Evolutionary advantages of adaptive rewarding

Our wellbeing depends as much on our personal success, as it does on the success of our society. The realization of this fact makes cooperation a very much needed trait. Experiments have shown that rewards can elevate our readiness to cooperate, but since giving a reward inevitably entails paying a cost for it, the emergence and stability of such behavior remain elusive. Here we show that allowing for the act of rewarding to self-organize in dependence on the success of cooperation creates several evolutionary advantages that instill new ways through which collaborative efforts are promoted. Ranging from indirect territorial battle to the spontaneous emergence and destruction of coexistence, phase diagrams and the underlying spatial patterns reveal fascinatingly reach social dynamics that explains why this costly behavior has evolved and persevered. Comparisons with adaptive punishment, however, uncover an Achilles heel of adaptive rewarding that is due to over-aggression, which in turn hinders optimal utilization of network reciprocity. This may explain why, despite of its success, rewarding is not as firmly weaved into our societal organization as punishment.


Introduction
Responsible usage of public goods and continuous investments into the common pool are of paramount importance for sustainable development on a global scale. Loosing sight of this by over-exploiting the goods for short-term benefits inevitably creates systemic risks that may lead to the "tragedy of the commons" [1]. The public goods game captures succinctly the essence of the underlying social dilemma by requiring that players decide simultaneously whether they wish to bare the cost of cooperation and thus to contribute to the common pool, or not. Regardless of their decision, each member of the group receives an equal share of the public good after the initial contributions are multiplied by a factor that takes into account the added value of collaborative efforts. Individuals are best off by defecting, while the group is most successful if everybody cooperates. Historical evidence suggest that humans have developed remarkable otherregarding abilities to mitigate between-group conflicts [2], as well as to help each other by rearing offspring that survived [3]. However, while these issues might have sparked our cooperative behavior, it is mechanisms like kin and group selection as well as different forms of reciprocity [4] or other recently identified mechanisms [5][6][7], that have likely been instrumental in eliciting its full potential and solidifying it as one of the most distinguishable behavioral traits of the mankind.
Reward and punishment [9] are also cited frequently as viable means to promote the evolution of public cooperation, although punishment has received substantially more attention, as reviewed comprehensively in [10]. Related to that, it is important to note that recent research related to antisocial punishment [11,12] and reward in particular [13], is questioning the aptness of sanctioning for elevating collaborative efforts and raising social welfare. Indeed, while the majority of previous studies addressing the "stick versus carrot" dilemma concluded that punishment is more effective than reward in sustaining cooperation [9,10], evidence suggesting that rewards may be as effective as punishment and lead to higher total earnings without potential damage to reputation [14] or fear from retaliation [15] is mounting rapidly. Moreover, in their recent paper [12], Rand and Nowak provide firm evidence that antisocial punishment renders the concept of sanctioning ineffective, and argue further that healthy levels of cooperation are likelier to be achieved through less destructive means.
Regardless of whether we place the burden of cooperation promotion on punishment [16][17][18][19] or reward [20][21][22], the problem with both actions is that they are costly. In particular, punishment implies paying a cost for another person to incur a cost, while rewards obviously incorporate a cost to bear too, but for another person to experience a benefit. Cooperators who abstain therefore become "second-order free-riders", and they can seriously challenge the success of sanctioning [23][24][25] as well as rewarding [21]. Here we focus on the later and take into account the fact that our willingness to reward others depends sensitively on the success of antisocial behavior. If defection is on the rise, we may feel more inclined to support cooperation by means of additional incentives in order to avert an impending social decline. On the other hand, if everybody is already cooperating such actions may appear superfluous. Moreover, there is a permanent tendency to eschew the costs that are associated with administrating rewards. Inspired by these observations, we introduce a third strategy to the spatial public goods game to supplement the traditional cooperators and defectors, namely the so-called rewarding cooperators, and show that adaptive rewarding yields several evolutionary advantages that can overcome the "second-order free-rider" problem. Compared to steady rewarding [21], for example, the cyclic dominance between the three competing strategies can be broken, which in turn leads to higher levels of cooperation and even to completely defector-free states. Punishment, nevertheless, still outperforms rewarding for it acts more coherently with network reciprocity. We thus arrive at interesting and partly counterintuitive conclusions that extend the existing theory on sanctioning and rewarding in structured populations [21,[26][27][28][29], as well as supplement the array of recently identified mechanisms that promote cooperation in public goods games, ranging from complex interaction networks and coevolution [30][31][32][33][34] over diversity [35][36][37][38][39] to the risk of collective failures [40] and selection pressure [41]. Before presenting the main results, however, we proceed with a detailed description of the studied spatial public goods game.

Spatial public goods game with adaptive rewarding
The game is contested by cooperators (s x = C), defectors (s x = D) and rewarding cooperators (s x = R), who initially populate the square lattice with equal probability. A player x plays the public goods game with its k = G − 1 = 4 interaction partners as a member of all the g = 1, . . . , G = 5 groups it belongs to. Both cooperating strategies contribute 1 to the public good while defectors contribute nothing. The sum of all contributions in each group is multiplied by the factor r > 1, reflecting the synergetic effects of cooperation, and the resulting amount is equally divided amongst all group members irrespective of their strategy. Adaptive rewarding is accommodated by assigning each rewarding cooperator an additional parameter π x , which keeps score of the rewarding activity. While this parameter is initially zero, subsequently, whenever a defector succeeds in passing its strategy, all the remaining rewarding cooperators in all the groups containing the defeated player increase their rewarding activity by one, i.e., π x = π x + 1. The related costs increase accordingly. However, to maintain the latter is unwanted, and hence at every second round all rewarding cooperators decrease their rewarding activity by one, as long as π x ≥ 0. The payoff of player x adopting s x = C in a given group g of size G is thus where N C , N D and N R are the number of other cooperators, defectors and rewarding cooperators in the group g, respectively. The sum runs across all the neighbors in the group, while π i is the actual rewarding activity of player i. The corresponding payoff of a rewarding cooperator at site x is while a defector, who's payoff is derived exclusively from the contributions of others, gets As it follows, each player adopting s x = C or s x = R is rewarded with an amount π i ∆/k from every rewarding cooperator, having rewarding activity π i , that is a member of the same group. At the same time, each rewarding cooperator bares the cost π i α∆/k for every cooperator that was rewarded. Self-rewarding is excluded. Here ∆ and α are important free parameters, determining the incremental step used for the rewarding activity and the cost of rewards, respectively. Note that α is actually the ratio between the cost of rewarding and the reward that is allotted to cooperators. The stationary fractions of cooperators ρ C , defectors ρ D and rewarding cooperators ρ R on the square lattice are determined by means of a random sequential update comprising the following elementary steps. First, a randomly selected player x plays the public goods game with its partners as a member of all the five groups it belongs to. The overall payoff it thereby obtains is thus P sx = g P g sx . Next, one of the four nearest neighbors of player x is chosen randomly. This player y also acquires its payoff P sy identically as previously player x. Finally, if s x = s y player y imitates the strategy of player x with the probability q = 1/{1 + exp[(P sy − P sx )/K]}, where K determines the level of uncertainty by strategy adoptions. Without loss of generality we set K = 0.5 [42], implying that better performing players are readily imitated, but it is not impossible to adopt the strategy of a player performing worse. Each full Monte Carlo step of the game involves all players having a chance to adopt a strategy from one of their neighbors once on average. Depending on the proximity to phase transition points and the typical size of emerging spatial patterns, the linear system size was varied from L = 200 to 2000 and the equilibration required up to 10 6 full rounds of the game for the finite size effects to be avoided.
It is worth noting that this set-up enables us to directly compare the effectiveness of adaptive rewarding with steady rewarding efforts studied previously in [21]. While the simulation details are identical in both cases, in the steady rewarding model players adopting s x = R always reward every cooperator with a reward ∆/k and therefore bare the cost of rewarding α∆/k. The initially set rewarding activity of rewarding cooperators π x = 1 never increases or decreases, while ∆ simply determines the strength of rewards. As in the adaptive model, α determines just how costly rewards are. For further details we refer to [21], where the steady rewarding model was presented and studied in detail. Moreover, the outcome of the presently studied model can also be compared to the one obtained by means of adaptive punishment, as studied recently in [29]. The main difference is that while rewarding cooperators increase their rewarding activity to reward cooperators, punishing cooperators increase their punishing activity  Figure 1. Fractions of the three competing strategies in dependence on ∆, as obtained at r = 2 and α = 0.1 for steady (left) and adaptive (right) rewarding. While steady rewarding fails to eliminate defection due to the spontaneous emergence of cycling dominance that is brought about by "second-order free-riding", adaptive rewarding suffers from no such drawbacks, gradually leading to complete dominance of rewarding cooperators as ∆ increases.
to punish defectors. In both cases a constant drift towards inactivity in terms of either punishment or reward is assumed. For further details we again refer to [29], while here we proceed with presenting the main results.

Adaptive versus steady rewarding
Firstly, it is instructive to compare the impact of the newly introduced adaptive rewarding with that of steady rewarding at the same synergy factor r and the cost of reward α. As shown in Fig. 1 (left), the application of steady rewards yields a stable presence of defectors virtually across the whole span of ∆. This implies that no matter how strong the rewarding, defection cannot be eliminated. Here rewarding cooperators enable the survival of cooperators, which act as "second-order free-riders", who in turn provide easy targets for defectors, thus creating a closed loop of dominance. The persistence of defectors is thus a direct consequence of "second-order free-riding", which emerges almost as soon as rewarding cooperators are able to invade defectors. Notably though, there is a very narrow span of intermediate ∆ values, at which steady rewarding is just successful enough to overcome defection, but not sufficiently so to enable cooperators to free-ride on the newly acquired success. For adaptive rewarding, however, the outcome is significantly different, as shown in Fig. 1 (right). To begin with, much lower values of ∆ suffice to elicit the downfall of defectors. But even more importantly, "second-order free-riding" never gets a foothold in the population. Accordingly, as ∆ increases rewarding cooperators gradually rise to complete dominance, despite of the very low synergy factor (r = 2) governing the production of public goods. Besides stable two strategy D + C and D + R phases, the coexistence of all three competing strategies is also possible, where D and C form an alliance to compete against R. Notably, R(C) denotes the defection-free phase, but since in the absence of defectors strategies R and C become equivalent, the evolutionary process proceeds via slow logarithmic coarsening, as in the voter model [43]. However, since at the time of extinction of defectors the majority of players are rewarding cooperators, the system finally arrives at the R phase with a significantly higher probability. Notably, the dominance of strategy R becomes more evident if rare mutations are allowed, similarly as reported for punishment in [44].
As demonstrated in [21], defector-free states are attainable also with steady rewarding, but require α < 0.05, i.e., very low costs of administrating the rewards. Adaptive rewarding is thus more effective, predominantly because "second-order free-riders" fail to induce cycling dominance between the three competing strategies.

Phase diagrams and spatial patterns
The comparison with steady rewarding begets further explorations. In particular, the question is whether coexistence in the absence of cyclic dominance is nevertheless possible, and to what degree the results presented in Fig. 1 are robust to parameter variations? To address this systematically, we proceed with the presentation of characteristic phase diagrams and spatial patterns for different values of r. Figure 2 features the full ∆ − α phase diagram for r = 4.4. It is important to note that for such a relatively high value of r cooperators can survive in the presence of defectors without rewards, solely on the basis of spatial reciprocity. Accordingly, if rewarding is inefficient and costly, rewarding cooperators die out, leaving D + C as the stable two-strategy phase. As α decreases, however, rewarding cooperators become more and more competitive, which culminates in the outbreak of the stable D + R phase if ∆ is sufficiently small. Yet the discontinuous D + C → D + R transition is deceiving, in that it suggests that the competition is won or lost directly between cooperators and rewarding cooperators. This is in fact not the case because in the absence of defectors the relation between the two eventually becomes neutral. The victor between C and R is therefore determined indirectly in terms of which of the two strategies is more successful in invading defectors. This indirect territorial battle is illustrated in Fig. 3, where in the upper row cooperators are more successful, while in the bottom row rewarding cooperators prevail. Note that in both cases cooperators and rewarding cooperators form compact clusters that are isolated from one another, which is a direct consequence of coarsening within a finite size domain. An identical phenomenon was reported in [28], where punishing cooperators and cooperators ("second-order freeriders") engaged in indirect competition that was mediated by defectors, and where too the victor was determined based on the success and efficiency of this invasion. It is also worth emphasizing that the fraction of defectors changes insignificantly during this evolutionary process, regardless of whether finally the D + C or the D + R phase is reached, i.e., C(R) spread almost exclusively on the expense of R(C) (not shown). Thus, defectors truly just mediate the difference in efficiency between cooperators and rewarding cooperators.
Returning to the phase diagram presented in Fig. 2, it can be observed that as ∆ increases, the discontinuous first-order phase transitions give way to a continuous transition line leading to the D + C + R coexistence. In contrast to the steady rewarding model, however, here the coexistence is not rooted in a dynamical invasion process of the form D → C → R → D, but rather it is due to a static equilibrium. For details concerning the dynamical invasion fronts brought about by steady rewarding we refer to [21], while here we elaborate further on the static equilibrium that is characteristic for adaptive rewarding. Figure 4 (left) features a cross-section of the phase diagram presented in Fig. 2 at ∆ = 1.5. It can be observed that as the cost of rewarding (α) increases, the pure R phase transform into the three-strategy D + C + R phase, which for still higher values of α becomes the D + C phase. This indicates that as rewarding cooperators loose their ability to deter defectors, they also simultaneously enable the existence of cooperators. Since the value of r is sufficiently high, cooperators can coexist with defectors, in fact forming an alliance with them to compete against rewarding cooperators. The emergence of this alliance can also be inferred from the cross-section plot, where ρ C and ρ D change simultaneously as α increases but all the while their ratio remains approximately the same. A characteristic spatial pattern attesting to this fact is presented in Fig. 4 (right), where the D + C patches, which are locally similar to the stable morphology plotted in the upper right panel of Fig. 3, are surrounded by  Figure 5. Full ∆ − α phase diagrams, as obtained at r = 3.5 (left) and r = 2 (right). As in Fig. 2, blue solid lines depict continuous second-order phase transitions and symbols mark the surviving strategies in the stationary state. Since the synergetic effects of collaborative efforts are too weak, cooperators can no longer survive alone in the presence of defectors. Accordingly, the D + C phase is missing. Instead, as ∆ increases, and if α is sufficiently small, the pure D phase gives way to the two-strategy D + R phase, which may further transform into the three-strategy D + C + R phase, but only if r is sufficiently large (left). At r = 2, for example, the three-strategy phase is no longer attainable on the considered ∆ − α plane. For small rewarding costs the defector-free R(C) phase is obtained (having the same properties as described for r = 4.4), although its area shrinks continuously as r increases.
invading green R players. For the later the cost of rewarding is simply too high to eliminate defectors, which brings along the "second-order free-riders" to form the D + C free-riding axis. It is also worth pointing out that as soon as rewarding cooperators die out, the fractions of D and C strategies seizes to vary, indicating that the two indeed form an alliance that depends only on the value of r. If, however, the adaptive rewarding response is made more severe while at the same time rewards remain sufficiently affordable, the three-strategy phase terminates into a defector-free state, denoted as R(C) in Fig. 2. The absence of defectors makes cooperators and rewarding cooperators two equivalent strategies. Note that there is a constant drift-towards non-rewarding if defectors fail to spread. This can be either because they are altogether missing, as is the case in the R(C) phase, or because they are not within the immediate neighborhood of R and thus spread undetected. The evolutionary process proceeds without surface tension via logarithmical slow coarsening, as is characteristic for the universality class of the voter model [43]. In [44], albeit within a model based on steady punishment, we have demonstrated that the prevalence of "active cooperators", here players adopting strategy R, can be accelerated very effectively by means of rare mutations. The later give rise to occasional defectors, who in turn mediate the winner similarly as described by the indirect territorial battle in the realm of the D + C → D + R transition.
If the added value of collaborative efforts is smaller, i.e., if r decreases, the phase  Figure 6. Left panel features a cross section of the phase diagram presented in Fig. 5 (left), as obtained at ∆ = 2.0. As α increases the rewarding cooperators first give way to a three-strategy D + C + R phase, but further on persevere longer than "second-order free-riders". At smaller values of r the latter require a delicate balance of conditions to survive, and can do so only along the D + R interfaces. Right panel depicts a characteristic snapshot of such a three-strategy phase, which was taken at ∆ = 2.0 and α = 0.55. Small and rare patches of cooperators (blue) can survive where defectors (red) and rewarding cooperators (green) meet.
diagrams change significantly, primarily because the D +C alliance is no longer possible. Figure 5 features two phase diagrams, left as obtained for r = 3.5 and right as obtained for r = 2, where the differences if compared to Fig. 2 are clearly inferable. If the cost of rewarding is substantial, defectors are the only ones to survive. Naturally, the lower the value of r the lower the value of α that still warrants defector dominance. The pure D phase becomes the two-strategy D + R phase by means of a continuous phase transition even at small r, if only the value of ∆ is not too small and the value of α is not too large. Continuing further towards more efficient rewarding may lead to the defector-free R(C) state, which has the same properties as described above for r = 4.4. As with the D + R phase, the extent of the R(C) region shrinks expectedly with decreasing r towards higher ∆ and lower α.
The "second-order free-riders", on the other hand, can survive only in the threestrategy D +C +R phase, but its existence is limited to high values of ∆, intermediate α and still sufficiently high values of r, as can be observed by comparing the left and right panels of Fig. 5. Importantly, this three-strategy phase is qualitatively different from the one described above for r = 4.4. As mentioned, because of lower r here cooperators cannot survive alone if surrounded solely by defectors. In fact, they can survive only where defectors and rewarding cooperators meet, i.e., along the D + R interfaces. The characteristic snapshot presented in Fig. 6 (right) confirms such a spatial configuration within the three-strategy phase. Details of its emergence are inferable from the crosssection of the phase diagram presented in Fig. 6 (left), which reveals that as α exceeds a critical value the efficiency of R weakens to the point where defectors are able to survive. The stable presence of a small fraction of cooperators, surviving at the D + R interfaces, accompanies this transition. Interestingly, as α is further increased the first to extinct are not rewarding cooperators but the "second-order free-riders", who fail to harvest the benefits of decreased rewarding efficiency. This indicates that, especially at small synergy factors, only a fine balance of all the other parameters enables the survival of "second-order free-riding".

Reward versus punishment
Finally, we address the "stick versus carrot" dilemma within the realm of adaptive modeling. To do so, we first focus on the competition solely between defectors and rewarding cooperators. The question is, given a fixed cost of administrating rewards α, what is the minimally required value of ∆ that warrants the complete elimination of defectors? The answer is presented in Fig. 7 as a function of the synergy factor r (solid green line). Next, we answer the same question again, but replacing the rewarding cooperators with punishing cooperators. For consistency we use the same value of α, but accordingly it now represents the punishment cost rather than the cost of rewarding. The dashed gray line in Fig. 7, depicting the results for adaptive punishment, falls significantly below the one obtained with adaptive rewarding. This leads to the conclusion that adaptive punishment, which we studied separately in [29], is more effective than adaptive rewarding in warranting defector-free states.
An intuitive explanation as to why this is the case is presented in Fig. 8, where we follow the evolution of interfaces separating defectors and punishing cooperators (top row) as well as defectors and rewarding cooperators (bottom row) under identical conditions. It can be observed that while rewarding cooperators are more successful in penetrating the area of defectors, the punishing cooperators advance less fast but maintain a compact phase. For example, in the third snapshot from the left, some Figure 8. Comparison of the evolution of interfaces separating punishing cooperators and defectors (top row), and rewarding cooperators and defectors (bottom row). It can be observed that while rewarding cooperators (green) advance faster into the territory of defectors (red), the punishing cooperators (gray) are relentlessly bent on keeping their phase compact. Although therefore advancing less fast, they ultimately succeed in completely eliminating the defectors. Rewarding cooperators, on the other hand, have to make do with their coexistence. Note that darker shades of gray (green) denote players with higher punishing (rewarding) activity. The parameter values are the same for both cases, namely r = 2, ∆ = 2 and α = 0.4, while the snapshots were taken at 1, 70, 300, 1000, 3000 and 6000 full Mote Carlo Steps.
rewarding cooperators have already reached the border of the lattice while punishing cooperators have yet to advance notably. However, rewarding cooperators have to pay a price for their over-aggressive invasion, namely an irregular interface that facilitates the coexistence with defectors. Paradoxically, the less aggressive effect of punishment, which focuses on repairing the cracks in the phalanx rather than on advancing into the territory of defectors at any cost, turns out to be more effective at the end. Punishing cooperators rise to complete dominance with the aid of a near flawless support of network reciprocity [45]. Rewarding cooperators, on the other hand, sacrifice the latter for a faster advancement, but therefore fail to create the desired defector-free state. The Achilles heel of rewarding is thus an excessively aggressive invasion of defectors that neglects the benefits of network reciprocity.

Summary
We have shown that adaptive rewarding creates several evolutionary advantages by means of which public cooperation is promoted, in particularly many such that go beyond those warranted by steady rewarding [21]. Phase diagrams and the corresponding analysis of spatial patterns reveal that, if the added value of collaborative efforts is substantial, rewarding cooperators fight an indirect territorial battle with the cooperators. The catalysts are the defectors, who essentially determine the victor depending on who can invade them more successfully. If the parameters determining adaptive rewarding are set properly, most notably if the rewards are sufficiently yet not too cheap and the response to invading defectors is sufficiently strong, the three competing strategies form a stable phase wherein defectors form a free-riding alliance with cooperators, i.e., "second-order free-riders", to compete against rewarding cooperators. This three-strategy phase can also be observed at intermediate multiplication factors, although its extent in the phase diagrams shrinks continuously as the synergetic effects of cooperation are lowered, and accordingly the D + C alliance becomes more and more difficult. It is also worth emphasizing that the spatial dynamics enabling the three-strategy phase changes. While for sufficiently high multiplication factors cooperators can survive alone in the presence of defectors, at lower values of r they can avoid extinction only in the immediate vicinity of D − R interfaces. If either the cost of rewarding is decreased further or the adaptive response is made even more severe, the coexistence is terminated, which leads to a defector-free state. Due to a constant drift towards non-rewarding in the absence of defectors that could spread successfully, cooperators and rewarding cooperators become equivalent strategies, and accordingly the victor is determined via slow logarithmic coarsening, as known from the voter model [43]. In the majority of cases, however, rewarding cooperators occupy the larger portion of the square lattice at the time defectors die out, and accordingly they are the more likely winners. This competition becomes even more biased in the presence of rare mutations.
Comparing the outcome with the one elicited by adaptive punishment [29], we find that the supreme efficiency of rewarding cooperators in terms of invading defectors lessens the effectiveness of network reciprocity, which in turn designates the slower advancing adaptive punishers as the more effective and indeed the more successful strategy. We report that the minimally required fine to reach a defector-free state is much lower than the minimal reward needed to achieve the same goal. Thus, while a deep invasion of isolated players into the territory of defectors is better supported by rewarding, punishers can reach the collective target of eliminating defectors only by collaborating and "holding the line". The later statement is reminiscent of an instruction that is frequently given to soldiers engaging in combat, highlighting the continued importance of network reciprocity despite additional, and locally more effective means to overcome defection. The uncovered Achilles heel of rewards may also provide further clues as to why order and justice in the society are maintained by law that focuses on sanctioning rather than rewarding -although the former acts more subtle and requires a higher coherence between group members, at the same conditions it provides a higher collective well-being for the whole community.