Reward versus punishment: averting the tragedy of the commons in eco-evolutionary dynamics

We consider an unstructured population of individuals who are randomly matched in an underlying population game in which the payoffs depend on the evolving state of the common resource exploited by the population. There are many known mechanisms for averting the overexploitation (tragedy) of the (common) resource. Probably one of the most common mechanism is reinforcing cooperation through rewards and punishments. Additionally, the depleting resource can also provide feedback that reinforces cooperation. Thus, it is an interesting question that how reward and punishment comparatively fare in averting the tragedy of the common (TOC) in the game-resource feedback evolutionary dynamics. Our main finding is that, while averting the TOC completely, rewarding cooperators cannot get rid of all the defectors, unlike what happens when defectors are punished; and as a consequence, in the completely replete resource state, the outcome of the population game can be socially optimal in the presence of the punishment but not so in the presence of the reward.


Introduction
In human society, the fact that a man earns his just reward for his good deeds and just punishment for his crimes is the very foundation of the conception of both human and divine justice. Rational individuals-in the absence of any designed reinforcers like reward and punishment-selfishly defect from helping others with a view to maximizing their utilities while leading to the tragedy of the commons (TOC) [1][2][3], the overexploitation of the shared common resources. Uncontrolled population growth [3], water pollution and water crisis [4], pollution of the earth's atmosphere [5], property rights, communal rights or state regulation [6], and wildlife crimes [7] are a few of the illustrating examples of the TOC. The TOC leads a system to a state of a depleted shared resource. For such systems, designing a proper reinforcement and feedback mechanism that regulates resource consumption is a burning issue.
Since the payoffs and strategies of the individuals in a population competing for a common resource should practically be dependent on the state of the resource, another interesting question presents itself: in the course of preventing the TOC, when can the outcome of the underlying population game be rendered socially optimal? Social optimality is the desired state of affairs for a social group. A strategic interaction is socially optimum if it maximizes the sum of the payoffs to all the individuals (players or agents) in the population [8]. Prevention of the TOC is obviously closely tied with a dominance of cooperators in the population. Reward and punishment are the most straightforward mechanisms for the emergence and sustenance of cooperation in both human and animal behavior. However, it is not apparent a priori whether reward and punishment act as two sides of the same coin or whether these are two kinds of reinforcers that play fundamentally distinct roles [9][10][11][12][13]. Thus, it is of interest to ask how effective they are in averting the TOC and which out of the two has more efficacy in furnishing a socially optimal outcome.
Reward and punishment were shown to exhibit a statistically equivalent positive effect on cooperation [14]. Sometimes [9,15] their simultaneous action is desirable for averting the TOC. Both reward [16] and punishment [17] can initiate the first seed of cooperation for it to take over globally. Both are incapable of achieving global cooperation where the population is divided into localized subgroups, and the payoffs are locally inefficient [18]. There are some subtle differences also, e.g., rewarding those who reward has been shown to be preferred to punishing those who do not punish [19,20]. In an experimental setting of a variant of the public goods game, it was shown that punishment had a more substantial effect than reward on cooperation if considered by itself; however, reward had a more significant influence when attached with optional participation [21]. Of course, the effectiveness of reward and punishment in inducing cooperation is expected to be non-trivially dependent on their intensity [22] and the underlying game [23].
The TOC is omnipresent; it goes beyond the human behavior and action. It can also be observed in nonhuman biological evolutionary systems [24][25][26][27][28][29][30][31][32][33][34][35][36][37] where the players may have almost no rationality. Punishment is not unheard of in such systems [28,[38][39][40][41]. In such biological systems, the commons undergoing the tragedy have a rather more general interpretation: it either can be a pre-existing resource, or it can be a social good formed either by cooperation or by abstaining from conflict. Furthermore, the tragedy is classified as either collapsing or component depending on whether the corresponding resource is respectively completely or partially lost.
Obviously, the evolutionary game theory [42][43][44][45] is a relevant formalism to investigate the TOC and hence, the use of the paradigmatic replicator equation [45][46][47][48][49][50] in this context is only natural. For example, the replicator equation has been used in the public goods game to observe the success of altruistic punishment [51] and to show that reward may encourage cooperation but, unlike punishment, may fail to stabilize it [52]; in the latter complex dynamics and unpredictable oscillations are also observed. More importantly, in close relation to the central idea of the present paper, the game-resource feedback dynamics have been studied within the paradigm of the replicator equation to reveal how harsh punishments may not save the renewable resource [53].

Model
In this paper, we work with the simplest non-trivial setup of the TOC where an external resource is exploited by players-either cooperators or defectors-playing the prisoner's dilemma game where defecting is the dominant strategy. We allow for the effect of punishment and reward by modifying the game payoff matrix. Furthermore, we allow for the feedback of the depleting resource in the game payoff matrix that is used to model the effective fitness of the cooperators whose frequency evolves in accordance with the replicator equation. The indirect [54] and direct [55] roles of environment in sustaining cooperation is well known. Subsequently, we do an insightful comparative study of the effects of the reward and the punishment. But before we discuss the results, let us first elaborate on the mathematical model that is central to our work.

Game-environment feedback: general formulation
Let us consider μ different strategies that can be adopted by any individual member of a very large unstructured population of constant size. Let x i (t) be the frequency with which ith strategy is used at any instant t. The vector x(t) ≡ (x 1 (t), x 2 (t), . . . , x μ (t)) defines the frequency distribution of the strategies over the consumer population. Let n(t) ≡ (n 1 (t), n 2 (t), . . . , n ν (t)) denotes ν different common shared resources in the environment. For convenience and without any loss of generality, we assume that for any j variable n j (t) is normalized so as to lie in the unit interval, i.e., 0 n j (t) 1. The state of the composite system consisting of the consumer population and the common shared resources can be represented by s(t) ≡ (x(t), n(t)). The resultant dynamical system is governed by μ − 1 differential equations for x(t) (since μ i=1 x i = 1) and ν differential equations for n(t).
The state of the population evolves under the replication-selection process; the time-continuous replicator equation is a suitable model dynamic. The replicator equation can be obtained in several similar ways [45,49,[56][57][58], either rigorously in the setting of one-locus-many-allele theory or phenomenologically for the interplay between the behavioural or strategic actions of the players. The fitness f i of ith strategied individual, in full generality, should be a function of s(t) ≡ (x(t), n(t)). The per capita rate of increase of ith strategy, d(log x i )/dt, is measure of evolutionary success; the basic tenet of Darwinism suggests that this evolutionary success can be expressed as the difference between fitness f i (s(t)) of ith strategy and the average fitness μ i=0 f i (s(t)). Thus, we write which is the replicator equation. To give an explicit functional form to the fitness function in the evolutionary game-theoretic set-up, we define a μ × μ dimensional payoff matrix U(n), whose element U ij (n) represents the payoff obtained by an individual with ith strategy when it interacts with an opponent employing jth strategy. The payoff matrix is independent of x(t) [59] whose dependence comes into the fitness through following relation: Complete description of the system requires specification of payoff matrix U(n). Finally, the model needs to be provided with a dynamics of state n of the common shared resources. In general, these resources may have intrinsic dynamics in the absence of the exploitation by the consumer population. Let the intrinsic rate of change of n i (t) is g (in) i (n(t)) (1 i ν). Corresponding equations of motion can be written as where g (ex) i (s(t)) is the effect of consumption of consumer population on ith resource. The parameters λ (in) i and λ (ex) i respectively determine the speeds of the intrinsic and extrinsic dynamics of ith resource [60]. Equations (1) and (3) constitute the set of dynamical equations that describe the eco-evolutionary dynamics of the composite system's state, s(t) ≡ (x(t), n(t)).

Social dilemmas and external reinforcers
We now adopt the simplest non-trivial specific setting that is conducive to studying the TOCs with added reinforcers, e.g., rewards and punishments. Often, the rise of the defectors who lead to the TOC is exemplified through the prisoner's dilemma game (T > R > P > S) [61]-a one-shot two-player-two-strategy game (μ = 2)-in which the players play non-Pareto-optimal [62] Nash equilibrium [63] strategy (defection) although mutual cooperation can fetch the players comparatively more payoff. Its normal bi-matrix form is given by where the first and the second comma-separated elements in each entry are the payoffs of two respective players. We denote by x 1 and x 2 as the fractions of 'cooperators' and 'defectors', respectively. Here, it is obvious that μ = 2. Since x 1 (t) and x 2 (t) = 1 − x 1 (t) are not independent variables, we use x(t) ≡ x 1 (t) as the only variable to specify the state of the consumer population. Furthermore, in order to engage in a simplified discussion of the eco-evolutionary model discussed above, we choose ν = 1. Therefore we may write s(t) ≡ (x(t), n(t)) as the simplified determiner of the composite state in this case. As we discussed earlier, our goal is to study the relative effects of the reinforcers like reward and punishment. To this end, we could assume that if a player defects against a cooperator, she is punished by reducing the payoff by a positive real number p. Similarly, we assume that if a player cooperates against a defector, she is rewarded with an additional payoff of r, a positive real number.

Reinforcement by shared resource's feedback
In addition to the effect of the external reinforcers like reward and punishment, we must consider an inherent reinforcing mechanism introduced by the shared resource's feedback. When the individuals in the population start to exploit the shared resource, the state of the shared resource depletes. Consequently, there is a potential change in the interactions pattern amongst the individuals. This concept is mathematized by the form of the total payoff matrix U(n) as given below [60,[64][65][66] In the above equation U k (k ∈ {0, 1}) is the symbolic notation for U(n = k), where k = 0 signifies the state of the completely depleted common resource and on the other hand k = 1 is for the fully replete state. Therefore, in the presence of reward or punishment, the mathematical form of U 1 is respectively Additionally, we note from equation (4), as the depletion starts the changed interaction pattern due to the feedback of the shared resource is mathematically reinforced through U 0 which for the consistency of the notations can be displayed in its most general form as In our set up, individuals in the fully replete state are involved in selfish behaviour modelled by the prisoner's dilemma game matrix, i.e., T 1 > R 1 > P 1 > S 1 . But as the degradation of the resource starts the cooperation can emerge through reinforcement. The form of U 0 , thus, could deviate from the prisoner's dilemma.
In general, depending on how the preferences of the players change as degradation of the shared resource takes place, the U 0 could be the payoff matrix for any of the four exhaustive and mutually exclusive classes of games [67][68][69][70][71] classified based on the correspondence of the Nash equilibria with cooperation and defection: (a) Harmony game: R 0 > T 0 and S 0 > P 0 ; the only symmetric Nash equilibrium is mutual cooperation. (b) Anti-coordination game: R 0 < T 0 and S 0 > P 0 ; there exists a unique mixed symmetric Nash equilibrium in which the players play a mixed strategy randomized over the pure strategies. (c) Prisoner's dilemma: T 0 > R 0 > P 0 > S 0 ; the mutual defection is the only symmetric Nash equilibrium. (d) Coordination game: R 0 > T 0 P 0 > S 0 ; there are two symmetric pure Nash equilibria (mutual defection and mutual cooperation) and one mixed symmetric Nash equilibrium similar to the one in the anti-coordination game.

Dynamics of cooperators and common resource
With the choice of the form of equation (4) for the payoff matrix, the variations of the fraction of cooperators with time is governed by the following equation In our model, we ignore the intrinsic dynamics of the resource, for simplicity. We incorporate the fact the cooperator consumers can cause augmentation of the resource while there is always deterioration of the resource by the defecting consumers. If θ > 0 is the ratio of the efficacy of augmentation rate to deterioration rate, then it has been shown [64] that a simple model for the state of the shared resource n can be written as where parameter = λ (ex) 1 1 because the dynamic of the environment is assumed to be slower than the dynamics of strategists' frequencies.
Equations (7) and (8) constitute the set of dynamical equations that mathematically describes the ecoevolutionary dynamics of cooperators and defectors consuming a single shared resource in the additional presence of the external reinforcement through reward or punishment.

Results
For future convenience we first introduce four parameters [60,[64][65][66] that can be intuitively interpreted as four distinct types of incentives for changes in the strategies: In the literature [72][73][74][75], −Δ 0 RT and −Δ 1 RT are the gamble-intending dilemma strengths, and −Δ 0 SP and −Δ 1 SP are the risk averting dilemma strengths for completely depleted or replete shared resource states respectively. It is easy to see that by construction, the four quadrants of Δ 0 RT − Δ 0 SP plane correspond to the aforementioned four classes of games: the first quadrant is for harmony game, the second quadrant is for coordination game, the third quadrant is for prisoner's dilemma, and the fourth quadrant is for the anti-coordination game. (As is customary, the quadrant numbers are counted anticlockwise starting from the first quadrant in the top-right quadrant in figure 1.) Furthermore, we also introduce a parameter, δ 0 ≡ Δ 0 RT /Δ 0 SP = (−Δ 0 RT )/(−Δ 0 SP ), that quantifies by what multiplicative factor a player has more affinity to defect against a cooperator than against a defector, when the shared resource state is in the worst possible state, i.e., n = 0. We draw two lines with δ 0 = Δ 1 RT /Δ 1 SP and with δ 0 = −θ, to divide Δ 0 RT − Δ 0 SP plane into seven distinct regions as seen in figure 1.

Equilibrium states
Now we perform the linear stability analysis of the eco-evolutionary dynamics, i.e., equations (7) and (8) along with equations (4)- (6). For the case of either reward or punishment, we get a total of seven fixed points (x * , n * ) and four of them are the four corners of the phase space, i.e., (0, 0), (0, 1), (1, 0), and (1, 1) which are common for both the cases. These four fixed points always exist for all parameters values. Another common fixed point, (x * b , 0), lies on the bottom edge (n = 0) of the x-n phase space, where This fixed point exists if δ 0 < 0. However, the fixed points on the top edge (n = 1) and in the interior of the phase space are different depending on the reinforcers. For the case of punishment, the fixed point in the top edge is (x * t p , 1), where x * t p is given by and the interior fixed is (x * i p , n * i p ), where For the case of reward, the fixed point in the top edge is (x * t r , 1), where and the interior fixed is (x * i r , n * i r ), where Table 1. We tabulate the conditions that need to be satisfied for a fixed point to acquire its possible type in the case of reward. The +/− sign refers to the sign that the corresponding parameter must take for the fixed point to be of the type mentioned in the last column. 'na' (not applicable) stands for the fact that the nature of the fixed point is independent of the corresponding parameter. Here, ξ r ≡ (θ + δ 0 ) − (θ + δ 1r ). ( 1 ,1 ) n a n a n a n a n a n a S a d d l e It is important to note that the locations of fixed points on the top edge and the interior of the phase space are dependent on the value of the reinforcement; for certain values of r or p, these fixed points can leave the physically allowed phase space and become effectively non-existent. For the case of reward, (x * t r , 1) exists if δ 1 r ≡ Δ 1 RT /(r + Δ 1 SP ) < 0. Whereas for the case of punishment, the analogous fixed point, (x * t p , 1) exists if δ 1 p ≡ (p + Δ 1 RT )/Δ 1 SP < 0. Also, the interior fixed points (x * i r , n * i r ) and (x * i p , n * i p ) exist if the inequalities (r + Δ 1 SP )(θ + δ 1 r )/[Δ 0 SP (θ + δ 0 )] < 0 and Δ 1 SP (θ + δ 1 p )/[Δ 0 SP (θ + δ 0 )] < 0 are respectively satisfied. Our next goal is to find out the stability of the aforementioned fixed points, which corresponds physically to the composite system's equilibrium states. We accomplish this by employing the linear stability analysis exhaustively for all the cases under consideration. Next, we calculate 2 × 2 Jacobian matrices at all the fixed points for the linearized dynamical system (equations (7) and (8)) and find the corresponding eigenvalues whose signs decide the type (hence, stability) of the fixed points (see tables 1 and 2).
In figure 1, we present the gist of the mathematical results pictorially. In the figure, we elaborately show the bifurcations in the phase portraits as the intensity of reward and punishment are increased. While reading the figure, one should translate any stable fixed point as an equilibrium state that can eventually be reached starting from some initial state in the corresponding basin of attraction; any unstable fixed point-unstable focus, unstable node, or saddle-is not realized in practice. A close inspection of the figure reveals some interesting features of the system that we discuss in what follows.

Bistability
The very first thing we note is the phenomenon of bistability, which is an omnipresent feature of many natural and man-made systems. In the case under study, the bistability partitions (almost all) initial conditions into two mutually exclusive classes depending on where the initial conditions are eventually attracted. The two classes can correspond to two of the three possible equilibrium states of the resource, viz, complete prevention of the TOC (n = 1), partial prevention of the TOC (0 < n < 1), and complete realization of the TOC (n = 0). Table 2. We tabulate the conditions that need to be satisfied for a fixed point to acquire its possible type in the case of punishment. The +/− sign refers to the sign that the corresponding parameter must take for the fixed point to be of the type mentioned in the last column. 'na' (not applicable) stands for the fact that the nature of the fixed point is independent of the corresponding parameter. Here, ξ p = (θ + δ 0 ) − (θ + δ 1p ).

Fixed point
( 0 ,1 ) n a n a n a n a n a n a S a d d l e ( The bistability is not realized in quadrant I (refer figure 1) in the presence of either reward or punishment. In passing, we remark that in this quadrant, oscillatory TOC (mathematically modelled by a heteroclinic cycle [64]) is known to be present in the absence of any reward or punishment, whose presence beyond a certain threshold, however, averts the oscillatory TOC. In the remaining three quadrants, both reward and punishment can affect bistability. However, the exact nature of the bistability is dependent on whether reward or punishment is being employed.
We note that while the emergence of bistability leads to prevention of the TOC that is inevitable in the absence of reward or punishment (refer quadrant II, quadrant III, and region IVA in figure 1), the prevention is not guaranteed; rather, it depends on the initial state of the system in addition to the extent of reward and punishment. In these cases, when the cooperator fraction is too low in an extremely poor state of the resource, any finite amount of reward and punishment cannot avert the TOC.
We also note (in region IA and region IVB in figure 1) that if partial TOC is present in the absence of reward or punishment, enough reward or punishment leads to complete prevention of the TOC. Unlike the case of the reward, the complete prevention of the TOC due to punishment passes through a stage of bistability (region IVB in figure 1(b)) between the states of partial TOC and completely averted TOC.

Socially optimal outcome
One notes that the most replete states of the resource effected by reward and punishment differ in an important respect: the fraction of defectors in the case of reward is not zero, unlike what happens in the case of the punishment. This observation has an interesting consequence that we discuss below.
We find that high enough punishment leads to a stable rich-environment state which furnishes sociallyoptimal outcome for the underlying population game. This is rather easy to see when we note that the stable rich-environment state is actually n = 1. Consider the payoff matrix, U(n) when environment is replete, i.e., n = 1 (reference equation (4)). Linear stability analysis tells us that when the punishment is high enough-p > −Δ 1 RT , to be precise-the stable fixed point (1, 1) comes into existence, denoting that in a rich environment, the population is completely overwhelmed by the cooperators. In the context of the payoff elements of U(1), this means that R 1 is the largest element because R 1 > (T 1 − p) and R 1 > P 1 > S 1 (by construction). Since R 1 is the largest element, this is the maximum average payoff that can be achieved in the population. Obviously, the average payoff of the population would simply be R 1 if only cooperators are present; any other composition of the population would result in a smaller average payoff. Hence, the richest resource state, when achieved with only cooperators around, produces the socially optimal outcome for the population game in the presence of enforced punishment.
In contrast, high enough reward does not lead to a socially-optimal outcome for the population game in the rich-environment state. Incidentally, here also, the most replete state achievable corresponds to n = 1. The linear stability analysis concludes that high enough reward-specifically, r > −Δ 1 SP -is required for the fixed point (x * t r , 1) to become attracting. Furthermore, recall that the payoff matrix, U(n) when environment is replete, i.e., n = 1 (reference equation (4)). The average payoff of the whole population consisting of x fraction of cooperators and (1 − x) fraction of defectors is which is extremized (i.e., dA/dx = 0) at The second order derivative of A(x), is a negative quantity implying that the extrema is a maximum. However, as already noted, the TOC-averted rich state due to reward is (x * t r , 1) and not (x e , 1). Hence, unlike punishment, reward induced TOC-averted resource state is incapable of furnishing a socially optimal outcome for the population game.

Discussion and conclusion
In this paper, we have extended the framework of deterministic eco-evolutionary dynamics to explore the effects of reward and punishment on the TOC. We have employed linear stability analysis to exhaustively study the nonlinear dynamical system. We have found that the phenomenon of bistability is brought forth through reward and punishment. However, our study reveals that these two kinds of reinforcers have fundamentally distinct roles in the prevention of the TOC.
Reward and punishment are not the same reinforcing factor acting oppositely. Intuitively, in a system with the replete resource in the presence of high cooperation level, the reward as a mechanism for sustaining cooperation is quite costly as all cooperators are supposed to be rewarded. However, if the system implements punishment, it is not relatively cheaper because the system has to punish only the defector fraction of the population; recall that in the replete state, defector and cooperators coexist in the case of punishment. This is the source of the asymmetric impact of reward and punishment on the prevention of the TOC. Our study shows that within the framework of our study, harsh punishment-unlike generous reward-results in a socially optimal outcome when the TOC is averted completely.
The eco-evolutionary dynamics we have considered, by construction, models the mean-field behaviour [65] of a well-mixed population of individuals sharing a common resource. In reality, however, any population is usually structured. In such a case, one needs to go beyond the usage of differential equations for modelling the dynamics and adopt the relevant tools of non-equilibrium statistical mechanics [76]. Of course, the most general set-up of a finite (arbitrarily) structured population is analytically intractable and can only be handled in a case-by-case manner using agent-based simulations. Nevertheless, remarkably, in the limit of weak selection, the frequencies of strategies in a structured large population can be modelled using the replicator equation with a transformed payoff matrix if the interaction structure is that of a regular graph with one player at each vertex. The transformed matrix depends on the local strategy update rules [77]. Hence, the eco-evolutionary dynamics discussed in this paper can be straightforwardly adopted to at least such structured populations.
The folk theorem of evolutionary game theory (along with similar theorems for the evolutionarily stable strategy) [59,78] establishes that many characteristics of the asymptotic dynamics of the replicator equation can be gauged merely by analyzing the corresponding payoff matrix for the Nash equilibrium and the evolutionarily stable strategy. These theorems have paved the way for (game-theoretic type) strategic reasoning in the field of population biology. Thus, when one notes that the outcome of the analyses presented in this paper can also be understood through the strategic reasoning using the effective (n-dependent) payoff matrix U(n), one should understand it as a paradigm analogous to that created by the aforementioned folk theorem; the only difference now is that the replicator equation is coupled with an equation for the common resource. Consequently, on the face of it, the mathematical framework presented herein might appear as an overkill.
However, we recall that mere existence of some game theoretic equilibria does not mean they are realized in practice as well because they may be dynamically unstable. Moreover, for games with many strategies and games with discrete-time dynamics, more complex dynamics [69,71,[79][80][81] (like oscillations, invariant cycles, and chaos) beyond the simple convergent (fixed point type) solutions appear. Such complex solutions are known not to be related to any standard game-theoretic equilibrium concept (cf [82,83]). Thus, the eco-evolutionary mathematical framework becomes indispensable in such situations.
We believe that the mathematical framework explored in this paper opens up the possibility of further exciting research themes in the game-resource feedback dynamics. For example, one could construct a model that takes the effect of changing population size [66] into account. Specifically, one could consider a renewable resource with intrinsic growth dynamics and study the effect of a finite growing population on the mechanism of reward and punishment. More elaborately, one could develop microscopic stochastic birth-death models for both the population and the resource with reward and punishment realized respectively by enhanced birth and death rates. Also, the impact of simultaneous action [84] of reward and punishment, and the time delayed effect [85] of players' actions on a common resource are also worth investigating within the framework of this paper. Lastly, we hope that some experiments can be designed to put the conclusions of this paper to test.