Local and global stimuli in reinforcement learning

In efforts to resolve social dilemmas, reinforcement learning is an alternative to imitation and exploration in evolutionary game theory. While imitation and exploration rely on the performance of neighbors, in reinforcement learning individuals alter their strategies based on their own performance in the past. For example, according to the Bush–Mosteller model of reinforcement learning, an individual’s strategy choice is driven by whether the received payoff satisfies a preset aspiration or not. Stimuli also play a key role in reinforcement learning in that they can determine whether a strategy should be kept or not. Here we use the Monte Carlo method to study pattern formation and phase transitions towards cooperation in social dilemmas that are driven by reinforcement learning. We distinguish local and global players according to the source of the stimulus they experience. While global players receive their stimuli from the whole neighborhood, local players focus solely on individual performance. We show that global players play a decisive role in ensuring cooperation, while local players fail in this regard, although both types of players show properties of ‘moody cooperators’. In particular, global players evoke stronger conditional cooperation in their neighborhoods based on direct reciprocity, which is rooted in the emerging spatial patterns and stronger interfaces around cooperative clusters.

Individuals' adaptive behavior based on experience usually consists of two aspects. On the one hand, an individual's future strategy follows specific action rules. For example, always cooperating, always defection, TFT, tit-for-two-tats (TF2T), generous tit-for-tat (GTFT), WSLS, grim cooperate, and extortioner have been identified as representative action rules in repeated prisoner's dilemma games [36][37][38][39]. Experimental studies also suggest that participants' decision-making behavior can be characterized as noisy TFT, and it is the dominant strategy in a pairwise interactive environment [40]. In addition, the behavior patterns behind the demise of the commons across different cultures have also been studied [41].
On the other hand, humans and many species are capable of complex cognition, many of the cognitive skills have been considered as mechanisms for promoting the evolution of cooperation, such as learning [42], theory of mind [43], intent recognition [44,45], intelligence [46], emotion [47,48], etc. Here we focus on learning ability, and typically individuals use learning theory based on reinforcement learning to adjust their future decisions. Macy and Flache [49] used the traditional Bush-Mosteller (BM) stochastic learning model [50] for binary selection, and called it BM model of reinforcement learning. This model consists of two parts. At first, a player chooses an action based on the probability of cooperation and obtains the corresponding benefit. The player calculates her stimulus measured by whether the income satisfies aspiration. Second, driven by the reinforcement learning algorithm, the player updates the tendency of cooperation based on the current action and stimulus.
Following Macy and Flache's study, reinforcement learning mechanisms have attracted the attention of many scholars [51]. At present, for a fixed aspiration level, some researches have shown that BM players can cooperate with each other when payoff satisfies the aspiration [52][53][54]. In addition to changing actions, individuals can also adjust aspiration level, irrespective of BM reinforcement learning model or other reinforcement learning models [55]. In short, the principle of reinforcement learning is that individuals form two cognitive mechanisms, namely approach and avoidance, from experiential information. Approach means that payoff satisfies aspiration, so an individual's probability of repeating her previous action increases. Conversely, avoidance means that payoff is lower than aspirations, then individual's probability of repeating previous action decreases.
Considering that in the reinforcement learning mechanism, stimulus is measured by the individual's satisfaction with the payoff, and is used as an important indicator to drive the individual to adjust her action probability, it is essential to emphasize that each player receives payoffs with different sources. On the one hand, individuals get the corresponding payoff when interacting with each neighbor, and on the other hand, they receive cumulative payoffs. In view of this, it is natural to assume players' stimulus with various sources, like the payoff. Here, we divide agents into global players and local players, depending on their sources of stimulation. With such a framework, the global player's stimulus means the focal player's overall satisfaction with all neighbors in the neighborhood, which is measured by the difference between cumulative payoff and total expectation. The local player's stimulus means the focal player's satisfaction with a specific neighbor, which is determined by the difference between payoff from the neighbor and local expectation. In this article we focus on the performance of two types of players with different sources of stimulus under reinforcement learning rule. Simulation results show that global players play a leading role in promoting cooperation, and the probability of cooperation in the steady state follows two separated states, that is high cooperation and low cooperation. While the probability of cooperation of local players follows a normal distribution and assists the global players to achieve a high level of cooperation.

Methods
Players follow the reinforcement learning rule on the grid with 100 × 100 nodes with periodic boundary conditions. All players decide whether to cooperate or defect according to an intended probability and play the prisoner's dilemma game with their four neighbors. If a pair of players choose to cooperate, they both gain the reward R = 1. If they both choose to defect, then they both get the punishment P = 0. If one chooses to cooperate and the other chooses to defect, the former gets the sucker's payoff S = 0, and the latter obtains the temptation T = b(b > 1), respectively. Therefore, the cumulative benefit Φ of focal agent reads as: where P y i is the payoff that focal player gets from her neighbor y i (figure 1). Since individual satisfaction with the current payoff will cause the fluctuation of individual emotion, stimulus, s t , is measured as a function of the difference between payoff and expectation, according to different sources of stimulation. considers the combined performance of her neighbors, her stimulus s t is generated by whether the cumulative payoff Φ t satisfies total expectation 4A, and she obtain the intended probability to cooperate p G(t+1) following the reinforcement learning rules given by equation (4) based on the current willingness to cooperate p Gt , stimulus s t , and action a t . (b) Local player (right) is more concerned about the performance of each neighbor, and her four independent stimulus s y i t are generated by measuring whether the payoff P y i t from each neighbor satisfies the single expectation A, and then her willingness to cooperate towards each neighbor p y i (t+1) is obtained according to reinforcement learning rules, so the local player's intended probability to cooperate p L(t+1) is the average of her willingness to cooperate towards four neighbors p y i (t+1) .
Players in the network are divided into two categories (figure 1): global players and local players (which are randomly distributed on the network), and the proportion of global players is u. In particular, given an aspiration level A, global players care about the comprehensive performance of the entire neighborhood, while local players care more about the performance of each neighbor. Thus, for a global player, her stimulus, s t , is measured by the difference between cumulative payoff Φ t and total expectation 4A. While for a local player, she is faced with four independent stimulus from each neighbor, s y i t (i = 1, 2, 3, 4), measured by the difference between payoff P y i and single expectation A ( figure 1). The details are as follows [51,55]: where the aspiration level A is fixed at 0.5. Parameter β measures the sensitivity of the stimulus to the difference between payoff and expectation.
Further, players' current stimulus and action affect their intended probability to cooperate. Therefore, they update the tendency to cooperate, p t , according to the reinforcement learning rule based on the current intended probability to cooperate p t , stimulus s t , and action a t , BM model [51,55]: Specifically, for global players, they obtain the intended probability to cooperate p G(t+1) at round t + 1 following the reinforcement learning rule based on the intended probability to cooperate p Gt , stimulus s t , and action a t at round t. However, it is worth noting that local players are sensitive to the performance of each neighbor and can clearly feel the stimulus, s y i t , from each neighbor, thereby generating an intended probability to cooperate for each neighbor, p y i t ( figure 1). Under such a situation, they first update the tendency to cooperate with each neighbor based on the reinforcement learning rules, then their tendency to cooperate p L(t+1) is measured by the average of intended probability to cooperate for each neighbor, In the beginning, each global player is randomly assigned the intended probability to cooperate, and each local player is randomly assigned the vector of intended probability to cooperate, one intended probability to cooperate for each of the neighbors. Players were selected once on average to update their intended probability to cooperate in each time step. For a full reinforcement learning run, we observed the probability of cooperation on the lattice with size L = 100 over 800 000 time steps, of which the last 10 000 has up to a stable state.

Results
To explore the cooperative behavior of two species of players with different sources of stimulus under reinforcement learning rules, we focused on how sensitivity β and the temptation to defect b affect the cooperative behavior (figure 2). The results show that the difference between payoff and aspiration brings player stimulus, and the player's greater sensitivity to stimulus increases the tendency to cooperate. However, the effect of sensitivity is limited, and when β exceeds a certain threshold, especially in the case where β > 1, cooperative behavior does not continue to increase with β. In addition to this, cooperative behavior gradually decreases with the temptation to defect b ( figure 2(a)). However, the responses of the two species of players to changes in b are quite different. Global players' cooperation gradually decreases with b from high to low levels ( figure 2(b)), whereas local players are more tenacious and the effect of changes in β and b on their cooperation behavior is minimal, with their cooperative behavior remaining stable in the range of 0.2 to 0.3 ( figure 2(c)).
The dynamic process of the players' cooperation probability and its distribution in the steady state for different temptation to defect b is shown in figure 3. The results show that the evolutionary trend of global player's cooperation probability is determined by the temptation to defect. In contrast, local players' cooperation probabilities do not fluctuate significantly with external factors, either in terms of dynamic processes or changes in b values (figures 3(a) and (d)). Thus, the trend in the probability of cooperation of the global players determines the trend of the group. In order to clearly show the distribution of the cooperation probabilities of the two species of players in a steady state, local players and global players are fixed in the upper and lower parts of the grid network, respectively, and the initial cooperation probabilities of all players are given randomly. In particular, it is confirmed that fixing players' position as shown above does not affect individual decision-making. Furthermore, the results show that in a steady state, the cooperation probability of local players presents a chaotic state, while the cooperation probability of global players clearly shows two separate cooperation levels, namely high cooperation and low cooperation (figures 3(b) and (e)). With a small b value, the high cooperation is dominant ( figure 3(b)), while at a large b value, the low cooperation is dominant ( figure 3(e)). Further, the histograms of the probability of cooperation for the two types of players are given, identifying from a quantitative perspective that the  Given the large differences between global and local players, we explored the change in the probability of strategy shift with sensitivity β for the two types of players separately under the reinforcement learning rule (figure 4). The proportion of C → C gradually increases, implying that global players' cooperative behavior is reinforced. The decline in the proportion of D → D means that the defection strategy is less likely to be repeated, and the proportion of cooperation and defection substituted for each other is always consistent and small. Thus in the case of small b, increasing β promotes global players to gradually converge towards cooperation and avoid defection, driven by reinforcement learning rules ( figure 4(a)). The trend in strategy shifts for local players is similar to that of global players, but with small fluctuations ( figure 4(b)). Thus larger sensitivities β are more likely to motivate cooperative behavior in global players.
Then we investigated the influence of the proportion of global players in the network on cooperative behavior ( figure 5). The results show that the appropriate mixing ratio of the two types of players in the  network can enable the group to achieve high cooperation ( figure 5(a)). The cooperative behavior of global players increases with the value of u ( figure 5(b)), while the opposite is true for local players (figure 5(c)). It is worth noting that despite the global player's cooperative behavior is dominant, group cooperation is not the best situation when the network is full of global players, but rather the network reaches its highest level of cooperation when the proportion of global players is around 0.65 ( figure 5(a)).
Pairwise interactions of strategies at steady state are analyzed. The results show that when u is smaller than 0.4, overall pairwise interactions are not significantly different from each other (figure 6(a)), while pairwise interactions starting with global players gradually increase ( figure 6(b)) and those starting with local players gradually decrease (figure 6(c)). As the proportion of global players increases further, CC interactions explode rapidly (figure 6(a)), especially for global players ( figure 6(b)). At the same time, we are surprised to find a reversal of the decreasing trend in CC interactions for local players, achieving a brief increase ( figure 6(c)). When the proportion of global players exceeds 0.7, the CC interaction gradually declines, but it always prevails. Therefore, when the density of global player is 0.7, group cooperation reaches the highest level.
In order to more intuitively observe the results of strategy evolution under reinforcement learning rules, a snapshot of the distribution of strategies in steady state is given for different u ( figure 7). When the proportion of local players is large, the cooperation strategies hardly form clusters and they are distributed in scattered dots or bands ( figure 7(a)). When the number of global players gradually increases, cooperative clusters are formed in the network, especially when u = 0.7 ( figure 7(c)). However, in the case of all global players, the cooperation strategy does not form larger cooperative clusters as expected, but instead shows a maze distribution ( figure 7(d)).  Then we analyzed the distribution of the basic structures of the two players forming the clusters (figure 8). It shows that when there are all local players, the neighborhoods where players are located are mainly c, d, h, i basic structures, so that there are hardly any cooperative clusters of large size. With the introduction of global players, the proportion of basic structures a, b increases significantly, especially the proportion of local players with basic structure a is as high as 43.8%, which provides the necessary conditions for the network to form larger cooperative clusters. However, when u = 1, the basic structure a, b significantly decreases, and the basic structure c, g, h increases, resulting in a maze distribution of strategies. Therefore, although global players play a leading role in promoting cooperation, only cooperation with local players can enable the group to achieve a high level of cooperation.
Previous researches have shown that reinforcement learning reflects features of direct reciprocity, so we analyze the conditional cooperation and moody conditional cooperation [51,56] with different proportions of global players (figure 9). It proves that pure local players (u = 0) do not exhibit significant conditional cooperation (figures 9(a) and (g)), but show the characteristics of moody conditional cooperation (figures 9(b), (c), (h) and (i)). The existence of global players can trigger individual conditional cooperation behavior ( figure 9(a)). However, the conditional cooperation patterns of the two types of players show great differences, that is, global players' cooperative tendency increase with the number of cooperators in their neighborhood (figure 9(d)), and local players' cooperative tendency only rises significantly when their neighbors are all cooperators ( figure 9(g)). In addition, regardless of the mixed ratio of the two types of players in the population, players show moody conditional cooperation, that is, individuals who chose to cooperate in the previous round are more likely to cooperate, and vice versa. The more global players in the population, the more obvious the tendency for emotional conditional cooperation (figures 9(b) and (c)). Furthermore, results once again confirmed the leading role of global players, that is, the trend of conditional cooperation and moody conditional cooperation of global players determines the trend of the group.
Finally, in repeated prisoner's dilemma game, considering players may implement wrong decisions during the interactions [57], we give the impact of noise (errors), ε, on the outcome of the interactions ( figure 10). Thus, the actual probability to cooperate at round t + 1 is measured bỹ p t+1 ≡ p t+1 (1 − ε) + (1 − p t+1 )ε. Results show that noise influences cooperative behavior. In the case of high noise, the tendency of conditional cooperation changes (figures 10(a) and (d)), but both types of players still present stable moody conditional cooperation (figures 10(b), (c), (e) and (f)).

Conclusion
Under the BM model of reinforcement learning, we classified players into two types, global players and local players, depending on the source of stimulus they perceive. By changing the mixed ratio of the two types of players in the group to study how players with different stimulus affect individual cooperative behavior. How players with different stimulus influence individual cooperative behavior was investigated by varying the mixing ratio of the two types of players in the group. Research shows that global players play a dominant role in facilitating cooperation, and their behavior largely determines the trend of the entire group. But it does not mean that the network can achieve a high level of cooperation when all players are global players. The fact is that network reaches high cooperation when there is a low density of local players in the population. This is due to the significant differences in the reciprocity patterns of the two types of players.
Looking at the group as a whole, our results reconfirm previous research [51,56] that conditional cooperation and moody conditional cooperation reveal behavior patterns that individuals generally follow in repeated dilemma games. Most importantly, we also find differences in direct reciprocity among individuals with different sources of stimulus. The probability of global player cooperation increases with the number of cooperators in the neighborhood, exhibiting conditional cooperation. In contrast, no conditional cooperation features are observed when the population is full of local players. The introduction of global players can stimulate individual conditional cooperation behavior, and the situation where neighbors are all collaborators can significantly increase the probability of local player cooperation. All individuals show the characteristics of moody conditional cooperation even in the presence of noise. In particular, global players are sensitive to the number of cooperators in the neighbors in the previous round, while local players seem to be more cautious or strict, as significant increases in the probability of cooperation of local players occurred when more than half of the neighbors are cooperators.
In addition, the distributions of the individual cooperation probabilities of the two types of players in the steady state show large differences, with the global player's cooperation probability showing a two-level distribution (close to 0 or close to 1), while the local player's cooperation probability approximately follows a normal distribution.
Our model is a simple variant of the BM model of reinforcement learning, but it has obtained interesting results, and provides new insights for readers to understand cooperative behavior in repeated dilemma games.
Adaptive behavior about adjusting strategies is considered to be highly cognitive and complex, and it is not difficult to understand the high costs involved in implementing learning strategies. Currently, cognitive costs have been analyzed in exploring the effects of trust-based strategies [58], intention recognition [59], evolutionary cycles in finite populations [60], finite automata [61], etc on the evolution of cooperation in repeated prisoner's dilemma games. In future work we will focus on the impact of cognitive costs on the evolution of cooperation in the framework of reinforcement learning.