Quantal response equilibrium for the Prisoner’s Dilemma game in Markov strategies

Within the studies of human cooperation, there are gaps that require further investigation. One possible area for growth is developing theoretical concepts which describe high levels of cooperation. In this paper, we present a symmetrical quantal response equilibrium (QRE) in Prisoner’s Dilemma game (PD) constructed in Markov strategies (tolerance to defection and mutual cooperation). To prove the adequacy of the resulting equilibrium, we compare it with the previously found Nash equilibrium in PD in Markov strategies: the QRE converges with the Nash equilibrium that corresponds with the theory. Next, we investigate the properties of QRE in PD in Markov strategies by testing it against experimental data. For low levels of rationality, the found equilibrium manages to describe high cooperation. We derive the levels of rationality under which the intersection between Nash and QRE occurs. Lastly, our experimental data suggest that QRE serves as a dividing line between behavior with low and high cooperation.

Previous studies have demonstrated that social interaction significantly increases the level of cooperation in iterated Prisoner's Dilemma (PD) games (the most commonly known example of social dilemma), from a 20% cooperation rate prior to socialization to 53% after socialization 9,29-31 . We require a specific approach to model such a high level of cooperative strategy choice. Menshikov et al. 30 proposed to consider PD in Markov strategies. A symmetric totally mixed Nash equilibrium was found for this game. However, this equilibrium better fits strategies prior to socialization than after. Therefore, we developed a new model that can describe high-cooperation strategies.

Model
Prisoner's Dilemma game (PD). This work is based on the broadly known PD game. In this game, two participants choose between two strategies: Left or Right for the first participant, and High and Low for the second. The choices are simultaneous and independent from each other. Payoffs correspond to the following payoff matrix (see Table 1), which was employed in the model and experiments. Left and High strategies represent Cooperation, Right and Low represent Defection.
Nash equilibrium for PD. PD has one Nash equilibrium: it is a mutual choice of Defection strategy that gives a payoff of 1 for two players. However, laboratory experiments show that people under some conditions avoid Nash equilibrium 9,30 . For example, individuals under social framing may more frequently choose the Cooperation strategy, behavior that could be considered irrational. For this reason, it would be interesting to discern a theoretical concept underlying this specific behavior.
Nash equilibrium for PD in Markov strategies. Several researchers 7,9,32,33 argue that for some subjects, social context led to an increase in cooperative choices of up to 100%. Therefore, the behavior under social context is far from a Nash equilibrium. One way to describe this cooperative behavior is to consider the PD game in Markov strategies.
Consider two participants i ∈ {1, 2} . Let us denote the probability of cooperation in round t for the first participant as p c 1 (t) . We describe participants' behavior by means of the following two quantities: (1) γ -mutual cooperation (the probability of cooperative choice as the response to the opponent's cooperative choice on the previous round); (2) α -tolerance to defection (the probability of cooperative choice as the response to the opponent's defection choice on the previous round). Then, we assumed that the decision in round t depends only on the results in round t − 1 . Thus, two variables γ and α imply that individuals' strategies at round t − 1 completely determine their behavior in round t . This model will be referred to as PD in Markov strategies 30,34 . For brevity, we will refer to subsequent γ and α as Markov strategies. The dynamics of participants' actions can be presented as follows: In a stationary state, we have: where p c 1 and p c 2 are stationary probabilities of cooperation. According to the paper 30 payoff function for participant 1 takes the following form: Menshikov et al. 30 found a symmetric (whereby γ 1 = γ 2 = γ and α 1 = α 2 = α ) totally mixed Nash equilibrium for PD in Markov strategies in explicit form. This equilibrium can be represented as the points (α, γ ) that meet the following equation: www.nature.com/scientificreports/ located in the unit square (see Fig. 1). We will refer to this equilibrium as a Nash equilibrium and this curve as a Nash equilibrium curve. It is evident from Fig. 1 that curve (4) exists in the area characterized by relatively small values of α (the tolerance to defection does not exceed 0.3). However, experimental results (see "Experimental results" section) reveal that tolerance to defection could exceed 0.5. Also, recent studies indicate that tolerance plays a highly important role in promoting cooperation within a population [35][36][37] . To reconcile this problem, we derive quantal response equilibrium for PD in Markov strategies in the next section.
Quantal response equilibrium (QRE). The QRE model was established to explain participants' observed behavior in laboratory experiments that differs significantly from the Nash equilibrium 23 . QRE is an internally consistent equilibrium model in the sense that the quantum response functions are based on the distribution of equilibrium probabilities in the choice of strategies of opponents, not simply on arbitrary beliefs that players may have about these probabilities. One feature of this model is that it allows the possibility of player error. QRE requires that expectations must match an equilibrium choice of probabilities. However, in contrast to the classical Nash equilibrium, the definition of QRE assumes that participants strive for the best answer only in the probabilistic sense: the better the answer, the more likely the participant will choose it 38,39 . A comparison of the QRE with experimental data indicated that this approach provided a better fit than the Nash equilibrium 40 . In practice, the QRE is dependent upon employing logistic distribution. The answer s i to the mixed strategy s −i of the remaining players (the probability of choosing strategy s i ) is expressed through the following formula: where -is the parameter of a participant's rationality, and U i (s i , s −i ) is the expected gain of participant i when the strategies of other players s −i and the strategy s i of participant i are given. Therefore, when → 0 (low rationality), participants chose strategies randomly. When → ∞ (high rationality), participants chose strategies with no errors (with the highest expected payoff).
The paper 30 demonstrated that the concept of QRE for a simple PD game works only when the probabilities of cooperative choices do not exceed 50%. Here we present a QRE for PD in Markov strategies. Consider {α 1 , γ 1 } -Markov strategies of player 1, and {α 2 , γ 2 } -Markov strategies of player 2, { 1 , 2 } -players' rationalities.
From system (2), we established the following expressions for stationary state probabilities of cooperation: www.nature.com/scientificreports/ Then, we established the profiles of pure strategies for both players when the others' pure strategies were fixed. To do this, we fixed the profile of strategies α 2 , γ 2 , γ 1 and calculated the probabilities of cooperative choice for pure strategies α 1 = 0 and α 1 = 1.
For γ 1 = 0: When γ 1 = 1: In the symmetrical case, we have α 1 = α 2 = α, γ 1 = γ 2 = γ , and 1 = 2 = . The probabilities of cooperative choice then took the following forms: Note that expressions like (. . . ) | α=0 in formulas (8)-(11) are more formal notations than traditional substitution. From this perspective, it is not problematic that expression for p c 1 | α=0 includes α . Additionally, despite the symmetrical assumption, formulas (8)- (11) are naturally asymmetric as expressions for p c 1 and p c 2 are different. If the symmetrical assumption was made prior to substituting pure strategies into resulted expressions, a significantly poorer set of (symmetric) stationary state cooperation probabilities would result. Namely, the symmetrical assumption would transform (7) into (for clarity, we utilize capital letters): From there, we derived the following expressions: It is evident that formula (12) occur only in the extreme cases of strategies (8)- (11). For example, it is true that P c | α=0 = p c 1 | α=0 = p c 2 | α=0 if we use α = 0 in (8). Next, we established the payoff functions using (3) for specific strategies by substituting the found expressions (8)-(11) of probabilities: . . . . .
. www.nature.com/scientificreports/ Finally, we used the found expressions (13)-(16) of the payoff functions in system (6) that gave: where α ∈ [0; 1] and γ ∈ [0; 1] are unknown, ∈ [0; +∞) is fixed, and U is the payoff function. Note that different values of the rationality could lead to different profiles of strategies. We propose to solve system (17) numerically, reducing it to finding (as feasibly as possible) the optimal solution for the following optimization problem: To solve optimization problem (18), we used the Python package minimize from scipy.optimize 41 . We solved the problem (18)  In Fig. 2, we illustrate the contour lines of objective function (19) for different values of rationality with the solutions of optimization problem (18) for the same value of rationality (yellow points) and with the Nash equilibrium curve. We noticed that for small values of <∼ 4 , objective function (19) has a unique local minimum. Additionally, the optimization method found the solution within these areas of the local minimum. However, for >∼ 4, objective function (19) contained more than one local minimum and most of these local minimums were located within the Nash equilibrium curve. For ∈ (∼ 4, ∼ 7) , the optimization method found the global minimum inside one of the local minimums. For >∼ 7, the results of the optimization method started to shift to the area within the standard Nash equilibrium α = 0, γ = 0 (defection/defection) and finally reached a point outside the Nash equilibrium curve in local minimums of objective function (19). This result corresponds to the theory, as the Nash equilibrium describes the behavior of fully rational participants 23 . Therefore, we can conclude that the chosen optimization method worked sufficiently.
In Fig. 3, we plot the arrangement of the obtained QRE for PD in Markov strategies (the solution of (18)) and the Nash equilibrium curve. The QRE for PD in Markov strategies forms an almost near smooth curve in the range of small (approximately less than 5). For these values of rationality, objective function (19) consistently showed a unique global minimum that was perfectly caught by the solver. In the middle range of (approximately located at the interval [5, 7.08] ), the solution of optimization problem (18) approaches the Nash equilibrium curve that corresponds to the theory 23 . Nonetheless, at these levels of rationality, the QRE for PD in Markov strategies curve (further, QRE curve) loses its smoothness and solutions "leapfrog" on the Nash equilibrium curve (see the blue triangles on Fig. 3). This could be the result of the optimization method weakness (in Fig. 2, we demonstrate that when >∼ 5 objective function (19) has many local minimums). However, the exact "first" intersection between the QRE curve and the Nash equilibrium curve (which is located near α ≈ 0.2, γ ≈ 0.5 and is derived under ≈ 5 ) fits the experimental data best when compared to other intersections (see "Experimental results" section). For large values of the rationality ( > 7.08 ), solutions of (18) converge to the point α = 0, γ = 0 which does not belong to the Nash equilibrium curve. Instead, it marks the strategies' profiles of the standard Nash equilibrium (defection/defection).

Experimental results
In this section, we evaluate the equilibrium found against the data from laboratory experiments which were presented in several publications 9,29,30 . The primary goal of these experiments was to identify the effect of socialization on the level of cooperation choice in the PD game.
The full description of the experiments (N = 14) can be found in Supplementary 1. The following is a schematic representation of the experimental design: 1. The sample for one experiment included 12 recruited participants (all strangers). 2. Participants played iterated PD (Table 1) in a mixed-gender group of 12 participants for 11-22 rounds.  The study procedures involving human participants were approved by the Skolkovo Institute of Science and Technology (Skoltech) Human Subjects Committee. Written informed consents were obtained from participants. All methods were performed in accordance with the relevant guidelines and regulations.
In Supplementary Table S1 (Supplementary 2), we present aggregated results of the 14 experiments (168 participants). We find that the choice of cooperation is higher after socialization (58%) rather than before (22%). We assume that socialization compensates for the irrationality of these choices. This implies that despite the expectation that the payoff of defection is higher than that of cooperation, the utility of sociality is higher than the probable losses of the cooperation choice. In comparing theoretical results with the experimental data, we found for every part of the experiments probabilities of mutual cooperation ( γ ) and tolerance to defection ( α ) in all parts of the experiments. Thus, we analyzed the aggregating strategies by participants that represent the most popular strategies in the observed part of the experiment. Due to the experimental design, we aggregated strategies by all 12 participants for every experiment before socialization. We aggregated strategies by two socialized groups of six participants for every experiment after socialization (see Supplementary Table S1, Supplementary 2).
We first analyzed how experimental points corresponded to values of objective function (19) under different levels of rationality (see Fig. 4). We observed that most participants' strategies could be approximated by the minima of the objective function after selecting the appropriate level of rationality. More precisely, we recognized that the behavior of individuals with a high level of cooperation (more than 50%) could be modeled by selecting low rationality rates (which was one of our objectives), whereas low-cooperative participants were well-approximated by high values of rationality. Unfortunately, participants located in the upper right zone of the phase plane are still unexplained. Notably, among all the Nash equilibrium, the best equilibrium that fits the experimental data was found on the initial intersection between the QRE curve and the Nash equilibrium curve (at the rationality level ≈ 5 ). The QRE for PD in Markov strategies for <∼ 5 divided most of the strategies before and after socialization (see Fig. 5), resembling phase boundary.

Discussion and conclusion
In an era that highlights the importance of every individual's choice while promoting living for oneself, it is crucial to consider cooperation as an effective mechanism to promote the overall well-being of society. In this paper, we present a theoretical concept that sufficiently works well enough with high cooperation levels that were previously obtained in the laboratory experiments 9,30 , that examining the influence of socialization on strategy choices in the PD game. In this paper, we present a symmetrical QRE for PD in Markov strategies (further, QRE-PD-M) and compared it with the symmetrical Nash for PD in Markov strategies 30 and experimental results 9 . Under the PD game in Markov strategies, we specify the following: a) instead of pure PD strategies (cooperation or defection), we employ mutual cooperation (the probability of a cooperative choice in response to an opponent's choice in the previous round), and tolerance to defection (the probability of a cooperative choice as the response to an opponent's defection choice in the previous round); and b) the choice of strategy in the current period depends solely on the strategies from the previous period.  www.nature.com/scientificreports/ values of rationality approximates low-cooperative results before socialization. We also found how the QRE-PD-M completes the Nash equilibrium. The intersection between the equilibrium curves under the low parameter of rationality ( ≈ 5 ) gives the unique selection of Nash equilibrium, most closely fitting the experimental data the closest (compared to other Nash equilibria for PD in Markov strategies). Additionally, the QRE-PD-M curve (for the parameter of rationality <∼ 5 ) serves somewhat as a phase boundary for the experimental data before and after socialization; most of the points for before socialization lie below the QRE-PD-M while most of the points for after socialization lie above. However, this result may be coincidental. Our study is not without limitations. They are as follows: 1. We found and described only the symmetrical case of the QRE-PD-M. Symmetricity allows us to avoid some mathematical difficulties. However, by doing so, we likely lose some equilibria. 2. The optimization problem was solved using numerical methods that come with possible noise in the solutions. 3. We compared the theoretical results against the experiments that involved only one socialization strategy.
Further, these experiments were performed within the same socio-demographic and cultural context.
In this paper, our initial motivation was to explain the high levels of cooperation observed in empirics via the concept of QRE. In a nutshell, we obtained that the found QRE-PD-M curve fits the data well. On the one hand, it means that we succeed in our purpose. On the other hand, one could wonder that our approach is rather meaningless because QRE is drawn upon the idea that individuals may "make errors, " and thus, our explanation of high cooperation via the QRE approach could imply that cooperation is simply "an error".
To reconcile this problem, we put forward the following hypothesis (which should be carefully tested in future studies!) whereby we attempt to bring together our findings and the modern notion of what is a rational sort of behavior. First, we propose that the segment of the QRE-PD-M curve that is characterized by small values of ( <∼ 5 ; hereafter, the small-rationality segment) serves as a natural indicator of a participant's state. This state could be individual (leading to selfish behavior), or social (in the social state, individuals should act unselfishly). Further, we suggest that the difference between these two states lies in an individual's utility function. According to the moral preferences hypothesis, the utility function includes the "moral component" that describes our "internal standards about what is right or wrong in a given situation" 22 . We suggest that in the individual state, the moral component is not well pronounced and doesn't compensate for the risk of cooperative behavior according to our internal moral standards (e.g., behavior in a group of unfamiliar people). Thus, the rational (from the classical view) sort of behavior should happen (corresponding strategies are located beneath the small-rationality segment). In the social state, this component comes with more effect to the utility function because of our need to belong to the social group 42 (and behave in a more empathetic and sensitive manner towards the group members). Then, people in the social state prefer to cooperate rather than defect (strategies above the small-rationality segment).
In other words, if individuals' strategies fall beneath the small-rationality segment, then we conclude that the people are in the individual state. If strategies are above the small-rationality segment, then we observe the Figure 5. The dashed line represents the Nash equilibrium. The pink circles represent the QRE curve. The orange triangles signify participants' strategies before socialization, while the violet circles signify participants' strategies after socialization. The QRE for PD in Markov strategies points for <∼ 5 serve as a natural border between points before and after socialization. www.nature.com/scientificreports/ social state. In the case the strategies are located near the small-rationality segment, then we hypothesize that the corresponding individuals are in "an intermediate regime. " In our empirical context, the changes in participant behavior were caused by the socialization procedure that made the inter-group relations more solid and thus increased the effect of the moral component. Because of this issue, participants' strategies had a tendency to move from the individual to the social state. However, questions remain: (1) why does the small-rationality segment play this "dividing" role and (2) how should we interpret strategies that are located near the small-rationality segment? We propose the following answer: one could think about the small-rationality segment as a zone where individuals experience a phase transition between the individual and social states. In this intermediate regime, the participants' behavior becomes rather "unpredictable" (people face considerable uncertainties about choosing between selfish and unselfish behavior) and thus may be described meaningfully via the concept of QRE. Depending on individuals' personal characteristics and the socio-demographic and cultural contexts, and, importantly, on the effectiveness of socialization, individuals may find themselves into individual, intermediate, or social states both before and after socialization. Figure 5 may be considered as evidence to support this statement.
A promising avenue for future studies would be to test our hypothesis. We believe that it could be done in two directions: (1) theoretical-by deriving corresponding models that consider a utility function from which the small-rationality segment and the corresponding unpredictable regime should come as a special case that marks the phase transition between individual and social states and (2) empirical-by conducting more experiments involving different socio-demographic and cultural contexts as well as various socialization strategies. Apart from socialization 9,43 , the phase shift could be triggered by nudging personal norms, turning on morality, and other methods presented by Capraro and Perc 22 . These experiments may reveal whether the small-rationality segment serves as a boundary between the individual and social states.
If the theory is found to be true, then this segment can be considered as a universal (and, importantly, theoretically motivated) indicator of whether the social relationships in a group of individuals are solid enough. This question is extremely important because solid relationships ensure that the group members are open to cooperative actions and the group can solve complex tasks that require high levels of collaboration. From this perspective, our results could be implemented in team-building activities. For example, one could use the PD game to analyze how participants act before and after implementing team-building tasks. If the tasks were implemented successfully, then it is reasonable to expect that the participants chose strategies above the small-rationality segment afterwards. Further, this methodology could also be used to compare different team-building strategies.