REINFORCEMENT LEARNING IN A PRISONER’S DILEMMA

. I fully characterize the outcomes of a wide class of Q-value based model-free reinforcement learning algorithms, such as Q-learning, in a prisoner’s dilemma. Learning is shown to always converge to one of the two states. Whether the players learn to cooperate or defect can be determined in a closed form from the relationship between the learning rate and the payoﬀs of the game. The results generalize to asymmetric learners and many experimentation rules.


Introduction
While a wide classes of learning rules have been studied with relation to the prisoner's dilemma, reinforcement learning algorithms, such as Q-learning, are rarely considered outside of simulation studies due to their large state complexity.In this paper I fill this gap by offering a complete closed-form characterization that removes the necessity for simulations.
This paper studies all algorithms in a class that is a generalization of the so-called Q-learning algorithm.A Q-learning agent maintains a vector of Q-values that encode her expected payoff from taking the corresponding action.She then usually takes an action with the highest Q-value, but sometimes experiments with other actions according to some predetermined rule.
1 Part of the attractiveness of reinforcement learning as a model for behavior lies with the minimal assumptions imposed by such algorithms on the players' understanding of the game.In the economics literature the dynamics of such learning processes are often called "completely uncoupled" (Hart and Mas-Colell, 2003;Foster and Young, 2006;Nax, 2019) or asynchronous (Asker et al., 2021) as the players themselves use only their prior experience to play, having no knowledge of the game structure.It is therefore not surprising that the predictions derived from our model are in stark contrast with the predictions of other adaptive dynamics, such as best-response1 , or conventional analysis of repeated games through subgame-perfect equilibria and folk theorems.I further outline these differences in the conclusion, having the expression for the relevant payoff information at hand.Unlike some of the similar studies of learning in a prisoner's dilemma (Mengel, 2014;Calvano et al., 2020), I do not consider "memory", i.e. actions cannot be conditioned on past play.This is intentional, the Q-learning algorithm, while being a very simple technique, proves capable of maintaining enough information in the Q-values to guarantee convergence to non-Nash outcomes even without relying on conditional strategies.In fact, this exact property has gained attention for the algorithm as a proof-of-concept technique for showing the possibility of algorithmic collusion in deceptively benign environments where neither the algorithm nor its designers observe anything beyond their own payoffs (Calvano et al., 2020;Klein, 2021).
The most closely related studies (Waltman andKaymak, 2007, 2008) pursue the same goal as us and have partially characterized the convergence in the prisoner's dilemma game for high learning rates.In particular, when one experimentation step is enough for a switch from a noncooperative state to a cooperative state and viceversa, our analysis can be simplified by considering only the minimum cost paths.
In many applications however (Calvano et al., 2020, e.g.), the learning rate may be expected to be low to ensure enough experimentation over a short period and full traversal of the state space.Neither is it clear whether the high learning rate assumption would be restrictive for human subjects.
Unlike Waltman andKaymak (2007, 2008), I follow the evolutionary game theory approach of characterizing stochastically stable sets through spanning trees (Young, 1993).This approach has been applied to the prisoner's dilemma in particular to characterize learning rules based on sampling from past history Mengel (2014).The Q-learning algorithm instead maintains an "expectation" of the payoff from choosing actions in the state of the learning process.
Another set of closely related learning rules is adaptive dynamics (Milgrom and Roberts, 1990), which can be shown to always converge to Nash equilibria in supermodular games and thus also in a prisoner's dilemma.However, Q-learning is not in this class and will be shown to converge to non-equilibrium actions for some combinations of parameters and payoffs.
Finally, the proof uses techniques introduced by Newton and Sawa (2015) for learning in matching games.They show that in the class of games that they consider a minimum cost path always exists from any state to a state that is most robust to one-shot deviations.In the prisoner's dilemma this is not always the case, so the results do not apply directly.However the same idea can be used to construct a path to a certain "central" state, which narrows down the possible minimal spanning trees and ultimately leads to a characterization.
The rest of the paper is organized as follows.I begin by introducing the game and learning rules in the next section, then I characterize the stable (absorbing) sets of states of the unperturbed process without experimentation, refine them to stochastically stable states of the process with experimentation, and finally apply the results to the prisoner's dilemma game under two common learning rules.I conclude by discussing possible extensions and comparing the results to other learning rules.

Preliminaries
Let π(a, x) denote the payoff of playing a when the opponent plays x.The two possible actions of each player comprise the set A = {C, N }.The four possible payoff Under the last condition, the game is a prisoner's dilemma, C stands for cooperative action and N -for non-cooperative or Nash action.
Every state g can be identified with a pair of Q-vectors, g The set of all possible states, i.e. pairs of valid Q-vectors will be denoted G.
In order to stay true to practical implementations of reinforcement learning and to avoid unnecessary continuity arguments while staying formal, I assume G to be a fine grid with > 0 between consecutive Q-values, i.e.
D}, where D = {q = z for some z ∈ Z}, Z denoting some compact subset of Z. Naturally π CC , π CN , π N C , π N N ∈ D. Whenever the Q-value does not conform to this grid, it is rounded to the closest grid point (this will be formalized below).This specification represents machine precision, a computer running a reinforcement learning algorithm would eventually reach the limit for the machine representation of a decimal number.All paths through the state space in our proofs are therefore finite, but this formulation will require additional steps to capture behavior at the boundary.
Cast in terms of a stochastic Markov process, it means that players always choose the action with the higher Q, obtain π t,i , and then each update the Q-vector as follows: (1) In other words, it does not matter how the Q-values are updated, as long as they get strictly closer to the obtained payoff, i.e. the player updates her expectation towards the realized payoff in full or in part.If the player continues to obtain the same payoff π, I assume that she approaches this value in the limit of some convergent sequence, i.e. lim t Q a t = π.The process that is usually called Q-learning is a particular kind of such updating rule when the speed of updating is captured by a single parameter α that is independent of the current Q-value: (2) where 1 ≥ α i > 0 is the learning parameter for player i.Note that this parameter can be different between players, and that the process is mapped to a finite grid by taking the closest value in D to (1 − α i )Q a t,i + α i π t,i .The main results are for the general setup in (1), and I will later discuss (2) as an illustration.
These updates thus move the process to the new state g = (Q t+1,1 , Q t+1,2 ).For convenience I will introduce the functions F a 1 ,a 2 (•), which for any state g = (Q t,1 , Q t,2 ) return the new state g that results from updating the previous values Q t,i for the two players after playing some action profile (a 1 , a 2 ) once.
I will refer to the actions with the higher Q as the action profile "played on path", i.e. the actions in {a : Q a t,i = max b Q b t,i } for each player i.If there is more than one such action, I further assume that the player randomizes over the full support of this set, i.e. all actions with the maximum Q-value have some positive probability to be taken.
In terms of the unperturbed dynamic I will be interested in the set of stable states.
The set of such states is denoted C, and it is a set of all states that are reentered with probability 1 in the unperturbed process, i.e.C = {g ∈ G : P t 0 (g, g) = 1} for some t.
2.2.Perturbed dynamics.Let {P β } β∈(0, β) be the family of perturbed dynamics indexed by the experimentation parameter β.In particular, P β (g, g ) denotes the probability of transition from state g to state g .It is assumed to satisfy the following conditions expanded from the list in Newton and Sawa (2015): Assumption 1. (Conditions on the perturbed dynamic).
(i) P β β→0 −→ P 0 , where P 0 are the transition probabilities for some unperturbed dynamic as described above.
(ii) For β > 0, the chain induced by P β is irreducible.
(vi) For any β > 0, P β (g, g ) > P β (g, ĝ), for any g with (a 1 , a 2 ) played on path and , where b 1 = a 1 , and (vii) For any β > 0 and states g The first four conditions are borrowed directly from Newton and Sawa (2015).They connect perturbed and unperturbed processes and restrict the perturbed process to be "weakly regular" (Sandholm, 2010).Condition (v) states that every transition is a valid Q-learning update, possibly an update on the profile that resulted from experimentation.Note that while the dynamics are parametrized by a single variable β, this condition admits different experimentation rules for players with different probabilities of experimentation or different processes altogether, as long as the probability of experimentation decreases in β for all players and the other conditions are satisfied.
The remaining two conditions impose mild restrictions stemming from the interpretation of the Q-vector as an imperfect estimate of the value function.In general, if the two players are experimenting independently of each other, and the state with probability that is lower for the actions with Q-values that are further from the on-path payoff, then both of the remaining conditions are satisfied.Condition (vi) requires that only one player experimenting be more likely than two players experimenting simultaneously, keeping their actions fixed.Importantly however, this does not imply that a two-player experimentation for some state cannot be less costly than a single-player experimentation from another state.In particular, this is not true if the former state has no single-player experimentation that eventually leads to another stable state.Condition (vii) states that if for some player the Q-value for the action on path is the same or lower, and the action off-path is higher in one of the two states, then she is at least as likely to experiment in this state as in the other state.
In particular, (vii) is true if the probability of experimentation is increasing in the Q-value of the corresponding action.
Overall, these conditions are quite permissive, and the logit choice rule (also called the Boltzmann softmax function) described in (Waltman andKaymak, 2007, 2008) can be shown to satisfy these conditions as well as experimenting uniformly, probit, etc.The results do not depend on the choice of perturbations as long as they satisfy these regularity assumptions.
Let G P D be the set of states, s.t. a 1 , a 2 ))] for any a ∈ A and both i ∈ {1, 2}. 2 The initial seed for the process has to be chosen from G P D since the other states are not reachable from within the set.This follows from the regularity conditions in Assumption 1 because any state g 2 reachable from g 1 has to be a valid update, i.e. g 2 = F a 1 ,a 2 (g 1 ) for some profile (a 1 , a 2 ).However, from the definition of F in (1) it follows that there are no states g 1 ∈ G P D and g 2 / ∈ G P D such that F a 1 ,a 2 (g 2 ) = g 1 for any a 1 , a 2 ∈ A. Therefore there can be no path from any state in the set G P D to any state not in the set G P D and P β (g, g ) = 0 for any β.
In the actual computer implementation of the algorithm, the restriction is irrelevant as the learning process on a finite grid will eventually reach G P D .Thus, the designer does not need to have prior knowledge of the payoffs or other information about the game to set up the initial conditions, provided there is sufficient experimentation.
As was mentioned in the introduction, the proof relies on the machinery of the "one-shot deviation principle" introduced in Newton and Sawa (2015) for matching games and uses the spanning trees approach from Young (1993).The definitions below are taken from these papers.
Definition 1.The 1-step cost of the process moving from g to g is defined as: adopting the convention that − log 0 = ∞.
The 1-step cost c (g, g ) is the exponential decay rate of the probability of transition from g to g .The rarer a transition, the higher its cost.Impossible transitions have infinite cost.Note that for g / ∈ C, there is a zero cost transition from g.This is because there is some g = g, such that P β (g, g ) does not approach zero as β → 0. We will also need the overall cost of moving between g and g , even if many steps are required.Let the t-step transition probabilities be given by P t β (g, g ) ≡ P (g t = g | g 0 = g, P β (., .)) Definition 2. The overall cost of the process moving from g to g is defined as: A spanning tree rooted at ĝ ∈ C is a directed graph over the set C such that every g ∈ C other than ĝ has exactly one exiting edge, and the graph has no cycles (implying that ĝ has no exiting edges).The cost of a spanning tree is the sum of the costs of its edges given by C(., .).A minimum cost spanning tree is a spanning tree whose cost is lower than or equal to the cost of any other spanning tree.A state ĝ ∈ C is stochastically stable only if there exists a minimum cost spanning tree rooted at ĝ.I will use cost(ĝ) to denote the cost of a minimal spanning tree among all trees rooted in ĝ.
I call a transition g → g from g ∈ G the least cost transition from g if it has the lowest cost of all possible 1-step transitions from g.This is either the regular update of Q-values after the on-path action profile is played or an update after the most likely experimentation.
Definition 3. Denote the set of possible least cost transitions from g ∈ G by: c L (g) will be used to denote the cost of the least cost transition from g. c L (g) := min Define OS, the set of states which are most robust to one-shot deviation: As c L (g) is strictly positive only for g ∈ C, it must be that OS ⊆ C.

Recurrent classes (unperturbed process).
As usual for the study of stochastic stability, the first step is based on the fact that the stochastically stable states belong to the recurrent classes (absorbing states or repeating sequences of states) of the unperturbed process (Young, 1993).
I will show that the process is always absorbed by a single state and cannot get "stuck" in a recurring cycle.
Let A i (G) ⊆ A be the set of actions that are played by i on path in a recurrent class G.That is, any action Lemma 1.For any recurring class G, any state ĝ ∈ G, ĝ = ( Q1 , Q 2 ), and any action a i that is played by i in this or any other state in G, max Proof.By construction in (1), for both players and any action a i played on path.
From any state g From this I know that any state ĝ for which Qa I can now characterize the absorbing states.
if and only if: (1) Moreover, these states are the only recurrent classes, i.e. there are no recurrent classes that are not singletons.
Proof.I start with the "if" part.Each of the players is taking the action with the higher Q-value, a 1 and a 2 respectively in the unperturbed process.Since the payoffs from this profile are exactly π a 1 a 2 and π a 2 a 1 , the new Q-vectors are unchanged, F a 1 ,a 2 (g) = g.Thus the process stays in g.
For the "only if" part, suppose there is a recurring class with possibly more than one state G. Suppose first that only one profile is played in A 2 (G) = a 2 for some pair of actions a 1 , a 2 ∈ A. For the action profile played on path, the Q values should equal the expected value of playing these actions by Lemma 1.
and thus Q a i i = π(a i , a −i ) and similarly for the other player.Moreover, if for some i and some action b i ∈ A, b i = a i then a different action profile is played, which is a contradiction.Therefore there are no other recurrent classes where only one action profile is played, and in particular there are no other singleton absorbing states.
It remains to show that there are no non-singleton recurrent classes.Note first that from any state where π(a i , a , and b i ∈ A \ a i , the process is eventually absorbed into a singleton absorbing state.This is because in any state that follows (a i , a −i ) is played again and | is bounded by 0 the process either eventually ends.
Suppose now that (C, C) is played in some state g ∈ G. Then Q C i ≥ Q N i ≥ π CC for at least one of i ∈ {1, 2}, the first inequality is because (C, C) is played, and the second because otherwise by the above remark the process is absorbed into a singleton state.But by Lemma 1 Q C i ≤ π CC .Moreover the inequality is strict; the equality only holds if the same maximual payoff was obtained by i in the previous period, which, since all payoffs are distinct, is only possible if the same profile was played in the previous period by construction in (1).Therefore since at least some other profile is also recurring, Q C i < π CC .This is a contradiction and therefore (C, C) is not played in G.
Then by Lemma 1, Q C i = π CN because in all profiles where i plays C the opponent plays N .At the same time since C cannot be played in any state in G. Thus only (N, N ) can be played on path, which is again a contradiction.

Stochastically stable states (perturbed process).
Using this characterization, the absorbing states can now be refined to stochastically stable states.
The following two lemmas will allow us to generalize the approach from Newton and Sawa (2015).Instead of showing that all states have a minimum cost path to the OS set, which is not true in the present case, I will use the fact that all states have such paths to some "central" state, not necessarily in the OS.If there are minimum cost paths to some state g c that is not in OS, it is still possible to say that the minimal trees are of a particular form.The next lemma says that if there is such state g c , then every minimal spanning tree rooted in some state ĝ can only have non-minimum-cost arcs from the states on the path from g c to the root ĝ.
for any l ∈ {1, ...L − 1} then in any minimal spanning tree rooted in ĝ for any g ∈ C, either the outgoing arc from g has the cost c L (g ) or there is a path from g c to g .Proof.Suppose to the contrary that there is a state g ∈ C with the cost of the outgoing arc greater than c L (g ) and there is no path from g c to g .Since the graph is a spanning tree, there is then a path from g to g c and it is not minimal.Replacing this arc with a minimum cost arc then yields a tree with a lower cost.
If g c ∈ OS then, by the result in Newton and Sawa (2015), SS = OS.However the lemma also allows for the case when g / ∈ OS, which is used in the next proposition.
Proposition 1.The minimal trees are rooted in states that minimize among all possible roots ĝ ∈ C.
Proof.By Lemma 3 any minimal spanning tree has all minimal cost outgoing arcs except for the path between g c and ĝ.The difference with the minimal tree rooted in ĝ is then the cost of this path and the least cost transition from g c .The cost of the tree rooted in ĝ is then cost(g c ) − c L (ĝ) + C(g c , ĝ).

Prisoner's Dilemma
I will introduce two stable states, g * and g * * , in C. Depending on the choice rule, one of these two states will be shown to be the unique state in SS.The first To be able to use Proposition 1, I will show that indeed a variant of a "getting closer" lemma holds, but instead of approaching the OS set, the process will be approaching the stable state g * with Nash equilibrium actions on path.In other words, g c = g * in the Lemma 3.
Before stating the lemma let me formalize what "closer" will mean.I define the distance D(g, g ) as follows: (3) where the Q-vectors Q(g) are constructed based on what is played on path in g as follows.
If a i is played on path in g by each player i ∈ {1, 2}, then Qa i i (g) = π(a i , a −i ) and Qb i i (g) = π(b i , a −i ) for b i = a i for i ∈ {1, 2}.That is, the Q-values for the "on path" action profile equal the respective payoffs for the two players, i.e. π(a 1 , a 2 ) and π(a 2 , a 1 ), and the values for the "off-path" actions are the payoffs of the action profile resulting from a single-player experimentation, i.e. π(b 1 , a 2 ), and π(b 2 , a 1 )3 , and therefore single player experimentation does not change the Q-vector.
I also introduce 0 ≥ m(g, g ) ≥ 2 as the number of actions that differ on path in g and g .This is used for the situations when precisely one player cooperates.
Finally I introduce the values d(g) and d(g), the probabilities of experimentation for the player who is more likely to experiment and for the player who is less likely to experiment.
where g i is the state after i experiments from g and β is any strictly positive value.
These values will be used for states where both players cooperate.
The next lemma uses the fact that experimentation by two players is less likely than experimentation by one player (Assumption 1, vi) to show that a single-player experimentation from any state, possibly followed by zero-cost deviations, will get the process closer to the state g * from which only a two-player experimentation can lead to a new state.
) and so on.
Take any state g ∈ C \ g * with (a 1 , a 2 ) played on path and b i = a i for i ∈ {1, 2}.
Neither (N, C) nor (C, N ) can be played on path in g ∈ C because then for one of the players and g / ∈ G.
The remaining proof is by cases.
(1) Suppose (N, N ) is played on path.Then Q C i = π(C, N ) for one of the players i ∈ {1, 2} in order for g = g * .If the non-equality holds for both players, suppose without loss of generality that the least cost transition is by player i.
Then in all cases experimentation leads i to play C. By Lemma 2 since g ∈ C, ) is played again.Then, with positive probability and zero cost in the states g 2 , g 3 , ...g that follow g again for some g .Thus, in the new absorbing state . the distance has decreased and thus D(g * , g ) < D(g * , g), while m(g * , g ) = m(g * , g) = 2 so g ≺ g as required.
(2) Suppose (C, C) is played on path.Then experimentation leads i to play N .
If the process converges to a state where at least one player defects, then 2 = m(g * , g ) > m(g * , g) = 1, and g ≺ g as required.So suppose instead that eventually a stable state with (C, C) on path is reached.Two subcases are possible.
) is played again.Then, with positive probability and zero cost in the states again for some g .Thus, in the new absorbing state Moreover, since i was the experimenting player in g, player i is also at least as likely to experiment in g as the other player by condition (vii) in Assumption 1. Then d(g) = d(g ) and d(g) < d(g ) and g ≺ g as required.
(b) In all remaining cases eventually QC ) is played.The payoff of player i is the highest possible, and the payoff of the other player is the lowest possible, so the Q-values of their actions increase and decrease respectively.Then eventually (N, N ) is played and again QC −i < Q N −i in some ĝ and will not increase unless (C, C) is played again.Once (C, C) is played again, the game continues with (C, C) until convergence to a stable state.Since eventually (C, C) is reached, at some future state Moreover since i experimented in g, one of the following must occur: either −i is also less likely to experiment than i in g and then again by condition (vii) in Assumption 1 and because or i is at least as likely to experiment as −i in g but P −i (g) > P −i (g ) ≥ P i (g ) and again d(g) > d(g ).Thus g ≺ g as required.
Since states with (C, N ), (N, C) or mixed actions on path cannot be stable in C ⊆ G P D , one only needs to consider trees rooted in states with (N, N ) and (C, C) on path.
At the same time, a tree rooted in some state ĝ = g * with (N, N ) on path cannot be minimal.This can be stated as a corollary of Proposition 1: Corollary 1.A minimal cost spanning tree has to be rooted in a state ĝ such that Proof.By Proposition 1 for any minimal spanning tree cost(g * ) − c L (ĝ) + C(g * , ĝ), where ĝ is the root.By definition, C(g Then the minimal tree rooted in g * has smaller cost, which is a contradiction. The cost of leaving any state ĝ = g * with (N, N ) on path is at the most c L (ĝ) = π N N − π CN , while the minimal cost of leaving g * is strictly greater than π CN − π N N by regularity conditions in Assumption 1 (vi) because both players have to experiment simultaneously.Therefore the cost of a minimal tree rooted at g * is strictly lower.This leaves only the state g * and states in C with (C, C) on path as candidates for stochastic stability.One can further refine the possibilities by the following corollary: Corollary 2. If a minimal cost spanning tree is rooted in a state ĝ = g * * and (C, C) is played on path in ĝ then there is also a minimal cost spanning tree rooted in g * * .
Proof.From any state with (N, C) or (C, N ) on path there is a zero-cost path to a state in (N, N ).Therefore in any minimum cost path to g * * any state with (C, C) has to follow another state with (C, C) or a state with (N, N ).Take the first state with (C, C) then there is a state with (N, N ) before it.But then in this state for both i ∈ {1, 2} and this state is therefore g * * .So any minimum cost path from g * to a state with (C, C) passes through g * * and has the cost no less than C(g * , g * * ).At the same time, c L (ĝ) ≤ c L (g * * ) for any state ĝ with (C, C) by regularity conditions in Then cost(g c )−c L (ĝ)+C(g c , ĝ) ≥ cost(g c )−c L (g * * )+C(g c , g * * ) and by Proposition 1 the result follows.
Thus, to argue about the action profiles on path in the limit, one only needs to consider g * and g * * .This leads to a characterization: ) then g * * ∈ SS and players always converge to cooperation in any state in SS.4 (ii) If C(g * , g * * ) > c L (g * * ) then SS = {g * } and players always converge to defection.
(iii) If C(g * , g * * ) = c L (g * * ) for one or more states, both defection and cooperation may occur in the limit with positive probability.
Proof.Follows directly from Proposition 1 and from Corollaries 1, 2 by the remark above.
I now apply these concepts to particular experimentation rules: -greedy (which I call β-greedy in my notation) and logit (also called softmax or Boltzmann) rules.
Under the β-greedy rule with probability (1 − k i β) player i chooses the actions with highest Q-values (with ties resolved uniformly), and with probability k i β the action is chosen by randomizing uniformly.Formally in our definitions, P r greedy i (a|g) = a is played by i on path in g and P r greedy i (a|g) = 1 2 k i β otherwise.The denominator in the former case only divides the amount among all actions played on path if there is more than one.The chosen experimentation function β i (β) = (k i β) ensures that the probability of experimentation is decreasing in β for any choice of the constants k i > 0 as required by condition (i) in Assumption 1.For this rule, the starting value of β has to be taken so that k i β is less than 1 to ensure that the resulting probability of experimentation is well-defined.If k 1 = k 2 = 1, I obtain the simpler case with symmetric experimentation probabilities Under the logit choice rule instead , with no restriction on k i and β as long as they are positive.The β i (β) = k i β is sometimes called the temperature, with higher values of β, experimentation becomes more likely and less dependent on Q-values.When β, and therefore also β i (β), approach zero, the actions with the highest Q-value are always chosen and the process approaches the unperturbed dynamic P 0 .For the symmetric setup in the limit with β 1 (β) = β 2 (β) = β approaching infinity, the actions are chosen with equal probability.
In both cases then P β (g, g ) is the product of these probabilities, i∈{1,2} P r greedy i (a i |g) or i∈{1,2} P r logit i (a i |g) for g = F a i ,a −i (g).
For β-greedy choice rule experimentation by 2 players in any state is less likely than experimentation by 1 player in any state, which will be shown to always lead to players converging to defection.
The logit rule case on the other hand can lead to cooperation, depending on the values of parameters.A commonly used property of the logit choice rule is that the transitions probabilities in the limit β → 0 are determined by the absolute difference in payoffs between the states.Let z logit i be the smallest integer equal or greater than which is the necessary number of updates on the profile (C, C) for the player i to get from g * to a state where i cooperates on path under the logit rule.The expression can be obtained by rewriting the recursive equations These equations describe the minimum cost path from g * to g * * because the only possible updates that increase Q-values for C are the updates on the profile (C, C), i.e.F a i ,a −i (g) = g can only hold if a i = C for both i ∈ {1, 2} if g has a higher Q-value for C than g.This is proven in the following lemma: Lemma 5.The path g * = g 1 , ...g L = g * * , where (C, C) is played in every state g l , is the lowest cost path between g * and g * * .
Proof.Let g l = (Q l,1 , Q l,2 ) and suppose there is a lower cost path g * = g 1 , ...g L = g * * , ) with at least one defection on path.Let g C,C l be the state (if any) on this path where (C, C) is played for the l-th time after l − 1 (not necessarily consecutive) plays of (C, C) on this path.That is, the next state on the path after By the regularity condition vii in Assumption 1, c(g l , g l+1 ) is the lowest 1-step cost c(g, ĝ) among all pairs of states g, ĝ ∈ C, g = (Q 1 , Q 2 ) and ĝ = F CC (g) = ( Q1 , Q2 ), such that Q C i ≤ Q C l,i for both i ∈ {1, 2} and (N, N ) is played on path.Moreover, for any such pair of states, QC i ≤ Q C l+1,i by construction in (1).At the same time, any profile where at least one player defects cannot increase the Q-value of cooperation for either player, i.e.
Then for every pair of states g l and g C,C l on their respective paths for any l ∈ 1..L, Q C l,i ≤ Q C l,i for both i ∈ {1, 2}.Moreover, L > L as there is at least one state with defection that does not increase the Q-value of cooperation.But then the cost at every (g C,C l ) is at least as high as the cost c(g l , g l+1 ) on the other path.Since there are no other states on the cooperative path g * = g 1 , ...g L = g * * , its cost is the same or lower.
Further, let q C l,i = π CC + (1 − α i ) l−1 (π CN − π CC ).These are the Q-values of cooperation for each player on the minimum-cost path in Lemma 5 from g * to g * * under the logit rule.Then the characterization for the two rules is as follows: Proposition 2. For the asymmetric Q-learners the SS set depends on the choice rule: (i) SS = g * under β-greedy choice rule (ii) Under the logit choice rule: Proof.(i) Under β-greedy rule, a two-player simultaneous experimentation has the probability β 1 (β) × β 2 (β) = k 1 k 2 β 2 , while a single-player experimentation has the probability at most max i β i (β) = max{k 1 , k 2 }β.By construction, leaving g * requires a two-player simultaneous experimentation, while a least cost transition from any state ĝ ∈ C with (C, C) on path requires only a single-player experimentation.Then and by Corollary 1 SS = {g * }.
(ii) For C(g * , g * * ) on the minimum-cost path where all players cooperate (by Lemma 5), the cost of transitions is ) while both players have N on path (both players have to experiment), i.e. for l = 1 to min i z logit i .
For l = min i z logit i to max i z logit i only the player who still has N on path has to experiment.Then C(g * , g * * ) = The result then follows by applying Proposition 1.
In particular, the previous proposition implies that there is always a low-enough α = min{α 1 , α 2 } for any π N N such that defect-defect (the g * state played repeatedly) is the unique action profile in the limit.In other words, one of the learners can always preclude cooperation if her learning parameter is low enough.(i) SS = g * under β-greedy choice rule (ii) Under the logit choice rule:  It is illustrative to also consider a supergame of choosing a learning algorithm against an opponent.In a supergame of choosing the parameters α i and k i , since the algorithms can only converge to (N, N ) or (C, C) on path, low values of α i are dominated.In other words, setting α i = 1 is never a bad strategy.For experimentation parameter k i it is best for the player in the supergame to try to match the oppponent's value of k −i .This can be seen by moving the cost of the player with the higher k i to the right side in the expression in Proposition 2. In sum, it is always best to remember only the immediately previous payoff, disregarding prior history of play, while trying to experiment about as often as the opponent.The shapes of the regions for the asymmetric case are similar and can be seen in Figure 2.
It is easy to extend the analysis to other learning and experimentation rules, so long as the regularity conditions are satisfied.The difference in learning parameters will only affect the regions through the changing costs C(g * , g * * ) and c L (g * ), so Corollary 3 can again be used to obtain the characterization.

Conclusion
The textbook approach to repeated games focuses on the existence of cooperative equilibria in terms of the discount rate δ and three of the four payoffs of the game.
Namely, the cooperative equilibrium will exist if δ ≥ π N C −π CC π N C −π N N .One can perhaps think of this as an alternative learning concept (parametrized by δ) that relies on players understanding the repeated nature of the game.This expression differs from the payoff information relevant for cooperation of reinforcement learners, π N N −π CN π CC −π CN .In fact, the payoff π CN is completely irrelevant in the former, and the temptation payoff π N C in the latter.Blonski et al. (2011) argue in favor of a third view, that all four payoffs should matter axiomatically with extremely low and extremely high π CN corresponding to Nash and cooperative outcomes respectively, provided that the discount is high enough to support cooperation in the first place.
Characterization of learning equilibria in this paper thus addresses two issues.With the results that differ from predictions of other learning processes, Q-learning becomes a testable theory given enough variation in payoffs -whether subjects think in terms of adjusting their best-responses, or instead keep a mental model of expected valuations of different actions, the Q-vector, has implications for observable behavior.The potential of the reinforcement learning models is supported by previous studies such as Roth and Erev (1995) and Erev and Roth (1998), which combined simulations with experiments to show that reinforcement learning models have better predictive and descriptive power than standard equilibrium analysis.At the same time, in this paper reinforcement learning is only a long-term equilibrium selection concept, instead of a description of the medium-term dynamics studied in these papers, which may be more relevant for discerning learning algorithms in human behavior.
Another, more direct target of this research is the field of algorithmic pricing.
Due to their simplicity, Q-learning algorithms are a natural candidate for building reinforcement learning into automated pricing systems.However even these simple algorithms have been shown empirically to be able to learn to support supracompetitive prices.I confirm these simulation results theoretically and, moreover, show that there is an optimal set of parameters that will always be chosen by a rational designer of the algorithm to maximize the chance of collusion, namely the highest learning rate and an attempt to match the experimentation rate of the opponent.
The most natural extension of this analysis is to expand the scope from a two-action game to a discretized Bertrand competition or a similar game.Unfortunately, not all results extend in a straightforward manner, most importantly, in a differentiated Bertrand competition a minimum cost path to a central state g * no longer has to exist.
for any b ∈ A \ a.I first show that the Q-values are bounded by the lowest and highest payoffs that can happen on path.I can also get the same results by considering the limiting distribution.

Corollary 4 .
There is always a low-enough α > 0 for any π N N such that {g * } = SS under either rule.Proof.z logit i increases without bound as α i approaches 1. Thenz logit i l=1 2(π N N − q Cl ) also increases without bound and by Proposition 2 {g * } = SS.The characterization is simpler for the symmetric case whereα 1 = α 2 = α, β 1 (β) = β 2 (β) = β, and therefore C(g * , g * * ) = z logit l=1 2(π N N − q C l ) and c L (g * * ) = π CC − π N N with z logit = z logit 1 = z logit 2. Substituting, the corollary follows immediately:Corollary 5.If α 1 = α 2 and β 1 (β) = β 2 (β) = β then: the regions with cooperation and defection for symmetric learners with a two-dimensional graph because the only relevant factors are the learning rate α and the position of the π N N payoff between π CN and π CC .Therefore I fix π CC = 1 and π CN = 0 without loss of generality -the shape of the regions is preserved for other values of the three payoffs π CN , π N C , and π CC .Instead I will equivalently use the value π N N −π CN π CC −π CN , which captures the relative position of π N N between π CN and π CC .The regions are shown in the Figure 1.The boundary of the regions consists of the only values where both cooperation and defection is possible in the limit.The area to the up and left of the red dashed line is the region covered by the theoretical part of (Waltman and Kaymak, 2008).

Figure 1 .
Figure 1.Trade-off between learning rate α and relative punishment payoff π N N −π CN π CC −π CN for symmetric learners.The blue region has (C, C) in the limit, the light region has (N, N ) in the limit.