Functional Dynamics by Intention Recognition in Iterated Games

Intention recognition is an important characteristic of intelligent agents. In their interactions with others, they try to read others' intentions and make an image of others to choose their actions accordingly. While the way in which players choose their actions depending on such intentions has been investigated in game theory, how dynamic changes in intentions by mutually reading others' intentions are incorporated into game theory has not been explored. We present a novel formulation of game theory in which players read others' intentions and change their own through an iterated game. Here, intention is given as a function of the other's action and the own action to be taken accordingly as the dependent variable, while the mutual recognition of intention is represented as the functional dynamics. It is shown that a player suffers no disadvantage when he/she recognizes the other's intention, whereas the functional dynamics reach equilibria in which both players' intentions are optimized. These cover a classical Nash and Stackelberg equilibria but we extend them in this study: Novel equilibria exist depending on the degree of mutual recognition. Moreover, the degree to which each player recognizes the other can also differ. This formulation is applied to resource competition, duopoly, and prisoner's dilemma games. For example, in the resource competition game with player-dependent capacity on gaining the resource, the superior player's recognition leads to the exploitation of the other, while the inferior player's recognition leads to cooperation through which both players' payoffs increase.


I. INTRODUCTION
How each individual decides his/her own behavior is a long-standing problem in nature. Each agent freely behaves and receives a reward as a result. This situation is generally formulated as a "normal form game" in which the "player," "action," and "payoff" are given [1]. In a standard game played only once, each player needs to decide what action he/she chooses as his/her own strategy. An optimal strategy is to choose the best rewarded action depending on others' actions, which results in the Nash equilibrium [2,3]. However, while this is the optimal strategy in a one-shot game, it is common that the game is repeatedly played in reality. In a repeated game, each player can refer to a data set on the history of actions played in the past and use this for the next game or a later one. The Nash equilibrium was not originally introduced to deal with such a situation. Instead, one possible equilibrium is given by the "folk theorem" [4][5][6], where deviation from the Nash equilibrium can happen; however, the theorem only provides the requirements for the achieved equilibrium and cannot specify which equilibrium is really achieved. To determine an optimal strategy, which is now nothing given as in a one-shot game, we must therefore adopt a concrete learning process through which each player improves his/her own strategy against that of the other player.
An important characteristic of an intelligent agent (i.e., a human) is to recognize and make an image of others by using the history of the other's actions. Such an agent assumes that the other intentionally changes the next action in response to the agent's own action. For example, a descriptive and predictive model for a person's cognitive behavior has been proposed [7,8], as is based on the experiments of a repeated beauty contest game [9,10]. In this model, each person is given a cognitive level. The level 0 person chooses an action with no recognition (randomly), while the level k(> 0) person best responds to the image of others at the (k − 1)th level (or lower).
Recently, such cognitive levels have been uncovered in neural economics [11]. As another example, it has been reported that information on others' true intentions increases the performance in the game when the actions of others are transferred with disturbances by noise [12][13][14][15]. Following this experimental result, the evolution of the ability to recognize others' intentions is theoretically discussed [16][17][18]. Although the above studies sufficiently justify the existence of humans' ability to recognize others and benefit from that, a theory for the dynamic coevolution of images between agents has been underdeveloped. In [19], how a player builds and deconstructs the other's image in prisoner's dilemma games has been studied by using a recurrent network. How the equilibrium of actions is shaped and how it deviates from the Nash equilibrium, however, are not analyzed.
Here, we develop a theoretical framework with mutual learning that shapes the other's internal intentions, generally applicable to any games, without resorting to specific learning algorithms.
Once every agent has an image of the other and best responds to it, the sequential actions are given. Therefore, how an agent constructs the other's image itself is now a strategy in repeated games, which is represented as a function, as shown later. Initially, both agents best respond to each other without referring to the other's strategy. Hence, both best-response functions as strategies achieve nothing but the Nash equilibrium [2,3]. Then, before considering the dynamics of a pair of such strategy functions, we see an extreme case that an agent one-sidedly reads the other's strategy function. In this case, the Stackelberg equilibrium [20] is achieved, which is defined as an equilibrium in an "extensive form game" [21] under perfect information.
Following these introductory results, we study the dynamics of strategy functions, which represent that agents mutually recognize the other's intention. With repeated games, each agent accurately reads the other's strategy function and optimizes his/her own one based on it. This dynamics reaches an equilibrium when there is no additional advantage for the further recognition of the other's strategy. At this point, a "functional equilibrium" is achieved between both players' strategies instead of the original Nash equilibrium.
Note that our formulation can be applied to general games. Here, applications to resource competition, duopoly, and probabilistic prisoner's dilemma games are provided as examples. In the former case, it is found that learning by an inferior agent increases the payoffs of both players, while that by a superior agent enhances exploitation and decreases the payoff of the other.

II. NASH EQUILIBRIUM
We consider a two-player game in which players are denoted by i ∈ {1, 2}. In addition, each player i's action and its payoff are represented by x i and u i (x 1 , x 2 ), respectively, which are continuous variables. A player tries to receive a higher payoff by optimizing his/her action depending on the other's action. Now, each player has an intention on which action he/she chooses depending on the other's action. The intentions vary depending on how the player imagines the other's action.
Thus, the intention of player 1 is given as strategy function f 1 (x 2 ), which represents that action x 1 is chosen when player 2 takes action x 2 . Player 2's strategy function is similarly defined as f 2 (x 1 ). Then, assuming that each player's action follows his/her own strategy, the equilibrium set of actions, denoted by (x eq 1 , x eq 2 ), is given by the crossing point of both players' functions. In other words, we get x eq 2 = f 2 (x eq 1 ). (1) In this section, we consider a situation in which both players have no recognition of the other's intention. In this case, each player simply maximizes his/her own payoff without referring to the other's strategy. To be consistent with the standard terminology, this strategy is called the "bestresponse" [3], as denoted by f B 1 (x 2 ) (B-response) for player 1. According to this definition, f B 1 (x 2 ) satisfies Eq. 2 simply means that player 1's strategy function is given by maximizing the payoff under the assumption that the other's action is constant independent of his/her own action. At this point, note that the strategy function given by B-response depends on the other's action x 2 .
Player 2's B-response is given in the same way. Thus, when both players make B-responses, the equilibrium, denoted by (x BB 1 , x BB 2 ), is nothing but the Nash equilibrium from its definition. In the present paper, however, we call it the BB equilibrium, where the left index indicates the player's strategy to the other's strategy given by the right index, because the same equilibrium set of actions can be achieved by different sets of functions. In this study, which pair of functions results in the equilibrium action is important; hence, we need to specify not only the equilibrium point but the pair of functions to achieve it. At the BB equilibrium, each player's payoff is defined

III. DEFINITION OF THE LEARNING RESPONSE AND ONE-SIDED RECOGNITION
Next, we define another type of intention where a player perfectly recognizes the other's intention. Then, each player optimizes his/her action based on the information on the other's strategy function. This strategy is termed the "learning response" (L-response), denoted by f L 1 (x 2 ), which is the response to the function of f 2 (x 1 ). Hence, it follows that An obvious difference between the L-and B-responses lies in the form of the recognized player's action. Recall that in the B-response, 1's strategy is given under the image that the other's action is independent of his/her own action (see Eq. 2). On the contrary, in the L-response, 1's strategy is given by the learning that the other's action depends on his/her own action (see Eq. 34). Therefore, the L-response is independent of x 2 , while the B-response depends on x 2 .
We now consider a situation in which player 1 one-sidedly recognizes 2's intention. In this case, player 1 (2) makes the L-(B-) response. The crossing of these functions is defined as the LB equilibrium (x LB 1 , x LB 2 ), which is given by Then, player i's payoff is defined as u LB In the same way, the BL equilibrium is defined as the crossing point between the B-response of player 1 and the L-response of player 2.
In the duopoly game to be discussed later, the LB (BL) equilibrium is known as the "Stackelberg equilibrium" [20], while in general games, it belongs to "sub-game perfect equilibria" [22,23]. Here, we use the term the Stackelberg equilibrium in any games. Therefore, one-sided recognition means a transition from the Nash equilibrium to the Stackelberg equilibrium.
We now study some of the general properties of such one-sided recognition. First, a player does not lose any benefit by learning the other's B-response; in other words, u LB 1 ≥ u BB 2 holds. This is easily proven as This inequality is understood as follows: as the player adopting B-response chooses the strategy depending on the other, the other player can take advantage of the other's strategy and shift the equilibrium point (i.e., x LB 1 or x BL 2 ) one-sidedly, in order to get more payoff. (Note that the Zero Determinant strategy by Press and Dyson [24] in prisoner's dilemma game, adopts a similar strategy, as the optimization strategy of one player itself is taken advantage by the other to increase the payoff. Second, we obtain a necessary and sufficient condition for a recognizing player to increase his/her payoff. When player 1 makes the L-response, 1 refers to 2's strategy. In other words, how 2's action changes depends on 1's action. Thus, 1's action deviates from the BB equilibrium if 2's strategy function has a nonzero gradient around the BB equilibrium. Considering the case when the LB and BB equilibria are achieved within the interior of the possible range of players' actions [x min , x max ] (i.e., at x min < x < x max ), the condition for it is given by The condition for player 2's L-response is obtained in the same way. ( The above result is interpreted by the relationship between both players' strategies. For the fixed 2's strategy f 2 (x 1 ), 1's strategy f 1 (x 2 ) enables him/herself to realize benefit u eq 1 . Therefore, the condition that 1's strategy is optimal and is not changed by the other's strategy is given by If the same equation for player 2 holds, the set of strategy functions is in the equilibrium. We define this as the "function equilibrium." As illustrated in the following two examples, the function equilibrium is not satisfied at the BB equilibrium in general because the function of the other player imagined by one player does not agree with the real function of the other. In the B-response, the player imagines that the other's strategy function is constant, and he/she chooses his/her strategy accordingly. When both players make B-responses, however, the function of each player is no longer constant in contrast to the assumption for the B-response. Therefore, both players still gain an advantage by learning the other's strategy function.
On the contrary, there is no such disagreement at the LB or BL equilibria, where the L-response player imagines that the other's action can be changeable depending on the learning side's action, and as a result, the real strategy function is made to be constant. Thus, the real and imagined straetgy functions are consistent with each other. Then, there are no more advantages of learning the other's strategy for both players, and the function equilibrium is satisfied.
Next, we consider whether the learned side increases or decreases his/her payoff. Let us consider the "competitive" case in which an increase in x i leads to disutility for the other as is given by Indeed, a few nontrivial games satisfy such a relation, as discussed in these two examples. In this case, if player 1 is more competitive owing to recognition (x LB 1 > x BB 1 ), the following relationship is satisfied: Then, the learned player is proven to receive a payoff below the BB equilibrium.
On the contrary, if player 1 is less competitive ( in the same way. In addition, we can deal with another case, for example, a cooperative situation the public goods game [25] belongs to this type.
Here, in contrast to earlier studies, we consider not only the equilibrium set of actions but also the functions of the players to achieve it,based on the recognition of the other's intention.
Accordingly, the function equilibrium that deviates from the Nash equilibrium is introduced. We explicitly calculate the BB, LB, and BL equilibria in specific examples.
A. Example 1: resource competition game As an example of the BL and LB responses formulated above, we consider a "resource competition" game. In this game, both players i ∈ {1, 2} pay cost x i ≥ 0 to compete for a restricted resource with the total amount of unit one. Each player's reward, defined as the distributed resource, is proportional to the paid cost. Here, the efficiency to get resource per cost is given by r i .
Each player's payoff u i is defined as the difference between the reward and cost, so that We assume that the abilities of the players differ, meaning that r 1 ≥ r 2 . Without loss of generality, r 2 is set at 1, and we take r 1 ≡ r ≤ 1. When r = 1, the abilities of the two players are identical, while r > 1 means that player 1 is superior to 2. This game is a continuous version of the hawk dove game [26,27]. In addition, this continuous game was recently applied to hierarchical competition [28]. In this resource competition game, each player's B-response is given (see Fig. 1-A) by From Eq. 1, we get the set of actions at the BB (Nash) equilibrium as the crossing of the strategy functions.
Next, we consider the case in which only player 1 learns 2's B-response. Player 1's L-response is given (see Fig. 1-B) by Then, the LB equilibrium is different from the BB equilibrium (compare Fig. 1

-B with A).
We now study how these two are different (see the Supplementary Data for the detailed calcu- These equations indicate that owing to the superior player's one-way learning, he/she increases his/her cost but increases his/her payoff, while the other player decreases his/her payoff while decreasing his/her cost.
On the contrary, when player 2 one-sidedly learns 1's B-response, 2's L-response is given (see In this case, we get x BL as shown in Fig. 1 (see the Supplementary Data for the detailed calculation). Hence, both the players decrease their costs and increase their payoffs owing to the one-way learning of the inferior player in contrast to that of the superior player.
The LB and BL equilibria correspond to the classical Stackelberg ones [20,22,23]. In particular, LB indicates an equilibrium for a situation that 1 firstly determines his/her action and 2 follows it given the information on 1's action. BL indicates the converse situation. Note again that we focus not only on the crossing equilibrium but also on which pair of strategy functions is achieved in the equilibrium. Thus, we here call the Stackelberg equilibrium the LB or BL equilibrium in the same way that we call the Nash equilibrium the BB equilibrium.
The superior player's one-way learning results in exploitation by gaining more benefit by increasing its own cost, while the inferior player's learning results in cooperation by decreasing its own cost. This is interpreted as follows. First, the cost a player should pay depends on the other's cost. A player would not need to pay so much when the other's cost is too small because the player would monopolize most resources by paying not so much cost. On the contrary, if the other's cost is too large, the player would not pay much cost either because one should pay too much cost to obtain more resources. Therefore, a player's optimal cost is maximal when the other pays a moderate cost (see Fig. 1-A). Second, in the BB equilibrium, both players pay a moderate cost.
Hence, no matter whether the learning player is superior or not, the player has to repress the other's cost to gain more benefits. How to repress the other's cost, however, depends on whether the player is superior or not. The superior player increases his/her cost and forces the inferior one to give up competition. Therefore, the former exploits the latter by learning the other's strategy.
On the contrary, to gain a higher payoff, the inferior player decreases his/her cost and relaxes the competition. Therefore, the learning player cooperates with the learned one.
How the learning and learned players' payoffs change depends on the type of game. Below, we discuss an alternative example, namely a duopoly game, in which competition always persists to the point that an increase in the payoff of the learning player always decreases the other's payoff.

B. Example 2: duopoly game
In a duopoly game, two companies i ∈ {1, 2}, which separately supply products, compete for a limited market. The more products supplied in this limited market, the cheaper their prices are.
Here, player i's action x i is the number of products he/she supplies. We assume that the price is where p represents the maximal price. In addition, player i's cost of supplying products is assumed to be c i . Thus, player i's payoff is given by Here, we assume c 1 ≤ c 2 without loss of generality. In other words, player 1 is superior to 2.
The strategy functions in the BB, LB, and BL equilibria are plotted in Figs Although both the resource competition and the duopoly games are categorized as competitive (∂u 1 /∂x 2 , ∂u 2 /∂x 1 < 0), the change in the learned player's payoff differs between the two. As has already been explained, this difference depends on whether the learning side is more or less competitive according to one-way learning (see Eq. 8). Each player's motivation to change his/her competitiveness is now discussed based on the following dynamic process of learning.

IV. FUNCTIONAL DYNAMICS OF STRATEGIES
So far, we first considered the BB equilibrium in which both players have no recognition of the other's intention. Then, we introduced the LB and BL equilibria in which a player one-sidedly recognizes the intention of the other. When one recognizes the other but not vice versa, the recognizing player has an advantage. However, such one-way recognition rarely appears because both of the players usually try to recognize each other's intention. Another problem in the Lresponse is that one player knows the other's intention perfectly in a one-shot game, while players usually shape the image of the other successively through the iteration of games. If both players recognize the other's intention gradually, the LB or BL (Stackelberg) equilibrium is no longer achieved. Instead, a set of actions that are not discussed by previous studies can be achieved, as shown below.
To represent such a gradual recognizing process, we assume that each player i learns the other's strategy at a rate of ǫ i . In this case, each player's strategy function f 1 (x 2 ) and f 2 (x 1 ) changes depending on the other's one as For ǫ 1 = 0, f 1 in the one-shot game corresponds to the B-response (see Eq. 2); for ǫ 1 = 1, f 1 corresponds to the L-response (see Eq. 34). In addition, when at least one player makes the B-response, both players' strategies in the equilibrium are given as fixed functions, as already mentioned. In the present case with ǫ 1 , ǫ 2 > 0, however, it is necessary to consider the functional dynamics, where both players change their strategy functions by learning the other's strategy function. Therefore, we add the time variable t.
Eq. 49 represents the functional dynamics [29][30][31], where the change in time depends on the function rather than the dynamic systems of state variables of a finite dimension (for example, in dynamical-systems game [32]). Hence, we need to solve the dynamics of infinite dimensions.
We now analyze the equilibrium state of Eq. 49. In the following, we assume that there exist a pair of fixed-point functions as an equilibrium state of the functional dynamics, which is denoted As demonstrated later numerically, fixed-point functions are reached in various games. To study the behavior near the equilibrium, we derive a crossing point of the equilibrium functions and its neighborhood. By assuming the continuity of the functions around the crossing point, we expand the equilibrium functions as By substituting Eq. 16 into Eq. 15, we get the first-order term as Then, we also get the second-order term as Here, Eq. 17 indicates that the crossing point (x eq * 1 , x eq * 2 ) satisfies the optimization condition for the other's function, while Eq. 18 is a consequence of the fixed-point functions, indicating that the set of equilibrium functions (f * 1 (x 2 ), f * 2 (x 1 )) satisfies the optimization condition in the neighborhood of the crossing point. To compute the player's equilibrium payoff given by (x eq * 1 , x eq * 2 ), the above calculation of f * 1 and f * 2 is thus sufficient.
As an extreme case, we consider ǫ i = 1, in which both players perfectly recognize the other's intention. The fixed point in this case is the LL equilibrium. Here, both players make L-responses, which are constant for the other's action. Therefore, from Eq. 34, the achieved actions at the LL equilibrium correspond to those at BB (i.e., the Nash equilibrium), namely x LL i = x BB i (see the upper right of Fig. 3). Note that the equilibrium points are identical, whereas they are different in the functional dynamics. Indeed, from Eqs. 17 and 18, we can confirm that x eq * i is equal for LL and BB, while a * i is not (see the Supplementary Data for the detailed calculation). Owing to this inequality in a * i , the function equilibrium holds in LL, but not in BB: In the functional dynamics with ǫ i = 0, both players' strategy functions are constant in LL, while for ǫ i = 1, they are not.
It may be disappointing that the classical Nash equilibrium is achieved as the LL equilibrium in the game between the learning players. As explained later, however, this LL equilibrium is rarely achieved; indeed, in many cases, novel equilibria are achieved according to the functional dynamics.
Below, we discuss some specific examples for the functional dynamics.

A. Example 1: resource competition game
We again consider the resource competition game. From Eqs. 17 and 18, we get the set of equilibrium actions (x eq * 1 , x eq * 2 ) and the set of equilibrium gradients around them (a * 1 , a * 2 ) as given by x eq * 1 = 1 r + ǫ 1 a * 2 r(x eq * 2 − ǫ 1 a * 2 x eq * 1 ) − (x eq * 2 − ǫ 1 a * 2 x eq * 1 ) , x eq * 2 = 1 1 + rǫ 2 a * 1 r(x eq * 1 − ǫ 2 a * 1 x eq * 2 ) − r(x eq * 1 − ǫ 2 a * 1 x eq * 2 ) , We now simulate Eq. 49 and compare the simulation results with the calculation, confirming that both players' strategy functions immediately converge to fixed ones (see Fig. 4). The crossing points (x eq * 1 , x eq * 2 ) of these functions agree well with the above analytic estimation and the converged strategy functions in the neighborhood of the crossing points are well estimated by Eq. 16 with the above values a * 1 and a * 2 . In addition, the action compared with the other's action is calculated from fixed-point function f , as shown in Fig. 3. This indicates that the more (less) each player learns the other's strategy, the less (more) dependent the strategy function is on the other's action.

B. Example 2: duopoly game
As in the resource competition game, both players' strategy functions converge to fixed functions for any ǫ 1 and ǫ 2 . Here, recall that the learning side exploits the learned side regardless of whether the former is superior or not. This result can be applied to the case with continuous learning degrees ǫ 1 and ǫ 2 . The larger ǫ 1 , the larger (smaller) 1's (2's) payoff is, with larger exploitation (see the Supplementary Data for the details).

V. DYNAMICS OF THE DEGREE OF LEARNING
Thus far, the learning degree ǫ i has been given and fixed. Thus, for each player, the case with ǫ i = 1 would be the better one for receiving a higher payoff. Each player, however, can change the degree to which he/she learns the other's strategy. Initially, each player may not care about the other, and he/she learns the other's strategy more through the repeated game. In the following, we consider this temporal evolution in the degree of learning, ǫ 1 , ǫ 2 . Here, assuming that the other's strategy function is fixed, each player tries to increase his/her payoff by changing his/her learning degree. Therefore, the dynamics of both players' learning degrees are given bẏ where S 1 , S 2 is the speed with which each player optimizes the intensity of recognition. In the following, we simulate these dynamics for the introduced example and examine what equilibrium is reached.
A. Example 1: resource competition game Fig. 5 shows the dynamics of ǫ 1 and ǫ 2 for various sets of learning speeds (S 1 , S 2 ), while this temporal evolution in the payoff of each player according to Eq. 50 is shown in Fig. 6.
First, the initial BB (Nash) equilibrium is unstable compared with the learning dynamics. In other words, both players are motivated to learn the other's strategy and to change their strategies accordingly because the payoff has a nonzero gradient at (x 1 , x 2 ) around the BB equilibrium point.
Second, as one player's learning is superior, the other's learning is repressed. During the evolution of learning, one player learns the other's gradient of the strategy function, and his/her strategy

VI. SUMMARY AND DISCUSSION
In this study, we introduce a new formulation for the mutual recognition of intention, which is represented as functional dynamics. In the formulation, every player can read the other's strategy function f (x), which determines the action to be chosen for the other's action x.
As a result, we proved that both players can increase their payoffs according to their learning.
The more a player learns the other's strategy, the less his/her action depends on the other's action (i.e., the function approaches a constant function). Since such a constant function does not provide any motivation to learn, the process of mutual learning stops when one player perfectly learns the other's strategy function.
The resultant function equilibrium includes and extends two kinds of well-known equilibria This finding may lead to understanding how the leader-follower relationship is formed in game theory according to intention recognition.
Furthermore, we also confirm that each player's payoff changes according to intention recognition in the resource competition game. The faster the superior player learns the other's strategy function, the more that player exploits the other, where the learning (learned) player's payoff increases (decreases). On the contrary, the faster the inferior person learns, the more the players cooperate, meaning that both players' payoffs increase. In the learning process, one player's positive (negative) gradient of his/her strategy function leads to a decrease (increase) in the other's competitiveness.
In this study, we only consider the two-player case. In the case of n(≥ 3) players, each player's strategy function is (n−1)-dimensional, and the function equilibrium is more complicated. Furthermore, some equilibria regarded as neither Nash nor Stackelberg equilibria emerge. For example, we can consider the cases of one-sided learning from 1 to 2, from 2 to 3, and from 3 to 1. Such a learning loop does not appear as an equilibrium in the extensive form game [21]. This will be discussed in future work.
In our formulation, we assume that each player's learning speed is independent of its accuracy.
This assumption results in a monotonic advantage for the increase in learning speed, at least in the duopoly game. In reality, however, there is a trade-off between the accuracy of reading and speed of evolution, which provides another disadvantage for the fast evolution owing to incomplete information. Indeed, some previous studies show that such incomplete information on the other's action leads to disutility [33].

VIII. GENERAL GAMES
A. Condition for disagreement between the Nash and Stackelberg equilibria Here, we consider a condition under which a player increases his/her payoff owing to the onesided recognition of the other's response. In the following, we assume a game satisfying the following two points: (1) the space of each player's possible action is bounded and (2) each player's payoff Next, we consider a condition satisfied by the Stackelberg (LB) equilibrium. When player 1's equilibrium action x LB 1 is optimal given the recognition of the other's intention f B 2 , we get Here, ∂f B 2 /∂x 1 is given by By substituting Eq. 24 into Eq. 23, we obtain a condition for the LB equilibrium as In addition, since player 2 makes a B-response, another condition for the LB equilibrium is given by Let us now compare the above condition (Eqs. 25 and 26) for the LB equilibrium with that of the BB equilibrium (Eqs. 21,22). First, the second condition is common between the two. Next, the first condition is different as long as the second term in Eq. 25 is nonzero. Hence, the condition for the mismatch between LB and BB is given by The condition for disagreement between BL and BB is given in the same way. In examples 1 and 2, we discuss this condition concretely.
Second, we also consider a case in which the Nash and Stackelberg equilibria exist on the border of the space of all players' actions. In particular, we consider a situation that the LB and BB equilibria are on the border of player 2's action, in other words, x LB 2 = x BB 2 = infx 2 or supx 2 holds. In this case, the recognizer's payoff satisfies Here, from the assumption that the LB equilibrium exists on the border of player 2's action, we can derive the second line from the first one (see Fig. S1). In addition, as shown in the main manuscript, player 1 obtains no advantage by recognizing the other's B-response; in other words, u LB 1 ≥ u BB 1 holds. Thus, we get u LB 1 = u BB 1 and then prove x LB 1 = x BB 1 . Putting this in other terms, x LB 1 = x BB 1 and x LB 2 = x BB 2 = infx 2 , supx 2 are incompatible. In example 3, we see a concrete illustration of the overlap between the LB and BB equilibria.

B. Solution of the functional dynamics
Here, we derive the following functional dynamics: First, we expand the equilibrium functions around the crossing points as In the following, we calculate the solution of player 1's function f * 1 (x 2 ), which is given by in x 2 ≃ x eq * 2 . Thus, we obtain the first-order term as Here, | eq * is an operation of (x 1 , x 2 ) = (x eq * 1 , x eq * 2 ). Then, we also obtain the second-order term as

IX. EXAMPLE 1: RESOURCE COMPETITION GAME
In this game, both players' payoffs are defined by A. BB equilibrium Each player's B-response is calculated as follows: Then, the crossing actions at the BB equilibrium are given as From Eq. 36, each player's payoff is obtained as B. LB equilibrium Player 1's L-response is given by Then, we get the actions and payoffs at the crossing point as By comparing LB with BB, we can prove for all r > 1.

C. BL equilibrium
Player 2's L-response is given by Then, we get the actions and payoffs at the crossing point as This equation shows that LB coincides with BB in the case of r = 1 and not in the case of r > 1.
We discuss the overlap between BL and BB in the same way.

F. Simulation of the learning process
Here, we consider a process to change the degree of recognition ǫ i to increase the first agent's payoff under the other's intention. Fig. S4 shows how both players' recognition degrees change for some S 1 /S 2 . In addition, the finally achieved payoffs are plotted in Fig. S5. The red (yellow) broken lines indicate player 1's (2's) payoff at the LB (right), LL (middle), and BL (left) equilibria for reference, respectively.

XI. EXAMPLE 3: PROBABILISTIC PRISONER'S DILEMMA GAME
Our third example is a prisoner's dilemma game in which each of the players chooses whether to cooperate or defect. In general, each player's payoff is given by Here, both T > R > P > S and 2R > T + S are required. Then, we consider a situation that each player determines the probability x 1 , x 2 to cooperate as his/her action.
First, each player's B-response is given by Eq. 60 indicates that every player is better off choosing defection regardless of the opponent's cooperativeness. In other words, both B-responses are constant compared with the change in the opponent's action. Then, we get the following actions and payoffs at the BB equilibrium: (x BB 1 , x BB 2 , u BB 1 , u BB 2 ) = (0, 0, P, P ).
Player 1's L-response is given by