Strategic Experimentation with asymmetric safe options ∗

We study a two-player game of strategic experimentation with exponential bandits à la Keller, Rady and Cripps (2005) where the safe-arm payoff is different across players. We show that, as in Das, Klein and Schmid (2020), there exists an equilibrium in cut-off strategies if and only if the difference in safe-arm payoffs is large enough. In the equilibrium in cutoff strategies, the player with the higher safe-arm payoff conducts less experimentation. This feature of the equilibrium offers an explanation for the fact that oftentimes technological innovations are due to startups rather than established market leaders.


Introduction
There are many situations when information generated through exploration by one agent is helpful to other agents as well.For example, when, in 2008, Tesla launched the Roadster, the first highway-compatible all-electric car, using lithium ion battery cells, other automobile companies became aware of the prospect of using lithium ion; hence gradually, electric cars were launched by companies like Nissan, Citroën and BMW.In the economics literature, such games with informational externalities1 have been analysed using models of strategic experimentation with bandits.Most of this literature, however, has been concerned with homogeneous players.However, car manufacturers clearly differ widely in terms of market shares.For example, when Tesla was launched in 2003, non-electric automobile markets were dominated by companies like General Motors and Toyota.In terms of a two-armed bandit model, this means the safe-arm payoffs were different across players.In this article, we aim to analyse the impact of asymmetric safe-arm options in a game of strategic experimentation with two-armed exponential bandits.
The strategic exponential-bandit model with homogeneous players has been introduced in a seminal paper by Keller, Rady and Cripps (2005).Das, Klein and Schmid (2020) has introduced asymmetry between the players in terms of different payoff arrival rates.In this article, we consider a variant of the two-armed bandit model where the asymmetry pertains to players' safe-arm payoffs.While we assume the payoff arrival rates on a good risky arm to be the same across the players, each player has a different flow payoff from their respective safe arm.In all other aspects, our model is identical to the canonical strategic exponential-bandit model of Keller, Rady and Cripps (2005).The latter show that with homogeneous players, there does not exist a Markov perfect equilibrium in which both players use a cut-off strategy.In any equilibrium, players swap the roles of pioneer and free-rider at least once.By contrast, Das, Klein and Schmid (2020) shows that, with sufficiently different payoff arrival rates from a good risky arm, there exists an equilibrium where both players use cut-off strategies.The equilibrium is unique in the class of equilibria in cutoff strategies; whenever only one player free-rides and the other experiments, it is always the player with the lower arrival rate who free-rides in the equilibrium in cutoff strategies.
In this article, we show that this feature is robust to the type of asymmetry across players.In particular, we show that, if the safe arm flow payoffs across players are sufficiently different, there exists an equilibrium where both players use a cut-off strategy.As in Das, Klein and Schmid (2020), this equilibrium is unique in the class of MPE in cut-off strategies.Whenever only one player experiments in equilibrium, it is the player with the lower safe-arm payoff.
In the context of our automobile example, this implies that a company that has a greater market share for non-electric vehicles will put relatively less effort into innovating by, e.g., offering an electric vehicle than a new entrant with a lower safe-arm payoff.It is indeed the case that the first highway-compatible electric vehicle was-arguably surprisingly-launched by a small startup company from California named Tesla, rather than by the established automobile giants like General Motors or Toyota.There are other instances as well when technological innovations have come from startups rather than established market leaders.For example, Christensen (2013) finds that over a period of two decades from 1976, each new generation of hard drives became smaller in size than the previous one, and a different set of companies dominated the market in each generation.In a recent paper, Awaya and Krishna (2021) have offered an explanation as to why, in many instances, technological innovations come from startups rather than firms that are already established in the respective industry.They consider a strategic bandit model where players can either innovate or not.The established incumbent is better informed about the viability of the innovation.Their main result shows that the incumbent tends to innovate less as better information dulls its incentive to learn from its rival.Our equilibrium in cutoff strategies offers a simple alternative explanation for the same phenomenon.Our model has the feature that if a company is working towards an innovation, it has to divert some resources from the market where it is already established.Hence, the opportunity cost of pursuing an innovation is higher for an already established market leader than for a startup; hence, the latter will be more inclined to persevere in its pursuit of innovation.

Related Literature
The paper contributes to the literature on strategic bandit models, a problem studied widely in economics.Some of the seminal papers in this area are by Bolton and Harris (1999), Keller, Rady and Cripps (2005), Keller and Rady (2010), Klein and Rady (2011), Klein (2013) and Thomas (2021).All these papers deal with symmetric players.The paper that is closest to the present article is Das, Klein and Schmid (2020), where we demonstrate how an equilibrium in cut-off strategies can exist if players have different learning abilities concerning the risky arm.In the current paper, we show that the same conclusion holds if the players have different safe-arm payoffs, while their innate exploration abilities are the same.
The rest of the paper is organised as follows.Section 2 describes the two-armed bandit model with different safe-arm payoffs.Sections 3 and 4 analyse the social planner's problem, and the non-cooperative game, respectively.Section 5 concludes.Our results are proved in the Appendix.

Two-armed bandit model with different safe-arm payoffs
There are two players (1 and 2), each of whom faces a two-armed bandit in continuous time.One of the arms is safe.If player i ∈ {1, 2} uses it, he gets a flow payoff of s i , where 0 < s 1 < s 2 .The risky arm can be either good or bad.Both players' risky arms are of the same quality.If the risky arm is good, then a player using it receives a lump sum at the jumping times of a Poisson process with parameter λ > 0; the lump sum is drawn from a time-invariant distribution with mean h > 0. If the risky arm is bad, it never yields any payoff.Thus, a good risky arm gives both players an expected flow payoff of g = λ h, where we assume that 0 < s 1 < s 2 < g.In all other aspects, the model is similar to the one in Das, Klein and Schmid (2020).The parameter values and the game are common knowledge.
Players do not initially know whether their risky arms are good or bad.They start with a common prior belief p 0 ∈ (0, 1) that their risky arms are good.Players have to decide in continuous time whether to choose the risky arm or the safe arm, and at each instant a player can choose only one arm.We write k i,t = 1(0) if player i ∈ {1, 2} uses the risky (safe) arm at instant t ∈ R + .Players' actions and outcomes are perfectly publicly observable.This implies that players have a common posterior belief p t at all times t ≥ 0. Players discount the future according to the common discount rate r > 0.
Let p t be the players' common belief that their risky arm is good at time t ≥ 0. Since only a good risky arm can ever yield positive payoffs, the arrival of a lump sum fully reveals the risky arm to be good.In the absence of a lump-sum arrival, by Bayes' rule, the belief follows the following law of motion: In the next section, we characterise a utilitarian planner's solution.

The Planner's Solution
The planner's objective is to maximise the sum of the players' expected discounted payoffs.The planner's action is denoted by the pair where k i = 0(1) denotes that the planner has allocated player i to the safe (risky) arm.The planner's value function v satisfies the Bellman equation ) such that the planner's optimal policy is given by The planner's value is C 1 , (strictly) increasing and convex (on (p * 1 , 1)), and is given by where the threshold p * 2 is implicitly defined by v (1,0) (p * 2 ) = 2s 2 , and the constant of integration The proof can be found in Appendix (A).Next, we turn our attention to the analysis of the non-cooperative game.

The Non-Cooperative Game
In the non-cooperative game, we restrict ourselves to Markovian strategies with the common posterior belief as the state variable, defined as mappings indicates that player i ∈ {1, 2} is choosing the safe (risky) arm, if the common posterior belief is p ∈ [0, 1].The Bellman equations for players 1 and 2's value functions can respectively be written as and We shall now determine the players' best responses, using the same method as Keller, Rady and Cripps (2005) or Das, Klein and Schmid (2020).Suppose player 2 is choosing the safe arm in some neighbourhood of p ∈ (0, 1).Then, player 1 chooses the risky arm at p if and only if the belief p exceeds his single-agent threshold p1 = rs 1 (r+λ )g−λ s 1 . By the same token, if player 1 is choosing the safe arm in a neighbourhood of p, player 2 will choose the risky arm as long as the belief is greater than p2 := rs 2 (r+λ )g−λ s 2 .Since, s 1 < s 2 , we have p1 < p2 .Now, suppose that player 2 is playing risky in some neighbourhood of p ∈ (0, 1).Player 1's best response is to choose the risky arm if and only if b(p, v 1 ) ≥ c 1 (p), and hence v 1 ≥ s 1 + c 1 (p), in this neighbourhood.This means that given the other player is choosing the risky arm, choosing the risky arm constitutes a best response for player 1 if and only if v 1 lies above D 1 in the (p, v)-plane, where the diagonal D 1 is defined as Similarly, for player 2, given that player 1 is choosing the risky arm, choosing the risky arm constitutes a best response for player 2 if and only if v 2 is above D 2 in the (p, v)-plane,2 where The players' best-response diagonals are depicted in Figure 1.
From Figure 1, it can be observed that heterogeneity in safe-arm payoffs drives a wedge between players' best response lines.When s 1 = s 2 , D 1 and D 2 coincide, while, when s 2 = g, the lines are the farthest apart from each other.The next proposition shows that if and only if players are sufficiently heterogeneous, there exists an equilibrium where both players use cutoff strategies; moreover, this equilibrium is unique in the class of equilibria where both players use cutoff strategies.
Proposition 2 There exists a s * 2 ∈ (s 1 , g) such that there exists an equilibrium where both players use cutoff strategies if and only if s 2 ∈ [s * 2 , g).In this equilibrium, player 1 chooses the risky arm for all beliefs greater than p1 and chooses the safe arm otherwise.There exists a unique p ′ 2 ∈ ( p1 , 1) such that player 2 chooses the risky arm for beliefs greater than p ′ 2 and the safe arm otherwise.This equilibrium is unique in the class of equilibria where both players use cutoff strategies.
The formal proof of this proposition is given in Appendix (B).It can be seen from Figure 2 that, as players' safe-arm payoffs diverge from each other, the range of beliefs over which only player 2 can free-ride expands.Just as in Das, Klein and Schmid (2020), this difference in free-riding opportunities allows for the existence of an equilibrium in cutoff strategies.The equilibrium is depicted in Figure 2. The equilibrium in cutoff strategies exists only if the curve v 1 intersects D 1 at a belief that is lower than the one at which the curve v 2 intersects D 2 .This is possible only if the players are sufficiently heterogeneous.
Thus, as in Das, Klein and Schmid (2020), sufficient heterogeneity, and hence sufficiently different free-riding opportunities, allow for the existence of a cutoff equilibrium.A larger difference in players' safe-arm payoffs reduces the free-riding opportunities of player 1, while expanding those of player 2.

Conclusion
In this article, we have considered a two-armed strategic bandit model where players face different opportunity costs of experimentation in terms of safe-arm payoffs but an identical rate of learning on the risky arm.We show that, as in Das, Klein and Schmid (2020), there exists a Markov Perfect equilibrium in cut-off strategies if and only if players are sufficiently asymmetric, i.e., the difference in safe-arm payoffs is high enough.Hence, we show that the conclusion that a sufficient degree of asymmetry across players is necessary and sufficient for the existence of an equilibrium in cutoff strategies continues to apply also when the asymmetry pertains to the opportunity costs, rather than the yield, of information acquisition via the risky arm.Furthermore, our equilibrium in cut-off strategies offers a simple explanation for the observation that, in many instances, innovation will come from start-ups rather than established companies, the idea being that start-ups have less to lose by focussing their resources and energy on the pursuit of an innovation. where > 0 is a constant of integration, which is determined by value matching at p = p * 1 ; direct computation verifies that smooth pasting obtains as well, i.e., that v (1,0) ′ (p * 1 ) = 0.As C 10 > 0, v (1,0) is strictly convex and hence strictly increasing.Moreover, v (1,0) (1) > 2s 2 > s 1 + s 2 = v (1,0) (p * 1 ).Thus, p * 2 ∈ (p * 1 , 1) is well-defined.For p > p * 2 , both players use the risky arm according to the conjectured solution.Hence, the planner's value v(p) satisfies the ODE which is solved by where C 11 is a constant of integration determined by It is immediate to verify from the appertaining ODEs that this implies that v (1,1) 2 ) (smooth pasting), and that therefore v is strictly increasing and convex on (p * 2 , 1) as well.As is immediate from the expression for v (1,1)

B Proof of Proposition 2
The same argument as in Das, Klein and Schmid (2020), Proposition 2, establishes that neither player will play risky on [0, p1 ] in any equilibrium.
In any equilibrium, for beliefs just above p1 , only player 1 can experiment with the risky arm, as p1 < p2 and the point ( p1 , s i ) lies below both diagonals D 1 and D 2 in (p, v)-space.This implies that, for beliefs just above p1 , the payoff of player 1 follows the ODE strategies, and that the only candidate for an equilibrium in cutoff strategies is that in which player i applies the cutoff p In this range, the proposed strategies imply that player i's value function follows the ODE 2 ).To show that player 1 plays a best response in this range, it suffices to show that v
Now suppose to the contrary that there exists some p ∈[p = D 1 ( p) and v (1,1) ′ 1 ( p) ≤ −g = D ′ 1 ( p).By the ODE for v ≥ D 1 (p) for all p ∈ [p ′ 2 , 1],and player 1 is playing a best response in this range as well.Now let us turn to player 2. The payoff function v , D 2 is strictly decreasing, s 2 = v (1,0) 2 ( p1 ) = D 2 ( s 2 g ), and p1 < s 2 g .Thus, v (1,1) 2 is convex.Direct computation from the appertaining ODEs furthermore shows smooth pasting at p > D 2 (p) for all p ∈ (p ′ 2 , 1], as D 2 is monotonically decreasing.This shows that player 2 is playing a best response on [p ′ 2 , 1] as well, thus concluding the proof. ≤ c 2 (p) if and only if v(p) ≤ 2s 2 ; both inequalities hold by the monotonicity of v.