Decision Maker based on Atomic Switches

We propose a simple model for an atomic switch-based decision maker (ASDM), and show that, as long as its total volume of precipitated Ag atoms is conserved when coupled with suitable operations, an atomic switch system provides a sophisticated"decision-making"capability that is known to be one of the most important intellectual abilities in human beings. We considered the multi-armed bandit problem (MAB); the problem of finding, as accurately and quickly as possible, the most profitable option from a set of options that gives stochastic rewards. These decisions are made as dictated by each volume of precipitated Ag atoms, which is moved in a manner similar to the fluctuations of a rigid body in a tug-of-war game. The"tug-of-war (TOW) dynamics"of the ASDM exhibits higher efficiency than conventional MAB solvers. We show analytical calculations that validate the statistical reasons for the ASDM dynamics to produce such high performance, despite its simplicity. These results imply that various physical systems, in which some conservation law holds, can be used to implement efficient"decision-making objects."Efficient MAB solvers are useful for many practical applications, because MAB abstracts a variety of decision-making problems in real- world situations where an efficient trial-and-error is required. The proposed scheme will introduce a new physics-based analog computing paradigm, which will include such things as"intelligent nano devices"and"intelligent information networks"based on self-detection and self-judgment.


I. INTRODUCTION
When we look at the natural world, information processing in biological systems is elegantly coupled with their underlying physics [1,2].This suggests a potential for establishing a new physics-based analog-computing paradigm.A proposal was made about ten years ago for a conceptually novel switching device, called the "atomic switch," that is based on metal ion migration and electrochemical reactions in solid electrolytes [3].Because its resistance state is controlled continuously by the movement of a limited number of metal ions/atoms, the atomic switch can be regarded as a physics-based analog-computing element.In this paper, using atomic switches, we show that a physical constraint, the volume conservation law, allows for the efficient solving of decision-making problems which, in human beings, is one of the most important intellectual abilities.
Suppose there are M slot machines, each of which returns a reward; for example, coins, with a certain probability density function (PDF) that is unknown to a player.Let us consider a minimal case: two machines A and B give rewards with individual PDF whose mean reward is µ A and µ B , respectively.The player makes a decision on which machine to play at each trial, trying to maximize the total reward obtained after repeating several trials.The multiarmed bandit problem (MAB) is used to determine the optimal strategy for playing machines as accurately and quickly as possible by referring to past experience.
In the context of decision making algorithms, the MAB was originally described by Robbins [4], although the essence of the problem had been studied earlier by Thompson [5].The optimal strategy, called the "Gittins index," is known only for a limited class of problems in which the reward distributions are assumed to be known to the players [6,7].Even in this limited class, in practice, computing the Gittins index becomes intractable for many cases.For the algorithms proposed by Agrawal and Auer et al., another index was expressed as a simple function of the reward sums obtained from the machines [8,9].In particular, the "upper confidence bound 1 (UCB1) algorithm" for solving MABs is used worldwide in many practical applications [9].The MAB is formulated as a mathematical problem without loss of generality and, as such, is related to various stochastic phenomena.In fact, many application problems in diverse fields, such as communications (cognitive networks [10,11]), commerce (advertising on the web [12]), entertainment (Monte-Carlo tree search, which is used for computer games [13,14]), can be reduced to MABs.

II. MODEL
Kim et al. proposed a MAB solution called "tug-of-war (TOW)," which uses a dynamical system.This algorithm was inspired by the spatiotemporal dynamics of a single-celled amoeboid organism (the true slime mold P. polycephalum) [15][16][17][18][19][20][21], which maintains a constant intracellular-resource volume while collecting environmental information by concurrently expanding and shrinking its pseudopod-like terminal parts.In this bio-inspired algorithm, the decision-making function is derived from its underlying physics, which resembles that of a tug-of-war game.The physical constraint in TOW dynamics, the conservation law for the volume of the amoeboid body, entails a nonlocal correlation among the terminal parts.That is, the volume increment in one part is immediately compensated for by volume decrement(s) in the other part(s).In our previous studies [15][16][17][18][19][20][21], we showed that, owing to the nonlocal correlation derived from the volume-conservation law, TOW dynamics exhibit higher performance than other well-known algorithms such as the modified ǫ-greedy algorithm and the modified softmax algorithm, which is comparable to the UCB1-tuned algorithm (seen as the best choice among parameter-free algorithms [9]).These observations suggest that efficient decision-making devices could be implemented using any physical object as long as it holds some common physical attributes, such as the conservation law.In fact, Kim et al. demonstrated that optical energy-transfer dynamics between quantum dots, in which energy is conserved, can be exploited for the implementation of TOW dynamics [22][23][24].Here, we propose a simplified model for an atomic switch-based decision maker (ASDM).Consider two atomic switches located close to each other, in which a solid electrolyte (SE) is sandwiched between one top Pt electrode and two bottom Pt electrodes respectively on both sides, as shown in Fig. 1(a).Each atomic switch is operated in a metal/ionic conductor/metal configuration, which is referred to as a "gapless type atomic switch [25]."Here we assume that operation of both switches is influenced by each other, which implies a certain interaction between the two switches.In the initial state, Ag ions are distributed uniformly in the electrolyte.When a bias voltage of −V 0 is applied to the bottom Pt electrodes relative to the top Pt electrode, Ag ions migrate to the bottom electrodes and the same amount of Ag atoms are precipitated on the respective electrode.We define the height of Ag atoms by X 0 , and each displacement of the height of precipitated Ag atoms from X 0 at time t by X k (t) (k ∈ {A, B}).The total height results in X 0 + X k (t).
If current I k >θ, we consider that the ASDM chooses machine k, and obtains reward R k (j) generated from each "unknown PDF" (mean reward µ k is also supposed to be unknown).According to the reward, the added voltage ∆V k (j) is determined by Here, R k (j) is a "reward" which has an arbitrary real value.K is a parameter to be described in detail later on in this paper.Then, each voltage becomes We assume the following conditions: 1.At initial equilibrium state, the SE is nearly empty of Ag ions to be precipitated.This implies that an increment of one height is compensated by a decrement in the other (Eq.(3)holds).
Here, T h and θ are thresholds.If the T h is set to be smaller than X 0 , this dynamics works from the initial state without fluctuations.
3. For simplicity, we assume a linear dependence between ∆V k and ∆X (Eq.( 4)) even though it depends on the shape of the Ag atoms and the amount of Ag ions remaining.
4. The time interval for adding voltage ∆t is sufficiently larger than that for interval ∆t int that the decaying effect of Ag atoms during ∆t int can be ignored.
Displacement X A (= −X B ) is determined by the following equations: Here, Q k (t) (k ∈ {A, B}) is an "estimate" of information of past experiences accumulated from the initial time 1 to the current time t, N k counts the number of times that machine k has been played, ∆V k is the added voltage when playing machine k, δ(t) is an arbitrary fluctuation to which the body is subjected, and K is a parameter.Eqs.( 1) and ( 4) are called the "learning rule."Consequently the ASDM dynamics evolve according to a particularly simple rule: in addition to the fluctuation, if machine k is played at each time t, R k −K is added to X k (t) (Fig. 1).

B. SOFTMAX Algorithm
The SOFTMAX algorithm is a well-known algorithm for solving MAB problems [26].In this algorithm, the probability of selecting A or B, P ′ A (t) or P ′ B (t), is given by the following Boltzmann distributions: where . Here, β is a time-dependent form in our study, as follows: β = 0 corresponds to a random selection, and β → ∞ corresponds to a greedy action.From computer simulations, we confirmed that, in almost all cases, an ASDM with the parameter K 0 (= µA+µB 2 ) can acquire more rewards than a SOFTMAX algorithm with the optimized parameter τ opt , although SOFTMAX is well-known as a good algorithm for efficient decision-making [27].Figure 2 shows an example of an ASDM/SOFTMAX performance comparison.The vertical axis denotes the sum of acquired rewards (mean values of 1000 samples), and the horizontal axis denotes the number of plays.For the reward PDFs, we used normal distributions N (µ A , σ 2 ) and N (µ B , σ 2 ), where µ A =0.5, µ B =0.6, and σ=0.2.Computer simulations were executed under the condition that T h=X 0 and δ=sin(π/2 + πt).

IV. THEORETICAL ANALYSES OF THE ASDM
Theoretical analyses of the TOW dynamics for a Bernoulli type MAB problem, in which a reward is limited to 0 or 1, are described in [21].In this section, theoretical analyses of the ASDM are described for a general MAB where a reward is not limited to 0 or 1.

A. Solvability of the MAB
To explore the MAB solvability of the ASDM dynamics, let us consider a random-walk model as shown in Fig. 3(a).Here, R k (t) (k ∈ {A, B}) is a reward at time t, and K is a parameter (see Eq.( 1)).We assume that means of the probability density function of R k satisfy µ A > µ B for simplicity.After time step t, the displacement D k (t) (k ∈ {A, B}) can be described by The expected value of D k can be obtained from the following equation: In the overlapping area between the two distributions shown in Fig. 3(b), we cannot accurately estimate which is larger.The overlapping area should decrease as N k increases so as to avoid incorrect judgments.This requirement can be expressed by the following forms: These expressions can be rearranged into the form In other words, the parameter K must satisfy the above conditions so that the random walk correctly represents the larger judgment.
We can easily confirm that the following form satisfies the above conditions: From Q k (t) and Eq.( 13), we obtain Here, we have set the parameter K to K 0 .Therefore, we can conclude that the ASDM dynamics using the learning rule Q k with the parameter K 0 can solve the MAB correctly.

B. Origin of the high performance
In many popular algorithms such as the ǫ-greedy algorithm, at each time t, an estimate of reward probability is updated for either of the two machines being played.On the other hand, in an imaginary circumstance in which the sum of the mean rewards γ = µ A + µ B is known to the player, we can update both of the two estimates simultaneously, even though only one of the machines was played.

TABLE I:
Estimates for each mean reward based on the knowledge that machine A was played NA times and that machine B was played NB times-on the assumption that the sum of the mean rewards γ = µA + µB is known.
The top and bottom rows of Table I provide estimates based on the knowledge that machine A was played N A times and that machine B was played N B times, respectively.Note that we can also update the estimate of the machine that was not played, owing to the given γ.
From the above estimates, each expected reward Q ′ k (k ∈ {A, B}) is given as follows: These expected rewards, Q ′ j s, are not the same as those given by the learning rules of TOW dynamics, Q j s in Eqs.( 1) and ( 4).However, what we use substantially in TOW dynamics is the difference When we transform the expected rewards we can obtain the difference Comparing the coefficients of Eqs.( 18) and ( 19), the differences in their constituent terms are always equal when K = K 0 (Eq.(14)) is satisfied.Eventually, we can obtain the nearly optimal weighting parameter K 0 in terms of γ.This derivation implies that the learning rule for the ASDM dynamics is equivalent to that of the imaginary system in which both of the two estimates can be updated simultaneously.In other words, the ASDM dynamics imitates the imaginary system that determines its next move at time t + 1 in referring to the estimates of the two machines, even if one of them was not actually played at time t.This unique feature in the learning rule, derived from the fact that the sum of mean rewards is given in advance, may be one of the origins of the high performance of the ASDM dynamics.
Monte Carlo simulations were performed it was verified that the ASDM dynamics with K 0 exhibits an exceptionally high performance, which is comparable to its peak performance-achieved with the optimal parameter K opt .To derive the optimal value K opt accurately, we need to take into account the fluctuations.
In addition, the essence of the process described here can be generalized to M -machine cases.To separate distributions of the top m-th and top (m + 1)-th machine, as shown in Fig. 3(b), all we need is the following K 0 : Here, µ (m) denotes the top m-th mean, and m is any integer from 1 to M − 1.The MBP is a special case where m = 1.In fact, for M -machine and X-player cases, we have designed a physical system that can determine the overall optimal state, called the "social maximum," quickly and accurately [28,29].

C. Performance characteristics
To characterize the high performance of the ASDM dynamics, let us consider the imaginary model for solving the MAB, called the "cheater algorithm."The cheater algorithm selects a machine to play according to the following estimate S k (k ∈ {A, B}) Here, X k,i is a random variable.If S A > S B at time t = N , machine A is played at time t = N + 1.If S B > S A at time t = N , machine B is played at time t = N + 1.If S A = S B at time t = N , a machine is played randomly at time t = N + 1.Note that the algorithm refers to results of both machines at time t without any attention to which machine was played at time t − 1.In other words, the algorithm "cheats" because it plays both machines and collects both results, but declares that it plays only one machine at a time.
The expected value and the variance of X k are defined as E( Here, µ k is the same as the P k defined earlier.From the central-limit theorem, S k has a Gaussian distribution with E(S k ) = µ k N and V (S k ) = σ 2 k N .If we define a new variable S = S A − S B , S has a Gaussian distribution and carries the following values: From Fig. 4, the probability of playing machine B, which has a lower reward probability, can be described as Q( E(S) σ(S) ).Here, Q(x) is a Q-function.We obtain Here, Using the Chernoff bound Q 2 ), we can calculate the upper bound of a measure, called the "regret," which quantifies the accumulated losses of the cheater algorithm.

regret = (µ
In contrast, the ASDM demonstrates that higher performance can be achieved by introducing a parameter K 0 that refers to the sum of the rewarded experiences, i.e., µ A + µ B .This type of optimization, using the sum of the rewarded experiences, is particularly useful for time varying environments (reward probability or reward PDF) [16].Owing to this novelty, the high performance of the TOW dynamics can be reproduced when implementing these dynamics with atomic switches.The ASDM proposed in this paper is a simple "ideal model."While the assumptions used for constructing the model may contain some points that do not match real experimental situations, we can more accurately extend the model so that the modified assumptions do match real experimental situations.As long as the TOW dynamics between atomic switches is implemented, high performance decision-making can be guaranteed even in the extended model.
The ASDM will introduce a new physics-based analog computing paradigm, which will include such things as "intelligent nanodevices" and "intelligent information networks" based on self-detection and self-judgment.Thus, our proposed physics-based analog-computing paradigm would be useful for a variety of real-world applications and for understanding the biological information-processing principles that exploit their underlying physics.
FIG.1:(a)The ASDM using gapless type atomic switches.The ASDM decides which machine (A or/and B) is to be played at time t according to whether the current I k is larger than θ or not.(b) Voltages VA and VB.Here, added voltage ∆V k (j) is determined by each reward R k (j) (Eq.(1)) at play j (I k >θ).

FIG. 3 :
FIG. 3: (a) Random walk: flight R k (t) − K .Here, the probability density function of R k has the mean µ k .(b) Probability distributions of two random walks.