Efficient Decision-Making by Volume-Conserving Physical Object

We demonstrate that any physical object, as long as its volume is conserved when coupled with suitable operations, provides a sophisticated decision-making capability. We consider the problem of finding, as accurately and quickly as possible, the most profitable option from a set of options that gives stochastic rewards. These decisions are made as dictated by a physical object, which is moved in a manner similar to the fluctuations of a rigid body in a tug-of-war game. Our analytical calculations validate statistical reasons why our method exhibits higher efficiency than conventional algorithms.

We demonstrate that any physical object, as long as its volume is conserved when coupled with suitable operations, provides a sophisticated decision-making capability. We consider the problem of finding, as accurately and quickly as possible, the most profitable option from a set of options that gives stochastic rewards. These decisions are made as dictated by a physical object, which is moved in a manner similar to the fluctuations of a rigid body in a tug-of-war game. Our analytical calculations validate statistical reasons why our method exhibits higher efficiency than conventional algorithms. The computing principles in modern digital paradigms have been designed to be dissociated from the underlying physics of natural phenomena [1]. In the construction of CMOS devices, wide-band-gap materials have been employed so that physical fluctuations such as thermal noise, which often violate logically-valid behavior, could be neglected [2]. Since electron dynamics constrained by physical laws cannot be controlled when only parameters of the same degree of freedom as those of logical inputoutput responses are modulated, considerably complicated circuits are required for implementing relatively simple logic gates such as NAND and NOR [3]. However, these efforts to circumvent the division between physics and computation are costly in terms of energy consumption and manufacturing resources. On the other hand, when we look at the natural world, information processing in biological systems is elegantly coupled with their underlying physics [4,5]. This suggests a potential for establishing a new physics-based analog-computing paradigm. In this Letter, we show that a physical constraint, the conservation law for the volume of a rigid body, allows for efficient solving of decision-making problems when subjected to suitable operations involving fluctuations.
Suppose there are M slot machines, each of which returns a reward; for example, a coin, with a certain probability that is unknown to a player. Let us consider a minimal case: two machines A and B give rewards with individual probabilities P A and P B , respectively. The player makes a decision on which machine to play at each trial, trying to maximize the total reward obtained after repeating several trials. The multi-armed bandit problem (MBP) is used to determine the optimal strategy for finding the machine with the highest reward probability as accurately and quickly as possible by referring to past experiences.
The MBP is formulated as a mathematical problem * Electronic address: KIM.Songju@nims.go.jp without loss of generality and so is related to various stochastic phenomena. In fact, many application problems in diverse fields, such as communications (cognitive networks [6,7]), commerce (advertising on the web [8]), entertainment (Monte-Carlo tree search, which is used for computer games [9,10]), and so on, can be reduced to MBPs. Particularly, the "upper confidence bound 1 (UCB1) algorithm" for solving MBPs is used worldwide in many practical applications [16].
In the context of reinforcement learning, the MBP was originally described by Robbins [11], though the essence of the problem had been studied earlier by Thompson [12]. The optimal strategy, called the "Gittins index", is known only for a limited class of problems in which the reward distributions are assumed to be known to the players [13,14]. Even in this limited class, in practice, computing the Gittins index becomes intractable for many cases. For the algorithms proposed by Agrawal and Auer et al., another index was expressed as a simple function of the reward sums obtained from the machines [15,16]. Kim et al. proposed an MBP solution using a dynamical system, called "tug-of-war (TOW) dynamics"; this algorithm was inspired by the spatiotemporal dynamics of a single-celled amoeboid organism (the true slime mold P. polycephalum) [17][18][19][20][21][22], which maintains a constant intracellular-resource volume while collecting environmental information by concurrently expanding and shrinking its pseudopod-like terminal parts. In this nature-inspired algorithm, the decision-making function is derived from its underlying physics, resembling that of a tug-of-war game. The physical constraint in TOW dynamics, the conservation law for the volume of the amoeboid body, entails a nonlocal correlation among the terminal parts, that is, the volume increment in one part is immediately compensated by volume decrement(s) in the other part(s). In our previous studies [17][18][19][20][21][22], we showed that, owing to the nonlocal correlation derived from the volume-conservation law, TOW dynamics exhibit higher performance than other well-known algorithms such as the modified ǫ-greedy algorithm and the modified soft-max algorithm, which is comparable to the UCB1-tuned algorithm (seen as the best choice among parameter-free algorithms [16]). These observations suggest that efficient decision-making devices could be implemented using any physical object as long as it held some common physical attributes such as the conservation law. In fact, Kim et al. demonstrated that optical energy-transfer dynamics between quantum dots, in which energy is conserved, can be exploited for the implementation of TOW dynamics [23,24]. Consider a volume-conserving physical object; for example, a rigid body like an iron bar (the slot-machine's handle), as shown in Fig. 1. Here, the variable X k represents the displacement of terminal k from an initial position, where k ∈ {A, B}. If X k is a maximum, we assume that the body makes a decision to play machine k. In TOW dynamics, the MBP is represented in its inverse form: instead of "rewarding" the player when machine k produces a coin with a probability P k , we "punish" the player when the machine gives no coin with a probability 1 − P k . In this respect, the displacement X A (= −X B ) is determined by the following equations: Here, Q k (t) (k ∈ {A, B}) is an "estimate" of information on past experiences accumulated from the initial time 1 to current time t, N k counts the number of times that machine k has been played, L k counts the number of punishments when playing machine k, δ(t) is an arbitrary fluctuation to which the body is subjected, and ω is a weighting parameter to be described in detail later on in this Letter. Eq.(2), called the "learning rule", reflects the volume-conservation law. Consequently the TOW dynamics evolve according to a particularly simple rule: in addition to the fluctuation, if machine k is played at each time t, +1 and −ω are added to X k (t − 1) when rewarded and non-rewarded, respectively (Fig. 1).
To explore the origins of the high performance of TOW dynamics, let us consider a random-walk model for com- parison. As shown in Fig. 2(a), α (right flight when rewarded) and β (left flight when non-rewarded) are the parameters. We assume that P A > P B for simplicity. After time step t, the displacement R k (t) (k ∈ {A, B}) can be described by The expected value of R k can be obtained from the following equation: In the overlapping area between the two distributions shown in Fig. 2(b), we cannot accurately estimate which is larger. The overlapping area should decrease as N k increases so as to avoid incorrect judgments. This requirement can be expressed by the following forms: These expressions can be rearranged into the form In other words, the parameters α and β must satisfy the above conditions so that the random walk correctly represents the larger judgment. We can easily confirm that the following form satisfies the above conditions: From R k (t)/α = Q k (t), we obtain ω = β α . From this and Eq.(8), we obtain Here, we have set the parameter ω to ω 0 . Therefore, we can conclude that the algorithm using the learning rule Q k with the parameter ω 0 can solve the MBP correctly. In many popular algorithms such as the ǫ-greedy algorithm, at each time t, an estimate of reward probability is updated for either of the two machines being played. On the other hand, in an imaginary circumstance in which the sum of the reward probabilities γ = P A + P B is known to the player, we can update both of the two estimates simultaneously, even though only one of the machines was played.
TABLE I: Estimates for each reward probability based on the knowledge that machine A was played NA times and that machine B was played NB times-on the assumption that the sum of the reward probabilities γ = PA + PB is known. A: The top and bottom rows of Table I provide estimates based on the knowledge that machine A was played N A times and that machine B was played N B times, respectively. Note that we can also update the estimate of the machine that was not played, owing to the given γ.
From the above estimates, each expected reward Q ′ k (k ∈ {A, B}) is given as follows: These expected rewards, Q ′ j s, are not the same as those given by the learning rules of TOW dynamics, Q j s in Eq. (2). However, what we use substantially in TOW dynamics is the difference When we transform the expected rewards Q ′ j s into Q ′′ j = Q ′ j /(2 − γ), we can obtain the difference Comparing the coefficients of Eq. (13) and (14), the differences in their constituent terms are always equal when ω = ω 0 (Eq.(9)) is satisfied. Eventually, we can obtain the nearly optimal weighting parameter ω 0 in terms of γ. This derivation implies that the learning rule for TOW dynamics is equivalent to that of the imaginary system in which both of the two estimates can be updated simultaneously. In other words, TOW dynamics imitates the imaginary system that determines its next move at time t + 1 in referring to the estimates of the two machines, even if one of them was not actually played at time t. This unique feature in the learning rule, derived from the fact that the sum of reward probabilities is given in advance, may be one of the origins of the high performance of TOW dynamics.
Monte Carlo simulations were performed it was verified that TOW dynamics with ω 0 exhibits an exceptionally high performance, which is comparable to its peak performance-achieved with the optimal parameter ω opt . To derive the optimal value ω opt accurately, we need to take into account the fluctuation and other dynamics of terminals [20].
In addition, the essence of the process described here can be generalized to M -machine cases. To separate distributions of the top m-th and top (m + 1)-th machine, as shown in Fig. 2(b), all we need is the following ω 0 : Here, P (m) denotes the top m-th reward probability. In fact, for M -machine and X-player cases, we have designed a physical system that can determine the overall optimal state, called the "social maximum," quickly and accurately [25].
To further investigate the origins of the high performance of TOW dynamics, let us consider another imaginary model for solving the MBP, called the "cheater algorithm." The cheater algorithm selects a machine to play according to the following estimate S k (k ∈ {A, B}) Here, X k,i is a random variable that takes either 1 (rewarded) or 0 (non-rewarded). If S A > S B at time t = N , machine A is played at time t = N + 1. If S B > S A at time t = N , machine B is played at time t = N + 1. If S A = S B at time t = N , a machine is played randomly at time t = N + 1. Note that the algorithm refers to results of both machines at time t without any attention to which machine was played at time t − 1. In other words, the algorithm "cheats" because it plays both machines and collects both results, but declares that it plays only one machine at a time.
The expected value and the variance of X k are defined as E(X k ) = µ k and V (X k ) = σ 2 k . Here, µ k is the same as the P k defined earlier. From the central-limit theorem, S k has a Gaussian distribution with E(S k ) = µ k N and V (S k ) = σ 2 k N . If we define a new variable S = S A − S B , S has a Gaussian distribution and carries the following values:  Fig. 3, the probability of playing machine B, which has a lower reward probability, can be described as Q( E(S) σ(S) ). Here, Q(x) is a Q-function. We obtain Here, φ = µA−µB √ Using the Chernoff bound Q(x) ≤ 1 2 exp(− x 2 2 ), we can calculate the upper bound of a measure, called the "regret", which quantifies the accumulated losses of the cheater algorithm.
Note that the regret becomes constant as N increases.
Using the "cheated" results, we can also calculate the regret of TOW dynamics in the same way. In this case, X k,i is also a random variable that takes either 1 (rewarded) or 0 (non-rewarded). Here, we use L k =(1 − µ k )N k . Then, we obtain E(S k ) = {µ k −(1−µ k )ω}N k and V (S k ) = σ 2 k N k . Using the new variables S = S A − S B , N = N A + N N , and D = N A − N N , we also obtain If the conditions ω = ω 0 and σ A = σ B ≡ σ are satisfied, we then obtain and Here, φ T = (µA−µB )(1+ω0) 2σ . We can then calculate the upper bound of the regret for TOW dynamics Note that the regret for TOW dynamics also becomes constant as N increases. It is known that optimal algorithms for the MBP, defined by Auer et al., have a regret proportional to log(N ). The regret has no finite upper bound as N increases because it continues to require playing the lower-reward machine to ensure that the probability of incorrect judgment goes to zero. A constant regret means that the probability of incorrect judgment remains non-zero in TOW dynamics, although this probability is nearly equal to zero. However, it would appear that the reward probabilities change frequently in actual decision-making situations, and their long-term behavior is not crucial for many practical purposes. For this reason, TOW dynamics would be more suited to real-world applications.
In this Letter, we proposed TOW dynamics for solving the MBP and analytically validated that their high efficiency in making a series of decisions for maximizing the total sum of stochastically obtained rewards is embedded in any volume-conserving physical object when subjected to suitable operations involving fluctuations. In conventional decision-making algorithms for solving the MBP, the parameter for adjusting the "exploration time" must be optimized. This exploration parameter often reflects the difference between the rewarded experiences, i.e., |P A −P B |. In contrast, TOW dynamics demonstrates that a higher performance can be achieved by introducing a weighting parameter ω 0 that refers to the sum of the rewarded experiences, i.e., P A + P B . Owing to this novelty, the high performance of TOW dynamics can be reproduced when implementing these dynamics with various volume-conserving physical objects. Thus, our proposed physics-based analog-computing paradigm would be useful for a variety of real-world applications and for understanding the biological information-processing principles that exploit their underlying physics.