Bandit Algorithm Driven by a Classical Random Walk and a Quantum Walk

Quantum walks (QWs) have a property that classical random walks (RWs) do not possess—the coexistence of linear spreading and localization—and this property is utilized to implement various kinds of applications. This paper proposes RW- and QW-based algorithms for multi-armed-bandit (MAB) problems. We show that, under some settings, the QW-based model realizes higher performance than the corresponding RW-based one by associating the two operations that make MAB problems difficult—exploration and exploitation—with these two behaviors of QWs.


Introduction
Random walk (RW) is one of the most ubiquitous stochastic processes and is employed for both mathematical analyses and applications, such as describing real-world phenomena and constructing various algorithms. Meanwhile, along with the increasing interest in quantum mechanics from both theoretical and applied perspectives, the quantum counterpart of an RW, known as a quantum walk (QW), is also attracting attention [1][2][3][4]. A QW includes the effects of quantum superposition or time evolution. In classical RWs, a random walker (RWer) selects in which direction to go probabilistically at each time step, and thus, one can track where the RWer is at any time step. On the other hand, in QWs, one cannot tell where a quantum walker (QWer) exists during the time evolution, and the location is determined only after conducting the measurement.
QWs have a property that classical RWs do not possess: the coexistence of linear spreading and localization [5,6]. As a result, QWs show probability distributions that are totally different from those of random walks, which weakly converge to normal distributions. The former behavior, linear spreading, means that the standard deviation of the probability distribution of measurement of quantum walkers (QWers) grows in proportion to the run time t. In the case of discrete-time RWs on a one-dimensional lattice Z, denoting the random variable of the position where a walker is measured at time t ∈ N 0 = N ∪ {0} by X (RW) t , then the standard deviation is D[X (RW) On the other hand, in discrete-time QWs on Z, the standard deviation of a walker's position at time t is D[X (QW) t ] = O(t), and thus, discrete-time QWs outperform RWs in terms of the propagation velocity [7]. The latter behavior, localization, implies that the probability is distributed at a particular position no matter how long the walk runs. In the classical RWs, the probability distribution becomes flat despite keeping a bell-shaped curve; that is, localization is not observed.
QWs were first introduced in the field of quantum information theory [7][8][9]. The idea of weak convergence, which is frequently used in probability theory, was introduced to show the properties of QWs [10,11], and since then, quantum walks have been actively studied from both fundamental and applied perspectives. In fundamental fields, there have been many attempts to analyze these evolution models mathematically [6,[12][13][14][15][16][17][18][19][20][21][22] due to varying behavior of QWs depending on the conditions or settings of time and space. In applied fields, their unique behavior is useful for implementing quantum structures or quantum analogs of existing models; therefore, various QW-based models have been considered for subjects such as time-series analysis [23], topological insulators [24,25], radioactive waste reduction [26,27], and optics [28,29]. In addition, the contribution to quantum information technology is becoming more prominent these days. QWs have been applied not only to the principle of technologies such as quantum search, quantum teleportation [30,31], and quantum key distribution [32] but also to the implementation of quantum gates themselves [33,34].
Throughout the extensive research conducted, considerable attention has also been devoted to the models of QWs themselves. Initially, QW models were introduced in the form of coined QWs, wherein the time evolution of walkers is considered through the utilization of unitary matrices called coin matrices, analogous to transitions in RWs [9]. According to the No-Go Lemma [35,36], we need the additional subspace, namely the coin space to each vertex, to construct a non-trivial unitary time evolution on the cycle, introduced later. This model remains one of the most intuitive models and is widely studied or applied even in contemporary times. On the other hand, depending on the research interests, some literature also explores QW models that do not incorporate coins [37][38][39], which represent a generalization of quantum cellular automata [35]. The coinless QW models seem to achieve reduced computational costs, but it is shown that such QW models are unitarily equivalent to coined QW models [40]. In this paper, we focus on using the coined QW models because adopting the coined QWs is reasonable when conducting a comparison with RWs. In the following, when referring to QWs, it indicates the coined QWs.
This paper proposes new solution schemes for multi-armed bandit (MAB) problems [41] using RWs and QWs. In the MAB problems, we consider a situation where there are multiple slot machines in an environment, each gives a reward with a probability allocated to it, and an agent iterates the selection of slot machines and probabilistic gain of the rewards and tries to maximize the total reward. Initially, the agent has no information about the probability of giving rewards, especially which slot machine has the maximum probability, which we call the best slot machine. Thus, it is required to accumulate such information through a certain number of selections, an action which we call exploration. On the other hand, too much exploration will use up the opportunities for selecting the better slot machines that have already been found; that is, it is also necessary to spend some rounds to bet on slot machines that are reliable based on the information obtained, which we call exploitation. The difficulty of MAB problems occurs under the balance between these two operations, known as the exploration-exploitation trade-off [42].
One of the purposes of this study is to show that, by utilizing QWs, we can construct an algorithm to solve the MAB problem. This algorithm can outperform models implemented by RWs in terms of the total rewards under some settings. Our idea to realize this is to address this dilemma derived from the exploration-exploitation trade-off by utilizing a unique property of QWs, i.e., the coexistence of linear spreading and localization. More precisely, we combine exploration with linear spreading, and exploitation with localization, as shown in Figure 1. By utilizing linear spreading, we intend to cover the whole environment and prevent us from missing some slot machines. In addition, by applying localization, we intend to mark slot machines that should be recommended with a high probability distribution. This paper introduces a QW-based algorithm for MAB problems, which realizes these combinations. Our study focuses on three-state site-dependent QWs on a cycle. The behavior of these walks corresponds to that of lazy random walks, with walkers moving clockwise, anti-clockwise, or staying in place in superposition. Giving three states to QWs enables us to obtain a high existence probability at the initial position of a QWer. In addition, site-dependent coin matrices make it possible to trap or dam the QWer on certain vertices. By taking advantage of these, we attempt to bring about a high probability on a vertex whose slot machine should be recommended. To facilitate a clear comparison, we also construct an RW-based model wherein lazy RWs occur on cycles, and the transition probabilities depend on the position from which the walkers depart. While the QW-based model possesses the coexistence of linear spreading and localization, the RW-one does not; while the results depend on the specifics of the MAB problem, our study reveals that, for certain settings, the different properties of QWs and RWs also lead to a significant difference in total rewards between both algorithms. The rest of this paper is organized as follows. First, in Section 2, we present an algorithm for MAB problems based on the RW, which is more intuitive than the QWbased one. Then, in Section 3, we introduce a system of discrete-time quantum walks on a cycle and the QW-based algorithm for MAB problems. In Section 4, we show some results for numerical simulations of the RW-and QW-based models and compare the performance between the two models. Section 5 concludes this paper and discusses the future possibilities of our work.

Random-Walk-Based Model for MAB Problem
This section presents an MAB algorithm implemented using a discrete-time random walk (RW) on cycles. The RW model presented in this paper describes the walkers that can stay at the same position, which we often call a lazy random walk. First, we present the mathematical system of the lazy RW, and then we construct the MAB algorithm based on it.

Random Walk on Cycles
Assume that cycle C N is composed of N vertices and edges. Here vertices are labeled by set V N := {0, 1, · · · , N − 1}, and the label is ordered clockwise. Thus, the set of edges is given by E N := {x, x + 1} | x ∈ V N , where one applies addition and subtraction modulo N to V N ; i.e., (N − 1) + 1 ≡ 0 and 0 − 1 ≡ N − 1. In other words, V N is isometric to Z/NZ.
We assume that the position of a walker is determined as follows: • A walker initially exists at position s ∈ V N . • At each time step, a walker at position x moves one unit clockwise with probability q(x), moves one unit anti-clockwise with probability q(x) or stays at the current position with probability 1 − 2q(x).
Here the probabilities of moving clockwise and anti-clockwise are equal to each other in this paper. This is due to correspondence with the setting of the QW presented later; therein, the setting of the initial state of the QW gives symmetric probability distributions when the coin matrix is homogeneous for the space. Note that q(x) should satisfy 0 ≤ q(x) ≤ 1/2 under this condition, and this setting is equivalent to the simple RW on cycles in the case of q(x) = 1/2 for all x ∈ V N .
Such an RW is mathematically constructed as follows. For N 0 = {0, 1, · · · }, let {X t } t∈N 0 be the sequence of random variables that represent the position of a walker at time step t. The next position X t+1 depends on the current one X t , and the conditional probability is determined as follows: with x ∈ V N . We recall that V N Z/NZ; the equation above includes Here we denote the probability that a walker is at position x at time step t by ν (t) (x): Then, the relation Moreover, by Equations (1) and (2), ν (t) (x) varies as follows:

Random-Walk-Based Algorithm
We consider an N-armed bandit problem with cycle C N ; each vertex x ∈ V N is given a slot machine that gives a reward with the probability p(x). In the following, each slot machine is identified by the same label as the corresponding slot machine; for example, we call the slot machine on the vertex x slot machine x. In addition, we call probability p(x) the success probability of slot machine x. Moreover, we denote the slot machine with the best success probability in V N by x * ; that is, and we call it the best slot machine.
The principle consists of the following four steps: [STEP 0] initializing the quantum walk settings, [STEP 1] running random walks, [STEP 2] playing the selected slot machine, and [STEP 3] updating the quantum walk settings. After finishing [STEP 3], the process returns to [STEP 1]. We call the series of the last three steps ([STEP 1-3], shown in Figure 2) a decision, and decisions are iterated J times over a run. Here, we use the following notations:  • s j ∈ V N : Initial position of random walk in the j-th decision. • q j (x) ∈ [0, 1/2]: Clockwise-transition probability and anti-clockwise-transition probability in the j-th decision. •x j ∈ V N : Vertex (slot machine) measured in the j-th decision. •r j ∈ {0, 1}: Reward on the j-th decision. This value is probabilistically determined by the Bernoulli distribution Ber(p(x j )); that is, • H j (x): Number of decisions where the slot machine x is selected until the j-th decision. • L j (x): Number of decisions where the slot machine x gives the reward until the j-th decision. •p j (x): Empirical probability that the slot machine x gives the reward on the j-th decision: [STEP 0] RW-setting initialization For the first decision, the settings of the random walk are determined as follows: • Initial position s 1 : Probabilistically determined by the uniform distribution on V N . • Transition probability: After finishing this step, the process iterates the following three steps.
[STEP 1] Random walk Random walks are run over T time steps with the initial position s j and transition probability q j (x), and the valuex j ∈ V N is obtained following probability distribution ν [STEP 2] Slot machine play The slot machinex j ∈ V N obtained at [STEP 1] is played. Then, the reward (r j = 1) is obtained with probability p(x j ).
Here Hand L-values are updated. First, the H-value onx j is incremented: Entropy 2023, 25, 843 6 of 20 Ifr j = 1, the L-value onx j is also incremented (otherwise, the value is maintained): For x =x j , the Hand L-values are maintained: Based on that,p j (x)s are updated.
[STEP 3] RW-setting adjustment Using the newp j (x)s, the settings of quantum walks are updated for the next decision. The new initial position is defined as Moreover, the new transition probabilities are determined as where a, b ≥ 1, and q • are defined in [STEP 0]. Note that the q-value monotonically decreases about the empirical success probability; that is, ifp j (x) is larger, then q j+1 (x) is smaller. By setting the new initial state and q-value in these manners, we aim at confining walkers to the desired position while concurrently affording them opportunities to depart when the current decision is uncertain. The parameters a and b control the strength of the effect byp j (x); the details are given in Appendix A. After this step, the process returns to [STEP 1].

Quantum-Walk-Based Model for MAB Problem
This section presents the MAB algorithm implemented by the discrete-time quantum walk (QW) on cycles. The difference between QW and RW lies in whether one handles the quantum superposition of states pertaining to the walker's positions. Herein, the transition of walkers at each time step is also superposed; that is, it is uncertain even after the time step which transition occurs: moving clockwise, anti-clockwise, or staying in place. We introduce probability amplitude vectors and coin matrices in the QW to describe the quantum superposition and its time evolution.
The QW model employed in our study is a three-state QW on a cycle, which can be naturally reduced to a finite space from the one-dimensional lattice model [5]; see, for example, [18,43,44]. First, we explain the definition of our QW in detail, and then we present the MAB algorithm based on it.

Quantum Walk on Cycles
Assume that cycle C N is constructed in the same manner as in Section 2; that is, it is the graph established by the set of vertices V N := {0, 1, · · · , N − 1} and edges where one applies addition and subtraction modulo N to V N .
The space of the probability amplitude vectors driving our QW is defined in a compound Hilbert space consisting of the position Hilbert space H P and the coin Hilbert space H C . The position Hilbert space H P is spanned by the unit vectors corresponding to the vertices on C N ; i.e., H P = span{|x | x ∈ V N }. Here we require them to be mutually orthogonal, which is equivalent to satisfying the relation y|x = δ y (x) for any x, y ∈ V N , where δ y is the delta function defined by Equation (7). Then, H P C N holds. The coin Hilbert space H C pertains to the internal state of walkers. In this model, we assume that there exist three internal states: clockwise (+), anti-clockwise (−), and staying (O). We define the three-dimensional unit vectors corresponding to them as |− = [1 0 0] T , |O = [0 1 0] T , and |+ = [0 0 1] T , where a superscript T on a matrix represents its transpose, and construct the coin Hilbert space as H C = span{|− , |O , |+ }. Here you see that H C = C 3 . Based on H P and H C , the whole system is described by Then the total state of our QW at time t ∈ N 0 is represented as follows: there exists Here, t ∈ N 0 represents time step of QWs, and |ψ (t) (x) ∈ C 3 is called the probability amplitude vector at position x ∈ V N at run time t. We set the initial state as where s ∈ V N , and |ϕ ∈ C 3 is a constant vector with ϕ = 1. In this paper, we fix |ϕ to |O , realizing a symmetric probability distribution about the initial position when the coin matrix defined later is homogeneous for positions. Now, we introduce the time evolution of |Ψ (t) by Here U is the unitary operator, referred to as the time evolution operator, and is composed of shift operator S and coin operator C: and S and C are given by and Here S is defined as and represents the clockwise transition, and then indicates the anti-clockwise transition. The identity matrix I N corresponds to staying in place. Here, note that N ≡ 0 on V N ( Z/NZ); that is, for example, in the case of N = 4, and C(x) is a unitary matrix called a coin matrix, which is defined as follows: is reduced to the Grover matrix, which is important in quantum searching [45]. Let us explain the equivalent expression for the time evolution operator U, which is useful to understand the dynamics of our QW: by applying the property of the Kronecker product, we have where The matrices P(x), Q(x), and R(x) are considered to be the decomposition elements of C(x); that is, the relation P(x) + Q(x) + R(x) = C(x) holds. They describe the matrixvalued weight of a clockwise transition, an anti-clockwise transition, and staying in place, respectively, corresponding to the transition probabilities of the RW, as shown in Figure 3. By Equations (21) and (30), we have Moreover, from the initial state (20), there exists a 2-dimensional matrix Ξ (t) (x) such that Here Ξ (t) (x) describes the weight of all the possible paths from the origin to the position x at run time t. From Equation (34), the following relation holds:  Finally, the measurement probability of the particle at position x at run time t, denoted by µ (t) (x), is given by Setting random variable X t following the distribution µ (t) , we call X t the position of a QWer at time t. This definition is based on the Born probability interpretation in quantum mechanics. Note that for any t ∈ N 0 , the following is satisfied:

Quantum-Walk-Based Algorithm
We consider an N-armed bandit problem with cycle C N ; each vertex x ∈ V N is given a slot machine that gives a reward with the probability p(x), identically to the RW-based model in Section 3.
The principle is also similar to the RW-based model: First, the QW settings are initialized ([STEP 0]), and then decisions are iterated ([STEP 1-3]) J times over a run. As shown in Figure 4, the QW-based model controls the coin matrix C(x) by adjusting the value of the parameter θ(x) instead of the transition probability q(x) in the RW-based model. Here we use the following notations: • |Φ j ∈ H PC : Initial state of quantum walk on the j-th decision. • s j ∈ V N : Initial position of quantum walk on the j-th decision. • θ j (x) ∈ [0, 2π): Parameter of Equation (24) on vertex x on the j-th decision; then the coin matrix there is C(x). •x j ∈ V N : Vertex (slot machine) measured on the j-th decision. •r j ∈ {0, 1}: Reward on the j-th decision, which follows the Bernoulli distribution Ber(p(x j )):r •p j (x): Empirical probability that slot machine x gives the reward on the j-th decision:   Parameter of coin matrices: θ 1 (x) = θ • ∈ [0, 2π) for all x ∈ V N . After finishing this step, the run iterates the following three steps.
[STEP 1] Quantum walk Quantum walks are run over T time steps with the initial position s j and the parameter θ j (x). After running T steps of time evolution, the QWer is measured to obtain the valuê x j ∈ V N following probability distribution µ (T) j (x).
[STEP 2] Slot machine play The slot machinex j ∈ V N obtained at [STEP 1] is played. Then, the reward (r j = 1) is obtained with probability p(x j ).
Here Hand L-values are updated. First, the H-value onx j is incremented: Ifr j = 1, the L-value onx j is also incremented (otherwise, the value is maintained): For x =x j , the Hand L-values are maintained: Based on that, thep j (x) values are updated.
where s j is the provisionally best machine: Moreover, the new parameters of the coin matrices are determined as where a, b ≥ 1, and θ • are defined in [STEP 0]. Note that the θ-value is defined similarly to the q-value in the RW-based model; that is, ifp j (x) is larger, then θ j+1 (x) is smaller. When the θ-value at a certain position x L ∈ V N is updated, the difference between C(x L ) and C(x) with x = x L ± 1 emerges, wherein x L is often called a defect. If a defect exists at the initial position, the coin matrix, depending on the θ-value, controls the strength of localization there. Incidentally, whenp j (x) is large, θ j+1 (x) can be almost 0. If θ(x) in Equation (24) is exactly 0, then C(x) is the identity matrix. This means that, if the initial position of the walker has a coin matrix with θ(x) = 0, the walker is completely trapped there because the internal state is set to be |O . Thus, a larger empirical success probability indicates a strong localization if it is provisionally best. However, ifp j (x) is not large, this phenomenon is relaxed. In short, this θ-value plays a role corresponding to the q-value in the RW-based model; that is, it confines walkers to the desired position while concurrently affording them opportunities to depart when the current decision is uncertain. Regarding the analysis of θ-values, also see Appendix A. After this step, the process returns to [STEP 1].

Numerical Simulations
In this section, we give and compare simulation results for the RW-and QW-based models. Assume that the RW-and QW-based models are run in parallel K times, respectively, and each run is labeled by the set {1, 2, · · · , K}. We indicate that the parameters are in the k-th run by a subscript next to the number of iterations; for example, the reward in the j-th decision in the k-th run is denoted byr j, k .
As figures-of-merit, we define quantities M(j), ρ(j), and CDR(j): M(j) indicates the mean of total rewards until the j-th decision over K runs. The aim of the proposed models is to make M(j) as large as possible. ρ(j) is the mean of cumulative regret until the j-th decision over K runs. The cumulative regret is equal to the difference in expectations of total reward between the case where only the best machine is selected until the j-th decision and that of actual selections until then. CDR(j) is the correct decision rate of the j-th decision, which is the ratio of the number of runs in selecting the best slot machine to the total number of runs K. Herein, δ y for y ∈ V N is the delta function defined by Equation (7).
The parameter values used for this series of simulations are summarized in Table 1. The success probabilities of slot machines are given as shown in Figure 5; that is, 14) 0.1 (x = 15) 0.7 (x : even except for 14) 0.5 (x : odd except for 15) . (51) Herein, the best slot machine is x * = 14. Recall that the agent cannot directly access all information regarding the success probabilities of the slot machines above. The tuples for the QW-and RW-based models are selected as one of the best performers in a certain range of parameters on each model. The details about parameter-dependencies of both models are found in Appendix A. The blue and orange curves in Figure 6a-c demonstrate the performances of RW-and QW-based models as the variations of the mean of total reward M(j), the cumulative regret ρ(j), and the maximum value of CDR over the number T of time steps of walks for single decision-making, respectively. The total reward and the cumulative regret are taken for the final decision J = 5000. The maximum value of CDR is taken over J decisions; that is, max(CDR) := max j=1, ··· , J CDR(j). (52) For T ≥ 4, we observe that M(5000) and max(CDR) of the QW-based model are larger than those of the RW-based model. On the other hand, for the cumulative regret ρ(5000), the value for the QW-based model is lower than that for the RW-based model. Both results indicate that the performance of the QW-based model is superior to that of the RW-based model. You see the particular difference in the growth of M(5000) and max(CDR) over the variation of T between the QW-and RW-models; the gradient of the orange curves (QW) in the range of 2 ≤ T ≤ 8 in M(5000) and max(CDR) is larger than that of the blue curve (RW). In addition, the QW-based model has higher suprema of M(5000) and max(CDR) than the RW-based model. Similar discussions are also made for regret ρ(5000). In both the RW-and QW-models ρ(5000) decreases over the variation of T in the range of 2 ≤ T ≤ 8, but the gradient of the QW-model is steeper than that of the RW-one. Moreover, the QW-based model has a lower infimum than the RW-based model.
These results depend on a variety of choices; in particular, the casino setting and the parameters a and b. Indeed, the QW-based model shows a faster growth in reward for many of a and b compared to the RW-based model (see Appendix A). However, there are also settings where this relationship is reversed. We have also observed that in the limit of very large a and b, when fine-tuned to a specific casino setting, the performance of both RW and QW can grow even faster than the results shown in Figure 6. We speculate that linear spreading and localization may lose their advantage for certain settings.  Table 1.
You can see the contribution of linear spreading and localization from the behavior of variation of decision-making and probability distributions for making a decision, which is particularly apparent for smaller T. Figures 7 and 8, respectively, show the precise performances of runs of the RW-and QW-based models whose resultant total rewards M(J) were almost equal to the average value. Herein, the number of time steps T is set to be 8 for each model, and the other parameters are set as in Table 1. Figure 7a,b indicate the relationship between the decision j and the selected slot machinex j for the RW-and QW-based models, respectively. From this figure, you can see that the decision-making in the QW-based model almost converges to x = x * near j = 1200, while that in the RW-based model does so near j = 1400. This means that exploration in the QW-based model is more successful than that in the RW-based model. Linear spreading makes the probability distribution of QWs wider, whose variance is larger than that of RWs, which results in faster exploration of the QW-based model. Moreover, the behavior of the QW-based model after finding x = x * is more stable than that of the RW-based model, which indicates that the QW-based model also realizes more effective exploitation than the RW-based one for this set of parameters. j )-based models over decision j shown in Figure 8a,b, respectively. You see that, for smaller j, the probability of QW is more widely distributed than that of RW, although the values are quite small except for a certain position. It is important that the probabilities are distributed in a wider range even if they are quite small because it indicates that the agent has more selections, which is crucial to realize exploration. As a result, the QW-based model can obtain a high measurement probability of the best slot machine at j = 1200 at the latest, while the RW-based model does at j = 1400, which corresponds to the convergence of the decision-making shown in Figure 7. After beginning the concentrated investment to x = x * , the measurement probability of walkers on the QW is almost 1, while the corresponding probability on the RW is around 0.9, which shows that strong localization occurs on vertex x * after finding the slot machine there. This phenomenon contributes to exploitation; as you see in Figure 7, the QW-based model after finding the best slot machine is much more likely to select it than the RW-based one, and  Table 1. Each run is selected as the one whose resultant total rewards M(J) were almost equal to the average value: M(J) = 4063 in (a), and M(J) = 4200 in (b). The black, sky blue, gray, and light green lines indicate the slot machines whose success probabilities are 0.9, 0.7, 0.5, and 0.1, respectively.  Figure 7; that is, for both settings, the number of time steps T is set to 8, other parameters are determined as in Table 1, and each run is selected as the one whose resultant total rewards M(J) were almost equal to the average value: M(J) = 4063 in (a), and M(J) = 4200 in (b).

Conclusions and Discussion
This paper has proposed new solution schemes for multi-armed bandit (MAB) problems using random walks (RWs) and quantum walks (QWs). We demonstrated that we could find parameter regimes where the QW-based model performs better than the RWbased model by addressing the exploration-exploitation dilemma by utilizing a unique property of QWs, i.e., the coexistence of linear spreading and localization. Our idea was to combine exploration with linear spreading and exploitation with localization. By utilizing linear spreading, we expect the QWs to cover the whole environment to prevent it from missing some slot machines. In addition, by applying localization, we expect the QWs to identify the slot machine that should be recommended with a high probability distribution. Indeed, we showed that, under some settings, linear spreading contributes to exploring the environment and quickly finding the best slot machine, and localization contributes to exploiting the best slot machine more frequently.
The positive results obtained in this study open the possibility for further extensions of this approach. First, can we apply this algorithm to the case of multi-agent systems such as competitive or adversarial bandit problems [46][47][48] with some revision? Especially for the QW-based model, we will examine its application to the use of coin matrices implemented by multiple registers or to drive walkers on a torus. Moreover, there are possibilities for constructing application models in the single-agent case. For example, we may construct an evolved version of the QW-based model, including a quantum version of the optimal stopping problem.
Moreover, analyses of our models are also important. The performances obtained in Section 4 should depend on the number of slot machines (i.e., vertices) N, the true success probability p(x), and the parameter settings (a, b, q • ) or (a, b, θ • ). Obtaining theoretical formulae for the figures-of-merit would be desirable, which would make our results more confident. Specifically, it is not immediately evident, given the parameter configurations outlined in this paper, that the QW-based model outperforms the RW-based model in general. Our investigations have focused solely on pairs (a, b) with small values; in the event that larger values are assigned to either a or b, the RW-based model can exhibit superior performance compared to any QW-based model (assuming the parameters are chosen correctly for a specific casino setting). Indeed, we have confirmed that the RW-based model with tuple (a, b, q • ) = (50, 20, 0.5) performs as well as the QW-based one with tuple (a, b, θ • ) = (50, 20, 29π/64), and in both results, the mean of total reward is around 4320 even in setting the number of time steps T to 8 (better than the results for small a and b as shown in Figure 6). However, sufficiently addressing the parameter dependency and its interplay with the casino settings is a highly complex problem at this stage. We provide our existing analysis in Appendix A.
Several factors contribute to the complexity of a rigid mathematical treatment of this problem, but the one that should be remarked on is the position-dependency of the coin matrices in the QW-based model. Solving QW-based models with site-dependent coins is very difficult, in general. While some studies have addressed this matter for the case where the coin matrix only on the origin differs from the others [6,22] or that the coin matrices are controlled by the trigonometric function whose input is in proportion to the label position [12,14,21], the generalized case remains an open problem. To conduct a thorough analysis of this model, it is necessary to accumulate analytical results for site-dependent quantum walks over an extended period of time.

Data Availability Statement:
The data that support the findings of this study are available from the corresponding author upon reasonable request.

Appendix A. Parameter-Dependencies of RW-and QW-Based Models
We analyze the (a, b, q • )-dependency of the RW-based model and the (a, b, θ • )dependency of the QW-based model.
First, we introduce the function f : [0, 1] → R defined as follows: with a, b ≥ 1 and c ≥ 0. Variance u corresponds to empirical success probabilityp j (x) in both cases, and f (u) to q j+1 (x) and θ j+1 (x) in Equations (17) and (47), respectively. Note that q j+1 (x) and θ j+1 (x) are controlled by position x throughp j (x); in that sense, they are the functions of the empirical success probability. Additionally, parameter c is represented by q • and θ • in the RW-and QW-based models, respectively. By the property of exponential functions, f (u) monotonically decreases; the maximum and minimum of f (u) are f (0) = c and f (1) = c exp(−a), respectively. This indicates that, if c is fixed, the minimum of f (u) is determined by a and smaller in larger a as shown in Figure A1a. Furthermore, if both a and c are fixed, the maximum and minimum of f (u) are constant; b solely governs how f (u) decreases along with the growth of u. Precisely, larger b makes the gradient of f (u) negatively larger as shown in Figure A1b.
In the following, we fix the number T of time steps of the walk for single decisionmaking to 32; the tendency is common for any T. The numbers of slot machines, runs, and decisions for a single run are equal to those of the simulation in Section 4; that is, (N, K, J) = (32, 500, 5000). In the RW-based model, f (u) directly controls the behavior of walkers as the q-value given by Equation (17), which indicates the transition probability clockwise and anticlockwise. By iterating the decision-making, the empirical success probabilityp j (x) is likely to be p(x) given by Equation (51). Especially, the algorithm should experience many plays with slot machine x with p(x) = 0.9 or 0.7; that is, considering f (0.9) and f (0.7) is particularly important. First, f (0.9) should be smaller because one demands walkers stay at x = x * once they find the slot machine due to exploitation. To realize it, larger a and smaller b are desirable. On the other hand, f (0.7) should be larger because one requires walkers to be active and explore more. To realize it, the opposite of the previous condition is desirable; i.e., smaller a and larger b. Here exploration-exploitation trade-off emerges as the balance of a and b. That is, it is required to choose a and b with an appropriate proportion (and this balance depends on the casino setting). If a is large, this leads to a small f (u) on slots with highp j (x), making it likely for the walker to stay on that slot and play that machine. This way, one can realize the exploitation of the best slot machine, but it may also lead to the over-exploitation of a non-best slot machine. Importantly, if b is large, one can save walkers from sticking to a non-best slot machine, but this also may result in over-exploration even after finding the best slot machine. Figure A2 demonstrates the performance of the RW-based model depending on tuple (a, b, q • ). You see that the larger b becomes, the larger a is required for improvement of the performance, which matches the analysis of f (u) above. If the value of a is small under the large value of b, the algorithm conducts exploration too much. On the other hand, setting a too large is not demanded; this case results in over-exploitation. Indeed, you see that the performance of (a, b, q • ) = (9, 2, 0.5) is worse than that of (a, b, q • ) = (7, 2, 0.5). In our results, the best performance is obtained for the case of (a, b) = (9, 6), where max(CDR) is over 0.95 for q • ≥ 0.25 in Figure A2(c3). In Section 4, we selected tuple (a, b, q • ) = (9, 6, 0.5) as the representative of this range for the analysis in the main manuscript.
In the QW-based model, f (u) contributes to controlling the behavior of walkers as the θ-value given by Equation (47), which plays the role of the phase of the coin matrix. The measurement probability on the defect strongly depends on the coin matrices on the defect and its neighborhood, and thus, θ • pertains to the performance much more than q • for the RW-based model. Moreover, the measurement probability of the defect is not monotonic over θ-values, whose analysis is still open. However, what we can say from f (u) is that the exploitation will be frequently conducted if a is small b is large, which is the same as the RW-base model because the θ-value near 0 distributes the probability of almost 1 on the defect as mentioned on [STEP 3] in Section 3.2.  Figure A3 demonstrates the performance of the QW-based model depending on tuple (a, b, θ • ). You see that some cases exhibit extreme results; the results can be both almost the same as the worst one in the RW-based model and better than any in the RW-based model. We observe that when setting a = 1, θ • makes the performance worse with violent oscillations. Especially, the case of (a, b, θ • ) = (1, 6, 29π/64) performs worst due to overexploitation, which is almost the same as the worst one in the RW-based model. For other cases of a, performance tends to improve over the growth of θ • . Especially in the case of b = 6, max(CDR) is almost 1 for the sufficiently large θ • as shown in Figure A3(c3), which is better than any case in the RW-based model. This indicates that exploration should be introduced by large b-values. Particularly, the case of (a, b) = (5, 6) realizes performs very well. Based on the above, it is important to choose the appropriate tuples for obtaining the result of the QW-based model that outperforms the RW-based model. In Section 4, we selected tuple (a, b, θ • ) = (5, 6, 5π/16) as the representative of this case.
As mentioned in the conclusions of the main manuscript, we have confirmed that the RW-based model with tuple (a, b, q • ) = (50, 20, 0.5) performs as well as the QW-based one with tuple (a, b, θ • ) = (50, 20, 29π/64), and in both results, the mean of total reward is around 4320 even in setting the number of time steps T to 8 (better than the results for small a and b as shown in Figure 6. In this case, f (u) is basically a step-function with f (0.9) 0 and f (0.7) 1, leading to walkers basically ignoring any position except the one optimal slot. It should be remarked that the discussion of f (u) above strongly depends on the setting of success probability p(x) given by Equation (51). For example, if we consider the setting where even the best slot machine has a success probability of 0.7, and the other ones have lower success probabilities, then f (0.7) should be smaller. As mentioned in Section 5, the dependency on the settings regarding slot machines is a heavily complex problem and needs detailed discussions as future tasks.