1 Introduction

The broad field of machine learning (Bishop 2011; Cover and Thomas 1991; Hastie et al. 2009) aims to develop computer algorithms that improve automatically through experience with lots of cross-disciplinary applications from domotics systems to autonomous cars, from face/voice recognition to medical diagnostics. Self-driving systems can learn from data, so as to identify distinctive patterns and make consequently decisions, with minimal human intervention. Its three main paradigms are supervised learning, unsupervised learning and reinforcement learning (RL). The goal of a supervised learning algorithm is to use an output-labeled dataset \(\lbrace x_i,y_i \rbrace _{i=1}^N\), to produce a model that, given a new input vector x, can predict its correct label y. Unsupervised learning, instead, uses an unlabelled dataset \(\lbrace x_i \rbrace _{i=1}^N\) and aims to extract some useful properties (patterns) from the single datapoint or the overall data distribution of the dataset (e.g. clustering). In reinforcement learning (Sutton and Barto 2018), the learning process relies on the interaction between an agent and an environment and defines how the agent performs his actions based on past experiences (episodes). In this process, one of the main problems is how to resolve the tradeoff between exploration of new actions and exploitation of learned experience. RL has been applied in many successful tasks, e.g. outperforming humans on Atari games (Mnih et al. 2015) and GO (Silver et al. 2016) and recently it is becoming popular in the contexts of autonomous driving (Kiran et al. 2020) and neuroscience (Botvinick et al. 2020).

In recent years, lots of efforts have been directed towards developing new algorithms combing machine learning and quantum information tools, i.e. in a new research field known as quantum machine learning (QML) (Schuld et al. 2015; Wittek 2014; Adcock et al. 2015; Arunachalam and de Wolf 2017; Biamonte et al. 2017), mostly in the supervised (Neven et al. 2008; Mott et al. 2017; Lloyd et al. 2020; Martina et al. 2022) and unsupervised domain (Otterbach et al. 2017; Winci et al. 2020; Hu et al. 2019), both to gain an advantage over classical machine learning algorithms and to control quantum systems more effectively. Some preliminary results on QRL have been reported in refs. Dong and Chen (2005); Paparo et al. (2014) and more recently for closed (i.e. following unitary evolution) quantum systems in ref. Dunjko et al. (2016) where the authors have shown quadratic improvements in learning efficiency by means of a Grover-type search in the space of the rewarding actions. Similarly, ref. Saggio et al. (2021) have shown how to get quantum speedups in reinforcement learning agents. The setting of an agent acting on an environment, however, has a natural analogue in the framework of open quantum systems (Breuer and Petruccione 2002; Caruso et al. 2014), where one can embed the entire RL framework into the quantum domain, and this has not been investigated in literature yet. Moreover, one of the authors of this manuscript, inspired by recent observations in biological energy transport phenomena (Caruso et al. 2009), has shown in ref. Caruso et al. (2016) that one can obtain a very remarkable improvement in finding a solution of a problem, given in terms of the exit of a complex maze, by playing with quantum effects and noise. This improvement was about five orders of magnitude with respect to the purely classical and quantum regimes for large maze topologies. In the same work, their results were also experimentally tested by means of an integrated waveguide array, probed by coherent light.

Motivated by these previous works, here we define the building blocks of RL in the quantum domain but in the framework of open (i.e. noisy) quantum systems, where coherent and noise effects can strongly cooperate together to achieve a given task. Then, we apply it to solve the quantum maze problem that, being a very complicated one, can represent a crucial step towards other applications in very different problem-solving contexts.

2 Reinforcement learning

In RL, the system consists of an agent that operates in an environment and gets information about it, with the ability to perform some actions in order to gain some advantage in the form of a reward. More formally, RL problems are defined by a 5-tuple \((S,A,P_\cdot (\cdot ,\cdot ),R_{\cdot }(\cdot ,\cdot ),\gamma )\), where S is a finite set of states of the agent, A is a finite set of actions (alternatively, \(A_s\) is the finite set of actions available from the state s), \(P_a(s,s') = \Pr (s_{t+1}=s' \mid s_t = s, a_t=a)\) is the probability that action a in state s at time t will lead to the state \(s'\) at time \(t + 1\), \(R_a(s,s')\) is the immediate reward (or expected immediate reward) received after transitioning from state s to state \(s'\), due to action a, and \(\gamma \in [0,1]\) is the discount factor balancing the relative importance of present and future rewards. In this setting, one can introduce different types of problems, based on the information one has at disposal. In multi-armed bandit models, the agent has to maximise the cumulative reward obtained by a sequence of independent actions, each of which giving a stochastic immediate reward. In this case, the state of the system describes the uncertainty of the expected immediate reward for each action. In contextual multi-armed bandits, the agent faces the same set of actions but in multiple scenarios, such that the most profitable action is scenario-dependent. In a Markov decision process (MDP), the agent has information on the state and the actions have an effect on the state itself. Finally, in partially observable MDPs, the state s is partially observable or unknown.

The goal of the agent is to learn a policy (\(\pi\)) that is a rule according to which an action is selected. In its most general formulation, the choice of the action at time t can depend on the whole history of agent-environment interactions up to t, and is defined as a random variable over the set of available actions if such choice is stochastic. A policy is called Markovian if the distribution depends only on the state at time t, with \(\pi _t (a \vert s)\) denoting the probability to choose the action a from such state s, and if a policy does not change over time it is referred as stationary (Ghavamzadeh et al. 2015). Then, the agent aims to learn the policy that maximises the expected cumulative reward that is represented by the so-called value function. Given a state s, the value function is defined as \(V^\pi (s) = \mathbb {E}[\sum _{t=0}^\infty \gamma ^tR(Z_t)\vert Z_0=(s, \pi (.\vert s))]\), where \(Z_t\) is a random variable over state-action pairs. The policy \(\pi\) giving the optimal value function \(V^*(s) = \sup _\pi V^\pi (s)\) is the RL objective. It is known  (Sutton and Barto 2018; Ghavamzadeh et al. 2015) that the optimal value function \(V^*(s)\) has to satisfy the Bellman equation, i.e. \(V^\pi (s) = R^\pi (s) + \gamma \int _S P^\pi (s'\vert s)V^\pi (s')ds'\). In deep RL, the policy is learned by a deep neural network whose objective function is the Bellman equation itself. The network starts by randomly exploring the space of possible actions and iteratively reinforcing its policy through the Bellman equation given the reward obtained after each action. A popular approach to transition from a purely random exploration to a conclusive reinforced policy can be achieved via an \(\varepsilon\)-greedy policy, which chooses a random action with probability \(0 < \varepsilon \ll 1\) and the provisional optimal one with probability \(1-\varepsilon\). Furthermore, a smooth transition can be obtained with a time-dependent slow decay of the parameter \(\varepsilon\). A pictorial view of the iterative process between the agent and the environment can be found in Fig. 1.

Fig. 1
figure 1

Deep reinforcement learning scheme. A deep neural network learns the policy \(\pi\) that the agent uses to perform an action on the environment. A reward and the information about the new state of the system are given back to the agent that improves and learns its policy accordingly

3 Quantum maze

Here we transfer the RL concepts into the quantum domain where both the environment and the reward process follow the laws of quantum mechanics and are affected by both coherent and incoherent mechanisms. We consider, for simplicity, a quantum walker described by a qubit that is transmitted over a quantum network representing the RL environment. The RL state is the quantum state over the network, represented by the so-called density operator \(\rho\). The RL actions are variations of the environment, e.g. its network topology, that will affect the system state through a noisy quantum dynamics. The reward process is obtained from the evolution of the quantum network and hence associated to some probability function to maximise. Following the results in ref. Caruso et al. (2016) and just to test this framework on a specific model, we consider a perfect maze, i.e. a maze where there is a single path connecting the entrance with the exit port. The network dynamics is described in terms of a stochastic quantum walk model (Whitfield et al. 2010; Caruso 2014), whose main advantage here is that, within the same model, it allows to consider a purely coherent dynamics (quantum walk), a purely incoherent dynamics (classical random walk), and also the hybrid regime where both coherent and incoherent mechanisms interplay or compete with each other. Although it is very challenging to make a fair comparison between QRL and RL as applied to the same task and it is out of the scope of this paper, the model we consider here allows us to have the non-trivial chance to analyse the performances of the classical and quantum RL models respectively but in terms of the same resources and degrees of freedom. Very recently we have also exploited this model to propose a new transport-based (neural network-inspired) protocol for quantum state discrimination (Dalla Pozza and Caruso 2020).

According to this stochastic quantum walk model, the time evolution t of the walker state \(\rho\) is governed by the following Lindblad equation (Lindblad 1976; Whitfield et al. 2010; Caruso 2014):

$$\begin{aligned} \frac{d \rho }{d t} = (1-p)\ \mathcal {L}_{QW} (\rho ) + p\ \mathcal {L}_{CRW} (\rho ) + \mathcal {L}_{exit} (\rho ) \end{aligned}$$
(1)

where \(\mathcal {L}_{QW} (\rho ) = - i[A,\rho ]\) describes the coherent hoping mechanisms, \(\mathcal {L}_{CRW} (\rho ) = \sum _{i,j} L_{ij} \rho L_{ij}^{\dagger } - \frac{1}{2} \{L_{ij}^\dagger L_{ij}, \rho \}\) with \(L_{ij} = (A_{ij}/d_j)\vert {i} \rangle \langle {j} \vert\) describes the incoherent hopping ones, while \(\mathcal {L}_{exit} (\rho ) = 2\vert {n+1} \rangle \langle {n} \vert \rho \vert {n} \rangle \langle {n+1} \vert - \{\vert {n} \rangle \langle {n} \vert , \rho \}\) is associated to the irreversible transfer from the maze (via the node n) to the exit (i.e., a sink in the node \(n+1\)). Here the maze topology is associated to the so-called adjacency matrix of the graph A, whose elements \(A_{ij}\) are 1 if there is a link between the node i and j, and 0 otherwise. Besides, \(d_j\) is the number of links attached to the node j, while \(\vert {i} \rangle\) is the element of the basis vectors (in the Hilbert space) corresponding to the node i. The parameter p describes how much incoherent the walker evolution is. In particular, when \(p=1\) one recovers the model of a classical random walk, when \(p=0\) one faces with a quantum walk, while when \(0< p < 1\) the walker hops via both incoherent and coherent mechanisms (stochastic quantum walker). Let us point out that the complex matrix \(\rho _{ij} \equiv \langle {i} \vert \rho \vert {j} \rangle\) contains the node (real) populations along the diagonal, and the coherence terms in the off-diagonal (complex) elements. More in general, in order to have a physical state, the operator \(\rho\) has to be positive semi-definite (to have meaningful occupation probabilities) and with trace one (for normalised probabilities). Hence, in this basis, only for a classical state \(\rho _{ij}\) is a fully diagonal matrix. Then, the escaping probability is measured as \(p_{exit}(t) = 2 \int _0^t \rho _{n n}(t^{\prime }) \mathrm {d} t^{\prime }\). Ideally, we desire to have \(p_{exit}=1\) in the shortest time interval, meaning that with probability 1 the walker has left the maze.

In the RL framework, \(\rho (t)\) is the state \(s_t\) evolving in time, the environment is the maze, and the objective function is the probability \(p_{exit}\) that the walker has exited from the maze in a given amount of time (to be maximised), or, in an equivalent formulation of the problem, the amount of time required to exit the maze (to be minimised). In this paper, we consider the former objective function. The actions are obtained by changing the environment, that is, by varying the maze adjacency matrix. More specifically, we consider three possible actions a performed at given time instants during the walker evolution: (i) building a new wall, i.e. \(A_{ij}\) is changed from 1 to 0 (removing a link); (ii) breaking through an existing wall, i.e. \(A_{ij}\) is changed from 0 to 1 (adding a new link); (iii) doing nothing (null action) and letting the environment evolve with the current adjacency matrix. The action (i) may allow the walker to waste time in dead-end paths, while the action (ii) may create shortcuts in the maze — see Fig. 2. Notice that the available actions a are indexed with the link to modify, so that the action space is discrete and finite. In the following, we set the total number of actions to be performed during the transport dynamics. In principle, one could add a penalty (negative term in the reward) in order to let the learning minimise the total number of actions (which might be energy consuming physical processes). The immediate reward \(R_a(s,s^{\prime })\) is the incremental probability that the walker has left the maze in the time interval \(\Delta t\) following the action a changing the state from \(\rho (t)\) to \(\rho (t+\Delta t)\). This is an MDP setting. The optimal policy \(\pi\) gives the optimal actions maximising the cumulative escaping probability. Besides, one could also optimise the noise parameter p but we have decided to keep it fixed and run the learning for each value of p in the range [0, 1].

Fig. 2
figure 2

Pictorial view of a maze where a classical/quantum walker can enter the maze from a single input port and escape through a single output port. In order to increase the escaping probability within a certain time, at given time instants, the RL agent can modify the environment (maze topology) breaking through existing walls and/or building new walls while the walker moves around to find the exit as quick as possible

This approach is slightly different from the scenario pictured in the traditional maze problem (classical RL). A classical educational example is provided, for instance, by a mouse (the agent) whose goal is to find the shortest route from some initial cell to a target cheese cell in a maze (the environment). The agent needs to experiment and exploit past experiences in order to achieve its goal, and only after lots of trials and errors it will solve the maze problem. In particular, it has to find the optimal sequence of states in which the accumulated sum of rewards is maximal, for instance considering a negative reward (penalty) for each move on free cells in the maze. This is indeed an MDP setting, where the possible actions are the agent moves (left, right, up, down). In our case, we face instead with a probability distribution to find the walker on the maze positions, while in the classical setting the corresponding state would be a diagonal matrix \(\rho _{ii}\) where only one element is equal to 1 and the others are vanishing. Our setup introduces an increased complexity with respect the classical case, in both the definition of the state and in the number of available actions. In addition, a quantum walker can move in parallel along different directions (quantum parallelism), as due to the quantum superposition principle in quantum physics, and interfere constructively or destructively on all maze positions, i.e. the quantum walker behaves as an electromagnetic or mechanical wave travelling through a maze-like structure (wave-particle duality). For these reasons, it is more natural to consider topology modifications (i.e. in the hopping rates described by \(A_{ij}\)) as possible actions. However, let us point out that changing the hopping rate is qualitatively similar to the process of forcing the walker to move more in one or in the other direction, hence mimicking the continuous-version of the discrete moves for the mouse in the classical scenario.

4 Results

Within the setting described above, we set a time limit T for the overall evolution of the system and define the time instants \(t_k = k \tau\), with \(\tau = \Delta t=T/N\) and \(k=0, \ldots N-1\), when the RL actions can be performed. The quantum walker evolves according to Eq. 1 in the time interval between \(t_k\) and \(t_{k+1}\). We then implement deep reinforcement learning with \({\varepsilon }\)-greedy algorithm for the policy improvement, and run it with \(N=8\) actions and with 1000 training epochs (see Methods for more technical details). At each time instant \(t_k\) the agent can choose to modify whatever link in the maze, albeit we would expect its actions to be localised around the places where it has the chance to further increase the escaping probability. The \(\varepsilon\)-greedy algorithm implies that the agent picks either the action suggested by the policy with probability \(1-\varepsilon\) or a random action with probability \(\varepsilon\). This method increases the chances of the policy to explore different strategies searching for the best one instead of just reinforcing a sub-optimal solution. The value of \(\varepsilon\) is slowly decreased during training so that, at the end, the agent is just applying the policy without much further exploration. This optimisation is repeated for different values of p and T in order to investigate their role in the learning process.

Fig. 3
figure 3

Cumulative reward as a function of p and \(\tau\), for a given \(6 \times 6\) perfect maze and \(N=8\) actions, equally spaced in time by the amount \(\tau\). The time unit is given in terms of the inverse of the sink rate set to 1. The dotted grid above represents the performance of the quantum walker after the training, while the coloured solid surface below is the baseline on the same maze with no actions performed by the agent (only free evolution). Repeating the training on over 30 random \(6 \times 6\) mazes and averaging their performances for each \((p, \tau )\), we qualitatively obtain the same trend

Fig. 4
figure 4

Training curves for an agent performing RL actions for \(p=0.4, \tau =28\), and \(N=8\) actions on a \(6 \times 6\) perfect maze. The curves show the cumulative rewards from single episodes (light blue), ten-episode window average (dark blue) and for the target network (orange) — see RL optimisation in Methods. The two horizontal lines are the (constant) cumulative reward in the case of no RL actions (magenta) and for the final trained policy (green)

As shown in Fig. 3, there is a clear RL improvement for any value of p especially for large T (i.e. also large \(\tau\)), while for small T it occurs mainly in the quantum regime (i.e. p going to 0) when the walker exploits coherent (and fast) hopping mechanisms. This is due to the fact that the classical random walker (without RL) moves very slowly and remains close to the starting point for small T, as reported in ref. Caruso et al. (2016). Repeating this experiment for 30 random \(6\times 6\) perfect mazes, we find a very similar behaviour — see also Fig. 6 in Methods where interestingly a dip in the cumulative reward enhancement is shown at around \(p=0.1\) where the interplay between quantum coherence and noise allows to optimise the escaping probability without acting on the maze (Caruso 2014). There it was very remarkable to observe that a small amount of noise allows the walker to both keep its quantumness (i.e. moving in parallel over the entire maze) and learn the shortest path to the exit from the maze.

Figure 4 shows an example of cumulative rewards obtained from the training of a network while the agent explores the space of the possible actions. Initially some random actions are performed, and soon the agent finds some positive reinforcement and learns to consistently apply better actions outperforming the case with no actions.

The proposed way of learning the best actions also comes with an intrinsic robustness to stochastic noise. Indeed this is a crucial property of RL-based approaches. In our case, we can suppose that we do not have perfect control on the system and there might be perturbations, for example in the timing at which the actions are effectively performed. These kinds of perturbations are in general detrimental for hard-coded optimisation algorithms, and we want to analyse how our QRL approach performs in this regard. To check this, we first train the agent in an environment with fixed \(\tau\) and p. Afterwards, we evaluate the performance of the trained agent in an environment where the time \(\tau\) at which the actions are performed becomes stochastic (noisy). This additional noise in the time is controlled by a parameter \(0 \le \eta \le 1\) while the total time of the actions is kept fixed. In this setting, we observe a remarkable robustness of our agent that is capable of great generalisation and keep the cumulative reward almost constant despite of the added stochasticity. Indeed, in Fig. 5, we plot the average reward obtained by the agent in this stochastic environment over 100 different realisations of the noise. We can see that as we increase the parameter \(\eta\) our agent, on average, keeps the ability to find the correct actions in order to make the reward consistent even in a stochastic environment, even though it has not been retrained in the noisy setting. However, while the average reward remains stable, the difference between the minimum and maximum reward increases significantly as \(\eta\) increases.

Fig. 5
figure 5

Cumulative reward of an agent trained at \(\tau =14\) and \(p=0.4\) and deployed in a stochastic environment controlled by the parameter \(\eta\). The solid line is the average reward obtained by the agent over 100 realisations of the noise, the shaded area represents the minimum and the maximum achieved reward and the dashed orange line is the baseline of the walker with no actions performed. While the average performance of the agent remains stable, the variance in the outcomes increases greatly as \(\eta\) increases

The other tested scenario is the one in which, instead of taking the actions equally spaced in the total evolution time, we concentrate them all at the beginning or at the end of the evolution. This gives our agent a different environment at which it adapts once again implementing different strategies. Indeed, we find that our training method is applicable with success also in this more general scenario thus concluding our remarks on the robustness of the proposed QRL implementation.

A detailed discussion of the robustness analysis and all the aforementioned experiments can be found in the Appendix.

5 Methods

5.1 Quantum maze simulation

To simulate the stochastic quantum walk on a maze, we have used the popular QuTiP package (Johansson et al. 2012) for Python. In order to account for the actions performed by the agent at time instants \(t_k=k \tau\) modifying the network topology and to evaluate the reward signal, we have wrapped the QuTiP simulator in a Gym environment (Brockman et al. 2016). Gym is a python package that has been created by OpenAI specifically to tackle and standardise reinforcement learning problems. In this way, we can apply any RL algorithms on our quantum maze environment. The initial maze could be randomly generated or loaded from a fixed saved adjacency matrix in order to account for both the reproducibility of single experiments and the averaging over different configurations.

5.2 RL optimisation

We have used a feed-forward neural network to learn the policy of our agent, following the Deep Q Learning approach (Stooke and Abbeel 2019), realised with the PyTorch package for python (Paszke et al. 2019). In this approach, at each iteration of the training loop defining a training epoch, a new training episode is evaluated by numerically solving Eq. 1 for the time evolution and employing an \(\varepsilon\)-greedy policy for the action selection. The new training episode is recorded in a fixed-dimension pool of recent episodes called replay memory, from which, after every new addition, a random batch of episodes is selected to train the policy neural network. The \(\varepsilon\) parameter is reduced at each epoch, in order to reduce the exploration of new action sequences and increase the exploitation of the good ones proposed by the policy neural network. Periodically, the policy neural network is copied in a target neural network, i.e. a trick used to reduce the instabilities in the training of the policy neural network. Figure 4 shows the reward of the training episodes, their ten-episode window average, the reward provided by target network, alongside the free evolution (no RL actions), and final reward (constant lines) provided by the trained target network.

Fig. 6
figure 6

Cumulative reward improvement over the no RL action (free evolution) dynamics as a function of p and \(\tau\), averaged over 30 random perfect mazes (\(6 \times 6\) size) and \(N=8\) (equally spaced in time) actions

Despite the relative simple architecture, we have found the training to be quite sensitive to the choice of learning hyper-parameters, such as the batch size of the training episodes per epoch, the replay memory capacity, the rate of target network update and the decay rate of \(\varepsilon\) in the \(\varepsilon\)-greedy policy. In particular, in Fig. 3 for each \((p,\tau )\), we run multiple independent hyper-parameter optimisations and training, employing the libraries Hyperopt (Bergstra et al 2013) and Tune (Liaw et al. 2018). Due to the small size of the networks, we were able to launch multiple instances of our training procedure using a single Quadro K6000 GPU. Figure 6 shows the mean cumulative reward improvement between the no-action strategy (only free evolution) and the trained strategy over 30 random perfect mazes (size \(6 \times 6\)) with \(N=8\) actions.

6 Discussion

To summarise, here we have introduced a new QML model bringing the classical concept of reinforcement learning into the quantum domain but also in presence of external noise. An agent operating in an environment does experiment and exploit past experiences in order to find an optimal sequence of actions (following the optimal policy) to perform a given task (maximising a reward function). In particular, this was applied to the maze problem where the agent desires to optimise the escaping probability in a given time interval. The dynamics on the maze was described in terms of the stochastic quantum walk model, including exactly also the purely classical and purely quantum regimes. This has allowed to make a fair comparison between transport-based RL and QRL models exploiting the same resources. We have found that the agent always learns a strategy that allows a quicker escape from the maze, but in the quantum case the walker is faster and can exploit useful actions also at very short times. Instead, in presence of a small amount of noise, the transport dynamics is already almost optimal and RL shows a smaller enhancement, hence further supporting the key role of noise in transport dynamics. In other words, some decoherence effectively reproduces a sort of RL optimal strategy in enhancing the transmission capability of the network. Moreover, the presence of more quantumness in our QRL protocol leads to have more robustness in the optimal reward with respect to the exact timing of the actions performed by the agent.

Finally, let us discuss how to possibly implement the RL actions in the maze problem from the physics point of view. In ref. Caruso (2014), one of us has shown that one can design a sort of noise mask that leads to a transport behaviour as if one had modified the underlying topology. For instance, dephasing noise can open shortcuts between non-resonant nodes, and Zeno-like effects can suppress the transport over a given link, hence mimicking the two types of RL actions discussed in this paper. As future outlooks, one could test these theoretical predictions via atomic or photonic experimental setups or even on the new-generation NISQ devices whose current technologies today allow to engineer complex topologies and modify them in the same time scale of the quantum dynamics while also exploiting some beneficial effects of the environmental noise that cannot be suppressed.