Reinforcement learning of optimal active particle navigation

The development of self-propelled particles at the micro- and the nanoscale has sparked a huge potential for future applications in active matter physics, microsurgery, and targeted drug delivery. However, while the latter applications provoke the quest on how to optimally navigate towards a target, such as e.g. a cancer cell, there is still no simple way known to determine the optimal route in sufficiently complex environments. Here we develop a machine learning-based approach that allows us, for the first time, to determine the asymptotically optimal path of a self-propelled agent which can freely steer in complex environments. Our method hinges on policy gradient-based deep reinforcement learning techniques and, crucially, does not require any reward shaping or heuristics. The presented method provides a powerful alternative to current analytical methods to calculate optimal trajectories and opens a route towards a universal path planner for future intelligent active particles.


Introduction
The problem of finding suitable navigation strategies is of great interest to applications ranging from motion planning for autonomous underwater vehicles, ocean gliders [1][2][3][4], and aerial vehicles [5][6][7] to microorganisms searching for food and prey [8,9] and striving for survival in complex environments [10,11]. One important class of path planning problems which is currently attracting a rapidly increasing attention is centered around the quest for the optimal trajectory allowing an active particle, which can freely steer but cannot control its speed, to reach a given target in a complex environment. This active particle navigation (APN) problem is relevant both for biological swimmers like fish or for turtles on the way to their breeding grounds [12,13] and for future applications of synthetic microswimmers [14] such as targeted drug [15][16][17] and gene delivery [18,19] or microsurgery [20].
Contrasting classical navigation problems of vehicles like ships or airplanes [21,22], mesoscopic active particles can face a variety of new challenges including fluctuations, hydrodynamic interactions with obstacles and boundaries [23,24], and highly complex environments [25][26][27][28]. Following these complex ingredients, the optimal (fastest) path generically differs from the shortest one and is highly challenging to determine. In fact, as of now, there is no standard receipe to find the optimal path even for the simplest case of a dry active particle with perfect steering (no fluctuations, no delay) [29] in sufficiently complex environments. While a method to tackle this challenge, which we refer to as the "minimal APN problem", would serve as a crucial step forward en route to the ultimate dream of a universal path planner for microswimmers, such a method is not yet available. In fact, classical path planning algorithms such as A* or Dijkstra can not be straightforwardly adjusted to account for complex smooth environments (even when using heuristic ingredients) and achieving general solutions with analytical methods based on variational methods [29], optimal control theory [23,30] or transformations to a problem of finding geodesics of a Randers metric [31] is highly challenging (if not impossible) if the environment is sufficiently complex. In contrast, recent advancements in the field of artificial intelligence and reinforcement learning (RL) have allowed handling the required complexity and can be used in principle to approach the (minimal) APN problem for generic environments. Corresponding pioneering studies at the interface of RL and active matter [32] have demonstrated, remarkably, that tabular Q-learning algorithms are able to uncover efficient navigation strategies for certain environments [33,34] including complex fluid flows [27]. Other RL based methods have been used to develop efficient navigation strategies even in the presence of turbulent and chaotic flows [35,36] or learning and modeling chemotaxis behavior [37]. Deep RL methods have been applied to colloidal robots in unknown environment with random obstacles [38,39] and very recently Q-learning was used to explore optimal colloidal predator-prey dynamics and strategies [40]. However, despite these successes, the challenge of finding the globally optimal path in generic environments still remains. First, there is generally a major risk that agents converge to a local optimum rather than to the global one [41]. Second, even after finding the global optimum for given reward and state-action space definitions, it often remains unclear how the result relates to optimality in the physical reality. In some cases this deviation can reach a level where the final result is nowhere near the optimal trajectory, as we'll see below.
To overcome this gap in the literature between methods which can not handle the complexity of generic complex environments and methods leading to solutions which do not (or are not known to) lead to the globally optimal path, in the present work we develop a new approach which does not only asymptotically reproduce analytically known optimal solutions for the minimal APN problem but also allows to find the optimal path in very complex environments. To achieve this, we combine a hybrid discrete-continuum representation of the The RL model is designed, trained and fine-tuned to find the optimal trajectory in environments with known analytical solutions. Upon training, the model's early performance (blue curve) converges asymptotically (green dashed curve) to the optimal solution (dashed black curve). Arrows and colors indicate the direction and strength of force/flow fields acting on the agent. (b) The designed model with proven performance is trained in highly complex environments to provide asymptotically optimal solutions in cases which are analytically inaccessible. The dashed line exemplarically shows the learned path in a Gaussian random potential (background colors). environment with policy gradient based Deep RL agents (advantageous actor-critic algorithms) to find the optimal path. Remarkably, we find that using a single relatively simple neural network architecture across different environments, is sufficient to achieve asymptotically optimal trajectories i.e. trajectories which are optimal for a given discretization. Our results unify, for the first time, asymptotic optimality with the feasibility of handling generic complex environments. They can also be interpreted as an indication that there is a way for biological agents to learn optimal navigation strategies from generation to generation (e.g. throughout evolution) without requiring direct global knowledge of their environment or even of the location of their target.

Model
To define the minimal APN problem, we consider an overdamped dry active particle in 2D at position r(t) = (x(t), y(t)) which evolves as: Here, v 0 is the constant self-propulsion velocity of the active particle and f ( r) = F ( r)/γ( r) + u( r) is the overall external field with F ( r), u( r) being the force field and the solvent flow field due to the environment and γ( r) is the Stokes drag coefficient which can vary in space as relevant e.g. for viscotaxis [42]. We assume that the steering directionn(t) = (cos ψ(t), sin ψ(t)) can be freely controlled by the active particle, as relevant e.g. for biological microswimmers, or by external fields e.g. via feedback control systems [43][44][45] or external electric, magnetic or phoretic fields [46][47][48].
Reinforcement learning of optimal active particle navigation 4 While in reality there would of course be some delay in the control [49] as well as thermal fluctuations (for small agents), here we neglect both complications as even in their absence there is no known way to systematically determine or learn globally optimal trajectories in the literature. However, we would like to stress that our approach can be generalized to include both ingredients. To complete the definition of the minimal APN problem, let us now assume that the starting r(0) = r start and target r(T ) = r end points as well as f ( r) are given and the task is to find the connecting path (and the associated navigation strategy) which minimizes the travelling time.

Approach
Here we use reinforcement learning to solve the minimal APN problem. We train an agent within a complex environment, to take actions in order to maximize a cumulative reward [41,50]. Within each episode, the agent starts from a certain initial state s 0 (initial position), and chooses an action a t (direction of motion) at each time step t until it either reaches the target point (or region) or faces an exiting condition, e.g. by hitting the boundaries of the simulation domain or exceeding the maximum number of allowed steps (see SI for details).
For this study, we consider a hybrid interplay of discrete and continuous state spaces for the environment so that the RL agent can harness the power of working inside an infinite dimensional continuous state space without dealing with high dimensional input data ( Fig. 2), which is crucial for the present approach. That is, we account for the exact continuous external field f ( r), but discretize the environment, represented as a 2D gridworld observation, when feeding it into the neural networks as the input states, s t . In each step, the agent chooses an action from a set of 60 equally spaced orientation angles ψ(s t ) = { mπ 30 |m ∈ Z, 0 ≤ m < 60}, defining its direction of motion and receives a fixed negative reward R t . The agent receives an additional positive reward of 100|R t | and a very large penalty amounting to R t times the maximum number of steps available for each episode in case of reaching the target region or hitting the environment boundaries. Hence the optimal navigation strategy corresponds to reaching the target by taking the least number of actions necessary, without hitting the boundaries. Importantly, in contrast to most previous studies, our approach neither requires any reward shaping [51,52] nor any heuristics [53,54] and the active particles are able to develop the optimal strategy only through pure exploration. The main component of our approach is the RL algorithm itself. It became evident during our tests, that, on-policy methods [41] provide more robustness in convergence toward the global optimal solution, as compared to off-policy methods such as Qlearning; hence a policy gradient method was selected (see SI for details). This method allows learning a parametrized policy π θ for choosing actions by optimizing the expected [55][56][57]. This goal is achieved by updating the set of parameters of the policy θ, which in our case corresponds to the parameters (weights) of the policy network.
Reinforcement learning of optimal active particle navigation

5
The update is performed in the direction of the gradient of the expected return, i.e. such that the expected return is enhanced. The gradient of total expected return, averaged across K trajectories, can be approximated as [55][56][57]: where R t denotes the reward value at time step t , achieved by policy π θ , and T i , denotes the overall lengths (number of steps) of the i-th trajectory, which can be written in stateaction space as We compute these gradients via the back-propagation algorithm in the training process of the policy network. It can be seen from Eq. (2), that as the policy network improves to achieve higher total returns, the gradient of the expected return starts to diminish, until eventually the agent converges to the final policy. In the following we choose the advantageous actor critic method (A2C) as a specific example of a policy gradient method. This method involves, besides the policy network, also a critic network which assignes a value to each state which corresponds to the expected temporal distance to the target and is used to guide the updating of the parameters (weights) of the policy network in the following way: The total return T i t =t R t is replaced with a more well-behaved term known as the advantage function [55,58,59], which together with the critic network, essentially rates the possible actions by evaluating the benefit of choosing a specific action in comparison to the average action for a given state. In our approach, we define the advantage function as follows [59]: where V π θ w is the critic network with parameters (weights) w, which estimates the value of a given state, and Q π θ (s it , a it ) is the state-action value function under policy π θ and discount factor λ, which determines the expected reward when choosing a specific action.
We train the two networks which are involved in the A2C algorithm (policy and critic network) simultaneously based on the trajectories from multiple past episodes. At each training round the critic network is updated based on the value (discounted total return computed based on the present policy) of each state and is used to judge the performance of previous episodes through the advantage term, which influences (or guides) the update of the policy network through Eq. (3). Note that the Deep RL approach is able to manifest a surprisingly high level of generalization potential. Remarkably, all of the results presented in this study, across different setups with various force fields, are achieved with policy networks containing Reinforcement learning of optimal active particle navigation

Results
To demonstrate the power of the developed approach, we now consider various environments of increasing complexity, starting with cases for which exact analytical solutions are available [29]. As a first application of our method, we explore an active particle in a linear force (or flow) field f = (kx, 0). Despite its apparent simplicity, for some combinations of starting and end points, this problem can not be straightforwardly obtained by "optimistic" applications of classical path planning algorithms like Dijkstra or A* [60] (see SI). In contrast, by applying our RL-based method, we find that after training the model (see SI for details), the agent is able to achieve asymptotically optimal solutions (Figs. 3a,b). That is, any remaining deviations due to the discretization of the problem can be systematically reduced by choosing an even finer discretization. Notably, the algorithm finds the optimal path even for k = 1.0, r start = (0, 0) and r target = (5, 5) (Fig. 3b) where the drift due to the external field dominates the self-propulsion in a large portion of space such that choosing unsuitable actions at early times can make it impossible for the agent to reach the target even after very long times. Let us now focus on a somewhat more complex environment, defined by f = (k(1 − y 2 ), 0), representing e.g. shear flow in a pipe [61]. Here the optimal path is S-shaped for sufficiently large k (dashed lines in Fig. 3c and 3d). From the RL point of view, such trajectories are rather unapparent, because achieving this symmetric S-shaped path, in the presence of the strong flow field, would require the agent to employ a strategy with intricate combination of actions. Accordingly it is not surprising that simple RL algorithms such as tabular Q-learning tend to fail finding the global optimal path or at least require unsystematic hyperparameter fine-tuning for each considered k-value (see Reinforcement learning of optimal active particle navigation SI for more details). Remarkably, also for this setup, our RL agent is able to learn the asymptotically optimal path for any k value. That is, also here any remaining deviations from the optimal path can be systematically reduced by choosing a finer discretization for state and action spaces. Notice however, that for a given discretization degenerate trajectories may occur which have very similar (or even identical) temporal costs but differ slightly in shape (see SI for details).
Let us now exploit the very close agreement between theoretically calculated optimal paths and the learned ones, to explore truly complex environments for which optimal trajectories cannot be analytically calculated. For this purpose, we create a Gaussian random potential (GRF) [62,63] U ( r) with a power spectrum of <Ũ (k)Ũ (−k) > ∝ k −α and determine f = − ∇U (see SI for more details).
Here we consider the cases of α = 2 and 4 for two different combinations of starting and end points (see Fig. 3e,f). Notably for α = 4 the learned solutions are comparatively smooth curves, while for α = 2 the learned paths involve comparatively sharp turns, reflecting the structure of the underlying potential islands. In all shown cases, we find that the travelling time of the learned optimal trajectories is shorter than the travelling time when following a straight line.

Conclusions
We have developed an end-to-end deep reinforcement learning approach which creates asymptotically optimal solutions for complex navigation problems which are not straightforwardly attainable with existing standard methods. This approach complements the broad literature on methods to find the shortest path and opens a route towards a universal path planner for microswimmers and larger self-driven agents which are subject to continuously varying force or flow fields. To achieve this, our method can be generalized in many ways, e.g to find globally optimal paths with respect to fuel consumption or dissipated power, or after generalization to continuous state-action-spaces, to account for microswimmer-specific ingredients such as hydrodynamic interactions with obstacles and fluctuations. Potential applications of the presented method range from testing future theoretical developments e.g. regarding the optimal path of active particles in the presence of fluctuations to the programming of nano-and microscale robots for targeted drug and gene delivery. On larger scales our approach could also be used to test if biological agents like turtles or fish manage to find the globally optimal path (based on a comparison of their trajectories with the learned results) or, possibly, even for route planning of macroscopic vehicles like cleaning-robots or spacecrafts where it could provide an alternative to optimization methods based on nonlinear programming or meta-heuristics [64].

Supplemental Materials for
Reinforcement learning of optimal active particle navigation Mahdi Nasiri and Benno Liebchen *

A. A2C algorithm
In this section we will briefly review our implementation of the A2C algorithm. Just like most other policy gradient based methods, the A2C algorithm consists of two separate neural networks known as the policy (actor) and the value (critic) networks. The final decision making is based on the output of the policy net. In Algorithm 1, a pseudo-code of the algorithm is provided.

6:
Estimate the policy gradient:

7:
Update the policy network via Adam optimization.

8:
Compute the loss on the value network:

9:
Update the value network via Adam optimization. 10: end As it can be seen from this pseudocode, at each step, both networks are updated using the aggregated data of |M k | = 10 different trajectories created based on the current state of the policy network. We use an actor network with two hidden layers with 250 and 50 neurons respectively and an output layer providing the probability of each action in the form of a softmax function. The critic, on the other side, has only one hidden layer with N neurons the value of which corresponds to the dimensions of the state space. We use learning rates of 10 −4 and 2 × 10 −4 for the Adam optimizer [S1] in policy and value networks respectively. The training of the neural networks are done with the help of the Pytorch deep learning framework [S2]. Table B.1 summarizes the parameter values which we have used for the calculation of the optimal paths in this work. Here the step size is the only parameter which is chosen based on the characteristics of the environment; the step size then determines the resolution of the discrete state space. In all of our setups, a fixed negative reward (penalty) of R t = −0.1 is used for each actions and we use a positive reward of R G = 10 (R G = 20 in case of the Gaussian random field with α = 4) for reaching the target domain and a penalty of Max # of steps × R t (2 × Max # of steps × R t for the Gaussian random field with α = 4) for hitting the boundary. The environments are implemented using OpenAI's gym framework [S3]. * benno.liebchen@pkm.tu-darmstadt.de

C. Method to obtain the exact solutions
For the exact (analytical) solutions of each setup, the final forms of the trajectories, y (x) and the boundary conditions, y(x A ) = y A and y(x B ) = y B are the same as in [S4]. We utilize a modified version of the shooting method to numerically solve these equations, in which we couple an optimization process to the ordinary shooting method that automatically updates the required constant c 0 in y (x), for each round of the shooting approach. For this purpose, at the first step a random small value is given to c 0 and after solving the equation, the last value of the achieved solution is compared with the boundary condition and the resulting difference ∆ = y(x B ) −ŷ(x B ) is used to update the initial constant, c 0 = c 0 + β∆, where β andŷ correspond to the learning rate and the numerically estimated solution. This process is continued until the estimated value of the boundary condition reaches a certain accuracy threshold, .

D. Classical path planners and Q-learning
To further illustrate the power of the developed method, we now use two types of existing approaches to determine the optimal path (to the extend possible). We first discuss the Dijkstra and the A* algorithms as examples of classic path planning methods and then turn our attention to the tabular Q-learning algorithm. While the Dijkstra algorithms has been mainly designed to find the shortest path, it can be heuristically transformed into an algorithm aiming to find the fastest path by defining a temporal cost of t cell = | ∆X|/|˙ r tot |. Here | ∆X| is the displacement length in a certain step (which is larger for diagonal moves than for horizontal and vertical ones) and |˙ r tot | is the maximum total velocity possible for that displacement (which must be a positive real value). For the implementation of the algorithm we follow [S5]. At each step we allow the agent to choose between eight equidistantly chosen possible directions. To illustrate the power of our deep RL approach as compared to these classic path finders, we focus on the simplest case of a linear flow or force field. The Dijkstra algorithm was able to achieve decent results for weak flow field (low k), closely approximating the analytical solutions (Fig D.1a), however once the force field dominates over the self-propulsion, the results started significantly deviate from the exact solutions (Fig D.1b). Let us now discuss the A* algorithm. Here the same cost is used together with a heuristic of h(s t ) = d heur /v heur , where d heur is the minimum distance to the target from the current cell, and v heur = max(|˙ r tot |). As it can be seen, the trajectories predicted by the A* algorithm turn out to be identical to those of the modified form of the Dijkstra algorithm. These examples depict the limitations of classical path planners for finding shortest time trajectories, in complex environments. For these methods to achieve asymptotically optimal solutions, one usually has to balance between the generalization capabilities of these approaches and their accuracy by designing task specific heuristics which in return would need to be modified in case of changes to the environment or the underlying flow fields.
Finally a tabular Q-learning agent was implemented with the same 60-dimensional action space as used for the on-policy approach which we have discussed in the main text. To assess the power of this algorithm and its limitations, we trained this agent on the environment hosting the shear flow f (x, y) = (k(1 − y 2 , 0) with k = −0.8. The same hybrid observation of the environment with the resolution of 43 × 19 was fed to train the model with learning rate= 0.1, exploration ( − greedy) parameter 0 = 0.25, and discount factor =0.9 for 4 × 10 6 generations, More details of this algorithm can be found in [S6]. Given these specifications of the environment and the algorithm, we were able to train the model with a reward of R t = −1, target reward R G = 0.5, and boundary penalty R B = −100. Note that even small modifications to these values or the hyperparameters turned out to drastically affect the outcome trajectories. In particular, when choosing a much larger target reward, the results did not improve. As it can be seen from some representative examples of the converged results (Fig D.1 c), most of the achieved trajectories are far from the optimal solution as opposed to the performance of our on-policy agent. It is important to note that although one may be able to achieve asymptotically optimal solutions using Q-learning via brute force grid searching in the hyperparameter space, problems arise due to the instability of the final converged results. That is for any independent training session, there is the possibility of converging to a totally different outcome. Also the discovered hyperparameter set lacks the necessary robustness with regards to changes in the environment or the underlying flow fields, and a slight modification of the flow field strength (k-value) will require a renewed grid search in the hyperparamter space. Note that increasing the dimensionality of the state space would decrease the degeneracy of the temporal cost of the outcome trajectories, but would not help to systematically address the instability of the outcome trajectories or the sensitivity of the results to hyperparameter definitions. One of the main reasons for this limited performance of simple off-policy methods such as tabular Q-learning could be rooted in the partial observability of the underlying state space of our hybrid setup. This characteristics of our state space is turning the path planning task into a complicated problem since in the presence of a strong flow field, small modifications in the navigation strategy would result in different outcomes for a given state and the final trajectory, which is not easily comprehensible to a simple off-policy method such as tabular Q-learning given the current coarse-grained image of the environment.

E. Degenerate trajectories
In all of our setups, at the end of the training, the RL agent does not always converge to one and the same trajectory, but it alternatingly chooses a trajectory out of a set of distinct trajectories which have identical temporal costs for a given discretization (but have different temporal length in continuum space). This phenomena is expected due to the coupling of the intrinsic stochasticity of the training process of the neural networks [S7] and the hybrid definition of our state spaces. The effects of these characteristics on the reproducability of the results decrease as finer discretizations are chosen for the state-action spaces. It is expected, in the limit of a very fine grid all of these degenerate trajectories would converge to the same optimal path in continuum space. To further illustrate this matter, it is instructive to look at the final learned strategy from the critic network's point of view. As it can be seen from the value map (background colors in Fig. E.1), the values as predicted by the critic network are essentially constant in the vicinity of the optimal path in continuum space (dashed line).