Creation and manipulation of quantized vortices in Bose-Einstein condensates using reinforcement learning

We apply the technique of reinforcement learning to the control of nonlinear matter waves. In this method, an agent controls the position, strength, and shape of an external Gaussian potential to create and manipulate quantized vortices in a Bose-Einstein condensate (BEC) trapped in a harmonic potential. The density and velocity distributions of the BEC at each moment obtained by the Gross-Pitaevskii evolution are directly input into a convolutional neural network to determine the next action of the agent. We demonstrate that a stationary single-vortex state can be produced in a two-dimensional system, and a stationary vortex-ring state can be produced in a three-dimensional system.


Introduction
Recent developments in machine learning have been remarkable. A standout example is the computer program "Al-phaGo", 1,2) which defeated the best human players in the game of Go. In AlphaGo, reinforcement learning with deep neural networks (called deep-Q learning) is used to evaluate the situation of the game and determine the next action. Another interesting example of the use of deep-Q learning was reported in Ref. 3, in which a computer agent playing video games is trained to get higher scores. It was demonstrated that the computer agent outperforms human players after training, without prior knowledge about the games.
In this study, we focus on the control of quantum systems by reinforcement learning. [4][5][6][7][8][9][10][11][12][13][14][15][16][17] Controlling quantum systems and producing desired quantum states are important in a variety of areas in quantum physics. Reinforcement learning consists of an agent and the environment. In this case, the quantum system is regarded as the environment. An easily accessible initial state, such as the ground state, is first prepared, and during the time evolution, the agent makes decisions to control the time-dependent parameters in the Hamiltonian. The quantum state develops depending on the parameters determined by the agent, and the information of the quantum state is fed back to the agent. Depending on the quantum state, a reward is also given to the agent, and the agent is trained to maximize the reward. The reward is, for example, the overlap or fidelity between the controlling state and the target state, In this paper, we consider a Bose-Einstein condensate (BEC) of an atomic gas as a quantum system to be controlled. Optimal control of a BEC has been studied for a variety of purposes, such as spatial transport, [18][19][20] splitting into two BECs, 18,21,22) squeezing of quantum fluctuations, 23,24) excitation to specific states, 21,[25][26][27][28][29] changing trap geometry, 22,30) improving interferometry, 31) and creating a selfbound droplet. 32) In these studies, quantum control theories, such as GRAPE 33) and CRAB 34,35) are used. Recently, machine learning techniques have been used for the optimized Fig. 1. Schematic illustration of our system. The system consists of an agent and the environment. At each time step, the agent decides on an action on the environment, namely, how to change the external Gaussian potential applied to the BEC. The dynamics of the BEC with the time-dependent Gaussian potential is obtained by numerically solving the Gross-Pitaevskii (GP) equation, and the state of the BEC and the reward are given to the agent. The agent is trained to get a higher reward, which leads to the production of a desired state of the BEC. creation of a BEC [36][37][38][39] as well as the transport and decompression 40) of a BEC. Figure 1 illustrates reinforcement learning to control a BEC. We aim to create and manipulate quantized vortices in a BEC by changing a Gaussian external potential. Experimentally, such a potential can be produced by a nonresonant Gaussian laser beam applied to the BEC. On the environmental side, the dynamics of the BEC with a time-dependent Gaussian external potential are obtained by numerically solving the Gross-Pitaevskii (GP) equation. At each time step, the agent obtains the state of the BEC from the environment, and decides on the next action, i.e., how to change the position, strength, and shape of the Gaussian potential. The decision of the agent is made using a deep convolutional neural network (CNN), where the spatial distributions of the density and ve-locity of the BEC are directly input into the CNN, giving the next action as an output. At each time step, the agent also receives reward from the environment. The reward is higher when the state of the BEC is closer to the target state. The CNN of the agent is trained to maximize the total amount of rewards, which enables the agent to discover an optimal control of the Gaussian potential.
Using the above scheme, we demonstrate that quantized vortices can be created and manipulated in two-dimensional (2D) and three-dimensional (3D) BECs. In a 2D system, the target state is set to be the state with a singly-quantized vortex at the center. Although both vortices and antivortices are created in pairs by a simple translation of the Gaussian potential, 41,42) the agent finds a clever way to generate a single vortex state. In a 3D system, the target state is set to be the stationary vortex-ring state, and the agent finds that it can be created by a simple Gaussian laser beam.
The remainder of the paper is organized as follows. Section 2 explains our method, Sec. 3 shows the numerical results, and Sec. 4 gives the conclusions to this study.

Environment
We begin by describing the environment of the reinforcement learning system. As described above, the environment receives the action from the agent and evolves to the next time step depending on the action. The environment then relays its state to the agent. The environment also evaluates whether the action of the agent was a good action, and gives a reward to the agent.
As an environment, we consider a BEC of bosonic atoms with mass m confined in a harmonic potential. In the meanfield approximation at zero temperature, the macroscopic wave function ψ(r, t) obeys the GP equation given by where V trap = m[ω 2 ⊥ (x 2 +y 2 )+ω 2 z z 2 ]/2 is the harmonic trap potential with ω ⊥ and ω z being the radial and axial trap frequencies, V G is the Gaussian external potential, and g = 4π 2 a/m is the interaction coefficient with a being the s-wave scattering length of the atom. The wave function is normalized as |ψ| 2 dr = N atom , where N atom is the number of atoms.
In the next section, we will investigate both 2D and 3D systems. For the 3D system, for simplicity, the harmonic trap potential is assumed to be isotropic, i.e., ω ⊥ = ω z ≡ ω. The external Gaussian potential for the 3D problem has the form where the strength A(t) > 0, width d y (t) > 0, and position ζ(t) are control parameters, and the width d z is a constant. Such an external potential can be produced by a blue-detuned Gaussian laser beam propagating in the x direction. The 2D system is experimentally realized by tight confine-ment in the z direction such that ω z is much larger than the other characteristic energies. In this case, the wave function can be approximated as ψ(r, t) = ψ ⊥ (x, y, t)ψ 0 (z)e −iω z t/2 , where ψ 0 (z) is the ground state of the harmonic oscillator potential mω 2 z z 2 /2. Multiplying Eq. (1) by ψ 0 (z) and integrating it with respect to z, the system is reduced to 2D as The external Gaussian potential for the 2D problem is assumed to be where the strength A ⊥ (t) > 0 and position ξ(t), η(t) are control parameters, and the width d is a constant. This form of the potential is generated by a blue-detuned Gaussian laser beam propagating in the z direction. At the start of each run of time evolution (called an episode), the environment is reset: the control parameters are set to the initial values, and the wave function is set to the ground state for V G (t = 0). The ground-state wave function is prepared by the imaginary-time evolution, where i on the right-hand sides of Eqs. (1) and (3) is replaced with −1. The real-time and imaginary-time evolution is numerically obtained by the pseudospectral method. 43) After the initial reset of the environment, the real-time evolution starts. At intervals of time ∆t (which is much larger than the discretized time interval δt in the pseudospectral method), the environment returns its state and the reward to the agent, and receives the action from the agent (see Fig. 1). The state of the environment given to the agent is the density distribution ρ = |ψ| 2 and flux distribution F = (ψ * ∇ψ − ψ∇ψ * )/(2mi) (ψ → ψ ⊥ for 2D) of the BEC. The state of the environment also includes the current shape of the Gaussian potential V G so that the agent can determine how to change the potential. The reward given to the agent is calculated from the overlap between the current wave function and the target wave function, which is specified in the next section. By the action received from the agent, one of the parameters in the Gaussian external potential is changed by a small amount at each time step. Each episode terminates at t = 500∆t ≡ T end , which is followed by the next episode.

Agent
The agent observes the state s t of the environment at time t, and determines the action a t on the environment. The agent then gets the reward r t for the action from the environment. From these experiences, the agent tries to maximize the discounted cumulative reward, ∞ n=0 γ n r t+n∆t , where 0 < γ < 1 is the discount rate. 44) As mentioned in Sec. 2.1, as the state s t of the environment, we take the density distribution ρ(r, t), flux distribution F(r, t), and the Gaussian potential V G (r, t). For the 2D case, the four distributions ρ(x, y), F x (x, y), F y (x, y), and V G (x, y) are expressed by 64 × 64 × 4 pixels, and are input into the CNN 45) shown in Fig. 2. For the 3D case, the five distributions ρ(x, y, z), F x (x, y, z), F y (x, y, z), F z (x, y, z), and V G (x, y, z) are expressed by 32 × 32 × 32 × 5 pixels. The CNN consists of four convolutional layers and two fully-connected layers. The CNN outputs six values, which we denote Q(s t , a) with a = 0, 1, · · · , 5. The agent determines the action a t by the ǫgreedy policy: where 0 ≤ r < 1 is a random number and argmax a Q(s t , a) indicates the value of a that maximizes Q(s t , a). The value of ǫ is linearly decreased from 1 to 0.1 in the first 50,000 steps (100 episodes), and a constant value of ǫ = 0.1 is used after that. The action a t chosen by the agent is applied to the environment, where the Gaussian external potential is changed depending on a t , and the BEC evolves from t to t + ∆t. The agent then receives the reward r t from the environment, and observes the new state s t+∆t . According to the Bellman optimality equation, 44) Q(s t , a t ) should approach the target value r t + γQ(s t+∆t , a ′ ), where a ′ = argmax a Q(s t+∆t , a). It was reported in Ref. 46 that the training process is stabilized by replacing Q withQ in the above target value, whereQ is generated by another CNN, called the target network. Thus, the target value is given by The original CNN to generate Q is updated in every training step, and the original CNN is copied to the target CNN every C steps, where we take C = 1000. The target value y t is thus calculated by fixing the target CNN during C steps, which stabilizes the learning process. This method with Eq. (6) is called double-Q learning. 46) The agent (CNN) is trained as follows. In every time step, the set of data (s t , a t , r t , s t+∆t ) is stored in the memory, which is called the replay memory. 3) In the present case, the replay memory can store 50,000 sets of data. Once the limit is reached, old data are removed to store new data (queue). From the replay memory, 32 sets of data are taken randomly, which is called a minibatch. For each dataset in the minibatch, the target value y t in Eq. (6) is calculated, and the network parameters of the CNN are updated to minimize minibatch L(Q(s t , a t ) − y t ), where the function L is called the Huber loss, 47) defined by Using the Huber loss, large gradients are suppressed and network updates are stabilized. Updating the CNN is performed using the Adam optimization scheme 48) with a learning rate of 10 −5 -10 −4 . The procedure for the agent is summarized in Algorithm 1.

Algorithm 1
Initialize deep-Q network Q Initialize target network asQ = Q for episode = 1, N episode do Initialize environment and get initial observation s 0 for t = 0, T end do Select action a t = argmax a Q(s t , a) with ǫ-greedy exploration Execute action a t on environment and get reward r t and next state s t+∆t Store (s t , a t , r t , s t+∆t ) in replay memory Sample minibatch of (s t , a t , r t , s t+∆t ) from replay memory Set y t = r t if t = T end , otherwise y t = r t + γQ(s t+∆t , a ′ ), where a ′ = argmax a Q(s t+∆t , a) Train network using gradient of L(y t − Q(s t , a t )) CopyQ = Q every C steps end for end for

Two-dimensional system
We first consider the 2D system, which obeys the GP equation in Eq. (3). The initial parameters of the Gaussian potential in each episode are ξ(0) = 0, η(0) = 2, and A ⊥ (0) = 20. The initial state of the BEC is the ground state for these parameters, as shown in Fig. 3(a), where the density dip is due to the Gaussian potential. For this initial state, therefore, the rotational symmetry of the problem is broken. Here we set our target state ψ target (x, y) to the lowest-energy stationary state having a singly-quantized counterclockwise vortex at the center, as shown in Fig. 3(f). Numerically, such a wave function can be obtained by phase imprinting followed by imaginarytime evolution. Starting from the initial state without a vortex, the agent tries to produce ψ target (x, y) by controlling the position and strength of the Gaussian potential in Eq. (4).
We specify the action and reward. Depending on the six actions of the agent, a = 0, 1, · · · , 5, the parameters in Eq. (4) are changed in each time step as where the displacement 0.15 is identical with the numerical mesh size. Since the strength A ⊥ of the Gaussian potential produced by a blue-detuned laser beam should be nonnegative, the actions a = 5 is ignored when A ⊥ becomes negative. The fidelity between the wave function ψ ⊥ (x, y, t) at time t and the target wave function ψ target (x, y) is defined by Since the present purpose is to increase the fidelity as much as possible, the reward r t given to the agent should be taken to be a monotonically increasing function of the fidelity. Here, to enhance the increase of the fidelity near F = 1, we take the reward as As shown in the inset of Fig. 3(g), this function steeply rises for F 0.9. We performed the reinforcement learning in Algorithm 1  49) for N episode = 5000 episodes. The total reward obtained in each episode is defined as where the episode terminates at T end = 500∆t. The episode with the largest R total among 5000 episodes is shown in Fig. 3. First, the Gaussian potential is moved leftward (Fig. 3(b)), and at the edge of the BEC a clockwise vortex is released to the periphery of the BEC, leaving a counterclockwise vortex in the Gaussian potential (Fig. 3(c)). The Gaussian potential then carries the counterclockwise vortex to the center of the BEC (Fig. 3(d)), at which point the strength of the Gaussian potential is reduced to produce the desired state (Fig. 3(e)). This process finishes at t ≃ 130, and after that the vortex is kept at the center. It is numerically predicted 41) and experimentally verified 42) that the simple translation of a circular potential in a BEC produces a pair of clockwise and counterclockwise vortices simultaneously. In the above dynamics, in contrast, the strong anisotropy at the edge of the condensate is used to release only a single vortex from the potential.   The strength A(t) of the Gaussian potential is first decreased from the initial value as the potential goes to the edge of the BEC. This is because the density is low at the edge of the BEC and a weak Gaussian potential is suitable for manipulating the vortices. The strength A(t) is then increased to rapidly transport the counterclockwise vortex to the center. Even after the final state is produced at t ≃ 130, A(t) is kept at ≃ 4, which helps fix the vortex to the center. It was confirmed that the vortex stays almost at the center, even if A(t) vanishes for t ≥ 130. Figure 4(a) shows the total reward R total obtained in each episode. The total reward increases as the agent experiences more episodes, and finally reaches R total ≃ 2000. The total reward first increases to R total ≃ 400, then remains on a plateau till the ≃ 2000th episodes and then increases again. This behavior is related to the form of the reward in the inset in Fig. 3(g), because the total reward saturates at R total ≃ 400 if we replace the reward in Eq. (10) with r t = F(t).
(12) Figure 4(b) shows the time evolution of the fidelity F(t) for the best episode, which indicates that the fidelity reaches F ≃ 0.97 using the reward in Eq. (10), while F ≃ 0.9 for the reward in Eq. (12). A snapshot of the density profile for the reward in Eq. (12) is shown in Fig. 4(c), which exhibits more short-wavelength excitations than Fig. 3(e). Thus, the nonlinear term in Eq. (10) makes the fidelity closer to unity, and reduces short-wavelength excitations. In Fig. 4(a), we also show the case without double-Q learning, i.e., we use y t = r t + γQ(s t+∆t , argmax a Q(s t+∆t , a)) (13) instead of Eq. (6). Without the target networkQ, the learning process does not proceed, which indicates the validity of the target network.
3.2 Three-dimensional system Next, we consider a 3D system obeying Eq. (1), in which a BEC is confined in an isotropic harmonic potential, and a Gaussian laser beam is applied from the x direction, producing the potential in Eq. (2). The initial parameters in Eq. (2) are set to A(0) = 20, d y (0) = 1, and ζ(0) = 0, and d z = 0.5 is fixed. The initial wave function is the ground state for these parameters, as shown in Fig. 5(a). Here, our target state ψ target (r) is the stationary state containing a vortex ring 50) with axial symmetry about the z axis, as shown in Fig. 5(f). Although a vortex ring in a uniform system travels in the direction of the symmetry axis, the vortex ring in Fig. 5(f) is stationary due to the inhomogeneity. Numerically, this target state is generated by imprinting the phase e iφ with φ = tan −1 [z/(r ⊥ − r 0 )] − tan −1 [z/(r ⊥ + r 0 )] followed by the imaginary-time evolution, where r 0 and the duration of the imaginary-time evolution are chosen appropriately.
As in the 2D case, the agent chooses one of six actions, a = 0, 1, · · · , 5, according to which the parameters of the Gaussian potential are modified as in each time step. The actions a = 3 and 5 are ignored when d y and A become negative, respectively. The reward has the same form as in Eq. (10) with the fidelity being given by Figures 5(a)-5(e) show the time evolution of the BEC for the episode with the best total reward among N episode = 3000 episodes. The controlled parameters of the Gaussian external potential are shown in Fig. 5(g). First, the Gaussian potential is moved in the −z direction, and a pair of vortex-antivortex lines are produced (Fig. 5(b)). The width d y of the Gaussian potential is then increased (Fig. 5(c)), and the two vortex lines connect with each other at the ±x edges (Fig. 5(d)), which forms a vortex ring (Fig. 5(e)). The produced vortex ring is almost stationary and stays in the condensate until the end of the episode t/∆t = 500. After t/∆t 100, the Gaussian  The episode with the best total reward among these 3000 episodes is used in (a)-(e) and (g). See the Supplemental Material for a movie of the dynamics shown in (a)-(e). 51) external potential almost vanishes because either A = 0 or d y = 0.
The generation of the vortex ring shown above is nontrivial, because the external Gaussian potential is uniform in the x direction. In a uniform system, using such a potential, we can only generate vortex lines along the x direction, and therefore the vortex-ring formation in our system is due to the inhomogeneity in the trap. The board-like potential with large d y , as in Figs. 5(c) and 5(d), also plays an important role. This potential creates the constrictions at the ±x edges of the isodensity surface, as shown in Fig. 5(c), which bend the vortex lines and forms a ring.
The time evolution of the fidelity F(t) is shown in Fig. 5(g). Unlike the 2D case in Fig. 4(b), the fidelity in Fig. 5(g) exhibits nonadiabatic oscillation, which may be unavoidable for the present condition. Figure 5(h) shows the total reward R total obtained in each episode. The total reward first increases gradually for 1000th step and then suddenly rises to R total ≃ 800. This behavior is similar to that in the 2D case shown in Fig. 4(a).

Conclusions and discussions
We have applied reinforcement learning to the control of nonlinear matter waves. The agent in the reinforcement learning is implemented using a deep convolutional neural network (CNN), and the state of the Bose-Einstein condensate (BEC) is input into the CNN to determine the next action of the agent. According to the action of the agent, the position and shape of the external Gaussian potential applied to the BEC are controlled. The agent is trained so that the state of the BEC approaches the prescribed target state.
Using this method, we demonstrated that quantized vortices can be created and manipulated in 2D and 3D systems. In the 2D system, the target state was set to be the singlevortex state at the center. Although vortices and antivortices are always created in pairs, the agent found a way to expel one of them to leave only a single vortex at the center of the BEC (Fig. 3). In the 3D system, the target state was set to be the stationary vortex-ring state. Although the Gaussian potential is 2D (uniform in the x direction), the agent found a way to produce such a 3D structure (Fig. 5).
The two examples demonstrated in the present paper are interesting, but rather simple; the control parameters found in Figs. 3(g) and 5(g) for creating the target states are very simple functions. Other methods have been developed to control quantum systems [33][34][35] that may find similar optimal results. Furthermore, the simple vortex states used for our target states can also be created by other methods, such as methods using Rabi transitions. 52,53) In this sense, our method might be considered overcomplicated for the present examples. However, reinforcement learning is very versatile and may prove to be an effective approach for handling more complicated problems (e.g., manipulation of multiple vortex states, control of quantum turbulence, creation of nontrivial topological excitations, etc.). This will be the topic of future studies.
Another challenging extension of this study will be the realtime control of BECs in experiments, where a series of nondestructive imaging data is given to the agent, and the reward is based on the measurement data. Reinforcement learning may be suitable for such measurement-based control which intro-