Reinforcement learning-enhanced protocols for coherent population-transfer in three-level quantum systems

We deploy a combination of reinforcement learning-based approaches and more traditional optimization techniques to identify optimal protocols for population transfer in a multi-level system. We constrain our strategy to the case of fixed coupling rates but time-varying detunings, a situation that would simplify considerably the implementation of population transfer in relevant experimental platforms, such as semiconducting and superconducting ones. Our approach is able to explore the space of possible control protocols to reveal the existence of efficient protocols that, remarkably, differ from (and can be superior to) standard Raman, stimulated Raman adiabatic passage or other adiabatic schemes. The new protocols that we identify are robust against both energy losses and dephasing.


I. INTRODUCTION
It is well known that quantum systems can provide clear computational advantage when compared with their classical counterparts, and several algorithms have been presented whereby this advantage is exploited to carry out so called super-classical tasks [1][2][3].The required control over quantum systems, however, still remains the biggest challenge for full implementation of quantum computing algorithms.An experimental platform that provides a promising candidate for controlling general quantum systems are superconducting circuits, which have been widely employed to fabricate qubits (see [4] and [5] for reviews) and two qubit gates [6][7][8], as well as implementations of so-called circuit-quantum electrodynamics (QED) [9,10], which is at the forefront of the current "quantum race" [11].Multi-level dynamics has also been addressed both theoretically [12][13][14][15] and experimentally [16][17][18].However, together with the promise of an experimental platform to manipulate quantum systems towards the achievement of a quantum network of multiple nodes, comes an increased demand for quantum-control schemes pertinent to the experimental constraints at play.Much of the work towards this goal employs techniques from NMR, Quantum Optics and Quantum Optimal Control theory [19,20].Specifically, gradient-based optimization methods have been recently employed to control general open systems with a myriad of applications [21], as well as aiding the design of high-fidelity, protected superconducting quantum gates [22][23][24][25].In the context of multi-level systems, the use of two tone pulses allow faithful, selective and robust single-qubit quantum operations such as population transfer and generation of superposition and can be generalized to quantum operations on multiple nodes of a * These authors contributed equally to this work network, such as state shuttling, entanglement and single photon generation.Stimulated Raman Adiabatic Passage (STI-RAP) and Raman oscillations are two well-known protocols for the implementation of these quantum operations.Some effort has been made to adapt the original formulation of such protocols to reduced-control architectures [13,14,26] or to improve them by using optimal pulse shaping and superadiabatic techniques [16,27,28].
More recently machine learning techniques have emerged as a viable option for finding alternative optimal control schemes.In particular reinforcement learning (RL) has been employed in the context of state preparation [29,30], circuit architecture design [31] and control of multi-level systems [32].In the context of three level systems, deep neural network based RL has been used along with state monitoring to learn optimal pulse shapes for driving fields [33,34].Here we implement a two-step optimisation approach, that combines different optimization approaches.Initially, Deep Reinforcement Learning (DRL)-like techniques, in conjunction with Recurrent Neural Networks (RNNs), is used to learn the shape of efficient piece-wise constant control pulses, without the requirement for state monitoring [35].Such key insight is then used to implement a suitable traditional optimization method.This two-step approach yields smooth, analytically well-defined control pulses.An important point to make is that application of such conventional optimization methods without any pre-available information is much more difficult in general due to the "curse of dimensionality" [36].To succeed, they require the choice of a suitably truncated basis upon which to expand their control functions.This highlights the utility of the initial learning step, which is essentially userindependent and can provide a suitable ansatz without the need for prior knowledge of the system.For example, a requirement for the success of STIRAP is the existence of (a manifold of) adiabatic dark states, and the full knowledge of their structure [37].On the other hand, for Raman oscilla-tions, the hallmark for adiabatic elimination is the validity of restrictive parameter conditions (such as large detunings), so as to constrain the dynamics to relevant subspaces.The RLbased step discussed here provides protocols that violate both such restrictive conditions, and thus differ from both STIRAP and simple adiabatic elimination, while combining advantages of both to achieve near-optimal dynamics.This thus provides an ansatz for the control that may otherwise not have been arrived at analytically, and whose flexibility could be exploited to engineer operations in multi-node architectures.While delivering previously unforeseen protocols, this hybrid approach to optimisation marks a significant departures form previous methods towards the control of quantum dynamics, and embodies one of the pillars of our proposal.
The remainder of this paper is organized as follows.In Sec.II we introduce the physical system of interest, which allows us to motivate the specific form of control chosen.In Sec.III we show how an RL agent was able to learn control schemes to induce some desired dynamics in the system.Then, in Sec.IV, we use a less sophisticated coefficient optimization over a polynomial basis in an attempt to reproduce the results obtained by the RL approach.In Sec.V we use the results from the RL agent in Sec.III, followed by the simpler coefficient optimization, where we were able to obtain further improvement in protocol efficiency when compared with both methods alone.We then dedicate Sec.VI A to an analysis of the resilience of the learned protocols to stochastic decay within the system, where we explicitly consider the performance of both protocols in a 3-level Ladder system.We finally discuss the robustness of the protocol to low-frequency noise and its resilience to pure dephasing in the system dynamics in Sec.VI B, followed by a brief discussion of the results in Sec.VII.

II. THE SYSTEM
We investigate control protocols for an abstract 3-level quantum systems and specifically consider the task of population transfer in so-called Lambda systems, where a ground state |g and target state | f are indirectly coupled via some intermediate excited state, |e as shown in Fig. 1.The states |g and | f are here considered to be 'quasi-stable' ground states, where |e is a radiatively decaying excited state.The typical Hamiltonian for this physical system reads Here the context of STIRAP [37][38][39], where for δ 0 there exists a suitable control scheme for Ω P (t) and Ω S (t), the socalled counterintuitive pulse sequence, such that perfect transfer from |g to | f is achieved whilst |e is kept depopulated at all times.Here we instead consider the case of alwayson Rabi-frequencies whilst modulating the single-and twophoton detunings.The population transfer thus achieved mimics protocols in circuit-QED where the couplings between qubit and harmonic mode are not switchable [14].Specifically, we investigate the case where the couplings Ω P and Ω S both assume the constant value Ω 0 , whilst freedom is afforded to modulate the detunings δ P (t) and δ(t), which embody a set of controls of simple experimental manipulation.The remits of our investigation extend beyond the context set by the 3-level system illustrated in this Section.Indeed, the three-level model considered here can also be used to address the problem of population transfer between two remote quantum resonators both connected by non-switchable couplings to a three-level system, which can be operated locally [13].Moreover, this configuration also describes a system consisting of two qubits connected by the field of a cavity and working in the single-excitation subspace.In this context, the two low-energy states of the equivalent three-level system would represent states where a single excitation is carried by one of the remote qubits, while the top-most state would imply that the cavity field is populated.This configuration is the building block of cavity-/circuit-QED architectures for controlled quantum dynamics currently being explored experimentally.

III. REINFORCEMENT LEARNING BASED OPTIMIZATION
In order to find an efficient control scheme we first employ an RL-inspired approach.Initially, we fix the total time for the system evolution to T which is then divided into N steps time intervals, t i , of equal duration.This constitutes one episode.
During each of these intervals the one-and two-photon detunings have constant values, δ P (t i ) and δ(t i ), which are all determined by an RL agent prior to each interval.Thus, for each time interval, we use the Hamiltonian in Eq. (1) with t → t i and Ω P = Ω S = Ω 0 , to evolve the continuous-time open-system dynamics ruled by the Lindblad master equation for the duration of the time interval.Here ρ is the density matrix of the system and D is the Lindblad-like operator accounting for the non-unitary part of the dynamics.More specifically, the agent provides two values, for each individual timestep, which act as the mean values of two separate Gaussian policies from which the detuning are sampled at said timestep.Learning is implemented using the policy gradient REINFORCE (with baseline) algorithm for continuous action spaces [40], employing a long short-term memory (LSTM) neural network [41] to as a function approximator (with only the series of time steps {t i } i=1,...,N steps as external input to the network) mirroring previous work [35].
Thus the agent is tasked with learning a policy that provides the optimal detuning control scheme, where performance is considered with respect to perfect transfer between |g and | f , whilst keeping |e depopulated at all times.In order to meet such a request we couple the intermediate state to a sink |s , in the learning phase only, as shown in Fig. 1.
This coupling is operationally implemented by introducing the Lindblad operator

√
Γ|s e| into the dissipator in Eq. ( 2) and induces a decay mechanism in the system, whereby any protocol that appreciably populates the excited state invariably leads to population loss.This is crucial: removing the state observation at each time-step removes the ability to explicitly define a reward function that encourages the desired dynamics.In this case we can define the delayed reward granted to the RL agent at the end of the evolution as This explicitly promotes population of the final state | f , whilst any transient population of |e during the dynamics will act to lower this final population thanks to the aforementioned leakage mechanism.In this sense, punishment for populating |e is built-in to the mechanisms of the system via |s .
The way this algorithm is able to work without monitoring the system at each time step can be rationalised in the following way.As the LSTM does not monitor the state of the system at each time step, it relies only on the ability to 'memorize' the actions that it has taken at each time step leading up to the final reward R. Thus, over several episodes, the agent is able to build an internal representation of the system dynamics and thus learn to act optimally with only the series of time-steps as input and the final target-state population as feedback.Consequently this type of optimization could in principle be employed as an iterative, closed-loop scheme.Such a key feature of our approach would be beneficial for optimizing control in the presence of difficult-to-simulate environmental decoherence, such as the in the situations faced by solid-state quantum hardware [24,42].A detailed explanation of the RL-LSTM approach is provided in Appendix A, while the network configuration -along with all the learning parameters -are reported in Appeendix B.
Using the RL based optimization outlined above, with Ω 0 T = 20, N steps = 20 and (δ, δ P )/Ω 0 ∈ [−50, 50], the agent was able to obtain a target state population at the end of the protocol of ρ f, f (T ) ≈ 0.9993, with a maximum excited state population over the entire time interval of max t ρ e,e (t) ≈ 0.0884.The learned protocol and the induced population dynamics can be found in Fig. 2. Despite the evidently desirable features of the results thus achieved, it is worth remarking that the learning process is in general stochastic and different runs of the optimization can produce different shapes for the detuning functions.However, successfully optimized detuning functions all shared common traits, which can be summarized by the following list of characteristic features C1: We have |δ(t)| |δ P (t)| for most of the evolution.
C2: Detuning δ P (t) always exhibits comparatively large initial and final values.
C3: δ P (t) always seems to exhibit specific parity features about T/2.Such a feature is more sporadically shared by δ(t).
In particular, feature C2 is to be expected if one wants to avoid populating the excited state at the beginning and at the end of the transfer, and agrees with previous findings reported in literature [14].Furthermore, feature C1 can be justified by inspecting how the presence of non-vanishing detunings affects the efficiency of both standard STIRAP and Raman protocols: while even small non-null values of |δ| are detrimental for the performance of the transfer, much larger values of δ P can be tolerated [37,39,43].
We have performed an optimization process based on the use of a restricted range for the values of |δ|, thus limiting the action-space of the RL agent and guaranteeing the validity of C1.In particular, we considered δ/Ω 0 ∈ [−0.2, +0.2] and Remarkably, differently from what one would naively expect, this protocol is not akin to a Raman-like or a STIRAPlike one.First, two-photon Raman protocols require large single-photon detunings while, in our case, δ P can even vanish, thus making the dynamics comparatively faster.Second, the protocol that we have found are non-adiabatic, thus making them markedly different from adiabatic population transfers, such as STIRAP.Our LSTM RL approach thus delivers genuinely new protocols that combine features of robustness akin to STIRAP but without requiring the demanding switching of coupling fields

IV. POLYNOMIAL COEFFICIENT OPTIMIZATION
Instead of dividing the time of the evolution in a certain number of steps and optimizing the values of the detunings at each step, an alternative approach for the optimization consists on the expansion of δ(t) and δ P (t) over a specific functional basis.The effectiveness of this approach depends on the choice of such basis, making it less general than the technique used in the previous section or other sophisticated optimal control techniques such as CRAB [44].However, should a suitable basis be found, the suggested approach translates the problem of finding the best protocol into a simpler numerical optimization over the coefficient of the expansion while also providing us with a simple analytical expression for the control terms.
We found that writing δ(t) and δ P (t) as 5 th order polynomial functions and using a Powell method search [45] over the coefficients of the polynomial expansion is enough to achieve an effective population transfer.In Figs. 4 and 5 we show the best protocols obtained after 10 different runs of the optimization for Ω 0 T = 40.It can be seen that, while still effective, they are different from the protocol found via the RL-based optimization (although conditions C1 and C2 found by the RL agent can still be observed).This again suggests that various quasi-optimal protocols can be identified as candidates for an efficient population transfer.However, the effectiveness of such optimization technique depends on the choice of the basis for the specific problem.Performing a simple numerical optimization to solve the same problem assigned to the RL agent (finding the values of piecewise constant functions) gives us far worse solutions compared to those obtained using the the RL-based approach [35].Therefore, not only the RLbased approach can be successfully applied to a wider class of problems with a simpler pre-optimization analysis but it also provides a better exploratory tool when only sub-optimal solutions are achieved, as these solutions are not biased by the choice of a specific basis of functions.

V. OPTIMAL PROTOCOLS
Based on the success of both the RL-based optimization and the optimization with a polynomial basis, we combined the two approaches, performing a straightforward numerical optimization starting from the results of the RL-based technique.To this end, observing the features in Fig. 3, we propose an ansatz for δ P (t) as Similarly, we suggest the linear ansatz for δ(t), The choice of Eqs. ( 4) and ( 5) ensure that the symmetry or anti-symmetry point of the proposed functions occur at t = T/2 of the evolution.This optimization was carried out using a Powell method search [45] over the space of parameters (C 1 , C 2 , k, m) for the maximization of R. The benefit here is two-fold: on one hand, it allows us to find an analytical expression for the protocol, thus contributing to the interpretation of the results that we achieve; on the other hand, it smooths the protocol found by the RL agent, presenting us with a continuous control scheme, which is experimentally more tractable.In achieving these two goals the analytic, smooth control pattern maintained a comparable final state target population to the RL learned scheme, while further reducing the transient population of the excited state.Specifically, in Fig. 6 we present results for Ω 0 T = 20 and Ω 0 T = 40 showing that, for the second case, ρ f, f (T ) ≈ 0.9994 and max t ρ e,e (t) ≈ 0.0143 can be achieved with the simple ansatz that we have proposed.
The insight provided by the RL-based optimization approach suggests the existence of different valid protocols of optimization.In this regard, an interesting question to pose addresses the role of the parity exhibited by the detuning functions with respect to t = T/2.That is, we wonder whether optimal functional behaviors akin to those exhibited in Fig. 2 can be identified.To ascertain it, we propose the use of an odd 5 th order polynomial function for δ P t T − 0.5 and an even 4 th order polynomial for δ t T − 0.5 and performed a similar optimization, finding that the corresponding optimized protocol is still effective [cf.Fig. 7].The resulting final target-state population is ρ f, f ≈ 0.9969, while the maximum excited-state population is max t ρ e,e (t) ≈ 0.01555.For brevity, we label the protocol of Fig. 6 (c)-(d) as protocol 1, while that of Fig. 7 will be referred to as protocol 2. We point out that the performances of protocol 1 and protocol 2 mentioned here are extremely similar to the RL protocol of Fig. 3. Optimality can thus be understood in terms of the evident simplicity of the control functions needed to achieve such performance The behaviours showcased in our results allow us to corroborate quantitatively the differences between our protocols and Raman-like ones.The first clear difference is the absence of Raman oscillations [cf.Fig. 9] from the dynamics of the populations resulting from our protocols.A second difference between the two approaches stems from the fact that in protocol 1 δ P is constant most of the time and we get δ(t) δ P (t).One can then ask how this compares to a Raman scheme with δ = 0 and a constant δ P Ω 0 .When no constraint is imposed over the total time of the evolution, one would expect that increasing δ P will progressively improve the transfer.However, our approach assumes a fixed value of Ω 0 T .This implies that a very large value of δ P could prevent the completion of the corresponding very slow population transfer.Both of these effects are relevant for the optimal choice of δ P .In Fig. 8 we show that protocol 1 achieves a more efficient population transfer relative to the case of a completely constant Hamiltonian.Moreover, in line with previous considerations, we also remark that the protocols are not adiabatic.If we increase the total time of the evolution while still using protocol 1 and 2 (without performing a new optimization for each value of Ω 0 T ), the performance does not increase monotonically, as it would happen in a Raman protocol (Figure 9).

VI. RESILIENCE TO DECOHERENCE A. Spontaneous Decay
Here we consider how the protocols that we have found perform when a multi-level system is subjected to spontaneous decay from some of its energy levels.We investigate two cases: A: The decay from the intermediate excited state ρ e,e to the first ground state ρ g,g with a decay rate γ e,g , implemented using the Lindblad operator √ γ e,g |g e|.
B: The case of an additional decay channel, from ρ f, f to ρ e,e , with rate γ f,e , implemented by √ γ f,e |e f |.
Scenario A is what one would expect to be relevant for the Lambda system that we have discussed thus far, particularly when fluxonium-based embodiments of the multi-level system are considered [46], for which an incoherent mechanism  For Raman-like protocols, one would expect oscillations with a period Ω 0 T ∼ 40, which our results do not exhibit, thus marking the difference between these approaches.
driving decay of population from | f to |g can be safely neglected.On the other hand, scenario B is motivated by the fact that the Hamiltonian in Eq. ( 1) encapsulates the so-called Ladder energy level structure, where | f becomes a higher excited state than |e and is thus susceptible to spontaneous decay.The Lambda and Ladder scenarios are operationally equivalent as far as the control protocols are concerned.A diagrammatic outline of these two scenarios is shown in Fig. 10.We consider the sensitivity of the protocols, with respect to final target state population ρ f, f (T ), for a range of decay rates in both cases.Fig. 11 shows how the performance of protocols 1 and 2 respectively depend on the strength of the decay rates in each of the Lambda and Ladder cases, where performance is gauged simply by the final target state population.
It can be appreciated that both protocols carry a strikingly similar dependence on the decay rates and exhibit relative robustness against decay from the intermediate state.This is to be expected the RL process included a mechanism to punish population of such level.On the other hand, both protocols exhibit great sensitivity to decay from the target state.Thus, in a ladder system, a decaying target state embodies the main  limiting factor.

B. Dephasing
We extend the previous analysis with the study of the behaviour of protocol 1 and 2 under the effects of dephasing.While sophisticated models can be invoked to illustrate the various facets of dephasing, in order to gather an understanding of its implications for the protocols identified here, we focus on pure dephasing implemented using the Lindblad operators where Γ kl = Γ k + Γ l and ρ kl = k|ρ|l (k, l = g, e, f ).We are now able to investigate the sensitivity of the protocols 1 and 2 with respect to such mechanism, an analysis that we perform by independently varying the values of one of the Γ k 's, whilst keeping the other at zero.The relationship between protocol efficiency and each of these dephasing rates can be inspected from Fig. 12. Owing to the two-photon character of the protocols at hand, we find much higher sensitivity to non-zero Γ g and Γ f , whilst being comparatively resilient to Γ e .In terms of FIG.
12. Final population achieved by the optimal protocols while independently varying the dephasing strengths Γ k with Γ j = 0 ∀ j k (k = g, e, f ).
Eq. ( 6), this translates into a much larger sensitivity to Γ g f relative to Γ ge , Γ e f .Needless to say, this is a consequence of the protocols having been optimized to constrain the system dynamics to the subspace {|g , | f } of the full Hilbert space and as such being most reliant on coherence between the initial and target states.

C. Robustness against low-frequency noise
We conclude our assessment of the robustness of the proposed protocols by addressing the sensitivity to detunings.This analysis is particularly relevant for superconducting devices, where the main dephasing mechanism can be attributed to the presence of a low frequency noise that often has a 1/ f spectrum characterized by slow fluctuations of the detunings [42,43].Due to the slowness of the dynamics of such fluctuations, the value of the detunings induced by such low frequency noise can be considered as constant during the population transfer.Hence, a simple way to achieve a meaningful and informative characterization of its effects on our protocols is to study their performance when we include a constant perturbation in each of the detunings.We thus take In the following analysis, we have considered both such constant perturbations and the leakage mechanism outlined in Sec.III.We have used the final target state population ρ f, f (T ) as a measure of performance.From Fig. 13 it can be seen that both protocols are almost insensitive to the single-photon detuning δP , while the sensitivity to two-photon detuning is larger, as expected, and comparable to that of STIRAP.Interestingly, protocol 2 results in ρ f, f (T ) being strongly asymmetric with respect to the sign of the perturbation to the two photon detuning δ.

VII. CONCLUSIONS
We have successfully employed a combination of a RLbased methods and more traditional optimization techniques to achieve optimal population transfer in a three-level system, whilst operating in an experimentally relevant control regime.Further, we have highlighted that our technique can in principle be implemented as an iterative, closed-loop optimization.Its use will be beneficial in all those situations where the underlying decoherence mechanisms are not fully understood.We have also demonstrated that even when a RL-based approach gives us sub-optimal solutions, it can still provide a useful tool that can be used to build better protocols through a simpler numerical optimization techniques.The approach produced two novel protocols which remarkably differ from other control methods such as STIRAP, standard Raman or adiabatic schemes whilst exhibiting comparable performance and robustness.To this end, it is worth noticing that, due to the specific constraints of the protocol, STIRAP cannot be operated with both always-on couplings.
Several works in the few years proposed the implementation of multi-level systems, including both Lambda and Ladder configurations, using superconducting artificial nanostructures subjected to suitable driving configurations.These arrangements, though, expose the nanostructure to increased noise level, which severely affects the performance of population transfer, limiting it to values that are typically in the range of 70%.Our approach will be invaluable to enhance the performance of such systems above and beyond the possibilities offered by demonstrated techniques for quantum control.In particular, our approach minimizes the need for the use of switchable coupling mechanisms, with is a key advantage when having in mind the design of robust schemes with low hardware overhead in the noisy intermediate-scale quantum technology framework.By serving both as an alternative control scheme for the specific physical system discussed above, and a proof-of-concept for the optimization technique itself, our protocols could be exported to be used in other relevant context, from quantum simulation to gate engineering.Traditional RL focuses on solving Markov Decision Processes (MDPs).In a MDP, the state of the environment (and the agent observation) at each time step and the corresponding action taken by the agent uniquely determine the state of the environment at the next time step [40].
If we now consider our physical problem where the agent is trying to learn the optimal Hamiltonian of the system (under the given constraints) and the Lindblad operators are not influenced by the agent actions, the natural choice to define a MDP would be to take the density matrix of the system as the agent observation.
Based on this input, we can use a function approximator (i.e. a neural network) to predict the mean values μδ θ , μ δ P θ of Gaussian distributions (with standard deviation σ) whose product constitute the policy function from which we sample the actions of our agent.If a reward R is granted to the agent at the end of the system dynamics, policy gradient REINFORCE [40] can be implemented by training the neural network with a cost function C = 1 2σ 2 a i R|a i − μθ (s i )| 2 , where μθ = ( μδ θ , μ δ P θ ) and a i = ( δ(t i ), δP (t i )), with δ(t i ), δP (t i ) detunings normalized with respect to their maximum values (defined by their ranges).
Easy improvements of this algorithm can be achived by working with a batch of agents in parallel (instead of a single agent) and training the network with stochastic gradient descent (or more advanced and similar techniques such as Adam [47,48]) and subtracting a baseline to the reward (in our work we subtract the average value of the reward over the batch).
This approach is effective for planning, when one can simulate the system dynamics, but it is extremely limiting as a control technique that works with real experimental data, since it requires full quantum tomography for each step of the MDP.Since measurements on a quantum system perturb the system dynamics, a good control technique would require to take measurements only at the beginning and at the end of the system evolution.Such control technique, if effective, would be more powerful than a simple application of RL to quantum systems, as it would be useful even when we are not able to simulate the system dynamics (e.g. when the noise mechanism is not fully understood).
Since all the other parameters of the evolution of the system are fixed, the reward that the agent gets at the end of the process is uniquely determined by the agent actions, as the evolution of the density matrix is deterministic.Hence, in principle, the decision process in which the agent receives informations about its previous actions (that now constitute the agent observation) can be solved by means of RL techniques (and policy gradient in particular) and while defining the observation as a list of all these actions is unpractical and likely ineffective compared to other optimization techniques, we can still pursue this approach by making use of a Recurrent Neural Network (RNN) as function approximator.
RNNs are neural networks specifically designed for sequential data and especially useful for time sequences.In a RNN, the output associated with each element of the input sequence depends on all the previous inputs and outputs of the network and hence this implicitly implements the desired feature.In particular we chose to use a Long Short Term Memory (LSTM) neural network that takes as external inputs only the time at which the agent is operating (details of the configuration can be found in Appendix B).
Comparison with standard numerical optimization techniques has been carried out in Ref. [35].There, it has been shown that this approach requires a smaller number of experiments to achieve optimal protocols and shows better performances when one increases the number of control terms and the dimension of the system.

FIG. 2 .
FIG. 2. (a) Detunings as a function of time for the control scheme obtained with the first RL-LSTM optimization.Notice that δ varies within a much smaller range than δ P (cf.Inset).(b) Population transfer achieved with the protocol shown.The target state population at the end of the protocol is ρ f, f (t) ≈ 0.9993, while the maximum excited state population is max t ρ e,e (t) ≈ 0.0884.

FIG. 4 .FIG. 5 .
FIG. 4. (a) Example of optimized protocol obtained using a polynomial basis-expansion for the optimization of the detunings.The inset shows the behavior of the respective two-photon detuning in a smaller vertical range.(b) Corresponding population transfer.The target state population at the end of the protocol is ρ g2,g2 (T ) ≈ 0.9980, while the maximum excited state population is max t ρ e,e (t) ≈ 0.0250.

FIG. 8 .
FIG. 8. Maximum target-state population reached during the transfer performed using a protocol with δ = 0 and a constant value of δ P for the system in Fig. 2. The dashed horizontal line indicates the population of the target state achieved using protocol 1.In panel (a) [panel (b)] we have used Ω 0 T = 40 [Ω 0 T = 20].

FIG. 9 .
FIG. 9. Target-state population and maximum excited-state population (insets) achieved with protocol 1 [panel (a)] and 2 [panel (b)] as a function of Ω 0 T .Both protocols are optimized for Ω 0 T = 40.For Raman-like protocols, one would expect oscillations with a period Ω 0 T ∼ 40, which our results do not exhibit, thus marking the difference between these approaches.
FIG. 10.Schematics of three-level systems affected by spontaneous decay.We consider: (a) a Lambda system with intermediate-level decay only; (b) a Ladder scenario where both the intermediate state and the target state are subjected to decay.

FIG. 11 .
FIG. 11.(a) Sensitivity of the performance for protocol 1 in the presence of the single-level decay mechanism.(b) Same analysis performed against the two decaying levels of the Ladder scheme.Note how the ranges used for the two decay rates vastly differ, this is a consequence of the protocols being, by construction, considerably more robust to decay from the intermediate excited state.In panel (c) [panel (d)] we show the sensitivity of the performance of protocol 2 under the effects of the single-channel [double-channel] decay mechanism.
Appendix A: RL-based optimization with LSTM Neural Network