Reinforcement learning based robust control algorithms for coherent pulse stacking: supplement

For the fast and robust control of the delay lines for coherent pulse stacking, we combined the stochastic parallel gradient descent with momentum (SPGDM) and the soft actor-critic (SAC) into a powerful algorithm, SAC-SPGDM. The simulation shows that the algorithm can find the optimal delay-line positions to ensure the 128 pulses are coherently stacked for 7-stage pulses stacking within 25 steps.


Training Soft Actor-Critic
Conventionally, in a DL-CPS (Delay Line Coherent Pulse Stacking) experiment, the initialization alignment had to be done by experienced people. It becomes difficult for more stages to align. SAC (Soft Actor-Critic) has the power to handle this kind of non-convex optimal control problem.
Soft Actor-Critic (SAC) is a maximum entropy deep reinforcement learning with a stochastic actor [1]. In SAC, the agent gets a bonus reward at each time step proportional to the entropy of the policy at that time step. Then π Q is changed to include the entropy bonuses from every time step: Algorithm S1 Soft Actor-Critic (SAC): Training Input: Initial actor f θ with parameter θ, local-critic-1 1 Q ϕ with parameter 1 ϕ , local-critic-2 2 Q ϕ with parameter 2 ϕ , empty reply buffer D , EMA smoothing factor ρ , discount factor γ , trade-off coefficient α , actor learning rate A λ , critic learning rate C λ Output: optimized policy ˆθ π 1. Set target parameters equal to local parameters targ,1 The overall SAC controller consisted of one actor neural network f θ (parameters θ), four critic neural networks: local-critic-1 ). In the SAC, the policy θ π is a squashed Gaussian policy, which means the actor network f θ gives the mean θ μ and covariance θ σ of the policy θ π . Then the action samples a θ  are obtained by adding the Gaussian noise ξ and followed by tanh (hyperbolic tangent) activation function: The training algorithm of SAC is shown in Algorithm S1. There are many tricks on training the SAC, one can refer to the original paper [1] and our implementation code [2] for details. For efficient training, we run the interaction between environment and SAC agent many times without updating parameters to collect data into the reply buffer. Once sufficient data were collected, the SAC updating process is started. In our simulation, the actor and the critic networks have similar structures. Each of them is constituted by 3 fully connected layers of 400 neurons each with Rectified Linear Unit (ReLU) activation functions: ReLU( ) max(0, ) x x = .

Simulation experiment
The proposed algorithms are evaluated on the simulated 7-stage DL-CPS environment. Before evaluating the RL-based algorithms, we trained the RL agent on the simulated environment. The training procedure for an RL agent is divided into a few episodes, each episode lasts for 200 T = steps. As soon as one episode is completed, the system is randomly reset to a new initial condition.
The simulated environment is based on the Nonlinear-Optics-Modeling package [3] and wrapped by a Gym package [4] for convenience to interact with RL agents. The proposed algorithms were implemented with Python and PyTorch [5] on a Windows PC. Figure S1 shows the pulse trains controlled by SAC-SPGDM from the random initial state where the delay lines are not well matched. The animation of this process and comparison with other controllers can be seen in supplemented Visualization 1. It is seen that the combined algorithm can quickly find the matched delay lines to achieve maximum peak power within 25 steps.  Figure S2 shows the pulse trains controlled by SAC-SPGDM from the initial state where the delay lines slightly drift away from matched lengths. The animation of this process and comparison with other controllers can be seen in supplemented Visualization 2. It is seen that the combined algorithm can quickly find the matched delay line lengths within 15 steps. Although SPGDM can drag the system back, SAC-SPGDM is still the fastest.

Effectiveness of momentum in SPGDM
The algorithm SAC-SPGDM contains two parts: SAC and SPGDM. In this section, we investigate the effectiveness of SPGDM on the 7-stage DL-CPS environment from the "good" initial state where the delay lines slightly drift away from matched lengths.
The performance of the SPGDM algorithms on different momentum factor β is shown in ) takes 17 control-steps to achieve 90% of full output power, while SPGD ( 0 β = ) takes 28 control-steps. This indicates that the number of control steps reduced by ~40% by introducing momentum without additional computational complexity.

Effectiveness of SAC monitoring the pulse train
In this section, we investigate the effectiveness of our SAC algorithm on a 7-stage DL-CPS environment from two perspectives: 1) effectiveness of the monitoring combined pulse trains as input (observation) signal than monitoring SHG power (previously used in [6]). 2) effectiveness of SAC algorithm than DDPG (Deep Deterministic Policy Gradient, an RL method was previously used in [6]). The experimental setting parameters are listed in Table S1. The cumulative reward of the training process is shown in Fig. S4 (a), and the SHG power of the combined pulses under evaluation control is shown in Fig. S4 (b).   Figure S4 (a) shows that SAC with the burst pulse train monitoring (purple, ~ 110 episodes) converged faster than SAC with solely pulse power minoring (orange, ~170 episodes). This indicates that the number of training episodes reduced by ~35% under monitoring pulse trains (using pulse trains as input observation) instead of pulse power. Figure S4(a) also indicates that SAC (orange, converged ~170 episodes) converged faster than DDPG (green, converged ~300 episodes) by ~ 40% under the same observation signal. More importantly, the final reward of SAC ( 0 r = ) is much higher than DDPG ( 40 r = − ). Similarly, on the evaluation control in Fig   S4 (b), the final achieved power of SAC (~100%) is higher than the achieved power of DDPG (~70% of the highest).
Furthermore, among all algorithms, SAC-SPGDM (red) achieves the fastest convergence speed and highest final power. Training of the SAC-SPGDM converged in 50 episodes and faster than SAC by 55%. The simulation demonstrated that SAC-SPGDM can quickly find the matched delay line length for the stacked pulse of 90% maximum full power within 17 steps. This is 26% faster than SAC (23 steps).