Adaptive Optics control using Model-Based Reinforcement Learning

Reinforcement Learning (RL) presents a new approach for controlling Adaptive Optics (AO) systems for Astronomy. It promises to effectively cope with some aspects often hampering AO performance such as temporal delay or calibration errors. We formulate the AO control loop as a model-based RL problem (MBRL) and apply it in numerical simulations to a simple Shack-Hartmann Sensor (SHS) based AO system with 24 resolution elements across the aperture. The simulations show that MBRL controlled AO predicts the temporal evolution of turbulence and adjusts to mis-registration between deformable mirror and SHS which is a typical calibration issue in AO. The method learns continuously on timescales of some seconds and is therefore capable of automatically adjusting to changing conditions.


Introduction
Atmospheric turbulence distorts astronomical imagery obtained with ground-based telescopes. Adaptive optics (AO) [1][2][3] is a technique that aims at minimizing the distortions caused by the turbulence. In AO, a wavefront emitted by an astronomical object, such as a star, and distorted by the atmosphere is directed to one or more deformable mirrors (DM) before it propagates to the scientific camera. The distortions are measured with a wavefront sensor (WFS), and optimal image quality is obtained by setting the DM to a shape that partially cancels the distortions after reflection. In this work, we consider the classical single conjugated AO (SCAO) system, which requires a bright star that is close to an object of interest. This reference star is used to calculate distortion caused by the atmosphere along the propagation path. Since the atmosphere is continuously evolving, the mirror's shape has to be controlled in real-time, often from 300 to more than 1000 times a second.
Most AO systems run in a closed-loop configuration, where the WFS measures the wavefront distortions after DM correction; see Figure 1. The goal of such a control-loop is to minimize the distortions in the measured wavefront i.e., the residual wavefront. For high contrast imaging (HCI) the wavefront error budget (within the AO controlled region) is often dominated by the temporal delay error [4]. Also real systems often suffer from a dynamic mis-alignment between DM and WFS called mis-registration [5]. Reinforcement learning (RL) provides an automated approach for control, which promises to cope with these limitations of current AO systems. Unlike the classical control methods, RL methods aim to learn a successful closed-loop control strategy via interacting with the system. Hence they do not require accurate models of the components in the control loop and adapt to a changing environment.
In recent years, the merger of RL and deep neural networks (NN), called deep RL, has become increasingly popular due to its effectiveness in problems with large state-and action-spaces. This type of RL has been used, for example, to play video-and board-games on a superhuman level [6,7] and for vision-based real-world robot control [8,9]. Much of the success can be Fig. 1. Overview of the task and method. The distorted wavefront is propagated into the deformable mirror (DM), which is controlled by our control algorithm. The algorithm reads the wavefront sensor (WFS) input, simulates how it will evolve using the learned dynamics model, and plans for the next DM commands with a process called model predictive control (MPC). specifically attributed to model-based RL (MBRL), where a model of the environment is learned using data obtained by interaction, and a planning algorithm is used in conjunction to decide the next action. Inspired by these successes, we attempt to generalize the adaptive optics problem to the general framework of reinforcement learning and apply existing algorithms in solving it.
Our starting point is to formulate the closed-loop AO system as a Markov decision process (MDP), the prevailing mathematical framework for reinforcement learning [10]. We describe the state of the AO system as a finite time series of past control voltages and WFS measurements and assume that such a state exhibits Markovian statistics to a good approximation, i.e. each state depends only on the previous state, where a state can also include data from several timesteps from the past. The key to successful prediction lies in finding a reliable model for the system dynamics. Here, we parameterize the dynamics model describing the conditional distribution of the next state given the current state and action using standard NN architectures. This parameterization is fitted to closed loop data in a process called training. Using this framework, we adapt a standard state-of-the-art MBRL algorithm, Probabilistic Ensemble Trajectory Sampling (PETS) [11], to train the model and optimize for the next action, i.e. the set of control voltages.
The paper's structure is as follows: In Section 2, we state the novelty of our method and position our work with respect to existing literature. In Sections 3, we give a short description of an AO control loop and the baseline method. Section 4 describes MDP formulation of the AO control loop, setting a platform for RL. Further, we describe the algorithm used and how we adapt it to AO. For small details and a general more in-depth justification of the method, the authors strongly encourage the reader to have a look at the original paper on the algorithm [11]. In Section 5, we demonstrate the performance of our method, through simulation of a small and simple SCAO system controlled either by RL or by the baseline integrator controller. The algorithm and MDP formulation presented is only one way to solve the control loop with RL but already hints at the great potential of MBRL control for AO. Finding the optimal formulation and bringing the computational time of the controller to the required level are left for future research. Finally, Section 6. discusses the topic, especially how MBRL could be implemented in a real system and how to overcome the significant hurdle of inference time and computational jitter.

Related work
In order to mitigate the measurement noise and temporal error, predictive controller methods have been proposed for ground-based adaptive optics. These methods include the Kalman filter-based linear quadratic Gaussian control (LQG) [12,13] and its variants [14][15][16][17][18] and predictive filters operating on separate modal coefficients such as Zernike polynomials or Fourier modes [19][20][21], which provide up to a factor of 1000 gain in raw point spread function contrast in an idealized simulation environment for an extremely large telescope at very small angular separations and using a very bright AO guide star [20]. The contrast performance gets shallower for larger angular separation, smaller telescopes or fainter stars. More recently, data-based predictive methods have emerged in AO literature. Examples include linear predictive filter methods such as empirical orthogonal functions [22], the low-order linear minimum mean square error predictor [23][24][25], as well as NN-based methods [26][27][28][29].
Many of the existing machine learning-based predictive control methods [26][27][28] have not been studied in a closed-loop configuration, but in principle, they can be integrated into closed-loop systems by utilizing a so-called pseudo-open loop telemetry [22,30]. These procedures consist roughly of two steps: collecting open-loop wavefront estimates from pseudo-open loop telemetry and learning a predictive filter as a supervised learning task. This procedure assumes accurate knowledge of system time lags and DM response, as well as close to linear behavior of the WFS. As a consequence, the predictive filter will inherit the errors in system calibration. These methods, therefore, learn the temporal evolution of the turbulence from the data but rely on modelling of the system components and interactions between them, which leads to the need for external tuning and re-calibration of the predictive controller to ensure robustness. Moreover, some AO systems operate at framerates which are high enough that usually neglected system dynamics (e.g. the finite response time of the DM) become important. Consequently, a control algorithm suffers from the simplifying assumption of a temporal step-wise response.
In contrast to these methods, we present a technique that learns predictive and noise-robust control straight from the system feedback without the set of prior assumptions mentioned earlier and eliminating the need for accurate calibration or modeling assumptions. Our RL formulation uses a generic Neural Network (NN) architecture to build the dynamics model. NNs have been applied to various aspects of AO before. The topics vary from open-loop systems to the extraction of Zernike coefficients directly from the images and to non-linear wavefront reconstruction; see [31][32][33][34][35].
Also RL-based concepts have already been applied to AO. Self-adaptive control has been studied in [36], where a deep learning control model is proposed to mitigate alignment errors in the calibration. Model-free RL methods for wavefront sensorless AO have been studied in [37,38], where the method is compared against stochastic parallel gradient descent providing improved correction speed. Finally, model-free RL for ground-based AO was implemented to control tip and tilt only [39]. The model-free RL method they used learns a policy NN that directly outputs the two values for the tip and tilt mirror given the observations. Such methods often require a large number of interactions with the environment, which increase exponentially with the degrees of freedom to be controlled if no additional measures are taken. In contrast, we control each actuator of a high-order DM via model-based RL, formulate ground-based astronomical AO as a general MBRL task, and discuss its potential benefits. We show that state-of-the-art model-based RL learns a self-calibrating noise-robust predictive control law using only a few seconds of past telemetry data.

Adaptive optics and the classical integrator
We first present the adaptive optics task, along with useful notation, and then frame it in the reinforcement learning setting. An overview of the AO control loop is given in Figure 1. The incoming light at the timestep gets corrected by the DM. After this correction the WFS measures the residual wavefront . Commonly, a linear relationship between the WFS observation and the residual wavefront is assumed, i.e., where ) is the WFS data and is so-called interaction matrix modelling the WFS measurement and is the measurement noise typically composed of photon and detector noise. Depending on the type of WFS, a component of the residual wavefront can represent, e.g., a wavefront modal coefficient, a wavefront slope or the wavefront phase itself. Classical control algorithms are often modelled by a linear mapping of the WFS measurements Δ to the residual DM control voltages Δ i.e., where is so-called reconstruction matrix. To obtain the reconstruction matrix, we decompose the DM on a Karhunen-Loeve (K-L) modal basis. Each mode of the K-L basis has a representation in terms of actuator voltages. This relation is fully determined by a linear map from voltages to modes. The matrix is computed by a double diagonalization process, which takes into account the geometrical and statistical properties of the telescope [40]. In the following, we utilize a reconstruction matrix defined by the Moore-Penrose pseudo-inverse We truncate the number of K-L modes in B to have a stable inversion and a reasonably low noise amplification by C.
Let us now consider a simple non-predictive control algorithm known as the integrator law. At a given timestep , the WFS measures the residual wavefront. The new control voltages˜are where is the integrator gain. In order to stabilize the loop, the value of is often fixed below a value of about 0.5 for a two-step delay system [41]. Large values of increase the correction bandwidth, i.e., the loop reacts faster. On the other hand, a large gain reduces the control loop's stability margin and amplifies noise propagation. The challenge in classical integrator control is in balancing these two effects to minimize the average error of the method [40]. In the following we denote the vector concatenating the past control voltages and the vector concatenating the past residual voltages, constructed from the WFS slopes, by This quantity merely represents WFS measurements in the voltage space projected on the K-L modal basis defined by B. It does not represent voltages applied to the DM. On the millisecond time scale of AO operations a big part of turbulence is presumably in frozen flow and the turbulence evolution is predictable to some extend [42]. Control methods that use past telemetry data have shown a great potential both in turbulence prediction and noise reduction [22]. In a closed-loop set-up, these methods would, for example, utilize past control and residual voltages in equations (5) and (6), respectively, to construct a pseudo-open loop data stream used for the prediction. This paper aims to obtain a controller with similar properties but without the need for neither an accurate knowledge of time delay, accurate calibration nor a linear response of the WFS to wavefront errors.

Markov decision process and the dynamics model
We model the closed-loop adaptive optics control problem as an MDP. An MDP consists of a set of states S, a set of actions A ( ) at the given state , a set of transition probabilities ( +1 | , ) and a reward function ( , ).
In AO, the set of actions consists of different combinations of control voltages, and the state consists of the prevailing atmospheric turbulence and the shape of the mirror during the measurement. In practice, we do not have access to the full state of the AO system, i.e., full turbulence, wind speeds and DM shape. We only partially observe the state through a noisy WFS measurement. Consequently, past observations and actions are still valid information for the prediction of the next observation. To account for partial observation and to ensure the Markovian property of state formulation, we define the state as a sequence of previous voltages and residual voltages derived from WFS measurements: where we typically choose = . The state includes data from the previous m (or k) time steps and the reconstruction matrix . We stress that the residual voltages are not applied to the DM. They are merely a quantity closely related to the residual wavefront through eq. 6, and which the MBRL control approach (see Section 4.3) will try to minimize. The matrix must only be chosen such that the residual voltages are well observable by the WFS. It does not have to match the actual registration of DM and WFS precisely and could be given by either a previous calibration or derived from a system model. Moreover, previous studies have shown that the neural network-based wavefront reconstructor benefits from involving a linear control matrix with a non-linear WFS [35], and we observe below that MBRL is robust to errors or perturbations in the reconstruction matrix; see Section 5.4. The action of the MDP is simply a vector of the changes to the control voltages Let us now represent the true transition probability ( +1 | , ), i.e. the conditional distribution of the next state (including the next WFS residual) given the current state and action as a parameterized distribution familyˆ( +1 | , ). The aim of MBRL is to find the optimal approximative modelˆgiven a data set from the real environment. We solve this problem by fitting NNs using straightforward supervised learning, detailed in Section 4.3. In our case, the parameters represent the weights of the neural networks. The transition probability approximationsˆrepresent our probabilistic dynamics and are hence called the dynamics model. It provides an estimation of the next state (of which only the next WFS measurements are new) from the current state and the control voltages. The dynamics model involves information about the interaction of voltages with WFS measurement as well as the system's temporal evolution, including the turbulent wavefront.
In adaptive optics we aim to minimize the residual wavefront res over the the whole time interval. The most natural reward for an AO system would be the Strehl ratio, or for a high contrast imaging (HCI) instrument the contrast obtained. Since we are considering a control system with just one WFS, we can only choose a reward function observable on that specific sensor. We choose a reward for a state-action pair as the residual voltages' negative squared norm corresponding to the next measurement: This quantity is proportional to the observable part of the negative norm of the true residual wavefront. The WFS measurement is blind to some modes, e.g., the waffle mode for a Shack-Hartmann Sensor (SHS). We ensure that we do not control these modes by projecting each action, i.e., set of control voltages to the control space. That is, where + projects the control voltages onto the control space defined by the K-L modes.

Model-based reinforcement learning
Now that we have defined the MDP components and the dynamics model, we can outline our MBRL approach. First, we initialize an empty data set, and we initialize the dynamics model parameters (the weights of the NN) randomly from a zero-mean Gaussian distribution. Then, we collect our first data set by running the AO loop for a particular time interval (an episode) with random actions (DM control voltages) sampled from a zero-mean Gaussian distribution as well.
After the first episode we have the first data set and use it to train the dynamics model. The training is described in more detail in Section 4.3.
We now have a first reasonable guess for the dynamics model and start to use it during the second episode to find the action that maximizes the expected future reward (minimizes the residual voltages) for a given state. This optimization task is called planning and replaces the regulator/controller in classical AO. We detail the methods used for this in Section 4.3.2.
After the second and subsequent episodes, the previous data set is concatenated with the new data and the dynamics model is trained again and updated. When the data set gets sufficiently long, old data is removed to ensure that the NNs are trained on sufficiently fresh data only. The dynamics model is entirely learned from data obtained while running the loop, i.e., during the experiment; no simulation or modeling steps are involved.

The PETS algorithm
We implement the MBRL control for AO approach described above following the PETS algorithm [11]. We use OOMAO [43] to simulate the AO system plant (turbulence, telescope, DM, WFS), and the Probabilistic Ensemble Trajectory Sampling (PETS) algorithm replaces the classical reconstruction, control and calibration. The algorithm combines a probabilistic ensemble (PE) neural network dynamics model and model predictive control (MPC) [44] that is based on trajectory sampling (TS). We combine the TS with the cross-entropy method (CEM) as described in Section 4.3.2.

The dynamics model
Our choice of the dynamics model, an ensemble of probabilistic NNs, can model two types of uncertainty. Firstly, it models the uncertainty associated with the predictions, e.g., the stochastic behavior of the turbulence and measurement noise, by outputting a variance estimate in addition to a mean prediction. Secondly, it models the uncertainty associated with the model's parameters by learning an ensemble of bootstrap models. Each model has its unique data set to be trained upon that is bootstrap sampled (a statistics term meaning sampling with replacement) from the whole data set recorded so far [45,46].
In preparation for the experiment, we verified that using an ensemble of NNs leads to a superior correction performance as a single NN. Then, we also ran tests and confirmed that estimating the next state's variance improves the performance compared to a fixed variance. Both measures combined stabilize training by a fair amount and eventually reach a higher reward, i.e., a better correction performance.
Each neural network in the ensemble defines a parameterized distribution familyˆ( +1 | , ) satisfyingˆ( where the mean ( , ) and the variance 2 ( , ) of the Gaussian field are outputs of a neural network. We train the dynamics model ensemble by maximizing the log-likelihood of a Gaussian for which the parameters are outputs of the neural network model. More specifically, given a dataset of transitions D = {( , ), +1 } =1 we maximize the following objective functionˆ= whereˆis given by equation (11). Each network that is a part of the ensemble is trained similarly, but with different bootstrap sampled data set from D. Each network is modelled as a convolutional neural network with 2 hidden layers of 8 feature maps each. Both layers are activated by a leaky rectified linear unit (LReLU) [47]. We use the concatenated vector [ , ] as an input and output the mean and log-scale variance of a normal distribution: the distribution of the next state. The maximization in equation (12) is done using an extension of stochastic gradient descent called the Adam algorithm [48]. The neural network hyperparameters (e.g., number of layers, convolutional features maps, activation function used) provided relatively fast implementation and performed well in our experiments. We did not tune them further because of the large number of hyperparameters and that the method was not very sensitive to them. However, moving to more complex numeric simulations or lab experiments, hyperparameters have to be more extensively studied. A full pseudocode is given in Algorithm 1, where stands for empty set and D ← D ∪ D (new) for concatenation of previous dataset and new data set that was collected during the last episode.

Planning control
We use the learned dynamics model to plan for the action, i.e., the mirror commands to apply at each timestep. The goal of the planning algorithm is to optimize a sequence of actions { , +1 · · · + } such that it maximizes the expected reward inside some planning horizon [44].
For the AO case, the action taken at timestep takes one timestep to be executed, and one additional timestep for the corresponding observation to be recorded. Therefore, we are essentially doing planning to minimize the observed wavefront sensor measurements up to +2 , Update ← mean(ˆ) and 2 ← Var(ˆ) return i.e., we implicitly predict the best control action by the DM at the time of the WFS measurement (two frames into the future in this case). This planning horizon of two steps provides stable control to time delays smaller or equal to 2 frames. On a real AO system the time delay is to some extend stochastic and/or non integer. Therefore, the planning horizon should include the longest time delays that may occur in the control loop. Further, in the presence of DM dynamics the effective planning horizon might be a couple of time steps longer, since the control voltage decision are not fully independent.
Starting at the given initial state, the CEM works as follows. We first sample a trajectory of actions , +1 from a Gaussian distribution parameterized by some starting and 2 . Next we use the learned dynamics modelˆto produce a sequence of potential next states given the actions and the initial state, i.e., +2 ∼ˆ( +1 , +1 ), where +1 ∼ˆ( , ). Since the dynamics model is approximated by an ensemble, these states will include samples trained using different bootstrapped training datasets. The algorithm then chooses the so-called elites: actions that produce the best rewards, and recalculates the sampling distribution parameters , 2 to adjust to the elites using a maximum likelihood estimate. Finally, the mean of the sampling distribution is returned as the best trajectory. Note that in the actual task only the first action is executed, after which another transition is observed, and the algorithm is run again using the new observation as the starting state. This procedure of re-planning at each timestep is often referred as model predictive control (MPC). The full pseudo-code is given in Algorithm 2.

Simulation set-up
In the following numerical simulations, the OOMAO simulator serves as the plant of the control system -it only provides the WFS measurements and receives a vector of the control voltages. The PETS algorithm runs in Python and interacts with the plant via Python/MATLAB interface. We compare the results against the ones obtained by a well-tuned integrator controller as well as a theoretical controller that suffers neither from time delay nor measurement noise. This theoretical controller is computed from the non-delayed noiseless measurement, i.e., it still contains errors due to the aliasing and uncontrolled high order modes. The same limitation also applies to the MBRL and integrator controllers. The optimum integrator gain is always tuned globally to give the best performance (Strehl ratio) at each simulation set-up (GS magnitude and misregistration (MR)) separately. This is done manually, and typical values were between 0.3 − 0.6 for our simulation setups.
We simulated an 8m telescope observing a single natural guide star (NGS), equipped with a 23 × 23 SHS, and a 24 × 24 DM with a Fried geometry (actuators on the subaperture corners). The DM actuator influence functions are assumed to be Gaussian with a 45% coupling. Atmospheric turbulence is simulated as a sum of three frozen flow layers with Von Karman power spectra combining a Fried parameter 0 of 15 cm at 550 nm wavelength. The parameters of the atmosphere are listed in Table 1. The loop is running at a framerate of 500 Hz with a time delay of 2 steps. We pick the simulation parameters to demonstrate three key properties of the proposed method: • The predictive capacity of the method is shown on a system with a negligible measurement noise; see Figures 5a and 3, and Table 2.
• The robustness of the method against observation noise is shown by observing natural guide stars of different magnitudes; see Figures 4,5 and Table 2.
• The self-calibrating property is demonstrated by running the same simulations but introducing (MR) between the WFS and DM; see Figure 7 and see Table 1.
We model MR in calibration by changing the alignment between the WFS and the DM in two different directions and shift amplitudes (see Table 1). All images and contrast plots are calculated at = 1.65 (H-band), and the WFS measures at = 551 (V-band). Wind speed and MR are somewhat pessimistic to prevent the error budget from being dominated by the significant aliasing error of the SHS [49].
We set the state of the MDP to include the last four actions and four WFS measurements and set the episode length to 400 frames giving a balance between a fast iteration and a reliable performance estimate. We validate our proposed algorithm by running multiple simulations in the simulator. Each simulation starts with the knowledge of the reconstruction matrix, but zero knowledge of temporal behavior including the time lag. Note here that the sole purpose of the reconstruction matrix is to implement the control space filtering by mapping WFS measurements on residual voltages to be included in the state. We never change it when running the MBRL control, in particular we do not update it to match the MR. Our model learns to compensate for the measurement noise, misregistration in the reconstruction matrix, and the atmosphere's temporal behavior by interacting with the environment.

Training
To demonstrate how fast the method learns a successful control strategy in different noise conditions and MRs, we compare the learning curve of the method to the baseline of the integrator; see Figures 2 and 6. In terms of loss i.e., the negative reward over the episode, our model outperforms the integrator baseline after about 1600 frames and reaches its full potential in about 4000 frames, in all of the test cases. The total loss in the figure corresponds to the sum of normalized residual voltages computed from the WFS measurements. For the simulated system running at 500 Hz, 1600 timesteps is equivalent to 3.2 seconds of actual time, while 4000 is 8 seconds. As described in Section 4.2, we train and update the dynamics model after each episode. The loop is suspended during this time, which amounts for a several seconds given our rather shallow NN architecture and moderate computational power. At the telescope with typically variable observing conditions (wind speed and directions, seeing, guide star magnitudes), the dynamics model has to be trained in parallel to the observation, for example using a separate computer. The available time for training is then set by the episode length and should not exceed the time-scale of environment variability.

Prediction and noise robustness
We compare the correction performance of the fully converged PETS models to the integrator in terms of raw point spread function (PSF) contrast [50] and Strehl ratio [3]. Each simulation run consists of 8000 frames, i.e. 16 sec. The resulting H-band Strehl ratios are computed from the wavefront error maps by Marechal's approximation [3], and are presented in Figure 4. The MBRL control outperforms the integrator in all cases. A predictive capacity of the MBRL algorithm should result in an improved raw PSF contrast by reducing the notorious wind-driven halo (WDH) [51]. The raw PSF contrast is given by the intensity ratio of the perfect coronagraphic PSF [52] at a certain angular separation over to the peak intensity of the non-coronagraphic image. In Figure 5a, we see that the RL method significantly reduces the WDH in all noise cases and hence delivers a better raw PSF contrast especially along the dominant wind direction. We also analyze a time series of one randomly picked actuator shown in Figure 3, and see that the RL method follows the non-delayed signal much closer than the integrator which exhibits the expected 2-frame delay between incident wavefront and correction by the DM.

Performance under misregistration
Besides the predictive power, MBRL may provide other benefits for AO. One such benefit could be the automatic adaptation to dynamic MR between DM and WFS. MR is often introduced through mechanically or thermally induced flexure in a real AO system and negatively affects the performance if left uncompensated. Algorithms to detect and compensate for MR exist [5], but combining these with a data-driven predictive control, might not be trivial or at least might need online tuning of hyper-parameters involved. In turn, RL does not make a specific assumption on the origin of error terms. Consequently, altogether the same algorithm with the same hyperparameters, including the reconstruction matrix C, also learns errors due to MR. Prospects are that RL might also learn to minimize some error terms we are not expecting. In order to verify this claim, we ran a simulation of the bright guide star case while shifting the WFS with respect to the DM by 14% to the upper left (1 px up and 1 pix right on the WFS) and in another case by 28% to the lower right (2 px down and 2 px right). Note that the reconstruction matrix C does not include the MR, i.e., the residual voltage presentation of WFS measurement does not match the mirror's voltage presentation anymore. The results are shown in Figures  6 and 7. The MBRL control maintains its performance and predictive capacity even when a serve MR of 28% of a subaperture is applied. Only at high spatial frequencies close to the DM correction radius [50], we see a small contrast degradation in the 28% MR case. This is due to the non optimal alignment geometry, i.e., some higher order modes on the DM are not anymore visible in the WFS. The RL method also learns to stabilize these modes.

Discussion
We have formulated the control task of a closed-loop adaptive optics system as a Markov decision process and evaluated the performance of standard deep reinforcement learning algorithms on such a system. Our simulation results demonstrate that a state-of-the-art MBRL algorithm PETS robustly performs well with no environment-specific assumptions, apart from a generic reconstruction matrix. Moreover, the MBRL method predicted the turbulence evolution to a good approximation and automatically adapted to misregistration between DM and WFS, and was robust to measurement noise. Even though the algorithm itself is rather complicated to implement, its usage is simple: the algorithm calibrates, tunes, and maintains itself automatically. The MBRL method operates on control voltages and residual voltages which are derived from the residual WFS measurements and takes into account closed-loop dynamics along with the temporal evolution of the atmosphere. All the data needed is recorded on the control system itself eliminating dependencies on any numerical simulator or assumptions on the physics of the system. The MBRL control also outperformed classical integrator control in all simulation environments considered in Section 4.3.2. The simulated performance is limited by the aliasing error of the SHS. With our single sensor setup and the objective to null future measurements, the correction of the DM unavoidably includes low spatial frequency aberrations which cancel the SHS signal of high spatial frequency turbulence [53]. Finally, the MBRL method learns quickly requiring only 1600 timesteps in the simulator to surpass the baseline controller and converges at around 4000 timesteps.
We simulate a relatively low order system with 24 actuators across the pupil. On the one hand, this keeps the execution times low with our moderate computational resources. On the other hand, the chosen system size is very relevant, because it simulates the size foreseen for the second AO stages currently planned or under development [54][55][56] and to be added to already existing first AO stages. While here we consider a single stage SCAO system, our method could be extended to control such 2nd-stage AO by including the first stage's voltages in the state as well.
In future work we plan to extend the algorithm and comprehensively study a system with more complex DM dynamics, non-linear WFS such as the Pyramid WFS, saturations, alignment errors, turbulence boiling, and a cascaded AO system with a fast second stage. In particular, the future extreme AO systems on the upcoming generation of extremely large telescopes will control more than 10 4 degrees of freedom; as such, scalability of the method shall be considered.
Future work should also address the challenges imposed by a variable turbulence. Understanding the trade-off between model complexity and fast training is essential for a successful implementation. Our MBRL method already learns continuously on a timescales of several seconds. Therefore, prospects are good that it is capable of automatically adjusting to changing conditions on timescales where atmosphere parameters typically change [42] Finally, we believe that, the biggest and most important challenge for a successful on-sky implementation of MBRL control for AO is the computational complexity of the method. In this work, the computational time at each timestep of the MPC on 448 degrees of freedom is around 80-120ms using a laptop equipped with a single NVIDIA Quadro RTX 3000 GPU and a straightforward implementation in PyTorch [57]. Both, the delay and the temporal jitter are too large for a stable control of a real system. In contrast to a real system whose cadence is defined by the atmosphere and WFS framerate, our simulations are stepwise and, therefore, not sensitive to the jitter, and no strategies to minimize it was devised. Jitter could, for example, be mitigated by exiting the planning algorithm after a given time rather than after a fixed number of iterations (20 in our simulations).
The large computational cost could be alleviated by reducing the number of parameters in the dynamics model, employing fewer samples in the planning phase, and tuning the CEM procedure's hyperparameters. It seems feasible that these points combined with better hardware and optimized low-level implementation are sufficient to bring the running time of our method with 448 degrees of freedom down into the range needed for an on-sky system.
However, the algorithm's brute force approach could possibly be improved. A promising approach to speed up the MBRL control system, could be to replace the dynamics model and/or the planning algorithm to reduce computational complexity. We proposed a dynamics model composed of an ensemble of convolutional NNs. If the non-linear property of NNs turns out not to be needed, a much simpler linear model, e.g., an autoregressive model, could be used instead. Also, we are already investigating other methods that replace the planning phase of MBRL with a so-called policy function [58], which could be implemented as a NN and therefore avoid iterations and make the controller fast.
Finally, an efficient possible direction to reduce computational effort is to apply MBRL control only to a low-dimensional subset of the controlled parameters. For example, modal control could allow us to control a small set of modes with MBRL, while other modes are controlled classically.