Towards on-sky adaptive optics control using reinforcement learning Model-based policy optimization for Adaptive Optics

Context. The direct imaging of potentially habitable Exoplanets is one prime science case for the next generation of high contrast imaging instruments on ground-based extremely large telescopes. To reach this demanding science goal, the instruments are equipped with eXtreme Adaptive Optics (XAO) systems which will control thousands of actuators at a framerate of kilohertz to several kilohertz. Most of the habitable exoplanets are located at small angular separations from their host stars, where the current control laws of XAO systems leave strong residuals. Aims. Current AO control strategies like static matrix-based wavefront reconstruction and integrator control su ﬀ er from temporal delay error and are sensitive to mis-registration, i.e., to dynamic variations of the control system geometry. We aim to produce control methods that cope with these limitations, provide a signiﬁcantly improved AO correction, and, therefore, reduce the residual ﬂux in the coronagraphic point spread function (PSF). Methods. We extend previous work in Reinforcement Learning (RL) for AO. The improved method, called PO4AO, learns a dynamics model and optimizes a control neural network, called a policy. We introduce the method and study it through numerical simulations of XAO with Pyramid wavefront sensing for the 8-m and 40-m telescope aperture cases. We further implemented PO4AO and carried out experiments in a laboratory environment using Magellan Adaptive Optics eXtreme system (MagAO-X) at the Steward laboratory. Results. PO4AO provides the desired performance by improving the coronagraphic contrast in numerical simulations by factors 3-5 within the control region of DM and Pyramid WFS, both in simulation and in the laboratory. The presented method is also quick to train, i.e., on timescales of typically 5-10 seconds, and the inference time is su ﬃ ciently small ( < ms) to be used in real-time control for XAO with currently available hardware even for extremely large telescopes.


Introduction
The study of extrasolar planets (exoplanets) and exoplanetary systems is one of the most rapidly developing fields of modern astrophysics. More than 3000 confirmed exoplanets have been identified mainly through indirect methods by NASA's Kepler mission 1 . High-contrast imaging (HCI) detections are mostly limited to about a dozen very young and luminous giant exoplanets (e.g., Marois et al. 2010;Lagrange et al. 2009;Macintosh et al. 2015) due to the challenging contrast requirements at a fraction of an arcsecond angular distance from the star which could be a billion times brighter than the exoplanet. The method, PO4AO, feeds actions to the environment, observes the outcome, and then improves the control regarding the reward. Starting from a random behavior at first (frame 0), the method learns a predictive control strategy in only 5000 frames of interaction.
optics from the exoplanet such as spectral and angular differential imaging (Marois et al. 2004(Marois et al. , 2006 or high-dispersion spectroscopy (Snellen et al. 2015). With an optimized instrument design, the XAO residual halo may be the dominant source of noise (Otten et al. 2021). Therefore, minimizing the XAO residuals is a key objective for ground-based HCI.
Adaptive optics systems typically run in a closed-loop configuration, where the wavefront sensor (WFS) measures the wavefront distortions after deformable mirror (DM) correction. The objective of this control loop is to minimize the distortions in the measured wavefront, that is, the residual wavefront, which, in theory, corresponds to minimizing the speckle intensity in the post-coronagraphic image. In the case of a widely used integrator controller, temporal delay error and photon noise usually dominate the wavefront error budget in the spatial frequency regime controlled by the DM (Guyon 2005;Fusco et al. 2006). A big part of the turbulence is presumably in frozen flow considering the millisecond timescale of AO control, and hence a significant fraction of wavefront disturbances can be predicted (Poyneer et al. 2009). Therefore, control methods that use past telemetry data have shown a significant potential for reducing the temporal error and photon noise (Males & Guyon 2018;Guyon & Males 2017;Correia et al. 2020). Further, real systems suffer from dynamic modeling errors such as misregistration (Heritier et al. 2018), optical gain effect for the Pyramid WFS (Korkiakoski et al. 2008;Deo et al. 2019), and temporal jitter (Poyneer & Véran 2008). Combined, these errors lead to a need for external tuning and recalibration of a standard pseudo-open-loop predictive controller to ensure robustness.
An up-and-coming field of research aimed at improving AO control methods is the application of fully data-driven control methods, where the control voltages are separately added to the learned control model (Nousiainen et al. 2021;Landman et al. 2020Landman et al. , 2021Haffert et al. 2021a,b;Pou et al. 2022). A significant benefit of fully data-driven control in closed-loop is that it does not require an estimate of the system's open-loop temporal evolution and that it is, therefore, insensitive to pseudo-open-loop reconstruction errors, such as the optical gain effect (Haffert et al. 2021a). In particular, reinforcement learning (RL) has also been shown to cope with temporal and misregistration errors (Nousiainen et al. 2021). RL is an active branch of machine learning that learns a control task via interaction with the environment. The principal idea is to let the method feed actions to the environment, observe the outcome, and then improve the control strategy regarding the long-term reward. The reward is a predefined function giving a concrete measure of the method's performance. By learning this way, RL methods do not require accurate models of the components in the control loop and, hence, can be viewed as an automated approach for control.
Previous work in RL-based adaptive optics control has focused on either controlling DM modes using model-free methods that learn a policy π θ : s t → a t parameterized by θ that maps states s t (or observations) into actions a t directly (Landman et al. 2020(Landman et al. , 2021Pou et al. 2022), or using model-based methods that employ a planning step to compute actions (Nousiainen et al. 2021). The model-free methods have the advantage of being fast to evaluate, as the learned policies are often neural networks that support sub millisecond inference. However, they suffer from the large space of actions resulting from the number of actuators that need to be controlled in adaptive optics systems -learning to control each actuator simultaneously with a modelfree method is difficult. On the other hand, model-based RL approaches benefit from being simple to train using even off-policy data, that is, data obtained, while using a different (e.g., classical integrator) control method. A Model-based method may only need hundreds of iterations while a model-free algorithm such as policy gradient methods may need millions of iterations (Janner et al. 2019). However, the planning step of model-based RL is often iterative and could, therefore, be too slow for AO control, even with expensive hardware (Nousiainen et al. 2021).
In this paper, we unify the approaches described above by learning a dynamics model and using the model to train a policy that is fast to evaluate and scales to control all actuators in a system. We call this hybrid algorithm Policy Optimization for Adaptive Optics (PO4AO). We do this by employing an endto-end convolutional architecture for the policy, leveraging the differentiable nature of the chosen reward function, and directly backpropagating through trajectories sampled from the model. Our method scales to sub-millisecond inference, and we present promising results in both a large pyramid-sensor-based simulation and a laboratory setup using Magellan Adaptive Optics eXtreme system (MagAO-X, , where our method is trained from scratch using interaction.

Related work
The adaptive optics control problem differs from the typical control problems considered by modern RL research. The main challenges of AO control are two-fold: first, the control space is substantially larger than in classical RL literature and is typically parameterized by 500 to 10 000 degrees of freedom (DoF). Secondly, the state of the system is observed through an indirect measurement, where the related inverse problem is not wellposed. On the bright side, it has been observed in the literature that simple differentiable reward functions with a relatively short time horizon can lead to good performance (Nousiainen et al. 2021).
Recently, progress has been made toward full reinforcement learning-based adaptive optics control. Landman et al. (2020) use the model-free recurrent deterministic policy gradient algorithm to control the tip and tilt modes of a DM and a variation of the method to control a high order mirror in the special case of ideal wavefront sensing. Pou et al. (2022) implemented a modelfree multi-agent approach to control a 40 x 40 Shack-Harmannbased AO system and analyzed the robustness against noise and variable atmospheric conditions. On the other hand, Nousiainen et al. (2021) present a model-based solution that learns a dynamics model of the environment and uses it with a planning algorithm to decide the control voltages at each timestep. This method shows good performance but requires heavy computation at each control loop iteration, which will be a problem in future generations of instruments with more actuators per DM. PO4AO aims for the best of both worlds: it requires only a small amount of training data and has a high inference speed, capable of scaling to modern telescopes. Further, we analyze the performance of our method in different noise levels and varied wind conditions combined with nonlinear wavefront sensing.
In RL terms, model-based policy optimization is an active area of research. Work that tackles the full reinforcement learning problem without assuming a known reward function includes Heess et al. (2015), and Janner et al. (2019). In contrast, PILCO and the subsequent deep PILCO (Deisenroth & Rasmussen 2011;Gal et al. 2016) are methods that directly backpropagate through rewards. Our method is similar to deep PILCO in the sense that it learns a neural network policy from trajectories sampled from a neural network dynamics model.
In addition, significant progress has also been made in AO control methods outside RL and fully data-driven algorithms. Linear-quadratic-Gaussian control (LQG) based methods have been studied in Kulcsár et al. (2006); Paschall & Anderson (1993); Gray & Le Roux (2012); Conan et al. (2011);Correia et al. (2010aCorreia et al. ( ,b, 2017, sometimes combined with machine learning for system identification (Sinquin et al. 2020). Predictive controllers have been studied in Guyon & Males (2017);Poyneer et al. (2007); Dessenne et al. (1998);van Kooten et al. (2017van Kooten et al. ( , 2019. Methods vary from linear filters to filters operating on single modes (such as Fourier modes) to neural network approaches (Swanson et al. 2018;Sun et al. 2017;Liu et al. 2019;Wong et al. 2021). Predictive control methods have also been studied in a closed-loop configuration. Males & Guyon (2018) address a closed-loop predictive control's impact on the postcoronagraphic contrast with a semianalytic framework. Swanson et al. (2021) studied closed-loop predictive control with NNs via supervised learning, where a NN is learned to compensate for the temporal error.
Finally, other RL-based methods have been developed for different types of AO. In order to mitigate alignment errors in calibration, a deep-learning control model was proposed in Xu et al. (2019). A model-free RL method for wavefront sensorless AO was studied in Ke et al. (2019). The method is shown to provide faster corrections speed than a baseline method assuming a relatively low-order AO system, while our work focuses on the case of XAO for HCI.

Reinforcement learning applied to adaptive optics
Since we introduce a novel approach (RL) to the field of AO, we present hereafter some of the standard notations and terms used in RL. The de facto mathematical framework for modeling sequential decision problems in the field of RL is the "Markov Decision Process" (MDP). An MDP is a discrete-time stochastic process which, at time step t, is in a "state" s t ∈ S where S is the set of all possible states. A "decision-maker" then takes an "action" a t ∈ A (again, A is the set of possible actions) based on the current state, and the "environment" changes to the next state s t+1 . As the transition dynamics (a t , s t ) → s t+1 is random in nature (influenced e.g. by the turbulence evolution) it is represented here by the conditional probability density function p(s t+1 |s t , a t ) 2 . At each timestep a "reward" R t = r(s t , a t ) is also observed, which is a (possibly stochastic) function of the current state and action. The modeler usually designs the reward to make the decision-maker produce some favorable behavior (e.g., correcting for turbulence distortions). The actions our decision-maker takes are determined by a "policy" π θ : s t → a t , which is a function that maps states into actions. For example, the matrix-vector multiplier (MVM) can be viewed as a policy, taking a wavefront sensor measurement as input and outputting the control voltages. The objective of reinforcement learning is to find a policy such that where p θ (s 0 , ..., s T ) = p 0 (s 0 ) T t=1 p(s t |s t−1 , π θ (s t−1 )) with the initial distribution s 0 ∼ p 0 and convention π θ (s −1 ) = a 0 for a fixed initial DM commands a 0 . In particular, we focus here on parametric models of π θ where θ is the set of parameters of the policy, for example, the weights and biases of a neural network. That is, given that the actions are given by π θ , we wish to find the parameters θ that maximize the expected cumulative reward the decision-maker receives. Here T is the maximum length of an episode or a single run of the algorithm in the environment. The transition dynamics is usually not known in adaptive optics control: it includes a multitude of unknowns including the atmosphere turbulence, dynamics of the WFS and DM, and the jitter in the computational delay. In order to solve Eq. (1) efficiently, model-based RL algorithms estimate the true dynamics model p(s t+1 |s t , a t ) in (1) by an approximate modelp(s t+1 |s t , a t ). Model-free methods, in turn, only learn a policy -they do not attempt to model the environment.
The standard MDP formulation assumes that all information about the environment is contained in the state s t . This is not the case in many real-world domains, such as adaptive optics control. A more refined formulation is then the "partially observed" MDP or POMDP, where the decision-maker observes o t , which is some subset or function of the true underlying state. The Markov property, that is, the assumption that the next state depends only on the previous state and action, does not necessarily apply to the observations in a POMDP. This work uses the standard method of having our state representation include a small number of past observations (WFS measurements) and actions (control voltages) to deal with this issue. This allows the policy to use knowledge of past actions to predict the next action. The exact form of the observations o t and the full state s t for adaptive optics control will be given in Section 5.1.
Finally, it is common in RL to use reward functions that are not differentiable (such as 1 for winning a game, 0 otherwise) or functions that do not depend directly on the state. In highcontrast imaging, we would like to minimize the speckle intensity in the post-coronagraphic PSF. However, this can be difficult to estimate at the high frequencies of modern HCI instruments. We discuss the specific choices in this regard in Section 5.1.

Adaptive optics control
This section introduces AO control aspects that are relevant to our work. First, we introduce the AO system components and then outline a standard control law called the integrator and the related calibration process. An overview of the AO control loop is given in Fig. 1; the incoming light φ tur t at the timestep t gets corrected by the DM. Next, the WFS measures the DM corrected residual wavefront φ res t . After receiving the wavefront sensor measurement, the control computer calculates a set of control voltages and sends the commands to the DM.
Further, the AO control loop inherits a temporal delay. The delay consists of a measurement delay introduced by the WFS integration and a control delay consisting of WFS readout, computation of the correction signal by the control algorithm, and its application to the DM. These add up to a total delay of at least twice the operating frame-time of the AO system (Madec 1999).

Pyramid wavefront sensor for adaptive optics
The function of the WFS is to measure the spatial shape of the residual phase of a wavefront φ res t . There are several different types of WFSs, but in this work, we focus on the so-called pyramid WFS (PWFS), which is a mature concept providing excellent performance for HCI (Guyon 2005). In the following, we give a short description of the PWFS.
The PWFS can be viewed as a generalization of the Foucault knife-edge test (Ragazzoni 1996). In pyramid wavefront sensing, the electric field of the incoming wavefront is directed to a transparent four-sided pyramid prism. The prism is located in the focal plane of an optical system and, hence, can be modeled as a spatial Fourier filter that introduces specific phase changes according to the shape of the prism (Fauvarque et al. 2017). This four-sided pyramid divides the incoming light into four different directions, and most of the light is propagated to four intensity images on the PWFS detector. Due to the slightly different optical paths of the light, the intensity fields differ from each other. These differences are then used as the data for recovering the disturbances in the incoming phase screen.
Commonly, pyramid data, that is, the intensity fields, are processed to so-called slopes w x , w y that correlate positively to actual gradients fields of the phase screen. In this paper, we follow the approach of Vérinaud (2004), where the slopes are normalized with the global intensity. In practice, we receive a vector w that is a collection of the measurements w x , w y at all possible locations x, y.
Both modulated and nonmodulated pyramid sensor observations are connected to the incoming wavefront via a nonlinear mathematical model. This study considers nonmodulated PWFSs, where the nonlinearity is stronger, but the sensitivity is better at all spatial frequencies (Guyon 2005). Currently, most wavefront reconstruction algorithms utilize a linearization of this model, inducing a trade-off between sensitivity and robustness (modulated PWFS vs. nonmodulated PWFS). Machine learning techniques have the potential to overcome this trade-off and increase PWFS performance without a decisive robustness penalty.
Another feature of the PWFS is that its sensitivity varies depending on both the seeing conditions and the level of AO correction itself (Korkiakoski et al. 2008) which is mainly introduced by high spatial frequency aberrations which the DM cannot control. The presence of these aberrations reduces the signal strength of the measurement also for the controlled modes, and the strength of the reduction depends on the mode's spatial frequencies (Korkiakoski et al. 2008).
To illustrate the OG effect of the Pyramid sensor, we use a preliminary version of a semi-analytical model code-named "AO cockpyt" (in preparation). This model is based on the work of Fauvarque et al. (2019), describing the sensitivity of the Pyramid sensor in the presence of residuals, and on an adaptation of Fourier models from Jolissaint (2010) and Correia et al. (2020). Figure 2 shows the analytically predicted modal optical gains for the case of an 8-m telescope with zero-modulation and integrator control and considering two different wavefront sensor wavelengths. The assumed AO system for this analytical prediction is the same as the one used for our numerical simulations presented in Section 6 (41 x 41 actuators correct for seeing of 0.7" at 550nm at 1000 Hz framerate using a 0th magnitude guide star). The figure shows how the optical gain depends on the spatial frequency of the control modes (the K-L are numbered from low-to high spatial frequencies) and on the WFS Strehl ratio, which is lower at the shorter wavelength.
A modal optimization of the controller gains using the knowledge of Fig. 2 can solve most of the problems (diagonality assumption in Chambouleyron et al. (2020)) and applying the usual control theory margins (gain and phase) for ensuring a robust system. Determining optical gains in real-time is possible but complex (Deo et al. (2021), Chambouleyron et al. (2020)), and the relative variations shown in Figure 2 are of the order 10-20% for our XAO case. Hence, compensation for the modedependent optical gains with a single integrator gain may lead to acceptable results. However, an aggressive static integrator gain could impair loop robustness when the correction improves, and the optical gains increase. Section 6 presents evidence that PO4AO takes the PWFS OG effect into account for improved performance. Further, modal gain compensation of OG is a solution that is expected to work in favorable cases, but still, the nonlinearities after OG compensation will remain and can only be treated with nonlinear methods as the one studied in this paper.

Classical adaptive optics control
Classically, an AO system is controlled by combining a linear reconstructor with a proportional-integral (PI) control law. We call this controller the integrator and use it as the reference method for the comparison with PO4AO. As a starting point, the controller assumes to operate in a regime where the dependence between WFS measurements and DM commands is linear to a good approximation, satisfying where w t = (δw 1 t , · · · , δw n t ) is the WFS data, v t the DM commands and D is so-called interaction matrix. Moreover, ξ t models the measurement noise typically composed of photon and detector noise. The DM command vector v t represents the DM shape given in the function subspace linearly spanned by the DM influence functions.
The interaction matrix D represents how the WFS sees each DM command. It can be derived mathematically if we accurately know the system components (WFS and DM) and the alignment of the system. In practice, it is usually measured by poking the DM actuators with a small amplitude staying inside the linear range of the WFS, and recording the corresponding WFS measurements (Kasper et al. 2004;Lai et al. 2021).
The interaction matrix D is generally ill-conditioned, and regularization methods must be used to invert it (Engl et al. 1996). Here, we regularize the problem by projecting v t to a smaller dimensional subspace spanned by Karhunen-Loéve (KL) modal basis. The KL basis is computed via a double diagonalization process, which considers the geometrical and statistical properties of the telescopes (Gendron 1994). This process results in a transformation matrix Bm which maps DM actuator voltages to modal coefficients.
We observe that the modal interaction matrix is now obtained as DB † m , where B † m is the Moore-Penrose pseudo-inverse of B m . A well-posed reconstruction matrix for the inverse problem in (2) is then given by where P m = B † m B m is a projection map to the KL basis. Regularization by projection is a classical regularization with wellestablished theory Engl et al. (1996). It is well-suited for the problem at hand due to the physics-motivated basis expansion and fixed finite dimension of the observational data.
With ∆w t denoting the residual error seen by the WFS in closed loop, and t denoting the discrete time step of the controller, the integrator control law is where g is so-called the integrator gain. In literature, g < 0.5 is typically found to provide stable control for a two-step delay system Madec (1999).

Learning to control using a model
Here we detail the control algorithm including optimization for the dynamics model p ω (s t , a t ) and the policy π θ (a t |s t ). In standard AO terms, the policy combines the reconstruction and control law (e.g., a least-squares modal reconstruction followed by integrator control); in our case, a nonlinear correction to a leastsquares modal reconstruction (MDP formulation) and a predictive control law. The key idea is to learn a dynamics model that predicts the next wavefront sensor measurement given the previous measurements and actions and to use that model to optimize the policy. Our method iterated the following three phases: 3 1. Running the policy: we ran the policy in the AO control loop for T timesteps (a single episode). 2. Improving the dynamics model: we optimized the dynamics model using a supervised learning objective Eq. (9). 3. Improving the policy: we optimized the policy using the dynamics model (12).
At each iteration of our algorithm, we collected an episode's worth of data, e.g., 500 subsequent sensor measurements and mirror commands, by running the policy in the AO control loop for T timesteps. We then saved the observed data and given actions and trained our policy and dynamics model using gradients computed from all previously observed data.
The following sections discuss how we represented each observation, our convolutional neural network architecture for both the dynamics model and the policy, and the optimization algorithm itself.

Adaptive optics as a Markov decision process
We defined the adaptive optics control problem as an MDP by following the approach of Nousiainen et al. (2021). As discussed in Sects. 3 and 4, we do not directly observe the state of the system but instead observe a noisy WFS measurement. In addition, adaptive optics systems suffer from control delay resulting from the high speed of operation, which means that the system evolves before the latest action has been fully executed. Hence, we set our state presentation to include a small amount of past WFS measurements and control voltages.
We denote the control voltages applied to DM at a given time instance t byṽ t and the preprocessed PWFS measurements by w t . We defined the set of actions to be the set of differential control voltages: In adaptive optics, at each timestep t, we observe the wavefront sensor measurement w t . We project the measurement into voltage space by utilizing the reconstruction matrix C. The observation is then given by the quantity: To represent each state, we concatenated previous observations and actions. That is, where we chose k = m (as in the typical pseudo-open-loop prediction). The state includes data from the previous m time steps and the reconstruction matrix C. Here the reconstruction matrix serves solely as a preprocessing step for WFS measurements. It speeds up the learning process by simplifying the convolutional NN (CNN) architecture (same dimensional observations and actions). However, It does not directly connect the measurement to actions and, therefore, using it does not imply a sensitivity to misregistration (Nousiainen et al. 2021).
For a state-action pair, the reward was chosen as the residual voltages' negative squared norm corresponding to the following measurement: whereõ t+1 was obtained froms t+1 ∼ p(·|s t , a t ). This quantity is proportional to the observable part of the negative norm of the true residual wavefront. This reward function does not capture all error terms such as aliasing and noncommon path errors (NCPA), and hence, the final contrast performance will always be limited by these. The aliasing errors could be mitigated with traditional means, e.g., by introducing a spatial filter (Poyneer & Macintosh 2004) or by oversampling the wavefront, that is, by using a WFS with finer sampling than the one provided by the DM. We also already eluded on the fact that minimizing the residual wavefront seen by the WFS does not necessarily minimize the residual halon in the science image because of NCPA between the two. The PO4AO could treat NCPA by including science camera images in the state formulation, but these would have to be provided at the same cadence as the WFS data, which is usually not the case. Still, NCPA can be handled by PO4AO in the usual way by offsetting the WFS measurements by an amount determined by an auxiliary image processing algorithm (e.g., Give'on et al. (2007), Paul et al. (2013). Finally, the reward does not include an assumption on the time delay of the system, so the method learns to compensate for any delay and predict the wavefront.

The dynamics model
An adaptive optics system inherits strong spatial correlations in observations and control space -neighboring actuators and WFS pixels close to each are more correlated than actuators further apart due to the steep negative slope of the turbulence temporal PSD (Fried 1990) and the frozen flow hypothesis. We employed a standard fully convolutional neural network (CNN), equipped with a leaky rectified linear unit (LReLU, Maas et al. (2013)) activation functions that predicts the next wavefront sensor readout. The CNN should work well for our setup with DM actuators and WFS subapertures aligned on a grid in a spatially homogeneous geometry.
In practice, the state is a 3D tensor (matrices stack along the third dimension, that is, a (N × N × (k + m)) tensor) with the channel dimension corresponding to DM actuator grid (2D) and the number of previous observations (k) and actions (m). See Fig. 3 for an illustration. The deterministic dynamics modelp ω (s t , a t ) estimates the next state s t+1 given the previous state and action. The model parameters ω (i.e., the NN weights and biases) were trained by first running the policy π in the environment, that is, controlling the AO system with the policy, collecting tuples of (s t , a t , s t+1 ) into a dataset D, and minimizing the squared difference between the true next states and the predictions where o t+1 is obtained from the state s t+1 andô t+1 is the observation predicted byp ω (s t , a t ). The optimization was done using the Adam algorithm (Kingma & Ba 2014). Again we did not assume any integer time delay here, but as the past actions are included in the state formulation, we learned to compensate for it.
It is well-known that model-based RL performance unfavorably exploits an overfitted dynamics model in the control (e.g., planning or policy optimization), especially in the early stages of training (Nagabandi et al. 2018). To discourage this behavior, we employed an ensemble of several models, each of which is trained using different bootstrap datasets, that is, subsets of the observations collected during training. In practice, this means that each model sees a different subset of observations, leading to different NN approximations. During policy training, predictions are averaged over the models (line 9 of Algorithm 1). See, for example, Chua et al. (2018) for a more detailed discussion on ensemble models.

The policy model
Again, we employed a fully convolutional neural network as the policy, similar to the dynamics model. The input is a 3D tensor representing the state, and the output a 2D tensor (a matrix) representing the actuator voltages. The WFS measurement is blind or insensitive to some shapes of the mirror, such as the wellknown waffle mode and actuators on the boundary. We ensured that we do not control these modes by projecting each set of control voltages to the control space, that is, we reshaped the 2D output to a vector, multiplied it by a filter matrix, and then reshaped the output back to a 2D image. The full policy model π is given by where B † B projects the control voltages onto the control space defined by the K-L modes and F θ is the standard fully convolutional NN, where the output is vectorized. Figure 3 gives more detailed overview of the network architecture of F θ .

Policy optimization
Ideally, the policy π θ (s t ) would be optimized based on the expected cumulative reward function (1). However, as we do not have access to the true dynamics model p, we must approximate it with the learned dynamics modelp ω . To stabilize this process, we introduced an extended time horizon H T over which the performance was optimized. Let us definê whereõ t+1 is obtained froms t+1 =p ω (s t , a t ). This leads to the approximative policy optimization problem where H the planning horizon and s 1 = s ands t+1 =p ω (s t , π θ (s t )).
Here the planning horizon H was chosen based on the properties of the AO system. More precisely, for AO control, the choice of the planning horizon H is driven by the system's control delay. In the case of a simple two-frame delay, no DM dynamic, and no noise, we would plan to minimize the observed wavefront sensor measurements two steps into the future, that is, we would implicitly predict the best control action by the DM at the time of the corresponding WFS measurement. However, the effective planning horizon is longer in the presence of DM dynamics and temporal jitter since the control voltage decisions are not entirely independent. The choice of the planning horizon is a compromise between two effects: too short a planning horizon jeopardizes the loop stability, and too long a planning horizon makes the method prone to overfitting. We used H = 4 frames in all our experiments (numerical and laboratory) as a reasonably well-working compromise. The policy π was optimized by sampling initial states from previously observed samples, computing actions for them, and using the dynamics model to simulate what would happen if we were to take those actions. We could then use the differentiable nature of both our models and the reward function to backpropagate through rewards computed at each timestep. More specifically, at each iteration, we sampled a batch of initial states s τ and computed the following H states using the dynamics model. We then had H rewards for each initial state, and we used the gradients of the sum of those rewards with respect to the policy parameters θ to improve the parameters. The full procedure of training the dynamics and the policy is given in Algorithm 1, where the while-loop (line 3) iterates over episodes and lines 6 -16 execute an update of policy via policy optimization.

Setup description
We evaluate the performance of PO4AO by numerical simulations. We used the COMPASS package (Ferreira et al. 2018) to simulate an XAO system at an 8-m employing a nonmodulated Pyramid WFS in low noise (0 mag) and moderately large noise (9 mag) conditions. For comparison, we also considered the theoretical case of an "ideal" wavefront sensor where the wavefront reconstruction is simply a projection of the 2D-turbulence screen onto the DM's influence functions.
We also include a simulation of a 40-meter telescope XAO with PWFS to confirm that PO4AO nicely scales with aperture size and XAO degrees of freedom. Comprehensive error analysis and fine-tuning are left for future work. In order to stabilize the performance of the integrator, we added 2λ/D modulations to the PWFS. Generate samples {s t+1 , s t , a t } by running policy π θ (a t |s t ) for T timesteps (an episode) and append to D Predict a t = π θ (s t ) 12: Predict s t+1 =p ω (s t , a t ) 13: Calculate R t =r ω (s t , a t ) For all simulations, we simulated the Atmospheric turbulence as a sum of three frozen flow layers with Von Karman power spectra combining for Fried parameter r0 of 16 cm at 500 nm wavelength. The complete set of simulation parameters is provided in Table 1.
We compare PO4AO against a well-tuned integrator and instantaneous controller, not affected by measurement noise or temporal error. For the Pyramid WFS, it still propagates aliasing and the fitting error introduced by uncontrolled or high-spatial frequency modes. For the idealized WFS, it acts as a spatial high-pass filter, instantaneously subtracting the turbulent phase projected on the DM control space (DM fitting error only).
In particular, we chose the simulation setups to demonstrate the following key properties of the proposed method. Firstly, The method achieves the required real-time control speed while being quick to train. This property enables the controller to be trained just before the science operation and be further updated during the operation. Consequently, the method is trained with the most relevant data and does not need to generalize to all possible conditions at once; furthermore, the method retains these properties with an ELT-scale instrument. Secondly, The method is a predictive controller, robust to nonlinear wavefront sensing and photon noise. Thirdly, The method can cope with the optical gain effect of the pyramid sensor.

Algorithm setup
We chose the state s t (in MDP) to consist of 15 latest observations and actions and set the CNN (dynamics and policy) to have 3-layers with 32 filters each. For further details on these choices, see Sect. 6.3. The episode length was set to 500 frames.
Each simulation started with the calibrations of the system and the deriving of the reconstruction matrix C and the K-L basis B; see Section 4. We note that the reconstruction matrix C serves solely as a filter that projects WFS measurement to control space. It does not have to match the actual registration of DM and WFS (Nousiainen et al. 2021). In particular, the reconstruction matrix is measured around the null point in the calibrations and, hence, it suffers from the optical gain effect Korkiakoski et al. (2008). For PWFS simulations, the K-L filter was set to include 85% of total degrees of freedom, and for ideal wavefront sensing to filter matrix was an identity, that is, no filtering included. 2: Performance of 11 different 3-layer CNNs. All CNN models were trained from scratch with the same PO4AO parameters (see Table 1) and VLT 0-mag simulation environment (see Sect. 6.1 and Table 1). The Strehl and reward were calculated from the last 1000 steps of the experiment. The inference time was also calculated for VLT and ELT-scale systems, while the training time after each episode was only calculated for the VLT-scale system due to computational limitations. The corresponding integrator performance (dominated by the fitting and temporal error) for the "VLT" simulation was 93.59 / -10 085 (Strehl/Reward). For all different conditions and instruments, we let simulations run until the performance of PO4AO is converged. That is 46 000 frames (46 seconds in real-time (theoretical)) with an episode length of 500 frames. While the final contrast performance shown in Figs. 7b, 7c, 7d and 8 is calculated from the last 1000 frames, we note that the correction performance very quickly passes the integrator performance as shown in Figure(s) 5a, 5b, 5c, and 6. After each episode, as described in Sect. 5.4, we halted the simulations and updated the dynamics and policy models. Given the shallow convolutional structure (3 -layers and 32 filters per layer) of the NN models and our moderate hardware, the combined (dynamic and policy) training time after each episode was about 1.5 seconds for VLT (and 7 seconds for ELT with the same training hyperparameters). For real-time implementation, training the NN models should be completed in the duration of an episode, that is, in 0.5s (500 frames at 1 kHz). Given that we do not use the latest GPU hardware, and a NN update could also be done at a slower rate than after each episode, it is conceivable that this small gap can be overcome, and a real-time implementation of PO4AO is already possible.

CNN design
The dynamics model can also be trained with data obtained with a different controller, such as, the integrator or random control. Therefore, to improve the stability of the learning process, we "warm-up" the policy by running the first ten episodes with the integrator and added binary noise to develop a coarse understanding of the system dynamics: where x is binary noise (−1 or 1 with the same probability) and σ ∈ [0, 1] is reduced linearly after each episode such that the first episode was run with high binary noise and the 10th episode with zero noise.

CNN design and MDP state definition
The PO4AO includes two learned models: the policy and the dynamics model. This paper aims to introduce an optimizations method called PO4AO to train the policy (from scratch) that minimizes the expected reward. The algorithm works for all differentiable function classes, for example, neural networks. For simplicity, we chose to model the environment dynamics and policy using generic 3-layer fully convolutional neural networks. While further research is needed in finding the best possible architectures, we experimented with the number of convolutional filters per layer and the number of past telemetry data by testing the algorithm in the "VLT" environment with different combinations; see Table 2. We chose the model CNN 2 to compromise between the overall performance, inference speed for VLT and ELT, and training speed. The chosen model performed well in all simulations and provided fast inference speed and fast training speed such that it could be completed during a single episode. Full model architecture optimization is left for future work (see Sect. 8 for more details). The inference speed in Table 2 is the speed of the fully convolutional NN architecture inside the policy model (see Fig. 3). The total time control time includes two standard MVMs (preprocessing to voltages + KL filtering in the output layer) in addition to the inference time below. The inference time and training time were run with PyTorch on NVIDIA Quadro RTX 3000 GPU. Note here that given enough parallel computational power (e.g., GPU), the inference time of a fully convolutional NN is more determined by the number of layers and filter (same for VLT and ELT) than the input image's size. We observe that for CNN with fewer filters, the inference speed is very similar for VLT and ELT cases, while for heavier CNNs, the inference speed differs more with the given hardware. The computational time of MVMs is naturally dependent on the DoF.

Training
To evaluate the training speed of the method, we compare the learning curves (from which 5000 frames are obtained with the integrator + noise controller) of the method to the baseline of the integrator performance under the same realization of turbulence and noise (see Figs. 5b, 5c, 5a and 6). Since the simulations are computationally expensive, in the 40-meter telescope experiments, we compare the performance of the PO4AO only to average integrator performance (see Fig. 6).
We plotted the training curves with respect to total reward (the sum of normalized residual voltages computed from the WFS measurements) and Strehl ratio side by side. The method tries to maximize the reward, and consequently, it also maximizes the Strehl ratio. In all our simulations, the method achieves better performance than the integrator already after the integrator warm-up of 5000 frames (5 sec on a real telescope), and the performance stabilizes at around 30 000 frames (30 sec). Since the fully convolutional NN structure can capture and utilize the homogeneous structure of the turbulence, the number of data frames needed for training of VLT and ELT control are on the same scale. However, training the same amount of gradient steps is computationally more expensive (although very parallelizable) for the ELT scale system.

Prediction and noise robustness
Here, we compare the fully converged PO4AO, the integrator, and ideal control in raw PSF contrast. We ran each controller for 1000 frames, and the wavefront residuals for each controller were propagated through a perfect coronagraph (Cavarroc et al. 2006). The raw PSF contrast was calculated as the ratio between the peak intensity of noncoronagraphic PSF and the postcoronagraphic intensity field. A nonpredictive control law suffers from the notorious wind-driven halo (WHD) (Cantalloube et al. 2018), that is, the butterfly-shaped contrast loss in the raw PSF contrast in Figs. 7a, 7b, 7c, 8. Figure 7a assumes using the ideal WFS, that is, the incoming phase is measured by a noiseless projection of the incoming phase onto the DM. Therefore, the ideal WFS eliminates aliasing and noise in the wavefront reconstruction process, only considering temporal and fitting errors. Further, we can easily eliminate temporal error in a simulation by directly subtracting the measured from the incoming phase. The "no noise, no temporal error" curve (black dashed) in Fig. 7a is therefore only limited by the ability of the DM to fit the incoming wavefront. The integrator with a 2-frame delay (blue curve) is then limited by the temporal error in addition. The PO4AO (red curve) largely reduces the WHD by predicting the temporal evolution of the wavefront but does not fully recover the fitting error limit (black dashed). Figure 7a, therefore, demonstrates the ability of PO4AO to reduce the temporal error. Figure 7b replaces the ideal WFS with the nonmodulated PWFS, which is affected by aliasing and requires some filtering of badly seen K-L modes during the reconstruction. Therefore, the "no noise, no temporal error" contrast performance is worse than for the ideal WFS in Fig. 7a. The integrator with a 2-frame delay (blue curve) performs at a very similar contrast as in the ideal WFS case, so it is still limited mostly by temporal error. Again, PO4AO (red curve) lies about halfway between the integrator and "no noise, no temporal error" controllers but performs at a reduced contrast compared to the ideal WFS case. Therefore, the PO4AO performance with the nonmodulated PWS is affected by aliasing and reconstruction errors as well as the temporal error. Figure 7c adds a significant amount of measurement noise. While this obviously does not affect the "no noise, no temporal error" case, the contrast performance of both integrator and PO4AO is strongly reduced and dominated by noise. Still, PO4AO outperforms the integrator, which demonstrates the resilience of PO4AO against noise-dominated conditions. Finally, Figure 8 demonstrates that PO4AO maintains its properties in an ELT scale simulation.
Unfortunately, a "black box" controller like PO4AO does not allow us to cleanly separate all individual terms in the error budget because the controller's behavior is to some extent driven by the error terms themselves. However, as discussed above, we explored the relative importance of the individual terms by switching them on and off in our numerical experiments. Fig. 4: Sensitivity to the PWFS optical gain effect. The blue line corresponds to ratio between the optical gain estimates between the different wavelengths. The red line is the ratio between the semi-analytically derived optical gains at the two wavelengths (see Sects. 4.1 and Fig. 2).

Robustness against data mismatch
So far, we have focused on static atmospheric conditions and size of the data set D is not limited, that is, "ever-growing." However, in reality, the atmospheric conditions are constantly changing, creating a so-called data mismatch problem -the prevailing atmospheric conditions are slightly different from the conditions in which the model was trained. To ensure the method's robustness to data mismatch, we trained the model with very different conditions and then tested the model with the original wind profile by plotting the raw PSF contrast averaged over 1000 frames. We altered the wind by reducing the wind speed by 50 percent and adding 90-degree variations to directions for training, that is, we altered the spatial and temporal statistics of the atmosphere. We do not show the corresponding training plot since it was very similar to Fig. 5b. The result of this experiment is shown in Fig.  7d. The integrator has naturally the same performance as before. The PO4AO still delivers better contrast close to the guide star but suffers from pronounced WDH further from the guide star. Most importantly, the PO4AO is robust and maintains acceptable performance even with heavy data mismatch, which could occur in the unlikely case that atmospheric conditions drastically change from one episode to the next, that is, on a timescale of seconds. Anyhow PO4AO with limited data set size (old data irrelevant data removed) would adapt to such a change and recover the performance within the typical training times discussed in the previous paragraph.

Sensitivity to the PWFS optical gain effect
The PO4AO uses convolutional NNs and is, therefore, a nonlinear method. Prospects are that it can adapt to nonlinearities in the system, such as the optical gain effect observed for the Pyramid WFS. To examine this property, we run the following experiment. We control the nonmodulated PWFS with PO4AO at 850 nm and 600 nm and record the policy after training. Then we control the PWFS with the integrator and record, in parallel, the actions PO4AO would have taken. The integrator control results in a correction performance similar to the Strehl ratios derived by the semi-analytical model (Fig. 2). At the shorter wavelength, the PWFS sees larger residuals wavefront errors (in radian) and a stronger effect on the optical gains. However, if the controller can cope with such an effect, which we would expect for PO4AO, the suggested actions should counteract the dampened measurement. In order to validate this, we compare the ratios between the standard deviation of the observations (PWFS measurements) and the standard deviation of suggested PO4AO actions. We define an estimate for the optical gain compensation: where std is the temporal standard deviation, o λ int the observations while running the integrator, λ the observing wavelength, and a λ po4ao the PO4AO suggested actions. As PO4AO is a predictive control method, this quantity also includes the effect of the prediction, that is, it includes compensation for the temporal error as well. However, we can approximately cancel out the temporal error by comparing the ratio between optical gain estimates obtained at different wavelengths. The result of this experiment is shown in Fig. 4. We see that the empirical estimate for the optical gain sensitivity of PO4AO follows roughly the corresponding ratio of the two semi-analytically derived curves plotted in Fig. 2. In particular, we see that the lower order modes are compensated more than high order modes. We, therefore, conclude that PO4AO adequately compensates for the optical gain effect of the PWFS.

Magellan adaptive optics extreme system
In addition to running the numerical simulations presented in the previous section, we also implemented PO4AO on the MagAO-X instrument. MagAO-X is an experimental coronagraphic extreme adaptive optics system that uses woofer-tweeter architecture (ALPAO-97 DM as the woofer and Boston Micromachines 2K as the tweeter). We use a point source in the f/11 input focus to illuminate the DMs, Pyramid WFS, and scientific camera. Further, we place a classical Lyot coronagraph with a 2.5 λ/D Lyot mask radius in front of the science camera. We set PWFS's modulation ratio to 3λ/D, and the brightness of the guide star is adjusted to match the flux per frame which a 0th magnitude star would provide in 1 ms (i.e., for a system running at 1 kHz.) We used a similar test setup as Haffert et al. (2021b) and ran our experiment by only controlling the woofer DM and injecting disturbances by running simulated phase screens across it. The phase screens were simulated as single-layer frozen flow turbulence with r 0 of 16 cm at 500nm. We experimented with three different single-layer wind profiles: 5 m/s, 15 m/s, and 30 m/s, where the wind speeds correspond to a 1 Khz framerate again.
The PO4AO is implemented with PyTorch and utilizes the Python interface of the MagAO-X RTC to pass data from CPU to GPU memory, do the PO4AO calculations on the GPU, and transfer them back. The data transfer takes time and limits the achievable framerate in this setup to 100 Hz. RTC software that would run entirely on GPUs would not suffer from this limitation.

The integrator
To retrieve the interactions matrix, we used the standard calibrations process described in Sect. 4. From the interactions matrix, we derived the reconstruction matrix by Tikhonov regularization given by, where α is tuned manually. We also tuned the integrator gain manually for each wind profile.

Policy optimization for adaptive optics
The structure of the MagAO-X experiment is similar to our numerical simulations. First, we trained the PO4AO for 50 episodes (25 000 frames) and then ran for an additional 5000 frames to compare the post-coronagraphic PSFs. We also used the 10 episode warm-up with noisy integrator and the same NN architectures. Given the low number of actuators and the high-order PWFS, we set the number of past telemetry data (k and m) to 10, and instead of filtering 20% of the K-L modes, for maximum performance, we only filter the piston mode in the policy output (see Fig. 3).

Results
We compare the performance of the PO4AO to the integrator in two ways: by looking at the training curves (see Fig. 9a) and by comparing the post-coronagraphic speckle variance (see Figure  9c). The PO4AO achieves better performance in all wind conditions than the integrator after 10k (10s in theoretical real-time) data frames. The reward is proportional to the mean RMS of the reconstructed wavefront. We further examine the performance by comparing the post-coronagraphic images with the 30 m/s wind profile; see Figs. 9b and 9c. The residual intensities of the images (see Fig. 9b) are limited by NCPA. Therefore, instead of comparing the raw PSF contrast, we compare the temporal speckle variance of the method (see Fig. 9c). We see a factor of 3 − 7 improvement in the speckle variance at 2.4 − 6λ/D. Given the inner working angle of the coronagraph and DM's control radius, that is where we would also expect to see the improvement. Further, these results are in line with the results from numeric simulations.

Discussion
In conclusion, reinforcement learning is a promising approach for AO control that could be implemented in on-sky systems with already existing hardware. The algorithm we propose requires only a small amount of training data and maintains an acceptable performance even when the training conditions differ heavily from test time. Further, it has a high inference speed, capable of scaling to high-order instruments with up to 10k actuators. Thanks to the use of relatively shallow convolutional NN, the inference time is just 300 µs with a modern laptop GPU. The inference time is also similar for an ELT scale system with more than 10k actuators and for a VLT scale system with "just" 1400 actuators.
The method was tested in numerical simulations and a lab setup and provides significantly improved post-coronagraphic contrast for both cases compared to the integrator. It is entirely data-driven, and in addition to predictive control, it can cope with modeling errors such as the optical gain effect and highly nonlinear wavefront sensing. Due to the constantly self-calibrating nature of the algorithm it could turn AO control into a turnkey operation, where the algorithm maintains itself entirely automatically.  Fig. 5: Training plots for 8-meter telescope experiments. Panel a is for ideal wavefront sensor, panel b is for the 0th magnitude guide star, and panel c for the 9th magnitude guide star. The red lines correspond to performance of PO4AO during each episode and blue lines for the integrator. The gray dashed line marks the end of integrator warm up for PO4AO. In all cases the PO4AO outperforms the integrator all ready after the warm up period, in both the Strehl ratio and rewards. An optimized implementation of the PO4AO could run the training in parallel to control, and the training time would then be included in the plot (see Sec. 6.2). The radial averages over the image. The blue lines are for the integrator and red for the PO4AO. The raw PSF contrast was computed during the 1000 frames of the experiment. Panel a: Performance of PO4AO with ideal WFS. We see that P04AO delivers a factor of 20-90 improvement inside the AO control radius compared to well-tuned integrator. Panel b: Performance of PO4AO on 0th mag guide star and a nonmodulated PWFS. PO4AO delivers a factor of 4-7 better contrast inside the AO control radius. Panel c: Performance of PO4AO on 9th mag guide star. We see a factor of 3-9 improvement in the raw PSF contrast. Panel d: Performance of PO4AO under heavy data mismatch. PO4AO was trained with drastically different wind conditions. The PO4AO still delivers better contrast with small angular separations.
We showed that our method is robust to heavy data mismatch, but the performance is reduced for a short time while PO4AO is adapting to the evolution of external conditions. These abrupt changes in wind conditions will rarely occur in the real atmosphere. Therefore, future work should also address maintaining the best possible performance under reasonably varying turbulence. The model learns on a scale of several seconds and can presumably adapt to changing atmospheric conditions at the same time scale. However, more research on the trade-off between model complexity and training speed is still needed. For example, a deeper NN model could generalize better to unseen conditions, while shallower NN models could learn new unseen conditions faster. Currently, the CNN model architectures themselves are not thoroughly optimized, and an exciting research topic would be to find the optimal CNN design to capture the AO control system dynamics for model-based RL. For example, a U-net type CNN architectures (Ronneberger et al. 2015) and mixed-scale dense CNNs (Pelt & Sethian 2018) have shown excellent performance on imaging-related applications. On the other hand, we could utilize similar NN structures that have shown excellent performance in pure predictive control (Swanson et al. 2018(Swanson et al. , 2021. Such a study should consider a variety of different, preferably realistically changing atmospheric conditions and misalignments as well as prerecorded on-sky data. As a caveat, the algorithm, like most deep RL methods, is somewhat sensitive to the choice of hyperparameters (e.g., number of layers in neural networks, learning rates, etc.). Moreover, control via deep learning is hard to analyze, and no stability bounds can be established.
Further, development is needed to move from the laboratory to the sky. The method currently runs on a Python interface that has to pass data via the CPU on MagAO-X. To increase the speed of the implementation and the maximum framerates, we must switch to a lower-level implementation that runs both the realtime pipeline and the PO4AO control on the GPU using the same memory banks. In addition, the training procedure needs to run in parallel with the inference, which should be straightforward to implement.
To summarize, this work presents a significant step forward for XAO control with RL. It will allow us to increase the S/N, detect fainter exoplanets, and reduce the time it takes to observe them on ground-based telescopes. As astronomical telescopes become larger and larger, the choice of the AO control method becomes critically important, and data-driven solutions are a promising direction in this line of work. Deep learning and RL methods are transforming many fields, such as protein folding, inverse problems, and robotics, and there is potential for the same to happen for direct exoplanet imaging.