Issue 
A&A
Volume 664, August 2022



Article Number  A71  
Number of page(s)  15  
Section  Astronomical instrumentation  
DOI  https://doi.org/10.1051/00046361/202243311  
Published online  09 August 2022 
Toward onsky adaptive optics control using reinforcement learning
Modelbased policy optimization for adaptive optics
^{1}
Lappeenranta–Lahti University of Technology,
Lappeenranta, Finland
email: jalo.nousiainen@lut.fi
^{2}
University of Helsinki, Department of Computer Science,
Helsinki, Finland
^{3}
European Southern Observatory,
Garching bei München, Germany
^{4}
University of Arizona, Steward Observatory,
Tucson,
Arizona, USA
^{5}
Wyant College of Optical Science, University of Arizona,
1630 E University Blvd,
Tucson,
AZ 85719
USA
^{6}
Astrobiology Center, National Institutes of Natural Sciences,
2211 Osawa, Mitaka,
Tokyo, JAPAN
^{7}
National Astronomical Observatory of Japan, Subaru Telescope, National Institutes of Natural Sciences,
Hilo,
HI 96720
USA
^{8}
Kirtland Air Force Base, Air Force Research Laboratory,
Albuquerque, NM, USA
Received:
11
February
2022
Accepted:
25
May
2022
Context. The direct imaging of potentially habitable exoplanets is one prime science case for the next generation of high contrast imaging instruments on groundbased, extremely large telescopes. To reach this demanding science goal, the instruments are equipped with eXtreme Adaptive Optics (XAO) systems which will control thousands of actuators at a framerate of kilohertz to several kilohertz. Most of the habitable exoplanets are located at small angular separations from their host stars, where the current control laws of XAO systems leave strong residuals.
Aims. Current AO control strategies such as static matrixbased wavefront reconstruction and integrator control suffer from a temporal delay error and are sensitive to misregistration, that is, to dynamic variations of the control system geometry. We aim to produce control methods that cope with these limitations, provide a significantly improved AO correction, and, therefore, reduce the residual flux in the coronagraphic point spread function (PSF).
Methods. We extend previous work in reinforcement learning for AO. The improved method, called the Policy Optimization for Adaptive Optics (PO4AO), learns a dynamics model and optimizes a control neural network, called a policy. We introduce the method and study it through numerical simulations of XAO with Pyramid wavefront sensor (PWFS) for the 8m and 40m telescope aperture cases. We further implemented PO4AO and carried out experiments in a laboratory environment using Magellan Adaptive Optics eXtreme system (MagAOX) at the Steward laboratory.
Results. PO4AO provides the desired performance by improving the coronagraphic contrast in numerical simulations by factors of 3–5 within the control region of deformable mirror and PWFS, both in simulation and in the laboratory. The presented method is also quick to train, that is, on timescales of typically 5–10 s, and the inference time is sufficiently small (<ms) to be used in realtime control for XAO with currently available hardware even for extremely large telescopes.
Key words: instrumentation: high angular resolution / instrumentation: adaptive optics / atmospheric effects / methods: data analysis / techniques: high angular resolution / methods: numerical
© J. Nousiainen et al. 2022
Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
This article is published in open access under the SubscribetoOpen model. Subscribe to A&A to support open access publication.
1 Introduction
The study of extrasolar planets (exoplanets) and exoplanetary systems is one of the most rapidly developing fields of modern astrophysics. More than 3000 confirmed exoplanets have been identified mainly through indirect methods by NASA’s Kepler mission^{1}. Highcontrast imaging (HCI) detections are mostly limited to about a dozen very young and luminous giant exoplanets (e.g., Marois et al. 2010; Lagrange et al. 2009; Macintosh et al. 2015) due to the challenging contrast requirements at a fraction of an arcsecond angular distance from the star which could be a billion times brighter than the exoplanet.
Highcontrast imaging aims to separate the exoplanet light from stellar light optically, thereby dramatically increasing the signaltonoise ratio (S/N) over the one provided by indirect methods. However, significant advances in HCI technology are needed to address two major scientific questions: the architectures of outer planetary systems, which remain essentially unexplored (e.g., Dressing & Charbonneau 2015; Fernandes et al. 2019); and the atmospheric composition of small exoplanets outside the solar system, which is especially interesting because it addresses the question of habitability and life in the universe.
For groundbased observations, HCI combines eXtreme Adaptive Optics (XAO, e.g., Guyon 2005, 2018) and coronagraphy (Mawet et al. 2012) with a way to distinguish stellar quasistatic speckles (QSS) produced by imperfect instrument optics from the exoplanet such as spectral and angular differential imaging (Marois et al. 2004, 2006) or highdispersion spectroscopy (Snellen et al. 2015). With an optimized instrument design, the XAO residual halo may be the dominant source of noise (Otten et al. 2021). Therefore, minimizing the XAO residuals is a key objective for groundbased HCI.
Adaptive optics systems typically run in a closedloop configuration, where the wavefront sensor (WFS) measures the wavefront distortions after deformable mirror (DM) correction. The objective of this control loop is to minimize the distortions in the measured wavefront, that is, the residual wavefront, which, in theory, corresponds to minimizing the speckle intensity in the postcoronagraphic image. In the case of a widely used integrator controller, temporal delay error and photon noise usually dominate the wavefront error budget in the spatial frequency regime controlled by the DM (Guyon 2005; Fusco et al. 2006). A big part of the turbulence is presumably in frozen flow considering the millisecond timescale of AO control, and hence a significant fraction of wavefront disturbances can be predicted (Poyneer et al. 2009). Therefore, control methods that use past telemetry data have shown a significant potential for reducing the temporal error and photon noise (Males & Guyon 2018; Guyon & Males 2017; Correia et al. 2020). Further, real systems suffer from dynamic modeling errors such as misregistration (Heritier et al. 2018), optical gain effect for the Pyramid WFS (Korkiakoski et al. 2008; Deo et al. 2019), and temporal jitter (Poyneer & Véran 2008). Combined, these errors lead to a need for external tuning and recalibration of a standard pseudoopenloop predictive controller to ensure robustness.
An upandcoming field of research aimed at improving AO control methods is the application of fully datadriven control methods, where the control voltages are separately added to the learned control model (Nousiainen et al. 2021; Landman et al. 2020, 2021; Haffert et al. 2021a,b; Pou et al. 2022). A significant benefit of fully datadriven control in closedloop is that it does not require an estimate of the system’s openloop temporal evolution and that it is, therefore, insensitive to pseudoopenloop reconstruction errors, such as the optical gain effect (Haffert et al. 2021a). In particular, reinforcement learning (RL) has also been shown to cope with temporal and misregistration errors (Nousiainen et al. 2021). RL is an active branch of machine learning that learns a control task via interaction with the environment. The principal idea is to let the method feed actions to the environment, observe the outcome, and then improve the control strategy regarding the longterm reward. The reward is a predefined function giving a concrete measure of the method’s performance. By learning this way, RL methods do not require accurate models of the components in the control loop and, hence, can be viewed as an automated approach for control.
Previous work in RLbased adaptive optics control has focused on either controlling DM modes using modelfree methods that learn a policy π_{θ} : s_{t} → a_{t} parameterized by θ that maps states s_{t} (or observations) into actions a_{t} directly (Landman et al. 2020, 2021; Pou et al. 2022), or using modelbased methods that employ a planning step to compute actions (Nousiainen et al. 2021). The modelfree methods have the advantage of being fast to evaluate, as the learned policies are often neural networks that support sub millisecond inference. However, they suffer from the large space of actions resulting from the number of actuators that need to be controlled in adaptive optics systems – learning to control each actuator simultaneously with a modelfree method is difficult. On the other hand, modelbased RL approaches benefit from being simple to train using even offpolicy data, that is, data obtained, while using a different (e.g., classical integrator) control method. A Modelbased method may only need hundreds of iterations while a modelfree algorithm such as policy gradient methods may need millions of iterations (Janner et al. 2019). However, the planning step of modelbased RL is often iterative and could, therefore, be too slow for AO control, even with expensive hardware (Nousiainen et al. 2021).
In this paper, we unify the approaches described above by learning a dynamics model and using the model to train a policy that is fast to evaluate and scales to control all actuators in a system. We call this hybrid algorithm Policy Optimization for Adaptive Optics (PO4AO). We do this by employing an endtoend convolutional architecture for the policy, leveraging the differentiable nature of the chosen reward function, and directly backpropagating through trajectories sampled from the model. Our method scales to submillisecond inference, and we present promising results in both a large pyramidsensorbased simulation and a laboratory setup using Magellan Adaptive Optics eXtreme system (MagAOX, Males et al. 2018), where our method is trained from scratch using interaction.
2 Related Work
The adaptive optics control problem differs from the typical control problems considered by modern RL research. The main challenges of AO control are twofold: first, the control space is substantially larger than in classical RL literature and is typically parameterized by 500–10000 degrees of freedom (DoF). Secondly, the state of the system is observed through an indirect measurement, where the related inverse problem is not wellposed. On the bright side, it has been observed in the literature that simple differentiable reward functions with a relatively short time horizon can lead to good performance (Nousiainen et al. 2021).
Recently, progress has been made toward full reinforcement learningbased adaptive optics control. Landman et al. (2020) use the modelfree recurrent deterministic policy gradient algorithm to control the tip and tilt modes of a DM and a variation of the method to control a high order mirror in the special case of ideal wavefront sensing. Pou et al. (2022) implemented a modelfree multiagent approach to control a 40 × 40 ShackHarmannbased AO system and analyzed the robustness against noise and variable atmospheric conditions. On the other hand, Nousiainen et al. (2021) present a modelbased solution that learns a dynamics model of the environment and uses it with a planning algorithm to decide the control voltages at each timestep. This method shows good performance but requires heavy computation at each control loop iteration, which will be a problem in future generations of instruments with more actuators per DM. PO4AO aims for the best of both worlds: it requires only a small amount of training data and has a high inference speed, capable of scaling to modern telescopes. Further, we analyze the performance of our method in different noise levels and varied wind conditions combined with nonlinear wavefront sensing.
In RL terms, modelbased policy optimization is an active area of research. Work that tackles the full reinforcement learning problem without assuming a known reward function includes Heess et al. (2015), and Janner et al. (2019). In contrast, PILCO and the subsequent deep PILCO (Deisenroth & Rasmussen 2011; Gal et al. 2016) are methods that directly backpropagate through rewards. Our method is similar to deep PILCO in the sense that it learns a neural network policy from trajectories sampled from a neural network dynamics model.
In addition, significant progress has also been made in AO control methods outside RL and fully datadriven algorithms. LinearquadraticGaussian control (LQG) based methods have been studied in Kulcsár et al. (2006); Paschall & Anderson (1993); Gray & Le Roux (2012); Conan et al. (2011); Correia et al. (2010a,b, Correia et al. 2017), sometimes combined with machine learning for system identification (Sinquin et al. 2020). Predictive controllers have been studied in Guyon & Males (2017); Poyneer et al. (2007); Dessenne et al. (1998); van Kooten et al. (2017, 2019). Methods vary from linear filters to filters operating on single modes (such as Fourier modes) to neural network approaches (Swanson et al. 2018; Sun et al. 2017; Liu et al. 2019; Wong et al. 2021). Predictive control methods have also been studied in a closedloop configuration. Males & Guyon (2018) address a closedloop predictive control’s impact on the postcoronagraphic contrast with a semianalytic framework. Swanson et al. (2021) studied closedloop predictive control with NNs via supervised learning, where a NN is learned to compensate for the temporal error.
Finally, other RLbased methods have been developed for different types of AO. In order to mitigate alignment errors in calibration, a deeplearning control model was proposed in Xu et al. (2019). A modelfree RL method for wavefront sensorless AO was studied in Ke et al. (2019). The method is shown to provide faster corrections speed than a baseline method assuming a relatively loworder AO system, while our work focuses on the case of XAO for HCI.
3 Reinforcement Learning Applied to Adaptive Optics
Since we introduce a novel approach (RL) to the field of AO, we present hereafter some of the standard notations and terms used in RL. The de facto mathematical framework for modeling sequential decision problems in the field of RL is the “Markov Decision Process” (MDP). An MDP is a discretetime stochastic process which, at time step t, is in a “state” S_{t} ϵ S where S is the set of all possible states. A “decisionmaker” then takes an “action” (again, is the set of possible actions) based on the current state, and the “environment” changes to the next state S_{t+1}. As the transition dynamics (a_{t}, S_{t}) → S_{t+1} is random in nature (influenced e.g. by the turbulence evolution) it is represented here by the conditional probability density function p(S_{t+1}S_{t}, a_{t})^{2}. At each timestep a “reward” R_{t} = r(S_{t}, a_{t}) is also observed, which is a (possibly stochastic) function of the current state and action. The modeler usually designs the reward to make the decisionmaker produce some favorable behavior (e.g., correcting for turbulence distortions).
The actions our decisionmaker takes are determined by a “policy” π_{θ} : S_{t} → a_{t}, which is a function that maps states into actions. For example, the matrixvector multiplier (MVM) can be viewed as a policy, taking a wavefront sensor measurement as input and outputting the control voltages. The objective of reinforcement learning is to find a policy such that (1)
with the initial distribution S_{0} ~ p_{0} and convention π_{θ}(s_{–1}) = a_{0} for a fixed initial DM commands a_{0}. In particular, we focus here on parametric models of π_{θ} where θ is the set of parameters of the policy, for example, the weights and biases of a neural network. That is, given that the actions are given by π_{θ}, we wish to find the parameters θ that maximize the expected cumulative reward the decisionmaker receives. Here T is the maximum length of an episode or a single run of the algorithm in the environment.
The transition dynamics is usually not known in adaptive optics control: it includes a multitude of unknowns including the atmosphere turbulence, dynamics of the WFS and DM, and the jitter in the computational delay. In order to solve Eq. (1) efficiently, modelbased RL algorithms estimate the true dynamics model p(S_{t+1}S_{t}, a_{t}) in Eq. (1) by an approximate model . Modelfree methods, in turn, only learn a policy – they do not attempt to model the environment.
The standard MDP formulation assumes that all information about the environment is contained in the state S_{t}. This is not the case in many realworld domains, such as adaptive optics control. A more refined formulation is then the “partially observed” MDP or POMDP, where the decisionmaker observes o_{t}, which is some subset or function of the true underlying state. The Markov property, that is, the assumption that the next state depends only on the previous state and action, does not necessarily apply to the observations in a POMDP. This work uses the standard method of having our state representation include a small number of past observations (WFS measurements) and actions (control voltages) to deal with this issue. This allows the policy to use knowledge of past actions to predict the next action. The exact form of the observations O_{t} and the full state S_{t} for adaptive optics control will be given in Sect. 5.1.
Finally, it is common in RL to use reward functions that are not differentiable (such as 1 for winning a game, 0 otherwise) or functions that do not depend directly on the state. In highcontrast imaging, we would like to minimize the speckle intensity in the postcoronagraphic PSF. However, this can be difficult to estimate at the high frequencies of modern HCI instruments. We discuss the specific choices in this regard in Sect. 5.1.
4 Adaptive Optics Control
This section introduces AO control aspects that are relevant to our work. First, we introduce the AO system components and then outline a standard control law called the integrator and the related calibration process. An overview of the AO control loop is given in Fig. 1; the incoming light at the timestep t gets corrected by the DM. Next, the WFS measures the DM corrected residual wavefront . After receiving the wavefront sensor measurement, the control computer calculates a set of control voltages and sends the commands to the DM.
Further, the AO control loop inherits a temporal delay. The delay consists of a measurement delay introduced by the WFS integration and a control delay consisting of WFS readout, computation of the correction signal by the control algorithm, and its application to the DM. These add up to a total delay of at least twice the operating frametime of the AO system (Madec 1999).
Fig. 1 Overview of the AO control loop and the performance of PO4AO. The method, PO4AO, feeds actions to the environment, observes the outcome, and then improves the control regarding the reward. Starting from a random behavior at first (frame 0), the method learns a predictive control strategy in only 5000 frames of interaction. 
4.1 Pyramid Wavefront Sensor for Adaptive Optics
The function of the WFS is to measure the spatial shape of the residual phase of a wavefront . There are several different types of WFSs, but in this work, we focus on the socalled pyramid WFS (PWFS), which is a mature concept providing excellent performance for HCI (Guyon 2005). In the following, we give a short description of the PWFS.
The PWFS can be viewed as a generalization of the Foucault knifeedge test (Ragazzoni 1996). In pyramid wavefront sensing, the electric field of the incoming wavefront is directed to a transparent foursided pyramid prism. The prism is located in the focal plane of an optical system and, hence, can be modeled as a spatial Fourier filter that introduces specific phase changes according to the shape of the prism (Fauvarque et al. 2017). This foursided pyramid divides the incoming light into four different directions, and most of the light is propagated to four intensity images on the PWFS detector. Due to the slightly different optical paths of the light, the intensity fields differ from each other. These differences are then used as the data for recovering the disturbances in the incoming phase screen.
Commonly, pyramid data, that is, the intensity fields, are processed to socalled slopes w_{x}, w_{y} that correlate positively to actual gradients fields of the phase screen. In this paper, we follow the approach of Vérinaud (2004), where the slopes are normalized with the global intensity. In practice, we receive a vector w that is a collection of the measurements w_{x}, w_{y} at all possible locations x, y.
Both modulated and nonmodulated pyramid sensor observations are connected to the incoming wavefront via a nonlinear mathematical model. This study considers nonmodulated PWFSs, where the nonlinearity is stronger, but the sensitivity is better at all spatial frequencies (Guyon 2005). Currently, most wavefront reconstruction algorithms utilize a linearization of this model, inducing a tradeoff between sensitivity and robustness (modulated PWFS vs. nonmodulated PWFS). Machine learning techniques have the potential to overcome this tradeoff and increase PWFS performance without a decisive robustness penalty.
Another feature of the PWFS is that its sensitivity varies depending on both the seeing conditions and the level of AO correction itself (Korkiakoski et al. 2008) which is mainly introduced by high spatial frequency aberrations which the DM cannot control. The presence of these aberrations reduces the signal strength of the measurement also for the controlled modes, and the strength of the reduction depends on the mode’s spatial frequencies (Korkiakoski et al. 2008).
To illustrate the OG effect of the Pyramid sensor, we use a preliminary version of a semianalytical model codenamed “AO cockpyt” (in prep.). This model is based on the work of Fauvarque et al. (2019), describing the sensitivity of the Pyramid sensor in the presence of residuals, and on an adaptation of Fourier models from Jolissaint (2010) and Correia et al. (2020). Figure 2 shows the analytically predicted modal optical gains for the case of an 8m telescope with zeromodulation and integrator control and considering two different wavefront sensor wavelengths. The assumed AO system for this analytical prediction is the same as the one used for our numerical simulations presented in Sect. 6 (41 × 41 actuators correct for seeing of 0.7′′ at 550 nm at 1000 Hz framerate using a 0th magnitude guide star). The figure shows how the optical gain depends on the spatial frequency of the control modes (the KL are numbered from lowto high spatial frequencies) and on the WFS Strehl ratio, which is lower at the shorter wavelength.
A modal optimization of the controller gains using the knowledge of Fig. 2 can solve most of the problems (diagonality assumption in Chambouleyron et al. 2020) and applying the usual control theory margins (gain and phase) for ensuring a robust system. Determining optical gains in realtime is possible but complex (Deo et al. 2021; Chambouleyron et al. 2020), and the relative variations shown in Fig. 2 are of the order 10–20% for our XAO case. Hence, compensation for the modedependent optical gains with a single integrator gain may lead to acceptable results. However, an aggressive static integrator gain could impair loop robustness when the correction improves, and the optical gains increase. Section 6 presents evidence that PO4AO takes the PWFS OG effect into account for improved performance. Further, modal gain compensation of OG is a solution that is expected to work in favorable cases, but still, the nonlinearities after OG compensation will remain and can only be treated with nonlinear methods as the one studied in this paper.
Fig. 2 Modal optical gains for the case of an 8m telescope with zeromodulation and integrator control and considering two different wavefront sensor wavelengths. 
4.2 Classical Adaptive Optics Control
Classically, an AO system is controlled by combining a linear reconstructor with a proportionalintegral (PI) control law. We call this controller the integrator and use it as the reference method for the comparison with PO4AO. As a starting point, the controller assumes to operate in a regime where the dependence between WFS measurements and DM commands is linear to a good approximation, satisfying (2)
where is the WFS data, υ_{t} the DM commands and D is socalled interaction matrix. Moreover, ξ_{t} models the measurement noise typically composed of photon and detector noise. The DM command vector υ_{t} represents the DM shape given in the function subspace linearly spanned by the DM influence functions.
The interaction matrix D represents how the WFS sees each DM command. It can be derived mathematically if we accurately know the system components (WFS and DM) and the alignment of the system. In practice, it is usually measured by poking the DM actuators with a small amplitude staying inside the linear range of the WFS, and recording the corresponding WFS measurements (Kasper et al. 2004; Lai et al. 2021).
The interaction matrix D is generally illconditioned, and regularization methods must be used to invert it (Engl et al. 1996). Here, we regularize the problem by projecting υ_{t} to a smaller dimensional subspace spanned by Karhunen–Loéve (KL) modal basis. The KL basis is computed via a double diagonalization process, which considers the geometrical and statistical properties of the telescopes (Gendron 1994). This process results in a transformation matrix B_{m} which maps DM actuator voltages to modal coefficients.
We observe that the modal interaction matrix is now obtained as , where is the Moore–Penrose pseudoinverse of B_{m}. A wellposed reconstruction matrix for the inverse problem in Eq. (2) is then given by (3)
where is a projection map to the KL basis. Regularization by projection is a classical regularization with wellestablished theory Engl et al. (1996). It is wellsuited for the problem at hand due to the physicsmotivated basis expansion and fixed finite dimension of the observational data.
With ∆w_{t} denoting the residual error seen by the WFS in closed loop, and t denoting the discrete time step of the controller, the integrator control law is (4)
where ɡ is socalled the integrator gain. In literature, ɡ < 0.5 is typically found to provide stable control for a twostep delay system Madec (1999).
5 Learning to Control Using a Model
Here we detail the control algorithm including optimization for the dynamics model p_{ω}(S_{t}, a_{t}) and the policy π_{θ}(a_{t}S_{t}). In standard AO terms, the policy combines the reconstruction and control law (e.g., a leastsquares modal reconstruction followed by integrator control); in our case, a nonlinear correction to a leastsquares modal reconstruction (MDP formulation) and a predictive control law. The key idea is to learn a dynamics model that predicts the next wavefront sensor measurement given the previous measurements and actions and to use that model to optimize the policy. Our method iterated the following three phases^{3}:
Running the policy: we ran the policy in the AO control loop for T timesteps (a single episode).
Improving the dynamics model: we optimized the dynamics model using a supervised learning objective Eq. (9).
Improving the policy: we optimized the policy using the dynamics model Eq. (12).
At each iteration of our algorithm, we collected an episode’s worth of data, e.g., 500 subsequent sensor measurements and mirror commands, by running the policy in the AO control loop for T timesteps. We then saved the observed data and given actions and trained our policy and dynamics model using gradients computed from all previously observed data.
The following sections discuss how we represented each observation, our convolutional neural network architecture for both the dynamics model and the policy, and the optimization algorithm itself.
5.1 Adaptive Optics as a Markov Decision Process
We defined the adaptive optics control problem as an MDP by following the approach of Nousiainen et al. (2021). As discussed in Sects. 3 and 4, we do not directly observe the state of the system but instead observe a noisy WFS measurement. In addition, adaptive optics systems suffer from control delay resulting from the high speed of operation, which means that the system evolves before the latest action has been fully executed. Hence, we set our state presentation to include a small amount of past WFS measurements and control voltages.
We denote the control voltages applied to DM at a given time instance t by and the preprocessed PWFS measurements by w_{t}. We defined the set of actions to be the set of differential control voltages: (5)
In adaptive optics, at each timestep t, we observe the wavefront sensor measurement w_{t}. We project the measurement into voltage space by utilizing the reconstruction matrix C. The observation is then given by the quantity: (6)
To represent each state, we concatenated previous observations and actions. That is, (7)
where we chose k = m (as in the typical pseudoopenloop prediction). The state includes data from the previous m time steps and the reconstruction matrix C. Here the reconstruction matrix serves solely as a preprocessing step for WFS measurements. It speeds up the learning process by simplifying the convolutional NN (CNN) architecture (same dimensional observations and actions). However, It does not directly connect the measurement to actions and, therefore, using it does not imply a sensitivity to misregistration (Nousiainen et al. 2021).
For a stateaction pair, the reward was chosen as the residual voltages’ negative squared norm corresponding to the following measurement: (8)
where was obtained from .
This quantity is proportional to the observable part of the negative norm of the true residual wavefront. This reward function does not capture all error terms such as aliasing and noncommon path errors (NCPA), and hence, the final contrast performance will always be limited by these. The aliasing errors could be mitigated with traditional means, e.g., by introducing a spatial filter (Poyneer & Macintosh 2004) or by oversampling the wavefront, that is, by using a WFS with finer sampling than the one provided by the DM. We also already eluded on the fact that minimizing the residual wavefront seen by the WFS does not necessarily minimize the residual halon in the science image because of NCPA between the two. The PO4AO could treat NCPA by including science camera images in the state formulation, but these would have to be provided at the same cadence as the WFS data, which is usually not the case. Still, NCPA can be handled by PO4AO in the usual way by offsetting the WFS measurements by an amount determined by an auxiliary image processing algorithm (e.g., Give’on et al. 2007; Paul et al. 2013). Finally, the reward does not include an assumption on the time delay of the system, so the method learns to compensate for any delay and predict the wavefront.
Fig. 3 Neural network architectures. Both the dynamics model and the policy NN take same input: concatenations of past actions and observations. They also share the same fully convolutional structure in the first layers. At the output layer, the policy model includes the KLfiltering scheme (upper right corner) and the dynamics model output is multiplied with the WFS mask (lower right corner). See Sect. 5.2 for details. 
5.2 The Dynamics Model
An adaptive optics system inherits strong spatial correlations in observations and control space – neighboring actuators and WFS pixels close to each are more correlated than actuators further apart due to the steep negative slope of the turbulence temporal PSD (Fried 1990) and the frozen flow hypothesis. We employed a standard fully convolutional neural network (CNN), equipped with a leaky rectified linear unit (LReLU, Maas et al. 2013) activation functions that predicts the next wavefront sensor readout. The CNN should work well for our setup with DM actuators and WFS subapertures aligned on a grid in a spatially homogeneous geometry.
In practice, the state is a 3D tensor (matrices stack along the third dimension, that is, a (N × N × (k + m)) tensor) with the channel dimension corresponding to DM actuator grid (2D) and the number of previous observations (k) and actions (m). See Fig. 3 for an illustration.
The deterministic dynamics model estimates the next state S_{t+1} given the previous state and action. The model parameters ω (i.e., the NN weights and biases) were trained by first running the policy π in the environment, that is, controlling the AO system with the policy, collecting tuples of (S_{t}, a_{t}, S_{t+1}) into a dataset , and minimizing the squared difference between the true next states and the predictions (9)
where o_{t+1} is obtained from the state S_{t+1} and is the observation predicted by . The optimization was done using the Adam algorithm (Kingma & Ba 2014). Again we did not assume any integer time delay here, but as the past actions are included in the state formulation, we learned to compensate for it.
It is wellknown that modelbased RL performance unfavorably exploits an overfitted dynamics model in the control (e.g., planning or policy optimization), especially in the early stages of training (Nagabandi et al. 2018). To discourage this behavior, we employed an ensemble of several models, each of which is trained using different bootstrap datasets, that is, subsets of the observations collected during training. In practice, this means that each model sees a different subset of observations, leading to different NN approximations. During policy training, predictions are averaged over the models (line 9 of Algorithm 1). See, for example, Chua et al. (2018) for a more detailed discussion on ensemble models.
5.3 The Policy Model
Again, we employed a fully convolutional neural network as the policy, similar to the dynamics model. The input is a 3D tensor representing the state, and the output a 2D tensor (a matrix) representing the actuator voltages. The WFS measurement is blind or insensitive to some shapes of the mirror, such as the wellknown waffle mode and actuators on the boundary. We ensured that we do not control these modes by projecting each set of control voltages to the control space, that is, we reshaped the 2D output to a vector, multiplied it by a filter matrix, and then reshaped the output back to a 2D image. The full policy model π is given by (10)
where B^{†}B projects the control voltages onto the control space defined by the KL modes and F_{θ} is the standard fully convolutional NN, where the output is vectorized. Figure 3 gives more detailed overview of the network architecture of F_{θ}.
5.4 Policy Optimization
Ideally, the policy π_{θ}(S_{t}) would be optimized based on the expected cumulative reward function Eq. (1). However, as we do not have access to the true dynamics model p, we must approximate it with the learned dynamics model . To stabilize this process, we introduced an extended time horizon H « T over which the performance was optimized. Let us define (11)
where õ_{t+1} is obtained from . This leads to the approximative policy optimization problem (12)
where H the planning horizon and
Here the planning horizon H was chosen based on the properties of the AO system. More precisely, for AO control, the choice of the planning horizon H is driven by the system’s control delay. In the case of a simple twoframe delay, no DM dynamic, and no noise, we would plan to minimize the observed wavefront sensor measurements two steps into the future, that is, we would implicitly predict the best control action by the DM at the time of the corresponding WFS measurement. However, the effective planning horizon is longer in the presence of DM dynamics and temporal jitter since the control voltage decisions are not entirely independent. The choice of the planning horizon is a compromise between two effects: too short a planning horizon jeopardizes the loop stability, and too long a planning horizon makes the method prone to overfitting. We used H = 4 frames in all our experiments (numerical and laboratory) as a reasonably wellworking compromise.
The policy π was optimized by sampling initial states from previously observed samples, computing actions for them, and using the dynamics model to simulate what would happen if we were to take those actions. We could then use the differentiable nature of both our models and the reward function to backpropagate through rewards computed at each timestep. More specifically, at each iteration, we sampled a batch of initial states S_{τ} and computed the following H states using the dynamics model. We then had H rewards for each initial state, and we used the gradients of the sum of those rewards with respect to the policy parameters e to improve the parameters. The full procedure of training the dynamics and the policy is given in Algorithm1, where the whileloop (line 3) iterates over episodes and lines 6–16 execute an update of policy via policy optimization.
6 Numerical Simulations
6.1 Setup Description
We evaluate the performance of PO4AO by numerical simulations. We used the COMPASS package (Ferreira et al. 2018) to simulate an XAO system at an 8m employing a nonmodulated Pyramid WFS in low noise (0 mag) and moderately large noise (9 mag) conditions. For comparison, we also considered the theoretical case of an “ideal” wavefront sensor where the wavefront reconstruction is simply a projection of the 2Dturbulence screen onto the DM’s influence functions.
We also include a simulation of a 40meter telescope XAO with PWFS to confirm that PO4AO nicely scales with aperture size and XAO degrees of freedom. Comprehensive error analysis and finetuning are left for future work. In order to stabilize the performance of the integrator, we added 2λ/D modulations to the PWFS.
For all simulations, we simulated the Atmospheric turbulence as a sum of three frozen flow layers with Von Karman power spectra combining for Fried parameter r0 of 16 cm at 500 nm wavelength. The complete set of simulation parameters is provided in Table 1.
We compare PO4AO against a welltuned integrator and instantaneous controller, not affected by measurement noise or temporal error. For the Pyramid WFS, it still propagates aliasing and the fitting error introduced by uncontrolled or highspatial frequency modes. For the idealized WFS, it acts as a spatial highpass filter, instantaneously subtracting the turbulent phase projected on the DM control space (DM fitting error only).
In particular, we chose the simulation setups to demonstrate the following key properties of the proposed method. Firstly, The method achieves the required realtime control speed while being quick to train. This property enables the controller to be trained just before the science operation and be further updated during the operation. Consequently, the method is trained with the most relevant data and does not need to generalize to all possible conditions at once; furthermore, the method retains these properties with an ELTscale instrument. Secondly, The method is a predictive controller, robust to nonlinear wavefront sensing and photon noise. Thirdly, The method can cope with the optical gain effect of the pyramid sensor.
Simulations parameters.
6.2 Algorithm setup
We chose the state st (in MDP) to consist of 15 latest observations and actions and set the CNN (dynamics and policy) to have 3layers with 32 filters each. For further details on these choices, see Sect. 6.3. The episode length was set to 500 frames.
Each simulation started with the calibrations of the system and the deriving of the reconstruction matrix C and the KL basis B; see Sect. 4. We note that the reconstruction matrix C serves solely as a filter that projects WFS measurement to control space. It does not have to match the actual registration of DM and WFS (Nousiainen et al. 2021). In particular, the reconstruction matrix is measured around the null point in the calibrations and, hence, it suffers from the optical gain effect Korkiakoski et al. (2008). For PWFS simulations, the KL filter was set to include 85% of total degrees of freedom, and for ideal wavefront sensing to filter matrix was an identity, that is, no filtering included.
For all different conditions and instruments, we let simulations run until the performance of PO4AO is converged. That is 46000 frames (46 s in realtime (theoretical)) with an episode length of 500 frames. While the final contrast performance shown in Figs. 4b–d and 5 is calculated from the last 1000 frames, we note that the correction performance very quickly passes the integrator performance as shown in Figs. 6a–c, and 7. After each episode, as described in Sect. 5.4, we halted the simulations and updated the dynamics and policy models. Given the shallow convolutional structure (3 – layers and 32 filters per layer) of the NN models and our moderate hardware, the combined (dynamic and policy) training time after each episode was about 1.5 s for VLT (and 7 s for ELT with the same training hyperparameters). For realtime implementation, training the NN models should be completed in the duration of an episode, that is, in 0.5 s (500 frames at 1 kHz). Given that we do not use the latest GPU hardware, and a NN update could also be done at a slower rate than after each episode, it is conceivable that this small gap can be overcome, and a realtime implementation of PO4AO is already possible.
The dynamics model can also be trained with data obtained with a different controller, such as, the integrator or random control. Therefore, to improve the stability of the learning process, we “warmup” the policy by running the first ten episodes with the integrator and added binary noise to develop a coarse understanding of the system dynamics: (13)
where x is binary noise (–1 or 1 with the same probability) and σ ϵ [0,1] is reduced linearly after each episode such that the first episode was run with high binary noise and the 10th episode with zero noise.
Fig. 4 Raw PSF contrast in VLTscale telescope experiments. Upper images: raw PSF contrast. Lower plot: the radial averages over the image. The blue lines are for the integrator and red for the PO4AO. The raw PSF contrast was computed during the 1000 frames of the experiment. Panel a: performance of PO4AO with ideal WFS. We see that P04AO delivers a factor of 20–90 improvement inside the AO control radius compared to welltuned integrator. Panel b: performance of PO4AO on 0th mag guide star and a nonmodulated PWFS. PO4AO delivers a factor of 4–7 better contrast inside the AO control radius. Panel c: performance of PO4AO on 9th mag guide star. We see a factor of 3–9 improvement in the raw PSF contrast. Panel d: performance of PO4AO under heavy data mismatch. PO4AO was trained with drastically different wind conditions. The PO4AO still delivers better contrast with small angular separations. 
Fig. 5 Raw PSF contrast in ELTscale experiment. 
6.3 CNN Design and MDP State Definition
The PO4AO includes two learned models: the policy and the dynamics model. This paper aims to introduce an optimizations method called PO4AO to train the policy (from scratch) that minimizes the expected reward. The algorithm works for all differentiable function classes, for example, neural networks. For simplicity, we chose to model the environment dynamics and policy using generic 3layer fully convolutional neural networks. While further research is needed in finding the best possible architectures, we experimented with the number of convolutional filters per layer and the number of past telemetry data by testing the algorithm in the “VLT” environment with different combinations; see Table 2. We chose the model CNN 2 to compromise between the overall performance, inference speed for VLT and ELT, and training speed. The chosen model performed well in all simulations and provided fast inference speed and fast training speed such that it could be completed during a single episode. Full model architecture optimization is left for future work (see Sect. 8 for more details).
The inference speed in Table 2 is the speed of the fully convolutional NN architecture inside the policy model (see Fig. 3). The total time control time includes two standard MVMs (preprocessing to voltages + KL filtering in the output layer) in addition to the inference time below. The inference time and training time were run with PyTorch on NVIDIA Quadro RTX 3000 GPU. Note here that given enough parallel computational power (e.g., GPU), the inference time of a fully convolutional NN is more determined by the number of layers and filter (same for VLT and ELT) than the input image’s size. We observe that for CNN with fewer filters, the inference speed is very similar for VLT and ELT cases, while for heavier CNNs, the inference speed differs more with the given hardware. The computational time of MVMs is naturally dependent on the DoF.
6.4 Results
6.4.1 Training
To evaluate the training speed of the method, we compare the learning curves (from which 5000 frames are obtained with the integrator + noise controller) of the method to the baseline of the integrator performance under the same realization of turbulence and noise (see Figs. 6b, c, a and 7). Since the simulations are computationally expensive, in the 40meter telescope experiments, we compare the performance of the PO4AO only to average integrator performance (see Fig. 7).
We plotted the training curves with respect to total reward (the sum of normalized residual voltages computed from the WFS measurements) and Strehl ratio side by side. The method tries to maximize the reward, and consequently, it also maximizes the Strehl ratio. In all our simulations, the method achieves better performance than the integrator already after the integrator warmup of 5000 frames (5 s on a real telescope), and the performance stabilizes at around 30000 frames (30 s). Since the fully convolutional NN structure can capture and utilize the homogeneous structure of the turbulence, the number of data frames needed for training of VLT and ELT control are on the same scale. However, training the same amount of gradient steps is computationally more expensive (although very parallelizable) for the ELT scale system.
6.4.2 Prediction and Noise Robustness
Here, we compare the fully converged PO4AO, the integrator, and ideal control in raw PSF contrast. We ran each controller for 1000 frames, and the wavefront residuals for each controller were propagated through a perfect coronagraph (Cavarroc et al. 2006). The raw PSF contrast was calculated as the ratio between the peak intensity of noncoronagraphic PSF and the postcoronagraphic intensity field. A nonpredictive control law suffers from the notorious winddriven halo (WHD) (Cantalloube et al. 2018), that is, the butterflyshaped contrast loss in the raw PSF contrast in Figs. 4a–c and 5.
Figure 4a assumes using the ideal WFS, that is, the incoming phase is measured by a noiseless projection of the incoming phase onto the DM. Therefore, the ideal WFS eliminates aliasing and noise in the wavefront reconstruction process, only considering temporal and fitting errors. Further, we can easily eliminate temporal error in a simulation by directly subtracting the measured from the incoming phase. The “no noise, no temporal error” curve (black dashed) in Fig. 4a is therefore only limited by the ability of the DM to fit the incoming wavefront. The integrator with a 2frame delay (blue curve) is then limited by the temporal error in addition. The PO4AO (red curve) largely reduces the WHD by predicting the temporal evolution of the wavefront but does not fully recover the fitting error limit (black dashed). Figure 4a, therefore, demonstrates the ability of PO4AO to reduce the temporal error.
Figure 4b replaces the ideal WFS with the nonmodulated PWFS, which is affected by aliasing and requires some filtering of badly seen KL modes during the reconstruction. Therefore, the “no noise, no temporal error” contrast performance is worse than for the ideal WFS in Fig. 4a. The integrator with a 2frame delay (blue curve) performs at a very similar contrast as in the ideal WFS case, so it is still limited mostly by temporal error. Again, PO4AO (red curve) lies about halfway between the integrator and “no noise, no temporal error” controllers but performs at a reduced contrast compared to the ideal WFS case. Therefore, the PO4AO performance with the nonmodulated PWS is affected by aliasing and reconstruction errors as well as the temporal error.
Figure 4c adds a significant amount of measurement noise. While this obviously does not affect the “no noise, no temporal error” case, the contrast performance of both integrator and PO4AO is strongly reduced and dominated by noise. Still, PO4AO outperforms the integrator, which demonstrates the resilience of PO4AO against noisedominated conditions. Finally, Fig. 5 demonstrates that PO4AO maintains its properties in an ELT scale simulation.
Unfortunately, a “black box” controller like PO4AO does not allow us to cleanly separate all individual terms in the error budget because the controller’s behavior is to some extent driven by the error terms themselves. However, as discussed above, we explored the relative importance of the individual terms by switching them on and off in our numerical experiments.
Fig. 6 Training plots for 8meter telescope experiments. Panel a: for ideal wavefront sensor, panel b is for the 0th magnitude guide star, and panel c for the 9th magnitude guide star. The red lines correspond to performance of PO4AO during each episode and blue lines for the integrator. The gray dashed line marks the end of integrator warm up for PO4AO. In all cases the PO4AO outperforms the integrator all ready after the warm up period, in both the Strehl ratio and rewards. An optimized implementation of the PO4AO could run the training in parallel to control, and the training time would then be included in the plot (see Sect. 6.2). 
Fig. 7 Training plots for the 40m telescope experiment. The red lines correspond to performance of the PO4AO during each episode and blue lines for the average integrator performance. The gray dashed line marks the end of integrator warm up for PO4AO. Similarly to 8meter telescope experiments the PO4AO outperforms the integrator after the warm up. 
Performance of 11 different 3layer CNNs.
6.4.3 Robustness Against Data mismatch
So far, we have focused on static atmospheric conditions and size of the data set is not limited, that is, “evergrowing”. However, in reality, the atmospheric conditions are constantly changing, creating a socalled data mismatch problem – the prevailing atmospheric conditions are slightly different from the conditions in which the model was trained. To ensure the method’s robustness to data mismatch, we trained the model with very different conditions and then tested the model with the original wind profile by plotting the raw PSF contrast averaged over 1000 frames. We altered the wind by reducing the wind speed by 50 percent and adding 90degree variations to directions for training, that is, we altered the spatial and temporal statistics of the atmosphere. We do not show the corresponding training plot since it was very similar to Fig. 6b. The result of this experiment is shown in Fig. 4d. The integrator has naturally the same performance as before. The PO4AO still delivers better contrast close to the guide star but suffers from pronounced WDH further from the guide star. Most importantly, the PO4AO is robust and maintains acceptable performance even with heavy data mismatch, which could occur in the unlikely case that atmospheric conditions drastically change from one episode to the next, that is, on a timescale of seconds. Anyhow PO4AO with limited data set size (old data irrelevant data removed) would adapt to such a change and recover the performance within the typical training times discussed in the previous paragraph.
6.5 Sensitivity to the PWFS Optical Gain Effect
The PO4AO uses convolutional NNs and is, therefore, a nonlinear method. Prospects are that it can adapt to nonlinearities in the system, such as the optical gain effect observed for the Pyramid WFS. To examine this property, we run the following experiment. We control the nonmodulated PWFS with PO4AO at 850 and 600 nm and record the policy after training. Then we control the PWFS with the integrator and record, in parallel, the actions PO4AO would have taken. The integrator control results in a correction performance similar to the Strehl ratios derived by the semianalytical model (Fig. 2). At the shorter wavelength, the PWFS sees larger residuals wavefront errors (in radian) and a stronger effect on the optical gains. However, if the controller can cope with such an effect, which we would expect for PO4AO, the suggested actions should counteract the dampened measurement. In order to validate this, we compare the ratios between the standard deviation of the observations (PWFS measurements) and the standard deviation of suggested PO4AO actions. We define an estimate for the optical gain compensation: (14)
where std is the temporal standard deviation, the observations while running the integrator, λ the observing wavelength, and the PO4AO suggested actions. As PO4AO is a predictive control method, this quantity also includes the effect of the prediction, that is, it includes compensation for the temporal error as well. However, we can approximately cancel out the temporal error by comparing the ratio between optical gain estimates obtained at different wavelengths. The result of this experiment is shown in Fig. 8. We see that the empirical estimate for the optical gain sensitivity of PO4AO follows roughly the corresponding ratio of the two semianalytically derived curves plotted in Fig. 2. In particular, we see that the lower order modes are compensated more than high order modes. We, therefore, conclude that PO4AO adequately compensates for the optical gain effect of the PWFS.
Fig. 8 Sensitivity to the PWFS optical gain effect. The blue line corresponds to ratio between the optical gain estimates between the different wavelengths. The red line is the ratio between the semianalytically derived optical gains at the two wavelengths (see Sect. 4.1 and Fig. 2). 
7 Magellan Adaptive Optics Extreme System
In addition to running the numerical simulations presented in the previous section, we also implemented PO4AO on the MagAOX instrument. MagAOX is an experimental coronagraphic extreme adaptive optics system that uses woofertweeter architecture (ALPAO97 DM as the woofer and Boston Micromachines 2 K as the tweeter). We use a point source in the f/11 input focus to illuminate the DMs, Pyramid WFS, and scientific camera. Further, we place a classical Lyot coronagraph with a 2.5 λ/D Lyot mask radius in front of the science camera. We set PWFS’s modulation ratio to 3λ/D, and the brightness of the guide star is adjusted to match the flux per frame which a 0th magnitude star would provide in 1 ms (i.e., for a system running at 1 kHz.) We used a similar test setup as Haffert et al. (2021b) and ran our experiment by only controlling the woofer DM and injecting disturbances by running simulated phase screens across it. The phase screens were simulated as singlelayer frozen flow turbulence with r_{0} of 16 cm at 500 nm. We experimented with three different singlelayer wind profiles: 5, 15, and 30 ms^{–1}, where the wind speeds correspond to a 1 Khz framerate again.
The PO4AO is implemented with PyTorch and utilizes the Python interface of the MagAOX RTC to pass data from CPU to GPU memory, do the PO4AO calculations on the GPU, and transfer them back. The data transfer takes time and limits the achievable framerate in this setup to 100 Hz. RTC software that would run entirely on GPUs would not suffer from this limitation.
7.1 The Integrator
To retrieve the interactions matrix, we used the standard calibrations process described in Sect. 4. From the interactions matrix, we derived the reconstruction matrix by Tikhonov regularization given by, (15)
where α is tuned manually. We also tuned the integrator gain manually for each wind profile.
7.2 Policy Optimization for Adaptive Optics
The structure of the MagAOX experiment is similar to our numerical simulations. First, we trained the PO4AO for 50 episodes (25 000 frames) and then ran for an additional 5000 frames to compare the postcoronagraphic PSFs. We also used the 10 episode warmup with noisy integrator and the same NN architectures. Given the low number of actuators and the highorder PWFS, we set the number of past telemetry data (k and m) to 10, and instead of filtering 20% of the KL modes, for maximum performance, we only filter the piston mode in the policy output (see Fig. 3).
7.3 Results
We compare the performance of the PO4AO to the integrator in two ways: by looking at the training curves (see Fig. 9a) and by comparing the postcoronagraphic speckle variance (see Fig. 9c). The PO4AO achieves better performance in all wind conditions than the integrator after 10 k (10 s in theoretical realtime) data frames. The reward is proportional to the mean RMS of the reconstructed wavefront. We further examine the performance by comparing the postcoronagraphic images with the 30 m s^{–1} wind profile; see Figs. 9b,c. The residual intensities of the images (see Fig. 9b) are limited by NCPA. Therefore, instead of comparing the raw PSF contrast, we compare the temporal speckle variance of the method (see Fig. 9c). We see a factor of 3–7 improvement in the speckle variance at 2.4–6λ/D. Given the inner working angle of the coronagraph and DM’s control radius, that is where we would also expect to see the improvement. Further, these results are in line with the results from numeric simulations.
8 Discussion
In conclusion, reinforcement learning is a promising approach for AO control that could be implemented in onsky systems with already existing hardware. The algorithm we propose requires only a small amount of training data and maintains an acceptable performance even when the training conditions differ heavily from test time. Further, it has a high inference speed, capable of scaling to highorder instruments with up to 10k actuators. Thanks to the use of relatively shallow convolutional NN, the inference time is just 300 µs with a modern laptop GPU. The inference time is also similar for an ELT scale system with more than 10k actuators and for a VLT scale system with “just” 1400 actuators.
The method was tested in numerical simulations and a lab setup and provides significantly improved postcoronagraphic contrast for both cases compared to the integrator. It is entirely datadriven, and in addition to predictive control, it can cope with modeling errors such as the optical gain effect and highly nonlinear wavefront sensing. Due to the constantly selfcalibrating nature of the algorithm it could turn AO control into a turnkey operation, where the algorithm maintains itself entirely automatically.
We showed that our method is robust to heavy data mismatch, but the performance is reduced for a short time while PO4AO is adapting to the evolution of external conditions. These abrupt changes in wind conditions will rarely occur in the real atmosphere. Therefore, future work should also address maintaining the best possible performance under reasonably varying turbulence. The model learns on a scale of several seconds and can presumably adapt to changing atmospheric conditions at the same time scale. However, more research on the tradeoff between model complexity and training speed is still needed. For example, a deeper NN model could generalize better to unseen conditions, while shallower NN models could learn new unseen conditions faster. Currently, the CNN model architectures themselves are not thoroughly optimized, and an exciting research topic would be to find the optimal CNN design to capture the AO control system dynamics for modelbased RL. For example, a Unet type CNN architectures (Ronneberger et al. 2015) and mixedscale dense CNNs (Pelt & Sethian 2018) have shown excellent performance on imagingrelated applications. On the other hand, we could utilize similar NN structures that have shown excellent performance in pure predictive control (Swanson et al. 2018, 2021). Such a study should consider a variety of different, preferably realistically changing atmospheric conditions and misalignments as well as prerecorded onsky data.
As a caveat, the algorithm, like most deep RL methods, is somewhat sensitive to the choice of hyperparameters (e.g., number of layers in neural networks, learning rates, etc.). Moreover, control via deep learning is hard to analyze, and no stability bounds can be established.
Further, development is needed to move from the laboratory to the sky. The method currently runs on a Python interface that has to pass data via the CPU on MagAOX. To increase the speed of the implementation and the maximum framerates, we must switch to a lowerlevel implementation that runs both the realtime pipeline and the PO4AO control on the GPU using the same memory banks. In addition, the training procedure needs to run in parallel with the inference, which should be straightforward to implement.
To summarize, this work presents a significant step forward for XAO control with RL. It will allow us to increase the S/N, detect fainter exoplanets, and reduce the time it takes to observe them on groundbased telescopes. As astronomical telescopes become larger and larger, the choice of the AO control method becomes critically important, and datadriven solutions are a promising direction in this line of work. Deep learning and RL methods are transforming many fields, such as protein folding, inverse problems, and robotics, and there is potential for the same to happen for direct exoplanet imaging.
Fig. 9 MagAOX experiment results. Panel a: training curves for PO4AO in the lab setup. The red lines are for PO4AO performance and dashed blue line represents the average integrator performance over an episode. The dashed gray vertical line is where the policy is switch from noisy integrator to PO4AO. For all different wind conditions the PO4AO passes the integrator performance after 10 k frames of data. Panel b: MagAOX postcoronagraphic PSFs of the methods. Left is for the Integrator and right for the PO4AO. The PSFs are limited by NCPA and, in order to validate the method, we examined the temporal variance of the PSFs (see panel c). Panel c: temporal variance of MagAOX postcoronagraphic PSFs. Upper images: temporal speckle variance at image plane for both control methods (left: integrator, right: PO4AO). Lower image: radial average over the images. The blue line is for the integrator and the red line for the PO4AO. The gray vertical line represents the inner working angle of the coronagraph (radius 2.5λ/D). 
Acknowledgments
Acknowledgements. The work of T.H. was supported by the Academy of Finland (decisions 320082 and 326961). Support for the work of S.H. was provided by NASA through the NASA Hubble Fellowship grant #HSTHF251436.001A awarded by the Space Telescope Science Institute, which is operated by the Association of Universities for Research in Astronomy, Incorporated, under NASA contract NAS526555. We are very grateful for support from the NSF MRI Award #1625441 for MagAOX. Finally, we would like to thank the referee for their insightful comments.
References
 Cantalloube, F., Por, E. H., Dohlen, K., et al. 2018, A&A, 620, L10 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Cavarroc, C., Boccaletti, A., Baudoz, P., Fusco, T., & Rouan, D. 2006, A&A, 447, 397 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Chambouleyron, V., Fauvarque, O., JaninPotiron, P., et al. 2020, A&A, 644, A6 [EDP Sciences] [Google Scholar]
 Chua, K., Calandra, R., McAllister, R., & Levine, S. 2018, in Advances in Neural Information Processing Systems, 4754 [Google Scholar]
 Conan, J.M., Raynaud, H.A.R., Kulcsár, C., Meimon, S., & Sivo, G. 2011, in Adaptive Optics for Extremely Large Telescopes (Singapore: World Scientific) [Google Scholar]
 Correia, C., Conan, J.M., Kulcsár, C., Raynaud, H.F., & Petit, C. 2010a, in 1st AO4ELT conferenceAdaptive Optics for Extremely Large Telescopes, EDP Sciences, 07003 [Google Scholar]
 Correia, C., Raynaud, H.F., Kulcsár, C., & Conan, J.M. 2010b, J. Opt. Soc. Am. A, 27, 333 [NASA ADS] [CrossRef] [Google Scholar]
 Correia, C. M., Bond, C. Z., Sauvage, J.F., et al. 2017, J. Opt. Soc. Am. A, 34, 1877 [NASA ADS] [CrossRef] [Google Scholar]
 Correia, C. M., Fauvarque, O., Bond, C. Z., et al. 2020, MNRAS, 495, 4380 [NASA ADS] [CrossRef] [Google Scholar]
 Deisenroth, M., & Rasmussen, C. E. 2011, in Proceedings of the 28th International Conference on machine learning (ICML11), Citeseer, 465 [Google Scholar]
 Deo, V., Gendron, É., Rousset, G., et al. 2019, A&A, 629, A107 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Deo, V., Gendron, É., Vidal, F., et al. 2021, A&A, 650, A41 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Dessenne, C., Madec, P.Y., & Rousset, G. 1998, Appl. Opt., 37, 4623 [NASA ADS] [CrossRef] [Google Scholar]
 Dressing, C. D., & Charbonneau, D. 2015, ApJ, 807, 45 [Google Scholar]
 Engl, H. W., Hanke, M., & Neubauer, A. 1996, Regularization of Inverse Problems (Berlin: Springer Science & Business Media), 375 [Google Scholar]
 Fauvarque, O., Neichel, B., Fusco, T., Sauvage, J.F., & Girault, O. 2017, J. Astron. Teles. Instrum. Syst., 3, 019001 [NASA ADS] [CrossRef] [Google Scholar]
 Fauvarque, O., JaninPotiron, P., Correia, C., et al. 2019, J. Opt. Soc. Am. A, 36, 1241 [Google Scholar]
 Fernandes, R. B., Mulders, G. D., Pascucci, I., Mordasini, C., & Emsenhuber, A. 2019, ApJ, 874, 81 [NASA ADS] [CrossRef] [Google Scholar]
 Ferreira, F., Gratadour, D., Sevin, A., & Doucet, N. 2018, in 2018 International Conference on High Performance Computing & Simulation (HPCS), IEEE, 180 [CrossRef] [Google Scholar]
 Fried, D. L. 1990, J. Opt. Soc. Am. A, 7, 1224 [NASA ADS] [CrossRef] [Google Scholar]
 Fusco, T., Rousset, G., Sauvage, J.F., et al. 2006, Opt. Exp., 14, 7515 [Google Scholar]
 Gal, Y., McAllister, R., & Rasmussen, C. E. 2016, in DataEfficient Machine Learning workshop (USA: ICML), 4, 25 [Google Scholar]
 Gendron, E. 1994, in European Southern Observatory Conference andWorkshop Proceedings, European Southern Observatory Conference and Workshop Proceedings, 48, 187 [NASA ADS] [Google Scholar]
 Give’on, A., Kern, B., Shaklan, S., Moody, D. C., & Pueyo, L. 2007, SPIE, 6691, 66910A [Google Scholar]
 Gray, M., & Le Roux, B. 2012, SPIE, 8447, 84471T [Google Scholar]
 Guyon, O. 2005, ApJ, 629, 592 [NASA ADS] [CrossRef] [Google Scholar]
 Guyon, O. 2018, Ann. Rev. Astron. Astrophys., 56, 315 [CrossRef] [Google Scholar]
 Guyon, O., & Males, J. 2017, AJ, accepted [arXiv:1707.00570] [Google Scholar]
 Haffert, S. Y., Males, J., Close, L., et al. 2021a, SPIE, 11823, 118231C [NASA ADS] [Google Scholar]
 Haffert, S. Y., Males, J. R., Close, L. M., et al. 2021b, J. Astron. Teles. Instrum. Syst., 7, 029001 [NASA ADS] [CrossRef] [Google Scholar]
 Heess, N., Wayne, G., Silver, D., et al. 2015, ArXiv eprints [arXiv:1510.09142] [Google Scholar]
 Heritier, C., Esposito, S., Fusco, T., et al. 2018, MNRAS, 481, 2829 [NASA ADS] [Google Scholar]
 Janner, M., Fu, J., Zhang, M., & Levine, S. 2019, ArXiv eprints [arXiv:1906.08253] [Google Scholar]
 Jolissaint, L. 2010, J. Euro. Opt. Soc., 5, 10055 [NASA ADS] [CrossRef] [Google Scholar]
 Kasper, M., Fedrigo, E., Looze, D. P., et al. 2004, J. Opt. Soc. Am. A, 21, 1004 [NASA ADS] [CrossRef] [Google Scholar]
 Ke, H., Xu, B., Xu, Z., et al. 2019, Optik, 178, 785 [NASA ADS] [CrossRef] [Google Scholar]
 Kingma, D. P., & Ba, J. 2014, International Conference for Learning Representations, San Diego, 2015 [Google Scholar]
 Korkiakoski, V., Vérinaud, C., & Le Louarn, M. 2008, Appl. Opt., 47, 79 [Google Scholar]
 Kulcsár, C., Raynaud, H.F., Petit, C., Conan, J.M., & Lesegno, P. V. D. 2006, Opt. Express, 14, 7464 [CrossRef] [Google Scholar]
 Lagrange, A. M., Gratadour, D., Chauvin, G., et al. 2009, A&A, 493, L21 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Lai, O., Chun, M., Dungee, R., Lu, J., & Carbillet, M. 2021, MNRAS, 501, 3443 [NASA ADS] [CrossRef] [Google Scholar]
 Landman, R., Haffert, S. Y., Radhakrishnan, V. M., & Keller, C. U. 2020, SPIE, 11448, 1144849 [NASA ADS] [Google Scholar]
 Landman, R., Haffert, S. Y., Radhakrishnan, V. M., & Keller, C. U. 2021, J. Astron. Teles. Instrum. Syst., 7, 039002 [NASA ADS] [CrossRef] [Google Scholar]
 Liu, X., Morris, T., & Saunter, C. 2019, in International Conference on Artificial Neural Networks (Berlin: Springer), 537 [Google Scholar]
 Maas, A. L., Hannun, A. Y., & Ng, A. Y. 2013, Proc. ICML, 30, 3 [Google Scholar]
 Macintosh, B., Graham, J. R., Barman, T., et al. 2015, Science, 350, 64 [Google Scholar]
 Madec, P.Y. 1999, Adaptive Optics in Astronomy (Cambridge: Cambridge University Press), 131 [CrossRef] [Google Scholar]
 Males, J. R., & Guyon, O. 2018, J. Astron. Teles. Instrum. Syst., 4, 019001 [NASA ADS] [CrossRef] [Google Scholar]
 Males, J. R., Close, L. M., Miller, K., et al. 2018, SPIE, 10703, 1070309 [NASA ADS] [Google Scholar]
 Marois, C., Racine, R., Doyon, R., Lafrenière, D., & Nadeau, D. 2004, ApJ, 615, L61 [NASA ADS] [CrossRef] [Google Scholar]
 Marois, C., Lafrenière, D., Doyon, R., Macintosh, B., & Nadeau, D. 2006, ApJ, 641, 556 [Google Scholar]
 Marois, C., Zuckerman, B., Konopacky, Q. M., Macintosh, B., & Barman, T. 2010, Nature, 468, 1080 [NASA ADS] [CrossRef] [Google Scholar]
 Mawet, D., Pueyo, L., Lawson, P., et al. 2012, SPIE Conf. Ser., 8442, 844204 [NASA ADS] [Google Scholar]
 Nagabandi, A., Kahn, G., Fearing, R. S., & Levine, S. 2018, in 2018 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 7559 [CrossRef] [Google Scholar]
 Nousiainen, J., Rajani, C., Kasper, M., & Helin, T. 2021, Opt. Express, 29, 15327 [NASA ADS] [CrossRef] [Google Scholar]
 Otten, G. P. P. L., Vigan, A., Muslimov, E., et al. 2021, A&A, 646, A150 [EDP Sciences] [Google Scholar]
 Paschall, R. N., & Anderson, D. J. 1993, Appl. Opt., 32, 6347 [NASA ADS] [CrossRef] [Google Scholar]
 Paul, B., Sauvage, J.F., & Mugnier, L. 2013, A&A, 552, A48 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Pelt, D. M., & Sethian, J. A. 2018, Proc. Natl. Acad. Sci., 115, 254 [CrossRef] [Google Scholar]
 Pou, B., Ferreira, F., Quinones, E., Gratadour, D., & Martin, M. 2022, Opt. Express, 30, 2991 [NASA ADS] [CrossRef] [Google Scholar]
 Poyneer, L. A., & Macintosh, B. 2004, J. Opt. Soc. Am. A, 21, 810 [NASA ADS] [CrossRef] [Google Scholar]
 Poyneer, L., & Véran, J.P. 2008, J. Opt. Soc. Am. A, 25, 1486 [NASA ADS] [CrossRef] [Google Scholar]
 Poyneer, L. A., Macintosh, B. A., & Véran, J.P. 2007, J. Opt. Soc. Am. A, 24, 2645 [NASA ADS] [CrossRef] [Google Scholar]
 Poyneer, L., van Dam, M., & Véran, J.P. 2009, J. Opt. Soc. Am. A, 26, 833 [NASA ADS] [CrossRef] [Google Scholar]
 Ragazzoni, R. 1996, J. Mod. Opt., 43, 289 [Google Scholar]
 Ronneberger, O., Fischer, P., & Brox, T. 2015, in International Conference on Medical Image Computing and Computerassisted Intervention (Berlin: Springer), 234 [Google Scholar]
 Sinquin, B., Prengère, L., Kulcsár, C., et al. 2020, MNRAS, 498, 3228 [NASA ADS] [CrossRef] [Google Scholar]
 Snellen, I., de Kok, R., Birkby, J. L., et al. 2015, A&A, 576, A59 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Sun, Z., Chen, Y., Li, X., Qin, X., & Wang, H. 2017, Opt. Commun., 382, 519 [NASA ADS] [CrossRef] [Google Scholar]
 Swanson, R., Lamb, M., Correia, C., Sivanandam, S., & Kutulakos, K. 2018, SPIE, 10703, 107031F [NASA ADS] [Google Scholar]
 Swanson, R., Lamb, M., Correia, C. M., Sivanandam, S., & Kutulakos, K. 2021, MNRAS, 503, 2944 [NASA ADS] [CrossRef] [Google Scholar]
 van Kooten, M., Doelman, N., & Kenworthy, M. 2017, Performance of AO predictive control in the presence of nonstationary turbulence (Instituto de Astrofisica de Canarias) [Google Scholar]
 van Kooten, M., Doelman, N., & Kenworthy, M. 2019, J. Opt. Soc. Am. A, 36, 731 [NASA ADS] [CrossRef] [Google Scholar]
 Vérinaud, C. 2004, Opt. Commun., 233, 27 [Google Scholar]
 Wong, A. P., Norris, B. R., Tuthill, P. G., et al. 2021, J. Astron. Teles. Instrum. Syst., 7, 019001 [NASA ADS] [CrossRef] [Google Scholar]
 Xu, Z., Yang, P., Hu, K., Xu, B., & Li, H. 2019, Appl. Opt., 58, 1998 [NASA ADS] [CrossRef] [Google Scholar]
Exoplanet Orbit Database: http://exoplanets.org/
All Tables
All Figures
Fig. 1 Overview of the AO control loop and the performance of PO4AO. The method, PO4AO, feeds actions to the environment, observes the outcome, and then improves the control regarding the reward. Starting from a random behavior at first (frame 0), the method learns a predictive control strategy in only 5000 frames of interaction. 

In the text 
Fig. 2 Modal optical gains for the case of an 8m telescope with zeromodulation and integrator control and considering two different wavefront sensor wavelengths. 

In the text 
Fig. 3 Neural network architectures. Both the dynamics model and the policy NN take same input: concatenations of past actions and observations. They also share the same fully convolutional structure in the first layers. At the output layer, the policy model includes the KLfiltering scheme (upper right corner) and the dynamics model output is multiplied with the WFS mask (lower right corner). See Sect. 5.2 for details. 

In the text 
Fig. 4 Raw PSF contrast in VLTscale telescope experiments. Upper images: raw PSF contrast. Lower plot: the radial averages over the image. The blue lines are for the integrator and red for the PO4AO. The raw PSF contrast was computed during the 1000 frames of the experiment. Panel a: performance of PO4AO with ideal WFS. We see that P04AO delivers a factor of 20–90 improvement inside the AO control radius compared to welltuned integrator. Panel b: performance of PO4AO on 0th mag guide star and a nonmodulated PWFS. PO4AO delivers a factor of 4–7 better contrast inside the AO control radius. Panel c: performance of PO4AO on 9th mag guide star. We see a factor of 3–9 improvement in the raw PSF contrast. Panel d: performance of PO4AO under heavy data mismatch. PO4AO was trained with drastically different wind conditions. The PO4AO still delivers better contrast with small angular separations. 

In the text 
Fig. 5 Raw PSF contrast in ELTscale experiment. 

In the text 
Fig. 6 Training plots for 8meter telescope experiments. Panel a: for ideal wavefront sensor, panel b is for the 0th magnitude guide star, and panel c for the 9th magnitude guide star. The red lines correspond to performance of PO4AO during each episode and blue lines for the integrator. The gray dashed line marks the end of integrator warm up for PO4AO. In all cases the PO4AO outperforms the integrator all ready after the warm up period, in both the Strehl ratio and rewards. An optimized implementation of the PO4AO could run the training in parallel to control, and the training time would then be included in the plot (see Sect. 6.2). 

In the text 
Fig. 7 Training plots for the 40m telescope experiment. The red lines correspond to performance of the PO4AO during each episode and blue lines for the average integrator performance. The gray dashed line marks the end of integrator warm up for PO4AO. Similarly to 8meter telescope experiments the PO4AO outperforms the integrator after the warm up. 

In the text 
Fig. 8 Sensitivity to the PWFS optical gain effect. The blue line corresponds to ratio between the optical gain estimates between the different wavelengths. The red line is the ratio between the semianalytically derived optical gains at the two wavelengths (see Sect. 4.1 and Fig. 2). 

In the text 
Fig. 9 MagAOX experiment results. Panel a: training curves for PO4AO in the lab setup. The red lines are for PO4AO performance and dashed blue line represents the average integrator performance over an episode. The dashed gray vertical line is where the policy is switch from noisy integrator to PO4AO. For all different wind conditions the PO4AO passes the integrator performance after 10 k frames of data. Panel b: MagAOX postcoronagraphic PSFs of the methods. Left is for the Integrator and right for the PO4AO. The PSFs are limited by NCPA and, in order to validate the method, we examined the temporal variance of the PSFs (see panel c). Panel c: temporal variance of MagAOX postcoronagraphic PSFs. Upper images: temporal speckle variance at image plane for both control methods (left: integrator, right: PO4AO). Lower image: radial average over the images. The blue line is for the integrator and the red line for the PO4AO. The gray vertical line represents the inner working angle of the coronagraph (radius 2.5λ/D). 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.