FlopPITy : Enabling self-consistent exoplanet atmospheric retrievals with machine learning

Context. Interpreting the observations of exoplanet atmospheres to constrain physical and chemical properties is typically done us-ing Bayesian retrieval techniques. Since these methods require many model computations, a compromise must be made between the model’s complexity and its run time. Achieving this compromise leads to a simpliﬁcation of many physical and chemical processes (e.g. parameterised temperature structure). Aims. Here, we implement and test sequential neural posterior estimation (SNPE), a machine learning inference algorithm for atmospheric retrievals for exoplanets. The goal is to speed up retrievals so they can be run with more computationally expensive atmospheric models, such as those computing the temperature structure using radiative transfer. Methods. We generated 100 synthetic observations using ARtful Modeling Code for exoplanet Science (ARCiS), which is an atmospheric modelling code with the ﬂexibility to compute models across varying degrees of complexity and to perform retrievals on them to test the faithfulness of the SNPE posteriors. The faithfulness quantiﬁes whether the posteriors contain the ground truth as often as we expect. We also generated a synthetic observation of a cool brown dwarf using the self-consistent capabilities of ARCiS and ran a retrieval with self-consistent models to showcase the possibilities opened up by SNPE. Results. We ﬁnd that SNPE provides faithful posteriors and is therefore a reliable tool for exoplanet atmospheric retrievals. We are able to run a self-consistent retrieval of a synthetic brown dwarf spectrum using only 50000 forward model evaluations. We ﬁnd that SNPE can speed up retrievals between ∼ 2 × and ≥ 10 × depending on the computational load of the forward model, the dimensionality of the observation, and its signal-to-noise ratio (S / N). We have made the code publicly available for the community on Github.


Introduction
Interpreting exoplanet and brown dwarf observations to estimate the physical and chemical properties of their atmospheres is typically done using Bayesian inference to find the joint posterior probability distribution of model parameters.This posterior is traditionally found using sequential sampling-based inference methods, most commonly Markov Chain Montecarlo (MCMC) or different nested sampling algorithms, particularly Multinest (Feroz et al. 2009).This is a computationally expensive process, often requiring hundreds of thousands to millions of forward model evaluations to converge.
The high computational expense essentially limits the use of complex atmospheric models in retrievals, making it necessary to reach a compromise between model complexity and compute time.The model complexity arises from including more realistic physics, e.g.computing self-consistently the temperature structure and cloud formation or including disequilibrium chemistry.Models that take more than a second to run already push a single retrieval to between a day and two weeks depending on the number of model evaluations needed.In addition, higher spectral resolution or larger wavelength coverage with new instruments increase the size of the data set, and are often coupled with higher sensitivities e.g. by JWST (Gardner et al. 2006).Together, these cause inference methods to require more model evaluations to converge, easily reaching 10 7 evaluations based on retrievals in Barrado et al. (2023).
These previous studies have shown that machine learning can provide constraints compatible with those obtained with Multinest.In particular, Ardévol Martínez et al. (2022) showed the parameter constraints obtained with machine learning retrievals to be extremely reliable.However in this previous approach the posterior was approximated by a multivariate Gaussian.Although with good enough data it would be a reasonable assumption, this is not yet the case for exoplanet atmospheres, so the reliability came at the expense of inaccurate posterior shapes.To fix this, Vasist et al. (2023) developed a normalizing flow based retrieval framework that removed the assumption of gaussianity and is able to reproduce the shape of nested sampling posteriors.
The vast majority of previous approaches have been amortised estimators.This means that after training, they are suitable to perform inference on observations occupying any region of parameter space.To accomplish this a large enough training set covering the full parameter space in sufficient resolution is required.As we already made explicit in Ardévol Martínez et al. (2022) Section 3.4, the need for a large training set leads to inflexibility with regards to the atmospheric model used.
To counter this shortcoming, Yip et al. (2022) developed a machine learning retrieval framework based on variational inference and normalising flows that was able to reproduce accurately nested sampling posteriors using <10% of the forward models while retaining the same flexibility.However, this approach requires the atmospheric model to be differentiable and so one would need to either use Diff-τ (the differentiable model developed by Yip et al. 2022), or develop new differentiable models, limiting its usability.
Here we present FlopPITy (normalising FLOw exoPlanet Parameter Inference ToolkYt), a machine learning retrieval framework based on neural spline flows (Durkan et al. 2019) and sequential neural posterior estimation (SNPE, Greenberg et al. 2019).This approach retains the flexibility of sampling-based methods while requiring only a fraction of the forward model evaluations.Additionally, it works with any atmospheric modelling code without the need to rewrite it or adapt it.
In this letter, we first describe briefly the machine learning approach we use in Section 2. In Section 3 we apply FlopPITy to synthetic observations to test and showcase its performance.Section 4 discusses the implications of these results and the advantages and shortcomings of our method.Finally, in Section 5 we present our conclusions.

(Sequential) Neural Posterior Estimation
FlopPITy1 (normalising FLOw exoPlanet Parameter Inference ToolkYt) is a new tool for exoplanet atmospheric retrievals that uses sequential neural posterior estimation.In particular, we use Automatic Posterior Transformation (APT or SNPE-C, Greenberg et al. 2019) as implemented in the python package SBI (Tejero-Cantero et al. 2020).SNPE-C belongs to a larger group of likelihood-free inference methods, which are useful when the likelihood function is intractable.In our case the likelihood can be calculated, so the advantage of SNPE is the speed-up that the use of machine learning provides.SNPE-C approximates the posterior distribution p(θ|x) with the distribution q F(x,ϕ) (θ|x), where q is a density family2 and F is a neural network with weights ϕ.In this work we use normalizing flows (Papamakarios et al. 2021) for our posterior estimator q F .See Appendix A in Vasist et al. (2023) for a brief overview on normalizing flows.More specifically, we use neural spline flows (Durkan et al. 2019), for which a more palatable explanation can be found in Green & Gair (2021).

Multi-round training
Amortised estimators such as those presented in Ardévol Martínez et al. (2022) and Vasist et al. (2023) are incredibly convenient as once trained, inference can be carried out almost instantly for any observation.However, they are relatively inflexible.Observational details and data processing choices (e.g.wavelength range used, spectral resolution, noise properties) are fixed during training, which means that for real world usage the estimator needs to be retrained with the observational properties of each observation.The main source of inflexibility comes from the need to pre-compute large training sets covering the whole prior parameter space, limiting the use cases for the trained estimator.This is because simple changes like adding extra chemical species to the models imply computing a whole new training set.Additionally this large upfront computational cost, although lower than would be required for Multinest, can still be unfeasible for more complex models.
Here we present a non-amortised approach that is as flexible as traditional sampling-based retrievals.Like the latter, it requires computing new forward models for every retrieval.However it needs only a fraction of them compared to nested sampling, resulting in a significant speed-up.
We use SNPE-C with multi-round training as described in Greenberg et al. (2019).The way it works is as follows: initially, N parameter vectors θ i are drawn from the prior, and corresponding forward models x i are computed.These are used to train the estimator and obtain a first approximation to the posterior.From this posterior, we draw another N samples and compute the corresponding forward models, which we use to improve the training and obtain an improved estimate for the posterior distribution.Figure 1 illustrates how the method works.After the first round of training the estimator is no longer amortised but is only suitable for the specific observation we are analysing.This process is repeated for a set number of rounds.Unfortunately, because after the first round we are not sampling from the prior p(θ) but rather a so-called proposal distribution p(θ), the resulting distribution is no longer the true posterior p(θ|x) but rather a so-called proposal posterior: where (2) SNPE-C automatically transforms between estimates of the true posterior and the proposal posterior, making it easy to sample the estimated true posterior.The interested reader can consult Greenberg et al. (2019) for the details of how this is accomplished.
The observational error is accounted for by adding noise to the simulated spectra.Training is done on noisy samples x ′ i = x i + ε with ε ∈ N(0, σ 0 ), where x i are the simulated spectra, N(µ, σ) is the normal distribution, and σ 0 the observed noise.
Because the network has already learnt the noise, at the time of inference only x 0 is passed.

Mock retrievals
We test FlopPITy on two scenarios.First, we run retrievals using a simple and fast atmospheric model to test its reliability.Second, we run a retrieval using a complex, computationally expensive atmospheric model to showcase the science cases that this methods enables.

Bulk retrievals
We perform retrievals on a hundred synthetic NIRSpec PRISM spectra to test the faithfulness of our method.For the purpose of illustration, we use a relatively simple (computationally fast) atmospheric model to minimise the computational load.The synthetic observations are generated using ARCiS (Min et al. 2020) with an isothermal temperature structure T (p) = T 0 , free H 2 O and CO 2 abundances, and the radius R P and log g of the planet.The parameter ranges from which the synthetic spectra were sampled are the same as the prior ranges used for the retrievals, and are shown in Table A.1.The spectroscopic channels and observational noise are taken from the FIREFLy reduction in Rustamkulov et al. (2023).
To perform the retrievals, we train FlopPITy in 20 rounds using 1,000 simulations in each round.Such a high number of rounds is not necessary but this allows us to check if training for more rounds than necessary leads to overconfidence.We also perform Multinest retrievals on the simulated observations to have a baseline to compare against.
We want to check that the posteriors produced by our method are correctly estimated and are not too broad or too narrow.We do this by calculating the expected coverage probability following Hermans et al. (2022) and Vasist et al. (2023).The coverage probability is the probability of a certain confidence region containing the ground truth.If the posteriors were correctly estimated, a region of the posterior with a fraction (1 − α)% of the probability would contain the ground truth (1 − α)% of the time.Figure 2 shows the coverage probability for the posteriors produced at each round.Its interpretation is simple: if the coverage probability is below the diagonal, the posteriors are overconfident (too narrow), whereas if the coverage probability is above, posteriors are underconfident (too wide).
Although the curves in Fig. 2  shows that the posteriors do not become increasingly overconfident with subsequent training iterations, but are reliable at each step.Figure 2 seems to indicate that at high credibility levels, FlopPITy is slightly overconfident when compared to Multinest.A larger number of mock retrievals would need to be run to be able to ascertain whether it is a real effect or just an artifact.If real, it would indicate that the probability in the wings of the posteriors is underestimated and therefore care should be taken not to overinterpret them.

Self-consistent retrieval
To showcase the real power of FlopPITy, we perform a retrieval with a self-consistent setup on a simulated observation of a brown dwarf.We generate the synthetic observation using the self-consistent capabilities of ARCiS.We assume a 1D (cloud free) atmosphere in radiative-convective equilibrium (with an effective temperature T eff ) as well as thermochemical equilibrium computed from the carbon to oxygen ratio C/O and metallicity Z, and include chemical disequilibrium due to vertical mixing (Kawashima & Min 2021).The vertical mixing is parameterised by the eddy diffusion coefficient K zz .We use the best fitting parameters for the cool Y dwarf WISE 1828+2650 (De Furio et al. 2023).We generate a spectrum with R = 1, 000 in the wavelength range 5−18 µm.We assume an error on the flux of 10 −5 Jy (corresponding to an SNR between ∼ 0 and ∼ 40 depending on the wavelength), which is representative of what can be achieved in most of the MIRI MRS wavelength range when binned down to R = 1, 000 (Barrado et al. 2023).The model parameters and their priors are shown in Table A.2.We do not show a comparison to a Multinest retrieval as it is not computationally feasible.
We train FlopPITy in ten rounds, using 5,000 forward models in each round.We use a larger simulation budget than in the previous case because the mapping from atmospheric parameters to spectra is more complex.In the simple model case, changes in parameters directly affect the spectrum, e.g. a higher molecular abundance increases the amplitude of the features in the spectrum.For the self-consistent model this is no longer the case.For example, the molecular abundances (which influence the amplitude of the spectral features) depend on the internal temperature, the carbon-to-oxygen ratio, the metallicity, and the vertical diffusion.
To account for possible biases due to the noise realization of an individual observation, we generate five different noisy observations from the simulated spectrum and aggregate the posteriors retrieved.The corresponding corner plot can bee seen in Fig. 3, with all parameters being faithfully recovered.
Figure 3 shows that for the SNR considered, we can get errors of ∼ 1% on crucial evolutionary properties, such as T eff , R and log g.This is only the case when the model perfectly reproduces the observation.For real data, this will virtually never be the case as there will always be physical/chemical processes not taken into account by the model.In this case it is unknown what level of precision and bias one should expect, and even to which extent the retrievals remain reliable.There might also be differences in performance between FlopPITy and Multinest, making one of them a better option.These questions will be explored in future work.

Discussion
We have shown that FlopPITy, an implementation of SNPE-C with neural spline flows, is a reliable method for exoplanet and brown dwarf atmospheric retrievals and it allows us to perform retrievals using forward models that are too slow for traditional sampling-based retrievals.In particular, the retrieval performed on the simulated brown dwarf spectrum (Section 3.2) used 50,000 simulations in total, taking ∼18 h to complete on a 2020 M1 MacBook Pro.This time was split in ∼15 h for the computation of the simulations and ∼3 h for training.Based on the retrievals ran in Barrado et al. (2023), we can estimate that the same retrieval with Multinest would require between 500,000 and 1,500,000 model evaluations to converge.Since each forward model requires ∼3 s to run, such a retrieval would take at least ∼ 20 days to converge.
As a comparison, the simple retrievals took ∼ 3 h each in a cluster with 48 AMD Opteron 6348 processors, out of which ∼ 1 h 25 min were used for training.Their Multinest counterparts typically needed between 20,000 and 60,000 forward models to converge, taking between 3 and 8 hours.Higher SNR observations typically require more forward models for Multinest to converge.The lower (to no) computational advantage of FlopPITy over Multinest in this case is due to two factors.First, having a simpler dataset (with a factor ∼ 6 lower spectral resolution) results in Multinest requiring fewer model evaluations to converge.Second, using a faster forward model (∼0.5 s versus ∼3 s) causes training to represent a significant fraction of the time needed for the FlopPITy retrieval.
This means that for simpler datasets and faster forward models, our method does not provide a substantial speed up compared to Multinest, and for very simple datasets (e.g.HST WFC3 spectra), Multinest might even be preferred.Conversely, FlopPITy enables analyses of complex datasets with computationally costly atmospheric models that would otherwise not be feasible with sampling-based retrieval methods.
Additionally, Multinest has the disadvantage that models need to be computed sequentially, so it can not be parallelised 3 .SNPE does not have this limitation, and the computation of the forward models can be spread over multiple CPUs, further speeding up retrievals.
The sequential approach presented here is not universally preferable to amortised approaches such as the one in Vasist et al. (2023).In particular, if the goal is to perform retrievals on a large number of spectra using the same atmospheric model (as will undoubtedly be the case for future missions such as ARIEL, Tinetti et al. 2018), an amortised approach will be computationally more efficient.However for the exploration of a single dataset, our sequential approach is more appropriate as the extra flexibility allows to more easily try and compare different atmospheric models with different assumptions.
The main drawback of FlopPITy is the need to choose how many rounds to train for, and how many training examples to use per round.Fortunately, as we have shown in Section 3.1, the posteriors do not become overconfident by training for too many rounds.Nevertheless training for more rounds than necessary represents an unnecessary computational expense that we would like to avoid.A way to gauge the convergence of the retrieval is to compare the posteriors at consecutive rounds, and train for a few more rounds if there is still significant variation.The 1σ, 2σ and 3σ envelopes of the retrieved spectra can also be 3 A small level of parallelization can be achieved by simultaneously drawing N points at each nested sampling iteration, with 1/N being an estimate of the sampling efficiency.used to ensure that the posterior is not over-dispersed.Regarding the amount of training examples, ideally one would choose as many as computationally affordable.Due to the stochastic nature of the random sampling of parameter vectors in the prior and proposal distributions, using too few examples will result in an uneven coverage of the parameter space, which could have a negative effect on the results.
Finally, it is still unclear how FlopPITy or nested sampling respond to adversarial examples.These are observations with features that are not well reproduced by the model, which we coined 'uncomfortable retrievals' in Ardévol Martínez et al. (2022).There we showed that machine learning was more reliable than Multinest when the observations were not well reproduced by the underlying atmospheric model.However, this came at the expense of very broad posteriors, so the information gain over the prior was limited (which is still preferred to biased posteriors).In future work we will explore if FlopPITy can remain reliable in this scenario while being more informative than the machine learning retrievals in Ardévol Martínez et al. (2022), or if it instead behaves more similarly to Multinest.

Conclusions
In this letter we present FlopPITy, a new machine learning retrieval tool that uses normalizing flows and multi-round training, and we show that it works reliably.
In contrast with previous machine learning retrieval methods, this method retains the flexibility of sampling-based methods and is applicable to any atmospheric model, requiring only a fraction of the forward models needed by the latter, with the exact fraction depending on the specific observation and model used.When performing individual retrievals, the sequential approach presented here requires fewer models to train on than amortised approaches, as the computational effort is focused in the relevant regions of parameter space.However, once trained, amortised estimators are able to perform retrievals almost instantly and are therefore better suited for the analysis of large datasets with the same forward model.
Our method enables retrievals of high quality observations, such as those provided by JWST, with computationally costly forward models, e.g.self-consistent temperature structure or cloud formation.This reduces the need to simplify atmospheric models to speed them up for retrievals, although of course it does not eliminate it completely, as still tens of thousands of forward models need to be computed in a reasonable time frame.Additionally, unlike with sampling-based retrievals, the computation of these models can be parallelised over multiple CPUs, further speeding up a retrieval.
This work highlights the avenues that machine learning is opening for characterizing exoplanets and expands the suite of existing machine learning retrieval methods and their use cases.

Fig. 1 .
Fig. 1.Schematic representation of the iterative training of a neural spline flow.In the very first iteration, the proposal distribution is the prior.

Fig. 2 .
Fig. 2. Coverage probability of SNPE posteriors at different training rounds compared to Multinest.The red dashed line denotes the 1:1 line.All the lines are close to the diagonal, indicating that both the Multinest and SNPE posteriors are faithful.Importantly, the latter remain faithful at every round.

Fig. 3 .
Fig. 3. Posteriors for five noise realizations of a synthetic WISE 1828+2650 spectrum.The input parameters of the synthetic spectrum are denoted by the black lines and are: T eff = 325 K, R P = 1.83 R J , log g = 3.6, log K zz = 7, C/O= 0.55 and log Z = 0. To keep the figure readable, we show only the 2σ contours.The quantiles shown in the titles correspond to one of the noisy realisations.