Issue 
A&A
Volume 686, June 2024



Article Number  A89  
Number of page(s)  35  
Section  Planets and planetary systems  
DOI  https://doi.org/10.1051/00046361/202347613  
Published online  31 May 2024 
How do wavelength correlations affect transmission spectra? Application of a new fast and flexible 2D Gaussian process framework to transiting exoplanet spectroscopy
^{1}
School of Physics, Trinity College Dublin, University of Dublin,
Dublin 2,
Ireland
email: fortunma@tcd.ie
^{2}
Center for Computational Astrophysics, Flatiron Institute,
New York,
NY,
USA
^{3}
School of Information and Physical Sciences, University of Newcastle,
Callaghan,
NSW,
Australia
^{4}
Max Planck Institute for Astronomy,
Königstuhl 17,
69117
Heidelberg,
Germany
Received:
31
July
2023
Accepted:
22
February
2024
The use of Gaussian processes (GPs) is a common approach to account for correlated noise in exoplanet time series, particularly for transmission and emission spectroscopy. This analysis has typically been performed for each wavelength channel separately, with the retrieved uncertainties in the transmission spectrum assumed to be independent. However, the presence of noise correlated in wavelength could cause these uncertainties to be correlated, which could significantly affect the results of atmospheric retrievals. We present a method that uses a GP to model noise correlated in both wavelength and time simultaneously for the full spectroscopic dataset. To make this analysis computationally tractable, we introduce a new fast and flexible GP method that can analyse 2D datasets when the input points lie on a (potentially nonuniform) 2D grid – in our case a time by wavelength grid – and the kernel function has a Kronecker product structure. This simultaneously fits all light curves and enables the retrieval of the full covariance matrix of the transmission spectrum. Our new method can avoid the use of a ‘commonmode’ correction, which is known to produce an offset to the transmission spectrum. Through testing on synthetic datasets, we demonstrate that our new approach can reliably recover atmospheric features contaminated by noise correlated in time and wavelength. In contrast, fitting each spectroscopic light curve separately performed poorly when wavelengthcorrelated noise was present. It frequently underestimated the uncertainty of the scattering slope and overestimated the uncertainty in the strength of sharp absorption peaks in transmission spectra. Two archival VLT/FORS2 transit observations of WASP31b were used to compare these approaches on real observations. Our method strongly constrained the presence of wavelengthcorrelated noise in both datasets, and significantly different constraints on atmospheric features such as the scattering slope and strength of sodium and potassium features were recovered.
Key words: methods: data analysis / methods: statistical / techniques: spectroscopic / planets and satellites: atmospheres / stars: individual: WASP31
© The Authors 2024
Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
This article is published in open access under the Subscribe to Open model. Subscribe to A&A to support open access publication.
1 Introduction
Lowresolution transmission spectroscopy has been a powerful technique for probing the atmospheric composition of exoplanets ever since the first detection of an exoplanet atmosphere (Charbonneau et al. 2002). The technique relies upon observing an exoplanet transit – when an exoplanet appears to pass in front of its host star – and analysing the decrease in flux during the transit as a function of wavelength. The resulting transmission spectrum contains information about the planetary atmosphere (Seager & Sasselov 2000; Brown 2001).
The field of exoplanet atmospheres has entered a new era due to the recent launch of JWST. Early Release Science observations of WASP39b using JWST NIRSpec’s PRISM mode have produced a 33σ detection of H_{2}O in addition to strong detections of Na, CO, and CO_{2} (Rustamkulov et al. 2023) – far exceeding what had been achieved with previous groundbased and spacebased observations (Nikolov et al. 2016; Sing et al. 2016). As the exoplanet community pushes JWST further to its limits towards smaller terrestrial planets (e.g. Greene et al. 2023; Zieba et al. 2023), the importance of careful treatment of systematics – astrophysical or instrumental – will become of increasing importance.
Gaussian processes (GPs) were introduced in Gibson et al. (2012b) to account for the uncertainty that correlated noise – also referred to as systematics – produces in the resulting transmission spectrum in a statistically robust way. GPs have been shown to provide more reliable estimates of uncertainties when compared to other common techniques such as linear basis models (Gibson 2014). This difference may help explain contradictory results in the field. For example, WASP31b was reported to have a strong potassium signal at 4.2σ using data from the Hubble Space Telescope (HST; Sing et al. 2015), which were analysed using linear basis models. However, followup measurements in Gibson et al. (2017) with the FOcal Reducer and low dispersion Spectrograph (FORS2) on the Very Large Telescope (VLT; Appenzeller et al. 1998) found no evidence of potassium. Highresolution observations in Gibson et al. (2019) using the Ultraviolet and Visual Echelle Spectrograph (UVES) on the VLT as well as lowresolution observations from the InamoriMagellan Areal Camera and Spectrograph (IMACS) on Magellan (McGruder et al. 2020) also failed to reproduce this detection. These results are more consistent with the reanalysis of the HST data using GPs which reduced the significance of the potassium signal to 2.5σ (Gibson et al. 2017), demonstrating the importance of careful treatment of systematics.
In addition to inconsistent detections of species, conflicting measurements of the slope of the transmission spectrum are also common (e.g. Sedaghati et al. 2017, 2021; Espinoza et al. 2019). A slope in the transmission spectrum can be caused by Rayleigh scattering or from scattering by aerosols including clouds or hazes in the atmosphere (Lecavelier Des Etangs et al. 2008) and it is therefore often referred to as a scattering slope. However, stellar activity can also produce an apparent scattering slope, which is typically used to explain these contradictory results (Espinoza et al. 2019; McCullough et al. 2014; Rackham et al. 2018). It is worth considering whether these effects could be caused by systematics because it is difficult to obtain direct evidence that stellar activity is the cause of these contradictions. This could be particularly relevant for measurements of extreme scattering slopes such as in May et al. (2020), where the authors note that a combination of atmospheric scattering and stellar activity still struggle to explain the observed slope.
One potential issue with current data analyses is that each transit depth in the transmission spectrum is fit separately and is assumed to have an independent uncertainty. Ih & Kempton (2021) studied the effect on retrieved atmospheric parameters if this assumption is incorrect and if transit depth uncertainties are correlated. The authors suggest that both instrumental and stellar systematics could cause correlations in transmission spectra uncertainties, which they showed could significantly impact atmospheric retrievals, but they did not provide a method for retrieving these correlations. Holmberg & Madhusudhan (2023) report the presence of wavelengthcorrelated systematics in observations of WASP39b and WASP96b using the Single Object Slitless Spectroscopy (SOSS) mode of JWST’s Near Infrared Imager and Slitless Spectrograph (NIRISS). They demonstrate why these systematics should result in correlated uncertainties in the transmission spectrum and made multiple simplifying assumptions to derive an approximate covariance matrix of the transmission spectrum, although their method cannot not account for both time and wavelengthcorrelated systematics.
In this work, we demonstrate a statistically robust way to account for the presence of both timecorrelated and wavelengthcorrelated systematics and its use on both simulated datasets and to real transit observations of WASP31b from VLT/FORS2 (originally analysed in Gibson et al. 2017). We model the noise present across the full dataset as a Gaussian process. By simultaneously fitting all spectroscopic light curves for transit depth, we also explored the joint posterior of all transit depths using Markov chain Monte Carlo (MCMC) and recovered the covariance matrix of the transmission spectrum. We present an efficient optimisation that can dramatically speed up the required loglikelihood calculation based on work in Saatchi (2011) and Rakitsch et al. (2013). We assume that the inputs to the kernel function of the GP lie on a 2D grid – such as a time by wavelength grid – and refer to this as a 2D GP. Intuitively this assumption is valid for many datasets that can be neatly arranged in a 2D grid, for example an image typically has a 2D grid structure where the inputs describing correlations could be chosen as the x and y coordinates of each pixel. It is not required that the grid is uniform; that is to say for transmission spectroscopy neither the time points nor wavelengths need to be uniformly separated. In contrast, we refer to GPs where the input(s) all lie along a single dimension such as time as 1D GPs. This includes the standard approach of fitting individual transit light curves using a GP.
One example where 2D GPs have already been used would be in radial velocity analysis where multiple parallel time series observations may be jointly fit with a GP (Rajpaul et al. 2015). In this case, time is one of the dimensions and we can consider the other dimension to consist of a small number (~3) of different observables stacked together such as radial velocity and any suitable activity indicators. This is in contrast to transmission spectroscopy where dozens or even hundreds of parallel time series may be observed simultaneously in the form of different wavelength channels and the number of observations in time may also be greater. Unfortunately, the algorithms developed for radial velocity either do not scale to the significantly larger datasets encountered in transmission spectroscopy (as noted in Barragán et al. 2022), or else they make strong assumptions about the form of correlations in both dimensions (Gordon et al. 2020; Delisle et al. 2022) which were not sufficient for our analysis of real observations.
In particular, Gordon et al. (2020) introduced a 2D GP method that could be applied to transmission spectroscopy and scales better than the method introduced here but with a much stronger assumption that the shape of the correlated noise is identical at different wavelengths but can change in amplitude. With weaker assumptions about the correlation in wavelength their method has worse scaling in the number of wavelength channels. The covariance matrix in the time dimension is also limited to celerite kernels as introduced in ForemanMackey et al. (2017). Similarly, the extension of the Gordon et al. (2020) method derived in Delisle et al. (2022) is still too limited to apply the kernel function used to fit the VLT/FORS2 data in this work. Our method has significant advantages for transmission spectroscopy as it is computationally tractable even when the length of both dimensions is of the order of hundreds of points and can be used with any general covariance matrices describing the noise in each dimension.
This paper is laid out as follows; Sec. 2 gives a brief introduction to Gaussian processes, the challenges of scaling them to two dimensional datasets and outlines the mathematics of the 2D GP method used in this paper, Sec. 3 compares our method to standard approaches on simulated datasets containing wavelengthcorrelated noise. Finally, Sec. 4 reanalyses archival VLT/FORS2 observations using this method and the discussion and conclusions are presented in Sects. 5 and 6.
The code developed for this work has been made available as a Python package called LUAS^{1}. It is available on GitHub^{2} along with documentation and tutorials.
2 2D Gaussian processes for transmission spectroscopy
2.1 Introduction to 2D Gaussian processes
To fit our transit observations we first model the transit using a deterministic function. This function is typically referred to as the ‘mean function’ in the context of GPs and denoted by μ. This light curve model includes a transit depth parameter for each spectroscopic light curve, which we fit for to obtain our transmission spectrum (see Sec. 3.1 for more details on the light curve model). When using 2D GPs, we combine light curve models for each spectroscopic light curve to form our 2D mean function μ(λ, t, θ), which is a function of model parameters θ as well as the wavelength channels λ and the times of our flux observations t.
We model the noise in our observations as a Gaussian process. This means that if we take any two arbitrary input locations x_{i} and x_{j} out of the collection of points in a dataset D, we assume that the noise observed at these points follows a multivariate Gaussian distribution with the covariance given by a kernel function of our choosing. In the case of 1D GPs fitting a single transit light curve, the covariance might be described by the commonly used squaredexponential kernel as a function of the time separation of each data point, giving covariance matrix: (1)
where h is the height scale or amplitude of the correlated noise, t_{i} and t_{j} are the times of x_{i} and x_{j}, l_{t} is the length scale of the correlated noise in time and we have also included a white noise term with variance σ^{2} added to the diagonal (δ_{ij} is the Kronecker delta). The choice of kernel function is problemspecific with a range of kernel functions to choose from (see Rasmussen & Williams 2006).
If we now consider fitting multiple spectroscopic light curves with a GP then any two flux measurements have both a time separation and a wavelength separation. We can combine correlations in both time and wavelength into a single kernel. If we choose the squaredexponential kernel for both dimensions and introduce a wavelength length scale l_{λ} we get: (2)
where λ_{i} and λ_{j}· are the wavelength values at locations x_{i} and x_{j}. We can also mix kernel functions, that is we could model time with the Matérn 3/2 kernel and wavelength with a squaredexponential kernel.
In general, the covariance matrix describing the noise in the observations K is a function of inputs λ and t and is a function of parameters such as h, l_{t}, l_{λ} and σ which are typically referred to as hyperparameters. We include both hyperparameters and light curve parameters in θ.
Overall, our dataset of flux observations y are modelled as a multivariate Gaussian distribution, with the likelihood (the probability of the data given our model) given by: (3)
Taking the logarithm of this (to avoid numerical errors) we get the loglikelihood of our model: (4)
where we have defined our residuals vector r = y – μ.
In accordance with Bayes’ theorem, we can add the logarithm of any prior probability to our loglikelihood to get the logposterior of all the parameters^{3}. This equation can then be used by inference methods such as Markov chain Monte Carlo (MCMC) to explore the marginal probability distributions of each parameter.
2.2 The computational cost of 2D Gaussian processes
The loglikelihood function in Eq. (4) can be used for analysing any general dataset including 1D or 2D datasets. However, the computational cost becomes prohibitive for large datasets. Assume we have flux observations measured at N different times and binned into M different wavelength channels. If we move from analysing individual light curves to all spectroscopic light curves simultaneously then the covariance matrix describing the correlation will go from an N × N matrix to an MN × MN matrix to describe correlations in both time and wavelength. Studying Eq. (4), it should be noted that it is necessary to invert the covariance matrix as well as calculate its determinant. The runtime of both of these computations scales as the cube of the number of data points in the most general case. We can write this scaling of runtime in ‘big ’ notation as (M^{3} N^{3}). The covariance matrix K also requires storing M^{2} N^{2} entries in memory. The scaling in memory is therefore (M^{2} N^{2}). This poor scaling in both runtime and memory would make most transmission spectroscopy datasets unfeasible to analyse and is the motivation for introducing a new loglikelihood optimisation.
The optimised GP method introduced in this work is based on Saatchi (2011) and Rakitsch et al. (2013) which both present exact methods of optimising the calculation of Eq. (4). Both methods assume that all inputs for the GP lie on a (potentially nonuniform) 2D grid. In the context of transmission spectroscopy this simply means that if wavelength and time are used as the two inputs then the same set of wavelength bands must be chosen for all points in time (but the wavelength bands do not need to be evenly separated). The grid must also be complete with no missing data, which means outliers at a specific time and wavelength cannot simply be removed from the dataset without either removing all other data points at that specific time or that particular wavelength. However, a common approach for dealing with nonGaussian outliers has just been to replace them using an interpolating function (e.g. Gibson et al. 2017). This approach keeps the inputs on a complete grid and so this grid assumption likely satisfies most lowresolution transmission spectroscopy datasets.
Both methods differ in the range of kernel functions they can be used with. The Saatchi (2011) method is simpler and is described first, with the Rakitsch et al. (2013) method being a more general extension with a similar computational scaling.
2.3 Kronecker products
Some basics of Kronecker product algebra are useful to explain the optimisations used in this work. See Saatchi (2011) for more details, proofs of the results and how these results generalise to more than two dimensions.
The Kronecker product between two matrices may be written using the ⊗ symbol. It can be thought of as multiplying every element of the first matrix by every element in the second matrix with each multiplication producing its own term in the resulting matrix. For example: (5)
If A is M × M and B is N × N, then A ⊗ B is MN × MN.
The following equations hold: (6) (7) (8)
Both methods presented assume the covariance matrix of the noise can be expressed using Kronecker products. The Saatchi (2011) method assumes the covariance matrix can be expressed as: (9)
where we have written the covariance matrix K as a Kronecker product of separate covariance matrices: K_{λ} constructed using a wavelength kernel function k_{λ}(λ, λ′); and K_{t} constructed using the time kernel function k_{t}(t, t′). It also assumes that white noise is constant across the full dataset with variance σ^{2} – a limitation of the Saatchi (2011) method. An equivalent way of stating this restriction is that the kernel function can be written as: (10)
In most transmission spectroscopy datasets, the amplitude of white noise varies significantly in wavelength across the dataset, making the Saatchi (2011) method highly restrictive. In contrast, the Rakitsch et al. (2013) method is valid for any covariance matrix that is the sum of two independent Kronecker products: (11)
or equivalently any kernel function of the form: (12)
This method is much more general with the second set of covariance matrices ∑_{λ} and ∑_{t} being able to account for white noise that varies in amplitude across the dataset. We can choose K_{λ} and K_{t} to include our correlated noise terms and ∑_{λ} and ∑_{t} to be diagonal matrices containing the white noise terms. This allows us to account for any correlated noise that is separable in wavelength and time as well as any white noise that is separable in wavelength and time. The white noise as a function of time and wavelength σ(λ, t) must be able to be written as (13)
However, there is no requirement that ∑_{λ} and ∑_{λ} be diagonal and the method works for any valid covariance matrices K_{λ}, K_{t}, ∑_{λ} and ∑_{t}.
Unlike the method introduced in Gordon et al. (2020) that uses a celerite kernel for the time covariance matrix and is limited to a single regression variable within that matrix, multiple regression variables may be used within any of these covariance matrices. Multiple regression variables are often used in transmission spectroscopy as it is argued they can provide extra information about how strongly correlated the noise may be for different flux observations. For example, the width of the spectral trace on the detector can change in time and a kernel function could be chosen where the correlation in the noise between two points is an explicit function of both the timeseparation and difference in trace widths (e.g. Gibson et al. 2012a; DiamondLowe et al. 2020) which is not possible using a celerite kernel (ForemanMackey et al. 2017). Due to the kronecker product structure, any regression variables used must also lie on a 2D grid, which will be true if they vary in time but are constant in wavelength (or vice versa). One caveat is that multiple regression variables may be used with the Gordon et al. (2020) method by treating them as additional time series to be fit by the GP (see our discussion of Rajpaul et al. 2015 and Delisle et al. 2022 in Sect. 1).
2.4 Loglikelihood calculation for uniform white noise
The algorithms described in Chapter 5 of Saatchi (2011) allow for the calculation of Eq. (4) in ((M + N)MN) time after an initial matrix decomposition step that scales as (M^{3} + N^{3}). This holds for any covariance matrix expressible in the form of Eq. (9).
To understand this method, first note that it takes advantage of Kronecker product structure to speed up matrixvector products [A ⊗ B]c for an arbitrary MN long vector c using (14)
where we reshape c into an M × N matrix with C_{ij} = c(i−1)_{M + j} and the vec() operator reverses this such that vec(C) = c.
As A ⊗ B is an MN × MN matrix, we would expect multiplication by a vector c to take (M^{2} N^{2}) operations to calculate in general. However, Eq. (14) takes ((M + N)MN) operations. We also do not need to store the full matrix A ⊗ B in memory but instead just A and B separately, reducing the memory requirement from O(M^{2} N^{2}) to (M^{2} + N^{2}).
We combine this with Eq. (8) and find that once we perform the (M^{3} + N^{3}) operations of computing K^{−1}_{λ} and K^{−1}_{t} then (15)
can be computed in O((M + N)MN) operations and using (M^{2} + N^{2}) memory.
While it would be sufficient to use any method such as Cholesky factorisation to compute K^{−1}_{⊗} and K^{−1}_{t}, using the eigendecomposition of each matrix permits white noise to be easily accounted for. We exploit a particular property of eigendecomposition that the addition of a constant to the diagonal of a matrix shifts the eigenvalues but leaves the eigenvectors unchanged. We denote the eigendecomposition of some matrix A as: (16)
after which the inverse can be easily computed as: (17)
For two matrices A and B it can be shown that (18)
We can then add a white noise term to Eq. (18) simply by shifting the eigenvalues: (19)
where represents the MN × MN identity matrix. We use Eq. (14) to multiply r by the eigenvector matrices Q_{λ} and Q_{t}. The middle term containing the product of eigenvalues is diagonal and so the inverse is easily calculated.
Calculating log K can be performed in (MN) time using: (20)
2.5 Accounting for nonuniform white noise
The assumption of uniform white noise can be avoided by using a linear transformation presented in Rakitsch et al. (2013). Using this transformation, we can efficiently solve for the loglikelihood for any covariance matrix that is the sum of two Kronecker products – as described by Eq. (11).
First, we solve for the eigendecomposition of ∑_{λ} and ∑_{t}, notating this as follows: (21) (22)
In the case that ∑_{λ} is diagonal then we simply have ∑_{λ} = Λ_{λ} and (and similarly for ∑_{t}).
We transform K_{λ} and K_{t} using these eigendecompositions: (23) (24)
Doing this allows us to write our covariance matrix K as a product of three terms: (25)
Multiplying this out reproduces Eq. (11).
The inverse of this equation can be calculated to be: (26)
We can now solve K^{−1} r in three steps. K^{−1} is the product of three matrices, two of which are of the form A ⊗ B and so we can use Eq. (14). The middle matrix in this product is equivalent to Eq. (19) because we have a Kronecker product plus a constant added to the diagonal being multiplied by a vector. In this case, we need the eigendecomposition of the transformed covariance matrices and and the ‘white noise’ term is set to σ = 1. The scaling has not changed significantly compared to the method with constant white noise although the eigendecomposition of four matrices instead of two is now required (although this is trivial if ∑_{λ} and ∑_{t} are diagonal).
log K is now computed as follows: (27)
The loglikelihood can now be solved for any covariance matrix that can be written in the form of Eq. (11) with an overall scaling of (2M^{3} + 2N^{3} + MN(M + N)) operations and (M^{2} + N^{2}) memory. Compared to a more general approach of performing Cholesky factorisation on the full MN × MN covariance matrix K, our optimised method provides greater than order of magnitude improvements in runtime for typical transmission spectroscopy datasets (shown in Sec. 4.5).
2.6 Efficient inference with large numbers of parameters
As multiple parameters are typically fit for each spectroscopic light curve, parameter inference can become computationally expensive when simultaneously fitting multiple light curves and a good choice of inference method may be required. Hamiltonian Monte Carlo (HMC) was chosen as it can scale significantly better when fitting large numbers of parameters compared to other more traditional MCMC methods such as MetropolisHastings or AffineInvariant MCMC (Neal 2011). HMC can make use of the gradient of the loglikelihood to take longer trajectories through parameter space for each step of the MCMC compared to these other methods. Specifically, we use No UTurn Sampling (NUTS) because it eliminates the need to handtune parameters while achieving a similar sampling efficiency to HMC (Hoffman & Gelman 2014).
To provide the gradients of the loglikelihood calculation for NUTS, the above algorithms were implemented in the Python package JAX (Bradbury et al. 2018). JAX has the capability of providing the values as well as the gradients of functions, often with minimal additional runtime cost. It has an implementation of NumPy that allows NumPy code to be converted into JAX code with limited alterations. It also allows the same code to be compiled to run on either a CPU or GPU – where GPUs may provide significant computational advantages. Novel analytic expressions (that were used in combination with JAX) for the gradients and Hessian of the loglikelihood were developed to aid numerical stability – as discussed in Appendices H and I – with numerical stability further discussed in Appendix J.
3 Testing on simulated data
As we do not know a priori if wavelengthcorrelated systematics are present in a real dataset, this could lead to challenging model selection problems deciding whether to fit the datasets with wavelengthindependent 1D GPs or a 2D GP that can account for wavelengthcorrelated systematics. We will demonstrate using simulated data that 2D GPs can accurately account for both wavelengthindependent and wavelengthcorrelated systematics. This avoids the need for model selection – which could become computationally intractable for large numbers of parameters and is also known to place a heavy reliance on the choice of priors^{4} (Llorente et al. 2023). For example, the 2D GP method can fit for a wavelength length scale l_{λ} – with the loglikelihood varying significantly with l_{λ} – while the 1D GP method does not. This would result in the favoured model (e.g. measured using the Bayesian evidence) being strongly dependent on the arbitrary choice of prior on l_{λ}. This dependence on the choice of priors can be significantly reduced by avoiding model selection and instead marginalising over both wavelengthindependent 1D GPs and wavelengthcorrelated 2D GPs. This can be performed simply by fitting with a 2D GP using a kernel such as from Eq. (2) and choosing a prior where l_{λ} can go to small values where the 2D GP becomes equivalent to jointfitting wavelengthindependent 1D GPs. While the choice of prior on l_{λ} will still affect how strongly the 1D and 2D GP methods are weighted, both methods are still marginalised over, reducing the importance of the choice of priors.
We aim to validate this approach by testing both methods across different sets of simulated data containing either wavelengthindependent or wavelengthcorrelated systematics. We will show that 2D GPs can have an added benefit of sharing hyperparameters between light curves to better constrain systematics and improve accuracy. We also use these datasets to characterise how wavelengthcorrelated systematics may contaminate transmission spectra when fitting with 1D GPs.
Finally, we study how we can account for commonmode systematics – systematics that are constant in wavelength – using 2D GPs. Typically, these are accounted for using a separate commonmode correction step before fitting for the transmission spectrum (to be described in Sec. 3.9), we show that these systematics can be accounted for using a 2D GP while simultaneously retrieving the transmission spectrum.
To summarise, we will demonstrate four major points:
 (i)
Sharing hyperparameters between light curves can improve the reliability of results compared to individual 1D GP fits.
 (ii)
2D GPs can accurately account for wavelengthindependent systematics by fitting for a wavelength length scale parameter.
 (iii)
When systematics are correlated in wavelength, 2D GPs can accurately retrieve transmission spectra while 1D GPs may retrieve erroneous results.
 (iv)
Commonmode systematics can be accounted for using 2D GPs without requiring a separate commonmode correction step.
Sections 3.1–3.5 will describe how the synthetic data were generated and analysed, with Sects. 3.6–3.9 explaining how the results demonstrate points (i), (ii), (iii) and (iv).
3.1 Light curve model
In this work, our light curve model is calculated using the equations of Mandel & Agol (2002) using quadratic limbdarkening parameters c_{1} and c_{2}. The other parameters describing the light curve are the central transit time (T_{0}), period (P), system scale (a/R_{*}) and impact parameter (b) as well as the planettostar radius ratio (ρ = R_{p}/R_{*}) we aim to measure. We chose a standard approach of using the transit depth (ρ^{2} = (R_{p}/R_{*})^{2}) to fit each light curve (i.e. Rustamkulov et al. 2023; Alderson et al. 2023). We can also fit a linear baseline function to each light curve, which introduces the flux outoftransit parameter (F_{oot}) and a fit to the slope of the baseline flux (T_{grad}).
Our mean function parameters therefore include five parameters that may be fit for each light curve independently (ρ^{2}, c_{1}, c_{2}, F_{oot}, T_{grad}) and four parameters that are shared across all light curves (T_{0}, P, a/R_{*}, b). For the simulations, the transit depths ρ^{2} were the only mean function parameters being fit for to reduce runtimes for the large number of simulations (the other parameters were kept fixed to their injected values).
We note that the 1D GP method generated light curves using BATMAN (Kreidberg 2015) to fit the data while the 2D GP method used JAXOPLANET (ForemanMackey & Garcia 2023). Any differences in the light curves generated between the packages were orders of magnitude below the level of white noise being added. JAXOPLANET is a Python package that can generate transit light curves similar to BATMAN but is implemented in JAX. This allows for the calculation of gradients of the loglikelihood – as required to implement No UTurn Sampling (NUTS).
3.2 Generating the exoplanet signal
Each simulated dataset contained 100 time points and 16 wavelength channels. The parameter ranges used for generating the exoplanet signal for all sets of simulations are included in Table 1. They are similar to the ranges chosen in Gibson (2014) and are designed to represent typical hot Jupiter transits.
The time range for these observations is [0.1, 0.1] days sampled uniformly in time. For 100 time points this gives a cadence of approximately 3 min. This cadence is longer than many real observations – such as from VLT/FORS2 or JWST observations – but is still wellsampled. This was required to reduce runtimes and allow for the analysis of a large number of simulated datasets. The parameter ranges chosen resulted in the proportion of time spent in transit varying within the range [26%, 65%].
The wavelength range considered was [3850 Å, 8650 Å] which for 16 evenly separated wavelength bins gives a bin width of 300 Å. This wide wavelength range ensures light curves with a range of limb darkening parameters were included. Limb darkening parameters were calculated by first uniformly sampling a confirmed transiting exoplanet listed on the NASA Exoplanet Archive^{5} and selecting the listed host star parameters. These parameters were used to compute the limb darkening parameters using PyLDTk, allowing for a realistic range of stellar limbdarkening profiles (Parviainen & Aigrain 2015).
The synthetic transmission spectra were generated using a simple model with the aim of testing the retrieval of typical sharp and broad features within transmission spectra. The planettostar radius ratio was randomly drawn as described in Table 2. To study transmission spectra that may have a scattering slope or flat ‘grey’ cloud deck, there was a 50% chance of a Rayleighscattering slope being included and a 50% chance the spectrum was flat. The Rayleighscattering was implemented using Eq. (28) from Lecavelier Des Etangs et al. (2008): (28)
where m is the slope of the transmission spectrum, H is the scale height and α describes the strength of scattering (α = −4 for Rayleigh scattering).
For simplicity, m = 0 was used for a flat ‘cloudy’ model and m = −0.005 for Rayleigh scattering. For context, WASP31b has a reported height scale of 1220 km and stellar radius of R_{*} = 1.12 M_{⊙} (Sing et al. 2015; Anderson et al. 2011) which results in a predicted slope of m = −0.0063 for Rayleigh scattering.
In order to compare how the different methods may constrain a sharp spectral feature such as a K feature (a sharp potassium absorption doublet at ~7681 Å), each dataset had a 50% probability of adding a change in radius ratio of Δρ_{κ} = 0.005 to the single wavelength bin which covers the wavelength range 7450 Å to 7750 Å. Overall this resulted in an even 25% probability of the spectrum being flat and featureless, hazy and featureless, flat with a K feature, or hazy with a K feature (see Fig. 1).
Mean function parameter ranges used for all simulations.
Transmission spectrum parameter ranges for all simulations.
Fig. 1 Four equally likely types of transmission spectra generated for a radius ratio of ρ = 0.1, showing the spectrum in transit depths. 
Parameter ranges used to generate systematics for Simulations 1–4.
3.3 Generating systematics
Five hundred synthetic transmission spectroscopy datasets (each containing 16 light curves) were generated for each of the four sets of simulations. All sets of simulations used the same transit signal model as described in Sec. 3.2. Parameter ranges used to generate systematics for Simulations 1–4 are included in Table 3. Systematics generated were multiplied by each of the transit models generated. We chose to multiply by the transit signal because many sources of systematics would be expected to be proportional to flux (i.e. from varying instrumental or atmospheric throughput).
For Simulation 1, noise was generated using Eq. (1) to build a covariance matrix and by taking random draws from it independently for each wavelength. These datasets therefore contained wavelengthindependent systematics (equivalently this is the limit of Eq. (2) when l_{λ} → 0). For Simulations 2 and 3, noise was generated using the kernel in Eq. (2) and both represent scenarios where wavelengthcorrelated systematics are present but with different ranges of wavelength length scales. Simulation 4 represents contamination from commonmode systematics – which are often encountered in real observations and are relevant to the VLT/FORS2 analysis in Sec. 4. For each simulated dataset, a single draw of correlated noise was generated using the kernel in Eq. (1) (with negligible white noise σ = 10^{−6} for numerical stability). All wavelength channels used this same draw of correlated noise so that it was constant in wavelength (which is the limit of Eq. (2) approached as l_{λ} → ∞). White noise was then added to the full dataset from the parameter range in Table 3.
For examples of noisecontaminated light curves generated, three synthetic light curves from a Simulation 1 dataset are plotted in the left plot of Fig. 2, while all light curves from a Simulation 3 dataset are shown in the left plot of Fig. 5. Simulation 2 datasets would appear similar to Simulation 3 but with the systematics changing shape more significantly between different light curves. Simulation 4 datasets would have the same shape systematics in all light curves (but with different white noise values).
3.4 Inference methods
Each of the fitting methods were initialised with the true values used in the simulations. A global optimiser was then used to locate the bestfit values for the parameters being fit. MCMCs were initialised by perturbing them from this bestfit value and run until convergence was reached – as measured by the Gelman–Rubin statistic (Gelman & Rubin 1992) – or until a limit on the number of chains or samples was reached.
The 1D GP fits were performed using the GEEPEA (Gibson et al. 2012b) and INFERNO (Gibson et al. 2017) Python packages. GEEPEA was used for the implementation of the 1D GP with the NelderMead simplex algorithm used to locate a bestfit value. Differential Evolution MCMC (DEMC) was used to explore the posterior of each of the light curves using the implementation in INFERNO. Four parameters were fit for each MCMC: ρ^{2}, h, l_{t} and σ. One hundred independent chains were run in each MCMC with 200 burnin steps and another 200 steps run for the chain. If any of the parameters had a GR statistic >1.01 then the chains were extended another 200 steps with this repeating up to a maximum chain length of 1000. This was sufficient to achieve formal convergence in over 99% of simulations performed.
In order to perform inference for the 2D GP fits, the loglikelihood calculation was implemented in JAX using the equations described in Sec. 2.5 and Appendix H. The Python package PYMC was used for its implementation of LimitedMemory BFGS for bestfits and NUTS for inference (Salvatier et al. 2016). LimitedMemory BFGS is a gradientbased optimisation method that can be used for models with large numbers (potentially thousands) of parameters (Liu & Nocedal 1989).
MCMC inference for the 2D GP method initially used two chains with 1000 burnin steps each followed by another 1000 steps used for inference. Convergence was checked and if all parameters had not yet converged then another two chains were run. This was repeated until a maximum of ten chains were run. This was also sufficient to achieve convergence in over 99% of simulations performed.
For many MCMC methods, an approximation of the covariance matrix of the parameters being fit can be used as a transformation to reduce correlations between parameters and improve sampling efficiency. In NUTS, the socalled mass matrix performs this function (Hoffman & Gelman 2014). A Laplace approximation was performed to find an approximate covariance matrix, which requires calculating the Hessian of the logposterior at the location of bestfit (see Appendix I). This was found to be significantly more efficient than using the samples from the burnin period to approximate the covariance matrix.
To overcome a few other challenges in achieving convergence (see Appendix D for details), blocked Gibbs sampling (Jensen et al. 1995) was used to update some groups of parameters together. Slice sampling (Neal 2003) was also used if the time or wavelength length scales were within certain ranges close to a prior limit, as this could sometimes sample these parameters more efficiently than NUTS. Priors used for both 1D and 2D GP methods are described in Appendix C and were unchanged between Simulations 1–4.
Fig. 2 Example of synthetic dataset analysed in Simulation 1 (containing wavelengthindependent systematics) showing some simulated light curves (left) and recovered transmission spectrum (right). Left: the three shortest wavelength light curves in the dataset being fit by the transit model combined with a 2D GP noise model (the other 13 light curves were simultaneously fit but not plotted). Right: resulting transmission spectrum from the joint fit of all 16 light curves, with the three leftmost points corresponding to the light curves in the left plot. The recovered transmission spectrum was consistent with the injected spectrum (). An atmospheric retrieval was performed on the recovered transmission spectrum (with the bestfit model plotted) and was consistent with the flat slope and strong K feature of the injected spectrum. 
3.5 Accuracy metrics
Multiple metrics were used to determine how accurately different methods could recover the injected transmission spectra. An example analysis of a single synthetic dataset from Simulation 1 is shown in Fig. 2, demonstrating both the fitting of the light curves and how the resulting transmission spectrum and atmospheric retrieval compares to the injected spectrum. We will study how accurately each method can extract the transmission spectrum in addition to the accuracy of the retrieved atmospheric parameters.
To measure the accuracy of the transmission spectra, we assumed that the uncertainty in the recovered transit depths was Gaussian. This implies that if we are accurately retrieving their mean and covariance, the reduced chisquared of the injected values should be distributed as a distribution with 16 degrees of freedom. We compute the of the injected transmission spectrum (accounting for covariance between transit depths) using: (29)
where and _{;ret} represent the retrieved mean and covariance of the transmission spectrum respectively (with ρ^{2} representing the transit depths). represents the injected transmission spectrum and M is the number of light curves (M = 16 for all the simulations in this paper). As the 1D GP method fits each light curve independently, the covariance between all the transit depths is always assumed to be zero, resulting in the covariance matrices from the 1D GP method being diagonal. The 2D GP method will instead recover the full covariance matrix of transit depths from the MCMC.
Since a single synthetic dataset is not sufficient to study the accuracy of each method, we chose to generate 500 synthetic datasets for each set of simulations to provide a sufficient sample to test each method. By comparing how the resulting 500 values trace out the theoretical distribution, we can determine if a method is accurately recovering the uncertainty in the transmission spectrum. For M = 16 degrees of freedom, has mean μ = 1 and variance σ^{2} = 2/M = 0.125. The mean being significantly larger than one suggests our uncertainties are too small on average, while the mean being smaller than one suggests our uncertainties are too large on average. We should expect the sample variance of 500 samples from a distribution with 16 degrees of freedom to be distributed as 0.125 ± 0.009, if values diverge from this it may suggest the presence of outliers.
We also performed onesample Kolmogorov–Smirnov (K–S) tests on the distribution of recovered values, which is a method for testing if a set of samples follows a given distribution – in this case a distribution. It works by comparing the empirical cumulative distribution function (ECDF) of the samples to the cumulative distribution function (CDF) of the distribution chosen, returning the maximum deviation D from the desired CDF. The KS statistic D for 500 samples and for a distribution with 16 degrees of freedom should have D < 0.060 at a pvalue of 0.05 and D < 0.072 at a pvalue of 0.01.
In addition to examining the values, we can also examine how accurately we can retrieve atmospheric features. For our simulations, there are three features of the injected transmission spectra: the radius ratio ρ = R_{p}/R_{*}, slope m = d(R_{p}/R_{*})/d(ln λ) and change in radius ratio in the wavelength bin centred on the K feature Δρ_{κ}. For each simulation, these three features were determined by performing a simple atmospheric retrieval on the recovered transmission spectrum using the same atmospheric model, which was used to generate the injected spectra. We determined the mean and uncertainty for each of these features using an MCMC (also using the NUTS algorithm). We measured the number of standard deviations – or the Zscore – each prediction was away from the true injected value for all 500 synthetic datasets (used previously in e.g. Carter & Winn 2009; Gibson 2014; Ih & Kempton 2021). If we are robustly retrieving a given atmospheric parameter, we expect these 500 Zscores to follow a Gaussian distribution with mean μ = 0 and variance σ^{2} = 1.
If the mean Zscore is zero, it would demonstrate that the parameter is not biased towards larger or smaller values than the true values. The Zscore values having unit variance shows that the uncertainty estimates are accurate. If the variance is less than one then that suggests the error bars are too large on average which might result in a missed opportunity to detect an atmospheric feature. If the variance is greater than one then the error bars are too small on average, which could lead to false detections of signals.
If two methods are both shown to produce reliable uncertainties based on these metrics, then we can also study which method gives smaller uncertainties on the transmission spectrum. Some methods described can produce tighter constraints than others but at the cost of making stronger assumptions about the systematics in the data.
Tables of each accuracy metric for all methods tested are included in Appendix G.
3.6 Result (i): sharing hyperparameters can make constraints more reliable
Before fitting for wavelengthcorrelations, we first demonstrate that simply by jointfitting spectroscopic light curves and sharing hyperparameters between light curves, we can more reliably constrain transmission spectra. This technique has already been demonstrated on groundbased observations of WASP94Ab (Ahrer et al. 2022), but to examine reliability we use it on simulated datasets where the transmission spectrum is known.
For this section, we used the Simulation 1 data containing only wavelengthindependent systematics. We first fit each synthetic dataset with individual 1D GPs that fit a separate transit depth ρ^{2}, height scale h, time length scale l_{t} and white noise amplitude σ to each individual light curve. The kernel in Eq. (1) was used for the GP (the same kernel used to generate the systematics). We compared this to jointfitting all 16 spectroscopic light curves in each dataset, where we still fit for all 16 transit depths but used single shared values for h, l_{t} and σ. This provides more information to constrain the values of the hyperparmeters and should produce better constraints than using wavelengthvarying values. We note that this can be considered to be using a 2D GP as the kernel is a function of the two dimensions time and wavelength (the correlation in wavelength is given by the Kronecker delta ). We therefore performed this jointfit identically to the 2D GP method described in Sec. 3.4 and used the same kernel function but fixed the wavelength length scale l_{λ} to be negligible^{6}. Comparing Eq. (1) and Eq. (2), it can be seen that both kernel functions are mathematically equivalent in the limit as l_{λ} → 0. We therefore refer to this method as the Hybrid method as it shares the same kernel function as the 1D GP method but jointfits light curves and can benefit from shared hyperparameters like the 2D GP method.
The Hybrid method had either equal or better performance compared to individual 1D GPs across every accuracy metric measured. For example, the distribution of values for the 1D GPs had a mean of = 1.35 ± 0.04, compared to = 1.00 ± 0.02. Similarly, the Zscores of all three atmospheric parameters studied had sample variances that were constrained to be greater than one for the 1D GP fits but consistent with one for the jointfits.
Our results suggest that 1D GPs underestimated the uncertainties of the transmission spectra on average, which may appear surprising given that the systematics in Simulation 1 were generated using 1D GPs with the exact same kernel function. The different results for the two methods appear to have been largely driven by a small number of outliers where individual light curves were fit to have much weaker systematics than was present. This did not happen with the jointfits and since both methods have mathematically equivalent kernel functions, we interpret this result as the 1D GPs not having sufficient data to consistently fit the systematics. This could happen if an individual light curve happens to have systematics that have a similar shape to a transit dip, making it difficult to determine the amplitude of the systematics. The jointfit can instead make use of more light curves to accurately constrain the systematics. We note that the 1D GP method had metrics consistent with ideal statistics for the 130 datasets with l_{t} < 0.01 days, which is much shorter than the minimum transit duration of 0.052 days and therefore less likely to mimic a transit dip. We conclude that sharing hyperparameters can improve the reliability of data analyses by utilising more data to account for the systematics, verifying Result (i).
We note however that for real data we need to be careful with which parameters we assume are wavelengthindependent. The amplitude of systematics and white noise could vary significantly across wavelength channels. For our VLT/FORS2 analysis in Sec. 4, we only assume that the length scales of correlated noise are the same across light curves, similar to Ahrer et al. (2022).
3.7 Result (ii): 2D GPs can account for wavelengthindependent systematics
As stated at the beginning of Sect. 3, 2D GPs (with the kernel given in Eq. (2)) can be used to fit wavelengthindependent systematics, meaning that there is no need to use model selection to determine whether to use 1D GPs or 2D GPs. We have already demonstrated that a 2D GP with a wavelength length scale fixed to negligible values (the Hybrid method) can accurately account for wavelengthindependent systematics. For this test, we wanted to demonstrate that even if we fit for the wavelength length scale with a broad prior then we can retrieve similar results. This method has the benefit of avoiding the a priori assumption that systematics are wavelengthindependent.
Our results showed no statistically significant difference in any accuracy metric measured between the Hybrid method and fitting for l_{λ} with a 2D GP. Both methods showed ideal statistics, based on both the mean and variance of the resulting distributions and the values of the K–S statistics. The Zscore of all retrieved atmospheric parameters were both consistent with the unit normal distribution. The average recovered transmission spectrum did have 2.4% larger uncertainties however so there may have been a minor loss in precision from the more conservative approach of fitting for the wavelength length scale.
We see that if systematics are wavelengthindependent, a 2D GP can constrain the wavelength length scale to be sufficiently small to provide robust retrievals. This also has limited cost to the precision of the transmission spectrum compared to assuming the systematics are wavelengthindependent a priori, confirming Result (ii).
Fig. 3 histograms for all three methods with theoretical distribution overplotted. Both the 1D GP and Hybrid methods have many outliers that increase the variance of the retrieved values. The values of the Hybrid method have the correct mean but with greater variance than the theoretical distribution, while the 2D GP traces the theoretical distribution very closely. 
3.8 Result (iii): 2D GPs can account for wavelengthcorrelated systematics
To study how accurately 1D GPs and 2D GPs can account for systematics correlated in both time and wavelength, we analysed Simulation 2 datasets that cover systematics correlated over wavelength length scales larger than one wavelength bin but smaller than half the wavelength range of the datasets. We fit all 500 datasets with 1D GPs and 2D GPs. However, in addition to having a different kernel to the 1D GP method, the 2D GP could also benefit from sharing hyperparameters (Result i). To isolate the effect of each of these differences, we also fit the Hybrid method to each dataset, which shares hyperparameters between light curves but uses a mathematically equivalent kernel to the 1D GP method.
We found that only the 2D GP method had correctly distributed values and Zscores of atmospheric parameters with unit variance. The distributions of each method are shown in Fig. 3. The Hybrid method produces a distribution of values with the correct mean but with greatly increased variance, while the 1D GP method performs similarly but with outliers producing a higher mean value (similar to Sec. 3.6). We interpret the result of the Hybrid method as demonstrating that the systematics in each individual light curve are being accurately described by the kernel function of the Hybrid method (resulting in the correct mean value) but the correlation between light curves is not being accounted for (increasing the variance of the values). To understand this, consider fitting two light curves where it is incorrectly assumed that the systematics are independent, but the systematics are actually identical. If the systematics in one light curve are recovered with the correct systematics model but happen to result in a 2σ error in the measured transit depth, then the same measurement will occur in the other light curve. Both light curves analysed together would appear to result in two independent 2σ errors, resulting in a high value. However, when we account for the fact that the two light curves are perfectly correlated, this is really only a single 2σ error with a smaller value. Similarly, when the error in transit depth is very small in one light curve (i.e. 0.1σ), it will be small in both light curves, resulting in a value that is too low. The effect of this is that the distribution of two light curves can produce the correct mean but with increased variance when correlations between the light curves are not accounted for.
The 1D GP and Hybrid methods were found to poorly retrieve atmospheric parameters: on average the radius ratio and slope uncertainties were underestimated and the uncertainty in the strength of a K feature was overestimated. This is shown in Fig. 4, where it can be seen that the Zscores have variance that is either too high or too low for the different features using these two methods. In contrast, the Zscores for the 2D GP method are consistent with unit normal distributions.
To understand our interpretation of why the 1D GP kernel underestimates errors in broad features (such as the slope and radius ratio), but can overestimate errors for sharp features (such as the K feature), note that signals in data are most degenerate with systematics when both have similar length scales. For example, wavelengthindependent systematics should be more degenerate with a sharp K feature compared to broader features. A slope in the transmission spectrum should be more degenerate with systematics which vary gradually in wavelength (i.e. with a long wavelength length scale). Both the 1D GP and Hybrid methods assume that the systematics are wavelengthindependent, which could produce uncertainties that are too large when fitting the sharp K feature but too small for broader features. The 2D GP can constrain the systematics to have a longer wavelength length scale, which may increase the uncertainty in broader features while reducing uncertainty in sharp features. This result matches related work examining how different length scale correlations in exoplanet transmission or emission spectra affect atmospheric retrievals (Ih & Kempton 2021; Nasedkin et al. 2023) but it had yet to be demonstrated how these correlations can arise from wavelengthcorrelated systematics in the spectroscopic light curves.
3.9 Result (iv): 2D GPs can account for commonmode systematics
We first consider how to fit the Simulation 4 data containing commonmode systematics (i.e. systematics that are constant in wavelength) before analysing Simulation 3 containing long wavelength length scale systematics. Commonmode systematics are often encountered in transmission spectroscopy datasets, particularly from groundbased observations such as in the VLT/FORS2 observations analysed in Sec. 4. These systematics are often dealt with by performing a ‘commonmode correction’. First, a much broader wavelength range relative to the spectroscopic light curves is chosen to extract a ‘white’ light curve. This was replicated for the simulations by averaging all spectroscopic light curves together. The systematics in this white light curve are then fit with a 1D GP similar to the individual spectroscopic light curve fits. The GP predictive mean is used to model the systematics in this white light curve, which is assumed to describe the commonmode systematic affecting all spectroscopic light curves. Each of the spectroscopic light curves is then divided by this recovered commonmode systematic, resulting in ‘commonmode corrected’ spectroscopic light curves. Each of these corrected light curves is then fit separately to recover the transmission spectrum using the 1D GP approach already outlined, accounting for any wavelengthindependent systematics in the light curves.
There are multiple weaknesses of performing this separate commonmode correction. It is known that it can produce offsets to the resulting transmission spectrum, as the particular fit of the commonmode systematic chosen to correct the spectroscopic light curves can shift all transit depths equally. Only a single fit of the commonmode systematic is chosen to correct the light curves, so the transmission spectrum is conditioned on this particular fit instead of marginalising over all possible commonmode corrections. This likely has the effect of underestimating the uncertainty in the average radius ratio because different commonmode corrections offset the transmission spectrum differently. Finally, the assumption that a real dataset contains systematics that are constant in wavelength may be incorrect, as examined in Sec. 4. 2D GPs can avoid using a commonmode correction because they can account for commonmode systematics by using the kernel in Eq. (2) with l_{λ} set to be effectively infinite (in practise to very large values to maintain numerical stability).
We tested three different approaches on each Simulation 4 dataset. First, all spectroscopic light curves were averaged together to perform a commonmode correction on each dataset, followed by fitting the corrected light curves with 1D GPs as performed for the other simulations. We also fit the data using a 2D GP with the wavelength length scale fixed to a very large value (50 times the wavelength range of the data) which effectively assumes a priori that the systematics present do not vary in wavelength. This is similar to the commonmode correction method but does not take into account remaining wavelengthindependent systematics that were not simulated and also takes advantage of shared hyperparameters between light curves. Finally, we fit a 2D GP where the wavelength length scale was fit to the data (with a loguniform prior from 1/4 of a wavelength bin to 50 times the total wavelength range of the dataset). This can be viewed as a more conservative approach that uses the data to constrain whether or not the systematics are constant in wavelength. Both 2D GPs used the kernel in Eq. (2).
We found as expected that the commonmode 1D GP method underestimated the uncertainty in the radius ratio. In contrast, both 2D GP methods were able to retrieve the radius ratio accurately with unit normal Zscore distributions, as they are not conditioned on a single fit of the commonmode systematics. All methods performed similarly at retrieving the strength of a K feature accurately.
Interestingly, the 2D GP fitting for the wavelength length scale overestimated the uncertainty in the slope of the transmission spectrum (Zscore variance of 0.67 ± 0.04), while both methods which assumed constant wavelength systematics retrieved the slope with ideal Zscore statistics and with ≈ 15% smaller average standard deviations. Fitting for the wavelength length scale appears to have reduced precision in the constraint of the scattering slope. Since systematics varying in wavelength could produce an apparent slope in the transmission spectrum, it makes sense that removing the assumption that the systematics are constant in wavelength has increased the uncertainty in our constraints.
While this loss in precision may not be desirable, we must also consider the risk of incorrectly assuming systematics are constant in wavelength. For Simulation 3, we generated systematics that gradually varied in wavelength using long wavelength length scales (>1/2 the wavelength range of the dataset but <8 times the wavelength range of the dataset). We then analysed these datasets with the same three methods as for the commonmode systematics. In this case, only the 2D GP method that fit for the wavelength length scale produced robust retrievals of the injected signals across all accuracy metrics, although in this case the constraints on the slope may have been slightly too small with a Zscore sample variance of s^{2}(Z_{m}) = 1.18 ± 0.07 (although the other two methods both had s^{2}(Z_{m}) > 7). We note that when fitting for l_{λ}, the slope constraints for Simulations 3 and 4 may be affected by the choice of prior bounds as l_{λ} was often consistent with its maximum prior bound for both sets of simulations.
An example of one of the synthetic datasets from Simulation 3 is shown in the left plot of Fig. 5, along with the corresponding commonmode corrected light curves (right plot). We note that the systematics may appear to be constant in wavelength, despite being generated with a wavelength length scale of 4170 Å. The resulting transmission spectra for the 1D GPs with a commonmode correction compared to the 2D GP method fitting for l_{λ} are included in Fig. 6. This was chosen as a particularly extreme example where the commonmode 1D GP method incorrectly detected a nonzero slope at 5.8σ, compared to 2.1σ for a 2D GP fitting for l_{λ}. Out of the datasets that were simulated to have a flat spectrum, the 1D GP method with a commonmode correction retrieved a slope > 3σ away from zero in 26.7% of them. This is compared to 0.4% of these >3 σ outliers with the 2D GP method (consistent with the expected 0.3% for a normallydistributed variable). This clearly demonstrates the potential for false detections of scattering slopes if systematics are incorrectly assumed to be constant in wavelength.
Overall, the 2D GP fitting for the wavelength length scale performed well on Simulations 3 and 4 compared to methods which assumed that the systematics were commonmode. While there was some loss in precision on the scattering slope for commonmode systematics, this was likely due to a more conservative approach which performed significantly better on long wavelength length scale systematics.
Fig. 4 Zscores of retrieved atmospheric features from Simulation 2, which had short wavelength length scale correlated noise. The three methods tested were fitting the light curves with 1D GPs (top row), the Hybrid method (middle row) and a 2D GP (bottom row). All methods have mean Zscores consistent with zero, that is none of them biases the results towards measuring larger or smaller values of the features. The retrieved variance of Zscores for the 2D GP method are all consistent with a variance of one (matching the Gaussian distribution plotted in red), but not for the other methods, which indicates overestimation or underestimation of uncertainties. 
Fig. 5 Raw light curves (left) and commonmode corrected light curves (right) for one simulated dataset in Simulation 3 (containing wavelengthcorrelated systematics). The injected transit signal is plotted as a black dotted line. The variation of the systematics in wavelength is not obvious by eye but can be noticed by comparing the regions highlighted in red boxes. Despite the systematics varying in wavelength, the commonmode correction appears to remove visual signs of systematics. 
Fig. 6 Example of retrieved transmission spectra with 1D GP method (left) analysing commonmode corrected light curves (right plot in Fig. 5) compared with 2D GP method (right) analysing uncorrected light curves (left plot in Fig. 5). The error bars in blue only convey the mean and standard deviation of each transit depth measurement, random draws are taken from the covariance matrix to convey potential correlations between transit depths, helping to visualise the increased uncertainty in the offset and scattering slope of the spectrum from the 2D GP method. Only the 1D GP method erroneously detects a negative scattering slope. While it is not visually apparent, the 2D GP method gives a stronger constraint on the detection of potassium (13.9σ compared to 8.7σ). 
3.10 Conclusions from simulations
2D GPs allow for hyperparameters to be shared between light curves, which we showed in Simulation 1 has the potential to improve the reliability of retrievals. 2D GPs with a broad wavelength length scale prior are able to accurately account for systematics across a wide range of wavelength length scales – including when the systematics are wavelengthindependent (Simulation 1), varying in wavelength (Simulations 2 and 3), or constant in wavelength (Simulation 4).
When systematics are correlated in wavelength with a short wavelength length (Simulation 2), analysing these data with 1D GPs or the Hybrid method resulted in constraints on broad features (i.e. the radius ratio and scattering slope) that were too small and constraints on sharp features (e.g. a K feature) that were too large.
The effect of incorrectly assuming wavelengthcorrelated systematics are commonmode was examined in Simulation 3. This was also found to significantly underestimate the uncertainty in the scattering slope, which may help explain previous conflicting scattering slope measurements in the literature. We note that stellar activity (e.g. from unocculted starspots) has previously been used to explain conflicting scattering slopes between different transits (e.g. Rackham et al. 2017; May et al. 2020). Since unaccounted for wavelengthcorrelated systematics also has the potential to affect the measured transmission spectrum slope, it is possible that stellar properties such as starspot covering fractions deduced from the slope of the transmission spectrum may be inaccurate.
We conclude our 2D GP method is a much safer approach to fitting noise which may have wavelengthcorrelated systematics present and was the only model tested that could accurately account for systematics across all four sets of simulations, although commonmode systematics may cause the uncertainty in the scattering slope to be slightly overestimated. As our method does not require a separate commonmode correction, the uncertainty in the radius ratio can be fully accounted for in the transmission spectrum.
4 Reanalysis of VLT/FORS2 observations
To test our method on real data, we performed a reanalysis of the VLT/FORS2 transit observations of WASP31b first analysed in Gibson et al. (2017). We aimed to identify whether wavelengthcorrelated systematics are present in these observations and how the constraints on atmospheric parameters changed after accounting for these systematics. Based on the results in Sec. 3, if there is strong evidence for wavelengthvarying systematics from the 2D GP analyses (based on the recovered wavelength length scale) then we should be skeptical of the original analysis as it did not account for wavelengthvarying systematics – making the constraints unreliable.
WASP31b is a lowdensity hot Jupiter. It was initially reported in Anderson et al. (2011) with a mass of M_{J} = 0.478 ± 0.029 and radius of R_{J} = 1.549 ± 0.050. It has a 3.4 day orbit around a late Ftype dwarf (V = 11.7).
The data consist of two transits of WASP31b, the first using the GRIS600B grism taken on the night of 2016 February 15 and the second using the GRIS600RI grism with the GG435 order blocking filter taken on 2016 March 3. These are referred to as the 600B and 600RI datasets, with the extracted light curves covering the wavelength ranges 3868–6238 Å and 5206–8476 Å respectively. See Gibson et al. (2017) for more details about the observations and subsequent data reduction. The same wavelength calibration and spectral extraction were used as in the previous analysis and the same 150 Å wide wavelength bins were used for extracting the spectroscopic light curves.
As noted in Sec. 1, these data conflict with previous observations from HST/STIS which identified a strong K feature at 4.3σ significance (Sing et al. 2015) not replicated in the original VLT analysis in Gibson et al. (2017). This conflict was the initial reason for choosing these observations to reanalyse, in order to study if accounting for wavelengthcorrelated noise may resolve this discrepancy. We leave the reanalysis of the HST/STIS light curves for future work.
4.1 Previous analysis with 1D GPs
The approach to fit the light curves used in Gibson et al. (2017) was as follows: After basic data reduction steps were performed (such as background subtraction, aperture extraction and wavelength calibration), the flux from the target star was divided by flux from a comparison star to account for changes in flux due to the Earth’s atmosphere.
A commonmode correction was then performed – as described in Sec. 3.9. However, in addition to dividing the spectroscopic light curves through by the commonmode systematics, the residuals of the white light curve (found by subtracting the GP predictive mean from the white light curve) were subtracted from each spectroscopic light curve. This was to account for highfrequency systematics – which are similar to commonmode systematics but occur on a faster timescale, making them indistinguishable from white noise in the white light curve fit. The presence of highfrequency systematics can make the residuals of each spectroscopic light curve correlated – which this process aims to remove. Because the white light curve residuals are subtracted from all spectroscopic light curves, this procedure assumes that these systematics are constant in wavelength. After these corrections were performed, each spectroscopic light curve was fit separately using a transit model combined with a 1D GP – accounting for remaining systematics that are correlated in time but assumed to be wavelengthindependent. The amplitude of white noise was also fit separately for each wavelength.
The exact cause of each of these systematics is unclear. In Gibson et al. (2017), the authors speculate that the commonmode systematics are most likely due to instrumentation (e.g. inhomogeneities in components such as the grism or derotator) while the highfrequency systematics could be due to varying atmospheric throughput potentially from uneven cloud cover. Stellar activity can in principle also cause variations in flux during transit (i.e. from starspot crossings) or at any time (i.e. due to varying starspot coverage or asteroseismology). However, photometric monitoring described in Sing et al. (2015) suggests that WASP31 is a quiet star with little photometric variability – so we consider it unlikely that stellar activity is a significant contributor to systematics in these data. Remaining wavelengthindependent systematics are often found to be particularly strong near telluric features such as at the ~7550–7750 Å O_{2} A band (see Fig. 7 in Sedaghati et al. 2016 for an example), which suggests these systematics could be due to atmospheric variability or perhaps the sharp changes in flux increase instrumental systematics.
4.2 Choosing the kernel function
Examining the systematics accounted for in the previous analysis suggests that the kernel of our 2D GP should be able to account for (a) ‘commonmode’ (CM) systematics, (b) highfrequency systematics (HFS), (c) any remaining timecorrelated systematics in the spectroscopic light curves (which we call wavelengthspecific systematics or WSS) and (d) the changing amplitude of white noise across different wavelength channels. It should also match the form of Eq. (12) to make this analysis computationally tractable. The kernel chosen for the 2D GP analysis is given in Eq. (30) and our choice of priors for all parameters is outlined in Appendix C. (30)
This equation can be multiplied out to reveal the four individual terms used to account for (a)(d): (30a) (30b) (30c) (30d)
Equation (30a) was included to account for (a), but instead of assuming a priori that the systematics are commonmode as in the previous analysis, we instead fit for a wavelength length scale with a broad prior range that can account for wavelengthvarying or commonmode systematics. The right plot in Fig. 7 appears to show the presence of systematics that vary quite gradually in wavelength across the 600RI dataset, similar to the simulated light curves in Fig. 5 containing wavelengthvarying systematics. Correlation in time was fit with the time length scale l_{t} and we assumed the height scale of these systematics did not vary over the dataset and therefore fit for a single parameter h_{CM}.
Equation (30b) accounts for (b) highfrequency systematics. We fit for a wavelength length scale and a single height scale h_{HFS} for these systematics. The difference in our kernel choice between (a) and (b) is that we are assuming the highfrequency systematics are independent in time (also assumed in the original analysis). The Kronecker delta term represents this independence in time in our kernel function.
Equation (30c) accounts for remaining systematics that are correlated in time (with time length scale l_{t}) but wavelengthindependent (represented by the Kronecker delta term ). We tested replacing this Kronecker delta term with a squaredexponential kernel in wavelength to avoid assuming these systematics are wavelengthindependent, but it had little effect on the results (similar to the analyses in Sec. 3.7). Wavelengthindependent systematics appear as light curves having systematics of a different shape to neighbouring light curves – such as the region highlighted in the left plot of Fig. 7. The height scales of these systematics are fit independently for each wavelength (given by ) because some wavelength channels may cover telluric features that can have particularly large amplitude systematics. The time length scale l_{t} is shared with Eq. (30a) in order to fit the form of Eq. (12). This assumes that any commonmode or wavelengthvarying systematics accounted for by Eq. (30a) have the same timescale as any wavelengthindependent systematics, which is not necessarily true but is an unavoidable assumption for this method. However as shown in Sec. 3.6, sharing hyperparameters between light curves can provide more robust constraints when it is a safe assumption that the parameters are constant.
The terms in Eq. (30d) fit for the amplitude of white noise. This is done separately for each wavelength channel as the white noise amplitude can be strongly wavelengthdependent for real observations. There is a Kronecker delta term that accounts for the noise being independent in wavelength and a second Kronecker delta term to account for the noise being independent in time.
By calculating the GP mean at the location of bestfit, but with different height scales h_{CM}, h_{HFS} or h_{WSS} set to zero, it is possible to roughly visualise each type of systematics in these datasets. The commonmode systematics were visualised by setting both h_{HFS} and h_{WSS} to zero when calculating the GP mean, while the highfrequency and wavelengthspecific systematics were visualised by examining the change in the GP mean when just h_{HFS} is set to zero or just h_{WSS} is set to zero respectively^{7}. This is shown for the 600B data in Fig. 8. The autocorrelation of the residuals given each of these terms are set to zero is examined in Appendix F.
Equation (30) can also be written in the Kronecker product form of Eq. (11) by choosing: (31) (32)
and similarly for ∑_{λ} and ∑_{t} (in this case ∑_{t} is the identity matrix).
4.3 Reanalysis procedure
We compare the original 1D GP analysis from Gibson et al. (2017) to multiple analyses with 2D GPs to examine how different assumptions affect the results. In addition, a reanalysis using 1D GPs was performed to enable a closer comparison of the differences between 1D and 2D GPs. This 1D GP reanalysis was similar to the original but more closely matched the 2D GP method implemented in this work, such as fitting each light curve for the transit depth ρ^{2} instead of the radius ratio ρ, using the same priors on corresponding parameters as the 2D GP fit and also using a squaredexponential kernel instead of the Matérn 3/2 kernel chosen in the original analysis. Loguniform priors on all of the hyperparameters were used which is also a slight change from the original analysis.
The spectroscopic light curves were binned using the same wavelength bins as the original analysis, although there may potentially have been more outliers clipped at the edges of the datasets. Each light curve had separate parameters for transit depth ρ^{2}, limbdarkening parameters c_{1} and c_{2}, outoftransit flux F_{oot} and linear trend in baseline flux T_{grad}. The light curves shared the central transit time parameter T_{0}, system scale a/R_{*}, period P and impact parameter b. The period was fixed to the mean literature value given in Patel & Espinoza (2022) of P = 3.4059095 ± 0.0000047 days.
For the 1D GP reanalysis, a commonmode correction was performed (see Sec. 3.9) with the white light curve for each dataset extracted across the same broad wavelength ranges as in Gibson et al. (2017). It was fit for T_{0}, a/R_{*}, ρ^{2}, b, c_{1}, c_{2}, F_{oot}, T_{grad}, h, l_{t} and σ using DEMC. T_{0}, a/R_{*} and b were fixed to their bestfit values when fitting the spectroscopic light curves. The same Gaussian priors were placed on a/R_{*}, b and on the radius ratio of the white light curve as in Gibson et al. (2017). These priors were taken from previously reported values in Sing et al. (2015). The priors were included because the large amplitude commonmode systematics made constraining parameters that affect the overall transmission spectrum difficult. To replicate this constraint for the 2D GP analyses – which do not perform a commonmode correction – these Gaussian priors were kept on a/R_{*} and b and also on the mean radius ratio^{8} when fitting the spectroscopic light curves.
As the data alone was not sufficient to accurately constrain the limbdarkening parameters, Gaussian priors were placed on c_{1} and c_{2} based on predicted values from theoretical models using PYLDTK. The mean values given by PYLDTK were set as the prior means but the prior uncertainties were increased to a standard deviation of 0.1 to account for uncertainty in the limbdarkening models (matching the procedure in Gibson et al. 2017).
Outliers in the spectroscopic light curves were clipped by performing a bestfit to the data with a GP – optimising both the transit parameters and hyperparameters – and clipping 4σ outliers (evaluated using the GP predictive mean and variance). The outliers were replaced with the GP predictive mean evaluated at the location of the points. This was performed using the 2D GP predictive mean for the 2D analysis (see Appendix A) and the 1D GP predictive means for each light curve for the 1D GP reanalysis. Replacing these outliers as opposed to removing them ensures the flux observations still lie on a complete 2D grid. This is required for our optimised 2D GP method and also matches the procedure in the original analysis.
The 1D GP reanalysis was similar to the procedure described in Sec. 3.4 but with each (corrected) spectroscopic light curve being fit for ρ^{2}, c_{1}, c_{2}, F_{oot}, T_{grad}, h, l_{t} and σ. As more parameters were being fit, the chains were run for longer with 100 independent chains having 500 burnin steps and another 500 steps run for the chain. If necessary, chains were extended until convergence occurred (measured using the GelminRubin statistic as in Sec. 3.4). The 2D GP analyses used blocked Gibbs and slice sampling in combination with NUTS sampling for inference (see Appendix E for more details). Sampling was performed with a burnin length of 1000 and a chain length of 1000 with four independent chains. Most of the 2D GP analyses were fitting seven parameters for each light curve (ρ^{2}, c_{1}, c_{2}, F_{oot}, T_{grad}, , σ) with five wavelengthindependent parameters (h_{CM}, l_{t}, , h_{HFS}, ). Exceptions to this are explained in Sec. 4.4. For the 600B data with 16 light curves, this results in 117 parameters. The 600RI data had 22 light curves and therefore had 159 parameters to fit. The use of NUTS permitted convergence to occur for all parameters within this relatively small number of sampling steps.
Basic atmospheric retrievals similar to those described in Sec. 3.5 were performed. We fit a radius and slope to each transmission spectrum in addition to measuring the change in the radius ratio at the bin centred on the Na feature Δρ_{Na} for both the 600B and 600RI data as well as the change in radius ratio at the K feature Δρ_{κ} for the 600RI data. The 600RI data have a 150 Å bin centred on 7681 Å which is wide enough to cover the full K I doublet. The 600B and 600RI datasets have 150Å bins centred at 5893Å and 5881Å respectively which both cover the Na I doublet. This same atmospheric retrieval was also performed on the transmission spectra taken from the original analysis.
Fig. 7 600B grism (left) and 600RI grism (right) spectroscopic light curves with bestfit transit region shaded in grey. Left: the light curves appear to have very similar systematics with the exception of the top one or two light curves in which the region in the red box shows a small increase in flux at the end of the observation. This could suggest the presence of wavelengthspecific systematics (WSS) in these data. Right: The slowly varying shape of the systematics – as highlighted in the red box – suggest that the ‘commonmode’ systematics are not actually commonmode but gradually vary in wavelength. 
Fig. 8 Plots showing GP predictive means fit by 2D GP in 600B dataset to account for (in order from left to right): (a) commonmode systematics, (b) highfrequency systematics and (c) wavelengthspecific systematics. The rightmost plot is the raw light curves included for reference. Specifically, the middle plots show the change in the GP mean caused by including these systematics in the fit, as discussed in Sec. 4.2. 
Fig. 9 600B (top) and 600RI (bottom) retrieved transmission spectra comparing original analysis, 1D GP reanalysis and analogous 2D GP analysis. The 2D GP bestfit atmospheric retrieval is shown as a red dotted line and random draws taken from the 2D GP spectrum are shown in grey. Top: all three analyses show largely consistent results. Bottom: the error bars of each method may appear consistent but the random draws from the 2D GP spectrum demonstrates a large uncertainty in the slope of this spectrum after accounting for the covariance between transit depths. 
4.4 Reanalysis results
A comparison of the transmission spectra from the 1D GP analyses to our initial 2D GP analysis is plotted in Fig. 9. The results of all atmospheric retrievals performed are included in Tables 4 and 5. The scattering slope reported in these tables was given in terms of the α parameter from Eq. (28) to make it easier to compare to Rayleigh scattering (α = −4). The α values were calculated by taking the literature values of H = 1220 km for the scale height and R_{*} = 1.12 M_{⊙} for the stellar radius as reported in Sing et al. (2015) and Anderson et al. (2011) respectively. The of the bestfitting atmospheric model for each recovered transmission spectrum (accounting for covariance for the 2D GP analyses) is also included. Unlike in Sec. 3 where the true transmission spectra were known, these values can only be used to help quantify how consistent each analysis is with the specific choice of atmospheric model.
Appendix L includes visualisations of the correlation matrix for both the 600B and 600RI transmission spectra, 2D GP bestfits to the spectroscopic light curves and corner plots for two of the 2D GP analyses.
4.4.1 1D GP reanalysis and the analogous 2D GP analysis
Our initial 2D GP analysis attempted to closely resemble the 1D GP analysis and so we refer to this as the ‘analogous 2D GP analysis’. However, the 2D GP has a different kernel with wavelength length scale parameters and that can account for how the ‘commonmode’ and highfrequency systematics may actually vary in wavelength. It shares the same time length scale parameter l_{t} across all light curves that could help provide more robust constraints (see Sec. 3.6). It also retrieves the full covariance matrix of the transmission spectrum instead of assuming all transit depths are independent. Figure 9 compares the resulting transmission spectra from the original Gibson et al. (2017) analysis, the 1D GP reanalysis and the analogous 2D GP reanalysis. Random draws taken from the 2D GP covariance matrix are included to visualise the correlated uncertainties in the 2D GP transmission spectrum (as the error bars do not convey this information).
For the 600B dataset, it was found the data were consistent with containing commonmode systematics as the posterior of the wavelength length scale was concentrated towards the maximum prior limit of 50 times the wavelength range of the data. This does not rule out that the systematics could be varying very gradually in wavelength, specifically the wavelength length scale was constrained to be > 6772 Å (≈3 times the wavelength range of the dataset) to 99% confidence. This constraint implies that systematics that vary over length scales shorter than 3 times the wavelength range of the dataset must be distinguishable from commonmode systematics and so if we constrain other wavelength length scales to be below this value we can safely assume those systematics are varying in wavelength. For example, the highfrequency systematics in this dataset were strongly constrained to be varying in wavelength with Å. The 600RI data were significantly affected by wavelengthvarying systematics as Å for data with a wavelength range of 3135 Å. The highfrequency systematics in these data were also found to vary in wavelength with Å.
Since the wavelength length scale for the 600B dataset was concentrated towards an arbitrary choice of maximum prior limit, this analysis was repeated with the prior limit reduced to 15 times the wavelength range of the data. There was little difference in the results other than a slight increase (~11%) in the uncertainty of the slope of the transmission spectrum. This increase in uncertainty makes sense as the shorter prior limit forces the wavelength length scale to take smaller values, which are likely to increase uncertainty in the slope.
Note there is a risk that the wavelengthcorrelated systematics identified are actually from wavelengthvarying inaccuracies in our transit model (e.g. due to unaccounted for limbasymmetries; see Powell et al. 2019). We performed another test where we masked out the transit region of each dataset and performed an identical analysis on the masked datasets except without fitting for a transit model. The linear baseline parameters F_{oot} and T_{grad} were still fit for to account for changes in baseline flux. The resulting constraints on the hyperparameters that fit for wavelength correlation (h_{CM}, h_{HFS} and ) were all consistent with the values from the previous analyses to 2.2σ. The systematics were similarly constrained to be varying in wavelength except for the 600B wavelength length scale which was again consistent with commonmode systematics.
Our results strongly contradict the assumptions of the original analysis that only commonmode systematics and wavelengthindependent systematics are present in the 600B and 600RI datasets. Based on the results of Sects. 3.8 and 3.9, these incorrect assumptions could lead uncertainties on broad features such as the scattering slope to be underestimated and could cause uncertainties on sharp features such as the strength of Na and K features to be overestimated. Tables 4 and 5 appear consistent with this interpretation as the 2D GP analyses all show increased uncertainty in the scattering slope and decreased uncertainty on the constraints of Na and K. The 600RI data were particularly affected, for example the uncertainty in the scattering slope was ~3.9 times larger compared to the original analysis. This makes sense as both the ‘commonmode’ systematics and highfrequency systematics were found to vary in wavelength for this dataset. The constraint of α = 8.09 ± 8.02 from the 2D GP analysis is weak enough to be consistent with a positive slope (α > 0), a flat spectrum (α = 0), Rayleighscattering (α = −4) or superRayleigh scattering (α < −4). This is in contrast to the 1D GP analyses that are both consistent with either a positive slope or a flat spectrum. The tighter constraints on Na and K in the 2D GP analyses (~50% smaller) still did not lead to a detection of either species in any analysis (< 2.3σ for all analyses). We note that the shared time length scale l_{t} in the 2D GP analysis could have also tightened the constraints on Na and K – similar to results in Ahrer et al. (2022).
We found that the of the bestfitting atmospheric models were increased significantly for the 2D GP analyses, particularly on the 600RI data. We should expect that ∈ [0.46, 1.75] to 95% confidence for the 600RI data. The 1D GP analyses are on the lower end of this confidence interval while the analogous 2D GP analysis is on the higher end^{9}. Since the 1D GP analyses do not account for wavelengthcorrelated systematics, they could certainly be overestimating uncertainties. However, it is not clear if the high values for the 2D GP analyses indicate that the 2D GP analyses are underestimating uncertainties (potentially due to a suboptimal choice of kernel) or if the chosen atmospheric model was not flexible enough. The value is still close to the 95% confidence interval however so this may also be due to random chance.
The runtime of the 2D GP analysis was longer than the 1D GP analysis by about one order of magnitude. The combined runtime of the MCMCs for the 1D GP reanalysis took 28 min for the 600B data and 54 min for the 600RI data. The MCMCs for each of the analogous 2D GP analyses on the 600B and 600RI datasets took 6.5–7 h each^{10}.
Results of atmospheric retrievals on the 600B data analysed using different methods.
Results of atmospheric retrievals on the 600RI data analysed by different methods.
Fig. 10 Similar to Fig. 9 but showing different 2D GP analyses tested. As error bars do not convey the correlations between the data points, it is difficult to see that the 600B fit with no mean radius prior – which appears to have much larger uncertainties – actually has a similar constraint on the slope and Na feature compared to the other methods. For the direct retrieval of atmospheric parameters from the data, a dotted line is included that shows the bestfit atmospheric model (as the transmission spectrum is not directly retrieved with this method). 
4.4.2 Fitting without a mean radius prior
The original analysis placed a Gaussian prior on the radius ratio of the white light curve which the 2D GP analyses match by placing the same prior on the square root of the mean transit depth (which should approximately match the radius ratio of the white light curve). This prior is in addition to the uniform priors already used when fitting for each transit depth. Fitting the transmission spectrum without this prior on the mean radius ratio combined with the lack of a commonmode correction permits the retrieval of the overall offset of the transmission spectrum.
We repeated the analogous 2D GP analyses without this prior on the mean radius ratio. By examining Table 4, we can see for the 600B data that the uncertainty of the radius ratio increases significantly without this prior and is ~13.9 times larger than the analogous 2D GP analysis. Other than this however, the constraints on the other parameters are almost identical with no significant difference in the scatteringslope or the retrieval of the Na feature. See Fig. 10 for comparisons between the transmission spectra for each of the 2D GP analyses including the 600B data fit without a mean radius prior.
The corresponding analysis of the 600RI data resulted in a much weaker constraint on the offset of the transmission spectrum. The mean radius ratio was retrieved to be = 0.08687 ± 0.00896, which is highly inconsistent with the value of = 0.12348 ± 0.00362 from the 600B data (that shares some overlapping wavelength bins). Fitting the white light curve with a 1D GP without including the Gaussian radius prior also results in a low radius ratio of ρ_{white} = 0.10277 ± 0.00891. As these results are inconsistent with previous observations, the simplest explanation may be that the systematics present in this dataset happened to be particularly similar to a transit signal, as suggested in Sec. 3.7. The other atmospheric parameters were not significantly changed from the analogous 2D GP analysis – as can be seen in Table 5 – so the choice of including this prior mainly affects the constraint on the radius ratio.
4.4.3 Marginalising over uncertainty in T_{0}, a/R_{*} and b
Jointfitting the spectroscopic light curves allows us to vary the central transit time T_{0}, system scale a/R_{*} and impact parameter b as parameters within the MCMC. We recommend this approach as it accounts for how the uncertainty in these parameters affects the transmission spectrum. We can examine if this has a significant effect by comparing our results to the analogous 2D GP analysis that matched the 1D GP procedure of fixing these parameters.
The central transit time was allowed to freely vary with a uniform prior in the 600B data but the 600RI data struggled to tightly constrain T0 (as was also seen in the 600RI white light curve analysis). It was decided to constrain T_{0} using the retrieved value from the 600B data as a Gaussian prior. This assumes that WASP31b does not have significant transit timing variations (TTVs) and that the literature value used for the period is accurate. While there is some evidence to question these assumptions (i.e. Patel & Espinoza 2022; Bonomo et al. 2017; Kokori et al. 2022), we ignore these here to demonstrate the method on both datasets. The Gaussian priors on a/R_{*} and b from Sing et al. 2015 that were used in the white light curve fitting in the original analysis were still kept on these parameters for the 600B analysis. The retrieved constraints on a/R_{*} and b from the 600B data were then used as Gaussian priors on the 600RI data. A more rigorous but computationally expensive approach could have been to perform a joint fit on both light curves with common a/R_{*} and b parameters and T_{0} values offset by 5 periods (the observations are 5 orbits apart).
The effect of marginalising over T_{0}, a/R* and b appears to have had little effect on the retrieval of the 600B data. The mean radius would likely be the parameter most affected by varying these parameters (as these parameters are shared by all light curves) but the mean radius prior could be restricting this effect. The retrieved central transit time was T_{0} (HJD2457433) = 0.753597 ± 0.000578 which is very close to the value retrieved from the white light curve of T_{0} (HJD2457433) = 0.753609 ± 0.000662. The marginalisation over these parameters also had little effect on the 600RI data as the atmospheric retrieval produced similar results to the analogous 2D GP analysis.
4.4.4 Directly fitting atmospheric forward models to the data
Typically a transmission spectrum is first obtained by individually fitting each spectroscopic light curve and then an atmospheric retrieval is performed on the resulting transmission spectrum. With all spectroscopic light curves being fit simultaneously, we can perform an atmospheric retrieval while fitting the light curves. The transit depths can be generated from atmospheric forward models with the atmospheric parameters directly retrieved in the MCMC fitting the light curves, similar to how highresolution transmission spectroscopy atmospheric retrievals are performed (e.g. Maguire et al. 2023).
This would reduce the number of parameters to fit, which may be useful if scaling this method to fit more light curves simultaneously. This also avoids approximating the posterior of the transmission spectrum. Typically the transmission spectrum is extracted by taking the mean and standard deviation (or the covariance matrix in our method) of each transit depth from the MCMC chains, which approximates the posterior as a Gaussian distribution. While different distributions could potentially model the posterior better, there would still likely be some modelling error, which could be avoided by directly fitting atmospheric models to the light curves.
The results from using this method closely match the analogous 2D GP analysis for the 600B and 600RI data, with the exception of the retrieved slope of the 600RI data, which has a significantly reduced mean and larger uncertainty. Explaining why this discrepancy occurred is difficult as we do not recover individual transit depths with this method and no can be retrieved to determine the goodnessoffit. The increased complication in interpreting the results is a major downside of this approach.
With this method, the systematics may be fit differently depending on the choice of atmospheric model. This is quite concerning as imperfections in our models could result in the GP fitting larger systematics to the data. While it is beneficial to avoid approximating the posterior of the transmission spectrum, this method appears to complicate the choice of atmospheric model and makes it harder to visualise if the data are consistent with a model. This method is therefore not recommended for testing out different atmospheric retrievals but could be an interesting test if a specific atmospheric model has already been chosen.
4.5 Benchmarking
We tested how the runtime for this method would be affected as a function of the number of time exposures and wavelength bins. The kernel function used to fit the VLT datasets was used (Eq. (30)) with the kernel hyperparameters set as the bestfit values of the 600RI analogous 2D GP analysis. As some hyperparameters ( and ) were fit separately for each wavelength, changing the number of wavelength bins required us to interpolate these bestfit values. One of the two time covariance matrices produced by this kernel function is the identity matrix. This makes the eigendecomposition of this matrix trivial, which was exploited in these calculations. For a large number of time points relative to wavelength bins, the eigendecomposition of these two time covariance matrices is typically the bottleneck in the loglikelihood calculation and so avoiding one eigendecomposition can almost half the runtime.
For each number of wavelengths and time points tested, a random draw of noise described by this kernel function was first taken. Runtime of the mean function was not included so the loglikelihood calculation was performed using this random draw of noise as the residuals. The mean function used in this paper would have had little effect as even for N = 512 and M = 256 the runtime for the mean function was only ~3.5 ms.
The gradient calculation of the loglikelihood was not performed for this test. However, the runtime would likely not be significantly affected as eigendecomposition is generally the computational bottleneck regardless of whether the gradients are calculated and the same eigendecomposition can be stored to calculate both the loglikelihood and gradients of the loglikelihood (see Appendix H).
Loglikelihood calculations were performed using both our method as well as using Cholesky factorisation on the full MN × MN covariance matrix. The benefit of the Cholesky method is that it would place no restrictions on the kernel function or on the data needing to lie on a complete grid. However, its poor runtime scaling of (M^{3} N^{3}) would make it unfeasible for most real datasets.
Figure 11 shows the results of this runtime comparison performed on a quadcore Intel Core i56500 CPU at 3.20 GHz. Calculations where MN ≥ 32768 were avoided for the Cholesky factorisation as they required ≥8GB of memory to store the MN × MN covariance matrix. This was not an issue for our new method as it only needs to store separate time and wavelength covariance matrices.
The Kronecker product sum method has a runtime scaling of (M^{3} + N^{3}) due to the expensive eigendecomposition step which has the effect that when one dimension is much longer than the other it tends to act as the computational bottleneck. The 600B and 600RI datasets analysed had dimensions of (M = 16, N = 262) and (M = 22, N = 319) respectively, so the number of time points far exceeded the number of wavelength channels chosen. As can be seen in the central plot of Fig. 11, increasing the number of wavelength channels would have little effect on the loglikelihood calculation runtime due to the eigendecomposition of the time covariance matrix dominating the runtime when N ≈ 256. There were however seven wavelengthdependent parameters that each light curve was individually fit for, so convergence time of the MCMC would increase as the number of parameters would increase. Convergence time with No UTurn Sampling has been shown to scale as for d parameters under certain assumptions (Neal 2011).
As each MCMC chain is run independently with NUTS, parallelisation is trivial to implement. The same JAX code can also be run on a GPU, which could significantly improve performance. Utilising parallelisation and GPU acceleration could permit this method to be scaled to larger datasets from JWST.
Fig. 11 Comparison of loglikelihood calculation runtime (in seconds on logarithmic scale) between general approach of Cholesky factorisation (left) compared to our method (centre) as well as relative performance improvement (right). Cholesky factorisation was not calculated when MN ≥ 32768 (shown in white) due to memory limitations. 
5 Discussion
The reanalysis of the VLT/FORS2 data confirms that systematics that vary in wavelength can be present in real observations and can be mistaken for commonmode systematics. As demonstrated in Sec. 3.9, this can significantly impact atmospheric retrievals and so we recommend a principled method to account for them using 2D GPs. Both the 600B and 600RI datasets contained highfrequency systematics that were found to gradually vary in wavelength. The reanalysis of the 600RI data also revealed that timecorrelated systematics previously assumed to be commonmode were strongly constrained to vary in wavelength. Accounting for this led to a weaker constraint on the scattering slope but produced tighter constraints on the Na and K features, consistent with the results of Sects. 3.8 and 3.9.
Multiple 2D GP analyses were performed either with a different choice of priors or different parameters being fit. For the 600B data, each of these analyses were consistent with Rayleigh scattering and no strong evidence for Na was found (<2.3σ for all analyses). The results of the 600RI analyses were less consistent and show that different assumptions made when fitting these data can significantly affect the retrieved radius and slope of the transmission spectrum, although evidence for sodium or potassium always remained below 2σ. This means that there is still a conflict on the presence of potassium between the VLT/FORS2 data and the original analysis of the HST observations from Sing et al. (2015). We leave the reanalysis of the HST data using our new method for future work.
We have demonstrated that our method can avoid the need for a commonmode correction, which avoids artificially reducing uncertainty in the offset of the transmission spectrum. This is even more important when considering eclipse spectra, where the overall offset is critical to the interpretation of the spectrum (e.g. Evans et al. 2013; Bell et al. 2017). Jointfitting spectroscopic light curves also permitted us to marginalise over uncertainty in the central transit time, systemscale and impact parameter – helping to improve the rigour of transmission spectroscopy. Approximating the posterior of the transmission spectrum for atmospheric retrievals can also be avoided if atmospheric forward models are fit directly to the data.
One disadvantage of assuming noise follows a Gaussian process is that it makes the analysis very sensitive to outliers. Studentt processes have been applied to transmission spectroscopy as they less sensitive to outliers (Wilson et al. 2021). Optimisations presented in this work could also permit the application of 2D Studentt processes to transmission spectroscopy as both methods have similar likelihood functions.
Our method could be used to account for wavelengthcorrelated systematics in JWST observations. For example, NIRCam observations are contaminated by 1/f correlated readout noise that is read parallel to the spectral trace (in the dispersion direction) on the detector, introducing wavelengthcorrelated systematics that are visible in the residuals of the light curves (Ahrer et al. 2023). In addition, wavelengthcorrelated systematics have been identified in NIRISS/SOSS observations (Holmberg & Madhusudhan 2023). Scaling our method to larger datasets typical of JWST observations could be enabled by utilising parallelisation and using GPUs.
In addition, there has been recent work attempting to fit lowresolution groundbased transmission spectroscopy datasets without the use of a comparison star (Panwar et al. 2022; Spyratos et al. 2023). Combining 2D GPs with this approach could be beneficial due to the wavelengthdependent effects of atmospheric extinction no longer being accounted for with a comparison star.
GPs have a broad variety of uses within astronomy (Aigrain & ForemanMackey 2023) and astronomical data often include noise that is correlated in twodimensions (i.e. x and y pixel positions on a detector), so there could be many other useful applications for this method. Our optimisation should have significant performance advantages over other 2D GP optimisations (e.g. Gordon et al. 2020; Delisle et al. 2022) when there are nontrivial correlations across both input dimensions and neither dimension is very large. This could benefit fields that already use these other GP optimisations (e.g. exoplanet detection). GPs have been proposed to be used in highresolution time series spectroscopy (e.g. Meech et al. 2022), our method may enable much faster application to large datasets. Interpolation of 2D datasets can also be performed efficiently with this method (see Appendix A) which could have very broad applications. For example, 2D GPs have already been used to interpolate supernova light curves (e.g. Fakhouri et al. 2015; Boone 2019) which this method might help optimise.
6 Conclusions
We have developed a new method that can reliably recover transmission spectra in the presence of both time and wavelength correlated systematics, have tested it on a range of simulated data to show its advantages over current approaches and demonstrated it on real data of WASP31b observed using VLT/FORS2, despite the presence of large amplitude systematics. To summarise our conclusions:
We developed a 2D GP framework exploiting the Kronecker product structure of the kernel for data described on a (potential nonuniform) 2D grid;
Our method is capable of handling correlations in time and wavelength when extracting the transmission spectra, enabling us to recover the covariance matrix of the spectra, which in turn can be used for atmospheric retrievals;
Simulations show that our method provides robust retrievals in the presence of wavelengthcorrelated systematics, whereas these same systematics can cause 1D GPs to overestimate uncertainties on sharp spectral features and recover erroneous scattering slopes;
Previous detections of extreme scattering slopes that are difficult to give physical explanations for could be explained by unaccounted for wavelengthcorrelated systematics;
For the two VLT/FORS2 datasets analysed, both datasets were found to contain wavelengthcorrelated systematics – contradicting the assumptions of their original analysis. Our method recovered significantly weaker constraints on the scattering slope from the 600RI data but tighter constraints on sodium for both datasets as well as on potassium for the 600RI data;
Our method removes the need for the commonmode correction – which normally introduces arbitrary offsets to the transmission spectra – making the comparison between independent transit observations simpler;
Other benefits of jointfitting spectroscopic light curves include the ability to marginalise over the uncertainty in the central transit time T_{0}, system scale a/R_{*} and impact parameter b. The simulations demonstrated that sharing hyperparameters between light curves may also result in more robust retrievals.
Overall, our method presents several clear advantages over the use of 1D GPs with the potential to be applied to a wide variety of datasets in transmission and emission spectroscopy.
Acknowledgements
We thank Soichiro Hattori for careful reading of the manuscript and providing helpful comments. We thank the team who led the observations analysed in this work (PI: Nikolov). The observations are publicly available in the ESO Science Archive Facility (http://archive.eso.org) under ESO programme 096.C0765. Many of the calculations in this work were performed on the astro01 system maintained by the Trinity Centre for High Performance Computing (Research IT). This system is cofunded by the School of Physics and by the Irish Research Council grant award IRCLA/2022/3788. This research has made use of the NASA Exoplanet Archive, which is operated by the California Institute of Technology, under contract with the National Aeronautics and Space Administration under the Exoplanet Exploration Program. We are grateful to the developers of NUMPY, SCIPY, MATPLOTLIB, ASTROPY, PANDAS, IPYTHON, BATMAN, ARVIZ, CORNER, PYLDTK, JAX, PyMC, JAXOPLANET and EXOPLANET as these packages were used extensively in this work (Harris et al. 2020; Virtanen et al. 2020; Hunter 2007; Astropy Collaboration 2022; Pandas development team 2020; Pérez & Granger 2007; Kreidberg 2015; Kumar et al. 2019; ForemanMackey 2016; Parviainen & Aigrain 2015; Bradbury et al. 2018; Salvatier et al. 2016; ForemanMackey & Garcia 2023; ForemanMackey et al. 2021).
Appendix A Gaussian process regression
This method can be used to efficiently perform Gaussian process regression (as demonstrated in Rakitsch et al. 2013), although the full predictive covariance matrix can be expensive to compute. The predicted mean and covariance of the GP conditioned on the data were used in this work to help visualise the fit of the systematics (Fig. 8) and also to clip outliers and replace with interpolated values (see Sec. 4.3).
Suppose we have a set of observations y ~ (μ(t, λ, θ), K(t, λ, θ)), where we are using the bestfit values for all parameters θ. We can compute the predictive distribution for points y_{*} – which are located on a grid of λ_{*} wavelengths and t_{*} times – using: (A.1) (A.2)
Where μ_{*} is the mean function computed at the locations of y_{*}, K_{*} is the covariance between y and y_{*} and K_{**} is the covariance between values of y_{*} with itself.
Calculating is straightforward as we can break down the calculation of K^{−1} r using Eq. (26) and K_{*} is in general a sum of two Kronecker products similar to K so we can use Eq. (14) to perform this matrixvector multiplication.
Calculating all of the terms in the covariance matrix Var[y_{*}] would be computationally expensive and memoryintensive as it would require the calculation and storage of M_{*}N_{*} × M_{*}N_{*} terms. If it is desired to take random draws from the predictive distribution then no method was found to efficiently do this, as it would likely require the calculation of the full covariance matrix Var[y_{*}], followed by Cholesky factorisation or eigendecomposition of that matrix. However, if we are just interested in calculating the diagonal terms of the covariance matrix (such as for identifying outliers) then an optimisation was found which does not require the storage of any M_{*} N_{*} × M_{*} N_{*} matrices in memory. This method has no calculations that would scale worse in runtime than the cube of any of the dimensions M, N, M_{*} or N_{*}. The details of this algorithm have been omitted for brevity but the implementation is available on the LUAS GitHub repository.
Appendix B Error propagation with a covariance matrix
While it is wellknown how to propagate uncertainties when they are independent, propagating uncertainties given any general covariance matrix is less wellknown. It is included here for reference.
Given a vector χ that follows a multivariatenormal distribution x – (μ_{χ}, ∑_{χ}) and values y we wish to calculate which are a linear transformation of x (i.e. y = Tx for some matrix T) then y follows a normal distribution y – (μ_{y}, ∑_{y}) and we can calculate μ_{y} and ∑_{y} using the following equations (see Brandt 2014 for derivations): (B.1) (B.2)
For example, to compute the average transit depth given a transmission spectrum of M transit depths described by , we must take the mean of ρ^{2}. We can do this by taking where 1 is an Mlong vector where every element is equal to one. If the covariance matrix ∑_{χ} is diagonal with constant variance σ^{2} then this reduces to the familiar form for the standard error of the mean.
This can also be used to bin a transmission spectrum to larger bin sizes. For example, if binning each pair of light curves together then the vector could be used to propagate the uncertainty for binning the first pair of bins. The covariance matrix of the full (binned) transmission spectrum could be obtained by using the matrix (assuming an even number of bins) for Equations B.1 and B.2.
Appendix C Priors for the simulations and the VLT/FORS2 analyses
Priors used for the simulations are included in Table C.1. These same priors were used for all methods where the parameter is relevant, that is the 1D GP method did not use the parameter l_{λ} but used the same priors on the other parameters.
The default priors used for both VLT/FORS2 datasets are given in Table C.2. The central transit time priors are given relative to the predicted central transit times from Patel & Espinoza (2022). When analyses in Sec. 4.4 diverged from these default values it was explicitly stated. Where parameters are shared between the 1D GP reanalysis and the 2D GP analyses, the same priors were used for both. The differences for the 1D GP reanalysis are that: it used the mean radius ratio prior during the white light curve fit instead of during the spectroscopic fit, the highfrequency systematics were recovered from the residuals of the white light curve fit and therefore h_{HFS} was never fit for and finally the wavelength length scales and were not fit for.
We note that while prior bounds were included for all uniform and loguniform priors, the only parameters that had MCMC values close to these bounds were for the 600B data and the wavelengthspecific systematics height scale h_{WSS} for many of the wavelength channels (it was fit independently for each wavelength). The implications for were discussed in Sect. 4.4.1, while for h_{WSS} (which could sometimes be consistent with the minimum prior bound) this suggests that some wavelength channels were consistent with having no wavelengthindependent systematics present in them at all. The choice of minimum prior bound therefore could have affected the results, with lower minimum prior bounds more strongly weighting the probability of negligible systematics being present and higher minimum prior bounds forcing the probability of nonnegligible systematics to be higher. The rest of the parameters were constrained within the prior bounds (see the corner plots for some of these parameters in Figs. L.5 and L.6) and therefore were likely unaffected by them, other than by the choice of a uniform prior (for most mean function parameters) or a loguniform prior (for hyperparameters).
Gaussian priors were also taken from Sing et al. (2015) for the mean radius ratio, a/R_{*} and b – as were also used in Gibson et al. (2017). Gaussian priors on the limbdarkening parameters c_{1} and c_{2} were calculated using PyLDTk, as discussed in Sec. 4.3.
For the direct atmospheric retrievals in Sect. 4.4.4, the Gaussian prior on the mean radius ratio was still used and the remaining parameters (m, Δρ_{Na}, Δρ_{κ}) were varied uniformly and with the constraint that the radius ratio was positive for all wavelengths.
Priors used when fitting parameters in Simulations 14.
Priors used for fitting the 600B and 600RI VLT/FORS2 datasets.
Appendix D MCMC sampling for the simulations
We initially found that some simulations would not be efficiently sampled by varying all parameters together in the same No UTurn Sampling step. As a result, blocked Gibbs sampling could be used to speed up convergence. The approach used for all the simulations was to vary all transit depths simultaneously with NUTS in one blocked Gibbs sampling step followed by varying the hyperparameters together in one or more blocked Gibbs sampling steps. Separately varying the transit depths and the hyperparameters would not be efficient if significant correlations existed between them but for the simulations it appears this was not a significant issue. For real datasets such as the VLT/FORS2 data, the approach used was different because the correlations between the parameters and hyperparameters were found to be more significant.
Work performed after the simulations has revealed that the improved efficiency when varying parameters separately was likely caused by the mass matrix we generated from a Laplace approximation not accounting for the transformations performed by PYMC to keep parameters within the desired prior bounds. As different parameters can have significantly different prior bounds, this likely had a significant impact on sampling efficiency and future updates to the LUAS package may be able to account for this without using blocked Gibbs sampling. However, blocked Gibbs sampling is a valid MCMC method (Jensen et al. 1995) and the choice of mass matrix should only affect the sampling efficiency and not whether the sampling is valid. We present in this section the procedure used for the simulations in this work, noting that it may be improved upon in future work.
A Laplace approximation was used to generate the mass matrix for each blocked Gibbs step of NUTS. However, it was found that parameters being fit close to a prior limit would often converge slowly with NUTS, likely because of the unaccounted for transformations performed by PYMC and also because they were poorly approximated by a Gaussian (since a Laplace approximation works by approximating the posterior as Gaussian). For the simulations performed in this work, only the wavelength and time length scales l_{λ} and l_{t} were occasionally being fit close to their prior limits so these were the parameters most affected. To deal with this issue, if the bestfit values of l_{t} and l_{λ} were identified to be outside of a certain range (given in Table D.1), they were instead sampled in a separate Gibbs step using slice sampling (see Neal (2003) for an explanation of slice sampling). Slice sampling is a method of sampling that samples under the area of the probability density by alternating between vertical and horizontal ‘slices’. Univariate slice sampling – as implemented in PYMC – was often found to converge faster than NUTS for these length scale parameters when they were close to prior limits. Future work may be done to try to avoid the need for slice sampling and improve sampling efficiency. However, blocked Gibbs sampling between NUTS and slice sampling is valid^{11} and the results of the simulations demonstrate that the retrievals using this inference approach were generally consistent with the true simulated values.
Table D.1 summarises the 2D GP sampling steps used depending on the bestfit values of the hyperparameters. All hyperparameters were drawn from loguniform distributions within the ranges listed. NUTS 1 and NUTS 2 refer to two different blocked Gibbs steps that use NUTS, that is all parameters being varied in step ‘NUTS 2’ are being varied together. The transit depth parameters were always varied together in a single step of NUTS. If the bestfit values of l_{t} and/or l_{λ} prior to running the MCMC were not within the ranges listed then they were independently varied with univariate slice sampling in additional Gibbs sampling steps (resulting in a maximum of four steps of Gibbs sampling). Both the Hybrid and 2D GP method followed this description with the exception that the Hybrid method did not have a wavelength length scale.
Overview of MCMC blocked Gibbs sampling steps for Simulations 14.
Appendix E MCMC sampling for the VLT/FORS2 analyses
Fitting both VLT/FORS2 datasets with 2D GPs required fitting over 100 parameters with an MCMC. The approach to do this was the same for both datasets. It was similar to the method followed for the simulations, although many more parameters were being fit in this case with more parameters combined together into the same blocked Gibbs steps. It was found that adjusting the mass matrix from the Laplace approximation by scaling the rows and columns relating to certain parameters could improve sampling efficiency. It was not known at the time why this was the case but the transformations by PyMC to keep parameters within certain prior bounds are likely the cause. We could still account for this by running NUTS on each group of parameters (i.e. ρ^{2}, F_{oot}, T_{grad}, etc.) individually and using dualaveraging (as described in Hoffman & Gelman 2014 as implemented in PyMC) to optimise the step size ε for each group of parameters. This was performed for each group of parameters for 200 steps with a single chain of NUTS, which was long enough for the step sizes to have converged near to ‘optimal’ values. The approximate covariance matrix of the posterior obtained using the Laplace approximation was scaled using these values which significantly improved sampling efficiency, likely because it accounted for the different transformations applied to each group of parameters. This procedure allowed for all the mean function parameters, the commonmode height scale h_{CM} and length scale in time l_{t} to be varied within a single blocked Gibbs step of No UTurn Sampling (112 parameters for the 600RI data). A simpler procedure later identified would have been to take the Laplace approximation with respect to the logposterior calculated from the transformed parameters which PyMC uses. However, as this only affects the choice of mass matrix to tune NUTS, it should not impact the validity of our results but has allowed us to improve sampling efficiency and reduce runtimes.
Similar to what was discussed in Appendix D, for parameters close to the prior limit it was often more efficient to sample them independently using slice sampling rather than use NUTS. It was found to be more efficient for the wavelengthlength scale l_{λ} and for some of the wavelengthspecific systematic height scales In particular, the values which were largely consistent with zero and had probabilities that dropped off sharply for larger values could be efficiently sampled with NUTS and values that were constrained to be larger than zero and had approximately Gaussian distributions were also typically sampled well with NUTS. However, parameters that were both consistent with zero but also had significant probability mass larger than zero appeared to have long correlation lengths with NUTS and were sampled more efficiently with slice sampling (e.g. see the posterior of h_{WSS;5743Å} in Fig. L.5). Since it was difficult before running an MCMC to distinguish between the parameters that were constrained to be greater than zero and the parameters that were better sampled with slice sampling, all of the larger parameters were fit with slice sampling. The threshold was chosen that if the bestfit value > 5 × 10^{−5} then it would be fit with slice sampling. This is not a particularly efficient method but avoided handpicking particular terms and was still faster to converge than using NUTS for these parameters. For larger datasets – such as from JWST – a more efficient approach may need to be found if a similar kernel function is used.
Overview of MCMC blocked Gibbs sampling steps for all 2D GP analyses of both VLT/FORS2 datasets.
The remaining parameters that were below the 5 × 10^{−5} threshold were all fit together with a single blocked Gibbs step of NUTS. However, using a mass matrix calculated from a Laplace approximation was found to be slow for these parameters and so the massmatrix was automatically adapted by PyMC during the burnin phase for these parameters.
Finally, the white noise parameters and highfrequency systematics parameters were sampled together in a separate blocked Gibbs sampling step with a massmatrix taken from the Laplace approximation. Table E.1 summarises which parameters were being sampled in which blocked Gibbs sampling steps. For the other 2D GP analyses performed, any of the mean function parameters were always varied in the first blocked Gibbs sampling step with NUTS (such as T_{0}, a/R_{*}, b or the atmospheric parameters in the direct retrieval).
Appendix F Examining the choice of kernel for the VLT/FORS2 data using autocorrelation
A useful approach to try visualise correlated noise in the VLT/FORS2 data is to calculate the autocorrelation of the residuals. The autocorrelation of a signal is the correlation of a signal with an offset copy of itself and it can be used to identify correlations in a signal. In this case, we want to identify correlations in the noise of the data so our signal would be the flux observations minus our best fit mean function. We consider correlations in two dimensions – both in time and wavelength.
The leftmost plot in Fig. F.1 shows the autocorrelation of the data minus the bestfit transit mean function. This should approximately look like a heat map of the kernel function. Note, the autocorrelation at zero offset has been set to zero for all these plots to increase the contrast for the rest of the values.
The second plot shows the autocorrelation of the residuals after subtraction of the bestfit GP mean if we set h_{HFS} = 0. This visualises the approximate correlation in the highfrequency systematics – which appears as a vertical line. The correlation appears to reduce at increasing wavelength separation, suggesting a finite wavelength length scale. The third plot is similar to the second but instead if all the terms are set to zero. If our assumption that these systematics are fully wavelengthindependent is true then we should only expect to find correlations at zero wavelength separation (i.e. a horizontal line in the middle of the plot) with other wavelength separations being independent from each other.
Fig. F.1 Autocorrelation plots of residuals minus GP mean if (from left to right): no correlated noise is fit, no highfrequency systematics are fit, no wavelengthspecific systematics are fit, or if the full kernel in Eq. (30) is fit. 
In the rightmost plot, we see the autocorrelation of the residuals after subtraction of the bestfit GP mean including all the terms in our kernel function. The resulting plot appears to show almost no correlation with the exception of an apparent purple cross in the middle of the plot, implying some residual correlation from both highfrequency (wavelengthcorrelated but timeindependent) systematics and wavelengthspecific (wavelengthindependent but timecorrelated) systematics. This could suggest that the squaredexponential kernel function used for these systematics was not the ideal shape to account for them or perhaps that the bestfit did not converge to the optimal values. It could also be due to the restrictions on the kernel made such as the wavelengthspecific systematics being forced to share the same length scale in time as the commonmode systematics. However, the remaining correlation is low in magnitude compared to the other autocorrelation plots so it is possible any remaining correlation has a negligible impact on the resulting transmission spectrum.
Appendix G Tables of results for Simulations 14
The accuracy metrics from Sec. 3.5 calculated for Simulations 14 are included in this section with Table G.1 explaining how to read and interpret each table. An interpretation of the results of each table is included in this section for convenience but repeats many of the points in Sec. 3.
Table G.2 presents the results of Simulation 1 containing wavelengthindependent systematics tested on three methods. Firstly, 1D GPs fitting each spectroscopic light curve separately were tested but did not produce reliable constraints as the mean and the variance of the Zscore for each atmospheric parameter are all larger than one. This suggests that the 1D GPs underestimate uncertainty on average despite having the correct kernel to fit these systematics. This can be explained by having a very flexible noise model that fits for a separate height scale, length scale and white noise amplitude for each spectroscopic light curve. The Hybrid method shares the same hyperparameters for all light curves and performs reliably across all accuracy metrics measured despite using the same kernel function as the 1D GPs. The 2D GP method is similar to the Hybrid method but fits for a wavelength length scale – which is a more conservative approach as it does not assume systematics are wavelengthindependent a priori. This still results in reliable retrievals of all parameters measured and with no statistically significant increase in the mean uncertainties on the parameters, so the more conservative analysis does not result in a significant loss in precision (although there was a minor loss in precision on the transmission spectrum, as noted in Sec. 3.7.
Table G.3 shows the results of Simulation 2 containing wavelengthcorrelated systematics with a relatively short wavelength length scale (300Å < l_{λ} < 2250Å). The same three methods are run for Simulation 2 as for Simulation 1, to demonstrate that the 1D GP and Hybrid methods that assume wavelengthindependent systematics perform very unreliably when systematics are correlated in wavelength. They overestimate the uncertainty in the radius ratio and slope (Zscore variances > 1) and underestimate uncertainty in the strength of the K feature (Zscore variance < 1). The KS statistics show that the distribution of the samples does not match the correct distribution, showing that the Hybrid method does not reliably retrieve the transmission spectrum even though the mean chisquared value is consistent with one. The 2D GP method performs reliably across all accuracy metrics – showing that our method is robust at accounting for wavelengthcorrelated systematics within this range of wavelength length scales (although the sample variance of the values is slightly too high which could suggest a few outliers being present). The uncertainty in the strength of the K feature is both more reliable and also much smaller than the 1D GP and Hybrid methods, lending support to our result from the VLT/FORS2 analyses which produced tighter constraints on Na and K in all 2D GP analyses compared to the 1D GP analyses.
Table G.4 shows the results of Simulation 3 containing longer wavelength length scale systematics (2250Å < l_{λ} < 36000Å). While the 2D GP method was identical to Simulations 1 and 2, in these simulations we were investigating how the assumption of systematics being constant in wavelength performs when the systematics are actually gradually varying in wavelength. We tested our 2D GP method against the 1D GP method with an initial commonmode (CM) correction as well as the same 2D GP method but with the wavelength length scale fixed to a large value which is close to assuming systematics are commonmode (labelled as 2D GP (CM) in the table although note this method does not perform a commonmode correction). Both the 1D GP (CM) and 2D GP (CM) methods were unreliable across almost all metrics whereas the 2D GP that fit for the wavelength length scale produced robust retrievals across all metrics (the variance in the slope Zscores may be slightly too high although this is only a 2.6σ result). The asterisk for the 1D GP (CM) values is there to note that these values were calculated after subtracting the offset in the recovered transmission spectrum to the injected spectrum because this offset does not have a large effect on atmospheric retrievals but was significantly affecting the values. This offset is the result of the commonmode correction and so did not affect the other two methods which do not have this offset correction performed for the values.
Explanation of the tables that report the results for each set of simulations.
Results of Simulation 1 containing wavelengthindependent systematics.
Results of Simulation 2 with short wavelength length scale (300Å < l_{λ} < 2250Å) systematics.
Results of Simulation 3 containing long wavelength length scale (2250Å < l_{λ} < 36000Å) systematics.
Results of Simulation 4 containing commonmode systematics.
Table G.5 shows the results of Simulation 4 that contains commonmode systematics. The same methods as for Simulation 3 were tested, although the assumption that the systematics are commonmode is correct for these simulations so both the 1D GP (CM) and 2D GP (CM) methods perform much more reliably. The 1D GP (CM) method is slightly more conservative compared to the 2D GP (CM) method on both the constraint of the slope and the strength of the K feature, this is may be because the 1D GPs are still accounting for wavelengthindependent systematics in each spectroscopic light curve while the 2d GP (CM) method assumes that only commonsystematics and white noise are present in the data. The 2D GP method that fits for a wavelength length scale performs reliably on all metrics except the constraint of the slope, where it overestimates the uncertainties on average. This is likely due to the challenge of constraining systematics to be fully commonmode as the 2D GP (CM) method is much better at constraining the slope. However, since we do not know for real data whether systematics are constant in wavelength or gradually varying in wavelength, the safest approach is to fit for the wavelength length scale and accept some loss in precision in the slope of the transmission spectrum if the systematics are commonmode.
Appendix H Gradient calculations
No UTurn Sampling requires the computation of the gradient of the loglikelihood with respect to each parameter being varied. One of the benefits of JAX is that it can compute the gradient of functions with limited modification to NumPy code. Using jaxoplanet in combination with implementing the algorithms in Sec. 2.5 in JAX was sufficient to efficiently and accurately compute the gradient of the loglikelihood with respect to any of the mean function parameters.
However, these algorithms were not sufficient for computing the gradients with respect to many of the hyperparameters. It was noticed that the gradients JAX computed from these equations were inconsistent with finite difference methods. It is likely that the numerical stability of eigendecomposition was an issue when computing the gradients of hyperparameters in these formula.
Fortunately, it is possible to rewrite the gradient of the loglikelihood in a way that is much less sensitive to numerical stability issues compared to the default method used by JAX. A method for analytically calculating the gradient of the loglikelihood with respect to hyperparameters was demonstrated in Rakitsch et al. (2013). The method is included here, although the calculation of the derivative of the logdeterminant of the covariance matrix has been changed. The method presented in Rakitsch et al. (2013) would require additional eigendecompositions of the matrices K_{λ} K_{t} and the transformed matrices and to be calculated in order to calculate the gradient with respect to all hyperparameters. It also would have no way of getting the gradient of a hyperparameter which is included in both Kronecker product terms such as in K_{λ} and ∑_{λ}. The method presented here is therefore faster, more general and is also more numerically stable.
Suppose θ is a hyperparameter, with r independent of θ (i.e. = 0). We can find by calculating: (H.1) (H.2)
where we can use the following two identities from matrix calculus: (H.3) (H.4)
The advantage of these identities is that they allow us to rewrite the gradient of the loglikelihood in terms of instead of or (which by default JAX fails to calculate in a numerically stable way). It was found that this approach results in gradients of the loglikelihood that are consistent with finitedifference methods.
Suppose θ is a parameter of K_{λ} and none of the other matrices in K, then: (H.5) (H.6)
This allows us to efficiently calculate the first term in Eq. (H.2) using the identity in Eq. (H.3): (H.7) (H.8)
Where K^{−1} r = α can be efficiently solved using Eq. (26) and α can be solved using Eq. (14) as it is of the form [A ⊗ B]c
The second term in Eq. (H.2) may be computed as follows. We will use the identity in Eq. (H.4), but first it is helpful to expand out the K^{−1} term using the factorisation described in Sec. 2.5: (H.9)
We then factorise the middle term similar to Eq. (19) as: (H.10)
Plugging this in and using the properties of Kronecker product algebra: (H.11)
where we have defined: (H.12) (H.13) (H.14)
We note that while it may look as if we have found the eigendecomposition of the covariance matrix K, the matrices W_{λ} and W_{t} can be clearly shown to not be eigenvector matrices as they are not orthogonal (i.e. the product does not equal the identity matrix unlike for an actual eigendecomposition). Similarly, the diagonal matrix D does not contain the eigenvalues of K as can be seen by noting that the product of the diagonal entries would not in general produce the same determinant as given in Eq. (27). Nonetheless, it is a useful way of writing K^{−1} which we can now plug into Eq. (H.4): (H.15)
We can use the cyclic property of the trace: (H.16)
This allows us to simplify further: (H.17)
Finally, D^{−1} is a diagonal matrix and the trace only depends on the diagonal of the overall matrix so we can equivalently write this expression as a dot product of the diagonal of the two matrices and make use of the fact that the diagonal of a Kronecker product matrix is the Kronecker product of the diagonals: (H.18)
where diag(A) denotes the vector formed from the diagonal of some matrix A.
Equations H.8 and H.18 permit the efficient calculation of the gradient of the loglikelihood with respect to a hyperparameter of K_{λ}. By taking advantage of the Kronecker product structure, the memory requirement is still (M^{2} + N^{2}) and we do not require the calculation of any additional eigendecompositions compared to the loglikelihood calculation. The mathematics follows similarly for calculating the gradients of parameters from K_{t}, ∑_{λ} or ∑_{t}.
Appendix I Hessian calculations
Similar calculations can be performed as in Appendix H in order to efficiently calculate the Hessian of the loglikelihood with respect to all of the mean function parameters and hyperparameters. This is implemented in the LUAS Github repository but is excluded here for brevity.
The benefit of computing the Hessian of the loglikelihood is that we can use it to calculate a Laplace approximation of the posterior. This is an analytic way of estimating the posterior using the bestfit value of all the parameters and the Hessian at the location of bestfit. It works by taking a secondorder Taylor expansion of the logposterior around the mode of the posterior (see Bishop & Nasrabadi 2007 for more information about the Laplace approximation). It assumes that the posterior can be approximated as a Gaussian and works by determining what the posterior would need to be in order to have the same Hessian at the mean of the Gaussian. It was found that this approximation produces an accurate enough fit to the posterior that using it to calculate the massmatrix of the No UTurn Sampler significantly improved the efficiency of sampling.
While the Laplace approximation is useful for tuning No UTurn Sampling, it can also be seen as a very computationally inexpensive way of approximating the posterior (taking < 10 seconds for the VLT/FORS2 datasets) which avoids using expensive inference techniques such as MCMC. While the results are not as robust as a full MCMC retrieval, by making this approximation it may permit the application of 2D GPs to very large datasets which could otherwise be unfeasible using a full exploration of the posterior. It is also a useful check before running an MCMC (which may take hours to run) to see if the Laplace approximation produces a reasonable fit to the data.
Appendix J Numerical stability with eigendecomposition
The numerical stability of eigendecomposition is dependent on the condition number of the matrix in question. For any covariance matrix, the condition number is given by the ratio of the largest to smallest eigenvalue of that matrix. Matrix inversion is generally more prone to numerical errors for matrices with large condition numbers.
This is often not a major issue for 1D GPs because it is common to have terms in the kernel that are added to the diagonal, such as white noise terms. Adding to the diagonal generally has the effect of reducing the condition number. For example, if a constant c is added to all of the diagonal elements of a covariance matrix then it has the effect of increasing all the eigenvalues by c. This always decreases the ratio of the largest to smallest eigenvalues. However, due to the way this method splits up the covariance matrix into different sums of Kronecker products, the component wavelength and time covariance matrices may not all have terms added to their diagonals. For the kernel function used to fit the VLT/FORS2 data (shown in Eq. 30), the matrix accounting for timecorrelated systematics K_{t} is calculated from a squaredexponential kernel without any terms added to the diagonal, potentially resulting in a high condition number which may produce significant numerical errors.
The approach used to deal with these numerical errors was to use a method that may be referred to as Ridge regression or Tikhonov regularisation (Hoerl & Kennard 1970). A constant was added to the diagonal elements of this time covariance matrix K_{t} in order to increase all the eigenvalues, often referred to as regularising the matrix. The value of the constant used was likely larger than was necessary to guarantee stability as it was unclear how to determine an optimal value. The diagonal values – which were all equal to one – were increased by 10^{−5}. To understand the effect this could have had, we can treat it similar to any of the other terms in the kernel and consider what type of noise it accounts for. The actual kernel function used for the VLT analysis was effectively: (J.1)
This new term multiplies both the h_{CM} and h_{WSS} terms in the wavelength kernel function it is multiplying. We can rearrange this kernel function as: (J.2)
This rearrangement of the equation shows that by adding the constant c to the diagonal of the first covariance matrix for the time dimension, we can think of it as instead adding a second highfrequency systematics term with height scale and wavelength length scale as well as changing the amplitude of the white noise terms.
Changing the amplitude of the white noise terms should not be an issue since we are freely fitting for each term. The effect of this is simply that the retrieved terms are slightly smaller than the actual level of white noise the kernel is fitting. For the bestfit values of the 600RI and 600B analogous 2D GP fits, the maximum decrease for any white noise value retrieved would be a decrease of 8.8 × 10^{−9} with c = 10^{−5}.
The only negative effect regularisation might have had was from this additional highfrequency systematics term with height scale and with a different wavelength length scale to the actual highfrequency systematics term in the kernel. Suppose the commonmode height scale h_{CM} is ten times greater than the highfrequency systematics height scale h_{HFS}. The additional term in the kernel would have a height scale which would be 3.2% of h_{HFS}. If both the wavelength length scales and were identical, then this would result in the MCMC retrieving h_{HFS} to be only ≈ 0.05% smaller but would have no effect on the retrieved transmission spectrum. Since the two wavelength length scales were different, this essentially resulted in the kernel function used for the highfrequency systematics very slightly deviating from a squaredexponential kernel as it would instead be the sum of two squaredexponential kernels with different length scales. Overall, this should have a very small effect.
For future work, it may be possible to avoid the addition of a constant to the diagonals. This is because to calculate the loglikelihood, the eigenvalues of the K_{λ} and K_{t} matrices are not used but instead the eigenvalues of are used. This addition of the identity matrix should regularise the matrix. The eigenvalues of the matrices ∑_{λ} and ∑_{t} will need to be calculated directly however, but as white noise terms are likely needed in any suitable kernel then these matrices can be chosen to include the white noise terms as this naturally regularises the matrices.
Appendix K Extension of method to higher dimensions
The optimisations developed in this paper could be extended to higher dimensions with limited extra work. For the case of a 3D Gaussian process, given the data lie on a complete 3D grid and the covariance matrix is of the form: (K.1)
then K^{−1} could be factorised using a similar method by first taking the eigendecompositions of ∑_{λ}, ∑_{t} and ∑_{n} and using them to transform K_{λ}, K_{t} and K_{n}. The eigendecompositions of these transformed matrices could then be calculated and K^{−1} could be factorised similarly to Eq. (26). Saatchi (2011) contains algorithms to efficiently calculate matrixvector products for any number of dimensions of Kronecker products including calculating [A ⊗ B ⊗ C]d for three dimensions (analogous to Eq. 14). Although it was not implemented, any of the algorithms described in this work could conceivably be generalised to higher dimensions using this.
An example within transmission spectroscopy where this could be useful is to study if the systematics from different transit observations using the same instrument are correlated. In that case, the three dimensions could be wavelength, time, and transit number. Many hyperparameters would need to be shared between different transit observations however so this may not be a reasonable approach for most datasets.
Appendix L Visualisations of 2D GP analyses of the VLT/FORS2 data
Fig. L.1 Heat map of correlation matrices of transmission spectra for 600B data. Left: From the analogous 2D GP analysis, visualising how the uncertainty in each transit depth is correlated across the spectrum. Right: For the 2D GP analysis with no added prior on the mean radius (from Sect. 4.4.2). In both plots, the strength of correlation is not significantly affected by the wavelength separation between light curves for most wavelengths. The right plot in particular shows significant correlations that are mostly independent of wavelengthseparation, consistent with significant uncertainty in the offset of the transmission spectrum as all wavelengths are approximately affected equally. 
Fig. L.2 Same as Fig. L.1 but for 600RI data. Both plots demonstrate that transit depths at large wavelength separations in the dataset are less correlated than for closer separations, which likely explains the significant uncertainty in the slope of the transmission spectrum for this analysis. 
Fig. L.3 Fitted transit light curves for 600B data, showing GP predictive mean in red as well as 1σ and 2σ uncertainty in mean shaded in grey. Left: Fitting the raw light curves. Middle: The same fit minus the bestfit transit model, displaying just the GP fit to the systematics. Right: The same fit minus both the bestfit transit model and the GP mean (fitting the systematics) subtracted, showing the residual white noise. 
Fig. L.5 Corner plot for 600B data fit with 2D GP and marginalising over T_{0}, a/R_{*} and b (from Sect. 4.4.3). Only the wavelengthdependent parameters from two of the light curves are included (from the wavelength bands immediately left of and centred on the Na feature) in addition to any parameters shared by all light curves. Note the wavelength length scale is consistent with the maximum prior limit, showing that these systematics may be constant in wavelength, while is constrained to be within the prior bounds. 
Fig. L.6 Corner plot for 600RI data for analogous 2D GP fit (from Sect. 4.4.1). The wavelengthdependent parameters included are from the light curves centred on the Na and K features. We note that both the wavelength length scales and are constrained within the prior bounds for this dataset. 
References
 Ahrer, E., Wheatley, P. J., Kirk, J., et al. 2022, MNRAS, 510, 4857 [NASA ADS] [CrossRef] [Google Scholar]
 Ahrer, E.M., Stevenson, K. B., Mansfield, M., et al. 2023, Nature, 614, 653 [NASA ADS] [CrossRef] [Google Scholar]
 Aigrain, S., & ForemanMackey, D. 2023, ARA&A, 61, 329 [NASA ADS] [CrossRef] [Google Scholar]
 Alderson, L., Wakeford, H. R., Alam, M. K., et al. 2023, Nature, 614, 664 [NASA ADS] [CrossRef] [Google Scholar]
 Anderson, D. R., Collier Cameron, A., Hellier, C., et al. 2011, A&A, 531, A60 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Appenzeller, I., Fricke, K., Fürtig, W., et al. 1998, The Messenger, 94, 1 [NASA ADS] [Google Scholar]
 Astropy Collaboration (PriceWhelan, A. M., et al.) 2022, ApJ, 935, 167 [NASA ADS] [CrossRef] [Google Scholar]
 Barragán, O., Aigrain, S., Rajpaul, V. M., & Zicher, N. 2022, MNRAS, 509, 866 [Google Scholar]
 Bell, T. J., Nikolov, N., Cowan, N. B., et al. 2017, ApJ, 847, L2 [Google Scholar]
 Bishop, C. M., & Nasrabadi, N. M. 2007, J. Electron. Imaging, 16, 049901 [NASA ADS] [CrossRef] [Google Scholar]
 Bonomo, A. S., Desidera, S., Benatti, S., et al. 2017, A&A, 602, A107 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Boone, K. 2019, AJ, 158, 257 [NASA ADS] [CrossRef] [Google Scholar]
 Bradbury, J., Frostig, R., Hawkins, P., et al. 2018, JAX: composable transformations of Python+NumPy programs [Google Scholar]
 Brandt, S. 2014, Data Analysis (Springer International Publishing) [CrossRef] [Google Scholar]
 Brown, T. M. 2001, ApJ, 553, 1006 [NASA ADS] [CrossRef] [Google Scholar]
 Carter, J. A., & Winn, J. N. 2009, ApJ, 704, 51 [Google Scholar]
 Charbonneau, D., Brown, T. M., Noyes, R. W., & Gilliland, R. L. 2002, ApJ, 568, 377 [Google Scholar]
 Delisle, J. B., Unger, N., Hara, N. C., & Ségransan, D. 2022, A&A, 659, A182 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 DiamondLowe, H., BertaThompson, Z., Charbonneau, D., Dittmann, J., & Kempton, E. M. R. 2020, AJ, 160, 27 [Google Scholar]
 Espinoza, N., Rackham, B. V., Jordán, A., et al. 2019, MNRAS, 482, 2065 [Google Scholar]
 Evans, T. M., Pont, F., Sing, D. K., et al. 2013, ApJ, 772, L16 [NASA ADS] [CrossRef] [Google Scholar]
 Fakhouri, H. K., Boone, K., Aldering, G., et al. 2015, ApJ, 815, 58 [NASA ADS] [CrossRef] [Google Scholar]
 ForemanMackey, D. 2016, J. Open Source Softw., 1, 24 [Google Scholar]
 ForemanMackey, D., & Garcia, L. J. 2023, jaxoplanet: Astronomical time series analysis with JAX [Google Scholar]
 ForemanMackey, D., Agol, E., Ambikasaran, S., & Angus, R. 2017, AJ, 154, 220 [Google Scholar]
 ForemanMackey, D., Luger, R., Agol, E., et al. 2021, exoplanet: Gradientbased probabilistic inference for exoplanet data & other astronomical time series [Google Scholar]
 Gelman, A., & Rubin, D. B. 1992, Stat. Sci., 7, 457 [Google Scholar]
 Gibson, N. P. 2014, MNRAS, 445, 3401 [NASA ADS] [CrossRef] [Google Scholar]
 Gibson, N. P., Aigrain, S., Pont, F., et al. 2012a, MNRAS, 422, 753 [NASA ADS] [CrossRef] [Google Scholar]
 Gibson, N. P., Aigrain, S., Roberts, S., et al. 2012b, MNRAS, 419, 2683 [NASA ADS] [CrossRef] [Google Scholar]
 Gibson, N. P., Nikolov, N., Sing, D. K., et al. 2017, MNRAS, 467, 4591 [NASA ADS] [CrossRef] [Google Scholar]
 Gibson, N. P., de Mooij, E. J. W., Evans, T. M., et al. 2019, MNRAS, 482, 606 [CrossRef] [Google Scholar]
 Gordon, T. A., Agol, E., & ForemanMackey, D. 2020, AJ, 160, 240 [NASA ADS] [CrossRef] [Google Scholar]
 Greene, T. P., Bell, T. J., Ducrot, E., et al. 2023, Nature, 618, 39 [NASA ADS] [CrossRef] [Google Scholar]
 Harris, C. R., Millman, K. J., van der Walt, S. J., et al. 2020, Nature, 585, 357 [Google Scholar]
 Hoerl, A. E., & Kennard, R. W. 1970, Technometrics, 12, 55 [Google Scholar]
 Hoffman, M. D., & Gelman, A. 2014, J. Mach. Learn. Res., 15, 1593 [Google Scholar]
 Holmberg, M., & Madhusudhan, N. 2023, MNRAS, 524, 377 [NASA ADS] [CrossRef] [Google Scholar]
 Hunter, J. D. 2007, Comput. Sci. Eng., 9, 90 [Google Scholar]
 Ih, J., & Kempton, E. M. R. 2021, AJ, 162, 237 [NASA ADS] [CrossRef] [Google Scholar]
 Jensen, C. S., Kjærulff, U., & Kong, A. 1995, Int. J. Hum.Comput. Stud., 42, 647 [Google Scholar]
 Kokori, A., Tsiaras, A., Edwards, B., et al. 2022, ApJS, 258, 40 [NASA ADS] [CrossRef] [Google Scholar]
 Kreidberg, L. 2015, PASP, 127, 1161 [Google Scholar]
 Kumar, R., Carroll, C., Hartikainen, A., & Martin, O. 2019, J. Open Source Softw., 4, 1143 [Google Scholar]
 Lecavelier Des Etangs, A., Pont, F., VidalMadjar, A., & Sing, D. 2008, A&A, 481, L83 [NASA ADS] [CrossRef] [Google Scholar]
 Liu, D. C., & Nocedal, J. 1989, Math. Program., 45, 503 [CrossRef] [Google Scholar]
 Llorente, F., Martino, L., Curbelo, E., LópezSantiago, J., & Delgado, D. 2023, WIREs Computat. Stat., 15, e1595 [CrossRef] [Google Scholar]
 Maguire, C., Gibson, N. P., Nugroho, S. K., et al. 2023, MNRAS, 519, 1030 [Google Scholar]
 Mandel, K., & Agol, E. 2002, ApJ, 580, L171 [Google Scholar]
 May, E. M., Gardner, T., Rauscher, E., & Monnier, J. D. 2020, AJ, 159, 7 [Google Scholar]
 McCullough, P. R., Crouzet, N., Deming, D., & Madhusudhan, N. 2014, ApJ, 791, 55 [NASA ADS] [CrossRef] [Google Scholar]
 McGruder, C. D., LópezMorales, M., Espinoza, N., et al. 2020, AJ, 160, 230 [NASA ADS] [CrossRef] [Google Scholar]
 Meech, A., Aigrain, S., Brogi, M., & Birkby, J. L. 2022, MNRAS, 512, 2604 [NASA ADS] [CrossRef] [Google Scholar]
 Nasedkin, E., Mollière, P., Wang, J., et al. 2023, A&A, 678, A41 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Neal, R. M. 2003, Ann. Stat., 31, 705 [Google Scholar]
 Neal, R. 2011, in Handbook of Markov Chain Monte Carlo, 113 [Google Scholar]
 Nikolov, N., Sing, D. K., Gibson, N. P., et al. 2016, ApJ, 832, 191 [NASA ADS] [CrossRef] [Google Scholar]
 Pandas development team, T. 2020, https://doi.org/10.5281/zenodo. 3509134 [Google Scholar]
 Panwar, V., Désert, J.M., Todorov, K. O., et al. 2022, MNRAS, 510, 3236 [CrossRef] [Google Scholar]
 Parviainen, H., & Aigrain, S. 2015, MNRAS, 453, 3821 [Google Scholar]
 Patel, J. A., & Espinoza, N. 2022, AJ, 163, 228 [NASA ADS] [CrossRef] [Google Scholar]
 Pérez, F., & Granger, B. E. 2007, Comput. Sci. Eng., 9, 21 [Google Scholar]
 Powell, D., Louden, T., Kreidberg, L., et al. 2019, ApJ, 887, 170 [NASA ADS] [CrossRef] [Google Scholar]
 Rackham, B., Espinoza, N., Apai, D., et al. 2017, ApJ, 834, 151 [Google Scholar]
 Rackham, B. V., Apai, D., & Giampapa, M. S. 2018, ApJ, 853, 122 [Google Scholar]
 Rajpaul, V., Aigrain, S., Osborne, M. A., Reece, S., & Roberts, S. 2015, MNRAS, 452, 2269 [Google Scholar]
 Rakitsch, B., Lippert, C., Borgwardt, K., & Stegle, O. 2013, in Advances in Neural Information Processing Systems, 26 [Google Scholar]
 Rasmussen, C. E., & Williams, C. K. I. 2006, Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning (MIT Press), I, 1 [Google Scholar]
 Rustamkulov, Z., Sing, D. K., Mukherjee, S., et al. 2023, Nature, 614, 659 [NASA ADS] [CrossRef] [Google Scholar]
 Saatchi, Y. 2011, PhD thesis, University of Cambridge [Google Scholar]
 Salvatier, J., Wiecki, T. V., & Fonnesbeck, C. 2016, PeerJ Comput. Sci., 2, e55 [Google Scholar]
 Seager, S., & Sasselov, D. D. 2000, ApJ, 537, 916 [Google Scholar]
 Sedaghati, E., Boffin, H. M. J., Jeřabková, T., et al. 2016, A&A, 596, A47 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Sedaghati, E., Boffin, H. M. J., MacDonald, R. J., et al. 2017, Nature, 549, 238 [Google Scholar]
 Sedaghati, E., MacDonald, R. J., CasasayasBarris, N., et al. 2021, MNRAS, 505, 435 [NASA ADS] [CrossRef] [Google Scholar]
 Sing, D. K., Wakeford, H. R., Showman, A. P., et al. 2015, MNRAS, 446, 2428 [NASA ADS] [CrossRef] [Google Scholar]
 Sing, D. K., Fortney, J. J., Nikolov, N., et al. 2016, Nature, 529, 59 [Google Scholar]
 Spyratos, P., Nikolov, N. K., Constantinou, S., et al. 2023, MNRAS, 521, 2163 [NASA ADS] [CrossRef] [Google Scholar]
 Virtanen, P., Gommers, R., Oliphant, T. E., et al. 2020, Nat. Methods, 17, 261 [Google Scholar]
 Wilson, J., Gibson, N. P., Lothringer, J. D., et al. 2021, MNRAS, 503, 4787 [NASA ADS] [CrossRef] [Google Scholar]
 Zieba, S., Kreidberg, L., Ducrot, E., et al. 2023, Nature, 620, 746 [NASA ADS] [CrossRef] [Google Scholar]
Alternating between NUTS and other MCMC methods is especially common when sampling both continuous and discrete variables (Salvatier et al. 2016).
All Tables
Results of atmospheric retrievals on the 600B data analysed using different methods.
Results of atmospheric retrievals on the 600RI data analysed by different methods.
Overview of MCMC blocked Gibbs sampling steps for all 2D GP analyses of both VLT/FORS2 datasets.
Results of Simulation 2 with short wavelength length scale (300Å < l_{λ} < 2250Å) systematics.
Results of Simulation 3 containing long wavelength length scale (2250Å < l_{λ} < 36000Å) systematics.
All Figures
Fig. 1 Four equally likely types of transmission spectra generated for a radius ratio of ρ = 0.1, showing the spectrum in transit depths. 

In the text 
Fig. 2 Example of synthetic dataset analysed in Simulation 1 (containing wavelengthindependent systematics) showing some simulated light curves (left) and recovered transmission spectrum (right). Left: the three shortest wavelength light curves in the dataset being fit by the transit model combined with a 2D GP noise model (the other 13 light curves were simultaneously fit but not plotted). Right: resulting transmission spectrum from the joint fit of all 16 light curves, with the three leftmost points corresponding to the light curves in the left plot. The recovered transmission spectrum was consistent with the injected spectrum (). An atmospheric retrieval was performed on the recovered transmission spectrum (with the bestfit model plotted) and was consistent with the flat slope and strong K feature of the injected spectrum. 

In the text 
Fig. 3 histograms for all three methods with theoretical distribution overplotted. Both the 1D GP and Hybrid methods have many outliers that increase the variance of the retrieved values. The values of the Hybrid method have the correct mean but with greater variance than the theoretical distribution, while the 2D GP traces the theoretical distribution very closely. 

In the text 
Fig. 4 Zscores of retrieved atmospheric features from Simulation 2, which had short wavelength length scale correlated noise. The three methods tested were fitting the light curves with 1D GPs (top row), the Hybrid method (middle row) and a 2D GP (bottom row). All methods have mean Zscores consistent with zero, that is none of them biases the results towards measuring larger or smaller values of the features. The retrieved variance of Zscores for the 2D GP method are all consistent with a variance of one (matching the Gaussian distribution plotted in red), but not for the other methods, which indicates overestimation or underestimation of uncertainties. 

In the text 
Fig. 5 Raw light curves (left) and commonmode corrected light curves (right) for one simulated dataset in Simulation 3 (containing wavelengthcorrelated systematics). The injected transit signal is plotted as a black dotted line. The variation of the systematics in wavelength is not obvious by eye but can be noticed by comparing the regions highlighted in red boxes. Despite the systematics varying in wavelength, the commonmode correction appears to remove visual signs of systematics. 

In the text 
Fig. 6 Example of retrieved transmission spectra with 1D GP method (left) analysing commonmode corrected light curves (right plot in Fig. 5) compared with 2D GP method (right) analysing uncorrected light curves (left plot in Fig. 5). The error bars in blue only convey the mean and standard deviation of each transit depth measurement, random draws are taken from the covariance matrix to convey potential correlations between transit depths, helping to visualise the increased uncertainty in the offset and scattering slope of the spectrum from the 2D GP method. Only the 1D GP method erroneously detects a negative scattering slope. While it is not visually apparent, the 2D GP method gives a stronger constraint on the detection of potassium (13.9σ compared to 8.7σ). 

In the text 
Fig. 7 600B grism (left) and 600RI grism (right) spectroscopic light curves with bestfit transit region shaded in grey. Left: the light curves appear to have very similar systematics with the exception of the top one or two light curves in which the region in the red box shows a small increase in flux at the end of the observation. This could suggest the presence of wavelengthspecific systematics (WSS) in these data. Right: The slowly varying shape of the systematics – as highlighted in the red box – suggest that the ‘commonmode’ systematics are not actually commonmode but gradually vary in wavelength. 

In the text 
Fig. 8 Plots showing GP predictive means fit by 2D GP in 600B dataset to account for (in order from left to right): (a) commonmode systematics, (b) highfrequency systematics and (c) wavelengthspecific systematics. The rightmost plot is the raw light curves included for reference. Specifically, the middle plots show the change in the GP mean caused by including these systematics in the fit, as discussed in Sec. 4.2. 

In the text 
Fig. 9 600B (top) and 600RI (bottom) retrieved transmission spectra comparing original analysis, 1D GP reanalysis and analogous 2D GP analysis. The 2D GP bestfit atmospheric retrieval is shown as a red dotted line and random draws taken from the 2D GP spectrum are shown in grey. Top: all three analyses show largely consistent results. Bottom: the error bars of each method may appear consistent but the random draws from the 2D GP spectrum demonstrates a large uncertainty in the slope of this spectrum after accounting for the covariance between transit depths. 

In the text 
Fig. 10 Similar to Fig. 9 but showing different 2D GP analyses tested. As error bars do not convey the correlations between the data points, it is difficult to see that the 600B fit with no mean radius prior – which appears to have much larger uncertainties – actually has a similar constraint on the slope and Na feature compared to the other methods. For the direct retrieval of atmospheric parameters from the data, a dotted line is included that shows the bestfit atmospheric model (as the transmission spectrum is not directly retrieved with this method). 

In the text 
Fig. 11 Comparison of loglikelihood calculation runtime (in seconds on logarithmic scale) between general approach of Cholesky factorisation (left) compared to our method (centre) as well as relative performance improvement (right). Cholesky factorisation was not calculated when MN ≥ 32768 (shown in white) due to memory limitations. 

In the text 
Fig. F.1 Autocorrelation plots of residuals minus GP mean if (from left to right): no correlated noise is fit, no highfrequency systematics are fit, no wavelengthspecific systematics are fit, or if the full kernel in Eq. (30) is fit. 

In the text 
Fig. L.1 Heat map of correlation matrices of transmission spectra for 600B data. Left: From the analogous 2D GP analysis, visualising how the uncertainty in each transit depth is correlated across the spectrum. Right: For the 2D GP analysis with no added prior on the mean radius (from Sect. 4.4.2). In both plots, the strength of correlation is not significantly affected by the wavelength separation between light curves for most wavelengths. The right plot in particular shows significant correlations that are mostly independent of wavelengthseparation, consistent with significant uncertainty in the offset of the transmission spectrum as all wavelengths are approximately affected equally. 

In the text 
Fig. L.2 Same as Fig. L.1 but for 600RI data. Both plots demonstrate that transit depths at large wavelength separations in the dataset are less correlated than for closer separations, which likely explains the significant uncertainty in the slope of the transmission spectrum for this analysis. 

In the text 
Fig. L.3 Fitted transit light curves for 600B data, showing GP predictive mean in red as well as 1σ and 2σ uncertainty in mean shaded in grey. Left: Fitting the raw light curves. Middle: The same fit minus the bestfit transit model, displaying just the GP fit to the systematics. Right: The same fit minus both the bestfit transit model and the GP mean (fitting the systematics) subtracted, showing the residual white noise. 

In the text 
Fig. L.4 Same plot as Fig. L.3 but for 600RI data. 

In the text 
Fig. L.5 Corner plot for 600B data fit with 2D GP and marginalising over T_{0}, a/R_{*} and b (from Sect. 4.4.3). Only the wavelengthdependent parameters from two of the light curves are included (from the wavelength bands immediately left of and centred on the Na feature) in addition to any parameters shared by all light curves. Note the wavelength length scale is consistent with the maximum prior limit, showing that these systematics may be constant in wavelength, while is constrained to be within the prior bounds. 

In the text 
Fig. L.6 Corner plot for 600RI data for analogous 2D GP fit (from Sect. 4.4.1). The wavelengthdependent parameters included are from the light curves centred on the Na and K features. We note that both the wavelength length scales and are constrained within the prior bounds for this dataset. 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.