Open Access
Issue
A&A
Volume 686, June 2024
Article Number A89
Number of page(s) 35
Section Planets and planetary systems
DOI https://doi.org/10.1051/0004-6361/202347613
Published online 31 May 2024

© The Authors 2024

Licence Creative CommonsOpen Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article is published in open access under the Subscribe to Open model. Subscribe to A&A to support open access publication.

1 Introduction

Low-resolution transmission spectroscopy has been a powerful technique for probing the atmospheric composition of exoplanets ever since the first detection of an exoplanet atmosphere (Charbonneau et al. 2002). The technique relies upon observing an exoplanet transit – when an exoplanet appears to pass in front of its host star – and analysing the decrease in flux during the transit as a function of wavelength. The resulting transmission spectrum contains information about the planetary atmosphere (Seager & Sasselov 2000; Brown 2001).

The field of exoplanet atmospheres has entered a new era due to the recent launch of JWST. Early Release Science observations of WASP-39b using JWST NIRSpec’s PRISM mode have produced a 33σ detection of H2O in addition to strong detections of Na, CO, and CO2 (Rustamkulov et al. 2023) – far exceeding what had been achieved with previous ground-based and space-based observations (Nikolov et al. 2016; Sing et al. 2016). As the exoplanet community pushes JWST further to its limits towards smaller terrestrial planets (e.g. Greene et al. 2023; Zieba et al. 2023), the importance of careful treatment of systematics – astrophysical or instrumental – will become of increasing importance.

Gaussian processes (GPs) were introduced in Gibson et al. (2012b) to account for the uncertainty that correlated noise – also referred to as systematics – produces in the resulting transmission spectrum in a statistically robust way. GPs have been shown to provide more reliable estimates of uncertainties when compared to other common techniques such as linear basis models (Gibson 2014). This difference may help explain contradictory results in the field. For example, WASP-31b was reported to have a strong potassium signal at 4.2σ using data from the Hubble Space Telescope (HST; Sing et al. 2015), which were analysed using linear basis models. However, follow-up measurements in Gibson et al. (2017) with the FOcal Reducer and low dispersion Spectrograph (FORS2) on the Very Large Telescope (VLT; Appenzeller et al. 1998) found no evidence of potassium. High-resolution observations in Gibson et al. (2019) using the Ultraviolet and Visual Echelle Spectrograph (UVES) on the VLT as well as low-resolution observations from the Inamori-Magellan Areal Camera and Spectrograph (IMACS) on Magellan (McGruder et al. 2020) also failed to reproduce this detection. These results are more consistent with the re-analysis of the HST data using GPs which reduced the significance of the potassium signal to 2.5σ (Gibson et al. 2017), demonstrating the importance of careful treatment of systematics.

In addition to inconsistent detections of species, conflicting measurements of the slope of the transmission spectrum are also common (e.g. Sedaghati et al. 2017, 2021; Espinoza et al. 2019). A slope in the transmission spectrum can be caused by Rayleigh scattering or from scattering by aerosols including clouds or hazes in the atmosphere (Lecavelier Des Etangs et al. 2008) and it is therefore often referred to as a scattering slope. However, stellar activity can also produce an apparent scattering slope, which is typically used to explain these contradictory results (Espinoza et al. 2019; McCullough et al. 2014; Rackham et al. 2018). It is worth considering whether these effects could be caused by systematics because it is difficult to obtain direct evidence that stellar activity is the cause of these contradictions. This could be particularly relevant for measurements of extreme scattering slopes such as in May et al. (2020), where the authors note that a combination of atmospheric scattering and stellar activity still struggle to explain the observed slope.

One potential issue with current data analyses is that each transit depth in the transmission spectrum is fit separately and is assumed to have an independent uncertainty. Ih & Kempton (2021) studied the effect on retrieved atmospheric parameters if this assumption is incorrect and if transit depth uncertainties are correlated. The authors suggest that both instrumental and stellar systematics could cause correlations in transmission spectra uncertainties, which they showed could significantly impact atmospheric retrievals, but they did not provide a method for retrieving these correlations. Holmberg & Madhusudhan (2023) report the presence of wavelength-correlated systematics in observations of WASP-39b and WASP-96b using the Single Object Slitless Spectroscopy (SOSS) mode of JWST’s Near Infrared Imager and Slitless Spectrograph (NIRISS). They demonstrate why these systematics should result in correlated uncertainties in the transmission spectrum and made multiple simplifying assumptions to derive an approximate covariance matrix of the transmission spectrum, although their method cannot not account for both time and wavelength-correlated systematics.

In this work, we demonstrate a statistically robust way to account for the presence of both time-correlated and wavelength-correlated systematics and its use on both simulated datasets and to real transit observations of WASP-31b from VLT/FORS2 (originally analysed in Gibson et al. 2017). We model the noise present across the full dataset as a Gaussian process. By simultaneously fitting all spectroscopic light curves for transit depth, we also explored the joint posterior of all transit depths using Markov chain Monte Carlo (MCMC) and recovered the covariance matrix of the transmission spectrum. We present an efficient optimisation that can dramatically speed up the required log-likelihood calculation based on work in Saatchi (2011) and Rakitsch et al. (2013). We assume that the inputs to the kernel function of the GP lie on a 2D grid – such as a time by wavelength grid – and refer to this as a 2D GP. Intuitively this assumption is valid for many datasets that can be neatly arranged in a 2D grid, for example an image typically has a 2D grid structure where the inputs describing correlations could be chosen as the x and y coordinates of each pixel. It is not required that the grid is uniform; that is to say for transmission spectroscopy neither the time points nor wavelengths need to be uniformly separated. In contrast, we refer to GPs where the input(s) all lie along a single dimension such as time as 1D GPs. This includes the standard approach of fitting individual transit light curves using a GP.

One example where 2D GPs have already been used would be in radial velocity analysis where multiple parallel time series observations may be jointly fit with a GP (Rajpaul et al. 2015). In this case, time is one of the dimensions and we can consider the other dimension to consist of a small number (~3) of different observables stacked together such as radial velocity and any suitable activity indicators. This is in contrast to transmission spectroscopy where dozens or even hundreds of parallel time series may be observed simultaneously in the form of different wavelength channels and the number of observations in time may also be greater. Unfortunately, the algorithms developed for radial velocity either do not scale to the significantly larger datasets encountered in transmission spectroscopy (as noted in Barragán et al. 2022), or else they make strong assumptions about the form of correlations in both dimensions (Gordon et al. 2020; Delisle et al. 2022) which were not sufficient for our analysis of real observations.

In particular, Gordon et al. (2020) introduced a 2D GP method that could be applied to transmission spectroscopy and scales better than the method introduced here but with a much stronger assumption that the shape of the correlated noise is identical at different wavelengths but can change in amplitude. With weaker assumptions about the correlation in wavelength their method has worse scaling in the number of wavelength channels. The covariance matrix in the time dimension is also limited to celerite kernels as introduced in Foreman-Mackey et al. (2017). Similarly, the extension of the Gordon et al. (2020) method derived in Delisle et al. (2022) is still too limited to apply the kernel function used to fit the VLT/FORS2 data in this work. Our method has significant advantages for transmission spectroscopy as it is computationally tractable even when the length of both dimensions is of the order of hundreds of points and can be used with any general covariance matrices describing the noise in each dimension.

This paper is laid out as follows; Sec. 2 gives a brief introduction to Gaussian processes, the challenges of scaling them to two dimensional datasets and outlines the mathematics of the 2D GP method used in this paper, Sec. 3 compares our method to standard approaches on simulated datasets containing wavelength-correlated noise. Finally, Sec. 4 re-analyses archival VLT/FORS2 observations using this method and the discussion and conclusions are presented in Sects. 5 and 6.

The code developed for this work has been made available as a Python package called LUAS1. It is available on GitHub2 along with documentation and tutorials.

2 2D Gaussian processes for transmission spectroscopy

2.1 Introduction to 2D Gaussian processes

To fit our transit observations we first model the transit using a deterministic function. This function is typically referred to as the ‘mean function’ in the context of GPs and denoted by μ. This light curve model includes a transit depth parameter for each spectroscopic light curve, which we fit for to obtain our transmission spectrum (see Sec. 3.1 for more details on the light curve model). When using 2D GPs, we combine light curve models for each spectroscopic light curve to form our 2D mean function μ(λ, t, θ), which is a function of model parameters θ as well as the wavelength channels λ and the times of our flux observations t.

We model the noise in our observations as a Gaussian process. This means that if we take any two arbitrary input locations xi and xj out of the collection of points in a dataset D, we assume that the noise observed at these points follows a multivariate Gaussian distribution with the covariance given by a kernel function of our choosing. In the case of 1D GPs fitting a single transit light curve, the covariance might be described by the commonly used squared-exponential kernel as a function of the time separation of each data point, giving covariance matrix: (1)

where h is the height scale or amplitude of the correlated noise, ti and tj are the times of xi and xj, lt is the length scale of the correlated noise in time and we have also included a white noise term with variance σ2 added to the diagonal (δij is the Kronecker delta). The choice of kernel function is problem-specific with a range of kernel functions to choose from (see Rasmussen & Williams 2006).

If we now consider fitting multiple spectroscopic light curves with a GP then any two flux measurements have both a time separation and a wavelength separation. We can combine correlations in both time and wavelength into a single kernel. If we choose the squared-exponential kernel for both dimensions and introduce a wavelength length scale lλ we get: (2)

where λi and λj· are the wavelength values at locations xi and xj. We can also mix kernel functions, that is we could model time with the Matérn 3/2 kernel and wavelength with a squared-exponential kernel.

In general, the covariance matrix describing the noise in the observations K is a function of inputs λ and t and is a function of parameters such as h, lt, lλ and σ which are typically referred to as hyperparameters. We include both hyperparameters and light curve parameters in θ.

Overall, our dataset of flux observations y are modelled as a multivariate Gaussian distribution, with the likelihood (the probability of the data given our model) given by: (3)

Taking the logarithm of this (to avoid numerical errors) we get the log-likelihood of our model: (4)

where we have defined our residuals vector r = y – μ.

In accordance with Bayes’ theorem, we can add the logarithm of any prior probability to our log-likelihood to get the log-posterior of all the parameters3. This equation can then be used by inference methods such as Markov chain Monte Carlo (MCMC) to explore the marginal probability distributions of each parameter.

2.2 The computational cost of 2D Gaussian processes

The log-likelihood function in Eq. (4) can be used for analysing any general dataset including 1D or 2D datasets. However, the computational cost becomes prohibitive for large datasets. Assume we have flux observations measured at N different times and binned into M different wavelength channels. If we move from analysing individual light curves to all spectroscopic light curves simultaneously then the covariance matrix describing the correlation will go from an N × N matrix to an MN × MN matrix to describe correlations in both time and wavelength. Studying Eq. (4), it should be noted that it is necessary to invert the covariance matrix as well as calculate its determinant. The runtime of both of these computations scales as the cube of the number of data points in the most general case. We can write this scaling of runtime in ‘big ’ notation as (M3 N3). The covariance matrix K also requires storing M2 N2 entries in memory. The scaling in memory is therefore (M2 N2). This poor scaling in both runtime and memory would make most transmission spectroscopy datasets unfeasible to analyse and is the motivation for introducing a new log-likelihood optimisation.

The optimised GP method introduced in this work is based on Saatchi (2011) and Rakitsch et al. (2013) which both present exact methods of optimising the calculation of Eq. (4). Both methods assume that all inputs for the GP lie on a (potentially non-uniform) 2D grid. In the context of transmission spectroscopy this simply means that if wavelength and time are used as the two inputs then the same set of wavelength bands must be chosen for all points in time (but the wavelength bands do not need to be evenly separated). The grid must also be complete with no missing data, which means outliers at a specific time and wavelength cannot simply be removed from the dataset without either removing all other data points at that specific time or that particular wavelength. However, a common approach for dealing with non-Gaussian outliers has just been to replace them using an interpolating function (e.g. Gibson et al. 2017). This approach keeps the inputs on a complete grid and so this grid assumption likely satisfies most low-resolution transmission spectroscopy datasets.

Both methods differ in the range of kernel functions they can be used with. The Saatchi (2011) method is simpler and is described first, with the Rakitsch et al. (2013) method being a more general extension with a similar computational scaling.

2.3 Kronecker products

Some basics of Kronecker product algebra are useful to explain the optimisations used in this work. See Saatchi (2011) for more details, proofs of the results and how these results generalise to more than two dimensions.

The Kronecker product between two matrices may be written using the ⊗ symbol. It can be thought of as multiplying every element of the first matrix by every element in the second matrix with each multiplication producing its own term in the resulting matrix. For example: (5)

If A is M × M and B is N × N, then AB is MN × MN.

The following equations hold: (6) (7) (8)

Both methods presented assume the covariance matrix of the noise can be expressed using Kronecker products. The Saatchi (2011) method assumes the covariance matrix can be expressed as: (9)

where we have written the covariance matrix K as a Kronecker product of separate covariance matrices: Kλ constructed using a wavelength kernel function kλ(λ, λ′); and Kt constructed using the time kernel function kt(t, t′). It also assumes that white noise is constant across the full dataset with variance σ2 – a limitation of the Saatchi (2011) method. An equivalent way of stating this restriction is that the kernel function can be written as: (10)

In most transmission spectroscopy datasets, the amplitude of white noise varies significantly in wavelength across the dataset, making the Saatchi (2011) method highly restrictive. In contrast, the Rakitsch et al. (2013) method is valid for any covariance matrix that is the sum of two independent Kronecker products: (11)

or equivalently any kernel function of the form: (12)

This method is much more general with the second set of covariance matrices ∑λ and ∑t being able to account for white noise that varies in amplitude across the dataset. We can choose Kλ and Kt to include our correlated noise terms and ∑λ and ∑t to be diagonal matrices containing the white noise terms. This allows us to account for any correlated noise that is separable in wavelength and time as well as any white noise that is separable in wavelength and time. The white noise as a function of time and wavelength σ(λ, t) must be able to be written as (13)

However, there is no requirement that ∑λ and ∑λ be diagonal and the method works for any valid covariance matrices Kλ, Kt, ∑λ and ∑t.

Unlike the method introduced in Gordon et al. (2020) that uses a celerite kernel for the time covariance matrix and is limited to a single regression variable within that matrix, multiple regression variables may be used within any of these covariance matrices. Multiple regression variables are often used in transmission spectroscopy as it is argued they can provide extra information about how strongly correlated the noise may be for different flux observations. For example, the width of the spectral trace on the detector can change in time and a kernel function could be chosen where the correlation in the noise between two points is an explicit function of both the time-separation and difference in trace widths (e.g. Gibson et al. 2012a; Diamond-Lowe et al. 2020) which is not possible using a celerite kernel (Foreman-Mackey et al. 2017). Due to the kronecker product structure, any regression variables used must also lie on a 2D grid, which will be true if they vary in time but are constant in wavelength (or vice versa). One caveat is that multiple regression variables may be used with the Gordon et al. (2020) method by treating them as additional time series to be fit by the GP (see our discussion of Rajpaul et al. 2015 and Delisle et al. 2022 in Sect. 1).

2.4 Log-likelihood calculation for uniform white noise

The algorithms described in Chapter 5 of Saatchi (2011) allow for the calculation of Eq. (4) in ((M + N)MN) time after an initial matrix decomposition step that scales as (M3 + N3). This holds for any covariance matrix expressible in the form of Eq. (9).

To understand this method, first note that it takes advantage of Kronecker product structure to speed up matrix-vector products [AB]c for an arbitrary MN long vector c using (14)

where we reshape c into an M × N matrix with Cij = c(i−1)M + j and the vec() operator reverses this such that vec(C) = c.

As AB is an MN × MN matrix, we would expect multiplication by a vector c to take (M2 N2) operations to calculate in general. However, Eq. (14) takes ((M + N)MN) operations. We also do not need to store the full matrix AB in memory but instead just A and B separately, reducing the memory requirement from O(M2 N2) to (M2 + N2).

We combine this with Eq. (8) and find that once we perform the (M3 + N3) operations of computing K−1λ and K−1t then (15)

can be computed in O((M + N)MN) operations and using (M2 + N2) memory.

While it would be sufficient to use any method such as Cholesky factorisation to compute K−1 and K−1t, using the eigen-decomposition of each matrix permits white noise to be easily accounted for. We exploit a particular property of eigendecom-position that the addition of a constant to the diagonal of a matrix shifts the eigenvalues but leaves the eigenvectors unchanged. We denote the eigendecomposition of some matrix A as: (16)

after which the inverse can be easily computed as: (17)

For two matrices A and B it can be shown that (18)

We can then add a white noise term to Eq. (18) simply by shifting the eigenvalues: (19)

where represents the MN × MN identity matrix. We use Eq. (14) to multiply r by the eigenvector matrices Qλ and Qt. The middle term containing the product of eigenvalues is diagonal and so the inverse is easily calculated.

Calculating log |K| can be performed in (MN) time using: (20)

2.5 Accounting for non-uniform white noise

The assumption of uniform white noise can be avoided by using a linear transformation presented in Rakitsch et al. (2013). Using this transformation, we can efficiently solve for the log-likelihood for any covariance matrix that is the sum of two Kronecker products – as described by Eq. (11).

First, we solve for the eigendecomposition of ∑λ and ∑t, notating this as follows: (21) (22)

In the case that ∑λ is diagonal then we simply have ∑λ = Λλ and (and similarly for ∑t).

We transform Kλ and Kt using these eigendecompositions: (23) (24)

Doing this allows us to write our covariance matrix K as a product of three terms: (25)

Multiplying this out reproduces Eq. (11).

The inverse of this equation can be calculated to be: (26)

We can now solve K−1 r in three steps. K−1 is the product of three matrices, two of which are of the form AB and so we can use Eq. (14). The middle matrix in this product is equivalent to Eq. (19) because we have a Kronecker product plus a constant added to the diagonal being multiplied by a vector. In this case, we need the eigendecomposition of the transformed covariance matrices and and the ‘white noise’ term is set to σ = 1. The scaling has not changed significantly compared to the method with constant white noise although the eigendecomposition of four matrices instead of two is now required (although this is trivial if ∑λ and ∑t are diagonal).

log |K| is now computed as follows: (27)

The log-likelihood can now be solved for any covariance matrix that can be written in the form of Eq. (11) with an overall scaling of (2M3 + 2N3 + MN(M + N)) operations and (M2 + N2) memory. Compared to a more general approach of performing Cholesky factorisation on the full MN × MN covariance matrix K, our optimised method provides greater than order of magnitude improvements in runtime for typical transmission spectroscopy datasets (shown in Sec. 4.5).

2.6 Efficient inference with large numbers of parameters

As multiple parameters are typically fit for each spectroscopic light curve, parameter inference can become computationally expensive when simultaneously fitting multiple light curves and a good choice of inference method may be required. Hamiltonian Monte Carlo (HMC) was chosen as it can scale significantly better when fitting large numbers of parameters compared to other more traditional MCMC methods such as Metropolis-Hastings or Affine-Invariant MCMC (Neal 2011). HMC can make use of the gradient of the log-likelihood to take longer trajectories through parameter space for each step of the MCMC compared to these other methods. Specifically, we use No U-Turn Sampling (NUTS) because it eliminates the need to hand-tune parameters while achieving a similar sampling efficiency to HMC (Hoffman & Gelman 2014).

To provide the gradients of the log-likelihood calculation for NUTS, the above algorithms were implemented in the Python package JAX (Bradbury et al. 2018). JAX has the capability of providing the values as well as the gradients of functions, often with minimal additional runtime cost. It has an implementation of NumPy that allows NumPy code to be converted into JAX code with limited alterations. It also allows the same code to be compiled to run on either a CPU or GPU – where GPUs may provide significant computational advantages. Novel analytic expressions (that were used in combination with JAX) for the gradients and Hessian of the log-likelihood were developed to aid numerical stability – as discussed in Appendices H and I – with numerical stability further discussed in Appendix J.

3 Testing on simulated data

As we do not know a priori if wavelength-correlated systematics are present in a real dataset, this could lead to challenging model selection problems deciding whether to fit the datasets with wavelength-independent 1D GPs or a 2D GP that can account for wavelength-correlated systematics. We will demonstrate using simulated data that 2D GPs can accurately account for both wavelength-independent and wavelength-correlated systematics. This avoids the need for model selection – which could become computationally intractable for large numbers of parameters and is also known to place a heavy reliance on the choice of priors4 (Llorente et al. 2023). For example, the 2D GP method can fit for a wavelength length scale lλ – with the log-likelihood varying significantly with lλ – while the 1D GP method does not. This would result in the favoured model (e.g. measured using the Bayesian evidence) being strongly dependent on the arbitrary choice of prior on lλ. This dependence on the choice of priors can be significantly reduced by avoiding model selection and instead marginalising over both wavelength-independent 1D GPs and wavelength-correlated 2D GPs. This can be performed simply by fitting with a 2D GP using a kernel such as from Eq. (2) and choosing a prior where lλ can go to small values where the 2D GP becomes equivalent to joint-fitting wavelength-independent 1D GPs. While the choice of prior on lλ will still affect how strongly the 1D and 2D GP methods are weighted, both methods are still marginalised over, reducing the importance of the choice of priors.

We aim to validate this approach by testing both methods across different sets of simulated data containing either wavelength-independent or wavelength-correlated systematics. We will show that 2D GPs can have an added benefit of sharing hyperparameters between light curves to better constrain systematics and improve accuracy. We also use these datasets to characterise how wavelength-correlated systematics may contaminate transmission spectra when fitting with 1D GPs.

Finally, we study how we can account for common-mode systematics – systematics that are constant in wavelength – using 2D GPs. Typically, these are accounted for using a separate common-mode correction step before fitting for the transmission spectrum (to be described in Sec. 3.9), we show that these systematics can be accounted for using a 2D GP while simultaneously retrieving the transmission spectrum.

To summarise, we will demonstrate four major points:

  • (i)

    Sharing hyperparameters between light curves can improve the reliability of results compared to individual 1D GP fits.

  • (ii)

    2D GPs can accurately account for wavelength-independent systematics by fitting for a wavelength length scale parameter.

  • (iii)

    When systematics are correlated in wavelength, 2D GPs can accurately retrieve transmission spectra while 1D GPs may retrieve erroneous results.

  • (iv)

    Common-mode systematics can be accounted for using 2D GPs without requiring a separate common-mode correction step.

Sections 3.13.5 will describe how the synthetic data were generated and analysed, with Sects. 3.63.9 explaining how the results demonstrate points (i), (ii), (iii) and (iv).

3.1 Light curve model

In this work, our light curve model is calculated using the equations of Mandel & Agol (2002) using quadratic limb-darkening parameters c1 and c2. The other parameters describing the light curve are the central transit time (T0), period (P), system scale (a/R*) and impact parameter (b) as well as the planet-to-star radius ratio (ρ = Rp/R*) we aim to measure. We chose a standard approach of using the transit depth (ρ2 = (Rp/R*)2) to fit each light curve (i.e. Rustamkulov et al. 2023; Alderson et al. 2023). We can also fit a linear baseline function to each light curve, which introduces the flux out-of-transit parameter (Foot) and a fit to the slope of the baseline flux (Tgrad).

Our mean function parameters therefore include five parameters that may be fit for each light curve independently (ρ2, c1, c2, Foot, Tgrad) and four parameters that are shared across all light curves (T0, P, a/R*, b). For the simulations, the transit depths ρ2 were the only mean function parameters being fit for to reduce runtimes for the large number of simulations (the other parameters were kept fixed to their injected values).

We note that the 1D GP method generated light curves using BATMAN (Kreidberg 2015) to fit the data while the 2D GP method used JAXOPLANET (Foreman-Mackey & Garcia 2023). Any differences in the light curves generated between the packages were orders of magnitude below the level of white noise being added. JAXOPLANET is a Python package that can generate transit light curves similar to BATMAN but is implemented in JAX. This allows for the calculation of gradients of the log-likelihood – as required to implement No U-Turn Sampling (NUTS).

3.2 Generating the exoplanet signal

Each simulated dataset contained 100 time points and 16 wavelength channels. The parameter ranges used for generating the exoplanet signal for all sets of simulations are included in Table 1. They are similar to the ranges chosen in Gibson (2014) and are designed to represent typical hot Jupiter transits.

The time range for these observations is [-0.1, 0.1] days sampled uniformly in time. For 100 time points this gives a cadence of approximately 3 min. This cadence is longer than many real observations – such as from VLT/FORS2 or JWST observations – but is still well-sampled. This was required to reduce runtimes and allow for the analysis of a large number of simulated datasets. The parameter ranges chosen resulted in the proportion of time spent in transit varying within the range [26%, 65%].

The wavelength range considered was [3850 Å, 8650 Å] which for 16 evenly separated wavelength bins gives a bin width of 300 Å. This wide wavelength range ensures light curves with a range of limb darkening parameters were included. Limb darkening parameters were calculated by first uniformly sampling a confirmed transiting exoplanet listed on the NASA Exoplanet Archive5 and selecting the listed host star parameters. These parameters were used to compute the limb darkening parameters using PyLDTk, allowing for a realistic range of stellar limb-darkening profiles (Parviainen & Aigrain 2015).

The synthetic transmission spectra were generated using a simple model with the aim of testing the retrieval of typical sharp and broad features within transmission spectra. The planet-to-star radius ratio was randomly drawn as described in Table 2. To study transmission spectra that may have a scattering slope or flat ‘grey’ cloud deck, there was a 50% chance of a Rayleigh-scattering slope being included and a 50% chance the spectrum was flat. The Rayleigh-scattering was implemented using Eq. (28) from Lecavelier Des Etangs et al. (2008): (28)

where m is the slope of the transmission spectrum, H is the scale height and α describes the strength of scattering (α = −4 for Rayleigh scattering).

For simplicity, m = 0 was used for a flat ‘cloudy’ model and m = −0.005 for Rayleigh scattering. For context, WASP-31b has a reported height scale of 1220 km and stellar radius of R* = 1.12 M (Sing et al. 2015; Anderson et al. 2011) which results in a predicted slope of m = −0.0063 for Rayleigh scattering.

In order to compare how the different methods may constrain a sharp spectral feature such as a K feature (a sharp potassium absorption doublet at ~7681 Å), each dataset had a 50% probability of adding a change in radius ratio of Δρκ = 0.005 to the single wavelength bin which covers the wavelength range 7450 Å to 7750 Å. Overall this resulted in an even 25% probability of the spectrum being flat and featureless, hazy and featureless, flat with a K feature, or hazy with a K feature (see Fig. 1).

Table 1

Mean function parameter ranges used for all simulations.

Table 2

Transmission spectrum parameter ranges for all simulations.

thumbnail Fig. 1

Four equally likely types of transmission spectra generated for a radius ratio of ρ = 0.1, showing the spectrum in transit depths.

Table 3

Parameter ranges used to generate systematics for Simulations 1–4.

3.3 Generating systematics

Five hundred synthetic transmission spectroscopy datasets (each containing 16 light curves) were generated for each of the four sets of simulations. All sets of simulations used the same transit signal model as described in Sec. 3.2. Parameter ranges used to generate systematics for Simulations 1–4 are included in Table 3. Systematics generated were multiplied by each of the transit models generated. We chose to multiply by the transit signal because many sources of systematics would be expected to be proportional to flux (i.e. from varying instrumental or atmospheric throughput).

For Simulation 1, noise was generated using Eq. (1) to build a covariance matrix and by taking random draws from it independently for each wavelength. These datasets therefore contained wavelength-independent systematics (equivalently this is the limit of Eq. (2) when lλ → 0). For Simulations 2 and 3, noise was generated using the kernel in Eq. (2) and both represent scenarios where wavelength-correlated systematics are present but with different ranges of wavelength length scales. Simulation 4 represents contamination from common-mode systematics – which are often encountered in real observations and are relevant to the VLT/FORS2 analysis in Sec. 4. For each simulated dataset, a single draw of correlated noise was generated using the kernel in Eq. (1) (with negligible white noise σ = 10−6 for numerical stability). All wavelength channels used this same draw of correlated noise so that it was constant in wavelength (which is the limit of Eq. (2) approached as lλ → ∞). White noise was then added to the full dataset from the parameter range in Table 3.

For examples of noise-contaminated light curves generated, three synthetic light curves from a Simulation 1 dataset are plotted in the left plot of Fig. 2, while all light curves from a Simulation 3 dataset are shown in the left plot of Fig. 5. Simulation 2 datasets would appear similar to Simulation 3 but with the systematics changing shape more significantly between different light curves. Simulation 4 datasets would have the same shape systematics in all light curves (but with different white noise values).

3.4 Inference methods

Each of the fitting methods were initialised with the true values used in the simulations. A global optimiser was then used to locate the best-fit values for the parameters being fit. MCMCs were initialised by perturbing them from this best-fit value and run until convergence was reached – as measured by the Gelman–Rubin statistic (Gelman & Rubin 1992) – or until a limit on the number of chains or samples was reached.

The 1D GP fits were performed using the GEEPEA (Gibson et al. 2012b) and INFERNO (Gibson et al. 2017) Python packages. GEEPEA was used for the implementation of the 1D GP with the Nelder-Mead simplex algorithm used to locate a best-fit value. Differential Evolution MCMC (DE-MC) was used to explore the posterior of each of the light curves using the implementation in INFERNO. Four parameters were fit for each MCMC: ρ2, h, lt and σ. One hundred independent chains were run in each MCMC with 200 burn-in steps and another 200 steps run for the chain. If any of the parameters had a GR statistic >1.01 then the chains were extended another 200 steps with this repeating up to a maximum chain length of 1000. This was sufficient to achieve formal convergence in over 99% of simulations performed.

In order to perform inference for the 2D GP fits, the log-likelihood calculation was implemented in JAX using the equations described in Sec. 2.5 and Appendix H. The Python package PYMC was used for its implementation of Limited-Memory BFGS for best-fits and NUTS for inference (Salvatier et al. 2016). Limited-Memory BFGS is a gradient-based optimisation method that can be used for models with large numbers (potentially thousands) of parameters (Liu & Nocedal 1989).

MCMC inference for the 2D GP method initially used two chains with 1000 burn-in steps each followed by another 1000 steps used for inference. Convergence was checked and if all parameters had not yet converged then another two chains were run. This was repeated until a maximum of ten chains were run. This was also sufficient to achieve convergence in over 99% of simulations performed.

For many MCMC methods, an approximation of the covariance matrix of the parameters being fit can be used as a transformation to reduce correlations between parameters and improve sampling efficiency. In NUTS, the so-called mass matrix performs this function (Hoffman & Gelman 2014). A Laplace approximation was performed to find an approximate covariance matrix, which requires calculating the Hessian of the log-posterior at the location of best-fit (see Appendix I). This was found to be significantly more efficient than using the samples from the burn-in period to approximate the covariance matrix.

To overcome a few other challenges in achieving convergence (see Appendix D for details), blocked Gibbs sampling (Jensen et al. 1995) was used to update some groups of parameters together. Slice sampling (Neal 2003) was also used if the time or wavelength length scales were within certain ranges close to a prior limit, as this could sometimes sample these parameters more efficiently than NUTS. Priors used for both 1D and 2D GP methods are described in Appendix C and were unchanged between Simulations 1–4.

thumbnail Fig. 2

Example of synthetic dataset analysed in Simulation 1 (containing wavelength-independent systematics) showing some simulated light curves (left) and recovered transmission spectrum (right). Left: the three shortest wavelength light curves in the dataset being fit by the transit model combined with a 2D GP noise model (the other 13 light curves were simultaneously fit but not plotted). Right: resulting transmission spectrum from the joint fit of all 16 light curves, with the three leftmost points corresponding to the light curves in the left plot. The recovered transmission spectrum was consistent with the injected spectrum (). An atmospheric retrieval was performed on the recovered transmission spectrum (with the best-fit model plotted) and was consistent with the flat slope and strong K feature of the injected spectrum.

3.5 Accuracy metrics

Multiple metrics were used to determine how accurately different methods could recover the injected transmission spectra. An example analysis of a single synthetic dataset from Simulation 1 is shown in Fig. 2, demonstrating both the fitting of the light curves and how the resulting transmission spectrum and atmospheric retrieval compares to the injected spectrum. We will study how accurately each method can extract the transmission spectrum in addition to the accuracy of the retrieved atmospheric parameters.

To measure the accuracy of the transmission spectra, we assumed that the uncertainty in the recovered transit depths was Gaussian. This implies that if we are accurately retrieving their mean and covariance, the reduced chi-squared of the injected values should be distributed as a distribution with 16 degrees of freedom. We compute the of the injected transmission spectrum (accounting for covariance between transit depths) using: (29)

where and ;ret represent the retrieved mean and covariance of the transmission spectrum respectively (with ρ2 representing the transit depths). represents the injected transmission spectrum and M is the number of light curves (M = 16 for all the simulations in this paper). As the 1D GP method fits each light curve independently, the covariance between all the transit depths is always assumed to be zero, resulting in the covariance matrices from the 1D GP method being diagonal. The 2D GP method will instead recover the full covariance matrix of transit depths from the MCMC.

Since a single synthetic dataset is not sufficient to study the accuracy of each method, we chose to generate 500 synthetic datasets for each set of simulations to provide a sufficient sample to test each method. By comparing how the resulting 500 values trace out the theoretical distribution, we can determine if a method is accurately recovering the uncertainty in the transmission spectrum. For M = 16 degrees of freedom, has mean μ = 1 and variance σ2 = 2/M = 0.125. The mean being significantly larger than one suggests our uncertainties are too small on average, while the mean being smaller than one suggests our uncertainties are too large on average. We should expect the sample variance of 500 samples from a distribution with 16 degrees of freedom to be distributed as 0.125 ± 0.009, if values diverge from this it may suggest the presence of outliers.

We also performed one-sample Kolmogorov–Smirnov (K–S) tests on the distribution of recovered values, which is a method for testing if a set of samples follows a given distribution – in this case a distribution. It works by comparing the empirical cumulative distribution function (ECDF) of the samples to the cumulative distribution function (CDF) of the distribution chosen, returning the maximum deviation D from the desired CDF. The K-S statistic D for 500 samples and for a distribution with 16 degrees of freedom should have D < 0.060 at a p-value of 0.05 and D < 0.072 at a p-value of 0.01.

In addition to examining the values, we can also examine how accurately we can retrieve atmospheric features. For our simulations, there are three features of the injected transmission spectra: the radius ratio ρ = Rp/R*, slope m = d(Rp/R*)/d(ln λ) and change in radius ratio in the wavelength bin centred on the K feature Δρκ. For each simulation, these three features were determined by performing a simple atmospheric retrieval on the recovered transmission spectrum using the same atmospheric model, which was used to generate the injected spectra. We determined the mean and uncertainty for each of these features using an MCMC (also using the NUTS algorithm). We measured the number of standard deviations – or the Z-score – each prediction was away from the true injected value for all 500 synthetic datasets (used previously in e.g. Carter & Winn 2009; Gibson 2014; Ih & Kempton 2021). If we are robustly retrieving a given atmospheric parameter, we expect these 500 Z-scores to follow a Gaussian distribution with mean μ = 0 and variance σ2 = 1.

If the mean Z-score is zero, it would demonstrate that the parameter is not biased towards larger or smaller values than the true values. The Z-score values having unit variance shows that the uncertainty estimates are accurate. If the variance is less than one then that suggests the error bars are too large on average which might result in a missed opportunity to detect an atmospheric feature. If the variance is greater than one then the error bars are too small on average, which could lead to false detections of signals.

If two methods are both shown to produce reliable uncertainties based on these metrics, then we can also study which method gives smaller uncertainties on the transmission spectrum. Some methods described can produce tighter constraints than others but at the cost of making stronger assumptions about the systematics in the data.

Tables of each accuracy metric for all methods tested are included in Appendix G.

3.6 Result (i): sharing hyperparameters can make constraints more reliable

Before fitting for wavelength-correlations, we first demonstrate that simply by joint-fitting spectroscopic light curves and sharing hyperparameters between light curves, we can more reliably constrain transmission spectra. This technique has already been demonstrated on ground-based observations of WASP-94Ab (Ahrer et al. 2022), but to examine reliability we use it on simulated datasets where the transmission spectrum is known.

For this section, we used the Simulation 1 data containing only wavelength-independent systematics. We first fit each synthetic dataset with individual 1D GPs that fit a separate transit depth ρ2, height scale h, time length scale lt and white noise amplitude σ to each individual light curve. The kernel in Eq. (1) was used for the GP (the same kernel used to generate the systematics). We compared this to joint-fitting all 16 spectroscopic light curves in each dataset, where we still fit for all 16 transit depths but used single shared values for h, lt and σ. This provides more information to constrain the values of the hyperparmeters and should produce better constraints than using wavelength-varying values. We note that this can be considered to be using a 2D GP as the kernel is a function of the two dimensions time and wavelength (the correlation in wavelength is given by the Kronecker delta ). We therefore performed this joint-fit identically to the 2D GP method described in Sec. 3.4 and used the same kernel function but fixed the wavelength length scale lλ to be negligible6. Comparing Eq. (1) and Eq. (2), it can be seen that both kernel functions are mathematically equivalent in the limit as lλ → 0. We therefore refer to this method as the Hybrid method as it shares the same kernel function as the 1D GP method but joint-fits light curves and can benefit from shared hyperparameters like the 2D GP method.

The Hybrid method had either equal or better performance compared to individual 1D GPs across every accuracy metric measured. For example, the distribution of values for the 1D GPs had a mean of = 1.35 ± 0.04, compared to = 1.00 ± 0.02. Similarly, the Z-scores of all three atmospheric parameters studied had sample variances that were constrained to be greater than one for the 1D GP fits but consistent with one for the joint-fits.

Our results suggest that 1D GPs underestimated the uncertainties of the transmission spectra on average, which may appear surprising given that the systematics in Simulation 1 were generated using 1D GPs with the exact same kernel function. The different results for the two methods appear to have been largely driven by a small number of outliers where individual light curves were fit to have much weaker systematics than was present. This did not happen with the joint-fits and since both methods have mathematically equivalent kernel functions, we interpret this result as the 1D GPs not having sufficient data to consistently fit the systematics. This could happen if an individual light curve happens to have systematics that have a similar shape to a transit dip, making it difficult to determine the amplitude of the systematics. The joint-fit can instead make use of more light curves to accurately constrain the systematics. We note that the 1D GP method had metrics consistent with ideal statistics for the 130 datasets with lt < 0.01 days, which is much shorter than the minimum transit duration of 0.052 days and therefore less likely to mimic a transit dip. We conclude that sharing hyperparameters can improve the reliability of data analyses by utilising more data to account for the systematics, verifying Result (i).

We note however that for real data we need to be careful with which parameters we assume are wavelength-independent. The amplitude of systematics and white noise could vary significantly across wavelength channels. For our VLT/FORS2 analysis in Sec. 4, we only assume that the length scales of correlated noise are the same across light curves, similar to Ahrer et al. (2022).

3.7 Result (ii): 2D GPs can account for wavelength-independent systematics

As stated at the beginning of Sect. 3, 2D GPs (with the kernel given in Eq. (2)) can be used to fit wavelength-independent systematics, meaning that there is no need to use model selection to determine whether to use 1D GPs or 2D GPs. We have already demonstrated that a 2D GP with a wavelength length scale fixed to negligible values (the Hybrid method) can accurately account for wavelength-independent systematics. For this test, we wanted to demonstrate that even if we fit for the wavelength length scale with a broad prior then we can retrieve similar results. This method has the benefit of avoiding the a priori assumption that systematics are wavelength-independent.

Our results showed no statistically significant difference in any accuracy metric measured between the Hybrid method and fitting for lλ with a 2D GP. Both methods showed ideal statistics, based on both the mean and variance of the resulting distributions and the values of the K–S statistics. The Z-score of all retrieved atmospheric parameters were both consistent with the unit normal distribution. The average recovered transmission spectrum did have 2.4% larger uncertainties however so there may have been a minor loss in precision from the more conservative approach of fitting for the wavelength length scale.

We see that if systematics are wavelength-independent, a 2D GP can constrain the wavelength length scale to be sufficiently small to provide robust retrievals. This also has limited cost to the precision of the transmission spectrum compared to assuming the systematics are wavelength-independent a priori, confirming Result (ii).

thumbnail Fig. 3

histograms for all three methods with theoretical distribution overplotted. Both the 1D GP and Hybrid methods have many outliers that increase the variance of the retrieved values. The values of the Hybrid method have the correct mean but with greater variance than the theoretical distribution, while the 2D GP traces the theoretical distribution very closely.

3.8 Result (iii): 2D GPs can account for wavelength-correlated systematics

To study how accurately 1D GPs and 2D GPs can account for systematics correlated in both time and wavelength, we analysed Simulation 2 datasets that cover systematics correlated over wavelength length scales larger than one wavelength bin but smaller than half the wavelength range of the datasets. We fit all 500 datasets with 1D GPs and 2D GPs. However, in addition to having a different kernel to the 1D GP method, the 2D GP could also benefit from sharing hyperparameters (Result i). To isolate the effect of each of these differences, we also fit the Hybrid method to each dataset, which shares hyperparameters between light curves but uses a mathematically equivalent kernel to the 1D GP method.

We found that only the 2D GP method had correctly distributed values and Z-scores of atmospheric parameters with unit variance. The distributions of each method are shown in Fig. 3. The Hybrid method produces a distribution of values with the correct mean but with greatly increased variance, while the 1D GP method performs similarly but with outliers producing a higher mean value (similar to Sec. 3.6). We interpret the result of the Hybrid method as demonstrating that the systematics in each individual light curve are being accurately described by the kernel function of the Hybrid method (resulting in the correct mean value) but the correlation between light curves is not being accounted for (increasing the variance of the values). To understand this, consider fitting two light curves where it is incorrectly assumed that the systematics are independent, but the systematics are actually identical. If the systematics in one light curve are recovered with the correct systematics model but happen to result in a 2σ error in the measured transit depth, then the same measurement will occur in the other light curve. Both light curves analysed together would appear to result in two independent 2σ errors, resulting in a high value. However, when we account for the fact that the two light curves are perfectly correlated, this is really only a single 2σ error with a smaller value. Similarly, when the error in transit depth is very small in one light curve (i.e. 0.1σ), it will be small in both light curves, resulting in a value that is too low. The effect of this is that the distribution of two light curves can produce the correct mean but with increased variance when correlations between the light curves are not accounted for.

The 1D GP and Hybrid methods were found to poorly retrieve atmospheric parameters: on average the radius ratio and slope uncertainties were underestimated and the uncertainty in the strength of a K feature was overestimated. This is shown in Fig. 4, where it can be seen that the Z-scores have variance that is either too high or too low for the different features using these two methods. In contrast, the Z-scores for the 2D GP method are consistent with unit normal distributions.

To understand our interpretation of why the 1D GP kernel underestimates errors in broad features (such as the slope and radius ratio), but can overestimate errors for sharp features (such as the K feature), note that signals in data are most degenerate with systematics when both have similar length scales. For example, wavelength-independent systematics should be more degenerate with a sharp K feature compared to broader features. A slope in the transmission spectrum should be more degenerate with systematics which vary gradually in wavelength (i.e. with a long wavelength length scale). Both the 1D GP and Hybrid methods assume that the systematics are wavelength-independent, which could produce uncertainties that are too large when fitting the sharp K feature but too small for broader features. The 2D GP can constrain the systematics to have a longer wavelength length scale, which may increase the uncertainty in broader features while reducing uncertainty in sharp features. This result matches related work examining how different length scale correlations in exoplanet transmission or emission spectra affect atmospheric retrievals (Ih & Kempton 2021; Nasedkin et al. 2023) but it had yet to be demonstrated how these correlations can arise from wavelength-correlated systematics in the spectroscopic light curves.

3.9 Result (iv): 2D GPs can account for common-mode systematics

We first consider how to fit the Simulation 4 data containing common-mode systematics (i.e. systematics that are constant in wavelength) before analysing Simulation 3 containing long wavelength length scale systematics. Common-mode systematics are often encountered in transmission spectroscopy datasets, particularly from ground-based observations such as in the VLT/FORS2 observations analysed in Sec. 4. These systematics are often dealt with by performing a ‘common-mode correction’. First, a much broader wavelength range relative to the spectroscopic light curves is chosen to extract a ‘white’ light curve. This was replicated for the simulations by averaging all spectroscopic light curves together. The systematics in this white light curve are then fit with a 1D GP similar to the individual spectroscopic light curve fits. The GP predictive mean is used to model the systematics in this white light curve, which is assumed to describe the common-mode systematic affecting all spectroscopic light curves. Each of the spectroscopic light curves is then divided by this recovered common-mode systematic, resulting in ‘common-mode corrected’ spectroscopic light curves. Each of these corrected light curves is then fit separately to recover the transmission spectrum using the 1D GP approach already outlined, accounting for any wavelength-independent systematics in the light curves.

There are multiple weaknesses of performing this separate common-mode correction. It is known that it can produce offsets to the resulting transmission spectrum, as the particular fit of the common-mode systematic chosen to correct the spectroscopic light curves can shift all transit depths equally. Only a single fit of the common-mode systematic is chosen to correct the light curves, so the transmission spectrum is conditioned on this particular fit instead of marginalising over all possible common-mode corrections. This likely has the effect of underestimating the uncertainty in the average radius ratio because different common-mode corrections offset the transmission spectrum differently. Finally, the assumption that a real dataset contains systematics that are constant in wavelength may be incorrect, as examined in Sec. 4. 2D GPs can avoid using a common-mode correction because they can account for common-mode systematics by using the kernel in Eq. (2) with lλ set to be effectively infinite (in practise to very large values to maintain numerical stability).

We tested three different approaches on each Simulation 4 dataset. First, all spectroscopic light curves were averaged together to perform a common-mode correction on each dataset, followed by fitting the corrected light curves with 1D GPs as performed for the other simulations. We also fit the data using a 2D GP with the wavelength length scale fixed to a very large value (50 times the wavelength range of the data) which effectively assumes a priori that the systematics present do not vary in wavelength. This is similar to the common-mode correction method but does not take into account remaining wavelength-independent systematics that were not simulated and also takes advantage of shared hyperparameters between light curves. Finally, we fit a 2D GP where the wavelength length scale was fit to the data (with a log-uniform prior from 1/4 of a wavelength bin to 50 times the total wavelength range of the dataset). This can be viewed as a more conservative approach that uses the data to constrain whether or not the systematics are constant in wavelength. Both 2D GPs used the kernel in Eq. (2).

We found as expected that the common-mode 1D GP method underestimated the uncertainty in the radius ratio. In contrast, both 2D GP methods were able to retrieve the radius ratio accurately with unit normal Z-score distributions, as they are not conditioned on a single fit of the common-mode systematics. All methods performed similarly at retrieving the strength of a K feature accurately.

Interestingly, the 2D GP fitting for the wavelength length scale overestimated the uncertainty in the slope of the transmission spectrum (Z-score variance of 0.67 ± 0.04), while both methods which assumed constant wavelength systematics retrieved the slope with ideal Z-score statistics and with ≈ 15% smaller average standard deviations. Fitting for the wavelength length scale appears to have reduced precision in the constraint of the scattering slope. Since systematics varying in wavelength could produce an apparent slope in the transmission spectrum, it makes sense that removing the assumption that the systematics are constant in wavelength has increased the uncertainty in our constraints.

While this loss in precision may not be desirable, we must also consider the risk of incorrectly assuming systematics are constant in wavelength. For Simulation 3, we generated systematics that gradually varied in wavelength using long wavelength length scales (>1/2 the wavelength range of the dataset but <8 times the wavelength range of the dataset). We then analysed these datasets with the same three methods as for the common-mode systematics. In this case, only the 2D GP method that fit for the wavelength length scale produced robust retrievals of the injected signals across all accuracy metrics, although in this case the constraints on the slope may have been slightly too small with a Z-score sample variance of s2(Zm) = 1.18 ± 0.07 (although the other two methods both had s2(Zm) > 7). We note that when fitting for lλ, the slope constraints for Simulations 3 and 4 may be affected by the choice of prior bounds as lλ was often consistent with its maximum prior bound for both sets of simulations.

An example of one of the synthetic datasets from Simulation 3 is shown in the left plot of Fig. 5, along with the corresponding common-mode corrected light curves (right plot). We note that the systematics may appear to be constant in wavelength, despite being generated with a wavelength length scale of 4170 Å. The resulting transmission spectra for the 1D GPs with a common-mode correction compared to the 2D GP method fitting for lλ are included in Fig. 6. This was chosen as a particularly extreme example where the common-mode 1D GP method incorrectly detected a non-zero slope at 5.8σ, compared to 2.1σ for a 2D GP fitting for lλ. Out of the datasets that were simulated to have a flat spectrum, the 1D GP method with a common-mode correction retrieved a slope > 3σ away from zero in 26.7% of them. This is compared to 0.4% of these >3 σ outliers with the 2D GP method (consistent with the expected 0.3% for a normally-distributed variable). This clearly demonstrates the potential for false detections of scattering slopes if systematics are incorrectly assumed to be constant in wavelength.

Overall, the 2D GP fitting for the wavelength length scale performed well on Simulations 3 and 4 compared to methods which assumed that the systematics were common-mode. While there was some loss in precision on the scattering slope for common-mode systematics, this was likely due to a more conservative approach which performed significantly better on long wavelength length scale systematics.

thumbnail Fig. 4

Z-scores of retrieved atmospheric features from Simulation 2, which had short wavelength length scale correlated noise. The three methods tested were fitting the light curves with 1D GPs (top row), the Hybrid method (middle row) and a 2D GP (bottom row). All methods have mean Z-scores consistent with zero, that is none of them biases the results towards measuring larger or smaller values of the features. The retrieved variance of Z-scores for the 2D GP method are all consistent with a variance of one (matching the Gaussian distribution plotted in red), but not for the other methods, which indicates overestimation or underestimation of uncertainties.

thumbnail Fig. 5

Raw light curves (left) and common-mode corrected light curves (right) for one simulated dataset in Simulation 3 (containing wavelength-correlated systematics). The injected transit signal is plotted as a black dotted line. The variation of the systematics in wavelength is not obvious by eye but can be noticed by comparing the regions highlighted in red boxes. Despite the systematics varying in wavelength, the common-mode correction appears to remove visual signs of systematics.

thumbnail Fig. 6

Example of retrieved transmission spectra with 1D GP method (left) analysing common-mode corrected light curves (right plot in Fig. 5) compared with 2D GP method (right) analysing uncorrected light curves (left plot in Fig. 5). The error bars in blue only convey the mean and standard deviation of each transit depth measurement, random draws are taken from the covariance matrix to convey potential correlations between transit depths, helping to visualise the increased uncertainty in the offset and scattering slope of the spectrum from the 2D GP method. Only the 1D GP method erroneously detects a negative scattering slope. While it is not visually apparent, the 2D GP method gives a stronger constraint on the detection of potassium (13.9σ compared to 8.7σ).

3.10 Conclusions from simulations

2D GPs allow for hyperparameters to be shared between light curves, which we showed in Simulation 1 has the potential to improve the reliability of retrievals. 2D GPs with a broad wavelength length scale prior are able to accurately account for systematics across a wide range of wavelength length scales – including when the systematics are wavelength-independent (Simulation 1), varying in wavelength (Simulations 2 and 3), or constant in wavelength (Simulation 4).

When systematics are correlated in wavelength with a short wavelength length (Simulation 2), analysing these data with 1D GPs or the Hybrid method resulted in constraints on broad features (i.e. the radius ratio and scattering slope) that were too small and constraints on sharp features (e.g. a K feature) that were too large.

The effect of incorrectly assuming wavelength-correlated systematics are common-mode was examined in Simulation 3. This was also found to significantly underestimate the uncertainty in the scattering slope, which may help explain previous conflicting scattering slope measurements in the literature. We note that stellar activity (e.g. from unocculted starspots) has previously been used to explain conflicting scattering slopes between different transits (e.g. Rackham et al. 2017; May et al. 2020). Since unaccounted for wavelength-correlated systematics also has the potential to affect the measured transmission spectrum slope, it is possible that stellar properties such as star-spot covering fractions deduced from the slope of the transmission spectrum may be inaccurate.

We conclude our 2D GP method is a much safer approach to fitting noise which may have wavelength-correlated systematics present and was the only model tested that could accurately account for systematics across all four sets of simulations, although common-mode systematics may cause the uncertainty in the scattering slope to be slightly overestimated. As our method does not require a separate common-mode correction, the uncertainty in the radius ratio can be fully accounted for in the transmission spectrum.

4 Re-analysis of VLT/FORS2 observations

To test our method on real data, we performed a re-analysis of the VLT/FORS2 transit observations of WASP-31b first analysed in Gibson et al. (2017). We aimed to identify whether wavelength-correlated systematics are present in these observations and how the constraints on atmospheric parameters changed after accounting for these systematics. Based on the results in Sec. 3, if there is strong evidence for wavelength-varying systematics from the 2D GP analyses (based on the recovered wavelength length scale) then we should be skeptical of the original analysis as it did not account for wavelength-varying systematics – making the constraints unreliable.

WASP-31b is a low-density hot Jupiter. It was initially reported in Anderson et al. (2011) with a mass of MJ = 0.478 ± 0.029 and radius of RJ = 1.549 ± 0.050. It has a 3.4 day orbit around a late F-type dwarf (V = 11.7).

The data consist of two transits of WASP-31b, the first using the GRIS600B grism taken on the night of 2016 February 15 and the second using the GRIS600RI grism with the GG435 order blocking filter taken on 2016 March 3. These are referred to as the 600B and 600RI datasets, with the extracted light curves covering the wavelength ranges 3868–6238 Å and 5206–8476 Å respectively. See Gibson et al. (2017) for more details about the observations and subsequent data reduction. The same wavelength calibration and spectral extraction were used as in the previous analysis and the same 150 Å wide wavelength bins were used for extracting the spectroscopic light curves.

As noted in Sec. 1, these data conflict with previous observations from HST/STIS which identified a strong K feature at 4.3σ significance (Sing et al. 2015) not replicated in the original VLT analysis in Gibson et al. (2017). This conflict was the initial reason for choosing these observations to re-analyse, in order to study if accounting for wavelength-correlated noise may resolve this discrepancy. We leave the re-analysis of the HST/STIS light curves for future work.

4.1 Previous analysis with 1D GPs

The approach to fit the light curves used in Gibson et al. (2017) was as follows: After basic data reduction steps were performed (such as background subtraction, aperture extraction and wavelength calibration), the flux from the target star was divided by flux from a comparison star to account for changes in flux due to the Earth’s atmosphere.

A common-mode correction was then performed – as described in Sec. 3.9. However, in addition to dividing the spectroscopic light curves through by the common-mode systematics, the residuals of the white light curve (found by subtracting the GP predictive mean from the white light curve) were subtracted from each spectroscopic light curve. This was to account for high-frequency systematics – which are similar to common-mode systematics but occur on a faster timescale, making them indistinguishable from white noise in the white light curve fit. The presence of high-frequency systematics can make the residuals of each spectroscopic light curve correlated – which this process aims to remove. Because the white light curve residuals are subtracted from all spectroscopic light curves, this procedure assumes that these systematics are constant in wavelength. After these corrections were performed, each spectroscopic light curve was fit separately using a transit model combined with a 1D GP – accounting for remaining systematics that are correlated in time but assumed to be wavelength-independent. The amplitude of white noise was also fit separately for each wavelength.

The exact cause of each of these systematics is unclear. In Gibson et al. (2017), the authors speculate that the common-mode systematics are most likely due to instrumentation (e.g. inhomogeneities in components such as the grism or derotator) while the high-frequency systematics could be due to varying atmospheric throughput potentially from uneven cloud cover. Stellar activity can in principle also cause variations in flux during transit (i.e. from starspot crossings) or at any time (i.e. due to varying starspot coverage or asteroseismology). However, photometric monitoring described in Sing et al. (2015) suggests that WASP-31 is a quiet star with little photometric variability – so we consider it unlikely that stellar activity is a significant contributor to systematics in these data. Remaining wavelength-independent systematics are often found to be particularly strong near telluric features such as at the ~7550–7750 Å O2 A band (see Fig. 7 in Sedaghati et al. 2016 for an example), which suggests these systematics could be due to atmospheric variability or perhaps the sharp changes in flux increase instrumental systematics.

4.2 Choosing the kernel function

Examining the systematics accounted for in the previous analysis suggests that the kernel of our 2D GP should be able to account for (a) ‘common-mode’ (CM) systematics, (b) high-frequency systematics (HFS), (c) any remaining time-correlated systematics in the spectroscopic light curves (which we call wavelength-specific systematics or WSS) and (d) the changing amplitude of white noise across different wavelength channels. It should also match the form of Eq. (12) to make this analysis computationally tractable. The kernel chosen for the 2D GP analysis is given in Eq. (30) and our choice of priors for all parameters is outlined in Appendix C. (30)

This equation can be multiplied out to reveal the four individual terms used to account for (a)-(d): (30a) (30b) (30c) (30d)

Equation (30a) was included to account for (a), but instead of assuming a priori that the systematics are common-mode as in the previous analysis, we instead fit for a wavelength length scale with a broad prior range that can account for wavelength-varying or common-mode systematics. The right plot in Fig. 7 appears to show the presence of systematics that vary quite gradually in wavelength across the 600RI dataset, similar to the simulated light curves in Fig. 5 containing wavelength-varying systematics. Correlation in time was fit with the time length scale lt and we assumed the height scale of these systematics did not vary over the dataset and therefore fit for a single parameter hCM.

Equation (30b) accounts for (b) high-frequency systematics. We fit for a wavelength length scale and a single height scale hHFS for these systematics. The difference in our kernel choice between (a) and (b) is that we are assuming the high-frequency systematics are independent in time (also assumed in the original analysis). The Kronecker delta term represents this independence in time in our kernel function.

Equation (30c) accounts for remaining systematics that are correlated in time (with time length scale lt) but wavelength-independent (represented by the Kronecker delta term ). We tested replacing this Kronecker delta term with a squared-exponential kernel in wavelength to avoid assuming these systematics are wavelength-independent, but it had little effect on the results (similar to the analyses in Sec. 3.7). Wavelength-independent systematics appear as light curves having systematics of a different shape to neighbouring light curves – such as the region highlighted in the left plot of Fig. 7. The height scales of these systematics are fit independently for each wavelength (given by ) because some wavelength channels may cover telluric features that can have particularly large amplitude systematics. The time length scale lt is shared with Eq. (30a) in order to fit the form of Eq. (12). This assumes that any common-mode or wavelength-varying systematics accounted for by Eq. (30a) have the same timescale as any wavelength-independent systematics, which is not necessarily true but is an unavoidable assumption for this method. However as shown in Sec. 3.6, sharing hyperparameters between light curves can provide more robust constraints when it is a safe assumption that the parameters are constant.

The terms in Eq. (30d) fit for the amplitude of white noise. This is done separately for each wavelength channel as the white noise amplitude can be strongly wavelength-dependent for real observations. There is a Kronecker delta term that accounts for the noise being independent in wavelength and a second Kronecker delta term to account for the noise being independent in time.

By calculating the GP mean at the location of best-fit, but with different height scales hCM, hHFS or hWSS set to zero, it is possible to roughly visualise each type of systematics in these datasets. The common-mode systematics were visualised by setting both hHFS and hWSS to zero when calculating the GP mean, while the high-frequency and wavelength-specific systematics were visualised by examining the change in the GP mean when just hHFS is set to zero or just hWSS is set to zero respectively7. This is shown for the 600B data in Fig. 8. The autocorrelation of the residuals given each of these terms are set to zero is examined in Appendix F.

Equation (30) can also be written in the Kronecker product form of Eq. (11) by choosing: (31) (32)

and similarly for ∑λ and ∑t (in this case ∑t is the identity matrix).

4.3 Re-analysis procedure

We compare the original 1D GP analysis from Gibson et al. (2017) to multiple analyses with 2D GPs to examine how different assumptions affect the results. In addition, a re-analysis using 1D GPs was performed to enable a closer comparison of the differences between 1D and 2D GPs. This 1D GP re-analysis was similar to the original but more closely matched the 2D GP method implemented in this work, such as fitting each light curve for the transit depth ρ2 instead of the radius ratio ρ, using the same priors on corresponding parameters as the 2D GP fit and also using a squared-exponential kernel instead of the Matérn 3/2 kernel chosen in the original analysis. Log-uniform priors on all of the hyperparameters were used which is also a slight change from the original analysis.

The spectroscopic light curves were binned using the same wavelength bins as the original analysis, although there may potentially have been more outliers clipped at the edges of the datasets. Each light curve had separate parameters for transit depth ρ2, limb-darkening parameters c1 and c2, out-of-transit flux Foot and linear trend in baseline flux Tgrad. The light curves shared the central transit time parameter T0, system scale a/R*, period P and impact parameter b. The period was fixed to the mean literature value given in Patel & Espinoza (2022) of P = 3.4059095 ± 0.0000047 days.

For the 1D GP re-analysis, a common-mode correction was performed (see Sec. 3.9) with the white light curve for each dataset extracted across the same broad wavelength ranges as in Gibson et al. (2017). It was fit for T0, a/R*, ρ2, b, c1, c2, Foot, Tgrad, h, lt and σ using DE-MC. T0, a/R* and b were fixed to their best-fit values when fitting the spectroscopic light curves. The same Gaussian priors were placed on a/R*, b and on the radius ratio of the white light curve as in Gibson et al. (2017). These priors were taken from previously reported values in Sing et al. (2015). The priors were included because the large amplitude common-mode systematics made constraining parameters that affect the overall transmission spectrum difficult. To replicate this constraint for the 2D GP analyses – which do not perform a common-mode correction – these Gaussian priors were kept on a/R* and b and also on the mean radius ratio8 when fitting the spectroscopic light curves.

As the data alone was not sufficient to accurately constrain the limb-darkening parameters, Gaussian priors were placed on c1 and c2 based on predicted values from theoretical models using PYLDTK. The mean values given by PYLDTK were set as the prior means but the prior uncertainties were increased to a standard deviation of 0.1 to account for uncertainty in the limb-darkening models (matching the procedure in Gibson et al. 2017).

Outliers in the spectroscopic light curves were clipped by performing a best-fit to the data with a GP – optimising both the transit parameters and hyperparameters – and clipping 4σ outliers (evaluated using the GP predictive mean and variance). The outliers were replaced with the GP predictive mean evaluated at the location of the points. This was performed using the 2D GP predictive mean for the 2D analysis (see Appendix A) and the 1D GP predictive means for each light curve for the 1D GP re-analysis. Replacing these outliers as opposed to removing them ensures the flux observations still lie on a complete 2D grid. This is required for our optimised 2D GP method and also matches the procedure in the original analysis.

The 1D GP re-analysis was similar to the procedure described in Sec. 3.4 but with each (corrected) spectroscopic light curve being fit for ρ2, c1, c2, Foot, Tgrad, h, lt and σ. As more parameters were being fit, the chains were run for longer with 100 independent chains having 500 burn-in steps and another 500 steps run for the chain. If necessary, chains were extended until convergence occurred (measured using the Gelmin-Rubin statistic as in Sec. 3.4). The 2D GP analyses used blocked Gibbs and slice sampling in combination with NUTS sampling for inference (see Appendix E for more details). Sampling was performed with a burn-in length of 1000 and a chain length of 1000 with four independent chains. Most of the 2D GP analyses were fitting seven parameters for each light curve (ρ2, c1, c2, Foot, Tgrad, , σ) with five wavelength-independent parameters (hCM, lt, , hHFS, ). Exceptions to this are explained in Sec. 4.4. For the 600B data with 16 light curves, this results in 117 parameters. The 600RI data had 22 light curves and therefore had 159 parameters to fit. The use of NUTS permitted convergence to occur for all parameters within this relatively small number of sampling steps.

Basic atmospheric retrievals similar to those described in Sec. 3.5 were performed. We fit a radius and slope to each transmission spectrum in addition to measuring the change in the radius ratio at the bin centred on the Na feature ΔρNa for both the 600B and 600RI data as well as the change in radius ratio at the K feature Δρκ for the 600RI data. The 600RI data have a 150 Å bin centred on 7681 Å which is wide enough to cover the full K I doublet. The 600B and 600RI datasets have 150Å bins centred at 5893Å and 5881Å respectively which both cover the Na I doublet. This same atmospheric retrieval was also performed on the transmission spectra taken from the original analysis.

thumbnail Fig. 7

600B grism (left) and 600RI grism (right) spectroscopic light curves with best-fit transit region shaded in grey. Left: the light curves appear to have very similar systematics with the exception of the top one or two light curves in which the region in the red box shows a small increase in flux at the end of the observation. This could suggest the presence of wavelength-specific systematics (WSS) in these data. Right: The slowly varying shape of the systematics – as highlighted in the red box – suggest that the ‘common-mode’ systematics are not actually common-mode but gradually vary in wavelength.

thumbnail Fig. 8

Plots showing GP predictive means fit by 2D GP in 600B dataset to account for (in order from left to right): (a) common-mode systematics, (b) high-frequency systematics and (c) wavelength-specific systematics. The rightmost plot is the raw light curves included for reference. Specifically, the middle plots show the change in the GP mean caused by including these systematics in the fit, as discussed in Sec. 4.2.

thumbnail Fig. 9

600B (top) and 600RI (bottom) retrieved transmission spectra comparing original analysis, 1D GP re-analysis and analogous 2D GP analysis. The 2D GP best-fit atmospheric retrieval is shown as a red dotted line and random draws taken from the 2D GP spectrum are shown in grey. Top: all three analyses show largely consistent results. Bottom: the error bars of each method may appear consistent but the random draws from the 2D GP spectrum demonstrates a large uncertainty in the slope of this spectrum after accounting for the covariance between transit depths.

4.4 Re-analysis results

A comparison of the transmission spectra from the 1D GP analyses to our initial 2D GP analysis is plotted in Fig. 9. The results of all atmospheric retrievals performed are included in Tables 4 and 5. The scattering slope reported in these tables was given in terms of the α parameter from Eq. (28) to make it easier to compare to Rayleigh scattering (α = −4). The α values were calculated by taking the literature values of H = 1220 km for the scale height and R* = 1.12 M for the stellar radius as reported in Sing et al. (2015) and Anderson et al. (2011) respectively. The of the best-fitting atmospheric model for each recovered transmission spectrum (accounting for covariance for the 2D GP analyses) is also included. Unlike in Sec. 3 where the true transmission spectra were known, these values can only be used to help quantify how consistent each analysis is with the specific choice of atmospheric model.

Appendix L includes visualisations of the correlation matrix for both the 600B and 600RI transmission spectra, 2D GP best-fits to the spectroscopic light curves and corner plots for two of the 2D GP analyses.

4.4.1 1D GP re-analysis and the analogous 2D GP analysis

Our initial 2D GP analysis attempted to closely resemble the 1D GP analysis and so we refer to this as the ‘analogous 2D GP analysis’. However, the 2D GP has a different kernel with wavelength length scale parameters and that can account for how the ‘common-mode’ and high-frequency systematics may actually vary in wavelength. It shares the same time length scale parameter lt across all light curves that could help provide more robust constraints (see Sec. 3.6). It also retrieves the full covariance matrix of the transmission spectrum instead of assuming all transit depths are independent. Figure 9 compares the resulting transmission spectra from the original Gibson et al. (2017) analysis, the 1D GP re-analysis and the analogous 2D GP re-analysis. Random draws taken from the 2D GP covariance matrix are included to visualise the correlated uncertainties in the 2D GP transmission spectrum (as the error bars do not convey this information).

For the 600B dataset, it was found the data were consistent with containing common-mode systematics as the posterior of the wavelength length scale was concentrated towards the maximum prior limit of 50 times the wavelength range of the data. This does not rule out that the systematics could be varying very gradually in wavelength, specifically the wavelength length scale was constrained to be > 6772 Å (≈3 times the wavelength range of the dataset) to 99% confidence. This constraint implies that systematics that vary over length scales shorter than 3 times the wavelength range of the dataset must be distinguishable from common-mode systematics and so if we constrain other wavelength length scales to be below this value we can safely assume those systematics are varying in wavelength. For example, the high-frequency systematics in this dataset were strongly constrained to be varying in wavelength with Å. The 600RI data were significantly affected by wavelength-varying systematics as Å for data with a wavelength range of 3135 Å. The high-frequency systematics in these data were also found to vary in wavelength with Å.

Since the wavelength length scale for the 600B dataset was concentrated towards an arbitrary choice of maximum prior limit, this analysis was repeated with the prior limit reduced to 15 times the wavelength range of the data. There was little difference in the results other than a slight increase (~11%) in the uncertainty of the slope of the transmission spectrum. This increase in uncertainty makes sense as the shorter prior limit forces the wavelength length scale to take smaller values, which are likely to increase uncertainty in the slope.

Note there is a risk that the wavelength-correlated systematics identified are actually from wavelength-varying inaccuracies in our transit model (e.g. due to unaccounted for limb-asymmetries; see Powell et al. 2019). We performed another test where we masked out the transit region of each dataset and performed an identical analysis on the masked datasets except without fitting for a transit model. The linear baseline parameters Foot and Tgrad were still fit for to account for changes in baseline flux. The resulting constraints on the hyperparameters that fit for wavelength correlation (hCM, hHFS and ) were all consistent with the values from the previous analyses to 2.2σ. The systematics were similarly constrained to be varying in wavelength except for the 600B wavelength length scale which was again consistent with common-mode systematics.

Our results strongly contradict the assumptions of the original analysis that only common-mode systematics and wavelength-independent systematics are present in the 600B and 600RI datasets. Based on the results of Sects. 3.8 and 3.9, these incorrect assumptions could lead uncertainties on broad features such as the scattering slope to be underestimated and could cause uncertainties on sharp features such as the strength of Na and K features to be overestimated. Tables 4 and 5 appear consistent with this interpretation as the 2D GP analyses all show increased uncertainty in the scattering slope and decreased uncertainty on the constraints of Na and K. The 600RI data were particularly affected, for example the uncertainty in the scattering slope was ~3.9 times larger compared to the original analysis. This makes sense as both the ‘common-mode’ systematics and high-frequency systematics were found to vary in wavelength for this dataset. The constraint of α = 8.09 ± 8.02 from the 2D GP analysis is weak enough to be consistent with a positive slope (α > 0), a flat spectrum (α = 0), Rayleigh-scattering (α = −4) or super-Rayleigh scattering (α < −4). This is in contrast to the 1D GP analyses that are both consistent with either a positive slope or a flat spectrum. The tighter constraints on Na and K in the 2D GP analyses (~50% smaller) still did not lead to a detection of either species in any analysis (< 2.3σ for all analyses). We note that the shared time length scale lt in the 2D GP analysis could have also tightened the constraints on Na and K – similar to results in Ahrer et al. (2022).

We found that the of the best-fitting atmospheric models were increased significantly for the 2D GP analyses, particularly on the 600RI data. We should expect that ∈ [0.46, 1.75] to 95% confidence for the 600RI data. The 1D GP analyses are on the lower end of this confidence interval while the analogous 2D GP analysis is on the higher end9. Since the 1D GP analyses do not account for wavelength-correlated systematics, they could certainly be overestimating uncertainties. However, it is not clear if the high values for the 2D GP analyses indicate that the 2D GP analyses are underestimating uncertainties (potentially due to a sub-optimal choice of kernel) or if the chosen atmospheric model was not flexible enough. The value is still close to the 95% confidence interval however so this may also be due to random chance.

The runtime of the 2D GP analysis was longer than the 1D GP analysis by about one order of magnitude. The combined runtime of the MCMCs for the 1D GP re-analysis took 28 min for the 600B data and 54 min for the 600RI data. The MCMCs for each of the analogous 2D GP analyses on the 600B and 600RI datasets took 6.5–7 h each10.

Table 4

Results of atmospheric retrievals on the 600B data analysed using different methods.

Table 5

Results of atmospheric retrievals on the 600RI data analysed by different methods.

thumbnail Fig. 10

Similar to Fig. 9 but showing different 2D GP analyses tested. As error bars do not convey the correlations between the data points, it is difficult to see that the 600B fit with no mean radius prior – which appears to have much larger uncertainties – actually has a similar constraint on the slope and Na feature compared to the other methods. For the direct retrieval of atmospheric parameters from the data, a dotted line is included that shows the best-fit atmospheric model (as the transmission spectrum is not directly retrieved with this method).

4.4.2 Fitting without a mean radius prior

The original analysis placed a Gaussian prior on the radius ratio of the white light curve which the 2D GP analyses match by placing the same prior on the square root of the mean transit depth (which should approximately match the radius ratio of the white light curve). This prior is in addition to the uniform priors already used when fitting for each transit depth. Fitting the transmission spectrum without this prior on the mean radius ratio combined with the lack of a common-mode correction permits the retrieval of the overall offset of the transmission spectrum.

We repeated the analogous 2D GP analyses without this prior on the mean radius ratio. By examining Table 4, we can see for the 600B data that the uncertainty of the radius ratio increases significantly without this prior and is ~13.9 times larger than the analogous 2D GP analysis. Other than this however, the constraints on the other parameters are almost identical with no significant difference in the scattering-slope or the retrieval of the Na feature. See Fig. 10 for comparisons between the transmission spectra for each of the 2D GP analyses including the 600B data fit without a mean radius prior.

The corresponding analysis of the 600RI data resulted in a much weaker constraint on the offset of the transmission spectrum. The mean radius ratio was retrieved to be = 0.08687 ± 0.00896, which is highly inconsistent with the value of = 0.12348 ± 0.00362 from the 600B data (that shares some overlapping wavelength bins). Fitting the white light curve with a 1D GP without including the Gaussian radius prior also results in a low radius ratio of ρwhite = 0.10277 ± 0.00891. As these results are inconsistent with previous observations, the simplest explanation may be that the systematics present in this dataset happened to be particularly similar to a transit signal, as suggested in Sec. 3.7. The other atmospheric parameters were not significantly changed from the analogous 2D GP analysis – as can be seen in Table 5 – so the choice of including this prior mainly affects the constraint on the radius ratio.

4.4.3 Marginalising over uncertainty in T0, a/R* and b

Joint-fitting the spectroscopic light curves allows us to vary the central transit time T0, system scale a/R* and impact parameter b as parameters within the MCMC. We recommend this approach as it accounts for how the uncertainty in these parameters affects the transmission spectrum. We can examine if this has a significant effect by comparing our results to the analogous 2D GP analysis that matched the 1D GP procedure of fixing these parameters.

The central transit time was allowed to freely vary with a uniform prior in the 600B data but the 600RI data struggled to tightly constrain T0 (as was also seen in the 600RI white light curve analysis). It was decided to constrain T0 using the retrieved value from the 600B data as a Gaussian prior. This assumes that WASP-31b does not have significant transit timing variations (TTVs) and that the literature value used for the period is accurate. While there is some evidence to question these assumptions (i.e. Patel & Espinoza 2022; Bonomo et al. 2017; Kokori et al. 2022), we ignore these here to demonstrate the method on both datasets. The Gaussian priors on a/R* and b from Sing et al. 2015 that were used in the white light curve fitting in the original analysis were still kept on these parameters for the 600B analysis. The retrieved constraints on a/R* and b from the 600B data were then used as Gaussian priors on the 600RI data. A more rigorous but computationally expensive approach could have been to perform a joint fit on both light curves with common a/R* and b parameters and T0 values offset by 5 periods (the observations are 5 orbits apart).

The effect of marginalising over T0, a/R* and b appears to have had little effect on the retrieval of the 600B data. The mean radius would likely be the parameter most affected by varying these parameters (as these parameters are shared by all light curves) but the mean radius prior could be restricting this effect. The retrieved central transit time was T0 (HJD-2457433) = 0.753597 ± 0.000578 which is very close to the value retrieved from the white light curve of T0 (HJD-2457433) = 0.753609 ± 0.000662. The marginalisation over these parameters also had little effect on the 600RI data as the atmospheric retrieval produced similar results to the analogous 2D GP analysis.

4.4.4 Directly fitting atmospheric forward models to the data

Typically a transmission spectrum is first obtained by individually fitting each spectroscopic light curve and then an atmospheric retrieval is performed on the resulting transmission spectrum. With all spectroscopic light curves being fit simultaneously, we can perform an atmospheric retrieval while fitting the light curves. The transit depths can be generated from atmospheric forward models with the atmospheric parameters directly retrieved in the MCMC fitting the light curves, similar to how high-resolution transmission spectroscopy atmospheric retrievals are performed (e.g. Maguire et al. 2023).

This would reduce the number of parameters to fit, which may be useful if scaling this method to fit more light curves simultaneously. This also avoids approximating the posterior of the transmission spectrum. Typically the transmission spectrum is extracted by taking the mean and standard deviation (or the covariance matrix in our method) of each transit depth from the MCMC chains, which approximates the posterior as a Gaussian distribution. While different distributions could potentially model the posterior better, there would still likely be some modelling error, which could be avoided by directly fitting atmospheric models to the light curves.

The results from using this method closely match the analogous 2D GP analysis for the 600B and 600RI data, with the exception of the retrieved slope of the 600RI data, which has a significantly reduced mean and larger uncertainty. Explaining why this discrepancy occurred is difficult as we do not recover individual transit depths with this method and no can be retrieved to determine the goodness-of-fit. The increased complication in interpreting the results is a major downside of this approach.

With this method, the systematics may be fit differently depending on the choice of atmospheric model. This is quite concerning as imperfections in our models could result in the GP fitting larger systematics to the data. While it is beneficial to avoid approximating the posterior of the transmission spectrum, this method appears to complicate the choice of atmospheric model and makes it harder to visualise if the data are consistent with a model. This method is therefore not recommended for testing out different atmospheric retrievals but could be an interesting test if a specific atmospheric model has already been chosen.

4.5 Benchmarking

We tested how the runtime for this method would be affected as a function of the number of time exposures and wavelength bins. The kernel function used to fit the VLT datasets was used (Eq. (30)) with the kernel hyperparameters set as the bestfit values of the 600RI analogous 2D GP analysis. As some hyperparameters ( and ) were fit separately for each wavelength, changing the number of wavelength bins required us to interpolate these best-fit values. One of the two time covariance matrices produced by this kernel function is the identity matrix. This makes the eigendecomposition of this matrix trivial, which was exploited in these calculations. For a large number of time points relative to wavelength bins, the eigendecomposition of these two time covariance matrices is typically the bottleneck in the log-likelihood calculation and so avoiding one eigendecomposition can almost half the runtime.

For each number of wavelengths and time points tested, a random draw of noise described by this kernel function was first taken. Runtime of the mean function was not included so the log-likelihood calculation was performed using this random draw of noise as the residuals. The mean function used in this paper would have had little effect as even for N = 512 and M = 256 the runtime for the mean function was only ~3.5 ms.

The gradient calculation of the log-likelihood was not performed for this test. However, the runtime would likely not be significantly affected as eigendecomposition is generally the computational bottleneck regardless of whether the gradients are calculated and the same eigendecomposition can be stored to calculate both the log-likelihood and gradients of the log-likelihood (see Appendix H).

Log-likelihood calculations were performed using both our method as well as using Cholesky factorisation on the full MN × MN covariance matrix. The benefit of the Cholesky method is that it would place no restrictions on the kernel function or on the data needing to lie on a complete grid. However, its poor runtime scaling of (M3 N3) would make it unfeasible for most real datasets.

Figure 11 shows the results of this runtime comparison performed on a quad-core Intel Core i5-6500 CPU at 3.20 GHz. Calculations where MN ≥ 32768 were avoided for the Cholesky factorisation as they required ≥8GB of memory to store the MN × MN covariance matrix. This was not an issue for our new method as it only needs to store separate time and wavelength covariance matrices.

The Kronecker product sum method has a runtime scaling of (M3 + N3) due to the expensive eigendecomposition step which has the effect that when one dimension is much longer than the other it tends to act as the computational bottleneck. The 600B and 600RI datasets analysed had dimensions of (M = 16, N = 262) and (M = 22, N = 319) respectively, so the number of time points far exceeded the number of wavelength channels chosen. As can be seen in the central plot of Fig. 11, increasing the number of wavelength channels would have little effect on the log-likelihood calculation runtime due to the eigendecomposition of the time covariance matrix dominating the runtime when N ≈ 256. There were however seven wavelength-dependent parameters that each light curve was individually fit for, so convergence time of the MCMC would increase as the number of parameters would increase. Convergence time with No U-Turn Sampling has been shown to scale as for d parameters under certain assumptions (Neal 2011).

As each MCMC chain is run independently with NUTS, parallelisation is trivial to implement. The same JAX code can also be run on a GPU, which could significantly improve performance. Utilising parallelisation and GPU acceleration could permit this method to be scaled to larger datasets from JWST.

thumbnail Fig. 11

Comparison of log-likelihood calculation runtime (in seconds on logarithmic scale) between general approach of Cholesky factorisation (left) compared to our method (centre) as well as relative performance improvement (right). Cholesky factorisation was not calculated when MN ≥ 32768 (shown in white) due to memory limitations.

5 Discussion

The re-analysis of the VLT/FORS2 data confirms that systematics that vary in wavelength can be present in real observations and can be mistaken for common-mode systematics. As demonstrated in Sec. 3.9, this can significantly impact atmospheric retrievals and so we recommend a principled method to account for them using 2D GPs. Both the 600B and 600RI datasets contained high-frequency systematics that were found to gradually vary in wavelength. The re-analysis of the 600RI data also revealed that time-correlated systematics previously assumed to be common-mode were strongly constrained to vary in wavelength. Accounting for this led to a weaker constraint on the scattering slope but produced tighter constraints on the Na and K features, consistent with the results of Sects. 3.8 and 3.9.

Multiple 2D GP analyses were performed either with a different choice of priors or different parameters being fit. For the 600B data, each of these analyses were consistent with Rayleigh scattering and no strong evidence for Na was found (<2.3σ for all analyses). The results of the 600RI analyses were less consistent and show that different assumptions made when fitting these data can significantly affect the retrieved radius and slope of the transmission spectrum, although evidence for sodium or potassium always remained below 2σ. This means that there is still a conflict on the presence of potassium between the VLT/FORS2 data and the original analysis of the HST observations from Sing et al. (2015). We leave the re-analysis of the HST data using our new method for future work.

We have demonstrated that our method can avoid the need for a common-mode correction, which avoids artificially reducing uncertainty in the offset of the transmission spectrum. This is even more important when considering eclipse spectra, where the overall offset is critical to the interpretation of the spectrum (e.g. Evans et al. 2013; Bell et al. 2017). Joint-fitting spectroscopic light curves also permitted us to marginalise over uncertainty in the central transit time, system-scale and impact parameter – helping to improve the rigour of transmission spectroscopy. Approximating the posterior of the transmission spectrum for atmospheric retrievals can also be avoided if atmospheric forward models are fit directly to the data.

One disadvantage of assuming noise follows a Gaussian process is that it makes the analysis very sensitive to outliers. Student-t processes have been applied to transmission spectroscopy as they less sensitive to outliers (Wilson et al. 2021). Optimisations presented in this work could also permit the application of 2D Student-t processes to transmission spectroscopy as both methods have similar likelihood functions.

Our method could be used to account for wavelength-correlated systematics in JWST observations. For example, NIRCam observations are contaminated by 1/f correlated readout noise that is read parallel to the spectral trace (in the dispersion direction) on the detector, introducing wavelength-correlated systematics that are visible in the residuals of the light curves (Ahrer et al. 2023). In addition, wavelength-correlated systematics have been identified in NIRISS/SOSS observations (Holmberg & Madhusudhan 2023). Scaling our method to larger datasets typical of JWST observations could be enabled by utilising parallelisation and using GPUs.

In addition, there has been recent work attempting to fit low-resolution ground-based transmission spectroscopy datasets without the use of a comparison star (Panwar et al. 2022; Spyratos et al. 2023). Combining 2D GPs with this approach could be beneficial due to the wavelength-dependent effects of atmospheric extinction no longer being accounted for with a comparison star.

GPs have a broad variety of uses within astronomy (Aigrain & Foreman-Mackey 2023) and astronomical data often include noise that is correlated in two-dimensions (i.e. x and y pixel positions on a detector), so there could be many other useful applications for this method. Our optimisation should have significant performance advantages over other 2D GP optimisations (e.g. Gordon et al. 2020; Delisle et al. 2022) when there are non-trivial correlations across both input dimensions and neither dimension is very large. This could benefit fields that already use these other GP optimisations (e.g. exoplanet detection). GPs have been proposed to be used in high-resolution time series spectroscopy (e.g. Meech et al. 2022), our method may enable much faster application to large datasets. Interpolation of 2D datasets can also be performed efficiently with this method (see Appendix A) which could have very broad applications. For example, 2D GPs have already been used to interpolate supernova light curves (e.g. Fakhouri et al. 2015; Boone 2019) which this method might help optimise.

6 Conclusions

We have developed a new method that can reliably recover transmission spectra in the presence of both time and wavelength correlated systematics, have tested it on a range of simulated data to show its advantages over current approaches and demonstrated it on real data of WASP-31b observed using VLT/FORS2, despite the presence of large amplitude systematics. To summarise our conclusions:

  • We developed a 2D GP framework exploiting the Kronecker product structure of the kernel for data described on a (potential non-uniform) 2D grid;

  • Our method is capable of handling correlations in time and wavelength when extracting the transmission spectra, enabling us to recover the covariance matrix of the spectra, which in turn can be used for atmospheric retrievals;

  • Simulations show that our method provides robust retrievals in the presence of wavelength-correlated systematics, whereas these same systematics can cause 1D GPs to overestimate uncertainties on sharp spectral features and recover erroneous scattering slopes;

  • Previous detections of extreme scattering slopes that are difficult to give physical explanations for could be explained by unaccounted for wavelength-correlated systematics;

  • For the two VLT/FORS2 datasets analysed, both datasets were found to contain wavelength-correlated systematics – contradicting the assumptions of their original analysis. Our method recovered significantly weaker constraints on the scattering slope from the 600RI data but tighter constraints on sodium for both datasets as well as on potassium for the 600RI data;

  • Our method removes the need for the common-mode correction – which normally introduces arbitrary offsets to the transmission spectra – making the comparison between independent transit observations simpler;

  • Other benefits of joint-fitting spectroscopic light curves include the ability to marginalise over the uncertainty in the central transit time T0, system scale a/R* and impact parameter b. The simulations demonstrated that sharing hyperparameters between light curves may also result in more robust retrievals.

Overall, our method presents several clear advantages over the use of 1D GPs with the potential to be applied to a wide variety of datasets in transmission and emission spectroscopy.

Acknowledgements

We thank Soichiro Hattori for careful reading of the manuscript and providing helpful comments. We thank the team who led the observations analysed in this work (PI: Nikolov). The observations are publicly available in the ESO Science Archive Facility (http://archive.eso.org) under ESO programme 096.C-0765. Many of the calculations in this work were performed on the astro01 system maintained by the Trinity Centre for High Performance Computing (Research IT). This system is co-funded by the School of Physics and by the Irish Research Council grant award IRCLA/2022/3788. This research has made use of the NASA Exoplanet Archive, which is operated by the California Institute of Technology, under contract with the National Aeronautics and Space Administration under the Exoplanet Exploration Program. We are grateful to the developers of NUMPY, SCIPY, MATPLOTLIB, ASTROPY, PANDAS, IPYTHON, BATMAN, ARVIZ, CORNER, PYLDTK, JAX, PyMC, JAXOPLANET and EXOPLANET as these packages were used extensively in this work (Harris et al. 2020; Virtanen et al. 2020; Hunter 2007; Astropy Collaboration 2022; Pandas development team 2020; Pérez & Granger 2007; Kreidberg 2015; Kumar et al. 2019; Foreman-Mackey 2016; Parviainen & Aigrain 2015; Bradbury et al. 2018; Salvatier et al. 2016; Foreman-Mackey & Garcia 2023; Foreman-Mackey et al. 2021).

Appendix A Gaussian process regression

This method can be used to efficiently perform Gaussian process regression (as demonstrated in Rakitsch et al. 2013), although the full predictive covariance matrix can be expensive to compute. The predicted mean and covariance of the GP conditioned on the data were used in this work to help visualise the fit of the systematics (Fig. 8) and also to clip outliers and replace with interpolated values (see Sec. 4.3).

Suppose we have a set of observations y ~ (μ(t, λ, θ), K(t, λ, θ)), where we are using the best-fit values for all parameters θ. We can compute the predictive distribution for points y* – which are located on a grid of λ* wavelengths and t* times – using: (A.1) (A.2)

Where μ* is the mean function computed at the locations of y*, K* is the covariance between y and y* and K** is the covariance between values of y* with itself.

Calculating is straight-forward as we can break down the calculation of K−1 r using Eq. (26) and K* is in general a sum of two Kronecker products similar to K so we can use Eq. (14) to perform this matrix-vector multiplication.

Calculating all of the terms in the covariance matrix Var[y*] would be computationally expensive and memory-intensive as it would require the calculation and storage of M*N* × M*N* terms. If it is desired to take random draws from the predictive distribution then no method was found to efficiently do this, as it would likely require the calculation of the full covariance matrix Var[y*], followed by Cholesky factorisation or eigendecomposition of that matrix. However, if we are just interested in calculating the diagonal terms of the covariance matrix (such as for identifying outliers) then an optimisation was found which does not require the storage of any M* N* × M* N* matrices in memory. This method has no calculations that would scale worse in runtime than the cube of any of the dimensions M, N, M* or N*. The details of this algorithm have been omitted for brevity but the implementation is available on the LUAS GitHub repository.

Appendix B Error propagation with a covariance matrix

While it is well-known how to propagate uncertainties when they are independent, propagating uncertainties given any general covariance matrix is less well-known. It is included here for reference.

Given a vector χ that follows a multivariate-normal distribution x(μχ, χ) and values y we wish to calculate which are a linear transformation of x (i.e. y = Tx for some matrix T) then y follows a normal distribution yy, ∑y) and we can calculate μy and ∑y using the following equations (see Brandt 2014 for derivations): (B.1) (B.2)

For example, to compute the average transit depth given a transmission spectrum of M transit depths described by , we must take the mean of ρ2. We can do this by taking where 1 is an M-long vector where every element is equal to one. If the covariance matrix ∑χ is diagonal with constant variance σ2 then this reduces to the familiar form for the standard error of the mean.

This can also be used to bin a transmission spectrum to larger bin sizes. For example, if binning each pair of light curves together then the vector could be used to propagate the uncertainty for binning the first pair of bins. The covariance matrix of the full (binned) transmission spectrum could be obtained by using the matrix (assuming an even number of bins) for Equations B.1 and B.2.

Appendix C Priors for the simulations and the VLT/FORS2 analyses

Priors used for the simulations are included in Table C.1. These same priors were used for all methods where the parameter is relevant, that is the 1D GP method did not use the parameter lλ but used the same priors on the other parameters.

The default priors used for both VLT/FORS2 datasets are given in Table C.2. The central transit time priors are given relative to the predicted central transit times from Patel & Espinoza (2022). When analyses in Sec. 4.4 diverged from these default values it was explicitly stated. Where parameters are shared between the 1D GP re-analysis and the 2D GP analyses, the same priors were used for both. The differences for the 1D GP re-analysis are that: it used the mean radius ratio prior during the white light curve fit instead of during the spectroscopic fit, the high-frequency systematics were recovered from the residuals of the white light curve fit and therefore hHFS was never fit for and finally the wavelength length scales and were not fit for.

We note that while prior bounds were included for all uniform and log-uniform priors, the only parameters that had MCMC values close to these bounds were for the 600B data and the wavelength-specific systematics height scale hWSS for many of the wavelength channels (it was fit independently for each wavelength). The implications for were discussed in Sect. 4.4.1, while for hWSS (which could sometimes be consistent with the minimum prior bound) this suggests that some wavelength channels were consistent with having no wavelength-independent systematics present in them at all. The choice of minimum prior bound therefore could have affected the results, with lower minimum prior bounds more strongly weighting the probability of negligible systematics being present and higher minimum prior bounds forcing the probability of non-negligible systematics to be higher. The rest of the parameters were constrained within the prior bounds (see the corner plots for some of these parameters in Figs. L.5 and L.6) and therefore were likely unaffected by them, other than by the choice of a uniform prior (for most mean function parameters) or a log-uniform prior (for hyperparameters).

Gaussian priors were also taken from Sing et al. (2015) for the mean radius ratio, a/R* and b – as were also used in Gibson et al. (2017). Gaussian priors on the limb-darkening parameters c1 and c2 were calculated using PyLDTk, as discussed in Sec. 4.3.

For the direct atmospheric retrievals in Sect. 4.4.4, the Gaussian prior on the mean radius ratio was still used and the remaining parameters (m, ΔρNa, Δρκ) were varied uniformly and with the constraint that the radius ratio was positive for all wavelengths.

Table C.1

Priors used when fitting parameters in Simulations 1-4.

Table C.2

Priors used for fitting the 600B and 600RI VLT/FORS2 datasets.

Appendix D MCMC sampling for the simulations

We initially found that some simulations would not be efficiently sampled by varying all parameters together in the same No U-Turn Sampling step. As a result, blocked Gibbs sampling could be used to speed up convergence. The approach used for all the simulations was to vary all transit depths simultaneously with NUTS in one blocked Gibbs sampling step followed by varying the hyperparameters together in one or more blocked Gibbs sampling steps. Separately varying the transit depths and the hyperparameters would not be efficient if significant correlations existed between them but for the simulations it appears this was not a significant issue. For real datasets such as the VLT/FORS2 data, the approach used was different because the correlations between the parameters and hyperparameters were found to be more significant.

Work performed after the simulations has revealed that the improved efficiency when varying parameters separately was likely caused by the mass matrix we generated from a Laplace approximation not accounting for the transformations performed by PYMC to keep parameters within the desired prior bounds. As different parameters can have significantly different prior bounds, this likely had a significant impact on sampling efficiency and future updates to the LUAS package may be able to account for this without using blocked Gibbs sampling. However, blocked Gibbs sampling is a valid MCMC method (Jensen et al. 1995) and the choice of mass matrix should only affect the sampling efficiency and not whether the sampling is valid. We present in this section the procedure used for the simulations in this work, noting that it may be improved upon in future work.

A Laplace approximation was used to generate the mass matrix for each blocked Gibbs step of NUTS. However, it was found that parameters being fit close to a prior limit would often converge slowly with NUTS, likely because of the unaccounted for transformations performed by PYMC and also because they were poorly approximated by a Gaussian (since a Laplace approximation works by approximating the posterior as Gaussian). For the simulations performed in this work, only the wavelength and time length scales lλ and lt were occasionally being fit close to their prior limits so these were the parameters most affected. To deal with this issue, if the best-fit values of lt and lλ were identified to be outside of a certain range (given in Table D.1), they were instead sampled in a separate Gibbs step using slice sampling (see Neal (2003) for an explanation of slice sampling). Slice sampling is a method of sampling that samples under the area of the probability density by alternating between vertical and horizontal ‘slices’. Univariate slice sampling – as implemented in PYMC – was often found to converge faster than NUTS for these length scale parameters when they were close to prior limits. Future work may be done to try to avoid the need for slice sampling and improve sampling efficiency. However, blocked Gibbs sampling between NUTS and slice sampling is valid11 and the results of the simulations demonstrate that the retrievals using this inference approach were generally consistent with the true simulated values.

Table D.1 summarises the 2D GP sampling steps used depending on the best-fit values of the hyperparameters. All hyperparameters were drawn from log-uniform distributions within the ranges listed. NUTS 1 and NUTS 2 refer to two different blocked Gibbs steps that use NUTS, that is all parameters being varied in step ‘NUTS 2’ are being varied together. The transit depth parameters were always varied together in a single step of NUTS. If the best-fit values of lt and/or lλ prior to running the MCMC were not within the ranges listed then they were independently varied with univariate slice sampling in additional Gibbs sampling steps (resulting in a maximum of four steps of Gibbs sampling). Both the Hybrid and 2D GP method followed this description with the exception that the Hybrid method did not have a wavelength length scale.

Table D.1

Overview of MCMC blocked Gibbs sampling steps for Simulations 1-4.

Appendix E MCMC sampling for the VLT/FORS2 analyses

Fitting both VLT/FORS2 datasets with 2D GPs required fitting over 100 parameters with an MCMC. The approach to do this was the same for both datasets. It was similar to the method followed for the simulations, although many more parameters were being fit in this case with more parameters combined together into the same blocked Gibbs steps. It was found that adjusting the mass matrix from the Laplace approximation by scaling the rows and columns relating to certain parameters could improve sampling efficiency. It was not known at the time why this was the case but the transformations by PyMC to keep parameters within certain prior bounds are likely the cause. We could still account for this by running NUTS on each group of parameters (i.e. ρ2, Foot, Tgrad, etc.) individually and using dual-averaging (as described in Hoffman & Gelman 2014 as implemented in PyMC) to optimise the step size ε for each group of parameters. This was performed for each group of parameters for 200 steps with a single chain of NUTS, which was long enough for the step sizes to have converged near to ‘optimal’ values. The approximate covariance matrix of the posterior obtained using the Laplace approximation was scaled using these values which significantly improved sampling efficiency, likely because it accounted for the different transformations applied to each group of parameters. This procedure allowed for all the mean function parameters, the common-mode height scale hCM and length scale in time lt to be varied within a single blocked Gibbs step of No U-Turn Sampling (112 parameters for the 600RI data). A simpler procedure later identified would have been to take the Laplace approximation with respect to the log-posterior calculated from the transformed parameters which PyMC uses. However, as this only affects the choice of mass matrix to tune NUTS, it should not impact the validity of our results but has allowed us to improve sampling efficiency and reduce runtimes.

Similar to what was discussed in Appendix D, for parameters close to the prior limit it was often more efficient to sample them independently using slice sampling rather than use NUTS. It was found to be more efficient for the wavelength-length scale lλ and for some of the wavelength-specific systematic height scales In particular, the values which were largely consistent with zero and had probabilities that dropped off sharply for larger values could be efficiently sampled with NUTS and values that were constrained to be larger than zero and had approximately Gaussian distributions were also typically sampled well with NUTS. However, parameters that were both consistent with zero but also had significant probability mass larger than zero appeared to have long correlation lengths with NUTS and were sampled more efficiently with slice sampling (e.g. see the posterior of hWSS;5743Å in Fig. L.5). Since it was difficult before running an MCMC to distinguish between the parameters that were constrained to be greater than zero and the parameters that were better sampled with slice sampling, all of the larger parameters were fit with slice sampling. The threshold was chosen that if the best-fit value > 5 × 10−5 then it would be fit with slice sampling. This is not a particularly efficient method but avoided handpicking particular terms and was still faster to converge than using NUTS for these parameters. For larger datasets – such as from JWST – a more efficient approach may need to be found if a similar kernel function is used.

Table E.1

Overview of MCMC blocked Gibbs sampling steps for all 2D GP analyses of both VLT/FORS2 datasets.

The remaining parameters that were below the 5 × 10−5 threshold were all fit together with a single blocked Gibbs step of NUTS. However, using a mass matrix calculated from a Laplace approximation was found to be slow for these parameters and so the mass-matrix was automatically adapted by PyMC during the burn-in phase for these parameters.

Finally, the white noise parameters and high-frequency systematics parameters were sampled together in a separate blocked Gibbs sampling step with a mass-matrix taken from the Laplace approximation. Table E.1 summarises which parameters were being sampled in which blocked Gibbs sampling steps. For the other 2D GP analyses performed, any of the mean function parameters were always varied in the first blocked Gibbs sampling step with NUTS (such as T0, a/R*, b or the atmospheric parameters in the direct retrieval).

Appendix F Examining the choice of kernel for the VLT/FORS2 data using autocorrelation

A useful approach to try visualise correlated noise in the VLT/FORS2 data is to calculate the autocorrelation of the residuals. The autocorrelation of a signal is the correlation of a signal with an offset copy of itself and it can be used to identify correlations in a signal. In this case, we want to identify correlations in the noise of the data so our signal would be the flux observations minus our best fit mean function. We consider correlations in two dimensions – both in time and wavelength.

The left-most plot in Fig. F.1 shows the autocorrelation of the data minus the best-fit transit mean function. This should approximately look like a heat map of the kernel function. Note, the autocorrelation at zero offset has been set to zero for all these plots to increase the contrast for the rest of the values.

The second plot shows the autocorrelation of the residuals after subtraction of the best-fit GP mean if we set hHFS = 0. This visualises the approximate correlation in the high-frequency systematics – which appears as a vertical line. The correlation appears to reduce at increasing wavelength separation, suggesting a finite wavelength length scale. The third plot is similar to the second but instead if all the terms are set to zero. If our assumption that these systematics are fully wavelength-independent is true then we should only expect to find correlations at zero wavelength separation (i.e. a horizontal line in the middle of the plot) with other wavelength separations being independent from each other.

thumbnail Fig. F.1

Autocorrelation plots of residuals minus GP mean if (from left to right): no correlated noise is fit, no high-frequency systematics are fit, no wavelength-specific systematics are fit, or if the full kernel in Eq. (30) is fit.

In the right-most plot, we see the autocorrelation of the residuals after subtraction of the best-fit GP mean including all the terms in our kernel function. The resulting plot appears to show almost no correlation with the exception of an apparent purple cross in the middle of the plot, implying some residual correlation from both high-frequency (wavelength-correlated but time-independent) systematics and wavelength-specific (wavelength-independent but time-correlated) systematics. This could suggest that the squared-exponential kernel function used for these systematics was not the ideal shape to account for them or perhaps that the best-fit did not converge to the optimal values. It could also be due to the restrictions on the kernel made such as the wavelength-specific systematics being forced to share the same length scale in time as the common-mode systematics. However, the remaining correlation is low in magnitude compared to the other autocorrelation plots so it is possible any remaining correlation has a negligible impact on the resulting transmission spectrum.

Appendix G Tables of results for Simulations 1-4

The accuracy metrics from Sec. 3.5 calculated for Simulations 1-4 are included in this section with Table G.1 explaining how to read and interpret each table. An interpretation of the results of each table is included in this section for convenience but repeats many of the points in Sec. 3.

Table G.2 presents the results of Simulation 1 containing wavelength-independent systematics tested on three methods. Firstly, 1D GPs fitting each spectroscopic light curve separately were tested but did not produce reliable constraints as the mean and the variance of the Z-score for each atmospheric parameter are all larger than one. This suggests that the 1D GPs underestimate uncertainty on average despite having the correct kernel to fit these systematics. This can be explained by having a very flexible noise model that fits for a separate height scale, length scale and white noise amplitude for each spectroscopic light curve. The Hybrid method shares the same hyperparameters for all light curves and performs reliably across all accuracy metrics measured despite using the same kernel function as the 1D GPs. The 2D GP method is similar to the Hybrid method but fits for a wavelength length scale – which is a more conservative approach as it does not assume systematics are wavelength-independent a priori. This still results in reliable retrievals of all parameters measured and with no statistically significant increase in the mean uncertainties on the parameters, so the more conservative analysis does not result in a significant loss in precision (although there was a minor loss in precision on the transmission spectrum, as noted in Sec. 3.7.

Table G.3 shows the results of Simulation 2 containing wavelength-correlated systematics with a relatively short wavelength length scale (300Å < lλ < 2250Å). The same three methods are run for Simulation 2 as for Simulation 1, to demonstrate that the 1D GP and Hybrid methods that assume wavelength-independent systematics perform very unreliably when systematics are correlated in wavelength. They overestimate the uncertainty in the radius ratio and slope (Z-score variances > 1) and underestimate uncertainty in the strength of the K feature (Z-score variance < 1). The K-S statistics show that the distribution of the samples does not match the correct distribution, showing that the Hybrid method does not reliably retrieve the transmission spectrum even though the mean chi-squared value is consistent with one. The 2D GP method performs reliably across all accuracy metrics – showing that our method is robust at accounting for wavelength-correlated systematics within this range of wavelength length scales (although the sample variance of the values is slightly too high which could suggest a few outliers being present). The uncertainty in the strength of the K feature is both more reliable and also much smaller than the 1D GP and Hybrid methods, lending support to our result from the VLT/FORS2 analyses which produced tighter constraints on Na and K in all 2D GP analyses compared to the 1D GP analyses.

Table G.4 shows the results of Simulation 3 containing longer wavelength length scale systematics (2250Å < lλ < 36000Å). While the 2D GP method was identical to Simulations 1 and 2, in these simulations we were investigating how the assumption of systematics being constant in wavelength performs when the systematics are actually gradually varying in wavelength. We tested our 2D GP method against the 1D GP method with an initial common-mode (CM) correction as well as the same 2D GP method but with the wavelength length scale fixed to a large value which is close to assuming systematics are common-mode (labelled as 2D GP (CM) in the table although note this method does not perform a common-mode correction). Both the 1D GP (CM) and 2D GP (CM) methods were unreliable across almost all metrics whereas the 2D GP that fit for the wavelength length scale produced robust retrievals across all metrics (the variance in the slope Z-scores may be slightly too high although this is only a 2.6σ result). The asterisk for the 1D GP (CM) values is there to note that these values were calculated after subtracting the offset in the recovered transmission spectrum to the injected spectrum because this offset does not have a large effect on atmospheric retrievals but was significantly affecting the values. This offset is the result of the common-mode correction and so did not affect the other two methods which do not have this offset correction performed for the values.

Table G.1

Explanation of the tables that report the results for each set of simulations.

Table G.2

Results of Simulation 1 containing wavelength-independent systematics.

Table G.3

Results of Simulation 2 with short wavelength length scale (300Å < lλ < 2250Å) systematics.

Table G.4

Results of Simulation 3 containing long wavelength length scale (2250Å < lλ < 36000Å) systematics.

Table G.5

Results of Simulation 4 containing common-mode systematics.

Table G.5 shows the results of Simulation 4 that contains common-mode systematics. The same methods as for Simulation 3 were tested, although the assumption that the systematics are common-mode is correct for these simulations so both the 1D GP (CM) and 2D GP (CM) methods perform much more reliably. The 1D GP (CM) method is slightly more conservative compared to the 2D GP (CM) method on both the constraint of the slope and the strength of the K feature, this is may be because the 1D GPs are still accounting for wavelength-independent systematics in each spectroscopic light curve while the 2d GP (CM) method assumes that only common-systematics and white noise are present in the data. The 2D GP method that fits for a wavelength length scale performs reliably on all metrics except the constraint of the slope, where it overestimates the uncertainties on average. This is likely due to the challenge of constraining systematics to be fully common-mode as the 2D GP (CM) method is much better at constraining the slope. However, since we do not know for real data whether systematics are constant in wavelength or gradually varying in wavelength, the safest approach is to fit for the wavelength length scale and accept some loss in precision in the slope of the transmission spectrum if the systematics are common-mode.

Appendix H Gradient calculations

No U-Turn Sampling requires the computation of the gradient of the log-likelihood with respect to each parameter being varied. One of the benefits of JAX is that it can compute the gradient of functions with limited modification to NumPy code. Using jaxoplanet in combination with implementing the algorithms in Sec. 2.5 in JAX was sufficient to efficiently and accurately compute the gradient of the log-likelihood with respect to any of the mean function parameters.

However, these algorithms were not sufficient for computing the gradients with respect to many of the hyperparameters. It was noticed that the gradients JAX computed from these equations were inconsistent with finite difference methods. It is likely that the numerical stability of eigendecomposition was an issue when computing the gradients of hyperparameters in these formula.

Fortunately, it is possible to rewrite the gradient of the log-likelihood in a way that is much less sensitive to numerical stability issues compared to the default method used by JAX. A method for analytically calculating the gradient of the log-likelihood with respect to hyperparameters was demonstrated in Rakitsch et al. (2013). The method is included here, although the calculation of the derivative of the log-determinant of the covariance matrix has been changed. The method presented in Rakitsch et al. (2013) would require additional eigendecompositions of the matrices Kλ Kt and the transformed matrices and to be calculated in order to calculate the gradient with respect to all hyperparameters. It also would have no way of getting the gradient of a hyperparameter which is included in both Kronecker product terms such as in Kλ and ∑λ. The method presented here is therefore faster, more general and is also more numerically stable.

Suppose θ is a hyperparameter, with r independent of θ (i.e. = 0). We can find by calculating: (H.1) (H.2)

where we can use the following two identities from matrix calculus: (H.3) (H.4)

The advantage of these identities is that they allow us to rewrite the gradient of the log-likelihood in terms of instead of or (which by default JAX fails to calculate in a numerically stable way). It was found that this approach results in gradients of the log-likelihood that are consistent with finite-difference methods.

Suppose θ is a parameter of Kλ and none of the other matrices in K, then: (H.5) (H.6)

This allows us to efficiently calculate the first term in Eq. (H.2) using the identity in Eq. (H.3): (H.7) (H.8)

Where K−1 r = α can be efficiently solved using Eq. (26) and α can be solved using Eq. (14) as it is of the form [AB]c

The second term in Eq. (H.2) may be computed as follows. We will use the identity in Eq. (H.4), but first it is helpful to expand out the K−1 term using the factorisation described in Sec. 2.5: (H.9)

We then factorise the middle term similar to Eq. (19) as: (H.10)

Plugging this in and using the properties of Kronecker product algebra: (H.11)

where we have defined: (H.12) (H.13) (H.14)

We note that while it may look as if we have found the eigendecomposition of the covariance matrix K, the matrices Wλ and Wt can be clearly shown to not be eigenvector matrices as they are not orthogonal (i.e. the product does not equal the identity matrix unlike for an actual eigendecomposition). Similarly, the diagonal matrix D does not contain the eigenvalues of K as can be seen by noting that the product of the diagonal entries would not in general produce the same determinant as given in Eq. (27). Nonetheless, it is a useful way of writing K−1 which we can now plug into Eq. (H.4): (H.15)

We can use the cyclic property of the trace: (H.16)

This allows us to simplify further: (H.17)

Finally, D−1 is a diagonal matrix and the trace only depends on the diagonal of the overall matrix so we can equivalently write this expression as a dot product of the diagonal of the two matrices and make use of the fact that the diagonal of a Kronecker product matrix is the Kronecker product of the diagonals: (H.18)

where diag(A) denotes the vector formed from the diagonal of some matrix A.

Equations H.8 and H.18 permit the efficient calculation of the gradient of the log-likelihood with respect to a hyperparameter of Kλ. By taking advantage of the Kronecker product structure, the memory requirement is still (M2 + N2) and we do not require the calculation of any additional eigendecompositions compared to the log-likelihood calculation. The mathematics follows similarly for calculating the gradients of parameters from Kt, ∑λ or ∑t.

Appendix I Hessian calculations

Similar calculations can be performed as in Appendix H in order to efficiently calculate the Hessian of the log-likelihood with respect to all of the mean function parameters and hyperparameters. This is implemented in the LUAS Github repository but is excluded here for brevity.

The benefit of computing the Hessian of the log-likelihood is that we can use it to calculate a Laplace approximation of the posterior. This is an analytic way of estimating the posterior using the best-fit value of all the parameters and the Hessian at the location of best-fit. It works by taking a second-order Taylor expansion of the log-posterior around the mode of the posterior (see Bishop & Nasrabadi 2007 for more information about the Laplace approximation). It assumes that the posterior can be approximated as a Gaussian and works by determining what the posterior would need to be in order to have the same Hessian at the mean of the Gaussian. It was found that this approximation produces an accurate enough fit to the posterior that using it to calculate the mass-matrix of the No U-Turn Sampler significantly improved the efficiency of sampling.

While the Laplace approximation is useful for tuning No U-Turn Sampling, it can also be seen as a very computationally inexpensive way of approximating the posterior (taking < 10 seconds for the VLT/FORS2 datasets) which avoids using expensive inference techniques such as MCMC. While the results are not as robust as a full MCMC retrieval, by making this approximation it may permit the application of 2D GPs to very large datasets which could otherwise be unfeasible using a full exploration of the posterior. It is also a useful check before running an MCMC (which may take hours to run) to see if the Laplace approximation produces a reasonable fit to the data.

Appendix J Numerical stability with eigendecomposition

The numerical stability of eigendecomposition is dependent on the condition number of the matrix in question. For any covariance matrix, the condition number is given by the ratio of the largest to smallest eigenvalue of that matrix. Matrix inversion is generally more prone to numerical errors for matrices with large condition numbers.

This is often not a major issue for 1D GPs because it is common to have terms in the kernel that are added to the diagonal, such as white noise terms. Adding to the diagonal generally has the effect of reducing the condition number. For example, if a constant c is added to all of the diagonal elements of a covariance matrix then it has the effect of increasing all the eigenvalues by c. This always decreases the ratio of the largest to smallest eigenvalues. However, due to the way this method splits up the covariance matrix into different sums of Kronecker products, the component wavelength and time covariance matrices may not all have terms added to their diagonals. For the kernel function used to fit the VLT/FORS2 data (shown in Eq. 30), the matrix accounting for time-correlated systematics Kt is calculated from a squared-exponential kernel without any terms added to the diagonal, potentially resulting in a high condition number which may produce significant numerical errors.

The approach used to deal with these numerical errors was to use a method that may be referred to as Ridge regression or Tikhonov regularisation (Hoerl & Kennard 1970). A constant was added to the diagonal elements of this time covariance matrix Kt in order to increase all the eigenvalues, often referred to as regularising the matrix. The value of the constant used was likely larger than was necessary to guarantee stability as it was unclear how to determine an optimal value. The diagonal values – which were all equal to one – were increased by 10−5. To understand the effect this could have had, we can treat it similar to any of the other terms in the kernel and consider what type of noise it accounts for. The actual kernel function used for the VLT analysis was effectively: (J.1)

This new term multiplies both the hCM and hWSS terms in the wavelength kernel function it is multiplying. We can rearrange this kernel function as: (J.2)

This rearrangement of the equation shows that by adding the constant c to the diagonal of the first covariance matrix for the time dimension, we can think of it as instead adding a second high-frequency systematics term with height scale and wavelength length scale as well as changing the amplitude of the white noise terms.

Changing the amplitude of the white noise terms should not be an issue since we are freely fitting for each term. The effect of this is simply that the retrieved terms are slightly smaller than the actual level of white noise the kernel is fitting. For the best-fit values of the 600RI and 600B analogous 2D GP fits, the maximum decrease for any white noise value retrieved would be a decrease of 8.8 × 10−9 with c = 10−5.

The only negative effect regularisation might have had was from this additional high-frequency systematics term with height scale and with a different wavelength length scale to the actual high-frequency systematics term in the kernel. Suppose the common-mode height scale hCM is ten times greater than the high-frequency systematics height scale hHFS. The additional term in the kernel would have a height scale which would be 3.2% of hHFS. If both the wavelength length scales and were identical, then this would result in the MCMC retrieving hHFS to be only ≈ 0.05% smaller but would have no effect on the retrieved transmission spectrum. Since the two wavelength length scales were different, this essentially resulted in the kernel function used for the high-frequency systematics very slightly deviating from a squared-exponential kernel as it would instead be the sum of two squared-exponential kernels with different length scales. Overall, this should have a very small effect.

For future work, it may be possible to avoid the addition of a constant to the diagonals. This is because to calculate the log-likelihood, the eigenvalues of the Kλ and Kt matrices are not used but instead the eigenvalues of are used. This addition of the identity matrix should regularise the matrix. The eigenvalues of the matrices λ and ∑t will need to be calculated directly however, but as white noise terms are likely needed in any suitable kernel then these matrices can be chosen to include the white noise terms as this naturally regularises the matrices.

Appendix K Extension of method to higher dimensions

The optimisations developed in this paper could be extended to higher dimensions with limited extra work. For the case of a 3D Gaussian process, given the data lie on a complete 3D grid and the covariance matrix is of the form: (K.1)

then K−1 could be factorised using a similar method by first taking the eigendecompositions of ∑λ, ∑t and ∑n and using them to transform Kλ, Kt and Kn. The eigendecompositions of these transformed matrices could then be calculated and K−1 could be factorised similarly to Eq. (26). Saatchi (2011) contains algorithms to efficiently calculate matrix-vector products for any number of dimensions of Kronecker products including calculating [ABC]d for three dimensions (analogous to Eq. 14). Although it was not implemented, any of the algorithms described in this work could conceivably be generalised to higher dimensions using this.

An example within transmission spectroscopy where this could be useful is to study if the systematics from different transit observations using the same instrument are correlated. In that case, the three dimensions could be wavelength, time, and transit number. Many hyperparameters would need to be shared between different transit observations however so this may not be a reasonable approach for most datasets.

Appendix L Visualisations of 2D GP analyses of the VLT/FORS2 data

thumbnail Fig. L.1

Heat map of correlation matrices of transmission spectra for 600B data. Left: From the analogous 2D GP analysis, visualising how the uncertainty in each transit depth is correlated across the spectrum. Right: For the 2D GP analysis with no added prior on the mean radius (from Sect. 4.4.2). In both plots, the strength of correlation is not significantly affected by the wavelength separation between light curves for most wavelengths. The right plot in particular shows significant correlations that are mostly independent of wavelength-separation, consistent with significant uncertainty in the offset of the transmission spectrum as all wavelengths are approximately affected equally.

thumbnail Fig. L.2

Same as Fig. L.1 but for 600RI data. Both plots demonstrate that transit depths at large wavelength separations in the dataset are less correlated than for closer separations, which likely explains the significant uncertainty in the slope of the transmission spectrum for this analysis.

thumbnail Fig. L.3

Fitted transit light curves for 600B data, showing GP predictive mean in red as well as 1σ and 2σ uncertainty in mean shaded in grey. Left: Fitting the raw light curves. Middle: The same fit minus the best-fit transit model, displaying just the GP fit to the systematics. Right: The same fit minus both the best-fit transit model and the GP mean (fitting the systematics) subtracted, showing the residual white noise.

thumbnail Fig. L.4

Same plot as Fig. L.3 but for 600RI data.

thumbnail Fig. L.5

Corner plot for 600B data fit with 2D GP and marginalising over T0, a/R* and b (from Sect. 4.4.3). Only the wavelength-dependent parameters from two of the light curves are included (from the wavelength bands immediately left of and centred on the Na feature) in addition to any parameters shared by all light curves. Note the wavelength length scale is consistent with the maximum prior limit, showing that these systematics may be constant in wavelength, while is constrained to be within the prior bounds.

thumbnail Fig. L.6

Corner plot for 600RI data for analogous 2D GP fit (from Sect. 4.4.1). The wavelength-dependent parameters included are from the light curves centred on the Na and K features. We note that both the wavelength length scales and are constrained within the prior bounds for this dataset.

References

  1. Ahrer, E., Wheatley, P. J., Kirk, J., et al. 2022, MNRAS, 510, 4857 [NASA ADS] [CrossRef] [Google Scholar]
  2. Ahrer, E.-M., Stevenson, K. B., Mansfield, M., et al. 2023, Nature, 614, 653 [NASA ADS] [CrossRef] [Google Scholar]
  3. Aigrain, S., & Foreman-Mackey, D. 2023, ARA&A, 61, 329 [NASA ADS] [CrossRef] [Google Scholar]
  4. Alderson, L., Wakeford, H. R., Alam, M. K., et al. 2023, Nature, 614, 664 [NASA ADS] [CrossRef] [Google Scholar]
  5. Anderson, D. R., Collier Cameron, A., Hellier, C., et al. 2011, A&A, 531, A60 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  6. Appenzeller, I., Fricke, K., Fürtig, W., et al. 1998, The Messenger, 94, 1 [NASA ADS] [Google Scholar]
  7. Astropy Collaboration (Price-Whelan, A. M., et al.) 2022, ApJ, 935, 167 [NASA ADS] [CrossRef] [Google Scholar]
  8. Barragán, O., Aigrain, S., Rajpaul, V. M., & Zicher, N. 2022, MNRAS, 509, 866 [Google Scholar]
  9. Bell, T. J., Nikolov, N., Cowan, N. B., et al. 2017, ApJ, 847, L2 [Google Scholar]
  10. Bishop, C. M., & Nasrabadi, N. M. 2007, J. Electron. Imaging, 16, 049901 [NASA ADS] [CrossRef] [Google Scholar]
  11. Bonomo, A. S., Desidera, S., Benatti, S., et al. 2017, A&A, 602, A107 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  12. Boone, K. 2019, AJ, 158, 257 [NASA ADS] [CrossRef] [Google Scholar]
  13. Bradbury, J., Frostig, R., Hawkins, P., et al. 2018, JAX: composable transformations of Python+NumPy programs [Google Scholar]
  14. Brandt, S. 2014, Data Analysis (Springer International Publishing) [CrossRef] [Google Scholar]
  15. Brown, T. M. 2001, ApJ, 553, 1006 [NASA ADS] [CrossRef] [Google Scholar]
  16. Carter, J. A., & Winn, J. N. 2009, ApJ, 704, 51 [Google Scholar]
  17. Charbonneau, D., Brown, T. M., Noyes, R. W., & Gilliland, R. L. 2002, ApJ, 568, 377 [Google Scholar]
  18. Delisle, J. B., Unger, N., Hara, N. C., & Ségransan, D. 2022, A&A, 659, A182 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  19. Diamond-Lowe, H., Berta-Thompson, Z., Charbonneau, D., Dittmann, J., & Kempton, E. M. R. 2020, AJ, 160, 27 [Google Scholar]
  20. Espinoza, N., Rackham, B. V., Jordán, A., et al. 2019, MNRAS, 482, 2065 [Google Scholar]
  21. Evans, T. M., Pont, F., Sing, D. K., et al. 2013, ApJ, 772, L16 [NASA ADS] [CrossRef] [Google Scholar]
  22. Fakhouri, H. K., Boone, K., Aldering, G., et al. 2015, ApJ, 815, 58 [NASA ADS] [CrossRef] [Google Scholar]
  23. Foreman-Mackey, D. 2016, J. Open Source Softw., 1, 24 [Google Scholar]
  24. Foreman-Mackey, D., & Garcia, L. J. 2023, jaxoplanet: Astronomical time series analysis with JAX [Google Scholar]
  25. Foreman-Mackey, D., Agol, E., Ambikasaran, S., & Angus, R. 2017, AJ, 154, 220 [Google Scholar]
  26. Foreman-Mackey, D., Luger, R., Agol, E., et al. 2021, exoplanet: Gradient-based probabilistic inference for exoplanet data & other astronomical time series [Google Scholar]
  27. Gelman, A., & Rubin, D. B. 1992, Stat. Sci., 7, 457 [Google Scholar]
  28. Gibson, N. P. 2014, MNRAS, 445, 3401 [NASA ADS] [CrossRef] [Google Scholar]
  29. Gibson, N. P., Aigrain, S., Pont, F., et al. 2012a, MNRAS, 422, 753 [NASA ADS] [CrossRef] [Google Scholar]
  30. Gibson, N. P., Aigrain, S., Roberts, S., et al. 2012b, MNRAS, 419, 2683 [NASA ADS] [CrossRef] [Google Scholar]
  31. Gibson, N. P., Nikolov, N., Sing, D. K., et al. 2017, MNRAS, 467, 4591 [NASA ADS] [CrossRef] [Google Scholar]
  32. Gibson, N. P., de Mooij, E. J. W., Evans, T. M., et al. 2019, MNRAS, 482, 606 [CrossRef] [Google Scholar]
  33. Gordon, T. A., Agol, E., & Foreman-Mackey, D. 2020, AJ, 160, 240 [NASA ADS] [CrossRef] [Google Scholar]
  34. Greene, T. P., Bell, T. J., Ducrot, E., et al. 2023, Nature, 618, 39 [NASA ADS] [CrossRef] [Google Scholar]
  35. Harris, C. R., Millman, K. J., van der Walt, S. J., et al. 2020, Nature, 585, 357 [Google Scholar]
  36. Hoerl, A. E., & Kennard, R. W. 1970, Technometrics, 12, 55 [Google Scholar]
  37. Hoffman, M. D., & Gelman, A. 2014, J. Mach. Learn. Res., 15, 1593 [Google Scholar]
  38. Holmberg, M., & Madhusudhan, N. 2023, MNRAS, 524, 377 [NASA ADS] [CrossRef] [Google Scholar]
  39. Hunter, J. D. 2007, Comput. Sci. Eng., 9, 90 [Google Scholar]
  40. Ih, J., & Kempton, E. M. R. 2021, AJ, 162, 237 [NASA ADS] [CrossRef] [Google Scholar]
  41. Jensen, C. S., Kjærulff, U., & Kong, A. 1995, Int. J. Hum.-Comput. Stud., 42, 647 [Google Scholar]
  42. Kokori, A., Tsiaras, A., Edwards, B., et al. 2022, ApJS, 258, 40 [NASA ADS] [CrossRef] [Google Scholar]
  43. Kreidberg, L. 2015, PASP, 127, 1161 [Google Scholar]
  44. Kumar, R., Carroll, C., Hartikainen, A., & Martin, O. 2019, J. Open Source Softw., 4, 1143 [Google Scholar]
  45. Lecavelier Des Etangs, A., Pont, F., Vidal-Madjar, A., & Sing, D. 2008, A&A, 481, L83 [NASA ADS] [CrossRef] [Google Scholar]
  46. Liu, D. C., & Nocedal, J. 1989, Math. Program., 45, 503 [CrossRef] [Google Scholar]
  47. Llorente, F., Martino, L., Curbelo, E., López-Santiago, J., & Delgado, D. 2023, WIREs Computat. Stat., 15, e1595 [CrossRef] [Google Scholar]
  48. Maguire, C., Gibson, N. P., Nugroho, S. K., et al. 2023, MNRAS, 519, 1030 [Google Scholar]
  49. Mandel, K., & Agol, E. 2002, ApJ, 580, L171 [Google Scholar]
  50. May, E. M., Gardner, T., Rauscher, E., & Monnier, J. D. 2020, AJ, 159, 7 [Google Scholar]
  51. McCullough, P. R., Crouzet, N., Deming, D., & Madhusudhan, N. 2014, ApJ, 791, 55 [NASA ADS] [CrossRef] [Google Scholar]
  52. McGruder, C. D., López-Morales, M., Espinoza, N., et al. 2020, AJ, 160, 230 [NASA ADS] [CrossRef] [Google Scholar]
  53. Meech, A., Aigrain, S., Brogi, M., & Birkby, J. L. 2022, MNRAS, 512, 2604 [NASA ADS] [CrossRef] [Google Scholar]
  54. Nasedkin, E., Mollière, P., Wang, J., et al. 2023, A&A, 678, A41 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  55. Neal, R. M. 2003, Ann. Stat., 31, 705 [Google Scholar]
  56. Neal, R. 2011, in Handbook of Markov Chain Monte Carlo, 113 [Google Scholar]
  57. Nikolov, N., Sing, D. K., Gibson, N. P., et al. 2016, ApJ, 832, 191 [NASA ADS] [CrossRef] [Google Scholar]
  58. Pandas development team, T. 2020, https://doi.org/10.5281/zenodo. 3509134 [Google Scholar]
  59. Panwar, V., Désert, J.-M., Todorov, K. O., et al. 2022, MNRAS, 510, 3236 [CrossRef] [Google Scholar]
  60. Parviainen, H., & Aigrain, S. 2015, MNRAS, 453, 3821 [Google Scholar]
  61. Patel, J. A., & Espinoza, N. 2022, AJ, 163, 228 [NASA ADS] [CrossRef] [Google Scholar]
  62. Pérez, F., & Granger, B. E. 2007, Comput. Sci. Eng., 9, 21 [Google Scholar]
  63. Powell, D., Louden, T., Kreidberg, L., et al. 2019, ApJ, 887, 170 [NASA ADS] [CrossRef] [Google Scholar]
  64. Rackham, B., Espinoza, N., Apai, D., et al. 2017, ApJ, 834, 151 [Google Scholar]
  65. Rackham, B. V., Apai, D., & Giampapa, M. S. 2018, ApJ, 853, 122 [Google Scholar]
  66. Rajpaul, V., Aigrain, S., Osborne, M. A., Reece, S., & Roberts, S. 2015, MNRAS, 452, 2269 [Google Scholar]
  67. Rakitsch, B., Lippert, C., Borgwardt, K., & Stegle, O. 2013, in Advances in Neural Information Processing Systems, 26 [Google Scholar]
  68. Rasmussen, C. E., & Williams, C. K. I. 2006, Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning (MIT Press), I, 1 [Google Scholar]
  69. Rustamkulov, Z., Sing, D. K., Mukherjee, S., et al. 2023, Nature, 614, 659 [NASA ADS] [CrossRef] [Google Scholar]
  70. Saatchi, Y. 2011, PhD thesis, University of Cambridge [Google Scholar]
  71. Salvatier, J., Wiecki, T. V., & Fonnesbeck, C. 2016, PeerJ Comput. Sci., 2, e55 [Google Scholar]
  72. Seager, S., & Sasselov, D. D. 2000, ApJ, 537, 916 [Google Scholar]
  73. Sedaghati, E., Boffin, H. M. J., Jeřabková, T., et al. 2016, A&A, 596, A47 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  74. Sedaghati, E., Boffin, H. M. J., MacDonald, R. J., et al. 2017, Nature, 549, 238 [Google Scholar]
  75. Sedaghati, E., MacDonald, R. J., Casasayas-Barris, N., et al. 2021, MNRAS, 505, 435 [NASA ADS] [CrossRef] [Google Scholar]
  76. Sing, D. K., Wakeford, H. R., Showman, A. P., et al. 2015, MNRAS, 446, 2428 [NASA ADS] [CrossRef] [Google Scholar]
  77. Sing, D. K., Fortney, J. J., Nikolov, N., et al. 2016, Nature, 529, 59 [Google Scholar]
  78. Spyratos, P., Nikolov, N. K., Constantinou, S., et al. 2023, MNRAS, 521, 2163 [NASA ADS] [CrossRef] [Google Scholar]
  79. Virtanen, P., Gommers, R., Oliphant, T. E., et al. 2020, Nat. Methods, 17, 261 [Google Scholar]
  80. Wilson, J., Gibson, N. P., Lothringer, J. D., et al. 2021, MNRAS, 503, 4787 [NASA ADS] [CrossRef] [Google Scholar]
  81. Zieba, S., Kreidberg, L., Ducrot, E., et al. 2023, Nature, 620, 746 [NASA ADS] [CrossRef] [Google Scholar]

1

luas is the word for speed in the Irish language, pronounced like ‘Lewis’

3

Ignoring the constant normalisation term which is not required for MCMC inference.

4

Model selection may still be required for the choice of kernel function (e.g. squared exponential) but this is an issue with GPs in general.

5

List downloaded on 2022 October 19.

6

For numerical reasons we could not set lλ = 0 but instead set lλ = 0.1 Å.

7

The common-mode systematics have a much larger amplitude than the other systematics, requiring these different approaches.

8

Specifically the square root of the mean transit depth as we were fitting for transit depths.

9

If this is surprising note that the correlations in the uncertainties restricts the space of models consistent with the recovered spectra.

10

Running all analyses on a 2020 M1 MacBook Pro using the CPU.

11

Alternating between NUTS and other MCMC methods is especially common when sampling both continuous and discrete variables (Salvatier et al. 2016).

All Tables

Table 1

Mean function parameter ranges used for all simulations.

Table 2

Transmission spectrum parameter ranges for all simulations.

Table 3

Parameter ranges used to generate systematics for Simulations 1–4.

Table 4

Results of atmospheric retrievals on the 600B data analysed using different methods.

Table 5

Results of atmospheric retrievals on the 600RI data analysed by different methods.

Table C.1

Priors used when fitting parameters in Simulations 1-4.

Table C.2

Priors used for fitting the 600B and 600RI VLT/FORS2 datasets.

Table D.1

Overview of MCMC blocked Gibbs sampling steps for Simulations 1-4.

Table E.1

Overview of MCMC blocked Gibbs sampling steps for all 2D GP analyses of both VLT/FORS2 datasets.

Table G.1

Explanation of the tables that report the results for each set of simulations.

Table G.2

Results of Simulation 1 containing wavelength-independent systematics.

Table G.3

Results of Simulation 2 with short wavelength length scale (300Å < lλ < 2250Å) systematics.

Table G.4

Results of Simulation 3 containing long wavelength length scale (2250Å < lλ < 36000Å) systematics.

Table G.5

Results of Simulation 4 containing common-mode systematics.

All Figures

thumbnail Fig. 1

Four equally likely types of transmission spectra generated for a radius ratio of ρ = 0.1, showing the spectrum in transit depths.

In the text
thumbnail Fig. 2

Example of synthetic dataset analysed in Simulation 1 (containing wavelength-independent systematics) showing some simulated light curves (left) and recovered transmission spectrum (right). Left: the three shortest wavelength light curves in the dataset being fit by the transit model combined with a 2D GP noise model (the other 13 light curves were simultaneously fit but not plotted). Right: resulting transmission spectrum from the joint fit of all 16 light curves, with the three leftmost points corresponding to the light curves in the left plot. The recovered transmission spectrum was consistent with the injected spectrum (). An atmospheric retrieval was performed on the recovered transmission spectrum (with the best-fit model plotted) and was consistent with the flat slope and strong K feature of the injected spectrum.

In the text
thumbnail Fig. 3

histograms for all three methods with theoretical distribution overplotted. Both the 1D GP and Hybrid methods have many outliers that increase the variance of the retrieved values. The values of the Hybrid method have the correct mean but with greater variance than the theoretical distribution, while the 2D GP traces the theoretical distribution very closely.

In the text
thumbnail Fig. 4

Z-scores of retrieved atmospheric features from Simulation 2, which had short wavelength length scale correlated noise. The three methods tested were fitting the light curves with 1D GPs (top row), the Hybrid method (middle row) and a 2D GP (bottom row). All methods have mean Z-scores consistent with zero, that is none of them biases the results towards measuring larger or smaller values of the features. The retrieved variance of Z-scores for the 2D GP method are all consistent with a variance of one (matching the Gaussian distribution plotted in red), but not for the other methods, which indicates overestimation or underestimation of uncertainties.

In the text
thumbnail Fig. 5

Raw light curves (left) and common-mode corrected light curves (right) for one simulated dataset in Simulation 3 (containing wavelength-correlated systematics). The injected transit signal is plotted as a black dotted line. The variation of the systematics in wavelength is not obvious by eye but can be noticed by comparing the regions highlighted in red boxes. Despite the systematics varying in wavelength, the common-mode correction appears to remove visual signs of systematics.

In the text
thumbnail Fig. 6

Example of retrieved transmission spectra with 1D GP method (left) analysing common-mode corrected light curves (right plot in Fig. 5) compared with 2D GP method (right) analysing uncorrected light curves (left plot in Fig. 5). The error bars in blue only convey the mean and standard deviation of each transit depth measurement, random draws are taken from the covariance matrix to convey potential correlations between transit depths, helping to visualise the increased uncertainty in the offset and scattering slope of the spectrum from the 2D GP method. Only the 1D GP method erroneously detects a negative scattering slope. While it is not visually apparent, the 2D GP method gives a stronger constraint on the detection of potassium (13.9σ compared to 8.7σ).

In the text
thumbnail Fig. 7

600B grism (left) and 600RI grism (right) spectroscopic light curves with best-fit transit region shaded in grey. Left: the light curves appear to have very similar systematics with the exception of the top one or two light curves in which the region in the red box shows a small increase in flux at the end of the observation. This could suggest the presence of wavelength-specific systematics (WSS) in these data. Right: The slowly varying shape of the systematics – as highlighted in the red box – suggest that the ‘common-mode’ systematics are not actually common-mode but gradually vary in wavelength.

In the text
thumbnail Fig. 8

Plots showing GP predictive means fit by 2D GP in 600B dataset to account for (in order from left to right): (a) common-mode systematics, (b) high-frequency systematics and (c) wavelength-specific systematics. The rightmost plot is the raw light curves included for reference. Specifically, the middle plots show the change in the GP mean caused by including these systematics in the fit, as discussed in Sec. 4.2.

In the text
thumbnail Fig. 9

600B (top) and 600RI (bottom) retrieved transmission spectra comparing original analysis, 1D GP re-analysis and analogous 2D GP analysis. The 2D GP best-fit atmospheric retrieval is shown as a red dotted line and random draws taken from the 2D GP spectrum are shown in grey. Top: all three analyses show largely consistent results. Bottom: the error bars of each method may appear consistent but the random draws from the 2D GP spectrum demonstrates a large uncertainty in the slope of this spectrum after accounting for the covariance between transit depths.

In the text
thumbnail Fig. 10

Similar to Fig. 9 but showing different 2D GP analyses tested. As error bars do not convey the correlations between the data points, it is difficult to see that the 600B fit with no mean radius prior – which appears to have much larger uncertainties – actually has a similar constraint on the slope and Na feature compared to the other methods. For the direct retrieval of atmospheric parameters from the data, a dotted line is included that shows the best-fit atmospheric model (as the transmission spectrum is not directly retrieved with this method).

In the text
thumbnail Fig. 11

Comparison of log-likelihood calculation runtime (in seconds on logarithmic scale) between general approach of Cholesky factorisation (left) compared to our method (centre) as well as relative performance improvement (right). Cholesky factorisation was not calculated when MN ≥ 32768 (shown in white) due to memory limitations.

In the text
thumbnail Fig. F.1

Autocorrelation plots of residuals minus GP mean if (from left to right): no correlated noise is fit, no high-frequency systematics are fit, no wavelength-specific systematics are fit, or if the full kernel in Eq. (30) is fit.

In the text
thumbnail Fig. L.1

Heat map of correlation matrices of transmission spectra for 600B data. Left: From the analogous 2D GP analysis, visualising how the uncertainty in each transit depth is correlated across the spectrum. Right: For the 2D GP analysis with no added prior on the mean radius (from Sect. 4.4.2). In both plots, the strength of correlation is not significantly affected by the wavelength separation between light curves for most wavelengths. The right plot in particular shows significant correlations that are mostly independent of wavelength-separation, consistent with significant uncertainty in the offset of the transmission spectrum as all wavelengths are approximately affected equally.

In the text
thumbnail Fig. L.2

Same as Fig. L.1 but for 600RI data. Both plots demonstrate that transit depths at large wavelength separations in the dataset are less correlated than for closer separations, which likely explains the significant uncertainty in the slope of the transmission spectrum for this analysis.

In the text
thumbnail Fig. L.3

Fitted transit light curves for 600B data, showing GP predictive mean in red as well as 1σ and 2σ uncertainty in mean shaded in grey. Left: Fitting the raw light curves. Middle: The same fit minus the best-fit transit model, displaying just the GP fit to the systematics. Right: The same fit minus both the best-fit transit model and the GP mean (fitting the systematics) subtracted, showing the residual white noise.

In the text
thumbnail Fig. L.4

Same plot as Fig. L.3 but for 600RI data.

In the text
thumbnail Fig. L.5

Corner plot for 600B data fit with 2D GP and marginalising over T0, a/R* and b (from Sect. 4.4.3). Only the wavelength-dependent parameters from two of the light curves are included (from the wavelength bands immediately left of and centred on the Na feature) in addition to any parameters shared by all light curves. Note the wavelength length scale is consistent with the maximum prior limit, showing that these systematics may be constant in wavelength, while is constrained to be within the prior bounds.

In the text
thumbnail Fig. L.6

Corner plot for 600RI data for analogous 2D GP fit (from Sect. 4.4.1). The wavelength-dependent parameters included are from the light curves centred on the Na and K features. We note that both the wavelength length scales and are constrained within the prior bounds for this dataset.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.