Inferring high-redshift large-scale structure dynamics from the Lyman-α forest

One of the major science goals over the coming decade is to test fundamental physics with probes of the cosmic large-scale structure out to high redshift. Here we present a fully Bayesian approach to infer the three-dimensional cosmic matter distribution and its dynamics at z > 2 from observations of the Lyman-α forest. We demonstrate that the method recovers the unbiased mass distribution and the correct matter power spectrum at all scales. Our method infers the three-dimensional density ﬁeld from a set of one-dimensional spectra, interpolating the information between the lines of sight. We show that our algorithm provides unbiased mass proﬁles of clusters, becoming an alternative for estimating cluster masses complementary to weak lensing or X-ray observations. The algorithm employs a Hamiltonian Monte Carlo method to generate realizations of initial and evolved density ﬁelds and the three-dimensional large-scale ﬂow, revealing the cosmic dynamics at high redshift. The method correctly handles multi-modal parameter distributions, which allow constraining the physics of the intergalactic medium with high accuracy. We performed several tests using realistic simulated quasar spectra to test and validate our method. Our results show that detailed and physically plausible inference of three-dimensional large-scale structures at high redshift has become feasible.


Introduction
Currently, cosmology is at the crossroads.While the standard model of cosmology ΛCDM fits the bulk of cosmological observations to extraordinary accuracy, some tensions between the model and observations seem to persist and increase (Planck Collaboration et al. 2016b, 2019).Amongst those are the H 0 and σ 8 tensions (e.g.Planck Collaboration et al. 2016a;Riess et al. 2016;Köhlinger et al. 2017;Riess et al. 2018;Abbott et al. 2018;Rusu et al. 2019).Additionally, Lyman-α auto-correlation and cross-correlations with quasar data reported a 2.3σ tension with the flat ΛCDM prediction obtained from Planck observations (see e.g.Delubac et al. 2015;du Mas des Bourboux et al. 2017).
The resolution of these tensions may encompass systematic effects but may also be the first signs of new physics indicated by novel cosmological data of increasing quality.New observations and better control on systematic effects in data analyses are inevitable to gain new insights into the physical processes driving the evolution of the universe.For this reason, in this work, we present a novel and statistically rigorous approach to extract cosmologically relevant and significant information from highredshift Lyman-α forest observations tracing the dynamic evolution of cosmic structures.
Detailed analyses of the spatial distribution of matter and growth of cosmic structures can provide significant information to discriminate between models of homogeneous dark energy and modifications of gravity, determine neutrino masses and their mass hierarchy and investigate the dynamical clustering behaviour of warm or cold dark matter (e.g.Colombi et al. 1996;Huterer et al. 2015;Basilakos & Nesseris 2016;Frenk & White 2012;Boyle & Komatsu 2018;Mishra-Sharma et al. 2018).
To harvest this information, cosmology now turns to analyze the inhomogeneous matter distribution with next-generation galaxy surveys such as the Large Synoptic Survey Telescope (LSST) and the Euclid satellite mission probing the galaxy distribution out to redshifts z ∼ 3 and beyond (LSST Science Collaboration et al. 2009a;Refregier et al. 2010;Laureijs et al. 2011;Alam et al. 2016).
While most of this research focuses on studying the cosmic large-scale structure with galaxy clustering and weak lensing observations, the Lyman-α (Ly-α) forest has the potential to provide important complementary information.Firstly, the Ly-α forest probes the matter distribution at higher redshift with higher spatial resolution than can be achieved with galaxy sampling rates.By probing scales down to 1 Mpc, the Ly-α forest is sensitive to neutrino masses and dark matter models (Viel et al. 2013;Rossi 2014;Palanque-Delabrouille et al. 2015;Rossi et al. 2015;Yèche et al. 2017).Secondly, while galaxy clustering probes high-density regions, the Ly-α forest is particularly sensitive to under-densities in the matter distribution (Peirani et al. 2014;Sorini et al. 2016).In addition, the Ly-α at z > 2 is redshifted to an optical band for which the atmosphere is transparent.The Ly-α emission becomes then one of the more powerful Fig. 1: Flow chart depicting the iterative block sampling approach of the BORG inference framework.As can be seen, the BORG algorithm first infers three-dimensional initial and evolved density fields from quasar spectra using assumed astrophysical parameters of the FGPA model.Then, using realizations of the evolved density field, the algorithm will update these astrophysical parameters and a new iteration of the process begins.Iterating this procedure results in a valid Markov Monte Carlo Chain that will correctly explore the joint posterior distribution of the three-dimensional matter distribution and astrophysical properties underlying Ly-α forest observations.probes at redshifts that are otherwise challenging to access with ground-based galaxy surveys (Lee 2016).
Specifically, auto-correlations of the Ly-α forest along the line of sight have been used to infer cosmological parameters which agree by and large with CMB observations (e.g Seljak et al. 2006;Viel et al. 2006;Slosar et al. 2011;Busca et al. 2013;Bautista et al. 2017;Blomqvist et al. 2019).Ly-α forest data has also been used to constrain neutrino masses (Palanque-Delabrouille et al. 2015;Rossi et al. 2015;Yèche et al. 2017), test warm dark matter models (Viel et al. 2013) and study the thermal history of the intergalactic medium (Nasir et al. 2016;Boera et al. 2019).
In line with these promises, a large number of ongoing surveys mapping the Ly-α forest at higher redshifts has been started.The eBOSS observations in the SDSS DR14 are expected to provide the spectra of 435 000 quasars over 7500 deg 2 and re-observe the lines of sight from the previous data release with low signal-to-noise (SNR<3, Myers et al. 2015).DESI (Dark Energy Spectroscopic Instrument, Levi et al. 2013;DESI Collaboration et al. 2016) will map the Ly-α forest over a larger fraction of the sky (14 000 deg 2 ), targeting 50 high-redshift quasars per deg 2 .Increasing the density of tracers, the ongoing CLAMATO survey (COSMOS Lyman Alpha Mapping and Tomography, Lee et al. 2017) maps the Ly-α forest over a small part of the sky (600 arcmin 2 ) with a separation between lines of sight of 2.4 h −1 Mpc, 20 times smaller than the BOSS survey (Eisenstein et al. 2011;Dawson et al. 2013)  At present, major analyses of the Ly-α forest focus only on the analysis of the matter power spectrum (e.g.Croft et al. 1998;Seljak et al. 2006;Viel et al. 2006;Bird et al. 2011;Palanque-Delabrouille et al. 2015;Rossi et al. 2015;Nasir et al. 2016;Yèche et al. 2017;Boera et al. 2019).However, these approaches ignore significant amounts of information contained in the higher-order statistics of the matter density field as generated by non-linear gravitational dynamics in the late time universe (He et al. 2018).
Capturing the full information content of the cosmic largescale structure requires a field-based approach to infer the entire three-dimensional cosmic large-scale structure from observations.This poses a particular challenge for the analyses of Ly-α forest observations, which provide sparse inherently onedimensional information along the lines of sight.Various approaches to perform three-dimensional density reconstructions from one-dimensional Ly-α forests have been proposed in the literature (e.g.Kitaura et al. 2012;Cisewski et al. 2014;Stark et al. 2015b;Ozbek et al. 2016;Horowitz et al. 2019).Gallerani et al. (2011) and Kitaura et al. (2012) proposed a Gibbs sampling scheme to jointly infer density and velocity fields and corresponding power-spectra.However, these approaches assume matter density amplitudes to be log-normally distributed.The log-normal distribution reproduces one-and two-point statistics but fails to reproduce higher-order statistics associated with the filamentary dark matter distribution.In an attempt to extrapolate information from one-dimensional quasar spectra into the three-dimensional volume, Cisewski et al. (2014) applied a local polynomial smoothing method.Ozbek et al. (2016) and Stark et al. (2015b) employed a Wiener filtering approach to reconstruct the three-dimensional density field between lines of sight of Ly-α forest data.In order to reproduce higher order statistics, Horowitz et al. (2019) recently used a large-scale optimization approach to fit a gravitational structure growth model to Ly-α data, showing that this approach allows recovering the more filamentary structure of the cosmic web.Although the approach improves over linear and isotropic Wiener filtering approaches, it shows systematic deviations of reconstructed matter power-spectra and underestimates density amplitudes at scales corresponding to the mean separation between lines of sight (Horowitz et al. 2019).
To go beyond previous approaches, in this work, we present a fully Bayesian and statistically rigorous approach to perform dynamical matter clustering analyses with high redshift Ly-α forest data while accounting for all uncertainties inherent to the observations.The aim of this work is to provide a novel and fully Bayesian approach to infer physically plausible three-dimensional density and velocity fields from Ly-α forest data.Our approach builds upon the algorithm for Bayesian Origin Reconstruction from Galaxies (BORG, Jasche & Wandelt 2013; Jasche & Lavaux 2019), which employs physical models of structure formation and sophisticated Markov Chain Monte Carlo techniques to optimally extract large-scale structure information from data and quantify corresponding uncertainties.
To make such inferences feasible, we developed a likelihood based on the fluctuating Gunn-Peterson approximation (FGPA, Gunn & Peterson 1965) and jointly constrained the astrophysical properties of the intergalactic medium.We tested and validated our approach with simulated data emulating the CLAM-ATO survey, showing that the algorithm recovers the unbiased dark-matter field.
The paper is organized as follows.Section 4 provides a brief overview of our Bayesian inference framework, BORG, as required for this work.In Section 2, we discuss the physics of the Ly-α forest underlying the data model described in Section 3. Section 5 describes the generation of simulated data employed Fig. 2: Burn-in of the posterior power spectra from the inferred initial conditions from a BORG analysis.The colour scale shows the evolution of the power spectrum with the sample number.The dashed lines indicate the underlying power spectrum and the 1-and 2-σ uncertainty limits.After the warm-up phase, the algorithm recovers the true matter power spectrum in all range of Fourier modes.
to test and validate the algorithm.The performance of the algorithm is tested in Section 6 and the results are shown in Section 7, where we also discuss possible applications of the method to scientifically exploit the Ly-α forest.Finally, Section 8 summarizes the results and discusses further extensions of the algorithm.

The physics of the Lyman-α forest
Developing a Bayesian framework to infer the matter distribution from the Ly-α forest requires to incorporate the corresponding light absorption physics into the data model.
When traversing the universe, photons emitted from quasars will be scattered by neutral hydrogen (HI) gas residing inside cosmic large-scale structures.Scattering of quasar light results in observed spectra covered with absorption features referred to as the Ly-α forest.While high-density regions obscure the light and attenuate quasar fluxes, under-dense regions are almost transparent.Therefore, the signal comes from the under-dense regions.Due to the cosmological redshift, different HI regions absorb photons from different wavelengths in the quasar spectrum (see e.g.Mo et al. 2010).Consequently, every HI absorber leaves an absorption feature to the spectrum, permitting to trace the distance and density of HI regions along the line of sight.The Ly-α forest, therefore, provides a formidable probe of the cosmic large-scale structure along the observers past light cone.
Since the Ly-α forest generated at z > 2 is redshifted to the optical atmospheric window, it can be observed by ground-based telescopes, becoming one of the more relevant probes at redshifts that are otherwise challenging to access with galaxy surveys.Observations of the Ly-α forest at lower redshift, from z = 1.5 down to z = 0 (Davé et al. 1999;Davé 2001;Williger et al. 2010), require UV space-based spectrographs as Faint Object Spectrograph in the Hubble Space Telescope (Bahcall et al. 1993(Bahcall et al. , 1996;;Weymann et al. 1998).
In contrast to galaxy surveys, requiring assumptions on galaxy formation to model galaxy biases, modelling the Ly-α forest does not involve complicated models.At z > 2, HI gas, producing the Ly-α forest, is in radiative equilibrium of photoionization due to ultraviolet (UV) background and adiabatic cooling due to the cosmic expansion (Hui & Gnedin 1997).This yields a tight relation between the temperature of the IGM and the dark matter density ρ where T 0 and γ are constants that depend on the re-ionization history and the spectral shape of the UV background sources (Hui & Gnedin 1997).
In thermal equilibrium, HI number density can be expressed as (Hui & Gnedin 1997) where x is the comoving position, α(T ) ∝ T −0.7 is the radiative recombination rate of HI, Γ is the photoionization rate of neutral hydrogen, n 0 (1 + z) 3 is the mean baryon number density at redshift z and δ is the dark matter density contrast.On scales larger than Jeans' scale (100 kpc), thermal pressure in the gas is negligible and the IGM traces the dark matter distribution with high accuracy (Peirani et al. 2014).
Combining equations ( 1) and ( 2) leads to the fluctuating Gunn-Peterson approximation (FGPA Gunn & Peterson 1965) for the transmitted flux: with z the considered absorption redshift, x the corresponding comoving distance, x the associated unit vector, β = 2−0.7(γ−1)and A ∝ (1 + z) 6 T −0.7 0 Γ −1 .Therefore, the astrophysical properties of HI gas are encoded in the parameters A and β of the FGPA model.

A data model for the Ly-α forest
To incorporate Ly-α forest observations into the Bayesian framework, we will now build a data model using the light absorption physics described in Sect. 2. Specifically, we assume the FGPA transmission model and Gaussian pixel noise for measured quasar spectra.A corresponding likelihood distribution can then be expressed as: where n labels respective quasar spectra, x indexes different volume elements intersected by the n-th quasar line of sight and δ f is the non-linear final density contrast evaluated at z = 2.5.
This likelihood is then implemented into the large-scale structure sampler of the BORG framework.The corresponding physical forward modelling approach will then proceed as follows.Using realizations of the three-dimensional field of primordial fluctuations, the dynamical structure formation model will evaluate non-linear realizations of the dark matter distribution at z = 2.5.Using these matter field realizations and the FGPA data model, BORG will predict quasar spectra that are compared to actual observations via the Gaussian likelihood of equation ( 4).More details on the sampling procedure are given in Appendix A. To efficiently implement this non-linear and non-Gaussian forward modeling approach into an MCMC framework, we employed a Hamiltonian Monte Carlo (HMC) method, as described in Appendix B.
In this work, we focus on the inference of the spatial dark matter distribution and its dynamics.For this reason, we will treat the parameters A and β mostly as nuisance parameters.Fol-lowing similar approaches described in previous works (Jasche & Wandelt 2013;Jasche & Lavaux 2017;Kodi Ramanah et al. 2018;Jasche & Lavaux 2019), we will use efficient slice sampling techniques to sample these parameters.We note that these parameters can carry valuable information on the astrophysical properties of the IGM.As illustrated in section 7.6, our inference can be used to also make statements on the astrophysics of the IGM.
The entire iterative sampling approach is depicted in Figure 1.As can be seen, the method iteratively infers threedimensional matter density fields and parameters of the FGPA model, resulting in a valid MCMC approach to jointly explore density fields and astrophysical parameters.

Method
As mentioned above, this work extends our previously developed BORG algorithm in order to analyze the spatial matter distribution underlying Ly-α forest observations.Here we will provide only a brief summary of the algorithm and corresponding concepts, but interested readers will find more detailed descriptions in our previous works (Jasche & Wandelt 2013;Jasche et al. 2015;Lavaux & Jasche 2016;Jasche & Lavaux 2017, 2019).
The BORG algorithm is a large-scale Bayesian inference framework aiming at inferring the non-linear spatial dark matter distribution and its dynamics from cosmological data sets.The underlying idea is to fit full dynamical gravitational structure formation models to observations tracing the spatial distribution of matter.Using non-linear structure growth models, the BORG algorithm can exploit the full statistical power of higherorder statistics imprinted to the matter distribution by gravitational clustering.In fitting dynamical structure growth models to data, the task of inferring non-linear matter density fields turns into a statistical initial conditions problem aiming at inferring the spatial distribution of primordial matter fluctuations from which present structures formed via gravitational structure growth.As such the BORG algorithm naturally links primordial initial conditions to late time observations and permits to infer dynamical properties as well as the structure formation history from present galaxy observations (Jasche & Wandelt 2013;Jasche et al. 2015;Lavaux & Jasche 2016;Jasche & Lavaux 2017, 2019).
The BORG algorithm employs a large-scale structure posterior distribution based on the well-developed prior understanding of almost Gaussian primordial density fluctuations and gravitational structure formation to predict physically plausible realizations of present matter density fields.More specifically BORG encodes a Gaussian prior for the initial density contrast at an initial cosmic scale factor of a = 10 −3 .Initial and evolved density fields are linked by deterministic gravitational evolution mediated by various physics models of structure growth.Specifically, BORG incorporates physical models based on Lagrangian Perturbation Theory (LPT) but also fully non-linear particle mesh (PM) models.
Our Bayesian inference approach has been previously shown to perform accurate dynamic mass estimation and provide mass measurements that agree well with complementary standard weak lensing and X-ray observations (Jasche & Lavaux 2019).More recently, we have shown, that BORG can exploit the geometric shapes of the cosmic large-scale structure to significantly improve constraints on cosmological parameters via the Alcock-Paczynski test (Kodi Ramanah et al. 2018).Additional projects, conducted with the BORG algorithm, can be found at our webpage of the Aquila consortium1 .
In this work, we will rely on an LPT approach to structure formation since the Ly-α forest mostly arises from under-dense regions that can be conveniently modelled by perturbation theory (Peirani et al. 2014;Sorini et al. 2016).The dynamical model permits to recover the non-linear higher-order statistics associated with the filamentary matter distribution of the cosmic largescale structure.Our approach also immediately provides detailed inferences of large-scale velocity fields and structure growth information at high redshifts as will be demonstrated in the remainder of this work.
A feature of particular relevance to this work is that BORG employs a modular statistical programming engine that executes a statistically rigorous Markov Chain to marginalize out any nuisance parameters associated to the data model, such as unknown biases or systematic effects.This statistical programming engine permits us to easily and straightforwardly implement complex data models.In particular, in this work, we will perform joint analyses of the cosmological large-scale structure and unknown astrophysical properties of the IGM medium described by the parameters of the FGPA model.The corresponding iterative block sampling inference approach is illustrated in Fig. 1.

Generating artificial Ly-α forest observations
To test our Ly-α forest inference framework we will generate artificial mock observations emulating the CLAMATO survey (Stark et al. 2015a;Lee et al. 2017).
Mock data is constructed by first generating Gaussian initial conditions on a cubic Cartesian grid of side length of 256h −1 Mpc with a resolution of 1h −1 Mpc.To generate primordial Gaussian density fluctuations we use a cosmological matter power-spectrum including the Baryonic wiggles calculated according to the prescription provided by (Eisenstein & Hu 1998, 1999).We further assume a standard ΛCDM cosmology with the following set of parameters: To generate realizations of the non-linear density field, we evolve these Gaussian initial conditions via Lagrangian Perturbation Theory.This involves simulating displacements for 512 3 particles in the LPT simulation.Final density fields are constructed by estimating densities via the cloud-in-cell scheme from simulated particles on a Cartesian equidistant grid with 256 3 volume elements.We further apply a Gaussian smoothing kernel of σ = 0.5h −1 Mpc to the fields to simulate the difference between dark matter and gas density fields (see e.g.Peirani et al. 2014;Stark et al. 2015a).A three dimensional quasar flux field is generated by applying the FGPA model, given in equation (3), to the final density field and assuming constant parameters A = 0.35 and β = 1.56 at z = 2.5, corresponding to the values in Stark et al. (2015a).
From this three-dimensional quasar flux field, we generate individually observed skewers by tracing lines of sight through the volume.Specifically, we generate a total of 1024 lines of sight parallel to the z-axis of the box.The separation between lines of sight is the most limiting factor of Ly-α surveys.Although the CLAMATO survey achieves a separation of 2.4 h −1 Mpc, the lines of sight in this work are separated by 8 h −1 Mpc.This gives us the opportunity to explore the potential of our method to reconstruct the three-dimensional density field between lines of sight.Finally, we added Gaussian pixel-noise to the flux with a signal-to-noise ratio (SNR) of ≈ 5, which corresponds to a value in the SNR range of the CLAMATO survey ( SNR min ≈ 1.4, SNR min ≈ 10 Lee et al. 2017).An example for one of these 1024 mock quasar spectra is illustrated in Fig. 3.

Testing sampler performance
In this section, we present a series of tests to evaluate the performance of our method.In particular, we focus on the convergence (Section 6.1) and efficiency (Section 6.2) to infer the underlying dark matter density field.We provide tests of the posterior power-spectra and perform posterior predictive tests for quasar spectra to check that the inferred model agrees with the data.

The warm-up phase of the sampler
In the large sample limit, any properly set up Markov chain is guaranteed to approach a stationary distribution that provides an unbiased estimate of the target distribution.While Markov chains are typically started from a place remote from the target distribution after a finite amount of transition steps, the chain acquires a stationary state.Once the chain is in a stationary state, we may start recording samples to perform statistical analyses of the inference problem.To test when the Markov sampler has passed its initial warm-up phase, we follow a similar approach as described in previous works (Jasche & Wandelt 2013;Jasche & Lavaux 2017;Kodi Ramanah et al. 2018;Jasche & Lavaux 2019;Porqueres et al. 2019), by initializing the Markov chain with an over-dispersed state and trace the systematic drift of inferred quantities towards their preferred regions in parameter space.Specifically, we initialized the Markov chain with a random Gaussian cosmological density field scaled by a factor 10 −3 and monitored the drift of corresponding posterior power-spectra during the initial warm-up phase.The results of this exercise are presented in Fig. 2. As can be seen, successive measurements of the posterior power-spectrum during the initial warm-up phase show a systematic drift of power-spectrum amplitudes towards their fiducial values.The sampler, therefore, correctly recovers the power of the initial density field and moves the chain towards regions of high-probability in the parameter space.By the end of the warm-up phase, the sampler has found an unbiased representation of the matter distribution at all Fourier modes considered in this work.Starting the sampler from an over-dispersed state, therefore, provides us with an important diagnostics to test the validity of the sampling algorithm.

Statistical efficiency of sampler
By design, subsequent samples in Markov chains are correlated.The statistical efficiency of an MCMC algorithm is determined by the number of independent samples that can be drawn from a chain of a given length.To estimate the statistical efficiency of the sampler, we estimate the correlation length of the astrophysical parameters A and β along the Markov chain.For a parameter θ the auto-correlation for samples with a given lag in the chain can be estimated as where n is the lag in MCMC samples, θ is the mean and Var(θ) is the variance.We typically determine the correlation length by estimating the lag n C at which the auto-correlation C n dropped below 0.1.The number n C therefore present the number of transitions required to produce one independent sample.The results of this test are presented in Fig. 5.As can be seen, correlation lengths for parameters A and β amount to 640 and 830 samples, respectively.To significantly improve statistical efficiency, in the future, we will use similar strategies to obtain a faster mixing of the chain as described in Kodi Ramanah et al. (2018).

Posterior predictions for quasar spectra
To test whether inferred density fields provide accurate explanations for the observations, we perform a simple posterior predictive test (see e.g.Gelman et al. 2004).Generally, posterior predictive tests provide good diagnostics about the adequacy of data models in explaining observations and to identify possible systematic problems with the inference.
Here we will use density fields and the astrophysical parameters A and β, inferred by BORG, to predict expected quasar fluxes.If posterior predictions agree with actual observations within the uncertainty bounds, then the data model can be considered to be sufficient to analyze the data.In contrast, any misfits or systematic deviations would indicate a problem of the data model or the inference process.Specifically, we applied the FGPA model given in equation 3 to inferred density fields: where F pp is the posterior-predicted flux, i labels the samples and x labels the volume elements.The result of this test is presented in Fig. 3.As can be seen, the posterior predicted quasar spectrum nicely traces the data input within the observational 1σ uncertainty region.This demonstrates that the method correctly locates absorber positions and corresponding amplitudes of the underlying density field.
While Fig. 3 shows results only for a single line of sight, more generally, we also explored the flux errors for all lines of sight.The corresponding distribution of flux errors for posterior predictions is presented in Fig. 4. The distribution of flux errors corresponds to the distribution of pixel-noise in the spectra, demonstrating that our method is close to the theoretical optimum.We note that this plot can also be compared to Fig. 14a in Horowitz et al. (2019).This comparison indicates that our method exhibits better control of the data model, including the handling of nuisance parameters and uncertainties.

Isotropy of the velocity field
One-dimensional Ly-α forest data introduces a particular geometry of the data into the inference problem.In particular, data is only available along one-dimensional lines of sight.In this work, we are interested in recovering the three-dimensional density field from Ly-α forest data by fitting a dynamical model.This model describes the formation of structures by displacing matter from its initial conditions to their final Eulerian positions.As such, we have found, that the large-scale velocity field is particularly sensitive to systematic effects introduced by the data, in particular, the survey geometry.Since the distribution of lines of sight defines a preferred axis, we thus test the isotropy of the velocity field to ensure that the algorithm correctly recovered a physically plausible three-dimensional velocity and corresponding density fields.The results of this test are shown in Figure 6, showing that there is no preferred component of the three-dimensional velocity field.All velocity components have a similar distribution indicating that the algorithm correctly accounted and corrected for the geometry of the survey.This result is further supported by the recovery of an isotropic primordial power-spectrum as described in section 7.2.

Analyzing the LSS in Ly-α forest data
In this section, we present the results of applying our algorithm to simulated Ly-α forest data.We show that our method infers physically plausible and unbiased density fields and corresponding power-spectra at all scales considered in this work.We further demonstrate that the algorithm accurately extrapolates information into unobserved regions between lines of sight.We will further illustrate the performance of the algorithm by recovering mass and velocity profiles of clusters as well as voids.

Inference of matter density fields at high-redshift
As discussed above, the BORG algorithm performs a full-scale Bayesian analysis of the cosmological large-scale structure in Ly-α forest data.This is achieved by using the physical forward modeling approach to fit a dynamical model of structure growth to the data.As a result, we simultaneously obtain inferences of the primordial field of fluctuations from which structures formed, the non-linear spatial matter distribution at a redshift of z = 2.5 and corresponding velocity fields.More specifically, the BORG algorithm provides us with an ensemble of realizations of these fields drawn from the corresponding largescale structure posterior distribution discussed in Appendix A.3.This ensemble of realizations from the Markov Chain enables us to estimate any desired statistical summary and quantify corresponding uncertainties.
As an example, in Figure 7, we illustrate ensemble mean and variances for primordial and final density fields.Specifically, Figure 7 shows slices through the true density, and the ensemble mean and variances of inferred three-dimensional density fields, computed from 12600 samples.A first visual comparison between ground truth and the inferred ensemble mean final density fields in the lower panels of figure 7 illustrates that the algorithm correctly recovered the three-dimensional large-scale structure from Ly-α forest data.On first sight, both fields are visually almost indistinguishable, indicating the high quality of the inference.The lower right panel of figure 7 shows the corresponding standard deviations for density amplitudes for respective volume elements as estimated from the Markov Chain.It can be seen that the estimated density standard deviations correlate with the inferred density field.This is expected for a nonlinear data model, which couples signal and noise.In particular, one can observe that uncertainties are lowest for under-dense regions where Ly-α forest data provides high signal-to-noise, as quasar light is simply transmitted by structures.In contrast, one can observe the highest uncertainties for over-dense structures.Since over-dense structures absorb quasar light, this decreases the signal in these regions: the absorption is saturated at highdensities.Even more, if structures are sufficiently dense, then they will absorb all quasar light and the absorption becomes saturated.Once light absorption is saturated, the data provides no further information about the actual density amplitude of the absorber and data will only provide information on a minimally lowest density threshold required to explain the observations.Figure 7 therefore, clearly demonstrates that our method correctly accounts for and quantifies uncertainties inherent to Ly-α Fig. 7: Slices through ground truth initial (left upper panel), final density field (left lower panel), inferred ensemble mean initial (middle upper panel) and ensemble mean final (middle lower panel) density field computed from 12600 MCMC samples.Comparison between these panels shows that the method recovers the structure of the true density field with high accuracy.Note that the algorithm infers the correct amplitudes of the density field.Right panels show standard deviations of inferred amplitudes of initial (upper right panel) and final density fields (lower right panel).The standard deviation of the final density field shows lower values at the position of the lines of sight.Note that the uncertainty of δ f presents a structure that correlates with the density field.Particularly, the variance is higher in high-density regions due to the saturation of the absorbed flux.In contrast, the standard deviation of the initial conditions are homogeneous and show no correlation with the initial density field due to the propagation of information from the final to the initial density field via the dynamical model.forest observations.It is interesting to remark, that while observations of galaxy clustering are most informative in high-density regions, the Ly-α forest is particularly informative for underdense regions and therefore provides complementary information to galaxy surveys.Figure 7 demonstrates that, besides inferring the correct filamentary structure, our method clearly infers the correct amplitudes of the density field.Our results can also be compared to Fig. 2 in Horowitz et al. (2019), which presents a systematic underestimation of the density amplitude.

Analyzing posterior power-spectra
As demonstrated in the previous section, the algorithm provides accurate reconstructions of final density fields.To provide more quantitative tests and to estimate whether inferred density fields are physically plausible, we perform analyses of posterior power-spectra measured from inferred density fields.
Reconstructing three-dimensional density fields from onedimensional Ly-α data is technically challenging.For example, Kitaura et al. (2012) used a Gibbs sampling approach to sample the large-and small-scales of the density field separately with a lognormal prior for the evolved density field.This approach inferred correct power-spectrum amplitudes at large scales k < 0.1 h Mpc −1 but obtained erroneous excess power at smaller scales, which was attributed to the inadequacy of the log-normal approximation or complex correlations in the data (Kitaura et al. 2012).Horowitz et al. (2019) used an optimization approach to fit a dynamical forward model to the data, but the method obtains power-spectra that severely underestimate the power of density amplitudes.
As we aim to optimally extract cosmologically relevant information from Ly-α forest data, we are particularly interested in recovering physically plausible density fields from observations.Erroneous power in inferred power-spectra is often a sign of untreated or unknown survey systematics and indicates a break down of the assumptions underlying the data model or the inference approach.In this work, we use a sophisticated MCMC framework to marginalize out nuisance parameters and quantify uncertainties inherent to the observations.As already indicated in section 6.1, our approach is able to identify the correct power distribution of density amplitudes from observations.To further quantify this result, we estimate the mean and variance of posterior power-spectra measured from the ensemble of Markov samples.The results are presented in Fig. 8.As can be seen, our method recovers the correct fiducial cosmological power-spectrum within the 1-σ cosmic variance uncertainty at all Fourier modes considered in this work.The unbiased recovery of the power-spectrum clearly indicates that the method correctly accounted for uncertainties and systematic effects of the data.
Systematic effects, such as survey geometries, often introduce spurious correlations between power-spectrum modes, Fig. 8: Mean posterior matter power-spectrum.Although the standard deviation is plotted, it is too small to be seen, showing the stability of the posterior power-spectrum.The dashed line indicates the underlying power spectrum and the 1-and 2-σ uncertainty limit.The mean and standard deviation are computed from 12600 samples of the Markov chain.The algorithm recovers the power-spectrum amplitudes within the 1 − σ cosmic variance uncertainty limit throughout the entire range of Fourier modes considered in this work.Fig. 9: Estimated correlation matrix of power spectrum amplitudes with the mean value, normalized using the variance of amplitudes of the power spectrum modes.The low off-diagonal contributions are a clear indication that our method correctly accounted and corrected for otherwise erroneous mode coupling, typically introduced by survey systematic effects and uncertainties.The expected correlations correspond to the lines of sight grid.The Pearson coefficient is > 0.9 at any location in the density field.This, combined with Fig. 7 showing that the algorithm recovers the correct amplitudes, indicates that the algorithm can interpolate the information between lines of sight and correctly recover the structures in unobserved regions.
when not accounted for properly.This is of particular relevance for Ly-α data, which introduces a particular survey geometry through the set of one-dimensional lines of sight.To test for residual correlations between Fourier modes, we estimated the covariance matrix of power-spectrum amplitudes from our ensemble of Markov samples.As shown in Fig. 9 this covariance matrix shows a clear diagonal structure with the expected correlations at k ∼ 0.5 − 0.7 h Mpc −1 due to the lines of sight grid.This test shows that our method correctly accounted for survey geometries and other systematic effects that could introduce erroneous mode coupling.
In summary, these tests demonstrate that our method is capable of inferring physically plausible matter density fields with correct power distribution from noisy Ly-α data.

Recovering information between lines of sight
The previous section describes that inferred density fields are physically plausible realizations of the matter field underlying the Ly-α observations.Here we want to quantify to what accuracy the method recovers the underlying ground truth density field.A particular challenge is to correctly recover the density field in between one-dimensional lines of sight.As illustrated above, our method is capable of recovering the threedimensional density field from a set of one-dimensional quasar spectra using the Bayesian physical forward modelling approach.In particular, we determine the performance of our algorithm to recover the density field from data by estimating the Pearson correlation coefficient (see Appendix D) between the inferred and ground truth density fields.We estimate the Pearson correlation coefficient as a function of the position on the x-axis, which permits us to track the correlations on and in-between observed lines of sight.Fig. 10 shows the Pearson coefficient averaged over 300 samples of the chain together with its 1-σ credibility interval.It can be clearly seen that inferred density fields correlate with the ground truth density typically to more than 98% The algorithm recovers the correct mass profile from the simulation.This demonstrates that the method provides unbiased mass estimates, correcting for the astrophysical bias of the IGM.Bottom: Spherically averaged density (left) and velocity (right) profiles of a void.The method recovers the underlying density and velocity profiles of the simulation.
along the lines of sight.But even in-between lines of sight the correlation is on average larger than 90%.These high Pearson correlations demonstrate the capability of our method to recover the cosmic large-scale structure in unobserved regions between observed lines of sight.

Cluster and void profiles
Above we showed that the BORG algorithm recovers physically plausible density fields from Ly-α data that agree with the expected statistical behaviour of a ΛCDM density field.Here we want to test if the algorithm also correctly recovers the properties of individual cosmic structures at their respective locations in the three-dimensional volume.To illustrate this fact, we will study the properties of inferred clusters and voids.In particular, we will show that the algorithm correctly recovers mass density and velocity profiles of cosmic structures.
To test the properties of an inferred cluster, we randomly chose a peak in the final density field.Then we determined the mass and velocity profiles in spherical shells around the identified center of the cluster.In particular, we estimated the cumulative radial mass profiles as: where r is the radial distance from the cluster center, i labels the Markov samples, N is the total number of samples and K i part (< r) is the number of simulation particles of Markov sample i inside a sphere of radius r around the cluster center.Finally, m p = 287 × 10 6 M is the particle mass of the simulation, which is obtained as where ρ is the mean cosmic density, V box is the volume of the inferred density field and N total part is the total number of particles used in the simulation.Fig. 11 shows the cluster mass profile measured from the inferred and true density fields.The inferred mass profile shown here is the average over 60 samples.Fig. 11 shows that our method provides unbiased mass estimates of cluster mass profiles, becoming an alternative to measuring cluster masses, complementary to weak lensing or X-ray observations.
The method also provides inferred density and velocity profiles of voids.Fig. 11 shows the density and velocity profiles measured from the inferred and true density fields.The void is defined spherically and the velocity and density profiles are spherically averaged: where i runs over the samples, N is the total number of samples, r is the radius of the sphere centered in the volume element with the lowest density, m p is the particle mass described in eq. 8, K i part (< r) is the number of particles inside the sphere and v p are the velocities of the particles provided by the dynamical model.The profiles shown in Fig. 11 are obtained from averaging over 60 samples.Voids constitute the dominant volume fraction of the Universe.Since the effect of matter in voids is mitigated, they are ideal to study the diffuse components of the Universe such as dark energy (Granett et al. 2008;Biswas et al. 2010;Lavaux & Wandelt 2012;Bos et al. 2012) and gravity (Li 2011;Clampitt et al. 2013;Spolyar et al. 2013;Hamaus et al. 2016).Voids are also interesting tools to constrain the neutrino mass (Massara et al. 2015;Kreisch et al. 2018;Sahlén 2019;Schuster et al. 2019): the neutrino free-streaming length falls within the range of typical void size.Therefore, the void profiles obtained with this framework provide an alternative tool to constrain the neutrino mass.

Velocity field and structure formation
The dynamical model employed in BORG allows to naturally infer the velocity field since it derives from the initial perturbations.The velocity information allows to discriminate between peculiar velocities and the Hubble flow.The combination of velocity and density fields can provide significant information on the formation of structures and galaxies.
Our method provides velocity fields at z = 2.5, where this kind of data is usually hard to obtain.Fig. 12 shows the velocity along the line of sight.The line-of-sight velocity provides a tool to measure the kinetic Sunyaev-Zeldovich effect (Sunyaev & Zeldovich 1972, 1980;Vishniac 1987) by cross-correlating the velocity field with the CMB and study the expansion of the Universe.
While the hierarchical structure formation model has been tested in the nearby Universe (Bond et al. 1982;Blumenthal et al. 1984;Davis et al. 1985), our method provides a framework to test it at high redshift.The method provides consistent velocity and density fields, as shown in Fig. 13, which shows a zoom-in on the inferred mean density field and the corresponding velocity components (v x , v y ).Therefore, we can test the hierarchical structure formation model at high-redshift by combining the density and velocity information and testing the predic- tions of the method with observations.Additionally, these density and velocity fields can provide some insights into galaxy formation and evolution by studying the effect of large-scale structures in galaxy populations (Porqueres et al. 2018).Therefore, this method for the Ly-α forest allows to extend investigations of the nature of galaxies or AGN to the high-redshift Universe.

The astrophysical parameters
Current efforts of the science community aim at constraining the astrophysics of the intergalactic medium (see e.g. Lee 2012;Rorai et al. 2017;Eilers et al. 2017) since the thermal history of the IGM holds the key to understanding hydrogen reionization at z > 6 (Lee 2012).Particularly, analysis of the Ly-α forest attempt to measure the parameters T 0 and γ.The parameter T 0 is related to the spectral shapes of the ultraviolet background.Therefore, it contains information on the intensity of the ionizing background, which is relevant to understand the nature of the first luminous sources and their impact on the subsequent generation of galaxies (Becker et al. 2015).The parameter γ defines the temperature-density relation indicating whether overdense regions are hotter than the underdense regions (γ > 1) or vice versa (γ > 1).While Becker et al. (2007); Bolton et al. (2008); Viel et al. (2009); Rorai et al. (2017) found evidence for γ < 1, Calura et al. (2012); Garzilli et al. (2012) found γ ≈ 1.Other approaches (Schaye et al. 2000;Theuns & Zaroubi 2000;Mc-Donald et al. 2001;Lidz et al. 2010;Rudie et al. 2012;Bolton et al. 2014) found γ > 1, which is more in agreement with the theoretically predicted γ ≈ 1.6 for a post-reionization IGM (Hui & Gnedin 1997).However, the value of γ is still an open debate since the different values are obtained from different kind of data, affected by systematic uncertainties in a different way (Eilers et al. 2017).
The algorithm presented in this work infers the astrophysical parameters of the FGPA, which are directly related to T 0 and γ.Fig. 14 shows the posterior distribution and correlations of these parameters and the noise variance σ, obtained from 6800 samples.This shows that the method recovers the true value of the posterior distribution of the FGPA parameters, corresponding to T 0 = (206.5 ± 0.1)10 3 Γ −1/0.7 K and γ = 1.614 ± 0.004, where the uncertainty is derived from the standard deviation of A and β.While the more accurate measurements of γ have an uncertainty above 7% (Rudie et al. 2012;Eilers et al. 2017), this test shows that our method can constrain γ with an uncertainty below 1%.The reason for this tight constraint of γ is probably the fact that higher-order statistics of the density field can break parameter degeneracy (Schmidt et al. 2019).Although constraining γ from real Ly-α forest data will require to model the continuum flux of the quasar, our method holds the promise to shed new light on the temperature-density debate.
Note in Fig. 14 that parameters A and β do not show a strong correlation with the noise variance σ, which is jointly sampled.It is also worth noticing that β presents a bimodal distribution.Optimization methods (see e.g.Horowitz et al. 2019) can only explore local extremes of the distribution, becoming sub-optimal to explore multi-modal distributions.However, our MCMC approach can characterize the bimodal distribution and provide the corresponding uncertainties, which are necessary to interpret the data correctly.Therefore, our algorithm provides a framework to constrain the astrophysics of the intergalactic medium jointly with the density field.To the knowledge of the authors, this work presents the first approach that attempts to jointly quantify the astrophysical parameters and the 3d cosmic structure.

Summary and discussion
Observations of the late time cosmic large-scale structure can provide information on fundamental physics driving the dynamic evolution of our Universe only if we manage to accurately account for non-linear data models and systematic effects of data and connect theory to observations.Detailed physical understanding of our universe, therefore, requires to accompany the current increase of high-quality cosmological data with the development of novel analysis methods capable of making the most of such observations.Traditional analyses of cosmic structures are limited to exploiting only one-and two-point statistics but ignore significant information encoded in higher-order statistics associated with the filamentary features of the late time cosmic Porqueres et al.: Inferring high redshift large-scale structure dynamics from the Lyman alpha forest matter distribution.To fully account for the information content of the cosmic large-scale structure, we need to follow a fieldbased approach to extract the entire three-dimensional matter distribution from observations.In this work we are particularly interested in studying the cosmic matter distributions at redshifts z > 2 with one-dimensional Lyman-α forest observations.
To extract the full three-dimensional information content from the Ly-α forest, we propose to use a Bayesian physical forward modelling approach to fit a numerical model of gravitational structure formation to data.Building upon the previously presented BORG algorithm, our fully probabilistic framework infers the three-dimensional matter density field and its dynamics from the Lyman-α forest.The method improves considerably over previous approaches (e.g.Kitaura et al. 2012;Horowitz et al. 2019) by inferring unbiased dark matter fields from the Lyα forest for the first time.While the previous approaches failed to recover the matter-power spectrum, our method provides the unbiased matter power spectrum at all range of Fourier modes.In particular, the approach followed by Horowitz et al. (2019) produces an underestimation of the power spectrum, corresponding to too low amplitudes of the inferred density field.We have shown that our method recovers the unbiased dark matter distribution and physically plausible density and velocity fields from Lyman-α data.
To make such inferences feasible, we used a likelihood function based on the fluctuating Gunn-Peterson approximation (FGPA).Besides modelling the dynamical evolution and spatial distribution of cosmic matter, our approach also accounts for systematic effects arising from the astrophysical properties of the intergalactic medium, by self-consistently marginalizing out corresponding nuisance parameters of the FGPA model.We have shown that this approach can provide tight constraints on the astrophysical parameters T 0 and γ, which provide significant information about the reionization history of the Universe and the early luminous sources.
Our hierarchical Bayesian framework encodes non-linear dynamical models.This work is based on the Lagrangian Perturbation Theory (LPT).As demonstrated by hydrodynamical simulations, the Ly-α forest arises from regions where the density of matter is within a factor ten of the cosmic mean density and thus is still close to the linear regime of gravitational collapse (Peirani et al. 2014;Sorini et al. 2016).
We tested our approach using realistic mock observations emulating the CLAMATO survey to infer the dark matter density and the velocity field at redshift z = 2.5 with a resolution of 1h −1 Mpc.These tests demonstrate that our method recovers unbiased reconstructions of the non-linear spatial matter distribution and its power-spectrum from observations of the Ly-α forest.The inferred mean density field presents a high correlation with the true density at any location.This indicates that our method can interpolate the information between lines of sight.While previous approaches tried to address this challenge with polynomial smoothing (Cisewski et al. 2014) or Wiener filtering (Ozbek et al. 2016;Stark et al. 2015b), the dynamical model in our method interpolates the information by accounting for the high-order statistics of the density field.
The dynamical model naturally infers velocity fields jointly with the density field.Therefore, our method provides velocity fields at z > 2, where this kind of data is usually hard to obtain.From the velocity fields, we can derive the velocities along the lines of sight or radial velocities.These velocity fields can be cross-correlated with the CMB to detect the kinetic Sunyaev-Zeldovich effect.Additionally, the consistent velocity and density fields provide a framework to test hierarchical structure formation and galaxy formation models at high redshift.
We have shown that the method provides accurate mass and velocity profiles for cosmic structures, such as voids and clusters.Therefore, this method provides an alternative to measure cluster masses, complementary to X-ray and lensing measurements.The method also provides the velocity and density profile of voids.Since the non-linear effects are mitigated in voids, they are sensitive to the diffuse components of the Universe such as dark energy.Therefore, the velocity and density profiles of voids can be used to discriminate between homogeneous dark energy and modifications of gravity.
Since the Ly-α probes the matter distribution down to few megaparsecs, it is sensitive to neutrino masses.This high resolution and the void profiles provided by our algorithm can constrain neutrino masses.Additionally, this high resolution of the Ly-α forest allows inferring the density field at the 1 Mpc scale.By applying new techniques compatible with Kodi Ramanah et al. ( 2018), this high-resolution density field can provide tight constraints of the cosmological parameters by studying structure geometry.
Summarizing, our method clearly demonstrates the feasibility of detailed and physically plausible inferences of threedimensional large-scale structures at high redshift from the Ly-α forest observations.The proposed approach, therefore, opens a new window to study cosmology and structure formation at high redshifts.

Fig. 3 :
Fig. 3: Top panel: Posterior prediction tests to test the model.These tests check whether the data model can accurately account for the observations.Any significant mismatch would immediately indicate a breakdown of the applicability of the data model or error of the inference framework.Our method recovers the transmitted flux fraction correctly, confirming that the data model can accurately account for the observations.Bottom panel: Comparison of the inferred ensemble mean density field along the line of sight to the ground truth.It can be seen that high-density regions yield a suppression of transmitted flux, while underdense regions transmit the quasar signal, in agreement with the FGPA model.

Fig. 4 :
Fig.4: Histogram of the error in the fractional transmitted flux, which is computed as the difference between posterior-predicted fluxes and input spectra.The distribution of flux error matches the distribution of pixel noise in the data, indicating that the method is close to the theoretical optimum.This distribution of error flux can be compared to the one obtained in previous works, demonstrating that our method recovers the fluxes with significantly higher accuracy.

Fig. 5 :
Fig. 5: Autocorrelation of the parameters as a function of the sample lag in the Markov chain.The correlation length of the sampler can be estimated by determining the point when correlations drop below 0.1 for the first time.These parameters have a correlation length of 830 samples for β and 640 for A.

Fig. 6 :
Fig. 6: Distribution of the three Cartesian components of the velocity.The three distributions have the same mean and variance, indicating that the density field is isotropic as expected.

Fig. 10 :
Fig. 10: Pearson coefficient of the inferred and true density field as a function of the position across lines of sight.The dashed lines indicate the positions of the lines of sight.The Pearson coefficient is > 0.9 at any location in the density field.This, combined with Fig.7showing that the algorithm recovers the correct amplitudes, indicates that the algorithm can interpolate the information between lines of sight and correctly recover the structures in unobserved regions.
Fig. 11: Top: Cluster mass (left) and radial velocity (right) profiles from the inferred ensemble mean density field and the simulation.The algorithm recovers the correct mass profile from the simulation.This demonstrates that the method provides unbiased mass estimates, correcting for the astrophysical bias of the IGM.Bottom: Spherically averaged density (left) and velocity (right) profiles of a void.The method recovers the underlying density and velocity profiles of the simulation.

Fig. 12 :
Fig. 12: Projection of the velocity along lines of sight, which are parallel to the x-axis of the plot.

Fig. 13 :
Fig.13: Zoom-in on the density field.The vector field shows the velocities on top of the inferred density field, showing matter flowing out of the void and falling into the gravitational potential of the cluster.Therefore, the method provides consistent velocity and density fields.

Fig. 14 :
Fig. 14: Corner plot for the unknown parameters A and β of the FGPA model as well as the standard deviation of the Gaussian noise σ.As indicated in the plot, different panels show different marginal posterior distributions for respective parameters.Black solid lines in uni-variate marginal distributions indicate the true fiducial parameter values.It can be seen that the method correctly infers the underlying parameters and quantifies corresponding uncertainties.It is interesting to remark, that the astrophysical parameters A and β show no a posteriori correlations with the noise standard deviation σ.
and smaller than the expected separation in DESI (3.7h −1 Mpc, Horowitz et al. 2019).Future surveys like MSE (Maunakea Spectrographic Explorer McConnachie et al. 2016) and LSST (LSST Science Collaboration et al. 2009b) will increase the amount of available Ly-α forest observations by a factor of 10.