Optimal neural summarization for full-field weak lensing cosmological implicit inference

Denise Lanzieri; Justine Zeghal; T. Lucas Makinen; Alexandre Boucaud; Jean-Luc Starck; François Lanusse

doi:10.1051/0004-6361/202451535

Home

All issues

Volume 697 (May 2025)

A&A, 697 (2025) A162

Full HTML

Open Access

Issue		A&A Volume 697, May 2025


Article Number		A162
Number of page(s)		15
Section		Cosmology (including clusters of galaxies)
DOI		https://doi.org/10.1051/0004-6361/202451535
Published online		16 May 2025

A&A, 697, A162 (2025)

Optimal neural summarization for full-field weak lensing cosmological implicit inference

Denise Lanzieri¹^,5^,⋆, Justine Zeghal²^,8^,9^,10^,⋆^,⋆⋆, T. Lucas Makinen³, Alexandre Boucaud², Jean-Luc Starck⁴^,6 and François Lanusse⁴^,7

¹ Université Paris Cité, Université Paris-Saclay, CEA, CNRS, AIM, F-91191 Gif-sur-Yvette, France
² Université Paris Cité, CNRS, Astroparticule et Cosmologie, F-75013 Paris, France
³ Imperial Centre for Inference and Cosmology (ICIC) & Astrophysics Group, Imperial College London, Blackett Laboratory, Prince Consort Road, London SW7 2AZ, United Kingdom
⁴ Université Paris-Saclay, Université Paris Cité, CEA, CNRS, AIM, 91191 Gif-sur-Yvette, France
⁵ Sony Computer Science Laboratories – Rome, Joint Initiative CREF-SONY, Centro Ricerche Enrico Fermi, Via Panisperna 89/A, 00184 Rome, Italy
⁶ Institutes of Computer Science and Astrophysics, Foundation for Research and Technology Hellas (FORTH), Heraklion, 70013, Greece
⁷ Center for Computational Astrophysics, Flatiron Institute, 162 5th Ave, New York, NY 10010, USA
⁸ Department of Physics, Université de Montréal, Montréal H2V 0B3, Canada
⁹ Mila – Quebec Artificial Intelligence Institute, Montréal H2S 3H1, Canada
¹⁰ Ciela – Montreal Institute for Astrophysical Data Analysis and Machine Learning, Montréal H2V 0B3, Canada

^⋆⋆ Corresponding author.

Received: 16 July 2024
Accepted: 18 February 2025

Abstract

Context. Traditionally, weak lensing cosmological surveys have been analyzed using summary statistics that were either motivated by their analytically tractable likelihoods (e.g., power spectrum) or by their ability to access some higher-order information (e.g., peak counts), but at the cost of requiring a simulation-based inference approach. In both cases, even if the statistics can be very informative, they are not designed nor guaranteed to be statistically sufficient (i.e., to capture all the cosmological information content of the data). With the rise of deep learning, however, it has becomes possible to create summary statistics that are specifically optimized to extract the full cosmological information content of the data. Yet, a fairly wide range of loss functions have been used in practice in the weak lensing literature to train such neural networks, leading to the natural question of whether a given loss should be preferred and whether sufficient statistics can be achieved in theory and in practice under these different choices.

Aims. We compare different neural summarization strategies that have been proposed in the literature to identify the loss function that leads to theoretically optimal summary statistics for performing full-field cosmological inference. In doing so, we aim to provide guidelines and insights to the community to help guide future neural network-based cosmological inference analyses.

Methods. We designed an experimental setup that allows us to isolate the specific impact of the loss function used to train neural summary statistics on weak lensing data at fixed neural architecture and simulation-based inference pipeline. To achieve this, we developed the sbi_lens JAX package, which implements an automatically differentiable lognormal weak lensing simulator and the tools needed to perform explicit full-field inference with a Hamiltonian Monte Carlo (HMC) sampler over this model. Using sbi_lens, we simulated a wCDM LSST Year 10 weak lensing analysis scenario in which the full-field posterior obtained by HMC sampling gives us a ground truth that can be compared to different neural summarization strategies.

Results. We provide theoretical insight into the different loss functions being used in the literature, including mean squared error (MSE) regression, and show that some do not necessarily lead to sufficient statistics, while those motivated by information theory, in particular variational mutual information maximization (VMIM), can in principle lead to sufficient statistics. Our numerical experiments confirm these insights, and we show on our simulated wCDM scenario that the figure of merit (FoM) of an analysis using neural summary statistics optimized under VMIM achieves 100% of the reference Ω_c−σ₈ full-field FoM, while an analysis using summary statistics trained under simple MSE achieves only 81% of the same reference FoM.

Key words: gravitational lensing: weak / methods: statistical / large-scale structure of Universe

^⋆

Equal contribution.

© The Authors 2025

Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article is published in open access under the Subscribe to Open model. Subscribe to A&A to support open access publication.

1. Introduction

Weak gravitational lensing by large-scale structures (LSS) is caused by the presence of foreground matter bending the light emitted by background galaxies. As it is sensitive to the LSS of the Universe, weak gravitational lensing is one of the most promising tools for investigating the nature of dark energy, the origin of the accelerating expansion of the Universe and estimating cosmological parameters. Future cosmological surveys, such as the Legacy Survey of Space and Time (LSST) of the Vera C. Rubin Observatory (Ivezić et al. 2019), Nancy Grace Roman Space Telescope (Spergel et al. 2015), and the Euclid Mission (Laureijs et al. 2011), will rely on weak gravitational lensing as one of the principal physical probes to address unresolved questions in current cosmology. As these surveys become deeper, they will be able to access more non-Gaussian features of the matter fields. This makes the standard weak-lensing analyses, that rely on two-point statistics such as the two-point shear correlation or the angular power spectrum, suboptimal. These analyses are unable to fully capture the non-Gaussian information imprinted in the lensing signal and can only access Gaussian information.

To overcome this limitation, several higher-order statistics have been introduced. These include weak-lensing peak counts (Liu et al. 2015a, b; Lin & Kilbinger 2015; Kacprzak et al. 2016; Peel et al. 2017; Shan et al. 2018; Martinet et al. 2018; Ajani et al. 2020; Harnois-Déraps et al. 2022; Zürcher et al. 2022), wavelet and scattering transform (Ajani et al. 2021; Cheng & Ménard 2021), the one-point probability distribution function (PDF; Liu & Madhavacheril 2019; Uhlemann et al. 2020; Boyle et al. 2021), Minkowski functionals (Kratochvil et al. 2012; Petri et al. 2013), moments of mass maps (Gatti et al. 2022), and three-point statistics (Takada & Jain 2004; Semboloni et al. 2011; Rizzato et al. 2019; Halder et al. 2021). Although these methods have been proven to provide additional cosmological information beyond the power spectrum, they do not necessarily exhaust the information content of the data.

In recent years, the idea of performing full-field inference (i.e., analyzing the full information content of the data at the cosmological field level) has gained traction. Contrary to previous approaches and by design, full-field inference is directly aimed at achieving information-theoretically optimal posterior contours.

Within this category, a further distinction can be made between explicit methods that rely on a Bayesian hierarchical model (BHM) describing the joint likelihood p(x|θ, z) of the field-level data x, cosmological parameters θ, and latent parameters z of the model (e.g., Alsing et al. 2017; Porqueres et al. 2021, 2023; Fiedorowicz et al. 2022a, b; Junzhe Zhou et al. 2023), and implicit methods (a.k.a. simulation-based inference, or likelihood-free inference) that rely on neural networks to directly estimate the full-field marginal likelihood p(x|θ) or posterior p(θ|x) on cosmological parameters from a black-box simulation model (Gupta et al. 2018; Fluri et al. 2018, 2019, 2021, 2022; Ribli et al. 2019; Matilla et al. 2020; Jeffrey et al. 2021, 2024; Lu et al. 2022, 2023; Kacprzak & Fluri 2022; Akhmetzhanova et al. 2024). Finally, another recently proposed flavor of implicit methods relies on directly estimating the full-field likelihood (Dai & Seljak 2024), which is in contrast to the previous methods that only target a marginal likelihood.

Despite the differences among these works in terms of the physical models they assume or the methodology employed, they all demonstrate that conducting a field-level analysis results in more precise constraints compared to a conventional two-point function analysis. However, open questions and challenges remain with both categories of methods. Explicit inference methods have not yet been successfully applied to data in order to constrain cosmology and are particularly challenging to scale to the volume and resolution of modern surveys. Implicit inference methods have proven to be much easier to deploy in practice and are already leading to state-of-the-art cosmological results (Jeffrey et al. 2024; Lu et al. 2023; Fluri et al. 2022), but the community has not yet fully converged on a set of best practices for this approach, which transpires in the variety of strategies found in the literature.

In this paper, we aim to clarify part of the design space of implicit inference methods related to how to design optimal data compression procedures with the aim of deriving low-dimensional summary statistics while minimizing information loss. Such neural summarization of the data is typically the first step in the two-step strategies found in most implicit inference work. The first step is the neural compression of the data, in which a neural network is trained on simulations to compress shear or convergence maps down to low-dimensional summary statistics. The second step is density estimation, in which either the likelihood of these summary statistics or directly the posterior distribution is estimated from simulations.

To quantify the extraction power of neural compression schemes found in the literature, we designed an experimental setup that allows us to isolate the impact of the loss function used to train the neural network. Specifically, we developed the sbi_lens package, which provides a differentiable lognormal forward model from which we can simulate a wCDM LSST Year 10 weak lensing analysis scenario. The differentiability of our forward model enables us to conduct an explicit full-field inference through the use of the HMC sampling scheme. This explicit posterior serves as our ground truth against which we can quantify the quality of the benchmarked compression procedures according to the sufficiency definition. To only assess the impact of the loss function of the compression step, we fixed both the inference methodology and the neural compressor architecture. In addition to empirical results, we provide some theoretical insight into the different losses employed in the literature. We explain why MSE, mean absolute error (MAE), or gaussian negative log-likelihood (GNLL) losses might lead to suboptimal summary statistics while information theory motivated losses such as VMIM theoretically achieve sufficient statistics.

The paper is structured as follows: In Section 2, we illustrate the motivation behind this work. In Section 3 we provide a detailed theoretical overview of the loss functions commonly used for compression. In Section 4, we introduce the sbi_lens framework and describe the simulated data used in this work. In Section 5, we detail the inference strategy and the three different approaches we used: the power spectrum, explicit full-field inference, and implicit full-field inference. In Section 6, we discuss the results and validate the implicit inference approaches. Finally, we conclude in Section 7.

2. Motivation

With the increased statistical power of stage IV surveys, conventional summary statistics such as the power spectrum but also higher-order statistics such as peak counts may not fully capture the non-Gaussian information present in the lensing field at the scales accessible to future surveys. In this paper, we focus on full-field inference methods that aim to preserve all available information and facilitate the incorporation of systematic effects and the combination of multiple cosmological probes through joint simulations.

In a forward model context, the simulator of the observables serves as our physical model. These models, often referred to as probabilistic programs, as illustrated by Cranmer et al. (2020), can be described as follows: the models take as input a vector parameter θ. Then, they sample internal states z, dubbed latent variables, from the distribution p(z|θ). These states can be directly or indirectly related to a physically meaningful state of the system. Finally, the models generate the output x from the distribution p(x|θ, z), where x represents the observations.

The ultimate goal of Bayesian inference in cosmology is to compute the posterior distribution:

$p (θ | x) = \frac{p (x | θ) p (θ)}{\int d θ' p (x | θ') p (θ')} \cdot$ $p({\boldsymbol {\theta }}|{\boldsymbol {x}})= \frac {p({\boldsymbol {x}}|{\boldsymbol {\theta }})p({\boldsymbol {\theta }})} {\int {\mathrm {d}}{\boldsymbol {\theta }}'p({\boldsymbol {x}}|{\boldsymbol {\theta }}')p({\boldsymbol {\theta }}')}\cdot$ (1)

However, a problems arises because the marginal likelihood p(x|θ) is typically intractable,

$p (x | θ) = \int p (x, z | θ) d z = \int p (x | z, θ) p (z | θ) d z,$ $p({\boldsymbol {x}}|{\boldsymbol {\theta }})=\int p({\boldsymbol {x}},{\boldsymbol {z}}|{\boldsymbol {\theta }}) {\mathrm {d}}{\boldsymbol {z}}=\int p({\boldsymbol {x}}|{\boldsymbol {z}},{\boldsymbol {\theta }})p({\boldsymbol {z}}|{\boldsymbol {\theta }}) {\mathrm {d}}{\boldsymbol {z}},$ (2)

since it involves integrating over all potential paths through the latent space. To overcome this limitation while still capturing the full information content of the data, two different approaches have been proposed in the literature. Although these approaches are often referred to by different names, hereinafter we make the following distinction:

Explicit inference. The explicit inference is referring to all likelihood-based inference approaches. In the context of full-field inference, this approach can be used when the simulator is build as a tractable probabilistic model. This probabilistic model provides a likelihood p(x|z, θ) that can be evaluated. Hence, the joint posterior

$p (θ, z | x) \propto p (x | z, θ) p (z | θ) p (θ)$ $p({\boldsymbol {\theta }},{\boldsymbol {z}}|{\boldsymbol {x}}) \propto p({\boldsymbol {x}}|{\boldsymbol {z}},{\boldsymbol {\theta }}) p({\boldsymbol {z}}|{\boldsymbol {\theta }})p({\boldsymbol {\theta }})$ (3)

can be sampled through Markov chain Monte Carlo (MCMC) schemes. In other words, this approach involves using a synthetic physical model to predict observations and then comparing these predictions with real observations to infer the parameters of the model.

Implicit inference. The implicit inference is referring to the approaches that infer the distributions (posterior, likelihood, or likelihood ratio) from simulations only. This second class of approaches can be used when the simulator is a black box with only the ability to sample from the joint distribution:

$(x, θ) \sim p (x, θ) .$ $({\boldsymbol {x}}, {\boldsymbol {\theta }})\sim p({\boldsymbol {x}}, {\boldsymbol {\theta }}).$ (4)

Within this class of methods, we can differentiate between traditional methods such as approximate Bayesian computation (ABC) and neural-based density estimation methods. ABC, in its simplest form, employs rejection-criteria-based approaches to approximate the likelihood by comparing simulations with the observation. In this work, our focus is on the second class of methods.

The standard deep learning based approach for implicit inference can be described as two distinct steps: (1) learning an optimal low-dimensional set of summary statistics; (2) using neural density estimator (NDE) in low dimensions to infer the target distributions. In the first step, we introduce a parametric function F_φ such that t=F_φ(x), which aims at reducing the dimensionality of the data while preserving information, or, in other words, aims at summarizing the data x into sufficient statistics t. By definition (Bernardo & Smith 2001), a statistic t is said to be sufficient for the parameters θ if p(θ|x) = p(θ|t). Meaning that the full data x and the summary t lead to the same posterior. Typically, the sufficient statistics t are assumed to have the same dimension as θ. In the second step, NDE can either target building an estimate p_φ of the likelihood function p(x|θ) (referred to as neural likelihood estimation (NLE, Papamakarios et al. 2018; Lueckmann et al. 2018)), targeting the posterior distribution p(θ|x), (known as neural posterior estimation (NPE, Papamakarios & Murray 2018; Lueckmann et al. 2017; Greenberg et al. 2019)), or the likelihood ratio r(θ, x) = p(x|θ)/p(x) (neural ratio estimation (NRE, Izbicki et al. 2014; Cranmer et al. 2015; Thomas et al. 2016)).

The main motivation of this work is to evaluate the impact of compression strategies on the posterior distribution and determine the ones that provide sufficient statistics. It is important to consider that different neural compression techniques may not lead to the same posterior distribution. Indeed, according to the definition, only a compression scheme that builds sufficient statistics leads to the true posterior distribution. We summarize the various neural compression strategies found in the literature in Table 1. Many of these papers have used neural compression techniques that rely on optimizing the MSE or MAE. As we demonstrate in the following sections, this corresponds to training the model to respectively estimate the mean and the median of the posterior distribution. Other papers rely on assuming proxy Gaussian likelihoods and estimate the mean and covariance of these likelihoods from simulations. Such compression of summaries could be suboptimal in certain applications, resulting in a loss of information.

Table 1.

Table summarizing the different neural compression schemes used for weak-lensing applications.

Ultimately, we should keep in mind that, given a simulation model, if a set of sufficient statistics is used, the two approaches should converge to the same posterior. Therefore, the goals of this work are to (1) find an optimal compression strategy for implicit inference techniques and (2) show that by using this optimal compression strategy, both the implicit and explicit full-field methods yield comparable results.

3. Review of common compression loss functions

To ensure the robustness of implicit inference techniques in cases where forward simulations are high dimensional, it becomes necessary to employ compression techniques that reduce the dimensionality of the data space into summary statistics. Specifically, we try to find a function t=F(x), where t represents low-dimensional summaries of the original data vector x. The objective is to build a compression function F(x) that captures all the information about θ contained in the data x while reducing dimensionality. Previous studies (e.g., Heavens et al. 2000; Alsing & Wandelt 2018) have demonstrated that a compression scheme can be achieved where the dimension of the summaries dim(t) is equal to the dimension of the unknown parameters dim(θ) without any loss of information at the Fisher level. Although these proofs rely on local assumptions, the guiding principle is that by matching the dimensions of t and θ, we ensure that t has just the right “space” to capture the variation in θ embedded in x. Furthermore, regression loss functions require that the summary statistics share the same dimensionality as the parameters θ. Thus, to best isolate the influence of the loss function on the construction of summary statistics, we use this dimensional convention for all benchmarked compression schemes.

There exist multiple approaches to tackling this dimensionality reduction challenge. This section aims to provide an overview of the various neural compression-based methods employed in previous works.

Mean squared error. One of the commonly used techniques for training a neural network is by minimizing the L₂ norm or MSE. This methods has been widely adopted in various previous studies (Ribli et al. 2018; Lu et al. 2022, 2023), where the loss function is typically formulated as follows:

$L_{MSE} = \frac{1}{N_{θ}} \sum_{i = 1}^{N_{θ}} (t_{i} - θ_{i})^{2} .$ ${\cal {{L_{{\textrm {MSE}}}}}}= \frac {1}{N_{\theta }}\sum _{i=1}^{N_{\theta }}(t_i-\theta _i)^2.$ (5)

Here N_θ represents the number of cosmological parameters, t denotes the summary statistics, and θ corresponds to the cosmological parameters. Minimizing the L₂-norm is equivalent to training the model to estimate the mean of the posterior distribution. We prove this statement in Appendix A. However, it is important to note that this approach does not guarantee the recovery of maximally informative summary statistics as it ignores the shape, spread, and correlation structure of the posterior distribution. Indeed, two posteriors can have the same mean while exhibiting different posterior distributions: the posterior mean is not guaranteed to be a sufficient statistic.

Mean absolute error. Another commonly used approach involves minimizing the L₁-norm or MAE. In this approach, the loss function is defined as:

$L_{MAE} = \frac{1}{N_{θ}} \sum_{i = 1}^{N_{θ}} | t_{i} - θ_{i} |,$ ${\cal {{L_{{\textrm {MAE}}}}}} = \frac {1}{N_{\theta }}\sum _{i=1}^{N_{\theta }}|t_i-\theta _i|,$ (6)

where t represents the summary statistics and θ denotes the cosmological parameters. In Appendix B, we demonstrate that minimizing this loss function is equivalent to training the model to estimate the median of the posterior distribution. While extensively employed in various previous studies (Gupta et al. 2018; Fluri et al. 2018; Ribli et al. 2019), it is important to note that this loss suffers the same pathology as the MSE loss.

Variational mutual information maximization. This technique was first introduced for cosmological inference problems by Jeffrey et al. (2021). This approach aims to maximize the mutual information I(t, θ) between the cosmological parameters θ and the summary statistics t. Maximizing this mutual information helps construct sufficient summary statistics, as t is sufficient for the parameters θ if and only if I(x, θ) = I(t, θ) by definition.

In the VMIM approach, the loss function is defined as:

$L_{VMIM} = - log q (θ | F_{φ} (x); φ') .$ ${\cal {{L_{{\textrm {VMIM}}}}}}=- \log {q({\boldsymbol {\theta }} |F_{{\boldsymbol {\varphi }}}({\boldsymbol {x}}); {\boldsymbol {\varphi }}')}.$ (7)

Here, q(θ|F_φ(x);φ′) represents a variational conditional distribution, where θ corresponds to the data vector of the cosmological parameters, and φ′ to the parameters characterizing the variational conditional distribution itself. F_φ denotes the compression network of parameters φ, used to extract the summary statistics t from the original high-dimensional data vector x, such that t=F_φ(x). In order to understand the significance of this loss function, it is necessary to start by considering the mathematical definition of mutual information I(t, θ):

$I (t, θ) = D_{KL} (p (t, θ) | | p (t) p (θ))$ $I({\boldsymbol {t}}, {\boldsymbol {\theta }}) = D_{\mathrm {KL}}(p({\boldsymbol {t}}, {\boldsymbol {\theta }})||p({\boldsymbol {t}})p({\boldsymbol {\theta }}))$ (8)

$\begin{matrix} = \int d θ d t p (t, θ) log (\frac{p (t, θ)}{p (t) p (θ)}) \\ = \int d θ d t p (t, θ) log (\frac{p (θ | t)}{p (θ)}) \\ = \int d θ d t p (t, θ) log p (θ | t) - \int d θ d t p (t, θ) log p (θ) \\ = \int d θ d t p (t, θ) log p (θ | t) - \int d θ p (θ) log p (θ) \\ = E_{p (t, θ)} [log p (θ | t)] - E_{p (θ)} [log p (θ)] \\ = E_{p (t, θ)} [log p (θ | t)] - H (θ); \end{matrix}$ $\begin{aligned}&= \int {\mathrm {d}}{\boldsymbol {\theta }} {\mathrm {d}}{\boldsymbol {t}} p({\boldsymbol {t}}, {\boldsymbol {\theta }})\log {\left (\frac {p({\boldsymbol {t}}, {\boldsymbol {\theta }})}{p({\boldsymbol {t}}) p({\boldsymbol {\theta }})}\right )} \\ &= \int {\mathrm {d}}{\boldsymbol {\theta }} {\mathrm {d}}{\boldsymbol {t}} p({\boldsymbol {t}}, {\boldsymbol {\theta }})\log {\left (\frac {p({\boldsymbol {\theta }} | {\boldsymbol {t}})}{p({\boldsymbol {\theta }})}\right )} \\ &= \int {\mathrm {d}}{\boldsymbol {\theta }} {\mathrm {d}}{\boldsymbol {t}} p({\boldsymbol {t}}, {\boldsymbol {\theta }})\log {p({\boldsymbol {\theta }} | {\boldsymbol {t}})} - \int {\mathrm {d}}{\boldsymbol {\theta }} {\mathrm {d}}{\boldsymbol {t}} p({\boldsymbol {t}}, {\boldsymbol {\theta }})\log {p({\boldsymbol {\theta }})} \\ &= \int {\mathrm {d}}{\boldsymbol {\theta }} {\mathrm {d}}{\boldsymbol {t}} p({\boldsymbol {t}}, {\boldsymbol {\theta }})\log {p({\boldsymbol {\theta }} | {\boldsymbol {t}})} - \int {\mathrm {d}}{\boldsymbol {\theta }} p({\boldsymbol {\theta }})\log {p({\boldsymbol {\theta }})} \\ &= {\mathbb {E}}_{p({\boldsymbol {t}}, {\boldsymbol {\theta }})} [\log {p({\boldsymbol {\theta }} | {\boldsymbol {t}})}]- {\mathbb {E}}_{p({\boldsymbol {\theta }})} [\log {p({\boldsymbol {\theta }})}] \\ &= {\mathbb {E}}_{p({\boldsymbol {t}}, {\boldsymbol {\theta }})} [\log {p({\boldsymbol {\theta }} | {\boldsymbol {t}})}]- H({\boldsymbol {\theta }}); \end{aligned}$

in the above equation, D_KL is the Kullback-Leibler divergence (Kullback & Leibler 1951), p(t, θ) is the joint probability distribution of summary statistics and cosmological parameters, and H(θ) represents the entropy of the distribution of cosmological parameters. Essentially, mutual information measures the amount of information contained in the summary statistics t about the cosmological parameters θ. The goal is to find the parameters of the network φ that maximize the mutual information between the summary and cosmological parameters:

$φ^{*} = \underset{φ}{argmax} I (F_{φ} (x), θ) .$ ${\boldsymbol {\varphi }}^* = \mathop {\,\mathrm {argmax}\,}\limits _{{\boldsymbol {\varphi }}} I(F_{{\boldsymbol {\varphi }}}({\boldsymbol {x}}), {\boldsymbol {\theta }}).$ (9)

However, the mutual information expressed in Equation (8) is not tractable since it relies on the unknown posterior distribution. To overcome this limitation, various approaches that rely on tractable bounds have been developed, enabling the training of deep neural networks to maximize mutual information. In this study, we adopt the same strategy used by Jeffrey et al. (2021), that involves using the variational lower bound (Barber & Agakov 2003):

$I (t, θ) \geq E_{p (t, θ)} [log q (θ | t; φ')] - H (θ) .$ $I({\boldsymbol {t}}, {\boldsymbol {\theta }}) \ge {\mathbb {E}}_{p({\boldsymbol {t}}, {\boldsymbol {\theta }})} [\log {q({\boldsymbol {\theta }} |{\boldsymbol {t}}; {\boldsymbol {\varphi }}')}]- H({\boldsymbol {\theta }}).$ (10)

Here, the variational conditional distribution log q(θ|t;φ′) is introduced to approximate the true posterior distribution p(θ|t). As the entropy of the cosmological parameters remains constant with respect to φ, the optimization problem based on the lower bound in Equation (10) can be formulated as:

$\underset{φ, φ'}{argmax} E_{p (x, θ)} [log q (θ | F_{φ} (x); φ')],$ $\mathop {\,\mathrm {argmax}\,}\limits _{{\boldsymbol {\varphi }}, {\boldsymbol {\varphi }}'}{\mathbb {E}}_{p({\boldsymbol {x}}, {\boldsymbol {\theta }})} [\log {q({\boldsymbol {\theta }} |F_{{\boldsymbol {\varphi }}}({\boldsymbol {x}}); {\boldsymbol {\varphi }}')}],$ (11)

yielding Equation (7).

Given the fundamental principles of deep learning, this method is capable of constructing sufficient statistics by design. This relies on having a sufficiently flexible neural network and a sufficiently large dataset.

One may see a similarity between VMIM and NPE with an information bottleneck (IB) (Tishby et al. 2000; Alemi et al. 2019), as both approaches aim to maximize the mutual information I(t;θ) between parameters and summary statistics. However, while IB introduces an adaptive trade-off between relevance and compression,

$\underset{t}{argmax} \underset{relevance}{\underset{︸}{I (t; θ)}} - β \underset{compression}{\underset{︸}{I (t; x)}},$ $\mathop {\,\mathrm {argmax}\,}\limits _t\underbrace {I({\boldsymbol {t}}; {\boldsymbol {\theta }})}_{{\textrm {relevance}}} - \beta \underbrace {I({\boldsymbol {t}}; {\boldsymbol {x}})}_{{\textrm {compression}}},$ (12)

VMIM achieves a similar effect by directly minimizing I(t;x) through dimensional constraints. By enforcing dim(t)≪dim(x), VMIM ensures that the representation selectively retains information, implicitly reducing irrelevant details without the need to fine-tune the trade-off parameter β.

Gaussian negative log-likelihood. Recognizing that the aleatoric uncertainty on different cosmological parameters varies, a third class of inverse variance weighted MSE was proposed in Fluri et al. (2018) with the idea of ensuring that each parameter contributes fairly to the overall loss by taking into account its uncertainty. The loss function typically takes the following form:

$L_{GNLL} = \frac{1}{2} log (| Σ |) + \frac{1}{2} (t - θ)^{⊤} Σ^{- 1} (t - θ),$ ${\cal {{L_{{\textrm {GNLL}}}}}} = \frac {1}{2} \log (|{\boldsymbol {\Sigma }}|) + \frac {1}{2}({\boldsymbol {t}} - {\boldsymbol {\theta }})^{\top } \Sigma ^{-1} ({\boldsymbol {t}} - {\boldsymbol {\theta }}),$ (13)

where t is the summary statistics and Σ is the covariance matrix representing the uncertainty on the cosmological parameters θ. Both t and Σ can be outputs of the compression network, that is, F_φ(x) = (t, Σ). But only the mean is kept as summary statistics.

One recognizes here the expression of a Gaussian probability function, and this expression can thus be related to the VMIM case by simply assuming a Gaussian distribution as the variational approximation for the posterior $q (θ | x) = N (θ; t, Σ)$ $q({\boldsymbol {\theta }} | {\boldsymbol {x}}) = {\cal {{N}}}({\boldsymbol {\theta }}; {\boldsymbol {t}}, {\boldsymbol {\Sigma }})$ . We demonstrate in Appendix C that under this loss function the summary t extracted by the neural network is, similarly to the MSE case, only an estimate of the mean of the posterior distribution, which is not guaranteed to be sufficient. We note that the summary statistics obtained by GNLL should thus be the same as the ones obtained from MSE but at a greater cost since GNLL requires optimizing the mean and the coefficients of the covariance matrix of the Gaussian jointly. We further find in practice that optimizing this loss instead of the MSE is significantly more challenging (as we illustrate in the results section), which we attribute to the log determinant term of the loss being difficult to optimize.

Information maximizing neural networks. A different approach has been proposed by Charnock et al. (2018) and further explored in Makinen et al. (2021, 2022). They implemented the information maximizing neural networks (IMNN), a neural network trained on forward simulations designed to learn optimal compressed summaries, in circumstances where the likelihood function is intractable or unknown. Specifically, they propose a new scheme to find optimal non-linear data summaries by using the Fisher information to train a neural network. Inspired by the MOPED algorithm (Heavens et al. 2000), the IMNN is a transformation f that maps the data to compressed summaries: f:x→t while conserving the Fisher information. The loss function takes the following form:

$L_{IMNN} = - ln det (F) + r_{Σ},$ ${\cal {{L}}}_{{\textrm {IMNN}}} = -\ln {\textrm {det}}({\textbf{F}})+ r_{\Sigma },$ (14)

where F is the Fisher matrix, and r_Σ is a regularization term typically dependent on the covariance matrix, introduced to condition the variance of the summaries. Since computing the Fisher matrix requires a large number of simulations, they proceed as follows: a large number of simulations with the same fiducial cosmology but different initial random conditions are fed forwards through the network. The summaries from these simulations are combined to compute the covariance matrix. Additionally, the summaries from simulations created with different fixed cosmologies are used to calculate the derivative of the mean of the summary with respect to the parameter¹. Finally, the covariance and the mean derivatives are combined to obtain the Fisher matrix. We leave the comparison of this scheme on our weak lensing simulations to a future work, due to the memory constraints of available devices and the large batch sizes required for computing Equation (14).

Fluri et al. (2021, 2022) make use of a re-factored Fisher-based loss in their weak lensing compression. They rearrange Equation (14) such that the output summary does not follow a Gaussian sampling distribution:

$L_{Fluri} = log det (Σ_{θ} (t)) - 2 log | det (\frac{\partial Ψ_{θ} (t)}{\partial θ}) |,$ ${\cal {{L_{{\textrm {Fluri}}}}}}=\log {{\textrm {det}}({\boldsymbol {\Sigma }}_{{\boldsymbol {\theta }}} ({\boldsymbol {t}})})-2\log {\left |{\textrm {det}}\left (\frac {\partial \Psi _{{\boldsymbol {\theta }}} ({\boldsymbol {t}})}{\partial {\boldsymbol {\theta }}}\right )\right |},$ (15)

where Σ is the covariance matrix, and $Ψ_{θ} (t) = E_{p (x | θ)} [t]$ $\Psi _{{\boldsymbol {\theta }}}({\boldsymbol {t}})={\mathbb {E}}_{p({\boldsymbol {x}}|{\boldsymbol {\theta }})}[{\boldsymbol {t}}]$ . This loss function is equivalent to the log-determinant of the inverse Fisher matrix used in Equation (14). Implementing this loss in Fluri et al. (2021, 2022) did not provide large improvements in cosmological constraints beyond the power spectrum, likely due to the smaller number of training simulations.

4. The `sbi_lens` framework

To investigate the questions above, we developed the Python package sbi_lens, that provides a weak-lensing differentiable simulator based on a lognormal model. sbi_lens enables the sampling of convergence maps in a tomographic setting while considering the cross-correlation between different redshift bins.

4.1. Lognormal modeling

For various cosmological applications, the non-Gaussian field can be modeled as a lognormal field (Coles & Jones 1991; Böhm et al. 2017). This model offers the advantage of generating a convergence field rapidly while allowing the extraction of information beyond the two-point statistics. Although studies demonstrated that this model fails in describing the 3D field (Klypin et al. 2018), it properly describes the 2D convergence field (Clerkin et al. 2017; Xavier et al. 2016). Assuming a simulated Gaussian convergence map κ_g, whose statistical properties are fully described by its power spectrum C_ℓ, we know that this model is not a suitable representation of late-time and more evolved structures. One potential solution is to find a transformation f(κ_g) of this map that mimics the non-Gaussian features in the convergence field. In doing so, it is crucial to ensure that the transformed map maintains the correct mean and variance, effectively recovering the correct two-point statistics. Denoting μ and $σ_{g}^{2}$ $\sigma _{\mathrm {g}}^2$ the mean and covariance matrix of κ_g, respectively, we can define the transformed convergence κ_ln as a shifted lognormal random field:

$κ_{\ln} = e^{κ_{g}} - λ,$ $\kappa _{\mathrm {ln}}=e^{\kappa _{g}}-\lambda ,$ (16)

where λ is a free parameter that determines the shift of the lognormal distribution. The convergence κ in a given redshift bin is fully determined by the shift parameter λ, the mean μ of the associated Gaussian field κ_g, and its variance $σ_{g}^{2}$ $\sigma _{g}^2$ . The correlation of the lognormal field, denoted as ξ_ln, is also a function of these variables and is related to $ξ_{g}^{ij}$ $\xi ^{ij}_{\mathrm {g}}$ through the following equations:

$\begin{matrix} ξ_{\ln}^{ij} (θ) & \equiv λ_{i} λ_{j} (e^{ξ_{g}^{ij} (θ)} - 1) \\ ξ_{g}^{ij} (θ) & = log [\frac{ξ_{\ln}^{ij} (θ)}{λ_{i} λ_{j}} + 1] . \end{matrix}$ $\begin{aligned}\xi ^{ij}_{\mathrm {ln}}(\theta ) &\equiv \lambda _i \lambda _j (e^{\xi ^{ij}_{\mathrm {g}}(\theta )}-1) \\ \xi ^{ij}_{\mathrm {g}}(\theta )&=\log {\left [\frac {\xi ^{ij}_{\mathrm {ln}}(\theta )}{\lambda _i \lambda _j}+1\right ]}. \end{aligned}$ (17)

Here i and j define a pair of redshift bins. The parameter λ, also known as minimum convergence parameter, defines the lowest values for all possible values of κ. The modeling of the shift parameter can be approached in various ways. For example, it can be determined by matching moments of the distribution (Xavier et al. 2016) or by treating it as a free parameter (Hilbert et al. 2011). In general, the value of λ depends on the redshift, cosmology, and the scale of the field at which smoothing is applied.

While it is straightforward to simulate a single map, if we want to constrain the convergence map in different redshift bins, an additional condition must be met. The covariance of the map should recover the correct angular power spectrum:

$〈 {\tilde{κ}}_{\ln}^{(i)} (ℓ) {\tilde{κ}}_{\ln}^{* (j)} (ℓ') 〉 = C_{\ln}^{ij} (ℓ) δ^{K} (ℓ - ℓ'),$ $\left \langle {\tilde {\kappa }}^{(i)}_{\mathrm {ln}} (\ell ){\tilde {\kappa }}^{*(j)}_{\mathrm {ln}}(\ell ')\right \rangle =C^{ij}_{\mathrm {ln}}(\ell )\delta ^{K}(\ell -\ell '),$ (18)

where $C_{\ln}^{ij} (ℓ)$ $C^{ij}_{\mathrm {ln}}(\ell )$ is the power spectrum of κ_ln in Fourier space, defined as:

$C_{\ln}^{ij} (ℓ) = 2 π \int_{0}^{π} d θ sin θ P_{ℓ} (cos θ) ξ_{\ln}^{ij} (θ)$ $C^{ij}_{\mathrm {ln}}(\ell )=2\pi \int _0^{\pi } {\mathrm {d}}\theta \sin {\theta }P_{\ell }(\cos {\theta })\xi ^{ij}_{\mathrm {ln}}(\theta )$ (19)

and P_ℓ is the Legendre polynomial of order ℓ. Using the lognormal model, we can simultaneously constrain the convergence field in different redshift bins while considering the correlation between the bins, as described by Equation (17).

In the sbi_lens framework, the sampling of the convergence maps can be described as follows. First, we define the survey in terms of galaxy number density, redshifts, and shape noise. Then, we compute the theoretical auto-angular power spectrum Cⁱⁱ(ℓ) and cross-angular power spectrum C^ij(ℓ) for each tomographic bin. These theoretical predictions are calculated using the public library jax-cosmo (Campagne et al. 2023). Next, we project the one-dimensional C(ℓ) onto two-dimensional grids with the desired final convergence map size. Afterwards, we compute the Gaussian correlation functions $ξ_{g}^{ij} (θ)$ $\xi ^{ij}_{\mathrm {g}}(\theta )$ using Equation (17). To sample the convergence field in a specific redshift bin while considering the correlation with other bins, we use Equation (18). We construct the covariance matrix Σ of the random field κ, where κ represents the vector of convergence maps at different redshifts as follows:

$Σ = (\begin{matrix} C_{ℓ}^{11} & C_{ℓ}^{12} & \dots & C_{ℓ}^{1 n} \\ C_{ℓ}^{21} & C_{ℓ}^{22} & \dots & C_{ℓ}^{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ C_{ℓ}^{n 1} & C_{ℓ}^{n 2} & \dots & C_{ℓ}^{nn} \end{matrix}) .$ ${\boldsymbol {\Sigma }}= \left (\begin {array}{cccc} C_{\ell }^{11} & C_{\ell }^{12} & \cdots & C_{\ell }^{1n} \\ C_{\ell }^{21} & C_{\ell }^{22} & \cdots & C_{\ell }^{2n} \\ \vdots & \vdots & \ddots & \vdots \\ C_{\ell }^{n1} & C_{\ell }^{n2} & \cdots & C_{\ell }^{nn} \end {array}\right ).$ (20)

To sample more efficiently, we perform an eigenvalue decomposition of Σ to obtain a new matrix $\tilde{Σ}$ ${\tilde {{\boldsymbol {\Sigma }}}}$ :

$\tilde{Σ} = Q Λ^{1 / 2} Q^{T}$ ${\tilde {{\boldsymbol {\Sigma }}}}={\boldsymbol {Q}}{\boldsymbol {\Lambda }}^{1/2}{\boldsymbol {Q}}^{T}$ (21)

where Q and Λ are the eigenvectors and eigenvalues of Σ, respectively. Next, we sample the Gaussian random maps κ_g using the equation:

$κ_{g} = \hat{Z} \cdot \tilde{Σ}$ ${\boldsymbol {\kappa }}_{g}={\hat {{\boldsymbol {Z}}}}\cdot {\tilde {{\boldsymbol {\Sigma }}}}$ (22)

where $\hat{Z}$ ${\hat {{\boldsymbol {Z}}}}$ represents the Fourier transform of the latent variables of the simulator. Finally, we transform the Gaussian map κ_g into a lognormal field using Equation (16).

To ensure that we recover the correct auto- and cross-power spectra, we compared the results from our simulations to theoretical predictions for different tomographic bin combinations. We show the results in Figure D.1.

4.2. Data generation

Our analysis is based on a standard flat wCDM cosmological model, that includes the following parameters: the baryonic density fraction Ω_b, the cold dark matter density fraction Ω_c, the Hubble parameter h₀, the spectral index n_s, the amplitude of the primordial power spectrum σ₈ and the dark energy parameter w₀. The priors used in the simulations and in the inference process are listed in Table 2, following Zhang et al. (2022). To simulate our data, we developed the sbi_lens package, which employs a lognormal model to represent the convergence maps, as explained in the previous section. Specifically, the package uses the public library jax-cosmo to compute the theoretical power- and cross-spectra. The computation of the lognormal shift parameter was performed using the Cosmomentum code (Friedrich et al. 2018, 2020), which utilizes perturbation theory to compute the cosmology-dependent shift parameters. In Cosmomentum the calculation of the shift parameters assumes a cylindrical window function, while our pixels are rectangular. Following Boruah et al. (2022), we computed the shift parameters at a characteristic scale, R=ΔL/π, where ΔL represents the pixel resolution. For each redshift bin, we tested the dependency of the shift parameter λ on various cosmological parameters. Specifically, we investigated how the value of λ changed when varying a specific cosmological parameter while keeping the others fixed. Our findings reveal that the parameters Ω_b, h₀, and n_s have almost no significant impact on λ. As a result, we computed the shift parameters for each redshift using the fiducial cosmology values of Ω_b, h₀, and n_s. To account for the cosmology dependence of λ on Ω_c, σ₈, and w₀, we calculated the shift for various points in the cosmological parameter space and then interpolated the shift values for other points in the parameter space. Each map is reproduced on a regular grid with dimensions of 256×256 pixels and covers an area of 10×10 deg². An example of a tomographic convergence map simulated using the sbi_lens package is shown in Figure 1.

Fig. 1.

Example of convergence maps simulated using the sbi_lens package.

Table 2.

Prior and fiducial values used for the analyses.

4.3. Noise and survey setting

We conducted a tomographic study to reproduce the redshift distribution and the expected noise for the LSST Y10 data release. Following Zhang et al. (2022), we modeled the underlying redshift distribution using the parametrized Smail distribution (Smail et al. 1995):

$n (z) \propto z^{2} exp - (z / z_{0})^{α},$ $n(z) \propto z^2 \exp {-(z/z_0)^{\alpha }},$ (23)

with z₀ = 0.11 and α = 0.68. We also assumed a photometric redshift error σ_z = 0.05(1+z) as defined in the LSST DESC Science Requirements Document (SRD, Mandelbaum et al. 2018). The galaxy sources are divided into five tomographic bins, each containing an equal number of sources N_s, computed from the average galaxy number density n_gal and the survey pixel area A_pix. For each redshift bin, we assumed Gaussian noise with mean zero and variance given by

$σ_{n}^{2} = \frac{σ_{e}^{2}}{N_{s}} \cdot$ $\sigma ^2_n= \frac {\sigma _e^2}{N_{\mathrm {s}}}\cdot$ (24)

Figure 2 illustrates the resulting source redshift distribution, and Table 3 provides a summary of the survey specifications.

Fig. 2.

Source sample redshift distributions for each tomographic bin for LSST Y10. The number density on the y-axis is shown in arcminutes per square degree.

Table 3.

LSST Y10 source galaxy specifications in our analysis.

5. Experiment

In the following section, we illustrate the inference strategy for the two approaches under investigation: the Bayesian forward modeling and the map-based inference based on implicit inference. Additionally, we conduct a power spectrum study. Indeed, as discussed in Section 4.1, lognormal fields offer the advantage of rapidly generating convergent fields while accounting for non-Gaussianities. To emphasize this claim further, along with the full-field analysis, we include a power spectrum analysis. This analysis demonstrates that there is indeed a gain of information when adopting a full-field approach.

5.1. Explicit inference

5.1.1. Full-field inference with BHMs

The explicit joint likelihood p(x|z, θ) provided by an explicit forward model is the key ingredient for conducting explicit full-field inference. In the following, we describe how we built this likelihood function based on the forward model described in Section 4.1.

The measurement of convergence for each pixel and bin differs from real observations due to noise. This is taken into consideration in the likelihood. Specifically, for LSST Y10, the number of galaxies for each pixel should be sufficiently high so that, according to the central limit theorem, we can assume the observation is characterized by Gaussian noise, with $σ_{n}^{2} = σ_{e}^{2} / N_{s}$ $\sigma _n^2=\sigma _e^2/N_{\mathrm {s}}$ , where N_s represents the total number of source galaxies per bin and pixel. Given $σ_{n}^{2}$ $\sigma _n^2$ the variance of this Gaussian likelihood, its negative log-form can be expressed as:

$L (θ, z) = - \sum_{i}^{N_{pix}} \sum_{j}^{N_{bins}} log p (κ_{i, j}^{obs} | κ_{i, j}, θ) \propto \sum_{i}^{N_{pix}} \sum_{j}^{N_{bins}} \frac{[κ_{i, j} - κ_{i, j}^{obs}]^{2}}{2 σ_{n}^{2}},$ ${\cal {{L}}}({\boldsymbol {\theta }}, {\boldsymbol {z}}) = -\sum _i^{N_{\mathrm {pix}}} \sum _{j}^{N_{\mathrm {bins}}} \log {p(\kappa ^{\mathrm {obs}}_{i,j}|\kappa _{i,j},{\boldsymbol {\theta }})} \propto \sum _i^{N_{\mathrm {pix}}} \sum _{j}^{N_{\mathrm {bins}}}\frac {[\kappa _{i,j}-\kappa ^{\mathrm {obs}}_{i,j}]^2}{2\sigma _n^2},$ (25)

where κ^obs refers to the observed noisy convergence maps. This map is fixed for the entire benchmark.

Since the explicit full-field approach involves sampling the entire forward model, it typically leads to a high-dimensional problem, requiring more sophisticated statistical techniques. To sample the posterior distribution p(θ, z|x), we used a HMC algorithm. Specifically, we employed the NUTS algorithm (Hoffman & Gelman 2014), an adaptive variant of HMC implemented in NumPyro (Phan et al. 2019; Bingham et al. 2019).

The HMC algorithm is particularly helpful in high-dimensional spaces where a large number of steps are required to effectively explore the space. It improves the sampling process by leveraging the information contained in the gradients to guide the sampling process. As the code is implemented within the JAX framework, the gradients of the computation are accessible via automatic differentiation.

5.1.2. Power spectrum

To obtain the posterior distribution of the cosmological parameters given the angular power spectra C_ℓ, we assumed a Gaussian likelihood with a cosmological-independent covariance matrix:

$L (θ) = - \frac{1}{2} [d - μ (θ)]^{T} C^{- 1} [d - μ (θ)] .$ ${\cal {{L}}}({\boldsymbol {\theta }})=-\frac {1}{2}[{\boldsymbol {d}}-{\boldsymbol {\mu }} ({\boldsymbol {\theta }})]^{T}{\boldsymbol {C}}^{-1}[{\boldsymbol {d}}-{\boldsymbol {\mu }}({\boldsymbol {\theta }})].$ (26)

To compute the expected theoretical predictions μ(θ) we used jax-cosmo. The covariance matrix C of the observables is computed at the fiducial cosmology, presented in Table 2, using the same theoretical library. Specifically, in jax-cosmo, the Gaussian covariance matrix is defined as:

$Cov (C_{ℓ}, C_{ℓ'}) = \frac{1}{f_{sky} (2 ℓ + 1)} (C_{ℓ} + \frac{σ_{ϵ}^{2}}{2 n_{s}}) δ^{K} (ℓ - ℓ'),$ ${\textrm {Cov}}(C_{\ell },C_{\ell '})=\frac {1}{f_{\mathrm {sky}}(2 \ell +1)}\left (C_{\ell }+\frac {\sigma _{\epsilon }^2}{2n_{\mathrm {s}}}\right )\delta ^{K}(\ell -\ell '),$ (27)

where f_sky is the fraction of sky observed by the survey, and n_s is the number density of galaxies. To obtain the data vector d, containing the auto- and the cross-power spectra for each tomographic bin, we used the LensTools package (Petri 2016) on the fixed observed noisy map. To constrain the cosmological parameters, we sampled from the posterior distribution using the NUTS algorithm from NumPyro.

5.2. Implicit inference

5.2.1. Compression strategy

We propose here to benchmark four of the most common loss functions introduced in Section 3, that is, MSE, MAE, GNLL, and VMIM. For all of them, we used the same convolutional neural network architecture for our compressor: a ResNet-18 (He et al. 2016). The ResNet-18 is implemented using Haiku (Hennigan et al. 2020), a Python deep learning library built on top of JAX.

While the different compression strategies share the same architecture, the training strategy for VMIM involves an additional neural network. To train the neural compressor under VMIM, we jointly optimized the weights φ of the neural network F_φ (the compressor) and the parameters φ′ of the variational distribution q_φ′. For VMIM, the variational distribution q(θ|t;φ′) is modeled using a normalizing flow (NF). After training, we exported the results of the neural compressor F_φ but discarded the results from the density estimator. Then, we trained a new NF to approximate the posterior distribution. Indeed, as mentioned before, we performed the implicit inference as a two-step procedure. This choice is motivated by the fact that it is difficult to train a precise conditional density estimator when the compressor can still change from iteration to iteration. Hence, the choice to split the problem into two steps: first, the dimensionality reduction, where the density estimation part does not need to be perfect; second, the density estimation itself, which needs to be done very carefully now, but it is much easier because we are in low dimension.

5.2.2. Inference strategy

Based on previous compression we now aim to perform inference. We use implicit inference methods to bypass the problem of assuming a specific likelihood function for the summary statistics t. As mentioned before, implicit inference approaches allow to perform rigorous Bayesian inference when the likelihood is intractable by using only simulations from a black box simulator. We focus on neural density estimation methods where the idea is to introduce a parametric distribution model q_φ′ and learn the parameters φ′ from the dataset (θ_i, x_i)_{i = 1…N} so that it approximates the target distribution. In particular, we focus on NPE, that is, we aim to directly approximate the posterior distribution.

We used a conditional NF to model the parametric conditional distribution q_φ′(θ|t), and we optimized the parameters φ′ according to the following negative log-likelihood loss function:

$L_{NLL} = - log q_{φ'} (θ | t) .$ ${\cal {{L}}}_{{\textrm {NLL}}}=-\log {q_{{\boldsymbol {\varphi }}'}({\boldsymbol {\theta }}|{\boldsymbol {t}})}.$ (28)

In the limit of a large number of samples and sufficient flexibility, we obtained

$q_{φ^{' *}} (θ | t) \approx p (θ | t),$ $q_{{\boldsymbol {\varphi }}^{\prime \ast }}({\boldsymbol {\theta }} | {\boldsymbol {t}}) \approx p({\boldsymbol {\theta }} | {\boldsymbol {t}}),$ (29)

where we indicate with $φ^{' *}$ ${\boldsymbol {\varphi }}^{\prime \ast }$ the values of φ′ minimizing Equation (28). Finally, the target posterior p(θ|t=t₀) is approximated by $q_{φ^{' *}} (θ | t = t_{0})$ $q_{{\boldsymbol {\varphi }}^{\prime \ast }}({\boldsymbol {\theta }} | {\boldsymbol {t}} = {\boldsymbol {t}}_{0})$ , where t₀=F_φ(x₀), that is, the compressed statistics from the fixed fiducial convergence map x₀ for a given compression strategy.

For each of the summary statistics, we approximated the posterior distribution using the same NF architecture, namely a RealNVP (Dinh et al. 2017) with 4 coupling layers. The shift and the scale parameters are learned using a neural network with 2 layers of 128 neurons with Sigmoid-weighted Linear Units (SiLU) activation functions (Elfwing et al. 2017).

6. Results

We present the results of the constraints on the full wCDM parameter space expected for a survey such as LSST Y10. The results are obtained using the simulation procedure outlined in Section 4 and the parameter inference strategy described in Section 5.2.2. The same fiducial map is used for all inference methods. We begin by presenting the estimators we use to quantify our results and to compare the different approaches. Then we compare the outcomes of three inference procedures: the two-point statistics, the explicit full-field statistics, and the implicit full-field statistics using CNN summaries. Subsequently, we analyze the impact of the different compression strategies on the final cosmological constraints.

6.1. Result estimators

We quantify the results by computing the figure of merit defined as follows:

${FoM}_{α β} = \sqrt{det ({\tilde{F}}_{α β})} .$ ${\textrm {FoM}}_{\alpha \beta }=\sqrt {{\textrm {det}}({\tilde {F}}_{\alpha \beta })}.$ (30)

Here, α and β represent a pair of cosmological parameters, and ${\tilde{F}}_{α β}$ ${\tilde {F}}_{\alpha \beta }$ refers to the marginalized Fisher matrix. We calculate ${\tilde{F}}_{α β}$ ${\tilde {F}}_{\alpha \beta }$ as the inverse of the parameter space covariance matrix C_αβ, which is estimated from the NF for the implicit inference or the HMC samples for the explicit inference. Under the assumption of a Gaussian covariance, the FoM defined in Equation (30) is proportional to the inverse of the 2-σ contours in the two-dimensional marginalized parameter space of the α and β pair.

In addition, from the definition of a sufficient statistic, we can determine whether a compression strategy is sufficient or not. This can be done by comparing the posterior contours to the ground truth (the explicit full-field approach).

6.2. Power spectrum and full-field statistics

We now compare the constraining power of the three approaches described in Section 5: the standard two-point statistics and two map-based approaches, the explicit inference and the implicit inference strategy. As outlined before, our interest is to prove that the two map-based approaches lead to very comparable posterior distributions. We present the 68.3% and 95.5% confident regions from the fixed fiducial map for the full wCDM parameters in Figure 3. The contours obtained by the angular C_ℓ analysis are plotted in blue, the ones for the explicit full-field inference (HMC) in yellow, and those for the implicit full-field inference (VMIM compression and NPE) in black. The results are presented in Table 4. The remarkably strong agreement between the two posteriors confirms that the two map-based cosmological inference methods yield the same results. The ratio of their FoM, corresponding to 1.00, 1.04, 1.03, in the (Ω_c−σ₈);(Ω_c−w₀);(σ₈−w₀) planes, establishes the validity of the full-field implicit inference strategy. We note that h and Ω_b are prior dominated and hence are constrained by none of the three approaches. Additionally, for the full-field strategies we find that the size of the contours is significantly smaller than the size of the prior distributions adopted. Moreover, we find that the two-point statistic is suboptimal in constraining Ω_c, σ₈, and w₀, while the map-based approaches yield much tighter constraints on these parameters. We find that the map-based explicit and implicit strategies lead to an improvement in the FoM of 2.06×, 1.98×, 2.28× and 2.06×, 1.90×, 2.22×, respectively, in the (Ω_c−σ₈);(Ω_c−w₀);(σ₈−w₀) plane.

Fig. 3.

Constraints on the wCDM parameter space as found in the LSST Y10 survey setup. The constraints are obtained by applying the C_ℓ (blue contours), the full-field explicit inference (yellow contours), and the full-field implicit inference strategy using the VMIM compression (black dashed contours), described in Section 5. The contours show the 68% and the 95% confidence regions. The dashed lines define the true parameter values.

Table 4.

Figure of merit (FoM).

6.3. Optimal compression strategy

Figure 4 shows our 68.3% and 95.5% constraints from the fixed fiducial map using different compressed summaries. We compare the constraints obtained from MSE (yellow contours), MAE (blue contours), and VMIM (black dashed contours). We note that the results obtained using different summaries are generally in agreement with each other, and there are no tensions present for any of the cosmological parameters. In Figure 5 we show the difference we obtain between MSE and GNLL compression, which as explained in Section 3 should yield the same summary statistics and thus the same constraints. This difference is due to optimization issues. As explained before, the GNLL loss function aims to optimize both the mean and the covariance matrix, which makes the training very unstable compared to the training of MSE. After investigating different training procedures we conclude that the GNLL loss, because of the covariance matrix, is hard to train and recommend to use of MSE loss if only the mean of the Gaussian is of interest.

Fig. 4.

Constraints on the wCDM parameter space as found in the LSST Y10 survey setup. The constraints are obtained from three CNN map compressed statistics: the MSE (yellow contours), the MAE (blue contours), VMIM (black dashed contours), described in Section 5. The same implicit inference procedure is used to get the approximated posterior from these four different compressed data. The contours show the 68% and the 95% confidence regions. The dashed lines define the true parameter values.

Fig. 5.

Constraints on the wCDM parameter space as found in the LSST Y10 survey setup. We compare the constraints obtained with the GNLL compression (green contours) and MSE compression (yellow contours). These two compressions are supposed to yield the same constraints. The difference is due to optimization issues while training using the GNLL loss function. The same implicit inference procedure is used to get the approximated posterior from these two different compressed data. The contours show the 68% and the 95% confidence regions. The dashed lines define the true parameter values.

We report the marginalized summary constraints in Table 5. The results concern the cosmological parameters that are better constrained from weak-lensing: Ω_c, σ₈, w₀. We note that the VMIM compressed summary statistics prefer values of Ω_c, σ₈, and w₀ that are closer to our fiducial cosmology than those inferred by MSE and MAE. To further quantify these outcomes, we consider the FoM described in Equation (30). The results are presented in Table 4. We can see that VMIM yields more precise measurements than MSE, and MAE for all considered parameters. In particular, the FoM of (Ω_c, σ₈) is improved by 1.24×, and 1.1× compared to MSE, and MAE respectively; the FoM of (Ω_c, w₀) is improved by 1.25×, and 1.17× from the MSE, and MAE; the FoM of (σ₈, w₀) is improved by 1.23×, and 1.15× from the MSE, and MAE.

Table 5.

Summary of the marginalized parameter distributions.

7. Conclusion

Implicit inference typically involves two steps. The first step is the automatic learning of an optimal low-dimensional summary statistic, and the second step is the use of a neural density estimator in low dimensions to approximate the posterior distribution. In this work, we evaluated the impact of different neural compression strategies used in the literature on the final posterior distribution. We demonstrated in particular that sufficient neural statistics can be achieved using the VMIM loss function, in which case both implicit and explicit full-field inference yield the same posterior.

We created an experimental setup specifically designed to assess the effects of compression methods. Our custom sbi_lens package features a JAX-based lognormal weak lensing forward model that allows us to use the forward model for explicit full-field inference and simulate the mock data required to train the implicit model. It is designed for inference applications that need access to the model's derivatives. Our analysis is based on synthetic weak-lensing data with five tomographic bins, mimicking a survey such as LSST Y10.

After providing an overview of the different compression strategies adopted in the literature for implicit inference strategies, we compared the impact of some of those on the final constraints on the cosmological parameters for a wCDM model. We also performed the classical power spectrum analysis and the full-field explicit analysis which consists of directly sampling the BHM. We found the following results:

The marginalized summary statistics indicate that VMIM produces better results for Ω_c, w₀, and σ₈ in terms of agreement with the fiducial value. However, it is important to note that the results from MSE, and MAE are not in tension with the fiducial parameters. Furthermore, we quantified the outcomes by examining the FoM and found that VMIM provides more precise measurements compared to MSE, and MAE.
When using the VMIM to compress the original high-dimensional data, we compared the posterior obtained in the implicit inference framework with those obtained from Bayesian hierarchical modeling and the power spectrum. We demonstrated that both map-based approaches lead to a significant improvement in constraining Ω_c, w₀, σ₈ compared to the two-point statistics. However, n_s is not more constrained with full-field inference than with two-point inference and h, Ω_b are not constrained by either and are prior-dominated.
When using the VMIM to compress the original high-dimensional data the two methods, that is, Bayesian hierarchical inference and implicit inference, lead to the same posterior distributions.

It is important to consider the potential limitations of the current implementation and highlight particular strategies for future extensions and applications of this project. In this work, we employed a physical model based on a lognormal field, which is notably faster than N-body simulation-based methods. Although we have shown that this description accounts for additional non-Gaussian information, as evidenced by the fact that we obtained different posteriors from the full-field and power spectrum methods, it is important to note that this is a good approximation for the convergence at intermediate scales but may not be appropriate for analyzing small scales. Furthermore, the lognormal shift parameters are computed using the Cosmomentum code (Friedrich et al. 2018, 2020), which employs perturbation theory. As mentioned by Boruah et al. (2022), the perturbation theory-based approach may not provide accurate results at small scales². Additionally, we did not include any systematics in the current application, although previous studies demonstrated that the map-based approaches help to dramatically improve the constraints on systematics and cosmological parameters (Kacprzak & Fluri 2022). Hence, the natural next step is to implement an N-body model as the physical model for the sbi_lens package (Lanzieri et al. 2023), and to improve the realism of the model by including additional systematics such as redshift uncertainties, baryonic feedback, and a more realistic intrinsic alignment model. However, regardless of the complexity of the simulations, VMIM is theoretically capable of constructing sufficient statistics, as it maximizes the mutual information I(t, θ) between the summary statistics and the parameters θ, and by definition, a statistics is sufficient if and only if I(t, θ) = I(x, θ). Therefore, in theory, VMIM should also generate sufficient statistics when using more realistic simulations, however, empirical confirmation is still required, which we leave for future work.

Regarding inference methodologies, we highlight that in this work we did not take into consideration the cost of the number of simulations as an axis in this comparison. Our main goal was to bring theoretical and empirical understanding of the impact of different loss functions in the asymptotic regime of a powerful enough neural compressor, and limitless number of simulations. One possible strategy to learn statistics at a low number of simulations is the IMNN, as it only requires gradients of simulations around the fiducial cosmology. Another potential strategy first explored in Sharma et al. (2024) is to use transfer learning and pre-train a compressor on fairly inexpensive simulations (e.g., FastPM dark matter only simulations), and fine tune the compression model on more realistic and expensive simulations. Given that, as illustrated in this work, we know that full-field implicit inference under VMIM compression is theoretically optimal, the remaining question on the methodology side is to optimize the end-to-end simulation cost of the method.

Regarding the metrics, the FoM reported in Table 4 was computed from a single map. Ideally, the FoM should be averaged over multiple realizations to provide more robust results. While obtaining posterior distributions for multiple observations would be straightforward using the implicit full-field inference approach specifically through NPE, as it only requires evaluating the learned NF on new observations, applying the explicit full-field inference method would be computationally intensive. The latter would require sampling the entire forward model from scratch for each new observation. Additionally, it is worth noting that a similar FoM for VMIM and explicit inference is necessary but not sufficient to prove that the statistics are sufficient. To provide a more comprehensive comparison, additional metrics, such as contour plots, means, and standard deviations, which complement the FoM and align with the conventional set of metrics in the field, should be considered. We also note that in our companion paper (Zeghal et al. 2024), we use the classifier two-sample test (C2ST) to assess the similarity between posterior distributions from VMIM and explicit inference. The C2ST score reflects the probability that samples are drawn from the same distribution, with a score of 0.5 indicating similarity and 1.0 indicating total divergence. Our analysis shows a C2ST score of 0.6, suggesting that while the distributions differ somewhat, they are not completely dissimilar. This observed difference could be due to various factors, including imperfections in explicit inference, slight insufficiencies in the summary statistics, training issues in the NF, or biases in the C2ST metric. Although achieving exact sufficiency is challenging, the near-sufficiency of VMIM makes it a strong candidate for data compression compared to other schemes.

Data availability

To facilitate further experimentation and benchmarking by the community we make our simulation framework sbi_lens publicly available. All codes associated with the analyses presented in this paper are available at this link.³

Acknowledgments

This work was granted access to the HPC/AI resources of IDRIS under the allocations 2022-AD011013922 and 2023-AD010414029 made by GENCI. This research was supported by the Munich Institute for Astro-, Particle and BioPhysics (MIAPbP), which is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy – EXC-2094 – 390783311. This work was supported by the TITAN ERA Chair project (contract no. 101086741) within the Horizon Europe Framework Program of the European Commission, and the Agence Nationale de la Recherche (ANR-22-CE31-0014-01 TOSCA). This work was supported by the Data Intelligence Institute of Paris (diiP), and IdEx Université de Paris (ANR-18-IDEX-0001).

¹

The method of finite differences is necessary when the framework in which the code is implemented does not support automatic differentiation.

²

However, as the main objective of this project is to compare the different inference strategies, we are not very concerned with the potential implications of this approximation at this stage.

³

https://github.com/dlanzieri/WL_Implicit-Inference

⁴

For notational simplicity, we demonstrate this statement in one dimension; the generalization to N-dimensions is straightforward.

References

Ajani, V., Peel, A., Pettorino, V., et al. 2020, Phys. Rev. D, 102, 103531 [Google Scholar]
Ajani, V., Starck, J. -L., & Pettorino, V. 2021, A&A, 645, L11 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Akhmetzhanova, A., Mishra-Sharma, S., & Dvorkin, C. 2024, MNRAS, 527, 7459 [Google Scholar]
Alemi, A. A., Fischer, I., Dillon, J. V., & Murphy, K. 2019, ArXiv e-prints [arXiv:1612.00410] [Google Scholar]
Alsing, J., & Wandelt, B. 2018, MNRAS, 476, L60 [NASA ADS] [CrossRef] [Google Scholar]
Alsing, J., Heavens, A., & Jaffe, A. H. 2017, MNRAS, 466, 3272 [Google Scholar]
Barber, D., & Agakov, F. 2003, Advances in Neural Information Processing Systems, 16 [Google Scholar]
Bernardo, J. M., & Smith, A. F. M. 2001, Meas. Sci. Technol., 12, 221 [Google Scholar]
Bingham, E., Chen, J. P., Jankowiak, M., et al. 2019, J. Mach. Learn. Res., 20, 28:1 [Google Scholar]
Böhm, V., Hilbert, S., Greiner, M., & Enßlin, T. A. 2017, Phys. Rev. D, 96, 123510 [CrossRef] [Google Scholar]
Boruah, S. S., Rozo, E., & Fiedorowicz, P. 2022, MNRAS, 516, 4111 [NASA ADS] [CrossRef] [Google Scholar]
Boyle, A., Uhlemann, C., Friedrich, O., et al. 2021, MNRAS, 505, 2886 [NASA ADS] [CrossRef] [Google Scholar]
Campagne, J. -E., Lanusse, F., Zuntz, J., et al. 2023, Open J. Astrophys., 6, 15 [NASA ADS] [CrossRef] [Google Scholar]
Charnock, T., Lavaux, G., & Wandelt, B. D. 2018, Phys. Rev. D, 97, 083004 [NASA ADS] [CrossRef] [Google Scholar]
Cheng, S., & Ménard, B. 2021, MNRAS, 507, 1012 [NASA ADS] [CrossRef] [Google Scholar]
Clerkin, L., Kirk, D., Manera, M., et al. 2017, MNRAS, 466, 1444 [NASA ADS] [CrossRef] [Google Scholar]
Coles, P., & Jones, B. 1991, MNRAS, 248, 1 [NASA ADS] [CrossRef] [Google Scholar]
Cranmer, K., Pavez, J., & Louppe, G. 2015, ArXiv e-prints [arXiv:1506.02169] [Google Scholar]
Cranmer, K., Brehmer, J., & Louppe, G. 2020, Proc. Natl. Acad. Sci., 117, 30055 [Google Scholar]
Dai, B., & Seljak, U. 2024, Proc. Natl. Acad. Sci., 121, e2309624121 [Google Scholar]
Dinh, L., Sohl-Dickstein, J., & Bengio, S. 2017, ArXiv e-prints [arXiv:1605.08803] [Google Scholar]
Elfwing, S., Uchibe, E., & Doya, K. 2017, ArXiv e-prints [arXiv:1702.03118] [Google Scholar]
Fiedorowicz, P., Rozo, E., & Boruah, S. S. 2022a, ArXiv e-prints [arXiv:2210.12280] [Google Scholar]
Fiedorowicz, P., Rozo, E., Boruah, S. S., Chang, C., & Gatti, M. 2022b, MNRAS, 512, 73 [NASA ADS] [CrossRef] [Google Scholar]
Fluri, J., Kacprzak, T., Refregier, A., et al. 2018, Phys. Rev. D, 98, 123518 [NASA ADS] [CrossRef] [Google Scholar]
Fluri, J., Kacprzak, T., Lucchi, A., et al. 2019, Phys. Rev. D, 100, 063514 [Google Scholar]
Fluri, J., Kacprzak, T., Refregier, A., Lucchi, A., & Hofmann, T. 2021, Phys. Rev. D, 104, 123526 [NASA ADS] [CrossRef] [Google Scholar]
Fluri, J., Kacprzak, T., Lucchi, A., et al. 2022, Phys. Rev. D, 105, 083518 [NASA ADS] [CrossRef] [Google Scholar]
Friedrich, O., Gruen, D., DeRose, J., et al. 2018, Phys. Rev. D, 98, 023508 [Google Scholar]
Friedrich, O., Uhlemann, C., Villaescusa-Navarro, F., et al. 2020, MNRAS, 498, 464 [NASA ADS] [CrossRef] [Google Scholar]
Gatti, M., Jain, B., Chang, C., et al. 2022, Phys. Rev. D, 106, 083509 [NASA ADS] [CrossRef] [Google Scholar]
Greenberg, D. S., Nonnenmacher, M., & Macke, J. H. 2019, ArXiv e-prints [arXiv:1905.07488] [Google Scholar]
Gupta, A., Matilla, J. M. Z., Hsu, D., et al. 2018, Phys. Rev. D, 97, 103515 [NASA ADS] [CrossRef] [Google Scholar]
Halder, A., Friedrich, O., Seitz, S., & Varga, T. N. 2021, MNRAS, 506, 2780 [CrossRef] [Google Scholar]
Harnois-Déraps, J., Martinet, N., & Reischke, R. 2022, MNRAS, 509, 3868 [Google Scholar]
He, K., Zhang, X., Ren, S., & Sun, J. 2016, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770 [Google Scholar]
Heavens, A. F., Jimenez, R., & Lahav, O. 2000, MNRAS, 317, 965 [NASA ADS] [CrossRef] [Google Scholar]
Hennigan, T., Cai, T., Norman, T., Martens, L., & Babuschkin, I. 2020, Haiku: Sonnet for JAX, http://github.com/deepmind/dm-haiku [Google Scholar]
Hilbert, S., Hartlap, J., & Schneider, P. 2011, A&A, 536, A85 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Hoffman, M. D., & Gelman, A. 2014, J. Mach. Learn. Res., 15, 1593 [Google Scholar]
Ivezić, Ž., Kahn, S. M., Tyson, J. A., et al. 2019, ApJ, 873, 111 [Google Scholar]
Izbicki, R., Lee, A. B., & Schafer, C. M. 2014, ArXiv e-prints [arXiv:1404.7063] [Google Scholar]
Jaynes, E. T. 2003, Probability Theory: The Logic of Science (Cambridge: Cambridge University Press) [CrossRef] [Google Scholar]
Jeffrey, N., Alsing, J., & Lanusse, F. 2021, MNRAS, 501, 954 [Google Scholar]
Jeffrey, N., Whiteway, L., Gatti, M., et al. 2024, MNRAS, 536, 1303 [NASA ADS] [CrossRef] [Google Scholar]
Junzhe Zhou, A., Li, X., Dodelson, S., & Mandelbaum, R. 2023, ArXiv e-prints [arXiv:2312.08934] [Google Scholar]
Kacprzak, T., & Fluri, J. 2022, Phys. Rev. X, 12, 031029 [NASA ADS] [Google Scholar]
Kacprzak, T., Kirk, D., Friedrich, O., et al. 2016, MNRAS, 463, 3653 [Google Scholar]
Klypin, A., Prada, F., Betancort-Rijo, J., & Albareti, F. D. 2018, MNRAS, 481, 4588 [Google Scholar]
Kratochvil, J. M., Lim, E. A., Wang, S., et al. 2012, Phys. Rev. D, 85, 103513 [Google Scholar]
Kullback, S., & Leibler, R. A. 1951, Ann. Math. Stat., 22, 79 [CrossRef] [Google Scholar]
Lanzieri, D., Lanusse, F., Modi, C., et al. 2023, A&A, 679, A61 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Laureijs, R., Amiaux, J., Arduini, S., et al. 2011, ArXiv e-prints [arXiv:1110.3193] [Google Scholar]
Lin, C. -A., & Kilbinger, M. 2015, A&A, 583, A70 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Liu, J., & Madhavacheril, M. S. 2019, Phys. Rev. D, 99, 083508 [Google Scholar]
Liu, J., Petri, A., Haiman, Z., et al. 2015a, Phys. Rev. D, 91, 063507 [Google Scholar]
Liu, X., Pan, C., Li, R., et al. 2015b, MNRAS, 450, 2888 [Google Scholar]
Lu, T., Haiman, Z., & Zorrilla Matilla, J. M. 2022, MNRAS, 511, 1518 [NASA ADS] [CrossRef] [Google Scholar]
Lu, T., Haiman, Z., & Li, X. 2023, ArXiv e-prints [arXiv:2301.01354] [Google Scholar]
Lueckmann, J. M., Goncalves, P. J., Bassetto, G., et al. 2017, ArXiv e-prints [arXiv:1711.01861] [Google Scholar]
Lueckmann, J. M., Bassetto, G., Karaletsos, T., & Macke, J. H. 2018, ArXiv e-prints [arXiv:1805.09294] [Google Scholar]
Makinen, T. L., Charnock, T., Alsing, J., & Wandelt, B. D. 2021, JCAP, 2021, 049 [CrossRef] [Google Scholar]
Makinen, T. L., Charnock, T., Lemos, P., et al. 2022, Open J. Astrophys., 5, 18 [NASA ADS] [CrossRef] [Google Scholar]
Mandelbaum, R., Eifler, T., Hložek, R., et al. 2018, ArXiv e-prints [arXiv:1809.01669] [Google Scholar]
Martinet, N., Schneider, P., Hildebrandt, H., et al. 2018, MNRAS, 474, 712 [Google Scholar]
Matilla, J. M. Z., Sharma, M., Hsu, D., & Haiman, Z. 2020, Phys. Rev. D, 102, 123506 [Google Scholar]
Papamakarios, G., & Murray, I. 2018, ArXiv e-prints [arXiv:1605.06376] [Google Scholar]
Papamakarios, G., Sterratt, D. C., & Murray, I. 2018, ArXiv e-prints [arXiv:1805.07226] [Google Scholar]
Peel, A., Lin, C. -A., Lanusse, F., et al. 2017, A&A, 599, A79 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Petri, A. 2016, Astron. Comput., 17, 73 [NASA ADS] [CrossRef] [Google Scholar]
Petri, A., Haiman, Z., Hui, L., May, M., & Kratochvil, J. M. 2013, Phys. Rev. D, 88, 123002 [Google Scholar]
Phan, D., Pradhan, N., & Jankowiak, M. 2019, ArXiv e-prints [arXiv:1912.11554] [Google Scholar]
Porqueres, N., Heavens, A., Mortlock, D., & Lavaux, G. 2021, MNRAS, 502, 3035 [NASA ADS] [CrossRef] [Google Scholar]
Porqueres, N., Heavens, A., Mortlock, D., Lavaux, G., & Makinen, T. L. 2023, ArXiv e-prints [arXiv:2304.04785] [Google Scholar]
Ribli, D., Ármin Pataki, B., & Csabai, I. 2018, ArXiv e-prints [arXiv:1806.05995] [Google Scholar]
Ribli, D., Pataki, B. Á., Zorrilla Matilla, J. M., et al. 2019, MNRAS, 490, 1843 [CrossRef] [Google Scholar]
Rizzato, M., Benabed, K., Bernardeau, F., & Lacasa, F. 2019, MNRAS, 490, 4688 [NASA ADS] [CrossRef] [Google Scholar]
Semboloni, E., Schrabback, T., van Waerbeke, L., et al. 2011, MNRAS, 410, 143 [Google Scholar]
Shan, H., Liu, X., Hildebrandt, H., et al. 2018, MNRAS, 474, 1116 [Google Scholar]
Sharma, D., Dai, B., & Seljak, U. 2024, ArXiv e-prints [arXiv:2403.03490] [Google Scholar]
Smail, I., Hogg, D. W., Yan, L., & Cohen, J. G. 1995, ApJ, 449, L105 [Google Scholar]
Spergel, D., Gehrels, N., Baltay, C., et al. 2015, ArXiv e-prints [arXiv:1503.03757] [Google Scholar]
Takada, M., & Jain, B. 2004, MNRAS, 348, 897 [Google Scholar]
Thomas, O., Dutta, R., Corander, J., Kaski, S., & Gutmann, M. U. 2016, ArXiv e-prints [arXiv:1611.10242] [Google Scholar]
Tishby, N., Pereira, F. C., & Bialek, W. 2000, ArXiv e-prints [arXiv:physics/0004057] [Google Scholar]
Uhlemann, C., Friedrich, O., Villaescusa-Navarro, F., Banerjee, A., & Codis, S. 2020, MNRAS, 495, 4006 [NASA ADS] [CrossRef] [Google Scholar]
Xavier, H. S., Abdalla, F. B., & Joachimi, B. 2016, MNRAS, 459, 3693 [NASA ADS] [CrossRef] [Google Scholar]
Zeghal, J., Lanzieri, D., Lanusse, F., et al. 2024, A&A, submitted [arXiv:2409.17975] [Google Scholar]
Zhang, Z., Chang, C., Larsen, P., et al. 2022, MNRAS, 514, 2181 [NASA ADS] [CrossRef] [Google Scholar]
Zürcher, D., Fluri, J., Sgier, R., et al. 2022, MNRAS, 511, 2075 [CrossRef] [Google Scholar]

Appendix A: Mean squared error

In this section, as well as the next one, we derive established results concerning the regression loss functions and what they actually learn. Additional details can be found in Jaynes (2003).

Here, we demonstrate that minimizing the L₂ norm is equivalent to training the model to estimate the mean of the posterior distribution. Namely,

${〈 θ 〉}_{p (θ | x)} = \underset{F (x)}{argmin} E_{p (θ, x)} [{‖ θ - F (x) ‖}_{2}^{2}],$ $\left \langle {\boldsymbol {\theta }} \right \rangle _{p({\boldsymbol {\theta }}|{\boldsymbol {x}})}=\mathop {\,\mathrm {argmin}\,}\limits _{F({\boldsymbol {x}})} {\mathbb {E}}_{p({\boldsymbol {\theta }},{\boldsymbol {x}})}[\left \Vert {\boldsymbol {\theta }}-F({\boldsymbol {x}})\right \Vert _{2}^{2}],$ (A.1)

where the posterior mean ${〈 θ 〉}_{p (θ | x)}$ $\left \langle {\boldsymbol {\theta }} \right \rangle _{p({\boldsymbol {\theta }}|{\boldsymbol {x}})}$ , is calculated as follows:

${〈 θ 〉}_{p (θ | x)} = E_{p (θ | x)} [θ] .$ $\left \langle {\boldsymbol {\theta }} \right \rangle _{p({\boldsymbol {\theta }}|{\boldsymbol {x}})}= {\mathbb {E}}_{p({\boldsymbol {\theta }}|{\boldsymbol {x}})}[{\boldsymbol {\theta }}].$ (A.2)

To demonstrate this statement, we need to minimize the expected value of the L₂ norm with respect to F(x). We considered its derivative:

$\frac{\partial}{\partial F (x)} E_{p (θ | x))} [(θ - F (x))^{2}] =$ $\frac {\partial }{\partial F({\boldsymbol {x}})} {\mathbb {E}}_{p({\boldsymbol {\theta }}|{\boldsymbol {x}}))} [({\boldsymbol {\theta }}-F({\boldsymbol {x}}))^2] =$ (A.3)

$\begin{matrix} \frac{\partial}{\partial F (x)} E_{p (θ | x)} [θ^{2} + F (x)^{2} - 2 θ F (x)] = \\ \frac{\partial}{\partial F (x)} [E_{p (θ | x)} [θ^{2}] + F (x)^{2} - 2 F (x) E_{p (θ | x)} [θ]] = \\ 2 F (x) - 2 E_{p (θ | x)} [θ] . \end{matrix}$ $\begin{aligned}& \frac {\partial }{\partial F({\boldsymbol {x}})} {\mathbb {E}}_{p({\boldsymbol {\theta }}|{\boldsymbol {x}})} [{\boldsymbol {\theta }}^2+F({\boldsymbol {x}})^2-2{\boldsymbol {\theta }}F({\boldsymbol {x}})] = \\ & \frac {\partial }{\partial F({\boldsymbol {x}})} [{\mathbb {E}}_{p({\boldsymbol {\theta }}|{\boldsymbol {x}})} [{\boldsymbol {\theta }}^2]+F({\boldsymbol {x}})^2-2F({\boldsymbol {x}}){\mathbb {E}}_{p({\boldsymbol {\theta }}|{\boldsymbol {x}})} [{\boldsymbol {\theta }}]] = \\ &2F({\boldsymbol {x}})-2 {\mathbb {E}}_{p({\boldsymbol {\theta }}|{\boldsymbol {x}})}[{\boldsymbol {\theta }}]. \end{aligned}$

Setting it equal to zero, we obtained the critical value:

$F (x) = E_{p (θ | x)} [θ] .$ $F({\boldsymbol {x}})= {\mathbb {E}}_{p({\boldsymbol {\theta }}|{\boldsymbol {x}})}[{\boldsymbol {\theta }}].$ (A.4)

Considering the second-order derivative:

$\frac{\partial^{2}}{\partial^{2} F (x)} E_{p (θ | x)} [(θ - F (x))^{2}] = 2 > 0,$ $\frac {\partial ^2}{\partial ^2 F({\boldsymbol {x}})} {\mathbb {E}}_{p({\boldsymbol {\theta }}|{\boldsymbol {x}})}[({\boldsymbol {\theta }}-F({\boldsymbol {x}}))^2]=2>0,$ (A.5)

we can assert that this critical value is also a minimum. From Equation A.4 and Equation A.2, we obtain Equation A.1.

Appendix B: Mean absolute error

In this section, we demonstrate that minimizing the L₁ norm is equivalent to training the model to estimate the median of the posterior distribution. Namely:

$θ_{p (θ | x)}^{M} = \underset{F (x)}{argmin} E_{p (θ, x)} [| θ - F (x) |] .$ ${\boldsymbol {\theta }}^M_{p({\boldsymbol {\theta }}|{\boldsymbol {x}})}=\mathop {\,\mathrm {argmin}\,}\limits _{F({\boldsymbol {x}})} {\mathbb {E}}_{p({\boldsymbol {\theta }},{\boldsymbol {x}})}[|{\boldsymbol {\theta }}-F({\boldsymbol {x}})|].$ (B.1)

By definition, the median of a one-dimensional⁴ probability density function p(x) is a real number m that satisfies:

$\int_{\infty}^{m} p (x) dx = \int_{m}^{\infty} p (x) dx = \frac{1}{2} .$ $\int _{\infty }^{m} p(x)dx=\int _{m}^{\infty }p(x)dx=\frac {1}{2}.$ (B.2)

The expectation value of the mean absolute error is defined as:

$E_{p (x)} [| x - m |] = \int_{\infty}^{\infty} p (x) | x - m | dx$ ${\mathbb {E}}_{p(x)}[|x-m|]= \int _{\infty }^{\infty }p(x)|x-m|dx$ (B.3)

which can be decomposed as

$\int_{\infty}^{m} p (x) | x - m | dx + \int_{m}^{\infty} p (x) | x - m | dx .$ $\int _{\infty }^{m}p(x)|x-m|dx +\int _{m}^{\infty }p(x)|x-m|dx.$ (B.4)

To minimize this function with respect to m, we need to compute its derivative:

$\frac{d E [| x - m |]}{dm} = \frac{d}{dm} \int_{\infty}^{m} p (x) | x - m | dx + \frac{d}{dm} \int_{m}^{\infty} p (x) | x - m | dx .$ $\frac {d{\mathbb {E}}[|x-m|]}{dm}= \frac {d}{dm}\int _{\infty }^{m}p(x)|x-m|dx +\frac {d}{dm}\int _{m}^{\infty }p(x)|x-m|dx.$ (B.5)

Considering that |x−m|=(x−m) for m≤x and |x−m|=(m−x) m≥x, we can write Equation B.5 as:

$\frac{d E [| x - m |]}{dm} = \frac{d}{dm} \int_{\infty}^{m} p (x) (m - x) dx + \frac{d}{dm} \int_{m}^{\infty} p (x) (x - m) dx .$ $\frac {d{\mathbb {E}}[|x-m|]}{dm}= \frac {d}{dm}\int _{\infty }^{m}p(x)(m-x)dx +\frac {d}{dm}\int _{m}^{\infty }p(x)(x-m)dx.$ (B.6)

Using the Leibniz integral rule, we obtained

$\frac{d E [| x - m |]}{dm} =$ $\frac {d{\mathbb {E}}[|x-m|]}{dm}=$ (B.7)

$\begin{matrix} p (x) (m - x) \frac{dm}{dm} + \int_{\infty}^{m} \frac{\partial}{\partial m} [p (x) (m - x)] dx \\ + p (x) (x - m) \frac{dm}{dm} + \int_{m}^{\infty} \frac{\partial}{\partial m} [p (x) (x - m)] dx . \end{matrix}$ $\begin{aligned}& p(x)(m-x)\frac {dm}{dm}+\int _{\infty }^{m}\frac {\partial } {\partial m}[p(x)(m-x)]dx \\ & + p(x)(x-m)\frac {dm}{dm}+\int _{m}^{\infty }\frac {\partial }{\partial m}[p(x)(x-m)]dx. \end{aligned}$

Setting the derivative to zero, we obtained

$\frac{d E [| x - m |]}{dm} = \int_{\infty}^{m} p (x) dx - \int_{m}^{\infty} p (x) dx = 0 .$ $\frac {d{\mathbb {E}}[|x-m|]}{dm}= \int _{\infty }^{m} p(x)dx-\int _{m}^{\infty }p(x)dx =0.$ (B.8)

Thus,

$\int_{\infty}^{m} p (x) dx = \int_{m}^{\infty} p (x) dx .$ $\int _{\infty }^{m} p(x)dx=\int _{m}^{\infty }p(x)dx.$ (B.9)

Considering that

$\int_{\infty}^{m} p (x) dx + \int_{m}^{\infty} p (x) dx = 1,$ $\int _{\infty }^{m} p(x)dx+\int _{m}^{\infty }p(x)dx=1,$ (B.10)

we obtain Equation B.2.

Appendix C: Gaussian negative log-likelihood

In this section, we demonstrate that the summary statistics obtained from the GNLL minimization corresponds to the mean of the posterior exactly as the minimization of the MSE loss function.

The minimization of the GNLL corresponds to the minimization of the forward Kullback Leiber Divergence (DKL):

$\hat{φ} = arg min_{φ} D_{KL} (p (x) | | q_{φ} (x))$ ${\hat {\varphi }} = \arg \min _{\varphi } D_{KL}(p(x) \; || \; q_{\varphi }(x))$ (C.1)

$\begin{matrix} = arg min_{φ} E_{p (x)} [log (\frac{p (x)}{q_{φ} (x)})] \\ = arg min_{φ} \underset{constant w . r . t φ}{\underset{︸}{E_{p (x)} [log (p (x))]}} - E_{p (x)} [log (q_{φ} (x))] \\ = arg min_{φ} - E_{p (x)} [log (q_{φ} (x))] \end{matrix}$ $\begin{aligned}& = \arg \min _{\varphi } {\mathbb {E}}_{p(x)}\Big [\log \left (\frac {p(x)} {q_{\varphi }(x)}\right )\Big ]\\ & = \arg \min _{\varphi } \underbrace {{\mathbb {E}}_{p(x)}\left [\log \left (p(x)\right ) \right ]}_{{\textrm {constant w.r.t}}\ \varphi } - {\mathbb {E}}_{p(x)}\left [ \log \left (q_{\varphi }(x)\right )\right ] \\ & = \arg \min _{\varphi } - {\mathbb {E}}_{p(x)}\left [\log \left (q_{\varphi }(x)\right )\right ] \end{aligned}$

where we let q be a Gaussian distribution

$q (x) = \frac{1}{\sqrt{2 π σ_{q}^{2}}} exp (- \frac{(x - μ_{q})^{2}}{2 σ_{q}^{2}}),$ $q(x) = \frac {1}{\sqrt {2\pi \sigma _q^2}} \exp \left (-\frac {(x - \mu _q)^2}{2\sigma _q^2}\right ),$ (C.2)

and φ=(μ_q, σ_q) be the mean and covariance of this distribution that we aim to optimize to minimize this forward DKL.

Replacing q by the Gaussian distribution in the forward DKL expression yields

$\begin{matrix} D_{KL} (p | | q) & = \int p (x) log p (x) dx + \int p (x) log (\sqrt{2 π σ_{q}^{2}}) dx \\ + \int p (x) \frac{(x - μ_{q})^{2}}{2 σ_{q}^{2}} dx . \end{matrix}$ $\begin{aligned}D_{KL}(p || q) &= \int p(x) \log p(x) \, dx + \int p(x) \log \left (\sqrt {2\pi \sigma _q^2} \right ) \, dx \\ &\qquad + \int p(x) \frac {(x - \mu _q)^2}{2\sigma _q^2} \, dx. \end{aligned}$

The first term of this integral is independent of the parameters φ and corresponds to the entropy H(p). The second term simplifies to $log (\sqrt{2 π σ_{q}^{2}})$ $\log (\sqrt {2\pi \sigma _q^2})$ , as the integral of p(x) over its domain is 1. The third term necessitates a bit more work:

$\int p (x) \frac{(x - μ_{q})^{2}}{2 σ_{q}^{2}} dx$ $\int p(x) \frac {(x - \mu _q)^2}{2\sigma _q^2} \, dx$ (C.3)

$= \frac{1}{2 σ_{q}^{2}} \int p (x) ((x - E_{p} [X]) + (E_{p} [X] - μ_{q}))^{2} dx$ $=\frac {1}{2\sigma _q^2} \int p(x) ((x - {\mathbb {E}}_p[X]) + ({\mathbb {E}}_p[X] - \mu _q))^2 \, dx$ (C.4)

$= \frac{1}{2 σ_{q}^{2}} \int p (x) (x - E_{p} [X])^{2} dx$ $= \frac {1}{2\sigma _q^2}\int p(x) (x - {\mathbb {E}}_p[X])^2 \, dx$ (C.5)

$+ \frac{1}{σ_{q}^{2}} (E_{p} [X] - μ_{q}) \int p (x) (x - E_{p} [X]) dx$ $\qquad + \frac {1}{\sigma _q^2}({\mathbb {E}}_p[X] - \mu _q) \int p(x) (x - {\mathbb {E}}_p[X]) \, dx$ (C.6)

$+ \frac{1}{2 σ_{q}^{2}} (E_{p} [X] - μ_{q})^{2} \int p (x) dx,$ $\qquad + \frac {1}{2\sigma _q^2} ({\mathbb {E}}_p[X] - \mu _q)^2 \int p(x) \, dx,$ (C.7)

and by definition of the expected value

$\begin{matrix} \int p (x) (x - E_{p} [X]) dx & = - E_{p} [X] + \int xp (x) dx \\ = - E_{p} [X] + E_{p} [X] dx \\ = 0, \end{matrix}$ $\begin{aligned}\int p(x) (x - {\mathbb {E}}_p[X]) \, dx &= - {\mathbb {E}}_p[X] + \int xp(x) \, dx \\ &= - {\mathbb {E}}_p[X] + {\mathbb {E}}_p[X] \, dx \\ &= 0, \end{aligned}$

the middle term vanishes yielding to

$\frac{1}{2 σ_{q}^{2}} [\int p (x) (x - E_{p} [X])^{2} dx + (E_{p} [X] - μ_{q})^{2}]$ $\frac {1}{2\sigma _q^2} \left [\int p(x) (x - {\mathbb {E}}_p[X])^2 \, dx + ({\mathbb {E}}_p[X] - \mu _q)^2\right ]$ (C.8)

$= \frac{1}{2 σ_{q}^{2}} [{Var}_{p} [X] + (E_{p} [X] - μ_{q})^{2}] .$ $= \frac {1}{2\sigma _q^2} \left [{\textrm {Var}}_p[X] + ({\mathbb {E}}_p[X] - \mu _q)^2\right ].$ (C.9)

Finally, putting all this together we have

$D_{KL} (p | | q) = H (p) + log (\sqrt{2 π σ_{q}^{2}}) + \frac{1}{2 σ_{q}^{2}} [{Var}_{p} [X] + (E_{p} [X] - μ_{q})^{2}] .$ $D_{KL}(p || q) = H(p) + \log \left (\sqrt {2\pi \sigma _q^2} \right ) + \frac {1}{2\sigma _q^2} \left [{\textrm {Var}}_p[X] + ({\mathbb {E}}_p[X] - \mu _q)^2\right ].$ (C.10)

Let us now find the minimum of this DKL. Computing the derivatives

$\frac{\partial D_{KL} (p | | q)}{\partial μ_{q}} = \frac{1}{σ_{q}^{2}} (E_{p} [X] - μ_{q})$ $\frac {\partial D_{KL}(p || q)}{\partial \mu _q} = \frac {1}{\sigma _q^2}\left ({\mathbb {E}}_p[X] - \mu _q \right )$ (C.11)

$\frac{\partial D_{KL} (p | | q)}{\partial σ_{q}} = \frac{1}{σ_{q}^{2}} - \frac{1}{σ_{q}^{3}} [{Var}_{p} [X] + (E_{p} [X] - μ_{q})^{2}]$ $\frac {\partial D_{KL}(p || q)}{ \partial \sigma _q} = \frac {1}{\sigma _q^2} - \frac {1}{\sigma _q^3}\left [{\textrm {Var}}_p[X] + ({\mathbb {E}}_p[X] - \mu _q)^2\right ]$ (C.12)

and setting it to zero, we can derive:

${\begin{matrix} μ_{q} & = E_{p} [X] \\ σ_{q} & = {Var}_{p} [X] \end{matrix}$ $\begin {cases}\mu _q &= {\mathbb {E}}_p[X]\\ \sigma _q &= {\textrm {Var}}_p[X] \end {cases}$ (C.13)

which is the only solution of these two equations system. To confirm that this point minimizes the DKL we derive the Hessian

$H (μ_{q}, σ_{q}) = (\begin{matrix} \frac{1}{σ_{q}^{2}} & - \frac{2 (E_{p} [X] - μ_{q})}{σ_{q}^{3}} \\ - \frac{2 (E_{p} [X] - μ_{q})}{σ_{q}^{3}} & - \frac{1}{σ_{q}^{2}} + \frac{3 ({Var}_{p} [X] + (E_{p} [X] - μ_{q})^{2})}{σ_{q}^{4}} \end{matrix}),$ $H(\mu _q, \sigma _q) = \left (\begin {array}{cc} \frac {1}{\sigma _q^2} &-\frac {2({\mathbb {E}}_p[X] - \mu _q)}{\sigma _q^3} \\ -\frac {2({\mathbb {E}}_p[X] - \mu _q)}{\sigma _q^3} &-\frac {1}{\sigma _q^2} + \frac {3({\textrm {Var}}_p[X] + ({\mathbb {E}}_p[X] - \mu _q)^2)}{\sigma _q^4} \end {array}\right ),$ (C.14)

evaluating it on the critical point we have

$H (E_{p} [X], {Var}_{p} [X]) = (\begin{matrix} \frac{1}{{Var}_{p} [X]} & 0 \\ 0 & \frac{2}{{Var}_{p} [X]} \end{matrix}) .$ $H({\mathbb {E}}_p[X], {\textrm {Var}}_p[X]) = \left (\begin {array}{cc} \frac {1}{{\textrm {Var}}_p[X]} &0 \\ 0 &\frac {2}{{\textrm {Var}}_p[X]} \end {array}\right ).$ (C.15)

To check if this symmetric matrix is positive definite we can derive

$x^{T} Hx = \frac{x_{1}^{2}}{{Var}_{p} [X]} + \frac{2 x_{2}^{2}}{{Var}_{p} [X]},$ $x^T H x = \frac {x_1^2}{{\textrm {Var}}_p[X]} + \frac {2 x_2^2}{{\textrm {Var}}_p[X]},$ (C.16)

which is strictly positive for all $x = (x_{1}, x_{2}) \in ℝ^{2} ∖ {0}$ $x = (x_1, x_2) \in {\mathbb {R}}^2 \backslash {\{ }0{\} }$ , and according to the definition this matrix is positive definite.

Hence, the critical point (Equation C.13) corresponds to a local minimum and because it is the unique solution of the system, it is the unique minimizer of the DKL.

Appendix D: Validation of `sbi_lens`'s forward model

Fig. D.1.

Convergence power spectra for different tomographic bin combinations. The solid yellow line shows the measurement from 20 simulated maps using the survey setting described in section 4, while the black dashed line shows the theoretical predictions computed using jax-cosmo. In this figure, the shaded regions represent the standard deviation from 20 independent map realizations.

All Tables

Table 1.

Table summarizing the different neural compression schemes used for weak-lensing applications.

In the text

Table 2.

Prior and fiducial values used for the analyses.

In the text

Table 3.

LSST Y10 source galaxy specifications in our analysis.

In the text

Table 4.

Figure of merit (FoM).

In the text

Table 5.

Summary of the marginalized parameter distributions.

In the text

All Figures

	Fig. 1. Example of convergence maps simulated using the `sbi_lens` package.
In the text

	Fig. 2. Source sample redshift distributions for each tomographic bin for LSST Y10. The number density on the y-axis is shown in arcminutes per square degree.
In the text

Fig. 3.

Constraints on the wCDM parameter space as found in the LSST Y10 survey setup. The constraints are obtained by applying the C_ℓ (blue contours), the full-field explicit inference (yellow contours), and the full-field implicit inference strategy using the VMIM compression (black dashed contours), described in Section 5. The contours show the 68% and the 95% confidence regions. The dashed lines define the true parameter values.

In the text

Fig. 4.

Constraints on the wCDM parameter space as found in the LSST Y10 survey setup. The constraints are obtained from three CNN map compressed statistics: the MSE (yellow contours), the MAE (blue contours), VMIM (black dashed contours), described in Section 5. The same implicit inference procedure is used to get the approximated posterior from these four different compressed data. The contours show the 68% and the 95% confidence regions. The dashed lines define the true parameter values.

In the text

Fig. 5.

Constraints on the wCDM parameter space as found in the LSST Y10 survey setup. We compare the constraints obtained with the GNLL compression (green contours) and MSE compression (yellow contours). These two compressions are supposed to yield the same constraints. The difference is due to optimization issues while training using the GNLL loss function. The same implicit inference procedure is used to get the approximated posterior from these two different compressed data. The contours show the 68% and the 95% confidence regions. The dashed lines define the true parameter values.

In the text

	Fig. D.1. Convergence power spectra for different tomographic bin combinations. The solid yellow line shows the measurement from 20 simulated maps using the survey setting described in section 4, while the black dashed line shows the theoretical predictions computed using jax-cosmo. In this figure, the shaded regions represent the standard deviation from 20 independent map realizations.
In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.

[1] Ajani, V., Peel, A., Pettorino, V., et al. 2020, Phys. Rev. D, 102, 103531 [Google Scholar]

[2] Ajani, V., Starck, J. -L., & Pettorino, V. 2021, A&A, 645, L11 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[3] Akhmetzhanova, A., Mishra-Sharma, S., & Dvorkin, C. 2024, MNRAS, 527, 7459 [Google Scholar]

[4] Alemi, A. A., Fischer, I., Dillon, J. V., & Murphy, K. 2019, ArXiv e-prints [arXiv:1612.00410] [Google Scholar]

[5] Alsing, J., & Wandelt, B. 2018, MNRAS, 476, L60 [NASA ADS] [CrossRef] [Google Scholar]

[6] Alsing, J., Heavens, A., & Jaffe, A. H. 2017, MNRAS, 466, 3272 [Google Scholar]

[7] Barber, D., & Agakov, F. 2003, Advances in Neural Information Processing Systems, 16 [Google Scholar]

[8] Bernardo, J. M., & Smith, A. F. M. 2001, Meas. Sci. Technol., 12, 221 [Google Scholar]

[9] Bingham, E., Chen, J. P., Jankowiak, M., et al. 2019, J. Mach. Learn. Res., 20, 28:1 [Google Scholar]

[10] Böhm, V., Hilbert, S., Greiner, M., & Enßlin, T. A. 2017, Phys. Rev. D, 96, 123510 [CrossRef] [Google Scholar]

[11] Boruah, S. S., Rozo, E., & Fiedorowicz, P. 2022, MNRAS, 516, 4111 [NASA ADS] [CrossRef] [Google Scholar]

[12] Boyle, A., Uhlemann, C., Friedrich, O., et al. 2021, MNRAS, 505, 2886 [NASA ADS] [CrossRef] [Google Scholar]

[13] Campagne, J. -E., Lanusse, F., Zuntz, J., et al. 2023, Open J. Astrophys., 6, 15 [NASA ADS] [CrossRef] [Google Scholar]

[14] Charnock, T., Lavaux, G., & Wandelt, B. D. 2018, Phys. Rev. D, 97, 083004 [NASA ADS] [CrossRef] [Google Scholar]

[15] Cheng, S., & Ménard, B. 2021, MNRAS, 507, 1012 [NASA ADS] [CrossRef] [Google Scholar]

[16] Clerkin, L., Kirk, D., Manera, M., et al. 2017, MNRAS, 466, 1444 [NASA ADS] [CrossRef] [Google Scholar]

[17] Coles, P., & Jones, B. 1991, MNRAS, 248, 1 [NASA ADS] [CrossRef] [Google Scholar]

[18] Cranmer, K., Pavez, J., & Louppe, G. 2015, ArXiv e-prints [arXiv:1506.02169] [Google Scholar]

[19] Cranmer, K., Brehmer, J., & Louppe, G. 2020, Proc. Natl. Acad. Sci., 117, 30055 [Google Scholar]

[20] Dai, B., & Seljak, U. 2024, Proc. Natl. Acad. Sci., 121, e2309624121 [Google Scholar]

[21] Dinh, L., Sohl-Dickstein, J., & Bengio, S. 2017, ArXiv e-prints [arXiv:1605.08803] [Google Scholar]

[22] Elfwing, S., Uchibe, E., & Doya, K. 2017, ArXiv e-prints [arXiv:1702.03118] [Google Scholar]

[23] Fiedorowicz, P., Rozo, E., & Boruah, S. S. 2022a, ArXiv e-prints [arXiv:2210.12280] [Google Scholar]

[24] Fiedorowicz, P., Rozo, E., Boruah, S. S., Chang, C., & Gatti, M. 2022b, MNRAS, 512, 73 [NASA ADS] [CrossRef] [Google Scholar]

[25] Fluri, J., Kacprzak, T., Refregier, A., et al. 2018, Phys. Rev. D, 98, 123518 [NASA ADS] [CrossRef] [Google Scholar]

[26] Fluri, J., Kacprzak, T., Lucchi, A., et al. 2019, Phys. Rev. D, 100, 063514 [Google Scholar]

[27] Fluri, J., Kacprzak, T., Refregier, A., Lucchi, A., & Hofmann, T. 2021, Phys. Rev. D, 104, 123526 [NASA ADS] [CrossRef] [Google Scholar]

[28] Fluri, J., Kacprzak, T., Lucchi, A., et al. 2022, Phys. Rev. D, 105, 083518 [NASA ADS] [CrossRef] [Google Scholar]

[29] Friedrich, O., Gruen, D., DeRose, J., et al. 2018, Phys. Rev. D, 98, 023508 [Google Scholar]

[30] Friedrich, O., Uhlemann, C., Villaescusa-Navarro, F., et al. 2020, MNRAS, 498, 464 [NASA ADS] [CrossRef] [Google Scholar]

[31] Gatti, M., Jain, B., Chang, C., et al. 2022, Phys. Rev. D, 106, 083509 [NASA ADS] [CrossRef] [Google Scholar]

[32] Greenberg, D. S., Nonnenmacher, M., & Macke, J. H. 2019, ArXiv e-prints [arXiv:1905.07488] [Google Scholar]

[33] Gupta, A., Matilla, J. M. Z., Hsu, D., et al. 2018, Phys. Rev. D, 97, 103515 [NASA ADS] [CrossRef] [Google Scholar]

[34] Halder, A., Friedrich, O., Seitz, S., & Varga, T. N. 2021, MNRAS, 506, 2780 [CrossRef] [Google Scholar]

[35] Harnois-Déraps, J., Martinet, N., & Reischke, R. 2022, MNRAS, 509, 3868 [Google Scholar]

[36] He, K., Zhang, X., Ren, S., & Sun, J. 2016, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770 [Google Scholar]

[37] Heavens, A. F., Jimenez, R., & Lahav, O. 2000, MNRAS, 317, 965 [NASA ADS] [CrossRef] [Google Scholar]

[38] Hennigan, T., Cai, T., Norman, T., Martens, L., & Babuschkin, I. 2020, Haiku: Sonnet for JAX, http://github.com/deepmind/dm-haiku [Google Scholar]

[39] Hilbert, S., Hartlap, J., & Schneider, P. 2011, A&A, 536, A85 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[40] Hoffman, M. D., & Gelman, A. 2014, J. Mach. Learn. Res., 15, 1593 [Google Scholar]

[41] Ivezić, Ž., Kahn, S. M., Tyson, J. A., et al. 2019, ApJ, 873, 111 [Google Scholar]

[42] Izbicki, R., Lee, A. B., & Schafer, C. M. 2014, ArXiv e-prints [arXiv:1404.7063] [Google Scholar]

[43] Jaynes, E. T. 2003, Probability Theory: The Logic of Science (Cambridge: Cambridge University Press) [CrossRef] [Google Scholar]

[44] Jeffrey, N., Alsing, J., & Lanusse, F. 2021, MNRAS, 501, 954 [Google Scholar]

[45] Jeffrey, N., Whiteway, L., Gatti, M., et al. 2024, MNRAS, 536, 1303 [NASA ADS] [CrossRef] [Google Scholar]

[46] Junzhe Zhou, A., Li, X., Dodelson, S., & Mandelbaum, R. 2023, ArXiv e-prints [arXiv:2312.08934] [Google Scholar]

[47] Kacprzak, T., & Fluri, J. 2022, Phys. Rev. X, 12, 031029 [NASA ADS] [Google Scholar]

[48] Kacprzak, T., Kirk, D., Friedrich, O., et al. 2016, MNRAS, 463, 3653 [Google Scholar]

[49] Klypin, A., Prada, F., Betancort-Rijo, J., & Albareti, F. D. 2018, MNRAS, 481, 4588 [Google Scholar]

[50] Kratochvil, J. M., Lim, E. A., Wang, S., et al. 2012, Phys. Rev. D, 85, 103513 [Google Scholar]

[51] Kullback, S., & Leibler, R. A. 1951, Ann. Math. Stat., 22, 79 [CrossRef] [Google Scholar]

[52] Lanzieri, D., Lanusse, F., Modi, C., et al. 2023, A&A, 679, A61 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[53] Laureijs, R., Amiaux, J., Arduini, S., et al. 2011, ArXiv e-prints [arXiv:1110.3193] [Google Scholar]

[54] Lin, C. -A., & Kilbinger, M. 2015, A&A, 583, A70 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[55] Liu, J., & Madhavacheril, M. S. 2019, Phys. Rev. D, 99, 083508 [Google Scholar]

[56] Liu, J., Petri, A., Haiman, Z., et al. 2015a, Phys. Rev. D, 91, 063507 [Google Scholar]

[57] Liu, X., Pan, C., Li, R., et al. 2015b, MNRAS, 450, 2888 [Google Scholar]

[58] Lu, T., Haiman, Z., & Zorrilla Matilla, J. M. 2022, MNRAS, 511, 1518 [NASA ADS] [CrossRef] [Google Scholar]

[59] Lu, T., Haiman, Z., & Li, X. 2023, ArXiv e-prints [arXiv:2301.01354] [Google Scholar]

[60] Lueckmann, J. M., Goncalves, P. J., Bassetto, G., et al. 2017, ArXiv e-prints [arXiv:1711.01861] [Google Scholar]

[61] Lueckmann, J. M., Bassetto, G., Karaletsos, T., & Macke, J. H. 2018, ArXiv e-prints [arXiv:1805.09294] [Google Scholar]

[62] Makinen, T. L., Charnock, T., Alsing, J., & Wandelt, B. D. 2021, JCAP, 2021, 049 [CrossRef] [Google Scholar]

[63] Makinen, T. L., Charnock, T., Lemos, P., et al. 2022, Open J. Astrophys., 5, 18 [NASA ADS] [CrossRef] [Google Scholar]

[64] Mandelbaum, R., Eifler, T., Hložek, R., et al. 2018, ArXiv e-prints [arXiv:1809.01669] [Google Scholar]

[65] Martinet, N., Schneider, P., Hildebrandt, H., et al. 2018, MNRAS, 474, 712 [Google Scholar]

[66] Matilla, J. M. Z., Sharma, M., Hsu, D., & Haiman, Z. 2020, Phys. Rev. D, 102, 123506 [Google Scholar]

[67] Papamakarios, G., & Murray, I. 2018, ArXiv e-prints [arXiv:1605.06376] [Google Scholar]

[68] Papamakarios, G., Sterratt, D. C., & Murray, I. 2018, ArXiv e-prints [arXiv:1805.07226] [Google Scholar]

[69] Peel, A., Lin, C. -A., Lanusse, F., et al. 2017, A&A, 599, A79 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[70] Petri, A. 2016, Astron. Comput., 17, 73 [NASA ADS] [CrossRef] [Google Scholar]

[71] Petri, A., Haiman, Z., Hui, L., May, M., & Kratochvil, J. M. 2013, Phys. Rev. D, 88, 123002 [Google Scholar]

[72] Phan, D., Pradhan, N., & Jankowiak, M. 2019, ArXiv e-prints [arXiv:1912.11554] [Google Scholar]

[73] Porqueres, N., Heavens, A., Mortlock, D., & Lavaux, G. 2021, MNRAS, 502, 3035 [NASA ADS] [CrossRef] [Google Scholar]

[74] Porqueres, N., Heavens, A., Mortlock, D., Lavaux, G., & Makinen, T. L. 2023, ArXiv e-prints [arXiv:2304.04785] [Google Scholar]

[75] Ribli, D., Ármin Pataki, B., & Csabai, I. 2018, ArXiv e-prints [arXiv:1806.05995] [Google Scholar]

[76] Ribli, D., Pataki, B. Á., Zorrilla Matilla, J. M., et al. 2019, MNRAS, 490, 1843 [CrossRef] [Google Scholar]

[77] Rizzato, M., Benabed, K., Bernardeau, F., & Lacasa, F. 2019, MNRAS, 490, 4688 [NASA ADS] [CrossRef] [Google Scholar]

[78] Semboloni, E., Schrabback, T., van Waerbeke, L., et al. 2011, MNRAS, 410, 143 [Google Scholar]

[79] Shan, H., Liu, X., Hildebrandt, H., et al. 2018, MNRAS, 474, 1116 [Google Scholar]

[80] Sharma, D., Dai, B., & Seljak, U. 2024, ArXiv e-prints [arXiv:2403.03490] [Google Scholar]

[81] Smail, I., Hogg, D. W., Yan, L., & Cohen, J. G. 1995, ApJ, 449, L105 [Google Scholar]

[82] Spergel, D., Gehrels, N., Baltay, C., et al. 2015, ArXiv e-prints [arXiv:1503.03757] [Google Scholar]

[83] Takada, M., & Jain, B. 2004, MNRAS, 348, 897 [Google Scholar]

[84] Thomas, O., Dutta, R., Corander, J., Kaski, S., & Gutmann, M. U. 2016, ArXiv e-prints [arXiv:1611.10242] [Google Scholar]

[85] Tishby, N., Pereira, F. C., & Bialek, W. 2000, ArXiv e-prints [arXiv:physics/0004057] [Google Scholar]

[86] Uhlemann, C., Friedrich, O., Villaescusa-Navarro, F., Banerjee, A., & Codis, S. 2020, MNRAS, 495, 4006 [NASA ADS] [CrossRef] [Google Scholar]

[87] Xavier, H. S., Abdalla, F. B., & Joachimi, B. 2016, MNRAS, 459, 3693 [NASA ADS] [CrossRef] [Google Scholar]

[88] Zeghal, J., Lanzieri, D., Lanusse, F., et al. 2024, A&A, submitted [arXiv:2409.17975] [Google Scholar]

[89] Zhang, Z., Chang, C., Larsen, P., et al. 2022, MNRAS, 514, 2181 [NASA ADS] [CrossRef] [Google Scholar]

[90] Zürcher, D., Fluri, J., Sgier, R., et al. 2022, MNRAS, 511, 2075 [CrossRef] [Google Scholar]

Optimal neural summarization for full-field weak lensing cosmological implicit inference

1. Introduction

2. Motivation

3. Review of common compression loss functions

4. The sbi_lens framework

4.1. Lognormal modeling

4.2. Data generation

4.3. Noise and survey setting

5. Experiment

5.1. Explicit inference

5.1.1. Full-field inference with BHMs

5.1.2. Power spectrum

5.2. Implicit inference

5.2.1. Compression strategy

5.2.2. Inference strategy

6. Results

6.1. Result estimators

6.2. Power spectrum and full-field statistics

6.3. Optimal compression strategy

7. Conclusion

Data availability

Acknowledgments

References

Appendix A: Mean squared error

Appendix B: Mean absolute error

Appendix C: Gaussian negative log-likelihood

Appendix D: Validation of sbi_lens's forward model

All Tables

All Figures

4. The `sbi_lens` framework

Appendix D: Validation of `sbi_lens`'s forward model