Probabilistic Mass Mapping with Neural Score Estimation

Weak lensing mass-mapping is a useful tool to access the full distribution of dark matter on the sky, but because of intrinsic galaxy ellipticies and finite fields/missing data, the recovery of dark matter maps constitutes a challenging ill-posed inverse problem. We introduce a novel methodology allowing for efficient sampling of the high-dimensional Bayesian posterior of the weak lensing mass-mapping problem, and relying on simulations for defining a fully non-Gaussian prior. We aim to demonstrate the accuracy of the method on simulations, and then proceed to applying it to the mass reconstruction of the HST/ACS COSMOS field. The proposed methodology combines elements of Bayesian statistics, analytic theory, and a recent class of Deep Generative Models based on Neural Score Matching. This approach allows us to do the following: 1) Make full use of analytic cosmological theory to constrain the 2pt statistics of the solution. 2) Learn from cosmological simulations any differences between this analytic prior and full simulations. 3) Obtain samples from the full Bayesian posterior of the problem for robust Uncertainty Quantification. We demonstrate the method on the $\kappa$TNG simulations and find that the posterior mean significantly outperfoms previous methods (Kaiser-Squires, Wiener filter, Sparsity priors) both on root-mean-square error and in terms of the Pearson correlation. We further illustrate the interpretability of the recovered posterior by establishing a close correlation between posterior convergence values and SNR of clusters artificially introduced into a field. Finally, we apply the method to the reconstruction of the HST/ACS COSMOS field and yield the highest quality convergence map of this field to date.


Introduction
The weak gravitational lensing effect provides a direct probe of the large scale matter distribution in the Universe.This lensing effect generates minute deformations of the apparent shapes of distant galaxies, a so-called shear, in the presence of massive structures along the line of sight.Due to its ability to directly probe the matter field, weak lensing is found at the heart of present and upcoming wide-field optical galaxy surveys including the ESA Euclid mission (Laureijs et al. 2011), the Vera C. Rubin Observatory Legacy Survey of Space and Time (Ivezić et al. 2019), and the Roman Space Telescope (Spergel et al. 2015).
While most of the cosmological analysis of weak lensing focuses on 2pt functions, the reconstruction of maps of the matter distribution opens up alternative ways to analyse data, including giving access to higher order statistics, such as the peak count statistic which has been applied to most existing weak lensing Contact: benjamin.remy@cea.frsurveys (Liu et al. 2015a,b;Kacprzak et al. 2016;Shan et al. 2018;Martinet et al. 2018;Harnois-Déraps et al. 2020).In addition, novel higher-order statistics such as the wavelet peak counts and 1 -norm (Ajani et al. 2021), the scattering transform statistics (Cheng et al. 2020), or neural summaries (e.g.Ribli et al. 2019;Jeffrey et al. 2020a) have recently been shown to be even more sensitive to cosmology.
This process of reconstructing maps of the matter distribution from measured galaxy ellipticities is known as weak lensing mass-mapping.Because of noise and missing data, the weak lensing mass-mapping problem is ill-posed; in other words, the matter density field is not uniquely determined by the observed shear.This implies that any mass-mapping method will rely on different prior assumptions to regularize this problem and yield a different map estimate.The most standard approach is the Kaiser-Squires method (Kaiser & Squires 1993) which is based on a direct inversion of the lensing operator along with some amount of Gaussian smoothing.A number of other methods have since been proposed, using various types of regularisation schemes such as maximum entropy (Marshall 2001), Gaussian prior (Wiener filter) (Simon et al. 2012;Horowitz et al. 2018), Sparsity (Starck et al. 2006;Leonard et al. 2012;Lanusse et al. 2016;Price et al. 2018;Starck et al. 2021).These techniques usually yield a point estimate of a convergence map, and usually lack proper uncertainty quantification, which makes interpreting the resulting maps difficult.
A number of Bayesian methods have been proposed to recover a posterior probability estimate for the unknown mass map, but these are usually limited by restrictive prior assumptions.For instance, using hierarchical Bayesian modeling Alsing et al. (2015) proposed an approach for sampling mass-maps, but which relied on a Gaussian prior for the unknown map.More recently, Porqueres et al. (2021) extended this work and demonstrated mass-map posterior sampling with a non-linear prior defined by a forward gravity model, but limited to very low angular resolution.Other Bayesian posterior sampling approaches include Schneider et al. (2016) which relied on a Gaussian Process prior, Price et al. (2018) which proposed a proximal MCMC approach to accommodate non-differentiable sparsity priors, and Fiedorowicz et al. (2021) which relies on a log-normal prior.
More recently, with the rise of deep learning, a number of methods relying on deep neural networks have been proposed to address the mass-mapping problem.The strength of these approaches is that they provide a practical way to leverage simulations as a prior to solve the mass-mapping problem.In particular, the DeepMass method (Jeffrey et al. 2020b) uses a U-net (Ronneberger et al. 2015) to recover an estimate of the mean posterior convergence map, with a prior defined by a set of simulations.Besides, Shirasaki et al. (2021) have proposed a model based on a Generative Adversarial Network (GAN) (Goodfellow et al. 2014) which is able to to denoise weak lensing mass maps.
In this paper, we propose a new approach to mass-mapping that combines elements of deep learning with Bayesian inference and provides a tractable way to sample from the full highdimensional posterior distribution of convergence maps.
The limiting factor in all Bayesian approaches mentioned above are the simplifications needed to achieve a tractable prior (which leads to Gaussian, log-Normal, sparse, or simplified hierarchical priors).These approximations are needed because the distribution of mass-maps is not tractable analytically, and only accessible as an implicit distribution, i.e. a distribution without an explicit likelihood, and that can only be sampled from.Sampling from such an implicit simulation is otherwise known as running a simulator.
Our approach DeepPosterior is based on learning a prior from samples drawn from such an implicit distribution (i.e. from simulated convergence maps), and use this prior for sampling the full Bayesian posterior.
Instead of learning a full probability density function p(x) over mass-maps with likelihood-based Deep Generative Models such Normalizing Flows (Rezende & Mohamed 2015) or Pixel-CNNs (Oord et al. 2016;Salimans et al. 2017), following recent developments in the field of Diffusion-based models, we target instead the score function ∂ log p(x)   ∂x .This score function is estimated using the Denoising Score Matching (DSM) technique (Vincent 2011), which relies on training a neural network under a simple denoising task, and yet leads asymptotically to an unbiased estimate of the score function.With this Neural Score Estimation in hand, we then demonstrate that one can sample from the posterior of the mass-mapping problem using an efficient gradient-based sampling techniques: Annealed Hamiltonian Monte-Carlo.Contrary to most similar Deep Learning ap-proaches to solve inverse problems, this method is stable and scalable, we achieve one independent sample from the massmapping posterior for maps of size 360 × 360 pixels in 10 GPUminutes.
We demonstrate our proposed methodology on simulations, using the high resolution κTNG convergence maps (Osato et al. 2021), based on the IllustrisTNG simulations (Nelson et al. 2019;Pillepich et al. 2018;Nelson et al. 2018;Springel et al. 2018;Naiman et al. 2018;Marinacci et al. 2018).We compare the method against other standard methods (Kaiser-Squires, DeepMass, GLIMPSE (Lanusse et al. 2016), MCALens (Starck et al. 2021)) and find and improvement in terms of the pixel reconstruction error, the Pearson correlation and the convergence power spectrum.We also investigate the interpretation of the recovered posterior and demonstrate a direct correlation between signal to noise of structures like galaxy clusters and the posterior distribution.
Following these validation tests on simulations, we apply the method to the reconstruction of the HST/COSMOS field (Scoville et al. 2007) based on the shape catalog from Schrabback et al. (2010).We obtain the highest quality mass map of the HST/COSMOS field, alongside uncertainty quantification.Our result improves over the previously published COSMOS map Massey et al. (2007) both from using a more recent shape catalog, and from our much improved methodology, revealing much finer structures and providing uncertainty quantification.
The paper is organized as follows: After providing some general background on gravitational lensing in section 2, and describing a unified view of mass-mapping and related works in section 3, we introduce in section 5 the methodology.We first describe how to build a prior from cosmological simulations and how to sample from the posterior distribution.Then in section 6, we describe the simulations we used to train our model.In section 7 we validate our method, showing improvement in point estimate reconstruction against other methods, and present a detection experiment using the posterior samples.Finally, in section 8, we apply our method a real data, reconstructing a very high quality map of the HST/ACS COSMOS field.

Weak Gravitational Lensing Formalism
In this section, we present an overview of weak gravitational lensing and mass-mapping.
Observed galaxy shapes are affected by the gravitational shearing effect that occurs in the presence of massive structures, acting as lenses, along the line of sight.This distortion can be described by a coordinate transformation (Bartelmann 2010) between unlensed coordinates β and observed image coordinates θ where ψ is known as the lensing potential, and is sourced by the projected matter density on the sky.
To first order, the resulting distortions affecting galaxy images can be described in terms of a simple linear Jacobian matrix A, known as the amplification matrix : In this expression, which only holds in the weak lensing regime (i.e.κ 1), the convergence κ, translates into an isotropic dilation of the source, while the shear γ causes anisotropic stretching of the image.
This convergence κ can be directly related to the projected mass density on the sky, which leads to typically using the denomination convergence map or mass-map interchangeably.Indeed, considering a sample of lensing sources distributed in redshift according to some distribution n(z), one can relate the convergence κ to the 3D matter overdensity δ according to: where H 0 is the Hubble constant, c the speed of light, q(χ) χ ) is the lensing efficiency, f K is the comoving angular distance, δ the over density, a is the scale factor and Ω m is the matter density (Kilbinger 2015).
While shear can be measured by the spatially coherent correlations it induces on galaxy shapes, convergence is typically not directly observable, as its magnification effect is much more difficult to disentangle from intrinsic galaxy sizes.Therefore, the problem of mass-mapping is generally to recover an estimate of the convergence κ from measurements of the shear γ.This is made possible by the following equations that tie convergence and shear to the lensing potential: Combining these equations, one can recover a minimum variance estimator for the convergence as: which constitutes the basis for the Kaiser-Squires reconstruction technique.This equation can be solved in practice most efficiently using a Fourier transform in the flat sky limit, or spherical harmonics transform in the spherical setting.The Fourier solution of the Kaiser-Squires estimator can be written as: where Note that the solution is not defined for k = 0, which means that the mean of the convergence field cannot be directly constrained from shear, which is usually known as the mass-sheet degeneracy.
One particularly remarkable property of the Kaiser-Squires estimator ( 6) is that it defines a unitary operation.In other words, let us introduce the linear operator P as: where κ = κE + iκ B and γ = γ1 + iγ 2 are complex representations of the convergence E and B modes, and of the two shear components in Fourier space.Then the operator P verifies P † P = I d .

A Unified View of Mass-Mapping
This mass-mapping problem can be reformulated as a probabilistic inference problem from a Bayesian perspective: where the posterior distribution p(κ|γ) models the probability of the signals κ conditioned on the observations γ, the likelihood distribution p(γ|κ) encodes the forward process of the model in equation Equation 7, the prior distribution p(κ) encodes the knowledge about the signal κ, and the Bayesian evidence p(γ) is the marginal density of the observations.The evidence is a constant if we assume a given model, and will be ignored in the rest of this work as we do not consider Bayesian model comparison.
In our work, the forward process encoded in Equation 7 returns a binned shear map γ, where each pixel value corresponds to the average shear in the pixel area and therefore takes as input a pixelized convergence map κ.Working with real data, it is assumed that the measurement for each pixel is degraded by shape noise n s due to the finite average of galaxy intrinsic ellipticities in the bin.This noise is assumed to be white Gaussian, i.e. n s ∼ N(0, Σ n ), where Σ n is the noise covariance matrix of the shear map.Moreover, because of missing data due to survey measurement masks, we need to explicitly consider that there is no measurement in some pixel regions.In practice, we set a very high variance, such as 10 10 , for these pixels in the covariance matrix Σ n .More specific information on the strategy we follow to emulate the COSMOS shape catalog can be found in subsection 6.3.
Thus, the log-likelihood takes the following form: where F and F * are respectively direct and inverse Fourier transform.
All existing mass-mapping techniques can be understood under the lens of this Bayesian formulation and will generally mostly differ in their choice of prior, and in the specific algorithm used to recover a point estimate of the convergence map.As in practice this problem is ill-posed due to noise corruption and missing data in Equation 7, the posterior p(κ|γ) can be both wide and heavily prior dependent, which explains why all these different techniques yield different answers.Below we describe several methods we use in this paper for comparison.

Kaiser-Squires Reconstruction
The Kaiser-Squires method (Kaiser & Squires 1993) can be seen as a simple maximum likelihood estimate (MLE) of the convergence map, typically followed by a certain amount of Gaussian smoothing.
where s is a Gaussian smoothing kernel of a given scale, and P † is a pseudo-inverse of the operator P typically achieved by direct Fourier inversion.While this method is the fastest, it does not take into account masks and leads to leakage between E and B modes of the convergence field.For Kaiser-Squires, the heteroscedasticity does not impact the solution, whereas it can do for certain extensions such as the Generalized Kaiser-Squires method (GKS) Starck et al. (2021) (Appendix B.1), in which an iterative approach with little regularization take the mask and noise heterescedasticity into account.

Wiener Filter
The Wiener filter approach assumes a Gaussian random field prior on κ and takes advantage of the fact that the power spectrum of the convergence can be analytically predicted from cosmological models, and accurately describes the field on large scales.This prior on the convergence can be expressed as a Gaussian distribution with a diagonal covariance matrix S in Fourier space: where S is the convergence power spectrum.The solution of the inverse problem can be formulated as: This Wiener solution corresponds to the maximum a posteriori (MAP) solution under this Gaussian prior, and also matches the mean of the Gaussian posterior.An appealing property of this estimator is that the solution can be easily recovered analytically in the case where the noise is homoscedastic, as both signal and noise covariance matrices become diagonal in Fourier space.The Wiener Filter reconstruction (Lahav et al. 1994;Zaroubi et al. 1995) is given in Fourier space by: where S and N are respectively the signal and noise covariance matrix in Fourier space.
In more complex cases, where the noise covariance is not diagonal in Fourier space (for instance because of a mask in pixel space), the solution can still be recovered efficiently by optimization, using the proximal method (Bobin et al. 2012;Starck et al. 2021), or its related messenger field alternative (Elsner & Wandelt 2013).One can also draw samples from the Wiener posterior with the messenger field algorithm (Jeffrey et al. 2018).

Sparse priors
Convergence maps contain non-Gaussian features, that are not well recovered with the methods described above.Several massmapping algorithms have been proposed, relying on a wavelet sparsity prior (Starck et al. 2006;Leonard et al. 2012;Lanusse et al. 2016;Price et al. 2018;Starck et al. 2021), which can be formulated as: with p < 2, and where Φ is a wavelet dictionary and .p is a sparsity promoting p norm.The convergence map is the solution of the following sparse recovery optimization problem: where λ is the regularisation parameter, weighting the sparse regularisation constraint.The GLIMPSE method (Lanusse et al. 2016), allows in addition to take into account masks, nonuniform noise, flexion data if they are available, and also does not require to transform the shear catalog on pixelized map.

DeepMass
While all the priors described above have closed-form expressions, it is also possible to design a mass-mapping method where the prior is defined implicitly.The first dark matter map reconstruction from weak lensing observational data using deep learning was shown in Jeffrey et al. (2020b).DeepMass is a Convolutional Neural Network (CNN) trained on pairs of simulated pixelized shear and convergence maps.One can show that under some assumptions, a deep learning model can estimate the mean of the posterior distribution p(κ|γ).In a nutshell, the network needs to be trained to minimize the the mean-squared-error (MSE) of the output convergence κ and the training convergence and shear maps must be drawn respectrively from the prior p(κ) and likelihood distribution p(γ|κ).
While the Wiener Filter assumes a Gaussian prior over the convergence, simulated training data for DeepMass are drawn from the "true" prior p(κ) and thus improve the accuracy of the reconstruction.DeepMass is therefore able to recover the nonlinear structures of the convergence better than the Wiener filter and to reduce the MSE of the reconstruction.
Even if DeepMass reconstructs high quality convergence maps, DeepMass alone only provides the mean posterior and cannot quantify the uncertainties of the reconstruction.Furthermore, as any direct inversion method based on neural networks, the likelihood Equation 9 is learned implicitly by the model, and does not explicitly constrains the solution at inference time.
Meaning that although we have not found in our experiments obvious failures, the CNN may in theory fail in ways that would lead to a map not actually consistent with observations, for instance creating spurious artifacts, or missing structures present in the true map.Another side effect of implicitly learning the likelihood during training is that the model is trained for a specific survey configuration, and retraining is required if either the mask or the noise is different.
In this work, we propose a new approach which estimates the full posterior distribution p(κ|γ), being able to not only recover the posterior mean, but also to quantify the uncertainties of the reconstruction.In addition, in our method the likelihood is explicit, meaning that it does not require a retraining for a new survey configuration.

Primer on Neural Score Estimation and Sampling
In this section, we review the technical aspects of the machine learning methodology we will employ in the mass-mapping problem.We begin by detailing how the score function can be estimated for an implicit distribution.We then describe our strategy to sample any distribution from the knowledge of its score.
All of these methods aim at modeling the probability distribution underlying some data, but not in the same way.On one hand, GANs or VAEs learn implicitly the probability distribution, which do not fit into our framework because we want to leverage the closed-form expression of the likelihood distribution.On the other hand, Normalizing Flows and EBMs can learn explicit forms of probability distribution, but do not scale well in dimension.Therefore we resort to a recent and promising class of generative models introduced in Song & Ermon (2019) based on learning an explicit model of the gradient log-probability distribution, also known as the score function, which is all we need in our framework to build the posterior distribution as described in subsection 5.3.

Denoising Score Matching
As originally identified by Vincent (2011) and Alain & Bengio (2013), the gradient w.r.t. the data of a log distribution, which is called the score function, can be modeled using a Denoising Auto-Encoder (DAE).This method is also known as Denoising Score Matching (DSM).
Let us introduce an auto-encoding function where N is the signal dimension, trained to reconstruct a true signal x following the probability distribution p, given a noisy version x = x + n, with n ∼ N(0, σ 2 I N ), under an 2 loss.In this case, the function r θ is called a denoiser, parametrized by a noise level σ and is trained minimizing the following criterion: where p σ 2 = p * N(0, σ 2 ), i.e. corresponds to the data distribution we want to model, convolved by a multivariate Gaussian with a diagonal covariance matrix σ 2 I N .Alain & Bengio (2013) showed that in this setting, an optimal denoiser r would then be achieved for: Figure 1 illustrates how one can denoise a 2-dimensional point knowing the noise level and the score function.We can clearly see that the score multiplied by the variance gives the right shift between the noisy point (in red) and the high density region of the distribution.Applying this shift to the noisy sample gives the projected sample in orange.
Rewriting this equation gives an estimator of the score function: The optimal denoiser is thus related to the score we wish to learn, and exactly matches the score of the probability distribution underlying data when the noise variance σ 2 goes to zero.Following the loss function design in Lim et al. (2020), the optimization objective to train the denoiser is: where u is sampled from a standard multivariate Gaussian, r θ is a denoiser parametrized by neural network weights θ, optimized to estimate r * .The modifications between Equation 17and Equation 20 resolve some numerical instabilities and reduce the error of approximation of the precedent objective.In particular, the learned denoiser r θ now directly models the score function, without the need to divide by the noise level anymore, i.e. r θ (x, σ 2 ) = ∇ x log p σ 2 (x), as σ 2 → 0. Moreover rescaling by σ and decoupling the noise level from an isotropic Gaussian noise prevents the gradient from vanishing.
To summarize, DSM provides a tractable way of estimating a score function, from only having access to samples from an implicit distribution.We will now be able to use this score estimate for sampling from said distribution, as detailed in the next section.

Score-based sampling
Given the score function ∇ log p(x) (either directly available, or learned by DSM), it is possible to sample from the distribution p(x) by Markov Chain Monte Carlo (MCMC).Indeed, the Langevin Dynamics (LD) or Hamiltonian Monte Carlo (HMC) updates, described in Neal (2011) and Betancourt (2017), only require the evaluation of the score function ∇ log p(x).For instance, the HMC update, based on the leapfrog integrator, requires the evaluation of the score function only: where α is the step size, m is the auxiliary momentum and M is a preconditioning matrix that could take into account the space metric, but in our case the identity matrix.Following this procedure, HMC is supposed to sample from p(x), but as explained in Betancourt (2017), the discretization induces a small error that will bias the resulting transition and requires a correction.In order to correct this bias, every sample is considered as a Metropolis-Hastings (MH) (Metropolis et al. 1953;Hastings 1970) proposal and is accepted or rejected according to an acceptance probability.This acceptance probability is designed from the Hamiltonian transition and is also only score-dependent as we show in Appendix A.
However, in most, if not all but the most trivial, cases sampling in high-dimension by Langevin or Hamiltonian dynamics is made very difficult by the fact that the distribution manifold is never Euclidean, meaning that assuming a diagonal noise covariance for LD, or diagonal momentum matrix for HMC, leads to extremely inefficient sampling.
To take a concrete example, let us consider the case of distribution of handwritten digits in the MNIST dataset, and imagine a Langevin Dynamics chain exploring this distribution.To transition from a 1 to a 7 for instance, the chain running in pixel space will try to add some white Gaussian noise at each update.However, MNIST digits are binary, so any addition of noise is bound to kick the chain out of the data distribution, and be rejected in a Metropolis-Hastings step.Only the smallest step sizes (which also tunes the amount of noise applied on the chain) will have a non-zero chance of acceptance, meaning the chain is practically never moving.
To summarize, having access to the score function is all we need to sample from a distribution, but this remains difficult for non-trivial high dimensional distributions.We describe in the next subsection a strategy to circumvent this issue.

Efficient sampling in high-dimensions with annealing
Various approaches have been suggested to increase the sampling efficiency of HMC and LD for complex distributions, which generally aim at reframing the chain in a space closer to the intrinsic manifold of the distribution rather than pixel space.For instance, Girolami & Calderhead (2011) exploits the Riemannian geometry of parameters space to define a Metropolis-Hastings (Metropolis et al. 1953;Hastings 1970) proposal, enabling high efficiency sampling in high dimensions.
In our work, we follow another direction which recently gained momentum from the literature on Denoising Diffusion Models (Sohl-Dickstein et al. 2015;Song & Ermon 2019;Ho et al. 2020).Instead of trying to follow the non-Euclidean latent manifold of the distribution, why not transforming this distribution so that it becomes easier to travel in pixel-space?This can be done very simply by convolving the data distribution with noise.
To go back to our MNIST thought experiment, if we convolve our binary handwritten digits with Gaussian noise, now it becomes very easy for a LD chain to move in pixel space, as the noise added by the chain can be made to match that present in the distribution.The higher the noise, the largest pixel-wise transition will be possible, and the faster the chain can transition from one digit to the other.Another view of this effect is that given enough noise, all of the distinct modes of a distribution will begin to merge with each other.Figure 2 illustrates the same two-moons distribution convolved with varying amount of Gaussian noise.At high noise values (top left corner), the distribution becomes close to a diagonal Gaussian and thus extremely easy to sample to explore all possible regions of the distribution.
We will call temperature T the variance σ 2 of this Gaussian kernel convolved with the target distribution.
Following our preliminary work (Remy et al. 2020), we adopt in this paper a two-step sampling procedure, based on an annealed version the Hamiltonian Monte Carlo (HMC) (Neal 2011;Betancourt 2017), similar to the annealed Langevin Dynamics proposed by Song & Ermon (2019).
In the first step of our sampling procedure, we initialize a chain using white Gaussian noise with a high temperature σ 2 T .Leveraging the conditional noise property of the score function described in section 4.1 we have direct access to the score function of the convolved distribution ∇ log p σ 2 T , which we can use in a score-based HMC.We then let the chain evolve under Hamiltonian dynamics, and progressively lower the temperature σ 2 of the conditional score.Each time the temperature σ 2 is decreased, the MCMC chain thermalizes to the new temperature in a few HMC steps.As described in Song et al. (2019), small enough steps are important to ensure non-zero transition probabilities between adjacent temperature values.Once the chain has reached sufficiently low temperature (ideally σ 2 = 0) we stop the chain and only retrieve the last sample.Multiple independent samples are obtained by running multiple independent annealed HMC chains in parallel.One must not think that having one chain per sample makes the process much longer than having multiple sample per chain.It is indeed much more efficient, because in practice it is very long to ensure sample independence within the same chain.
In practice however, it has proven to be difficult to anneal chains all the way to zero temperature.We find that at low temperatures, the chains do not properly thermalize at each step and 1 2 3 4 Fig. 2. Example of annealing on the two moons distribution.The distribution is convolved with a multivariate Gaussian of variance σ.This variance is decreasing over the sampling procedure so that the MCMC is always over high density regions, i.e. σ 4 > σ 3 > σ 2 > σ 1 .The chain is initialized from a wide multivariate Gaussian, as on the top left panel, and converges to the data distribution, as on the bottom right panel.In blue is the density of the convolved distribution and the arrows represent its score function.
residual noise remains in our samples.This was also observed by other authors using annealed Langevin Dynamics instead of HMC (Jolicoeur-Martineau et al. 2020;Song & Ermon 2019).In the application in this paper, we can reach σ = 10 −3 by finetuning our annealing scheme.
Therefore, in a second step of our sampling procedure, we propose a strategy to transport the annealed HMC samples obtained in step 1 at finite temperature all the way to σ 2 = 0. To achieve this, we use remarkable results from Song et al. (2021) which establish a parallel between denoising diffusion models (generative models based on random walks) and Stochastic Differential Equations.We refer the interested reader to Appendix B for more mathematical details and derivations, but the main result from that paper that we use here is the following Ordinary Differential Equation: where σ 2 (t) is the increasing variance schedule of the Gaussian distribution we convolve the data distribution with.In our particular implementation, we used a linear schedule t → σ 2 (t) = t.This ODE describes a deterministic process {x(t)} T t=0 indexed by a continuous time variable t ∈ [0, T ], such that x(0) ∼ p 0 and x(T ) ∼ p T , where p 0 denotes the data distribution and p T the convolution between the data distribution and a multivariate Gaussian of variance T .This ODE can be solved by any blackbox ODE solver provided that the convolved score is available.In particular, it means that if we start the ODE at a point in p t , where t is the intermediate temperature at which we have stopped our HMC chains, we can transport that point to p 0 .This proce-dure very effectively removes residual noise while making sure we reach a point in the target distribution at zero temperature.
We summarize our full sampling procedure as follows: 1. Initialize with white Gaussian noise x init ∼ N(0, T • I d ) 2. Sample the posterior distribution with the annealed HMC algorithm which requires at each step to: (a) evaluate the convolved score ∇ log p σ (γ|κ) using Equation 19, (b) compute the MH proposal using Equation 21 and accept it according to Equation A.1, (c) anneal if there is enough samples for a given temperatures.3. Project the last sample on the posterior distribution with the ODE described in Equation 22.

Mass-Mapping by Neural Score Estimation
Having laid out the machine learning notions needed to implement our method, we now describe how to apply these techniques to the weak lensing mass-mapping problem.Our strategy will be to use Denoising Score Matching to learn a high fidelity prior over convergence maps, and use this prior in combination with the analytic likelihood function Equation 9to access the full posterior of the mass-mapping problem by annealed HMC sampling.

Hybrid analytic and generative modeling of the prior
One of the main limiting factors of previous mass-mapping approaches is the limited complexity of the prior used in the inversion problem (e.g.Gaussian, or sparse).Our first step towards our mass-mapping technique is to build a prior that can fully capture the non-Gaussian nature of convergence maps.To achieve this goal, we propose an hybrid prior relying on a Gaussian analytic model, complemented by a Denoising Score Matching approach to learn on simulations the delta between this Gaussian prior and fully non-Gaussian convergence map distribution, accessible through simulations.
We indeed know that analytic cosmological models provide a reliable model of the 2pt functions of the convergence field, and can thus be used to define a Gaussian prior which enjoys a closed-form expression (Equation 12), as discussed in subsection 3.2 in the context of the Wiener filter.Moreover, because of its tractable expression we directly have access to its score function ∇ log p Gaussian (κ).This Gaussian prior is particularly adapted on large scales, where the matter distribution is well modelled by a Gaussian Random Field, but does not capture the significantly non-Gaussian nature of the convergence field on small scales.
On the other hand, we do have access to physical models that can capture the full statistics of the convergence field, in the form of numerical simulations.In this case however, the simulator gives us access to an implicit distribution i.e. we can only draw samples from the distribution, but we do not have access to the likelihood of a given sample under the model.We cannot therefore directly use black-box simulators as priors for sampling the Bayesian posterior.This is where we can leverage deep generative models, to turn samples from an implicit distribution into a model with a tractable likelihood that can be used for inference.
In this work, we propose to use Denoising Score Matching for this task, but we also want to capitalize as much as possible on the analytic Gaussian prior.Consequently, we aim to use the neural network to model the higher order moments of the convergence field that the Gaussian prior cannot capture.In a DSM framework, it is straightforward to implement such a network in practice by feeding the neural network denoiser with (1) the noisy input image (2) the noise level and (3) the Gaussian score.The full prior p(κ) is then computed as: where r θ is the DAE trained to model only the residuals from the Gaussian prior.During learning, a first step of denoising is performed using ∇ log p Gaussian (κ), then the neural network r θ is optimized to denoise from this Gaussian guess, hence referred as the residual score.
Just like for conventional Denoising Score Matching, we can train the model on simulations by adapting Equation 20to Residual Denoising Score Matching: where p G is the Gaussian prior.
Figure 3 provides an illustration of this hybrid prior.From the noisy input map (b), we first compute the Gaussian score function.Panel (c) shows the Gaussian-denoiser estimate computed with Equation 18 from the Gaussian score function.Then, feeding the DAE with the noisy input, the input noise level and the Gaussian score function, we get the full score function.Panel (d) shows the complete denoising estimate computed with Equation 18 again from the score function.Figure 3 exemplifies the ability of the neural network to capture complementary scales to the Gaussian prior.
Building the prior with this hybrid design reduces the complexity of the modelling problem, and limits the reach of the neural network model by only modeling a correction to the analytical prior.

Neural Score Estimator architecture for mass-mapping
The Denoising Score Matching approach described in subsection 4.1 was so far completely generic and agnostic of the actual implementation of the parametric denoiser r θ .We describe here the concrete neural network architecture we will use throughout this work, fine-tuned to image data of size 360×360 pixels.
Following Equation 19, we need to train a DAE in order to model the prior score function.We used a U-net, inspired from Ronneberger et al. (2015), with ResNet building blocks of convolutions from He et al. (2016) followed by Batch Normalization.We also used Spectral Normalization, using power iteration method on the neural network weights Gouk et al. (2020), to improve the regularity of the network in out-of-distribution regions, where the score is not constrained by the data.Indeed, as discussed in Appendix C, regularizing the spectral norm lowers the Lipschitz constant of the network and thus enforces the score field to be aligned for close inputs.Note that the network is noise conditional, which means that it takes as inputs both the image to denoise and the noise level.The noise level was drawn from a normal distribution of standard deviation σ = 0.2.Our U-net denoiser was trained using the Adam optimizer (Kingma & Welling 2014) starting with a learning rate of 10 −4 decreasing

Sampling from the posterior
The prior learned with DSM is naturally defined as a convolved version of the data distribution.Besides, in order to use annealing on the posterior distribution, it is necessary to convolve also the likelihood.Under the assumption that the likelihood is Gaussian, convolving with a multivariate Gaussian on x gives the following expression: where σ 2 L is the variance of the convolved likelihood, σ 2 n is the noise variance of the measurement and σ 2 is the variance of the convolved Gaussian, also called the temperature.We provide a demonstration of Equation 25 in Appendix D.
Note that the MCMC algorithm does not navigate the posterior distribution convolved at temperature σ 2 at the same rate than the likelihood or the prior.Indeed, convolving the product of two distributions is not equivalent to convolving each distribution independently.However, we found empirically that the annealed HMC converges towards the posterior distribution p(x|y) at zero temperature.We first demonstrate it with a Gaussian posterior in subsection 7.1 and then with the full posterior distribution in subsubsection 7.2.1, subsubsection 7.2.2 and subsection 7.3.This way, the complete sampling procedure consists of running a MCMC on a proxy of the posterior distribution, that is gradually annealed to low temperature, with chains progressively moving toward the posterior distribution.Each chain is initialized by sampling a multivariate normal distribution of unit variance and we leverage the usage of GPUs by running several chains in batches.

Simulations and data
In this section we describe the simulated data we used to validate our method, designed to emulate the COSMOS field on which we aim to apply our algorithm.2010).The redshifts for bright galaxies (i + < 25) are derived from the COSMOS-30 catalog (Ilbert et al. 2009), while the n(z) distribution for faint galaxies (i + > 25) is based on an extrapolation to fainter magnitudes as described in Schrabback et al. (2010).ƭ

COSMOS shear catalog
The COSMOS Survey (Scoville et al. 2007) is a contiguous 1.64 deg 2 field imaged with the Advanced Camera for Surveys (ACS) in the F814W band.In this paper, we use the shape catalog obtained in Schrabback et al. (2010) (hereafter S10) by reduction of this survey using the KSB+ method (Schrabback et al. 2007;Erben et al. 2001;Kaiser et al. 1995;Hoekstra et al. 1998), applying modelling for the variable HST/ACS point-spread function using principal component analysis and a correction for multiplicative shear measurement bias, which was calibrated as a function of the galaxy signal-to-noise ratio using image simulations.
The S10 catalog is divided into a bright i + < 25 (Subaru SExtractor MAG_AUTO magnitude) and a faint i + > 25 galaxy sample.For the bright sample, individual high quality photometric redshifts are available by cross-matching against the COSMOS-30 catalog (Ilbert et al. 2009) In this analysis, we combine both bright and faint galaxy samples into a single shape catalog, and will assume the combined redshift distribution illustrated on Figure 4.The only cut we apply is to reject galaxies in the bright sample with z phot < 0.6 and i + > 24.The photometric redshift cut is motivated by S10 finding indications that many of these galaxies are truly at high redshifts.Thus, their inclusion would imply that the used estimate of the redshift distribution is inaccurate.This yields a total of 417117 galaxies, which translates into a mean number of galaxies 64.2 per arcmin 2 6.2.κTNG simulations After having described the COSMOS field, we now present the κTNG suite of simulations we used to create mock data.We use the κTNG simulated convergence maps (Osato et al. 2021), generated from the IllustrisTNG (TNG) hydrodynamic simulations (Nelson et al. 2019;Pillepich et al. 2018;Nelson et al. 2018;Springel et al. 2018;Naiman et al. 2018;Marinacci et al. 2018).The TNG simulations are a set of cosmological, largescale gravity and magneto-hydrodynamical simulations, where baryonic processes such as stellar evolution, chemical enrichment, gas cooling, supernovae and black hole feedback are incorporated as subgrid models.
The κTNG suite is generated through tracing the trajectories of light-rays from z = 0 to the target source redshift.A large number of realizations are generated by randomly translating and rotating the snapshots.Specifically, the full set includes 10,000 realisations of 5 × 5 deg 2 convergence maps for 40 source redshifts up to z s = 2.6, at a 0.29 arcmin pixel resolution.A flat Λ-cold dark matter cosmology is assumed for both TNG and κTNG: Hubble constant H 0 = 67.74km s −1 Mpc −1 , baryon density Ω b = 0.0486, matter density Ω m = 0.3089, spectral index of scalar perturbations n s = 0.9667, and amplitude of matter fluctuations at 8 h −1 Mpc σ 8 = 0.8159.

Mock COSMOS data
To act as our prior for the reconstruction of the COSMOS field, and for validation tests, we emulate from κTNG simulations mock COSMOS lensing catalogs.
Using the binned galaxy distribution from the S10 catalog at the same 0.29 arcmin resolution, we define a binary mask that captures the limits of the survey as well as missing data within the survey area due to bright stars or image artifacts that prevent the reliable measurement of galaxy shapes in some regions of the survey.This binning also yields noise variance maps, defined by the standard deviation of intrinsic galaxy ellipticities, rescaled by the number of galaxies per pixels N i : for each pixel i.Note that in empty pixels, we assume a large variance instead of infinity.κTNG provides finely sampled convergence source planes in the range 0 < z s < 2.6.We combine these source planes to match the redshift distribution of the survey illustrated in Figure 4. To handle source redshifts higher than z s ≥ 2.6 we resorted to a redshift recycling approach in which the last source plane at z = 2.6 was reused, with an appropriate weight designed to match the expected lensing kernel.This procedure yields a total of 10,000 convergence maps, now properly weighted to match the redshift distribution of the binned data.
Mock observations of COSMOS shear maps are obtained by first computing the shear from simulated convergence maps, sampling spatially variant noise according to Σ, and applying the binary mask to mask out unobserved regions.

Tests with simulated data
In this section, we validate our sampling procedure on mock COSMOS data generated as described in the previous section.We will begin by demonstrating our sampling method in an analytically tractable Gaussian case, before testing full posterior sampling using the neural score matching approach introduced in section 5.

Sampling validation with analytic Gaussian prior
In this section, we assume that the convergence κ is a Gaussian Random Field and therefore associated to a Gaussian prior.We show that one can compute the Wiener filter estimate using our posterior sampling method.
Our method, based on the evaluation of the gradient of the log posterior enables to recover both the maximum a posteriori (MAP) and the Mean Posterior, which should match.
We used the halofit matter power spectrum from jax-cosmo 1 , using Planck Collaboration et al. ( 2015) (Table 4 final column) results, to build the Gaussian prior covariance matrix.In the following, we also refer to it as the fiducial power spectrum.From the score function of the Gaussian posterior, i.e. ∇ log p(κ|γ), one can compute the MAP using the gradient descent algorithm.As the posterior distribution is Gaussian, there is only one minimum so we are sure to obtain the MAP.The MAP matches the posterior mean, so this also provides a check of the validity of our sampling procedure.
Figure 5 compares the Wiener Filter computed with the messenger field method, from Elsner & Wandelt (2013), and our Score-based MAP and Posterior Mean reconstructions for the same input shear field.The residuals between the MAP and the Wiener filter are very low (two orders of magnitude smaller than the signal) and the residals are also unstructured as expected, both are computed with the same input data and fiducial cosmology and should mathematically match.We also show the relative error between power spectrum of the Wiener filter and the MAP, and the relative error between the Wiener filter and the posterior mean, which is almost zero at every scale.The posterior mean is computed by averaging 1500 posterior samples, sampled with annealed HMC presented in subsection 4.2, reaching 10 −3 temperature and then projected on the posterior distribution with the ODE sampler.
This experiment with a Gaussian prior validates that our sampling procedure is well adapted to sample from the posterior distribution in this analytically tractable case.

Sampling with a neural prior
In the last section, we showed that with the gradient of the log-Gaussian posterior can sample the posterior distribution and that we can therefore recover classical results such as the Wiener filter estimate.In this section, we apply the same sampling procedure, using a simulation-based prior.We saw in Figure 3 that denoising a convergence map with a deep neural network trained on κTNG simulated maps (using the full prior) is much more effective for small scales structures inference than using the Gaussian assumption alone.We will now show that the same results hold for posterior samples (using the posterior) with a neural prior.The likelihood remains exactly the same as the one used in the former section, i.e. from Equation 25.Likewise, the sampling procedure is unchanged, only the prior is extended with a neural network as defined in Equation 23.
Convergence maps sampled from the full posterior are shown in the second row of Figure 7.One can observe two regimes in these samples.On one hand, sampling in the unobserved regions (outside of the white boundary of the survey) is only driven by the prior.These regions have therefore high variance between samples because of the different chain initialization maps.On the other hand, in the data region, the sampling is driven by both the likelihood and the prior.Therefore, there is a high correla-Fig.6.Comparison of posterior samples for the Gaussian prior and our hybrid prior.The dashed black curve is the fiducial power spectrum, computed using jax-cosmo.The red curve is computed from the κTNG simulated map.The green curve is the power spectra of a white Gaussian noise realization we use to initialize the chains.The blue curve is the mean power spectrum of posterior samples assuming our proposed hybrid prior, and the blue interval represent the standard deviation of the power spectra computed over 400 samples.The p.s. of a posterior sample assuming a Gaussian prior is shown in orange.Both posteriors consider the same noisy input shear taken from our test set.ƭ tion between the samples, and with the ground truth convergence map, within the white contours.
Figure 6 illustrates that the power spectra of our DLPosterior samples is consistent with the ground truth κTNG simulated convergence map power spectrum, validating that we are able to sample maps with the correct statistics.This is also apparent by visual inspection, posterior samples shown in Figure 7 are very similar to convergence maps from the κTNG simulations, illustrating that we are able to sample κ maps at the simulation resolution.In addition, Figure 6 demonstrates that the prior choice has a strong influence on the map statistics.Using the Gaussian prior alone leads to sampling a map that matches the fiducial power spectrum, which has more power at high due to the limited resolution of the simulation.

Posterior mean
As for the Wiener filter, a way to validate the posterior distribution is to look at the posterior mean solution.In this section, we compare to the state of the art method to date, DeepMass, which is an other deep-learning based method.We trained DeepMass on examples drawn from our prior and likelihood model with the exact same U-net architecture as our denoiser.Using the convergence maps from our simulated dataset, we mocked noisy shear maps according to the likelihood in Equation 25.DeepMass  1 shows a quantitative comparison of the Kaiser-Squires, the Wiener Filter, MCALens, GLIMPSE, DeepMass and our mass-mapping method, based on two metrics: -The root mean square error (RMSE) defined as: where i is the pixel index, κ the mean-subtracted estimated convergence map, κ gt the mean-subtracted ground truth convergence map we aim to recover.Note that we use here the mean-subtracted convergence map.This is motivated by the mass-sheet degeneracy, which does not constrain the mean of the convergence map.
-The Pearson correlation coefficient r defined as: where Cov is the covariance and σ κ is the standard deviation of the convergence map κ.
On the top row of figure Figure 7 we display our Posterior mean against DeepMass, the ground truth and the Wiener filter.Qualitatively, we can clearly observe that both DeepMass and our posterior mean recover the convergence map with striking visual similarity, and at a much higher resolution than the Wiener Filter applied to the same field (see Figure 5).We can indeed identify the presence of clusters, which are corresponding to true clusters in the ground-truth map.For further visual comparison, Figure G.1 shows Kaiser-Squires, Wiener Filter, MCALens, GLIMPSE, DeepMass, and our Posterior mean and samples convergence estimates from the same input shear field.
Just like in the case of Gaussian prior explored in the previous section, one can observe that computing the mean of the posterior samples cancels out the signal outside the survey mask, where the reconstruction is only constrained by the prior.This is expected since sampling with a random initialization yield a random sample on the posterior.This is moreover confirmed by the standard deviation map, showing that the uncertainty of the reconstruction is much higher in masked regions than regions constrained by data.
Quantitatively, we also compare our reconstruction to methods based on other priors, which are MCAlens and GLIMPSE.These methods are based on sparse priors, which are described in subsection 3.3.It turns out that MCAlens, DeepMass and our posterior mean both reach the lowest RMSEs and highest Pearson correlations.This results can be expected since both DeepMass and DLPosterior mean are data-driven.DeepMass is explicitly trained to minimize the pixel reconstruction error leading to an estimate of the posterior mean, which our method is also targeting.This result shows that our method reaches the state of the art reconstruction of convergence maps.30.Background corresponds to the coefficient of samples where we did not add any cluster.The background coefficients follow a Gaussian distribution, we can thus define a standard deviation σ for detection.We can see that the lower the cluster mass -thus the lower the SNR, the closer histograms are to the background.ƭ

Method
The fact that MCAlens leads to metrics similar to DeepMass and DLPosterior mean could be explained by the fact that MCAlens modeling, (the convergence map being modeled as the sum of a Gaussian Random Field and a non-Gaussian one being sparse in the wavelet domain) is able to capture well enough the main properties of the convergence map.

Demonstration with cluster detection
In a Bayesian perspective, we expect that there is a relation between the number of times a cluster appears within posterior samples and its signal-to-noise ratio (SNR).We show in this section to which extent having access to posterior samples can allow us to quantify the uncertainty of a cluster detection and how it relates to the cluster SNR.
In order to compare the occurrence of a cluster in the posterior and its SNR, we insert a Navarro-Frenk-White (NFW) lensing signal, described in Navarro et al. (1997) and Takada & Jain (2003), of a given mass, redshift, and concentration, into a mocked shear field.We consider a cluster at redshift z h = 0.5, concentration c = 1, lensing a source at redshift s = 1.We run the experiments at different halo masses {9 × 10 13 , 3 × 10 14 , 6 × 10 14 } M that simulates an SNR increase for a given noise level in the data.We used the code from the lenspack2 repository to 6 × 10 14 9 × 10 14 Cluster Mass [M ] Fig. 9. Comparison of the probability of cluster detection given the input data and the detection method with respect to the SNR.The ratio of detection corresponds the the number of samples above 5σ over the total number of samples.The SNR was computed by dividing the maximum κ value of the NFW profile by the noise in the κ field.We considered clusters at redshift z h = 0.5 and concentration c = 1 varying the halo mass.ƭ build the NFW shear map, that is summed to the mocked shear field from our test dataset.
In this section, all the maps are filtered with an aperture mass filter with starlets functions (Leonard et al. 2012).Starlets are well localized in real space, hence it is well very adapted to cluster detection (Starck et al. 1998).In the following, we either refer as pixel or coefficient for the filtered convergence maps.
Therefore, for a given NFW cluster, its intrinsic SNR, that we also call the input SNR, is defined as: where κNFW denotes the convergence NFW profile that has been filtered and the maximum is taken over the pixels of the map, κbackground is the background convergence map, i.e. the mocked convergence field over which we add the NFW profile, that is being filtered.The mocked convergence field is computed from COSMOS-like data, similar to the fields described in subsection 6.3.We now describe the detection procedure.Given a shear field, we sample convergence maps with our method and each posterior sample is filtered with the aperture mass.Given one posterior sample, the detection criteria defined as: which means that we consider that a cluster is detected when the convergence is over 3 times the noise background amplitude.The distribution of the convergence background is illustrated as the hashed histogram in Figure 8. Similar histograms were computed from posterior samples with an added cluster in the input shear and plotted besides the background.In Figure Figure 8, we can see that when decreasing the input SNR, by reducing the halo mass, the distribution of the cluster maximum coefficient shifts towards lower levels, approaching the background distribution.The background distribution histogram in hatched black in Figure 8 is the histogram of the κbackground map. Figure F.1 shows stamps coefficients after filtering.
Figure 9 shows the frequency of the clusters detection in our posterior samples as a function of the input SNR.

Reconstruction of the COSMOS field
In this section we now apply our full methodology to the reconstruction of the COSMOS field, using the catalog described in subsection 6.1.The likelihood convariance and the simulationbased prior remain the same as in the validation section.
As a baseline, we show the Kaiser-Squires reconstruction of the COSMOS convergence field in the top-left panel of Figure 10.We applied a Gaussian-smoothing with a variance σ smooth = 1 arcmin, chosen such that the RMSE is minimized and Pearson correlation coefficient is maximized on simulated data.Although large scale and small scale structures can already be observed on this map, its power spectrum does not correspond to the fiducial matter power spectrum, and no feature at scales smaller that 1 arcmin can be observed.Moreover, there is not any uncertainty quantification on the reconstruction.
In Figure 10 we also presents our reconstruction of the COS-MOS convergence field, alongside uncertainty quantification.The top-right panel shows a posterior sample that is looking very similar to a posterior sample from simulated input from κTNG simulations.Notice that although there is only input shear within the white contours, i.e. in the survey footprint, a complete convergence map is sampled from the posterior distribution.We furthermore validate, in Figure 11, the quality of COSMOS posterior samples reconstruction by showing that their power spectra are in good agreement with those of κTNG convergence maps.
As proposed in subsection 3.4 we choose the posterior mean as our estimate for the convergence reconstruction.In the Bottom-left panel of Figure 10 shows the average of 400 samples making the DLPosterior mean of the COSMOS field.One can visually observe the similar large scale structures of the field between the Kaiser-Squires and the DLPosterior mean, while the latter contains much more resolved features, especially concerning the clusters shapes.
Another way to examine our COSMOS convergence map is to compare to another probe for cluster mass detection.So in the bottom-left and top-right panels, we overlay a subset of X-ray clusters from the Finoguenov et al. (2006) catalog.Most of the X-ray clusters match to a resolved peak in the convergence field, which is in a way expected since we selected the most massive X-ray clusters.
Alongside the DLPosterior mean, we also provide the DLPosterior standard deviation, in the bottom-right panel of Figure 10.One can observe again that the variance is the highest outside the survey contours since there is no data.Another interesting behaviour of the posterior is that the location where the uncertainty is also high is where the signal is the most intense.
The former mass-map of the COSMOS field was published in Massey et al. (2007), using an generalized version of the Kaiser-Squires method.When comparing the two maps, although we do not share the same shape catalog nor redshift distribution, they clearly share similar mass distributions.However, the COSMOS-DLPosterior mean is much more resolved.In particular, we can identify several clusters in what looks like one coarse cluster in the Massey et al. (2007) mass-map at coordinates RA=150.4,DEC=2.5.

Discussion
Having presented the method and results, we propose in this section a discussion on some important points including scientifically relevant use-cases, limitations, and possible extensions.
As mentioned earlier in this paper, taking the average over posterior samples should reduce to the DeepMass results.One may wonder about the tradeoff between the two approaches if the results should be the same.We would argue that the two methods are complementary.DeepMass is much faster in producing a convergence map as it only requires a forward pass of the U-net.However, DeepMass remains an amortized solution, with no strong guarantees on the solution it recovers in practice, due to not having an explicit likelihood.This also means that the entire model needs to be retrained for any change in the lensing catalog (to account for variations in the mask or noise).
We expect that the method presented in this paper will find its most compelling applications in the study of localized structures through the weak lensing effect, where the benefits of a full pixel-level posterior are the strongest.As a particularly relevant example, we will mention the discussion surrounding the potential detection in the Abell 520 (A520) cluster of a dark clump (Jee et al. 2012;Clowe et al. 2012), a localized peak visible in mass-maps but with no optical counterpart.Quantifying the significativity of such a structure using non-linear mass-mapping algorithms targeting a Maximum A Posterior solution (for instance based on sparse regularisation) is a difficult task.It was attempted for this particular field using different techniques in Peel et al. (2017) and Price et al. (2018), but without strong quantitative statements.The method developed in this paper however would be able to access the full posterior of the problem.
For cosmological applications, it is however likely that the use cases of the method presented in this work will remain limited.Indeed, for Higher Order Statistics relying on simulations to evaluate their likelihood, the particular mass-mapping technique used is not critical as any systematics due to reconstruction errors induced by the algorithm are calibrated on simulations.A simple Kaiser-Squires inversion should in principle suffice in most cases of interest.From an information theoretic point of view, the information is preserved by a Kaiser-Squires inversion (in the presence of masks, keeping both E-and B-modes) while any posterior summary may discard some cosmological information.Nevertheless, the ability to sample constrained realisations may find very useful applications such as inpainting masked regions for the purpose of facilitating the computation of 3pt functions, or void detection algorithms.
We also want to highlight that the method presented in this paper can be extended in multiple directions.We are considering here a prior at a fixed cosmology, primarily due to the high computational cost of high resolution hydrodynamical simulations at different points in cosmological parameter space.Concretely, this means that our solution is heavily biased towards the fiducial cosmology on scales poorly constrained by the data.Note however that the hybrid deep learning + physical gaussian prior we have introduced in this work provides a natural framework for extending our method to include cosmological dependence.On large scales, the prior is mostly driven by the analytic power spectrum and thus easy on condition on cosmology, while on small scales, only the non-Gaussian residuals are captured by the neural network.This implies that the cosmology-dependent part of the model that needs to be learned on simulations is mostly on small scales.This makes it very likely that in the close future one could develop such a cosmology-dependent residual prior from suites of numerical simulations of smaller cosmological volume, would be an ideal dataset for this purpose.Another avenue for future work would be extending the method to the sphere, which is simply a matter of defining a U-net for instance using a Deep-Sphere (Perraudin et al. 2019) approach for convolutions on a spherical domain.

Conclusion
In this paper, we have presented a unified view of the massmapping problem as a Bayesian posterior estimation problem.
Most existing methods either rely on simple priors (i.e.Gaussian prior for the Wiener filter) and/or only return a point estimate of the mass-mapping posterior.Instead, we have proposed a framework which allows us to 1) use numerical simulations to provide a full non-Gaussian prior on the convergence field; 2) sample from the full Bayesian posterior of the mass-mapping problem under this simulation-driven prior and a physical likelihood.
The proposed approach, dubbed DLPosterior, relies first on using a Denoising Score Matching technique to learn a prior from high resolution hydrodynamical simulations (the κTNG dataset (Osato et al. 2021)), an approach that has proven to be extremely scalable and easy to implement.And as a second ingredient, we have introduce an annealed Hamiltonian Monte-Carlo approach that allows us to sample the high-dimensional Bayesian posterior of the problem with high efficiency.We demonstrate that we are able to achieve on average an independent posterior sample in 10 GPU-minute on an Nvidia Tesla V100 GPU.Moreover, the annealing scheme ensures that we sample independent samples with the correct weights of the posterior modes.
We first validated the sampling approach in an analytically tractable fully Gaussian case where we recovered the Wiener filter.We then validated the entire method on mock observations, by comparing the posterior mean obtained by our method to a Deep Learning estimate of this posterior mean using the DeepMass method.We found excellent agreement between these Fig. 11.Power spectra comparison between the COSMOS field reconstruction and simulations.We compare power spectra of posterior samples for the COSMOS field in blue and for a simulated field in red.We also display the power spectrum of samples from the training set in orange to show the matching.ƭ two independently estimated posterior summaries, with near identical Pearson correlation coefficient and RMSE.This consistency gives us high confidence that the full posterior is correctly sampled even in the non-Gaussian case.
In further comparisons, DLPosterior, demonstrated quantitative improvement on convergence reconstructions against other standard methods, based on a large class of priors such as the Kaiser-Squires inversion, the Wiener Filter, or GLIMPSE2D.In addition, contrary to most of these methods, DLPosterior provides uncertainty quantification with the posterior samples, such as the posterior variance.In this respect, we also showed that the recovered posterior can be interpreted to quantify uncertainties, by establishing a close correlation between posterior convergence values and the SNR for clusters artificially introduced into a field.
Finally, we have applied the validated method to the reconstruction of the HST/ACS COSMOS field based on the shape catalog from Schrabback et al. (2010), and produced the highest quality convergence map of this field to date.

Fig. 1 .
Fig. 1.Illustration of the score field (represented by the vector field) on the simple two-dimensional "two moons" distribution.The score field points in the direction of regions of higher probability.A noisy sample from the distribution is shown in red, while the denoised sample computed using equation Equation 18 is shown in orange.This illustrates how the variance multiplied by the score function, represented by the black arrow, can project the sample onto the high density of the distribution.ƭ

Fig. 3 .
Fig. 3. Denoising of a 100 × 100 arcmin 2 simulated convergence map (a) which was corrupted with a noise level of σ = 0.15 per pixel (b).The Gaussian denoiser (c), computed with equations 18 and 12, retrieves large scales of the convergence map.Adding the DAE residuals to the Gaussian denoiser (d) allows us to recover both large and small scales of the κ map.This illustrates the Gaussian (analytic) and non-Gaussian (learned) parts of our hybrid score function Equation 19.ƭ

Fig. 4 .
Fig. 4. Redshift distribution of the COSMOS catalog fromSchrabback et al. (2010).The redshifts for bright galaxies (i + < 25) are derived from the COSMOS-30 catalog(Ilbert et al. 2009), while the n(z) distribution for faint galaxies (i + > 25) is based on an extrapolation to fainter magnitudes as described inSchrabback et al. (2010).ƭ , while for the faint sample Schrabback et al. (2010) propose a functional form for the overall n(z) based on an extrapolation to fainter magnitudes of the i 814 -redshift relation observed in the 23 < i 814 < 25 range.The redshift distribution of both samples is illustrated on Figure 4.

Fig. 5 .
Fig. 5. Comparison between the Wiener filter (top left) and our scorebased maximum-a-posteriori (not shown) and posterior mean reconstructions (top right, average of 1500 samples), assuming a Gaussian prior.The MAP reconstruction matches the Wiener filter both in terms to absolute pixel error and power spectrum ratio (bottom right, blue line).Our posterior mean is also in excellent agreement with the Wiener filter as illustrated by the pixel absolute error (bottom left) and the power spectrum relative error (bottom right, orange line).Bottom panel shows the relative error between the Gaussian posterior samples power spectra and the theoretical power spectrum.ƭ

Fig. 7 .
Fig. 7. Comparison of deep learning-based mass-mapping methods on one mocked COSMOS field.First row: the ground truth κTNG convergence map, the mean of our posterior samples (over 400 samples), DeepMass and the standard deviation of our posterior samples.All the maps use the same colorbar, except for the standard deviation, which is displayed in the range [0, 0.035].Second row: samples from the posterior distribution, using the hybrid Gaussian-neural score prior.The input noise level is computed according Equation 26 and the mask contours corresponds to the COSMOS survey boundary mask.ƭ

Fig. 8 .
Fig.8.Histograms of coefficient amplitudes.A wavelet coefficient amplitude corresponds to the coefficient computed in Equation30.Background corresponds to the coefficient of samples where we did not add any cluster.The background coefficients follow a Gaussian distribution, we can thus define a standard deviation σ for detection.We can see that the lower the cluster mass -thus the lower the SNR, the closer histograms are to the background.ƭ

Fig. 10 .
Fig. 10.HST/ACS COSMOS field reconstructions along known X-ray clusters from the XMM-Newton survey.Top-left: Kaiser-Squires (with a Gaussian smoothing of σ = 1 arcmin), top-right: sample from the posterior distribution, bottom-left: mean of the posterior distribution, bottomright: standard deviation of the posterior distribution (over 400 samples, shown in the clipped range [0, 0.035]).ƭ

Fig. C. 1 .
Fig. C.1.Comparison of the score function learning with and without spectral normalization.In blue is the density of the swiss roll distribution and the black vector field is the learned score function evaluated on a grid.ƭ

Table 1 .
Metrics comparison between different mass mapping methods.RMSE (the lower the better) is computed according to eq. 27 and the Pearson Correlation coefficient r (the higher the better) according to eq. 28.For a fair comparison to Kaiser-Squires, we kept the smoothing coefficient that minimized the RMSE.