Open Access
Issue
A&A
Volume 671, March 2023
Article Number A147
Number of page(s) 16
Section Cosmology (including clusters of galaxies)
DOI https://doi.org/10.1051/0004-6361/202244325
Published online 17 March 2023

© The Authors 2023

Licence Creative CommonsOpen Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article is published in open access under the Subscribe to Open model.

Open access funding provided by Max Planck Society.

1. Introduction

Gravitational lensing has become a very powerful tool in astrophysics, especially in combination with others, such as the lens velocity dispersion measurements (e.g., Barnabè et al. 2011, 2012; Yıldırım et al. 2020) and the galaxy rotation curves (e.g., Strigari 2013; Hashim et al. 2014), which both help to probe the mass structure of galaxies. In particular, gravitational lensing allows us to measure the total mass of the lens (Dye & Warren 2005; Treu 2010) and thus to study the fraction and distribution of dark matter (DM; e.g., Schuldt et al. 2019; Baes & Camps 2021; Shajib et al. 2021; Wang et al. 2022), as well as its nature (e.g., Basak et al. 2022; Gilman et al. 2021). Moreover, one can use lensing to study high-redshift sources thanks to the lensing magnification (e.g., Dye et al. 2018; Lemon et al. 2018; McGreer et al. 2018; Rubin et al. 2018; Salmon et al. 2018; Shu et al. 2018) by reconstructing the surface brightness of the source, which can be used to reveal information about the evolution of galaxies at higher redshifts (e.g., Warren & Dye 2003; Suyu et al. 2006; Nightingale et al. 2018; Rizzo et al. 2018; Chirivì et al. 2020).

In special cases where a variable source, such as a quasar (e.g., Lemon et al. 2017, 2018, 2019; Ducourant et al. 2019; Khramtsov et al. 2019; Chan et al. 2020; Chao et al. 2020, 2021) or supernova (SN), is present as a background object (Kelly et al. 2015; Goobar et al. 2017; Rodney et al. 2021), one can measure the time delays (e.g., Millon et al. 2020a,b; Huber et al. 2022) and then constrain the Hubble constant H0 (e.g., Refsdal 1964; Chen et al. 2019; Rusu et al. 2020; Wong et al. 2020; Shajib et al. 2020, 2022), helping to clarify the current discrepancy between the cosmic microwave background (CMB) measurements (Planck Collaboration VI 2020) and measurements using the local distance ladder (e.g., the SH0ES project, Riess et al. 2019, 2021). Furthermore, if one detects a lensed SN on the first image, and predicts the location and time for the next appearing image(s), the SN can be observed at earlier phases than without lensing, which would help to shed light on open questions regarding the SN progenitor system(s).

For this reason, and also because such lensing systems with time-variable background sources are very rare, great effort has been made in recent years to run several large and dedicated surveys, such as the Sloan Lens ACS (SLACS) survey (Bolton et al. 2006; Shu et al. 2017), the CFHTLS Strong Lensing Legacy Survey (SL2S; Cabanac et al. 2007; Sonnenfeld et al. 2015), the Sloan WFC Edge-on Late-type Lens Survey (SWELLS; Treu et al. 2011), the BOSS Emission-Line Lens Survey (BELLS; Brownstein et al. 2012; Shu et al. 2016; Cornachione et al. 2018), and the Survey of Gravitationally-lensed Objects in HSC Imaging (SuGOHI; Sonnenfeld et al. 2018; Wong et al. 2018; Chan et al. 2020; Jaelani et al. 2020a) In addition to that, several teams used the imaging data from large surveys, such as the Dark Energy Survey (DES; e.g., Jacobs et al. 2019; Rojas et al. 2022), the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS; e.g., Lemon et al. 2018; Cañameras et al. 2020), the Kilo Degree Survey (KiDS; e.g., Petrillo et al. 2017, 2019; Li et al. 2020, 2021), and the surveys with the Hyper Suprime-Cam (HSC; e.g., Cañameras et al. 2021; Shu et al. 2022; Jaelani et al., in prep.), to identify additional lens systems. In total, a few hundred spectroscopically confirmed lenses and many more promising lens candidates have been found so far, but the sample of lens candidates is expected to grow by a factor of around 20 (Collett 2015) with upcoming surveys, such as the Rubin Observatory Legacy Survey of Space and Time (LSST, Ivezic et al. 2008), which will observe around 20 000 deg2 of the southern hemisphere in six different filters (u, g, r, i, z, y), or the Euclid imaging survey operated by the European Space Agency (ESA; Laureijs et al. 2011).

To find those strong galaxy–galaxy scale lenses, huge efforts are currently being made to develop fast and automated algorithms in order to classify billions of observed galaxies. There are different methods available, such as geometrical quantification (Bom et al. 2017; Seidel & Bartelmann 2007), spectroscopic analysis (Baron & Poznanski 2017; Ostrovski et al. 2017), an Arcfinder including color cuts (Gavazzi et al. 2014; Maturi et al. 2014), or machine learning techniques (e.g., Jacobs et al. 2017; Petrillo et al. 2017; Lanusse et al. 2018; Schaefer et al. 2018; Metcalf et al. 2019; Cañameras et al. 2020, 2021, in prep.; Huang et al. 2020; Rojas et al. 2022; Savary et al. 2022; Jaelani et al., in prep.; Shu et al. 2022).

After detecting the lens candidates, a model describing their total mass distribution is required for nearly all applications; this also helps to reject some false-positive candidates (Marshall et al. 2009; Sonnenfeld et al. 2013; Chan et al. 2015; Taubenberger et al., in prep.). As the sample of known lenses is increasing rapidly, current Monte-Carlo Markov-chain (MCMC)-based techniques (e.g., Jullo et al. 2007; Suyu & Halkola 2010; Sciortino et al. 2020; Fowlie et al. 2020) are no longer sufficient to model all of them, because the MCMC sampling is very time consuming and resource dependent, and requires a lot of human input. Hezaveh et al. (2017) therefore proposed and demonstrated the feasibility of using convolutional neural networks (CNNs) to predict the mass model parameter values for high-resolution and pre-processed, lens-light-subtracted images. This was further improved and explored by the same team (Perreault Levasseur et al. 2017; Morningstar et al. 2018, 2019), and Pearson et al. (2019, 2021) also presented a CNN for modeling high-resolution lens images. For the LSST Dark Energy Science Collaboration, Wagner-Carena et al. (2021) developed so-called Bayesian neural networks to model HST-like lenses after lens light subtraction in analogy to Hezaveh et al. (2017), while in their follow-up work, they extended their approach to HST-like images without prior lens light subtraction (Park et al. 2021). To complement this, as part of our ongoing Highly Optimized Lensing Investigations of Supernovae, Microlensing Objects, and Kinematics of Ellipticals and Spirals (HOLISMOKES, Suyu et al. 2020) programme, in Schuldt et al. (2021a, hereafter S21a) we presented a CNN designed to model strongly lensed galaxy images. This CNN predicts the five singular isothermal ellipsoid (SIE) mass model parameter values (lens center x and y, complex ellipticity ex and ey, and the Einstein radius θE) for the lens galaxy mass distribution using ground-based HSC images.

In the presented work, which builds upon S21a, we adopt an SIE profile plus external shear component and include an uncertainty prediction for each parameter, which has already been demonstrated on high-resolution HST-like images (Perreault Levasseur et al. 2017; Park et al. 2021; Wagner-Carena et al. 2021). As this task is much more complex than that in S21a, it requires a deeper network to cope with the small distortions of the external shear, such that we rely now on residual blocks in the network architecture (ResNet, He et al. 2016). Moreover, we improved our simulation pipeline, which uses real observed galaxy images as well as corresponding redshift and velocity measurements (S21a), to obtain even more realistic training data. In detail, we add Poisson noise on the simulated arcs, add the external shear component, and improve the automated computation of the first and second brightness moments to obtain the lens light center and ellipticity. Our mock images of lenses also contain other galaxies in the image cutout, allowing our ResNet to cope with realistic lens systems that often have other nearby objects along the line of sight. While we use only simulated data to test the network performance in this work, we apply this network to a smaller sample of 31 real HSC lenses in a companion paper (Schuldt et al. 2022) and compare the obtained models to the traditional ones obtained through MCMC sampling. The good agreement found there shows that our mock images are realistic and that our demonstrated network performance is trustworthy.

As one of the main advantages of our ResNet is the computational speed, it is perfectly suited to predict the lens mass model parameters of a lensing system with a transient host galaxy as the background source, which can then be used to predict the next appearing image(s) and time delay(s) of the transient. Therefore, in analogy to S21a, we compare the predicted image positions and time delays using (1) the ground-truth mass model from our simulation pipeline and (2) the predicted model from the network.

The outline of the paper is as follows. We summarize in Sect. 2 the simulation of our training data and changes compared to S21a. In Sect. 3, we describe our network architectures and in Sect. 4 we present our results. The dedicated tests for the network are summarized in Sect. 5. Our comparisons of the image positions and time delays are reported in Sect. 6, before we present our conclusions in Sect. 7.

Throughout this work, we assume a flat ΛCDM cosmology with a Hubble constant H0 = 72 km s−1 Mpc−1 (Bonvin et al. 2017) and ΩM = 1 − ΩΛ = 0.32 (Planck Collaboration VI 2020) in analogy to S21a.

2. Simulation of mock lenses

Training a supervised neural network requires a good-quality, representative, and sufficiently large sample of input data together with the corresponding output, which is the so-called ground truth. For our purpose, we request a sample containing around 100 000 lens systems, each in four filters, which requires that we rely on simulations given that the number of real known lenses is < 104 and the corresponding mass models need to be known. In order for our network to achieve optimum performance on real data, it is important that our mock images in our training sample are as realistic and as representative of real data as possible. We therefore use real observed images of galaxies as lenses and as background sources, instead of producing a completely mock sample as is often done (e.g., Hezaveh et al. 2017; Perreault Levasseur et al. 2017; Pearson et al. 2019, 2021). We improve on the pipeline1 described in S21a for simulating images of mock lenses and briefly summarize here the procedure, highlighting the new features we adopt in this work. A diagram of the pipeline is shown in Fig. 1.

thumbnail Fig. 1.

Flow chart of the simulation pipeline used to create our training data in four filters griz. Figure adapted from Fig. 3 of S21a.

As lenses, we use again the HSC images (Aihara et al. 2019) of luminous red galaxies (LRGs) and available spectroscopic redshifts and velocity dispersion measurements from the Sloan Digital Sky Survey DR14 (SDSS, Abolfathi et al. 2018). In addition to our criteria in S21a, we now set a lower limit of 100 km s−1 for the velocity dispersion, as very low-mass galaxies do not yield a strong lensing configuration with spatially resolved multiple images (compare Fig. 1 in Schuldt et al. 2021a). This speeds up the selection process of suitable lens-source pairs. As background sources, we again use images from the Hubble Ultra-deep field (HUDF, Beckwith et al. 2006; Inami et al. 2017). Each background galaxy is then lensed with the software GLEE (Suyu & Halkola 2010; Suyu et al. 2012); convolved with the (subsampled) HSC PSF; and then binned to the HSC pixel size of 0.168″. The obtained arcs are then added to the lens image. With this procedure, we include real line-of-sight objects and light distributions, as shown in Fig. 2.

thumbnail Fig. 2.

Example images of our mock lenses (left panel) and corresponding frames with only the lensed source (right panel). Each image has a size of 64 × 64 pixels, which corresponds to around 10.75″ × 10.75″.

To calculate the deflection angles, we assume a singular isothermal ellipsoid (SIE) profile and, in contrast to S21a, an external shear described in complex notation by γext, 1 and γext, 2 to account for additional mass outside of the cutout. The convergence (also called dimensionless surface mass density) of the adopted SIE profile (Barkana 1998)2 can be expressed as

κ ( r ) = θ E ( 1 + q ) r , $$ \begin{aligned} \kappa (r) = \frac{\theta _{\rm E}}{(1+q)r}, \end{aligned} $$(1)

with an elliptical radius

r = x 2 + y 2 q 2 , $$ \begin{aligned} r = \sqrt{x^2 + \frac{{ y}^2}{q^2}}, \end{aligned} $$(2)

and the axis ratio

q = 1 e x 2 + e y 2 1 + e x 2 + e y 2 , $$ \begin{aligned} q = \sqrt{\frac{1-\sqrt{e_x^2 + e_{ y}^2}}{1 + \sqrt{e_x^2 + e_{ y}^2}}}, \end{aligned} $$(3)

where

e x = 1 q 2 1 + q 2 cos ( 2 ϕ ) $$ \begin{aligned} e_{x} = \frac{1-q^2}{1+q^2} \cos (2\phi ) \end{aligned} $$(4)

and

e y = 1 q 2 1 + q 2 sin ( 2 ϕ ) $$ \begin{aligned} e_{ y} = \frac{1-q^2}{1+q^2} \sin (2\phi ) \end{aligned} $$(5)

are the complex ellipticities. The external shear parameters can be converted to a total shear strength

γ ext = γ ext , 1 2 + γ ext , 2 2 , $$ \begin{aligned} \gamma _{\rm ext} = \sqrt{\gamma _{\rm ext,1}^2 + \gamma _{\rm ext,2}^2}, \end{aligned} $$(6)

which is rotated by

ϕ ext = { s 2 if γ ext , 1 0 and γ ext , 2 0 π s 2 if γ ext , 1 < 0 and γ ext , 2 0 π + s 2 if γ ext , 1 < 0 and γ ext , 2 < 0 2 π s 2 if γ ext , 1 0 and γ ext , 2 < 0 $$ \begin{aligned} \phi _{\rm ext} = \left\{ \begin{array}{cccc} \frac{s}{2}&\mathrm{if}\; \gamma _{\rm ext,1} \ge 0 \;\mathrm{and}\; \gamma _{\rm ext,2} \ge 0\\ \frac{\pi - s}{2}&\mathrm{if}\; \gamma _{\rm ext,1} < 0 \;\mathrm{and}\; \gamma _{\rm ext,2} \ge 0 \\ \frac{\pi +s}{2}&\mathrm{if}\; \gamma _{\rm ext,1} < 0 \;\mathrm{and}\; \gamma _{\rm ext,2} < 0 \\ \frac{2\pi -s}{2}&\mathrm{if}\; \gamma _{\rm ext,1} \ge 0 \;\mathrm{and}\; \gamma _{\rm ext,2} < 0 \end{array}\right. \end{aligned} $$(7)

with

s = arcsin ( | γ ext , 2 | γ ext ) · $$ \begin{aligned} s = \arcsin \left(\frac{|\gamma _{\rm ext,2}|}{\gamma _{\rm ext}}\right)\cdot \end{aligned} $$(8)

Given our results in S21a, we again use a flat distribution of the Einstein radii up to ≃2″ in order to allow the network to better learn the full parameter space. In analogy, we use a flat distribution for γext in the range between 0 and 0.1, but consider also a realistic distribution (Faure et al. 2011; Wong et al. 2011).

Moreover, we improve our simulation code by including Poisson noise for the arcs approximated as

σ poisson , arc , i = α × I arc , i + , $$ \begin{aligned} \sigma _{\mathrm{poisson,arc,}i} = \sqrt{\alpha \times I_{\mathrm{arc,}i}^+}, \end{aligned} $$(9)

where Iarc is the lensed source image (arcs) and

I arc + = max { I arc , 0 } . $$ \begin{aligned} I_{\rm arc}^+ = \mathrm{max}\{I_{\rm arc},0\}. \end{aligned} $$(10)

To compute the scaling factor α, we start from the HSC variance map with value v i + $ v_{i}^+ $ corresponding to the ith lens image pixel3 and define

σ i = v i + . $$ \begin{aligned} \sigma _{i} = \sqrt{v_{i}^+}. \end{aligned} $$(11)

We then compute the lens background noise σbkgr defined as the minimal root-mean-square (rms) of the four corners of the lens image to exclude contributions from line-of-sight objects. With those two quantities, we then approximate the Poisson noise map of the lens as

σ poisson , i = σ i 2 σ bkgr 2 $$ \begin{aligned} \sigma _{\mathrm{poisson},i} = \sqrt{\sigma _i^2 - \sigma _{\rm bkgr}^2} \end{aligned} $$(12)

in order to obtain the Poisson scaling factor map,

α i = σ poisson , i 2 / I lens , i , $$ \begin{aligned} \alpha _i = \sigma _{\mathrm{poisson},i}^2/I_{\mathrm{lens,}i}, \end{aligned} $$(13)

where Ilens is the lens image from HSC. As this scaling factor α should be a constant, we approximate it as the median of the 10 × 10 pixel central region, because the lens intensity is highest in this region and therefore the mapping is more precise than on the outer parts. From this map, we draw Gaussian variations representing the additional Poisson noise, which are added on top of the simulated images.

Additionally, we improve the lens centering and ellipticity estimation in the simulation code. For this, the code now accepts a mask of the lens, obtained for example with the Source Extractor (Bertin & Arnouts 1996), resulting in a more accurate masking of the lens. As that mask might exclude parts of the lens due to overlapping line-of-sight objects, we additionally apply a circular mask with a radius of 20 pixels centered at the image center when determining the lens center. From the resulting masked image, we predict the lens center through the first moments and re-center the image on the pixel closest to the lens center. The remaining fractional pixel offset of the lens center is taken into account through the simulation as lens light center. The axis ratio qll and position angle ϕll of the lens light is determined through the second moments using the provided mask.

As we now re-center the lens cutout, we shift the final mock image randomly by up to ±3 pixels in the x and y direction such that the lens light and mass center, which are different, are not coincident with the image cutout center. This allows the network to learn the lens center instead of the cutout center and ensures that the network can handle images that are not perfectly centered on the lens mass.

Similar to S21a, we set some criteria on the brightness of the arcs to ensure visibility. Specifically, we request that the brightest pixel of the arcs is above 5σbkgr in either g or i band, chosing the band in which the source is brightest. In the same filter, we further set a threshold for that pixel to be a factor of 1.5 brighter compared to the lens image at the same pixel. This helps to avoid drastic blending with the lens. To overcome these criteria more easily, we include a slight boost of the source magnitude in up to six steps of −0.5 mag. This does not affect the shape of the arcs, nor the lens, nor the line-of-sight objects, but helps to obtain detectable lenses.

With this method, we generate around 165 000 mocks for training, validating, and testing the network, where we limit the Einstein radius to the range between 0.5″ and 5″, and additionally to 3000 systems per 0.05″ bin to obtain a flat distribution at least up to θE ∼ 2″. As we use real measurements of the velocity dispersion and redshifts, lensing systems with an Einstein radius above ∼2.5″ are very rare, even when increasing the number of iterations for testing different lens–source alignments or lens–source pairs. As a result, the number of systems drops by two orders of magnitude towards θE ∼ 3″ compared to the plateau. Example images generated with our upgraded simulation pipeline are displayed in Fig. 2; the left panels show the final mock images and the right panels show the lensed source alone, which were added to the HSC lens image. This figure demonstrates the variety of mocks and how realistic they are.

3. Neural networks and their architecture

Convolutional neural networks are very powerful tools in image recognition tasks, especially if an autonomous and fast method is required to cope with huge numbers of images. This property has already lead to many different applications in astrophysics (e.g. Paillassa et al. 2020; Tohill et al. 2021; Wu 2020; Schuldt et al. 2021b). As there are already thousands of known lens candidates in HSC (e.g., Wong et al. 2018; Sonnenfeld et al. 2019, 2020; Jaelani et al. 2020a,b, and in prep.; Cañameras et al. 2021; Shu et al. 2022), and we expect hundreds of thousands more observed by LSST and Euclid (Collett 2015), CNNs would be perfectly suited for analyzing this amount of data in an acceptable amount of time.

While in S21a we used a CNN based on the LeNet (Lecun et al. 1998) architecture to predict the five SIE parameters, deeper networks help to capture small features of the more complex mocks used in this work. Therefore, we now make use of residual neural networks (He et al. 2016), which are a specific type of CNN. In residual neural networks, the network architecture includes so-called residual blocks of typically two convolutional (conv.) layers with a kernel size of 3 × 3, which are connected via a skip-connection (or short cut) for the back-propagation. Therefore, the back-propagating gradient does not vanish easily over the multiple convolutional layers and thus allows the neurons and weights of the first layers to be properly updated, even if the network architecture is very deep. A sketch of our network architecture is given in Fig. 3.

thumbnail Fig. 3.

Overview of our ResNet architecture. The input are the lens images in four filters g, r, i, and z, which are passed through a series of convolutional (conv.) layers and fully connected (FC) layers. After a sigmoid layer to map all output values to the range [0, 1], we split them up into median η and uncertainty σ.

3.1. Error estimation and loss function

In analogy to S21a, we started with a network4 predicting one point estimate per parameter, which amounts to seven values for the five SIE parameters, plus two for the external shear parameters. Here we again used a mean-square-error (MSE) loss function. We then modified the network such that it predicts two values per parameter, that is, 14 values in total. Providing not only a point estimate but rather a median-like value with a 1σ uncertainty makes the network much more powerful and helps to exclude insecure parameter estimations. From the 14 predicted values, we now interpret one value per parameter as the median (like previously the point estimate) and the other value as the standard deviation of a Gaussian function describing the error on that parameter value for that specific image. Here, we follow Perreault Levasseur et al. (2017) and use a loss L given by

L = k = 0 N l = 0 p [ w l × P ( η k , l pred , η k , l tr , σ k , l ) + ϵ l × log ( σ k , l 2 ) ] , $$ \begin{aligned} L = \sum _{k=0}^N \sum _{l=0}^p \left[-{ w}_l \times P \left(\eta ^\mathrm{pred}_{k,l}, \eta ^\mathrm{tr}_{k,l}, \sigma _{k,l} \right) + \epsilon _l \times \log \left(\sigma _{k,l}^2\right) \right], \end{aligned} $$(14)

with a regularization term to minimize the errors and a log-probability term that is defined in PyTorch for a Gaussian distribution, as

P ( η k , l pred , η k , l tr , σ k , l ) = ( η k , l tr η k , l pred ) 2 2 σ k , l 2 ln ( σ k , l ) ln ( 2 π ) . $$ \begin{aligned} P(\eta ^\mathrm{pred}_{k,l}, \eta ^\mathrm{tr}_{k,l}, \sigma _{k,l}) = - \frac{\left(\eta _{k,l}^\mathrm{tr} - \eta _{k,l}^\mathrm{pred}\right) ^2}{2\sigma _{k,l}^2} - \ln (\sigma _{k,l}) - \ln (\sqrt{2\pi }). \end{aligned} $$(15)

The loss for a given system is the sum over the p different parameters η = (x, y, ex, ey, θE, γ1, γ2) and corresponding errors σ = (σx, σy, σex, σey, σθE, σγ1, σγ2) with index l ∈ [0, p]. For the total loss L, we additionally sum over each image k in the batch or sample of size N.

Similar to the weighting factors wl introduced already in S21a and used to improve on the Einstein radius, we have also introduced weighting factors in our new loss function as well as a regularization constant ϵl that can be different for each parameter. With these factors, we can control the contribution of the different parameters to the loss, and can therefore choose which ones to better optimize. For our CNNs in S21a, we found it helpful to increase the contribution of the Einstein radius to the loss, as this is the key quantity in the SIE profile. Although it remains the key parameter in an SIE+γext setup, we also tested the possibility of up-weighting the external shear, as these two parameters are the most problematic ones. However, we discarded this option from all further tests, as we find only a minor improvement on the external shear but notably lose performance on the other parameters.

In addition to the weighting factors, we modified the loss function by introducing the uncertainty prediction such that also σ contributes to the optimization of the weights and neurons during the training process. Here, we tested different possible regularization terms, such as using the absolute value of σ instead of the squared term, or even leaving out the additional regularization term completely. Moreover, we varied the regularization constant ϵ. As changing the loss function will change the loss value for a given network, we cannot compare the obtained loss values directly. Based on a more quantitative comparison, we found no notable difference in the performance on the median predictions η. Given that we use a scaling to match the expected 68.3% confidence intervals (CIs) instead of tuning dropout for this, we finally adopted the regularization function proposed by Perreault Levasseur et al. (2017), which uses a squared term and a regularization constant of 0.5.

In order that the predicted σ values can indeed be interpreted as 1σ Gaussian widths, Perreault Levasseur et al. (2017) suggest tuning the network through the dropout rate (Hinton et al. 2012; Srivastava et al. 2014) such that the predicted errors match the expected CIs of 68.3% for 1 σ ¯ $ \bar{\sigma} $, 95.4% for 2 σ ¯ $ \bar{\sigma} $, and 99.7% for 3 σ ¯ $ \bar{\sigma} $. Here, the bar indicates that the standard deviation is computed from the medians of a full sample, for example, the test set, and does not correspond to the uncertainty σ predicted by the network for an individual lens system and parameter. This idea was also adopted by Pearson et al. (2021) for their CNNs. As there is only one dropout rate, both teams average over their three or four parameters when trying to match the expected percentile such that the individual errors still differ from the expected CIs (compare Fig. 2 of Perreault Levasseur et al. 2017 and Fig. 4 of Pearson et al. 2021). This would be even more difficult for us with seven parameters. Instead of using dropout with the same rate for both convolutional layers and fully connected (FC) layers, we test only the effect of dropout for the FC layers as the dropout rate typically differs significantly between convolutional layers and FC layers (Hinton et al. 2012; Srivastava et al. 2014). For these reasons, we finally do not use dropout for the error scaling and instead incorporate a direct scaling of each parameter uncertainty such that the uncertainties match the CIs on the test set as closely as possible.

3.2. Network architecture

In general, to find the best network architecture and the best set of hyper-parameters, which is defined as the network with minimal mean validation loss, we carried out extensive tests including the hyper-parameter search discussed in Sect. 3.3. For the architecture, we varied the number of residual blocks between 2 and 6, the number of FC layers between 1 and 3, and the number of feature maps and strides in the convolutional layers and the number of neurons within the different FC layers. We also tested different kernel sizes for the convolutional layers, but obtained the smallest average validation loss with the standard 3 × 3 kernel, which is understandable given the small size of our ground-based images.

As shown in Fig. 3, the final network architecture contains one convolutional layer with kernel size 3 followed by a batch normalization and three residual blocks of two convolutional layers each with a kernel size of 3. As shown in Fig. 3, we finally include three residual blocks with strides of two, one, and one, respectively, and 24, 32, and 64 feature maps, respectively. The first layer before the residual blocks has 16 feature maps, while the input has 4 feature maps corresponding to the four different filters. After flattening, the output of the convolutional sequence is passed through one FC layer connecting 1024 to 14 neurons.

During our tests on the network architecture, we also adjusted the pooling layer before flattening for the FC layers (compare Fig. 3). Here, we tried an average pooling layer, a maximal pooling layer, or no pooling at all. For the two different pooling layers, we tested kernel sizes of 8 × 8, 4 × 4, or 2 × 2. We found the best performance using an average pooling layer with a kernel size of 8 × 8, as indicated in Fig. 3.

3.3. Hyper-parameter search

In addition to testing different architecture layouts, the values for hyper-parameters also need to be optimized. Throughout these tests, we adopted a weight decay of 0.0005, a momentum of 0.9, and, after some short tests regarding the effect on the performance, a batch size of 32. Apart from those, the learning rate is one of the key quantities to test. Here, we typically tested the values for the learning rate rlearn ∈ [0.01, 0.001, 0.0001, 0.00001, 0.000001] when changing any other hyper-parameter or layer. This covers a good range of plausible values. As the changes on the weights are expected to drop over the training, we also tested the option of a “decreasing learning rate”, which means that we divide the learning rate by a factor of two every twentieth epoch. Because the best epoch was typically below 100, we had only a few relevant decreasing steps during training such that we finally adopted a constant learning rate for our further tests. In the presented network, we finally assumed a learning rate of rlearn = 0.000001, a regularization constant ϵ of 0.5 for all the different parameters, and a weighting factor w of 5 for the Einstein radius in analogy to S21a, and otherwise 1.

Although the initialization is not a hyper-parameter that typically gets tuned, we tested the effect of using a different initialization of the network for a few given setups. This demonstrates how independent the final, trained network is from the original values. We find no preference of a specific seed and the overall performance is unaffected, but each network gives a slightly different loss. These slight changes are due to the stochastic learning process and are therefore commonly observed. For a few instances, the best hyper-parameter, for example the learning rate, changed by changing the seed, indicating the importance of optimizing the hyper-parameters for a given network.

To mitigate these changes, so-called ensemble learning methods can be used, where essentially the same network, that is, with fixed architecture and hyper-parameters, is trained with a different initialization and the predictions are combined afterwards. We performed such tests for a few given setups, predicted the 14 parameters with each network, and compared their average to the ground truth on the test set. In our case, this does not help to eliminate outliers and decrease the scatter, which proves the similarity of the networks regardless of the initialization.

3.4. Parameter normalization

Because different parameters cover different ranges, we include a scaling to map them consistently to the range [0, 1]. This ensures an equal contribution from the different parameters to the loss, resulting in a better optimization of all parameters. We assume the following input ranges [al, bl]: lens center x ∈ [ − 0.6″, 0.6″] and y ∈ [ − 0.6″, 0.6″], complex ellipticity ex ∈ [ − 1, 1] and ey ∈ [ − 1, 1], Einstein radius θE ∈ [0.5″, 5″] as already in the simulation procedure, and complex external shear γext, 1 ∈ [ − 0.1, 0.1] and γext, 2 ∈ [ − 0.1, 0.1]. This means the ground truth is scaled componentwise as

η scaled , tr = η tr a b a , $$ \begin{aligned} \boldsymbol{\eta }^\mathrm{scaled, tr} = \frac{\boldsymbol{\eta }^\mathrm{tr} - \boldsymbol{a}}{\boldsymbol{b}-\boldsymbol{a}}, \end{aligned} $$(16)

and the output of the network is scaled back to the original ranges through

η = ( b a ) η scaled , pred + a , $$ \begin{aligned} \boldsymbol{\eta } = (\boldsymbol{b} - \boldsymbol{a}) \, \boldsymbol{\eta }^\mathrm{scaled, pred} + \boldsymbol{a}, \end{aligned} $$(17)

and for the uncertainty,

σ = ( b a ) σ scaled . $$ \begin{aligned} \boldsymbol{\sigma } = (\boldsymbol{b} - \boldsymbol{a}) \, \boldsymbol{\sigma }^\mathrm{scaled}. \end{aligned} $$(18)

The uncertainties are not shifted by a, as those are considered with respect to the predicted median values η. To ensure that all predicted values are between 0 and 1, we include a sigmoid layer in the network architecture before splitting it up into seven values for the median η and seven values for the uncertainty σ. The network parameter optimization is performed with a ReLU (Nair & Hinton 2010) activation function and a stochastic gradient descent algorithm, adjusting the weights to minimize the loss.

3.5. Cross-validation

To avoid unbalanced optimization of the network, we follow S21a and use a five-fold cross-validation by splitting our set of 165 374 mocks into roughly 56% training, 14% validation, and 30% testing, where rounding effects from the batch size occur, and train each run over 500 epochs. This also allows us to better determine the hyper-parameters (e.g., learning rate, number of neurons, or feature maps) and the final stopping epoch, which is defined through the minimal average validation loss.

The lowest mean loss is −593.71, obtained in epoch 230, such that the final run is trained over 230 epochs. The loss curve reveals that there is a relatively large generalization gap, which means that after a certain number of epochs the network performs significantly better on the training data than on the validation set. Therefore, we tested the effect of dropout on the FC layers even with our relatively small number of FC layers. This helped to reduce the generalization gap, but also resulted in a higher average validation loss used to select the best network. Therefore, we consider no dropout for the final network, which was suggested by Perreault Levasseur et al. (2017) and also used by Pearson et al. (2021) for the uncertainty scaling such that the distribution matches the expected percentile for a Gaussian distribution.

4. Network results and performance

In this section, we present the predicted SIE+γext parameters and corresponding uncertainties σ predicted with our residual neural network. This network was trained, validated, and tested on 165 374 realistic mock images created with our upgraded simulation procedure as described in Sect. 2. The optimized network architecture is shown in Fig. 3. In Fig. 4 we show a comparison between ground truth and predicted values on the test set, which are images that the network has not seen before during training or in the cross-validation procedure. Specifically, for every parameter, we show the distribution of values as a histogram as well as a direct comparison between predicted and true values. Figure 4 also shows the median predicted values per bin (red line) and the 1 σ ¯ $ 1\bar{\sigma} $ and 2 σ ¯ $ 2\bar{\sigma} $ (gray shaded). The bar indicates the computation from the whole sample to better distinguish from the specific uncertainties σ predicted by the network.

thumbnail Fig. 4.

Comparison between the ground truth and prediction of our final network. In the left panel of each column we show histograms of the ground truth (red) and predicted values (blue). In the right panel we directly plot the predicted value (y-axis) against the true value, showing the median as a red line with 1 σ ¯ $ \bar{\sigma} $ and 2 σ ¯ $ \bar{\sigma} $ ranges (gray shaded) inferred from the distribution of median values of the test set.

The network performs very well on the lens center, especially in contrast to the CNNs presented in S21a which had difficulty in recovering the mass center. Here, we clearly see the improvement on the lens centering resulting from the re-centering of the lens galaxy, taking the fractional pixel in the simulation into account, which was neglected in S21a, and the random shift of the final mock image by up to ±3 pixels. The median with 1 σ ¯ $ \bar{\sigma} $ range for xpred − xtr and ypred − ytr are, respectively, 0 . 04 0.40 + 0.50 $ 0.04^{+0.50}_{-0.40} $ and 0 . 04 0.41 + 0.46 $ 0.04^{+0.46}_{-0.41} $ pixels. The complex ellipticity components ex and ey are also well recovered with 0 . 02 0.09 + 0.12 $ 0.02^{+0.12}_{-0.09} $ and 0 . 01 0.10 + 0.11 $ 0.01^{+0.11}_{-0.10} $, respectively, although the predictions still have a tendency to underestimate the absolute values. In other words, the network predicts the galaxies to be rounder than they are. One of the reasons for this is the much higher number of training systems with values around zero, which means that the network might be biased towards this value.

The Einstein radius θE is very well recovered with a median and 1 σ ¯ $ \bar{\sigma} $ value of θ E pred θ E tr = 0 . 003 0.24 + 0.21 $ \theta_{\mathrm{E}}^{\mathrm{pred}} - \theta_{\mathrm{E}}^{\mathrm{tr}} = 0.003^{+0.21}_{-0.24} $. This can also be seen from Fig. 4, where the median line closely follows the 1:1 line between our lower limit of 0.5″ and around 1.5″, dropping for systems with very large image separations. The 1 σ ¯ $ \bar{\sigma} $ and 2 σ ¯ $ \bar{\sigma} $ ranges indicate very precise predictions between ∼1″ and ∼2″, while beyond this range the performance is not as good, which is partly due to the low number of systems in the training set. The lower precision on the low end is most likely due to blending issues with the lens. Because of the small image separations, the counter images are more strongly blended with the lenses such that the network cannot sufficiently deblend the arcs from the lens, which is important for the prediction of the Einstein radius. This is confirmed by a test using the arcs alone for the network training (see Sect. 5.4 for details). On the other side of the range (i.e., high θ E tr $ \theta_{\mathrm{E}}^{\mathrm{tr}} $), the performance drops significantly due to their underrepresentation in the data set. As shown with CNNs in S21a, the distribution of the Einstein radii is crucial for the performance, and a uniformly distributed sample yields the best performance over the full considered range. As mentioned already in Sect. 2, we therefore set a maximum of mocks per Einstein radius bin. As we use real measurements of the velocity dispersion and redshifts, lensing systems with an Einstein radius above ∼2″ are very rare, such that the number of systems drops by more than one order of magnitude towards θE ∼ 3″ compared to the plateau at θE ≤ 1.5″, which means a decreasing performance. Given the very low number of systems within each bin for θE > 3″, we do not show them in Fig. 4 but keep our limit at 5″ as the largest image separation accepted by our simulation pipeline. Even if the network shows a lower performance in this range, it has seen some of these systems, and because of our introduced scaling, it is in principle able to predict such large Einstein radii. As demonstrated in S21a, we can also train a dedicated network on a smaller sample, for example for systems with θE ≥ 2″, meaning that a better performance is achieved for systems with such large image separations.

As we see from Fig. 4, the network is so far not able to accurately predict the external shear components γext, 1 and γext, 2, although the mean with 1 σ ¯ $ \bar{\sigma} $ values for the whole test set are 0 . 002 0.04 + 0.04 $ 0.002^{+0.04}_{-0.04} $ and 0 . 001 0.04 + 0.04 $ -0.001^{+0.04}_{-0.04} $, respectively. It tends to predict values closer to zero, resulting in a lower predicted shear strength than it should. We tested many different possibilities to improve on the external shear as described in Sect. 5, and found that blending with the lens is not the main reason for this issue. It seems that the current network apparently cannot sufficiently generalize to new systems on these very minor distortions on the arcs. This might be because of the variable PSF from system to system or the image resolution given that it works relatively well with more idealized, lens-light subtracted and high-resolution images (Morningstar et al. 2018). Further investigation beyond our tests summarized in Sect. 5 is therefore necessary for a better estimate of the external shear. Whether a precise estimate of the external shear is crucial depends on the science case behind the modeling. For statistical studies on the lenses, for instance regarding the lens mass, the external shear is expected to have only negligible influence.

Figure 5 shows the difference between the predicted values and ground truths for the seven parameters, as well as correlations between them. We find no strong correlations, not even between typically degenerate quantities such as ellipticity and external shear. By comparing to Fig. 7 of S21a, where the plotting ranges of the SIE parameters are kept the same for ease of comparison, we find a generally better performance on the Einstein radius but with the same kind of diamond-shaped 2 σ ¯ $ \bar{\sigma} $ contour. On the other hand, the scatter on the lens mass center is slightly larger with the presented ResNet. This is most likely because of our newly introduced random ±3 pixel shift for the final mocks, which was implemented to ensure that the network predicts the lens mass center instead of the image center, but means that we cannot directly compare the performance on the lens center here. Instead, Fig. 4 reveals the much improved performance on the lens center of the ResNet compared to the CNNs in S21a (Figs. 8 and 10). Given the numerous changes between these networks by introducing new parameters through the external shear, the uncertainty prediction, the change in the network architecture, and the addition of Poisson noise in the simulations, it is difficult to interpret the differences in performance. As the ResNet has the much more complex task of additionally predicting the external shear and uncertainties for each parameter, which it does well overall, this is definitely the more powerful network.

thumbnail Fig. 5.

Histograms (bottom row) and 2D density plots of the difference between prediction and ground truth of our final ResNet applied to the test set.

Besides the median value, the network also predicts an uncertainty σ for each parameter. To interpret this as the width of a Gaussian distribution, it has to match the statistical expectations of 1 σ ¯ $ \bar{\sigma} $ corresponding to a CI of 68.3%, 2 σ ¯ $ \bar{\sigma} $ to 95.4%, and 3 σ ¯ $ \bar{\sigma} $ to 99.7%. As explained in Sect. 3, we do not use dropout, as in Perreault Levasseur et al. (2017) and Pearson et al. (2021), and instead incorporate a direct scaling per parameter. As our predictions do not perfectly match a Gaussian distribution, we scale the predicted values for the different parameters σ = (σx, σy, σex, σey, σθE, σγext, 1, σγext, 2) by s = (1.25, 1.32, 1.19, 1.20, 1.08, 1.21, 1.21). This minimizes the quadratic sum of the differences for the three σ intervals for each parameter ηj, that is, it minimizes

d 1 , j 2 + d 2 , j 2 + d 3 , j 2 , $$ \begin{aligned} d_{1,j}^2+d_{2,j}^2+d_{3,j}^2, \end{aligned} $$(19)

with

d 1 , j = | 100 N × k T u 1 , j , k 68.3 | $$ \begin{aligned} d_{1,j} = \left| \frac{100}{N} \times \sum _k^T u_{1,j,k} - 68.3\right|\end{aligned} $$(20)

d 2 , j = | 100 N × k T u 2 , j , k 95.4 | $$ \begin{aligned} d_{2,j} = \left| \frac{100}{N} \times \sum _k^T u_{2,j,k} - 95.4\right| \end{aligned} $$(21)

d 3 , j = | 100 N × k T u 3 , j , k 99.7 | , $$ \begin{aligned} d_{3,j} = \left| \frac{100}{N} \times \sum _k^T u_{3,j,k} - 99.7\right|, \end{aligned} $$(22)

where T denotes the size of the test set over which we sum, and

u 1 , j , k = { 1 if ( | η j , k tr η j , k pred | s σ j , k ) < 0 0 otherwise $$ \begin{aligned} {u}_{1,j,k} =&\left\{ \begin{array}{ccc} 1&\mathrm{if}&\left( |{\eta }_{j,k}^\mathrm{tr} - {\eta }_{j,k}^\mathrm{pred}| - s {\sigma }_{j,k} \right) < 0\\ 0&\mathrm{otherwise}&\end{array} \right. \end{aligned} $$(23)

u 2 , j , k = { 1 if ( | η j , k tr η j , k pred | 2 s σ j , k ) < 0 0 otherwise $$ \begin{aligned} {u}_{2,j,k} =&\left\{ \begin{array}{ccc} 1&\mathrm{if}&\left( |{\eta }_{j,k}^\mathrm{tr} - {\eta }_{j,k}^\mathrm{pred}| - 2s {\sigma }_{j,k} \right) < 0\\ 0&\mathrm{otherwise}&\end{array} \right. \end{aligned} $$(24)

u 3 , j , k = { 1 if ( | η j , k tr η j , k pred | 3 s σ j , k ) < 0 0 otherwise . $$ \begin{aligned} {u}_{3,j,k} =&\left\{ \begin{array}{ccc} 1&\mathrm{if}&\left( |{\eta }_{j,k}^\mathrm{tr} - {\eta }_{j,k}^\mathrm{pred}| - 3s {\sigma }_{j,k} \right) < 0\\ 0&\mathrm{otherwise.}&\end{array} \right. \end{aligned} $$(25)

This is a good compromise between matching the commonly used 68.3% CIs and also the 95.4% and 99.7% CIs. This is visualized in Fig. 6, where we show the coverage of the scaled uncertainty values for each parameter (gray bars) as absolute values (top) and the difference from the expectations (bottom). The top panel demonstrates the close match between the scaled uncertainties and the expected CI levels (blue dashed), especially for the mean over all seven parameters (red dotted), and can be directly compared to the achievements from Perreault Levasseur et al. (2017, Fig. 2) and Pearson et al. (2021, Fig. 4). The bottom panel highlights the small deviations between the achieved and expected CIs; it shows the good match of the 1σ values for all individual parameters achieved through appropriate scaling, resulting in visible deviations for the 2σ and 3σ lines. In particular, the distribution for the Einstein radius is sharper than a Gaussian distribution.

thumbnail Fig. 6.

Coverage of the predicted uncertainties, shown as absolute values (top) and relative values (bottom) to expectations (blue dashed lines). The network-predicted uncertainties σ of the test set were scaled such that roughly 68.3%, 95.4%, and 99.7% of the true values ηtr are contained in the 1σ (dark gray bar), 2σ (median gray bar), and 3σ width (light gray bar) ranges of a Gaussian distribution centered at the predicted median values ηpred. With this scaling, we can indeed interpret the predicted uncertainties of each individual parameter as the width of a Gaussian distribution. We additionally show the mean of all seven σ values, which are, respectively, 68.05%, 95.72%, and 99.46%, in red for comparison.

A comparison of the performance to other modeling networks is difficult given the discrepancy in assumptions. Hezaveh et al. (2017), who originally proposed this novel idea, presented a network to predict ex, ey, and θE of an SIE profile for HST-like images after lens-light subtraction to demonstrate the feasibility. Perreault Levasseur et al. (2017) further included uncertainty predictions and an external shear component. Similarly, Pearson et al. (2021) presented a network to predict ex, ey, and θE of an SIE profile for mock images with 0.1″ resolution in preparation for the Euclid space mission. These authors also included error estimations inspired by Perreault Levasseur et al. (2017) and explored the opportunity of a hybrid code by combining their network with PYAUTOLENS (Nightingale et al. 2018), a fully automated nonmachine-learning-based modeling software, for further refinement of the parameter predictions. The difference in image resolution, number of filters, and the quality of the training and test data makes a comparison difficult. Moreover, the different number of predicted parameters complicates the comparison given the degeneracies between the different parameters of a given lensing system. The closest work in terms of image quality and number of filters was presented by Pearson et al. (2019), who considered CNNs to predict ex, ey, and θE of an SIE profile for Euclid, LSST r-band, and LSST gri-band data. The latter is the best match to our networks, and is comparable in performance, as mentioned in S21a. There is also currently a lot of work going into automated modeling without machine learning (e.g., Nightingale et al. 2018, 2021; Rojas et al. 2022; Savary et al. 2022; Ertl et al. 2023; Etherington et al. 2022; Gu et al. 2022; Schmidt et al. 2023), which typically performs better than neural networks but requires significantly longer run times of hours to days. We refer to Schuldt et al. (2022) for a direct comparison between the network presented here and traditionally obtained models for real HSC lenses.

5. Additional tests on the network

In this section, we summarize additional tests carried out mostly to address the remaining difficulty in the prediction of the external shear.

5.1. Tests on the network architecture

Beyond our general tests on the network architecture described in Sect. 3, we further tested the possibility to split the network into multiple branches after flattening and before the FC layers as sketched in Fig. 7. Each branch then consists of n FC layers, which can also mean just one, and predicts only specific parameters. This allows us to optimize the weights of the different branches for the specific parameters of that branch. The input of the first FC layer in each branch is the full flattened data cube obtained after the pooling, which is the same for each branch. Here we considered three versions visualized in Fig. 7, although others are possible. In version 1, we split the network into two branches, each predicting seven values: one branch for the median values η and one branch predicting the uncertainties σ. In version 2, we split the network into seven branches, each predicting the median and uncertainty for one parameter. The third considered option includes 14 branches, where each branch predicts just one value. As we did not find an improvement through these tests, we do not further explore the splitting into multiple branches. This might be because the parameters have shared information and are degenerate, such that a single branch is helpful especially for the uncertainty prediction. However, even though the multiple branches were not helpful in obtaining improved performance compared to an architecture with just a single branch of FC layers, they could be helpful when trying to tune the errors through the dropout rate as suggested by Perreault Levasseur et al. (2017), as one can adopt a specific dropout rate for each branch. This would allow an individual tuning for each parameter.

thumbnail Fig. 7.

Sketch of tested multibranch network architectures. We split the FC layers either into 2, 7, or 14 branches after the convolutional layers and flattening (compare Fig. 3).

5.2. Over-fitting tests

To test whether a particular network architecture is promising, we performed so-called over-fitting tests, which means that we trained and evaluated the network on a very small sample with around 1000 mock images. This shows whether the network is able to “memorize” the task perfectly, including predicting the external shear. As these were only short tests, we performed no cross-validation. Given our difficulties with the external shear, it is important to remark that our network is indeed able to learn the external shear for that very small sample. This shows that the network is in general able to extract features of the images and to connect them to all seven parameters, albeit imperfectly. On the shear in particular we see some scatter, which indicates that the network does not just remember the exact images and output the stored values. This demonstrates that the baseline network architecture is likely not the main reason for the failure in the external shear prediction. However, these tests have no implications as to whether the network performs well on new, unseen data from the test set. We further performed such over-fitting tests with networks that just predict the external shear. This helped us to significantly improve on the training data.

5.3. Test with fixed lens–source pairs

As our over-fitting tests presented in Sect. 5.2 only show that the network is able to predict the external shear well from images in a small training set, we further tested whether it can predict the shear on new images if we simplify the task. For this, we considered three different stages of simplification and created samples with 1000 mocks each. First, we always used the same lens, but different background sources from HUDF and placed them randomly behind the lens. The second scenario had the same lens light and source light distribution, as well as the same redshifts, but various positions of the background source were tested with respect to the lens as well as different mass-to-light offsets of the lens, as in our general training data set. The third option was to keep everything fixed, including the source position, and only vary the external shear. This means that, in the third option, the arcs always appeared to be the same, without external shear, and only distortions arose from the external shear. As the lens did not vary in these tests, we excluded the SIE parameters from the prediction and trained the networks to only predict the external shear.

In the first and second scenarios, the network is able to predict the external shear better but not perfectly. In the third scenario, the network yields a nearly perfect prediction of the external shear on the test data (Fig. 8). This demonstrates the ability of the network to transfer the shear extraction to completely new images.

thumbnail Fig. 8.

Comparison of ground truth and predictions of the external shear from the test set by a network trained on 1000 mocks, each with the same lens and source pair. Under this extreme simplification, the network is able to predict the external shear perfectly even on the new images of the test set.

As the lens and background source are always the same in the last scenario, we can exclude the possibility that the network connects other features of the image with the external shear parameters. On the other hand, we also exclude possible degeneracies between different parameters, such as the ellipticity and the external shear, that might explain the difficulties with the external shear. One reason for the good performance in these tests might be the matching PSF for all systems, given that we are using the same lens. Normally, different lenses have slightly different PSFs, meaning that the arcs appear different after the convolution. These variations can introduce difficulties for the network, as it does not know the PSF of a given lens system. Particularly for the external shear, detecting small distortions on the arcs is necessary, which can be highly influenced by different PSF shapes. However, passing the PSFs together with the images to the network did not help (see Sect. 5.4).

5.4. Variations of the input data

As we achieved a better performance in our lens search projects by applying a square-root stretching (Cañameras et al. 2021, in prep.; Shu et al. 2022), we also tested this for our modeling network. Hence, we no longer passed the images to the network, but rather the square-root of the images after setting all negative background pixels to zero. This helps to enhance the faint arcs compared to the brighter lens in the center, but also changes the flux ratios between pixels on the arcs in any given filter and also the ratio between the different filters where the color information is encoded. Finally, we find no improvement for the modeling network.

Another test of this kind was to subsample the images by a linear interpolation to increase the number of pixels, as our 64 × 64 pixels images are very small compared to images from for example ImageNet. Even though no information was added in this process, the hope was that a higher pixel count might help the network to predict small features and would allow for deeper networks with larger kernel sizes or strides in the convolutional layers. We tested subsampling factors of 2, 3, and 4, but the lowest mean validation loss was obtained without subsampling.

Given our difficulty in recovering the external shear, we tried to help the network with some additional information. First, we provided the full-width half-maximum (FWHM) values of the point spread function (PSF) frames in addition to the normal set of images. These values were added to the flattened output of the convolutional layers and processed through the FC layers. As this leads to no improvement, we considered networks accepting eight frames instead of four, and added the PSF images directly as input images. To this end, we subsampled the PSF with a linear interpolation to 64 × 64 pixels, matching the size of the images, and passed them independently of the images through a mirrored branch of convolutional layers, combining the intermediate outputs directly before the FC layers as suggested by Maresca et al. (2021) and Li et al. (2022). Again, no improvement is seen with this option, possibly because such networks, with their relatively small kernel sizes, perform well in pattern recognition but perhaps not as well in analyzing very similar and completely smooth images like a PSF. Another possible reason for gaining no improvement by adding the PSF images is that these PSFs are likely imperfect, because of the complex stacking procedure of ∼70 candidate stars per CCD (Bertin 2011; Aihara et al. 2018). These small errors are estimated to reach the 1% level at maximum (Aihara et al. 2018, 2019), but are possibly relevant for the relatively small effects of the external shear on the arcs.

To further investigate the external shear, we trained networks on images containing only the arcs. This was proven to be helpful in other modeling networks (e.g., Hezaveh et al. 2017; Perreault Levasseur et al. 2017; Morningstar et al. 2018, 2019; Pearson et al. 2021). By removing the lens light, the remaining information from the arcs becomes more easily accessible. We assumed perfect lens-light subtraction, which is not achievable in reality, but is the best-case scenario when assessing whether or not the network can then better predict the external shear, which is only encoded in the arcs. As expected, the new network predicts the Einstein radius with nearly no scatter over the full range, and also performs well on the lens center. The good performance on the systems with small image separations confirms that the lower performance of our normal network in that regime is due to blending issues with the lens. In this test, we notably last performance on the lens mass ellipticity, which is to some extent expected, as the network retrieves this information partially from the lens light which differ slightly from the lens mass distribution. However, also in this case this network cannot accurately predict the external shear, which suggests that the information is very hard to access from ground-based imaging and a generalization over the whole sample is currently impossible.

To answer the question of the minimum number of filters required, we trained networks on fewer filters. We tried with either only the g band; the g and r bands; or the g and i bands. Although we performed no real optimization of the network architecture and hyper-parameters, we found notably poorer performance with one or two filters than with four. However, even with just one filter, the network is able to predict the SIE parameters reasonably well, and has the greatest difficulty with the ellipticity. This confirms that it is possible to train CNNs or ResNets on single-band images, for example, from Euclid, where the much better resolution will compensate for the missing color information to some extent. As expected, the external shear is not predictable at all with just one filter of HSC image quality within our few tests in this direction.

6. Image positions and time-delay comparison

Given the extremely low computational time needed by the trained network to predict the SIE+γext parameters, the network would be ideal for predicting the mass model of a lensing system with a lensed transient. This mass model could then be used to predict, based on the first detection, the next appearing images with corresponding time delays and therefore help to plan follow-up observations. To test the precision and accuracy of the network, we performed a similar test as in S21a on our test set. To this end, we computed the image positions ( θ x tr $ \theta_{x}^{\mathrm{tr}} $, θ y tr $ \theta_{\mathit{y}}^{\mathrm{tr}} $) and time delays Δttr of the background source center given the true mass model from our simulation pipeline. Assuming that the first appearing true image position ( θ x , A tr $ \theta_{x,\mathrm{A}}^{\mathrm{tr}} $, θ y , A tr $ \theta_{\mathit{y},\mathrm{A}}^{\mathrm{tr}} $) is the first observed image and is therefore coincident with ( θ x , A pred $ \theta_{x,\mathrm{A}}^{\mathrm{pred}} $, θ y , A pred $ \theta_{\mathit{y},\mathrm{A}}^{\mathrm{pred}} $), we computed the source position using the mass model predicted by the network. From this predicted source position, we then obtained all other image positions ( θ x pred $ \theta_{x}^{\mathrm{pred}} $, θ x pred $ \theta_{x}^{\mathrm{pred}} $) and corresponding time delays Δtpred. We finally compared the true and predicted values and directly computed the differences for lens systems with a matching number of images.

To visualize the results, we show the differences of the x- and y-coordinates of the multiple image positions in Fig. 9 as a function of the Einstein radius. We find essentially no bias and a similar scatter as in S21a, which increases with the Einstein radius as expected from Fig. 4. The small bias and fluctuations in the last bins (θE ≥ 2″) are due to low-number statistics (compare also the histogram of the Einstein radius distribution in Fig. 4).

thumbnail Fig. 9.

Differences between the true and predicted x-coordinates (top) and y-coordinates (bottom) of the image positions as a function of the predicted Einstein radii.

Figure 10 shows the time-delay differences as a histogram (left panel) and as a function of the predicted Einstein radius (right panel). We find a small bias in Δt, while the observed scatter increases as well, which is expected given the performance on the image positions. As we train our ResNet on a sample with equally distributed Einstein radii up to ∼1.5″, we can compare the bias and scatter to the similarly constructed CNN of S21a (see Sect. 4.3 in S21a). These networks show an overall comparable bias towards longer time delays, from which we can infer that the external shear has only a minor effect on the time-delay predictions and that the bias mainly originates from the Einstein radius offsets. While the ResNet equally under- and overpredicts many time delays with a typical scatter of σ ¯ 20 $ \bar{\sigma} \sim 20 $ days, the CNN has a slight tendency to make overpredictions. To correct the ResNet predictions for the bias with the Einstein radius, we fit a linear function of the form

a × θ E pred + b $$ \begin{aligned} a \times \theta _{\rm E}^\mathrm{pred} + b \end{aligned} $$(26)

thumbnail Fig. 10.

Difference between the predicted and true time delays as a histogram (left) and as a function of the predicted Einstein radii. Given the bias with the Einstein radius, even if relatively small, we fit a linear function a × θ E pred + b $ a \times \theta_{\mathrm{E}}^{\mathrm{pred}} + b $ in the range [0.5″, 2″] as shown in blue.

to the individual differences, restricting the fitting range to [0.5″, 2″]. We obtain a value for a of 0.909 and a value for b of 1.928. The corrected plot is shown in the left panel of Fig. 11, along with the corrected time-delay difference Δtpred − Δttr as a function of the difference in Einstein radius (right panel). For systems with θ E pred 2 $ \theta_{\mathrm{E}}^{\mathrm{pred}} \lesssim 2^{{\prime\prime}} $, we can control the bias within ∼3 days, although the scatter remains large, limiting the usefulness of the network in predicting precise time delays. This performance is clearly not as good as that of traditional MCMC sampling on ground-based images and a SIE+γext mass distribution, but given the extreme difference in computational time, it may still serve as a reasonable first estimate for systems with θE ≤ 2″.

thumbnail Fig. 11.

Difference between the predicted and true time delays after correction of the linear bias as function of, respectively, the predicted Einstein radius (left panel) and the difference between the predicted and true Einstein radius (right panel).

7. Summary and conclusion

S21a demonstrated the possibility of modeling ground-based images of strongly lensed galaxy-scale systems with a relatively simple CNN inspired by the LeNet architecture. Building upon these results, we developed a way of modeling such systems when including the external shear component in addition to the SIE mass distribution of the lens. Moreover, we now predict a 1σ uncertainty for each parameter and lensing system. To this end, we make use of a residual neural network, which is a specific type of CNN that includes so-called residual blocks with a skip connection. A diagram of our final network architecture is shown in Fig. 3. Because of the included error prediction, we changed the loss function from an MSE loss to a log-probability function with a regression term inspired by Perreault Levasseur et al. (2017). To ensure that all parameters contribute equally to the loss, we introduced a scaling of each parameter to the range [0, 1] and therefore included the sigmoid function as the last layer.

The network was trained on mock images created from real observed data by only simulating the lensing effect, resulting in realistic mocks, as shown in Fig. 2. We used images of LRGs observed with HSC as lenses, together with velocity dispersion and redshift measurements from SDSS. The background sources were images from HUDF with known redshifts.

The training data were created with a flat distribution in the Einstein radius between 0.5″ and ∼2″ and a flat distribution in the external shear strength γext between 0 and 0.1. Moreover, we included Poisson noise on the arcs, which we approximated from the variance map provided by HSC, and improved the lens center and ellipticity estimation using a dedicated mask for each individual lens. To make sure that the network predicts the lens mass center, we applied random shifts to the final mock images of up to three pixels in both x and y directions.

With this procedure, we created ∼165 000 mock systems in four filters, which we then used to train our neural network. Through extensive tests, we found a network that is able to accurately predict the SIE parameters of the lens mass profile (Figs. 4 and 5). However, the external shear is very hard to predict accurately. We carried out many tests on the external shear, such as up-weighting its loss contribution; only predicting the external shear; adding further information, such as the FWHM values or PSF images; or applying subsampling. Only in the case of training on a very small sample with always the same lens and source pair, as shown in Fig. 8, is the network also able to predict the external shear well for new images in the test set. This demonstrates that the network indeed obtained its external shear prediction from the arcs, which are only different because of the variable external shear. In short, the network is able to predict the external shear from effects introduced by the external shear. Therefore, it seems that the network is in general able to extract the information from the external shear, but cannot generalize well to other systems, which is probably a result of the combination of the complexity of the lensing system, image resolution, inaccurate masking of the sources affecting the mocks, the unknown PSF for the network, correlations between the external shear and other parameters, and/or other reasons. This is supported by the fact that the external shear can be relatively well predicted by CNNs trained on more idealized, lens-light subtracted, and high-resolution images (Morningstar et al. 2018).

In analogy to S21a, we used the predicted SIE+γext mass parameters to predict the next appearing image(s) and corresponding time delay(s) given the first appearing image of the true, simulated mass model. Although we observed stronger discrepancies on the external shear, the observed scatter on the image positions and time delays is comparable to that obtained with our CNNs of S21a. We find a bias in the predicted time delays as a function of the predicted Einstein radius, which can be compensated up to θ E pred 2 $ \theta_{\mathrm{E}}^{\mathrm{pred}} \sim 2^{{\prime\prime}} $ by applying a linear correction function. Through a comparison of the performance with that of the CNN presented in S21a, we see that the external shear has only a minor effect on the time-delay prediction.

The greatest strength of our network is certainly its reduced requirement for computational time and its greater degree of automation. Once trained, it predicts these parameters in fractions of a second, while state-of-the-art methods like GLEE & GLAD require at least days and a lot of user input for the same task. With this network, we are able to predict the SIE+γext values with uncertainties for all known HSC lenses or lens candidates – which already number a few thousand – within minutes. Given the good match of the HSC images to the expected quality of LSST, the performance of our network is expected to hold for LSST as well. Here, we propose to generate dedicated mocks and train a separate network as soon as data from the first LSST data release are available in order to avoid a possible degradation in performance because of slightly different image characteristics.


1

The simulation pipeline will be made available upon request as well as the needed nonpublicly available software GLEE (Suyu & Halkola 2010; Suyu et al. 2012).

2

The SIE mass profile introduced by Barkana (1998) allows for an additional core radius, which we set to 10−4, yielding effectively a singular mass distribution without numerical issues at the lens center.

3

In the very rare case of negative values in the variance map vi provided by HSC, we reset them with zero, i.e., v i + =max{ v i ,0} $ v_{i}^+ = {\rm max}\{v_{i},0\} $.

4

The code used for the network training is based on a nonpublicly available code and will be made available upon request.

Acknowledgments

We thank D. Sluse and the anonymous referee for helpful comments. S.S., S.H.S., and R.C. thank the Max Planck Society for support through the Max Planck Research Group for SHS. This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (LENSNOVA: grant agreement No. 771776). This research is supported in part by the Excellence Cluster ORIGINS which is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC-2094 – 390783311. Y.S. acknowledges support from the Alexander von Humboldt Foundation in the framework of the Max Planck-Humboldt Research Award endowed by the Federal Ministry of Education and Research. Based on observations made with the NASA/ESA Hubble Space Telescope, obtained from the data archive at the Space Telescope Science Institute. STScI is operated by the Association of Universities for Research in Astronomy, Inc. under NASA contract NAS 5-26555. The Hyper Suprime-Cam (HSC) collaboration includes the astronomical communities of Japan and Taiwan, and Princeton University. The HSC instrumentation and software were developed by the National Astronomical Observatory of Japan (NAOJ), the Kavli Institute for the Physics and Mathematics of the Universe (Kavli IPMU), the University of Tokyo, the High Energy Accelerator Research Organization (KEK), the Academia Sinica Institute for Astronomy and Astrophysics in Taiwan (ASIAA), and Princeton University. Funding was contributed by the FIRST program from Japanese Cabinet Office, the Ministry of Education, Culture, Sports, Science and Technology (MEXT), the Japan Society for the Promotion of Science (JSPS), Japan Science and Technology Agency (JST), the Toray Science Foundation, NAOJ, Kavli IPMU, KEK, ASIAA, and Princeton University. This paper makes use of software developed for the Rubin Observatory Legacy Survey in Space and Time (LSST). We thank the LSST Project for making their code available as free software at http://dm.lsst.org. This paper is based in part on data collected at the Subaru Telescope and retrieved from the HSC data archive system, which is operated by Subaru Telescope and Astronomy Data Center (ADC) at National Astronomical Observatory of Japan. Data analysis was in part carried out with the cooperation of Center for Computational Astrophysics (CfCA), National Astronomical Observatory of Japan. We make partly use of the data collected at the Subaru Telescope and retrieved from the HSC data archive system, which is operated by Subaru Telescope and Astronomy Data Center at National Astronomical Observatory of Japan. Funding for the Sloan Digital Sky Survey IV has been provided by the Alfred P. Sloan Foundation, the US Department of Energy Office of Science, and the Participating Institutions. SDSS-IV acknowledges support and resources from the Center for High-Performance Computing at the University of Utah. The SDSS website is www.sdss.org. SDSS-IV is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS Collaboration including the Brazilian Participation Group, the Carnegie Institution for Science, Carnegie Mellon University, the Chilean Participation Group, the French Participation Group, Harvard-Smithsonian Center for Astrophysics, Instituto de Astrofísica de Canarias, The Johns Hopkins University, Kavli Institute for the Physics and Mathematics of the Universe (IPMU)/University of Tokyo, the Korean Participation Group, Lawrence Berkeley National Laboratory, Leibniz Institut für Astrophysik Potsdam (AIP), Max-Planck-Institut für Astronomie (MPIA Heidelberg), Max-Planck-Institut für Astrophysik (MPA Garching), Max-Planck-Institut für Extraterrestrische Physik (MPE), National Astronomical Observatories of China, New Mexico State University, New York University, University of Notre Dame, Observatário Nacional/MCTI, The Ohio State University, Pennsylvania State University, Shanghai Astronomical Observatory, United Kingdom Participation Group, Universidad Nacional Autónoma de México, University of Arizona, University of Colorado Boulder, University of Oxford, University of Portsmouth, University of Utah, University of Virginia, University of Washington, University of Wisconsin, Vanderbilt University, and Yale University. Software citations: This work uses the following software packages: https://github.com/astropy/astropy (Astropy Collaboration 2013, 2018), https://github.com/matplotlib/matplotlib (Hunter 2007), https://github.com/numpy/numpy (van der Walt et al. 2011; Harris et al. 2020), https://www.python.org/ (Van Rossum & Drake 2009), https://github.com/scipy/scipy (Virtanen et al. 2020), https://pytorch.org (Paszke et al. 2019).

References

  1. Abolfathi, B., Aguado, D. S., Aguilar, G., et al. 2018, ApJS, 235, 42 [NASA ADS] [CrossRef] [Google Scholar]
  2. Aihara, H., Arimoto, N., Armstrong, R., et al. 2018, PASJ, 70, S4 [NASA ADS] [Google Scholar]
  3. Aihara, H., AlSayyad, Y., Ando, M., et al. 2019, PASJ, 71, 114 [Google Scholar]
  4. Astropy Collaboration (Robitaille, T. P., et al.) 2013, A&A, 558, A33 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  5. Astropy Collaboration (Price-Whelan, A. M., et al.) 2018, AJ, 156, 123 [Google Scholar]
  6. Baes, M., & Camps, P. 2021, MNRAS, 503, 2955 [NASA ADS] [CrossRef] [Google Scholar]
  7. Barkana, R. 1998, ApJ, 502, 531 [NASA ADS] [CrossRef] [Google Scholar]
  8. Barnabè, M., Czoske, O., Koopmans, L. V. E., Treu, T., & Bolton, A. S. 2011, MNRAS, 415, 2215 [Google Scholar]
  9. Barnabè, M., Dutton, A. A., Marshall, P. J., et al. 2012, MNRAS, 423, 1073 [Google Scholar]
  10. Baron, D., & Poznanski, D. 2017, MNRAS, 465, 4530 [NASA ADS] [CrossRef] [Google Scholar]
  11. Basak, S., Ganguly, A., Haris, K., et al. 2022, ApJ, 926, L28 [NASA ADS] [CrossRef] [Google Scholar]
  12. Beckwith, S. V. W., Stiavelli, M., Koekemoer, A. M., et al. 2006, AJ, 132, 1729 [Google Scholar]
  13. Bertin, E. 2011, in Astronomical Data Analysis Software and Systems XX, eds. I. N. Evans, A. Accomazzi, D. J. Mink, & A. H. Rots, ASP Conf. Ser., 442, 435 [Google Scholar]
  14. Bertin, E., & Arnouts, S. 1996, A&AS, 117, 393 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  15. Bolton, A. S., Burles, S., Koopmans, L. V. E., Treu, T., & Moustakas, L. A. 2006, ApJ, 638, 703 [NASA ADS] [CrossRef] [Google Scholar]
  16. Bom, C. R., Makler, M., Albuquerque, M. P., & Brandt, C. H. 2017, A&A, 597, A135 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  17. Bonvin, V., Courbin, F., Suyu, S. H., et al. 2017, MNRAS, 465, 4914 [NASA ADS] [CrossRef] [Google Scholar]
  18. Brownstein, J. R., Bolton, A. S., Schlegel, D. J., et al. 2012, ApJ, 744, 41 [NASA ADS] [CrossRef] [Google Scholar]
  19. Cañameras, R., Schuldt, S., Suyu, S. H., et al. 2020, A&A, 644, A163 [Google Scholar]
  20. Cañameras, R., Schuldt, S., Shu, Y., et al. 2021, A&A, 653, L6 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  21. Cabanac, R. A., Alard, C., Dantel-Fort, M., et al. 2007, A&A, 461, 813 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  22. Chan, J. H. H., Suyu, S. H., Chiueh, T., et al. 2015, ApJ, 807, 138 [NASA ADS] [CrossRef] [Google Scholar]
  23. Chan, J. H. H., Suyu, S. H., Sonnenfeld, A., et al. 2020, A&A, 636, A87 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  24. Chao, D. C. Y., Chan, J. H. H., Suyu, S. H., et al. 2020, A&A, 640, A88 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  25. Chao, D. C. Y., Chan, J. H. H., Suyu, S. H., et al. 2021, A&A, 655, A114 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  26. Chen, G. C. F., Fassnacht, C. D., Suyu, S. H., et al. 2019, MNRAS, 490, 1743 [NASA ADS] [CrossRef] [Google Scholar]
  27. Chirivì, G., Yıldırım, A., Suyu, S. H., & Halkola, A. 2020, A&A, 643, A135 [EDP Sciences] [Google Scholar]
  28. Collett, T. E. 2015, ApJ, 811, 20 [NASA ADS] [CrossRef] [Google Scholar]
  29. Cornachione, M. A., Bolton, A. S., Shu, Y., et al. 2018, ApJ, 853, 148 [NASA ADS] [CrossRef] [Google Scholar]
  30. Ducourant, C., Krone-Martins, A., Delchambre, L., et al. 2019, in SF2A-2019: Proceedings of the Annual meeting of the French Society of Astronomy and Astrophysics, eds. P. Di Matteo, O. Creevey, A. Crida, et al., 179 [Google Scholar]
  31. Dye, S., & Warren, S. J. 2005, ApJ, 623, 31 [NASA ADS] [CrossRef] [Google Scholar]
  32. Dye, S., Furlanetto, C., Dunne, L., et al. 2018, MNRAS, 476, 4383 [Google Scholar]
  33. Ertl, S., Schuldt, S., Suyu, S. H., et al. 2023, https://doi.org/10.1051/0004-6361/202244909 [Google Scholar]
  34. Etherington, A., Nightingale, J. W., Massey, R., et al. 2022, MNRAS, 517, 3275 [CrossRef] [Google Scholar]
  35. Faure, C., Anguita, T., Alloin, D., et al. 2011, A&A, 529, A72 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  36. Fowlie, A., Handley, W., & Su, L. 2020, MNRAS, 497, 5256 [Google Scholar]
  37. Gavazzi, R., Marshall, P. J., Treu, T., & Sonnenfeld, A. 2014, ApJ, 785, 144 [Google Scholar]
  38. Gilman, D., Bovy, J., Treu, T., et al. 2021, MNRAS, 507, 2432 [NASA ADS] [CrossRef] [Google Scholar]
  39. Goobar, A., Amanullah, R., Kulkarni, S. R., et al. 2017, Science, 356, 291 [Google Scholar]
  40. Gu, A., Huang, X., Sheu, W., et al. 2022, ApJ, 935, 49 [NASA ADS] [CrossRef] [Google Scholar]
  41. Harris, C. R., Millman, K. J., van der Walt, S. J., et al. 2020, Nature, 585, 357 [Google Scholar]
  42. Hashim, N., De Laurentis, M., Zainal Abidin, Z., & Salucci, P. 2014, ArXiv e-prints [arXiv:1407.0379] [Google Scholar]
  43. He, K., Zhang, X., Ren, S., & Sun, J. 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770 [Google Scholar]
  44. Hezaveh, Y. D., Perreault Levasseur, L., & Marshall, P. J. 2017, Nature, 548, 555 [Google Scholar]
  45. Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. 2012, ArXiv e-prints [arXiv:1207.0580] [Google Scholar]
  46. Huang, X., Storfer, C., Ravi, V., et al. 2020, ApJ, 894, 78 [NASA ADS] [CrossRef] [Google Scholar]
  47. Huber, S., Suyu, S. H., Ghoshdastidar, D., et al. 2022, A&A, 658, A157 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  48. Hunter, J. D. 2007, Comput. Sci. Eng., 9, 90 [Google Scholar]
  49. Inami, H., Bacon, R., Brinchmann, J., et al. 2017, A&A, 608, A2 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  50. Ivezic, Z., Axelrod, T., Brandt, W. N., et al. 2008, Serb. Astron. J., 176, 1 [Google Scholar]
  51. Jacobs, C., Glazebrook, K., Collett, T., More, A., & McCarthy, C. 2017, MNRAS, 471, 167 [Google Scholar]
  52. Jacobs, C., Collett, T., Glazebrook, K., et al. 2019, ApJS, 243, 17 [Google Scholar]
  53. Jaelani, A. T., More, A., Oguri, M., et al. 2020a, MNRAS, 495, 1291 [Google Scholar]
  54. Jaelani, A. T., More, A., Sonnenfeld, A., et al. 2020b, MNRAS, 494, 3156 [NASA ADS] [CrossRef] [Google Scholar]
  55. Jullo, E., Kneib, J. P., Limousin, M., et al. 2007, New J. Phys., 9, 447 [Google Scholar]
  56. Kelly, P. L., Rodney, S. A., Treu, T., et al. 2015, Science, 347, 1123 [Google Scholar]
  57. Khramtsov, V., Sergeyev, A., Spiniello, C., et al. 2019, A&A, 632, A56 [EDP Sciences] [Google Scholar]
  58. Lanusse, F., Ma, Q., Li, N., et al. 2018, MNRAS, 473, 3895 [Google Scholar]
  59. Laureijs, R., Amiaux, J., Arduini, S., et al. 2011, ArXiv e-prints [arXiv:1110.3193] [Google Scholar]
  60. Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. 1998, Proc. IEEE, 86, 2278 [Google Scholar]
  61. Lemon, C. A., Auger, M. W., McMahon, R. G., & Koposov, S. E. 2017, MNRAS, 472, 5023 [NASA ADS] [CrossRef] [Google Scholar]
  62. Lemon, C. A., Auger, M. W., McMahon, R. G., & Ostrovski, F. 2018, MNRAS, 479, 5060 [Google Scholar]
  63. Lemon, C. A., Auger, M. W., & McMahon, R. G. 2019, MNRAS, 483, 4242 [NASA ADS] [CrossRef] [Google Scholar]
  64. Li, R., Napolitano, N. R., Tortora, C., et al. 2020, ApJ, 899, 30 [Google Scholar]
  65. Li, R., Napolitano, N. R., Spiniello, C., et al. 2021, ApJ, 923, 16 [NASA ADS] [CrossRef] [Google Scholar]
  66. Li, R., Napolitano, N. R., Roy, N., et al. 2022, ApJ, 929, 152 [NASA ADS] [CrossRef] [Google Scholar]
  67. Maresca, J., Dye, S., & Li, N. 2021, MNRAS, 503, 2229 [NASA ADS] [CrossRef] [Google Scholar]
  68. Marshall, P. J., Hogg, D. W., Moustakas, L. A., et al. 2009, ApJ, 694, 924 [NASA ADS] [CrossRef] [Google Scholar]
  69. Maturi, M., Mizera, S., & Seidel, G. 2014, A&A, 567, A111 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  70. McGreer, I. D., Clément, B., Mainali, R., et al. 2018, MNRAS, 479, 435 [Google Scholar]
  71. Metcalf, R. B., Meneghetti, M., Avestruz, C., et al. 2019, A&A, 625, A119 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  72. Millon, M., Courbin, F., Bonvin, V., et al. 2020a, A&A, 642, A193 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  73. Millon, M., Courbin, F., Bonvin, V., et al. 2020b, A&A, 640, A105 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  74. Morningstar, W. R., Hezaveh, Y. D., Perreault Levasseur, L., et al. 2018, ArXiv e-prints [arXiv:1808.00011] [Google Scholar]
  75. Morningstar, W. R., Perreault Levasseur, L., Hezaveh, Y. D., et al. 2019, ApJ, 883, 14 [Google Scholar]
  76. Nair, V., & Hinton, G. E. 2010, in ICML, eds. J. Fürnkranz, & T. Joachims (Omnipress), 807 [Google Scholar]
  77. Nightingale, J. W., Dye, S., & Massey, R. J. 2018, MNRAS, 478, 4738 [Google Scholar]
  78. Nightingale, J., Hayes, R., Kelly, A., et al. 2021, J. Open Sour. Softw., 6, 2825 [NASA ADS] [CrossRef] [Google Scholar]
  79. Ostrovski, F., McMahon, R. G., Connolly, A. J., et al. 2017, MNRAS, 465, 4325 [Google Scholar]
  80. Paillassa, M., Bertin, E., & Bouy, H. 2020, A&A, 634, A48 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  81. Park, J. W., Wagner-Carena, S., Birrer, S., et al. 2021, ApJ, 910, 39 [Google Scholar]
  82. Paszke, A., Gross, S., Massa, F., et al. 2019, Advances in Neural Information Processing Systems 32 (Curran Associates, Inc.), 8024 [Google Scholar]
  83. Pearson, J., Li, N., & Dye, S. 2019, MNRAS, 488, 991 [Google Scholar]
  84. Pearson, J., Maresca, J., Li, N., & Dye, S. 2021, MNRAS, 505, 4362 [CrossRef] [Google Scholar]
  85. Perreault Levasseur, L., Hezaveh, Y. D., & Wechsler, R. H. 2017, ApJ, 850, L7 [Google Scholar]
  86. Petrillo, C. E., Tortora, C., Chatterjee, S., et al. 2017, MNRAS, 472, 1129 [Google Scholar]
  87. Petrillo, C. E., Tortora, C., Chatterjee, S., et al. 2019, MNRAS, 482, 807 [NASA ADS] [Google Scholar]
  88. Planck Collaboration VI. 2020, A&A, 641, A6 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  89. Refsdal, S. 1964, MNRAS, 128, 307 [NASA ADS] [CrossRef] [Google Scholar]
  90. Riess, A. G., Casertano, S., Yuan, W., Macri, L. M., & Scolnic, D. 2019, ApJ, 876, 85 [Google Scholar]
  91. Riess, A. G., Casertano, S., Yuan, W., et al. 2021, ApJ, 908, L6 [NASA ADS] [CrossRef] [Google Scholar]
  92. Rizzo, F., Vegetti, S., Fraternali, F., & Di Teodoro, E. 2018, MNRAS, 481, 5606 [Google Scholar]
  93. Rodney, S. A., Brammer, G. B., Pierel, J. D. R., et al. 2021, Nat. Astron., 5, 1118 [NASA ADS] [CrossRef] [Google Scholar]
  94. Rojas, K., Savary, E., Clément, B., et al. 2022, A&A, 668, A73 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  95. Rubin, D., Hayden, B., Huang, X., et al. 2018, ApJ, 866, 65 [Google Scholar]
  96. Rusu, C. E., Wong, K. C., Bonvin, V., et al. 2020, MNRAS, 498, 1440 [Google Scholar]
  97. Salmon, B., Coe, D., Bradley, L., et al. 2018, ApJ, 864, L22 [Google Scholar]
  98. Savary, E., Rojas, K., Maus, M., et al. 2022, A&A, 666, A1 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  99. Schaefer, C., Geiger, M., Kuntzer, T., & Kneib, J. P. 2018, A&A, 611, A2 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  100. Schmidt, T., Treu, T., Birrer, S., et al. 2023, MNRAS, 518, 1260 [Google Scholar]
  101. Schuldt, S., Chirivì, G., Suyu, S. H., et al. 2019, A&A, 631, A40 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  102. Schuldt, S., Suyu, S. H., Meinhardt, T., et al. 2021a, A&A, 646, A126 [EDP Sciences] [Google Scholar]
  103. Schuldt, S., Suyu, S. H., Cañameras, R., et al. 2021b, A&A, 651, A55 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  104. Schuldt, S., Suyu, S. H., Canameras, R., et al. 2022, A&A, submitted (Paper X) [arXiv:2207.10124] [Google Scholar]
  105. Sciortino, F., Howard, N. T., Marmar, E. S., et al. 2020, Nucl. Fusion, 60, 126014 [Google Scholar]
  106. Seidel, G., & Bartelmann, M. 2007, A&A, 472, 341 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  107. Shajib, A. J., Birrer, S., Treu, T., et al. 2020, MNRAS, 494, 6072 [Google Scholar]
  108. Shajib, A. J., Treu, T., Birrer, S., & Sonnenfeld, A. 2021, MNRAS, 503, 2380 [Google Scholar]
  109. Shajib, A. J., Wong, K. C., Birrer, S., et al. 2022, A&A, 667, A123 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  110. Shu, Y., Bolton, A. S., Kochanek, C. S., et al. 2016, ApJ, 824, 86 [NASA ADS] [CrossRef] [Google Scholar]
  111. Shu, Y., Brownstein, J. R., Bolton, A. S., et al. 2017, ApJ, 851, 48 [Google Scholar]
  112. Shu, Y., Marques-Chaves, R., Evans, N. W., & Pérez-Fournon, I. 2018, MNRAS, 481, L136 [Google Scholar]
  113. Shu, Y., Cañameras, R., Schuldt, S., et al. 2022, A&A, 662, A4 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  114. Sonnenfeld, A., Gavazzi, R., Suyu, S. H., Treu, T., & Marshall, P. J. 2013, ApJ, 777, 97 [Google Scholar]
  115. Sonnenfeld, A., Treu, T., Marshall, P. J., et al. 2015, ApJ, 800, 94 [Google Scholar]
  116. Sonnenfeld, A., Chan, J. H. H., Shu, Y., et al. 2018, PASJ, 70, S29 [Google Scholar]
  117. Sonnenfeld, A., Jaelani, A. T., Chan, J., et al. 2019, A&A, 630, A71 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  118. Sonnenfeld, A., Verma, A., More, A., et al. 2020, A&A, 642, A148 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  119. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. 2014, J. Mach. Learn. Res., 15, 1929 [Google Scholar]
  120. Strigari, L. E. 2013, Phys. Rep., 531, 1 [Google Scholar]
  121. Suyu, S. H., & Halkola, A. 2010, A&A, 524, A94 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  122. Suyu, S. H., Marshall, P. J., Hobson, M. P., & Blandford, R. D. 2006, MNRAS, 371, 983 [Google Scholar]
  123. Suyu, S. H., Hensel, S. W., McKean, J. P., et al. 2012, ApJ, 750, 10 [Google Scholar]
  124. Suyu, S. H., Huber, S., Cañameras, R., et al. 2020, A&A, 644, A162 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  125. Tohill, C., Ferreira, L., Conselice, C. J., Bamford, S. P., & Ferrari, F. 2021, ApJ, 916, 4 [NASA ADS] [CrossRef] [Google Scholar]
  126. Treu, T. 2010, ARA&A, 48, 87 [NASA ADS] [CrossRef] [Google Scholar]
  127. Treu, T., Dutton, A. A., Auger, M. W., et al. 2011, MNRAS, 417, 1601 [Google Scholar]
  128. van der Walt, S., Colbert, S. C., & Varoquaux, G. 2011, Comput. Sci. Eng., 13, 22 [Google Scholar]
  129. Van Rossum, G., & Drake, F. L. 2009, Python 3 Reference Manual (Scotts Valley, CA: CreateSpace) [Google Scholar]
  130. Virtanen, P., Gommers, R., Oliphant, T. E., et al. 2020, Nat. Meth., 17, 261 [Google Scholar]
  131. Wagner-Carena, S., Park, J. W., Birrer, S., et al. 2021, ApJ, 909, 187 [Google Scholar]
  132. Wang, H., Cañameras, R., Caminha, G. B., et al. 2022, A&A, 668, A162 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  133. Warren, S. J., & Dye, S. 2003, ApJ, 590, 673 [Google Scholar]
  134. Wong, K. C., Keeton, C. R., Williams, K. A., Momcheva, I. G., & Zabludoff, A. I. 2011, ApJ, 726, 84 [NASA ADS] [CrossRef] [Google Scholar]
  135. Wong, K. C., Sonnenfeld, A., Chan, J. H. H., et al. 2018, ApJ, 867, 107 [Google Scholar]
  136. Wong, K. C., Suyu, S. H., Chen, G. C. F., et al. 2020, MNRAS, 498, 1420 [Google Scholar]
  137. Wu, J. F. 2020, ApJ, 900, 142 [NASA ADS] [CrossRef] [Google Scholar]
  138. Yıldırım, A., Suyu, S. H., & Halkola, A. 2020, MNRAS, 493, 4783 [Google Scholar]

All Figures

thumbnail Fig. 1.

Flow chart of the simulation pipeline used to create our training data in four filters griz. Figure adapted from Fig. 3 of S21a.

In the text
thumbnail Fig. 2.

Example images of our mock lenses (left panel) and corresponding frames with only the lensed source (right panel). Each image has a size of 64 × 64 pixels, which corresponds to around 10.75″ × 10.75″.

In the text
thumbnail Fig. 3.

Overview of our ResNet architecture. The input are the lens images in four filters g, r, i, and z, which are passed through a series of convolutional (conv.) layers and fully connected (FC) layers. After a sigmoid layer to map all output values to the range [0, 1], we split them up into median η and uncertainty σ.

In the text
thumbnail Fig. 4.

Comparison between the ground truth and prediction of our final network. In the left panel of each column we show histograms of the ground truth (red) and predicted values (blue). In the right panel we directly plot the predicted value (y-axis) against the true value, showing the median as a red line with 1 σ ¯ $ \bar{\sigma} $ and 2 σ ¯ $ \bar{\sigma} $ ranges (gray shaded) inferred from the distribution of median values of the test set.

In the text
thumbnail Fig. 5.

Histograms (bottom row) and 2D density plots of the difference between prediction and ground truth of our final ResNet applied to the test set.

In the text
thumbnail Fig. 6.

Coverage of the predicted uncertainties, shown as absolute values (top) and relative values (bottom) to expectations (blue dashed lines). The network-predicted uncertainties σ of the test set were scaled such that roughly 68.3%, 95.4%, and 99.7% of the true values ηtr are contained in the 1σ (dark gray bar), 2σ (median gray bar), and 3σ width (light gray bar) ranges of a Gaussian distribution centered at the predicted median values ηpred. With this scaling, we can indeed interpret the predicted uncertainties of each individual parameter as the width of a Gaussian distribution. We additionally show the mean of all seven σ values, which are, respectively, 68.05%, 95.72%, and 99.46%, in red for comparison.

In the text
thumbnail Fig. 7.

Sketch of tested multibranch network architectures. We split the FC layers either into 2, 7, or 14 branches after the convolutional layers and flattening (compare Fig. 3).

In the text
thumbnail Fig. 8.

Comparison of ground truth and predictions of the external shear from the test set by a network trained on 1000 mocks, each with the same lens and source pair. Under this extreme simplification, the network is able to predict the external shear perfectly even on the new images of the test set.

In the text
thumbnail Fig. 9.

Differences between the true and predicted x-coordinates (top) and y-coordinates (bottom) of the image positions as a function of the predicted Einstein radii.

In the text
thumbnail Fig. 10.

Difference between the predicted and true time delays as a histogram (left) and as a function of the predicted Einstein radii. Given the bias with the Einstein radius, even if relatively small, we fit a linear function a × θ E pred + b $ a \times \theta_{\mathrm{E}}^{\mathrm{pred}} + b $ in the range [0.5″, 2″] as shown in blue.

In the text
thumbnail Fig. 11.

Difference between the predicted and true time delays after correction of the linear bias as function of, respectively, the predicted Einstein radius (left panel) and the difference between the predicted and true Einstein radius (right panel).

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.