HOLISMOKES -- IV. Efficient Mass Modeling of Strong Lenses through Deep Learning

Modelling the mass distributions of strong gravitational lenses is often necessary to use them as astrophysical and cosmological probes. With the high number of lens systems ($>10^5$) expected from upcoming surveys, it is timely to explore efficient modeling approaches beyond traditional MCMC techniques that are time consuming. We train a CNN on images of galaxy-scale lenses to predict the parameters of the SIE mass model ($x,y,e_x,e_y$, and $\theta_E$). To train the network, we simulate images based on real observations from the HSC Survey for the lens galaxies and from the HUDF as lensed galaxies. We tested different network architectures, the effect of different data sets, and using different input distributions of $\theta_E$. We find that the CNN performs well and obtain with the network trained with a uniform distribution of $\theta_E$ $>0.5"$ the following median values with $1\sigma$ scatter: $\Delta x=(0.00^{+0.30}_{-0.30})"$, $\Delta y=(0.00^{+0.30}_{-0.29})"$, $\Delta \theta_E=(0.07^{+0.29}_{-0.12})"$, $\Delta e_x = -0.01^{+0.08}_{-0.09}$ and $\Delta e_y = 0.00^{+0.08}_{-0.09}$. The bias in $\theta_E$ is driven by systems with small $\theta_E$. Therefore, when we further predict the multiple lensed image positions and time delays based on the network output, we apply the network to the sample limited to $\theta_E>0.8"$. In this case, the offset between the predicted and input lensed image positions is $(0.00_{-0.29}^{+0.29})"$ and $(0.00_{-0.31}^{+0.32})"$ for $x$ and $y$, respectively. For the fractional difference between the predicted and true time delay, we obtain $0.04_{-0.05}^{+0.27}$. Our CNN is able to predict the SIE parameters in fractions of a second on a single CPU and with the output we can predict the image positions and time delays in an automated way, such that we are able to process efficiently the huge amount of expected lens detections in the near future.


Introduction
Strong gravitational lensing has become a very powerful tool for probing various properties of the Universe. For instance, galaxygalaxy lensing can help to constrain the total mass of the lens and moreover, assuming a mass-to-light (M/L) ratio for the baryonic matter, also its dark matter (DM) fraction. By combining lensing with other methods like measurements of the lens' velocity dispersion (e.g., Barnabè et al. 2011Barnabè et al. , 2012Yıldırım et al. 2020) or the galaxy rotation curves (e.g., Hashim et al. 2014;Strigari 2013), the dark matter can be better disentangled from the baryonic component and a 3D (deprojected) model of the mass density profile can be obtained. Such profiles are very helpful for probing cosmological models (e.g., Davies et al. 2018;Eales et al. 2015;Krywult et al. 2017).
Another application of strong lensing is to probe high redshift sources thanks to the lensing magnification (e.g., Dye et al. 2018;Lemon et al. 2018;McGreer et al. 2018;Rubin et al. 2018;Salmon et al. 2018;Shu et al. 2018). In the last years, huge efforts were spent in reconstructing the surface brightness distribution of lensed extended sources. Together with redshift and kinematic measurements, these observations contain information about the evolution of galaxies at higher reshifts. If the mass profile of the lens is well constrained, the original unlensed morphology can be reconstructed (e.g., Warren & Dye 2003;Suyu et al. 2006;Nightingale et al. 2018;Rizzo et al. 2018;Chirivì et al. 2020).
In the case of a transient source like a quasar or supernova (SN), measurements of the time delay between multiple images can be used to constrain the value of the Hubble constant H 0 (e.g., Refsdal 1964;Chen et al. 2019;Rusu et al. 2020;Wong et al. 2020;Shajib et al. 2020) and thus help to assess the 4.4σ tension between the cosmic microwave background (CMB) analysis that gives H 0 = (67.36 ± 0.54) km s −1 Mpc −1 for flat Λ cold dark matter (ΛCDM;Planck Collaboration et al. 2018) and the local distance ladder with H 0 = (74.03 ± 1.42) km s −1 Mpc −1 (SH0ES programme; Riess et al. 2019).
Since these strong lens observations are very powerful, several large surveys including the Sloan Lens ACS (SLACS) survey Shu et al. 2017), the CFHTLS Strong Lensing Legacy Survey (SL2S; Cabanac et al. 2007;Sonnenfeld et al. 2015), the Sloan WFC Edge-on Late-type Lens Survey (SWELLS; Treu et al. 2011), the BOSS Emission-Line Lens Survey (BELLS; Brownstein et al. 2012;Shu et al. 2016;Cornachione et al. 2018), the Dark Energy Survey (DES; Dark Energy Survey Collaboration et al. 2005;Tanoglidis et al. 2020), the Survey of Gravitationally-lensed Objects in HSC Imaging (SuGOHI; Sonnenfeld et al. 2018a;Wong et al. 2018;Chan et al. 2020;Jaelani et al. 2020), and surveys in the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS; e.g., Lemon et al. 2018;Cañameras et al. 2020) have been conducted to find lenses. So far we have several thousand lenses detected but mainly from the lower redshift regime. However, based on newer upcoming surveys like the Rubin Observatory Legacy Survey of Space and Time (LSST, Ivezic et al. 2008) located in Chile, which will target around 20, 000 deg 2 of the southern hemisphere in six different filters (u, g, r, i, z, y), together with the Euclid imaging survey from space operated by the European Space Agency (ESA, Laureijs et al. 2011), we expect billions of galaxy images containing on the order of a hundred thousand lenses (Collett 2015).
To deal with this huge amount of images, there are ongoing efforts to develop fast and automated algorithms to find lenses in the first place. These methods are based on different identification properties, for instance on geometrical quantification (Bom et al. 2017;Seidel & Bartelmann 2007), spectroscopic analysis (Baron & Poznanski 2017;Ostrovski et al. 2017) or color cuts (Gavazzi et al. 2014;Maturi et al. 2014). Moreover, Convolutional Neural Networks (CNNs) have also been extensively used in gravitational lens detection (e.g., Jacobs et al. 2017;Petrillo et al. 2017;Schaefer et al. 2018;Lanusse et al. 2018;Metcalf et al. 2019;Cañameras et al. 2020;Huang et al. 2020) as these do not require any measurements of the lens' properties. Once a CNN is trained, it can classify huge amounts of images in a very short time and is thus very efficient. Nontheless, such CNNs have limitations like completeness or an accurate grading, and the performance strongly depends on the training set design as it encodes an effective prior (in the case of supervised learning). In this regard unsupervised or active learning might be promising future avenues for lens finding.
However, these methods are only for finding the lenses, and a mass model is necessary for further studies. Mass models of gravitational lenses are often described by parametrized profiles, where the parameters are optimized for instance via Markov-Chain Monte-Carlo (MCMC) sampling (e.g., Jullo et al. 2007;Suyu & Halkola 2010;Sciortino et al. 2020;Fowlie et al. 2020). Such techniques are very time and resource consuming and are thus difficult to scale up for the upcoming amount of data. With the success of CNNs in image processing, Hezaveh et al. (2017) showed the use of CNNs to estimate the mass model parameters of a Singular Isothermal Ellipsoid (SIE) profile and investigated further error estimations , analyzing interferometric observations (Morningstar et al. 2018), and source surface brightness reconstruction with reccurent inference machines (RIM Morningstar et al. 2019). While they are mainly considering single band images and subtracting the lens light before processing the image with the CNN, Pearson et al. (2019) presented a CNN to model the image without lens light subtraction. However, for all deep learning approaches one needs a data set, containing the images and the corresponding parameter values, for training, validation, and testing the network. As there are not that many real lensed galaxies known, both groups mock up lenses for their CNNs.
We recently initiated the Highly Optimized Lensing Investigations of Supernovae, Microlensing Objects, and Kinematics of Ellipticals and Spirals (HOLISMOKES) programme (Suyu et al. 2020, hereafter HOLISMOKES I). After presenting our lens search project , hereafter HOLISMOKES II), we present in this paper a CNN for modeling strong gravitationally lensed galaxies with ground-based imaging, taking the advantage of four different filters and not applying lens light subtraction beforehand. In contrast to Pearson et al. (2019) we are using a mocked up data set based on real observed galaxy cutouts, since the performance of the CNN on real systems will be optimal when the mock systems used for training are as close to real lens observations as possible. Our mock lens images contain, by construction, realistic line-of-sight objects as well as realistic lens and source light distributions in the image cutouts. We are using the Hyper Suprime-Cam (HSC) Subaru Strategic Program (SSP) images together with redshift and velocity dispersion measurements from the Sloan Digital Sky Survey (SDSS) for the lens galaxies, and images together with redshifts from the Hubble Ultra Deep Field (HUDF) survey for the sources (Beckwith et al. 2006;Inami et al. 2017).
The outline of the paper is as follows. We describe in Sec. 2 how we simulate our training data, and we give a short introduction and overview of the used network architecture in Sec. 3. The main networks are presented in Sec. 4, and we give details of further tests in Sec. 5. We also consider the image position and time delay differences in Sec. 6 for a performance test and compare to other modeling techniques in Sec. 7. We summarize and conclude our results in Sec. 8. Throughout this work, we assume a flat ΛCDM cosmology with a Hubble constant H 0 = 72 km s −1 Mpc −1 (Bonvin et al. 2017) and Ω M = 1 − Ω Λ = 0.32 (Planck Collaboration et al. 2018). Unless specified otherwise, each quoted parameter estimate is the median of its onedimensional marginalized posterior probability density function, and the quoted uncertainties show the 16 th and 84 th percentiles (that is, the bounds of a 68% credible interval).

Simulation of strongly lensed images
For training a neural network one needs, depending on the network size, tens of thousands up to millions of images together with the expected network output, which are, in our case, the values of the SIE profile parameters corresponding to each image. Since there are far too few known lens systems, we need to mock up lens images. While previous studies are based on partly or fully generated light distributions (e.g., Hezaveh et al. 2017;Perreault Levasseur et al. 2017;Pearson et al. 2019), we aim to produce more realistic lens images by using real observed images of galaxies and only simulate the lensing effect with our own routine. We work with the four HSC filter, g, r, i, z (matched to HST filters F435W (λ = 4343.4Å), F606W (λ = 6000.8Å), F775W (λ = 7702.2Å) and F850LP (λ = 9194.4Å), respectively), to give the network color information to distinguish better between lens and source galaxies. The images of HSC for those filters are very similar to the expected image quality of LSST such that our tests and findings will also hold for LSST. Therefore, this work is a direct preparation and an important step for modeling the expected 100,000 lens systems which will be detected with LSST in the near future.

Lens galaxies from HSC
For the lenses, we use HSC SSP 1 images from the second public data release (PDR2; Aihara et al. 2019) with a pixel size of 0.168 . For calculating the axis ratio q light and position angle θ light of the lens, we use the second brightness moments calculated for the i band since redder filters follow better the stellar mass but the S/N is substantially lower in the z band compared to the i band. We crossmatch the HSC catalog with the SDSS 2 catalog to use only images of galaxies where we have SDSS spectroscopic redshifts and velocity dispersions. With this selection, we end up with a sample containing 145,170 galaxies that is dominated by LRGs. We show in Figure 1 in gray a histogram of the lens redshifts used for the simulation. We already overplot the distribution of the mock samples discussed in Sec. To describe the mass distribution of the lens, we adopt a SIE profile (Barkana 1998) such that the convergence (dimensionless surface mass density) can be expressed as with elliptical radius where x and y are angular coordinates on the lens plane, with respect to the lens center. In this equation θ E denotes the Einstein radius and q the axis ratio. 3 The mass distribution is rotated by the position angle θ. The Einstein radius is obtained from the velocity dispersion v disp with where c is the speed of light, and D ds and D s are the angular diameter distances between the lens (deflector) and source and the observer and source, respectively. The distribution of the velocity dispersion is shown in Figure 1 (bottom pannel, gray histogram). We compute the deflection angles of the SIE with the lensing software Glee (Suyu & Halkola 2010;Suyu et al. 2012). Based on the second brightness moments of the lens light distribution in the i band, the axis ratio q light and position angle θ light are obtained internally in our simulation code. Based on several studies (e.g, Sonnenfeld et al. 2018b;Loubser et al. 2020), the light traces the mass relatively well but not perfectly. Therefore, we add randomly-drawn gaussian perturbations on the light parameters, with a gaussian width of 0.05 for the lens center, of 0.05 for the axis ratio, and 0.17 radians (10 degrees) for the position angle, and adopt the resulting parameter values for the lens mass distribution. In case the axis ratio of the mass q (i.e. with gaussian perturbation) is above 1, we draw a second realization of the gaussian noise and otherwise set it to exactly 1.
While the simulation code assumes a parametrization in terms of axis ratio q and position angle θ, we parametrize for our network in terms of complex ellipticity e c which we define as e c = A e 2iθ = e x + ie y with e x = 1 − q 2 1 + q 2 cos(2θ) The back transformation is given by This is in agreement with previous CNN applications to lens modeling Pearson et al. 2019).

Sources from HUDF
The images for the sources are taken from HUDF 4 where also the spectroscopic redshifts are known (Beckwith et al. 2006;Inami et al. 2017). The cutouts are 10 × 10 with a pixel size of 0.03 . This survey is chosen for its high spatial resolution and we can adopt the images without point-spread function (PSF) deconvolution. Moreover it contains high redshift galaxies such that we can achieve a realistic lensing effect. The 1,323 relevant galaxies are extracted with Source Extractor (Bertin & Arnouts 1996) since the lensing effect is redshift dependent and we would otherwise lens the neighboring objects as if they were all at the same redshift, which would lead to incorrect lensing features. We show a histogram of the source redshifts in Figure 2 (gray histogram). Since we select randomly a background source (see Sec. 2.3 for details), the source galaxies can be used multiple times for one mock sample and thus the redshift distribution varies slightly between the different samples (colored histograms, see details in Sec. 4).

Mock lens systems
For training our networks we use mocked-up images based on real observed galaxies and only generate the lensing effect. We use HSC galaxies as lenses (see Sec. 2.1 for details) and HUDF galaxies as background objects (see Sec. 2.2) to obtain mocks that are as realistic as possible. Figure 3 shows a diagram of the simulation pipeline. The input has three images: the lens, the (unlensed) source, and the lens PSF image (top row). Together with the provided redshifts of source and lens, as well as the velocity dispersion for calculating the Einstein radius with equation 3, the source image can be lensed onto the lens plane (second row). For this we place a random source from our catalog randomly in a specified region behind the lens and accept this position if we obtain a strongly lensed image. Since the source images have previously been extracted, we use the brightest pixel in the i band to center the source. We have also implemented the option to just keep one of the two strong lens configurations, either quadruply or doubly imaged galaxies, classified based on the image multiplicity of the lensed source center. We also set a peak brightness threshold for the arcs in comparison to the background noise, which is the lowest root-mean-square (RMS) value of a 10%×10% square (rounded to an integer of pixel) placed in the four corners of the whole HSC cutout. The reason for calculating the RMS for each corner separately and then picking the lowest value is that there might be line-of-sight objects in the corners which would raise the RMS values. To avoid contamination to the background estimation from the lens, we use 40 × 40 image cutouts such that each corner is 4 × 4 .  In the next step, the lensed source image with high resolution is convolved with the sub-sampled PSF of the lens, which is provided by HSC SSP PDR2 for each image separately. After binning up the high-resolution lensed, convolved source image to the HSC pixel size and accounting for the different photometric zeropoints of the source telescope zp sr and lens telescope zp ls , which gives a factor of 10 0.4(zp ls −zp sr ) , the lensed source image is obtained as if it had been observed through the HSC instrument (third row in Figure 3), i.e. on the HSC 0.168 /pixel resolution. We neglect at this point the additional Poisson noise for the lensed arcs. Finally, the original lens and the mock lensed source images can be combined which results in the final image (fourth row) that is croped to a size of 64 × 64 pixels (10.8 × 10.8 ). For better illustration, a color image based on the filters g, r, and i is also shown, but we generate all mock images in four bands which we use for the network training. We show more example images based on gri filters in Figure 4.
We test the effect of different assumptions on the data set, like splitting up in quads-only or doubles-only, or different as- Fig. 4: Examples of strong gravitational lens systems mocked up with our simulation code by using HUDF galaxies as sources behind HSC galaxies as lenses. Each image cutout is 10.8 × 10.8 .
sumptions on the distribution of the Einstein radii since we found this to be crucial for the network performance. For this we generate with this pipeline new, independent mock images which are based on the same lens and source images, but different combinations and alignments. The details of the different samples and the network trained on them will be discussed further in Sec. 4. For the set of quads-only and higher limit on the Einstein radius of 2 we use a modification of the conventional data augmentation in deep learning. In particular, we rotate only the lens image before adding the random lensed source image, but not the whole final image (which is done normally for data augmentation). Thus the ground truth values are also not exactly the same values given the change in position angle and another background source with different location and redshift.

Neural Networks and their architecture
Neural networks (NN) are extremely powerful tools for a wide range of tasks and thus in the recent years broadly used and explored. Additionally, the computational time can be reduced notably compared to other methods. There are generally two types of NN: (1) classification, where one has as ground truth different labels to distinguish between the different classes, or (2) regression, where the ground truth consists of a set of parameters with specific values. The latter is the kind we use here, i.e., the network shall predict a numerical value for each of the five different SIE parameters (x, y, e x , e y and θ E ).
Depending on the problem the network shall solve, there are several different types of networks. Since we are using images as data input, typically convolutional layers followed by fully connected (FC) layers are used (e.g., Hezaveh et al. 2017;Perreault Levasseur et al. 2017;Pearson et al. 2019). The detailed architecture depends on attributes such as the specific task, the size of the images or the size of the data set. We have tested different architectures and found an overall good network performance with two convolutional layers followed by three FC layers but no significant improvement for the other network architectures. A sketch of this is shown in Figure 5. The input has four different filter images for each lens system and each image a size of 64 × 64 pixels. The convolutional layers have stride of 1 and a kernel size of 5 × 5 × C, with C = 4 for the first layer, and C = 6 for the second layer, respectively. Each convolutional layer is followed by max pooling of size f × f = 2 × 2 and stride 2. After the two convolutional layers, we obtain a data cube of size 13 × 13 × 16, which is then passed through the FC layers after flattening to finally obtain the five output values.
Independent of the exact network architecture, the network can contain hundreds of thousands (or more) neurons. While initially the values of weight parameters and bias of each neuron are random, they will be updated during the training. To see the network performance after the training, one splits the data set into three samples: the training, the validation and the test sets. We further divide those sets into random batches of size N. In each iteration the network predicts the output values for one batch (forward propagation) and after running over all batches from the training and validation sets, one epoch is finished. The error, which is called loss, is obtained for each batch with the loss function, where we use the mean-square-error (MSE) defined as where η tr k,l and η pred k,l denotes, respectively, the l th true and predicted parameter, in our case from {x, y, e x , e y , θ E }, of lens system k, and p denotes the number of output parameters. We incorporated in our loss function L weighting factors w l , which are normalized such that p l=1 w l = p holds. This gives a weighting factor of 1 for all parameters if they are all weighted equally.
The loss value of that batch is then propagated to the weights and biases (back propagation) for an update based on a stochastic gradient descent algorithm to minimize the loss. This procedure is repeated in each epoch first for all batches of the training set and an average loss is obtained for the whole training set. Afterwards those steps are repeated for all batches of the validation set, while no update of the neurons is done, and an average loss for the validation set is obtained as well. The validation loss shows whether the network improved in that epoch or if a decreasing training loss is related to overfitting the neurons. A network is overfitting if it predicts better the values for the data from the training set compared to the data from the validation set. After each epoch we reshuffle our whole training data to obtain a better generalization. This concludes one epoch and is repeated iteratively to obtain a network with optimal accuracy. This whole training correspond to one so-called crossvalidation run, where several cross-validation runs are performed by exchanging the validation set with another subset of the training set. For example, if the training set and validation set form 5 subsets {A, B, C, D, and E}, then we can have 5 independent runs of training where in each run, the validation set is one of these 5 subsets and the training set contains the remaining 4 subsets. After the multiple runs, one can determine the optimal number of epochs for training by locating the epoch with the minimal average validation loss across the multiple runs. This procedure helps to minimize potential bias to certain types of lenses for a potentially unbalanced single split. The neural network trained on all five sets {A, B, C, D, E} up to that epoch can then be applied to the test set, which contains data the network has never seen before. In our case we, used ∼56% of the data set as training set, ∼14% as validation set, and ∼30% as test set 5 , such that we have a 5-fold cross-validation for each network.

Results
For training our modeling network we mock up lensing systems based on real observed galaxies with our simulation pipeline described in Sec. 2. Each lensing system is simulated in the four different filter griz of HSC to give the network color information to distinguish better between lens galaxy and lensed arcs. The network architecture assumes, as described in Sec. 3 in detail, images with size 64 × 64 pixels, which corresponds to a size of around 10 × 10 .
During our network testing, we found that the distribution of Einstein radii in the training set is very important, especially as this is a key parameter of the model. Therefore we trained a network under the assumption of different underlying data sets, e.g., a lower limit of the Einstein radius for the simulations or a different distribution of Einstein radii. We further tested the network performance by limiting to a specific configuration i.e. only doubles or quads. We give an overview of the different data set assumptions in Tab. 1.
For finding the best network for our specific problem, we test the network performance with several different variations of the hyperparameters of the network. Independent of the data set, we train each cross validation run for 300 epochs, and apart from a few checks with different values, we fix the weight decay to 0.0005 and the momentum to 0.9. For the learning rate, batch size, and the initializations of the neurons, we have done a grid search, varying the learning rate r learn ∈ [0.1, 0.05, 0.01, 0.008, 0.005, 0.001, 0.0005, 0.0001], and batch size as 32 or 64 images per batch, and exploring three different network initializations. For the weighting factors of the contribution to the loss we test mainly two options, either all parameters contribute equally (i.e. w l = 1 ∀ l in eq. 6) or the contribution of the Einstein radius is a factor 5 higher (w θ E = 5). The best hyperparameter values depend on the assumed data set and these values are listed in Tab. 1.
We present in the following subsections our CNN modelling results for various data sets.
4.1. Naturally distributed Einstein radii with lower limit 0.5 For this network we use 65,472 mock lens images simulated following the procedure described in Sec. 2. Here we assume a lower limit of the Einstein radii of 0.5 as otherwise the lensed source is totally blended with the lens and not resolvable given the average seeing and image quality. The resulting redshift distributions are shown as the blue histograms for the lens in Figure 1 (top panel) and for the source in Figure 2. The lens redshift peaks at z d ∼ 0.5. Concerning the possible strong lensing configurations, the data set is dominated by doubles as expected. In addition, systems with smaller Einstein radii are more numerous than those with larger Einstein radii as expected given the lens mass distribution, although the velocity dispersion, which is shown in Figure 1 (bottom panel), peaks at around v disp ∼ 280 km s −1 and thus tends to include more massive galaxies than the input catalog (gray histogram). The distribution of the Einstein radius is shown in Figure 6 on the left panel; the red histogram depicts the true Einstein radii and the blue one the predicted distribution. On the right panel we show the correlation between the true Einstein radius θ tr E (x-axis) and the predicted Einstein radius θ pred E (y-axis). The red line shows the median and the gray bands marks the 1σ (16 th to 84 th percentile) and 2σ (2.5 th to 97.5 th percentile) ranges, respectively. The dashed black line is the 1:1 line for reference. We do not show this for the other parameters (lens center and ellipticity) as the performance of those parameters is very similar to that presented in Sec. 4.3 where we assume a uniform distribution of Einstein radii. The red line shows the median of the distribution and the gray bands mark the 1σ (16 th to 84 th percentile) and 2σ ( 2.5 th to 97.5 th percentile) ranges. Note. The first and second columns indicate if quads and/or doubles are included in the data set. The parameter θ E,min represents the lower limit on the Einstein radius in the simulation, and w θ E is the weighting factor of the Einstein radius in the loss function. The other parameters (lens center, ellipticity) are always weighted by a factor of 1 and the sum of all five weighting factors is normalized to the number of parameters. The fifth and sixth columns give the value of the loss of the test set and the epoch with the best validation loss. This is followed by the specific hyperparameters: learning rate r learn , batch size N, and seed for the random number generator.
We see that the network recovers the Einstein radius better for lens systems with lower image separation than with high image separation (θ E 2 ), which is in the first instance counterintuitive. If the lensed images are further separated, they are better resolved and less strongly blended with the lens, and we would expect better recovery of Einstein radii from the network. The worse network performance at larger Einstein radii can therefore only be explained by the relatively low numbers of these systems in the training data. We have more than two orders of magnitude more lens systems with θ E ∼ 0.5 than with θ E ∼ 2.0 . Therefore the network is trained to predict often a small Einstein radius and just in a negligible fraction of times a larger Einstein radius. Since the lens systems with larger image separation are very interesting for a wide range of scientific applications, it is desirable to improve the network performance on specifically those lens systems. Therefore we test a network with the same data set where the Einstein radius difference contributes a factor of 5 more to the loss than the other parameters. In case of this weighted network, the prediction performance is very similar for the lens center and ellipticity, but slightly better for the Einstein radius. If we increase the contribution of the Einstein radius further, we worsen notably the performance on the other parameters.
As a further comparison of the ground truth with the predicted values of the test set, we show in Figure 7 the difference as normalized histograms (bottom row) and the 2D probability distributions (blue) where we find no strong correlation between the five parameters. The obtained median values with 1σ uncertainties for the different parameters are, respectively, (0.00 +0.31 −0.30 ) for ∆x, (−0.01 +0.29 −0.31 ) for ∆y, 0.00 +0.08 −0.09 for ∆e x , 0.01 +0.09 −0.08 for ∆e y , and (0.02 +0.21 −0.18 ) for ∆θ E , where ∆ denotes the difference between the predicted and ground truth values. As an example, a shift of e x = 0.3 to e x = 0.15 with fixed e y = 0 results in a shift from q = 0.73 to q = 0.86.
Finally, we show in Figure 8 the difference in Einstein radii as a function of the logarithm of the ratio between lensed source intensity I s and lens intensity I l determined in the i band, which we hereafter refer to as the brightness ratio. In the top right panel, we show the distribution of the brightness ratio. The lens intensity is defined as the sum of all the pixel values in the 64 pixels × 64 pixels cutout of the lens such that it is slightly overestimated due to light contamination from surrounding objects. The distribution peaks around −2 in logarithm to basis 10, which means that the lensed source flux is a factor 100 below that of the lens. The bottom-left plot shows the median with 1σ values of the Einstein radius differences for each brightness ratio bin. Focussing on the blue curve for this section, we find a bias in the Einstein radius which is driven by the small lensing systems with θ E 0.8 (compare Figure 6). Excluding these small lensing systems, we show the corresponding plot on the lower right panel. With this limitation, we find no bias anymore and obtain a median with 1σ values of 0.00 +0.17 −0.14 for the Einstein radius difference. We find a slight improvement of the performance with increasing brightness ratio for both the full sample (bottom-left panel) and the sample with θ E > 0.8 (bottom-right panel).
To further improve the network performance for wideseparation lenses, we train separate networks for lens systems with Einstein radius θ E > 2.0 in Sec. 4.2, and for lens systems where we artificially boost the number of lenses on the high-end of θ E in Sec. 4.3.
4.2. Naturally distributed Einstein radii with lower limit 2.0 Since the network presented in Sec. 4.1 cannot recover well large Einstein radius (θ E 2 ), we test the performance of a network specialized for the high end of the distribution and set the lower limit to θ E,min = 2 . Because of the higher limit on the Einstein radii, the velocity dispersion (see bottom, orange histogram in E, min = 0.5'', equally distributed E, min = 0.5'', naturally distributed E, min = 2.0'', naturally distributed Fig. 7: Comparison of the performance of the three networks described in Sec. 4. All samples include doubles and quads and a weighting factor of w θ E = 5, but different Einstein radius distributions or lower limits on the Einstein radius as indicated in the legend. In the lowest row we show the normalized histograms of the difference between predicted values and ground truth for the five parameters and above the 2D correlations distribution (1σ contour in solid and the 2σ contour in dotted).
Figure 1), is shifted towards the high end which corresponds to more massive galaxies. We also find that the lens and source redshifts, which are shown as orange histograms in Figure 1 and Figure 2, respectively, tend to slightly higher values. Since we use the natural distribution of Einstein radii as in Sec. 4.1, the image-separation distribution is again bottom-heavy and the number of mock lens systems (25,623) is smaller, as shown in Figure 9. From the blue (predicted) histogram, we see that the true distribution (red histogram) is well recovered. On the right panel in Figure 9, we show the correlation of predicted and true Einstein radii. The red line, which follows quite well the diagonal dashed line, shows the median. The gray shaded regions visualize the 1σ and 2σ regions. We find that the network performs much better for θ E ∼ 2 than the network trained in the full range (Sec. 4.1). However, this is again due to the dropping number of lens systems towards θ E ∼ 4 , and the scatter increases dramatically for the high-end of the data set.
We further show 1D and 2D probability distributions for this network in Figure 7 (orange) as well as the histogram of the brightness ratio, and the difference of the Einstein radii as function of the brightness ratio in Figure 8. While the performance for the lens center and complex ellipticity is very similar to the network presented in Sec. 4.1, we achieve an improvement for the Einstein radius. This is expected as the network is specifically trained for lens systems with large image separation. As we see from Figure 8, the larger systems do not have a higher brightness ratio on average as one might expect. As we saw already, the network performs notably better on the Einstein radii over the whole brightness ratio range. We no longer overpredict the Einstein radius for log I s I l −2.5, and also the 1σ values are smaller.
4.3. Uniformly distributed Einstein radii with lower limit 0.5 Because of the extreme decrease in the number of systems towards large image separation, we test a network trained on a more uniformly distributed sample. For this, we generate more lens systems with high image separation by rotating the lens image by nπ/2 with n ∈ [0, 1, 2, 3]. Here we do not reuse the same lens in the same rotation to avoid producing multiple images of lens systems that are too similar. We note that the background  The upper-right panel shows the histogram of the brightness ratio of lensed source and lens. The bottom panel shows for the full sample (left) and limited to θ E > 0.8 (right) the difference in Einstein radius as a function of the brightness ratio with the 1σ values. We show the Einstein radius difference in the range −3 < log I s I l < −1 (white area in the histogram) where we have enough data points and shift the blue/orange bars slightly to the right side for better visualization. source and position are always different such that the lensing effect varies (see Sec. 2 for further details on the simulation procedure). We limit to a maximum of 8,000 lens systems per 0.1 bin resulting in a sample of 140,812 lens systems. This results in a more uniform distribution, though the largest-image-separation bins still have fewer lens systems since it is very difficult but also very seldom to obtain a lensing configuration with an image separation above ∼2.5 due to the mass distribution of galaxy-scale lenses. The biggest image separation within this sample is ∼4.5 while we set an upper limit to 5 corresponding to the size of the biggest Einstein radius so far observed from galaxy-galaxy lensing (Belokurov et al. 2007). The redshift distributions, shown as green histograms in Figure 1 and Figure 2, are similar to that of the naturally distributed sample (blue), whereas the lens velocity dispersions (Figure 1, bottom panel) tend to be higher (i.e., more massive galaxies), as expected.
Similar to the networks trained with natural Einstein radius distribution (see Sec. 4.1 and Sec. 4.2), we show in Figure 10 histograms (left column) and a 1:1 comparison (right column) but now for all five SIE parameters, i.e. from top to bottom for the lens center x and y, the complex ellipticity e x and e y , and the Einstein radius θ E . For this network we obtain a median value with 1σ scatter of (0.00 +0.30 −0.30 ) for ∆x, (0.00 +0.30 −0.29 ) for ∆y, −0.01 +0.08 −0.09 for ∆e x , 0.00 +0.08 −0.09 for ∆e y , and (0.07 +0.29 −0.12 ) for the Einstein radius ∆θ E . Comparing the performance on the Einstein radius to the network from Sec. 4.1 with a natural Einstein radius distribution, we see a significant improvement on the systems with larger image separation. Therefore we can confirm that the underprediction of the Einstein radius in Sec. 4.1 is due to the relatively small number of large-θ E systems in the training data. On the other hand, based on this plot the new network seems to be slightly worse on the low-image separation systems. It tends to overpredict the Einstein radius at θ E 2.0 such that when we limit to θ E > 0.8 as in Sec. 4.1, we get only a slight improvement in reducing the scatter and obtain ∆θ E = (0.07 +0.25 −0.08 ) . Therefore, it turns out that the performance depends sensitively on the training data distribution. If we look at the performance on the lens center, which is measured in units of pixels with respect to the image cutout center, it seems as if the network fails totally in the first instance. However, one has to recall how we obtain the lens mass center. In the simulation, we assume the lens light center to be the image center and add a gaussian variation on top (with standard deviation of 0.05 ) to shift to the lens mass center. Thus the ground truth (red histogram in Figure 10) follows a gaussian distribution while the predicted lens center distribution (blue) is peakier. This suggest that the network does not obtain enough information from the slight shift or distortion in the lensed arcs to predict correctly the lens mass center. The network has further difficulties on this parameter because all systems have the exact same lens light center (which is at the center of the image). If we would assume that the lens mass perfectly follows the light distribution and the lens light center is always the same, the lens (mass) center ground truth would become a delta distribution, and the network would perform much better. Accordingly, in many automated lens modeling architectures (e.g., Pearson et al. 2019) the lens center is not even predicted. Since the difference of the center is for nearly all lens systems smaller than ±1 pixel, it does not affect the model noticeably. We nonetheless keep five parameters for generality and suggest to investigate in future work more in this direction by relaxing the strict assumption of coincidence centers of image cutout and of lens light.
We also tested networks by weighting the contribution of the lens center to the loss with a higher fraction which results in a better performance on these two parameters but then the performance on the other parameter starts to deteriorate. We thus refrain from up-weighting the lens center.
If we now look at the performance on the ellipticity, it turns out that most of the lens systems are roundish, i.e. e x ∼ e y ∼ 0 and that the network can recover them very well. In case the lens is more elliptical, the network performance starts to drop. This might be an effect of the lower number of such lens systems in the sample especially as here the position angle becomes relevant and thus the number of systems in a particular direction is again lower. Note that e x = ±0.3 and e y = 0 corresponds to an axis ratio q = 0.73, i.e. quite elliptical. If the absolute value of e x or e y would be higher, the axis ratio would be even lower which occurs relatively seldom in nature.
With the 1D and 2D probability contours in Figure 7 (green) one can see that this network performs overall very similarly compared to the network trained on the naturally distributed sample (blue). For all three networks we find minimal correlation between the different parameters.
In analogy to the previously presented networks, we show in Figure 8 the histogram of the brightness ratio and the Einstein radius differences as function of the brightness ratio for this network. While the distribution matches that from the sample with naturally distributed Einstein radius, we overpredict the Einstein radius more than before. This is related to the overprediction at smaller Einstein radii (see Figure 10) which comes from weighting higher the fraction of systems with larger image separation. We still underestimate the Einstein radius at the very high end as already noted, but this is negligible for the overall performance compared to the amount of overestimated systems as we still have a factor of ∼ 100 more of them in our sample. This is the reason why the network tends to overpredict more strongly than that trained on the naturally distributed sample (Sec. 4.1, and blue lines in Figure 7 and Figure 8).
Finally, we show the loss curve in Figure 11. The training losses (dotted lines) and validation losses (solid lines) in different colors correspond to the five different cross-validation runs. Additionally, we give the mean of the validation curves with a black solid line. This line is used to obtain the best epoch, which is in this specific case epoch 122 that is marked with a vertical line. The corresponding loss is 0.0528 obtained with Eq. 6.  In the corresponding colors, we plot the validation loss as solid lines together with the black curve that is the average of the five validation curves from the crossvalidation runs. From the minimum in the black curve, which is marked with the vertical gray line, we find the best epoch.
From the loss curve we see that the network is not overfitting much to the training set since the validation curves do not increase much for higher epochs, but still enough to define an optimal epoch to terminate the final training. This is a sign that drop-out, i.e. the omission of random neurons in every iteration, is not needed, which is supported by additional test in Sec. 5.3.

Further network tests
In addition to the networks described in Sec. 4, where we mainly investigated the effect of the Einstein radius distribution, we discuss here further tests on the training data set.

Data set containing double or quads only
We consider a specialized network for one of the two strong lensing options and limit our sample to either doubles or quads, where the image multiplicity is based on the centroid of the source (as the spatially extended parts of the source could have different image multiplicities depending on their positions with respect to the lensing caustics). In the case where we limit to doubles only, we have done our standard grid search for the different hyperparameter combinations for two samples with naturally distributed Einstein radii above 0.5 and above 2.0 . With these networks we got no notable difference compared to the sample containing both doubles and quads (see Sec. 4.1 and Sec. 4.2), which is expected as the doubles are dominating the sample including both doubles and quads by a factor of around 20-30 (for the different networks depending on the lower limit of the Einstein radii).
In case we limit the sample to quads only, we have done again our grid search for the different hyperparameter combinations for the both samples with naturally distributed Einstein radii above 0.5 and above 2.0 and also with equally distributed Einstein radii. Since the chance to obtain four images is smaller than the chance to observe two images based on the necessary lensing configuration probability, the sample sizes are smaller with, respectively, 42,063, 19,176, and 28,398 lensing systems. Therefore the output has to be considered with care as this is much lower than typically used for such a network.
It turns out that this networks performs equally well on the lens center and ellipticity but better for the Einstein radius shown in Figure 12. By comparing this plot to Figure 10, we find the main improvement that the 1σ and 2σ scatters are substantially reduced and with smaller bias for systems with larger θ E . An improvement on the Einstein radius is expected as the network get the same information of the lens but more on the lensed arcs. Even if now one image is too faint to be detected or too blended with the lens there are three images from the quad left over to provide information on the Einstein radius.
To increase the sample, we simulated a new quads only batch with the source brightness boosted by one magnitude which resulted in a ∼ 1.5 times larger sample as before. This is still small compared to the other double or mixed samples. Now we have a brightness ratio peak at log I s I l ∼ −1.5 instead of ∼ −2.0 (compare Figure 8). The obtained performance of this trained network (loss is 0.0673 for the network with w θ E = 5) is still similar as for the quads-only network without magnitude boost (loss is 0.0688) and no significant performance difference is observed for the individual parameters.

Comparison to lens galaxy images only
As further proof for the network performance on the Einstein radius, we test how well the network is able to predict the parameters from images of only the lens galaxies, i.e. without lensed arcs. As expected, the network performs similarly well for the lens center and axis ratio, but much worse for the Einstein radius with a 1σ value of 0.41 . This shows us that the arcs are bright enough and sufficiently deblended from the lens galaxies to be detectable by the CNN.

Different network architectures
For each of the networks presented in Sec. 4, we have done a grid search to find the best hyperparameters. We have consid- ered eight different values for the learning rate, three different network initializations, two different batch sizes, and two different sets of weights for the loss contribution. This gives already 96 different combinations of hyperparameters which we tested with cross-validation and early stopping. For a subset of the hyperparameter combinations, we test further possibilities. In particular, we explore the effect of drop-out with a drop-out rate p ∈ [0.1, 0.3, 0.5, 0.7, 0.9] but find no improvement. We further test different network architectures by adding an additional convolutional layer or fully connected layer, or varying the number of neurons in the different layers. We further test the effect of five different scaling options of the input images for our data set described in Sec. 4.1, but assume here the learning rate r learn = 0.001 for simplicity. First, we boost the r band only by a factor of 10. Since the network is still able to recover the parameter values, we see that the network performance is not heavily affected by the absolute value of the images. On the other hand, in case we normalize each filter of one lens system independently of the other filter, the network has huge difficulties to infer the correct parameter values. This shows us that the network is indeed able to extract the color information and need the different filter. In the fourth and fifth option, we normalize the images with the peak value of each filter or with the mean peak value. Lastly, we also rescale the images by shifting them by the mean value and dividing by the standard deviation 6 . Since we obtained no notable improvement with any one of these scalings, we use the images without rescaling for obtaining our final networks. 6 The four individual images are rescaled as with mean the number of filter f , and σ = f k=0 p1,p2 l,m=0,0 and p1 and p2 as image dimensions in pixels for the x and y axis, respectively. In our case we have f = 4 and p1 = p2 = 64.

Prediction of image position(s) and time delay(s)
After obtaining a network for different data sets (see Tab. 1), we compared the true and predicted parameter values directly. Since the main advantage of the network is the computational speed-up compared to recent methods and the fully automated application, the network is very useful for planning follow-up observations as this needs to be done relative quickly in case there is for instance a supernova (SN) or a short lived transient occurring in the background source.
Lensed SNe in addition to lensed quasars are very powerful cosmological probes. By measuring the time delays of a lensing system with an object that is variable in brightness, one can among others use it to constrain the Hubble constant H 0 . Such applications are mainly based on lensed quasars as the chance of a lensed SN is substantially lower. So far there are two lensed SNe known; one core-collapse SN named SN Refsdal behind a strong lensing cluster MACS J1149.5+222.3 (SN Refsdal;Kelly et al. 2015) and one SN type Ia behind an isolated lens galaxy (iPTF16geu; Goobar et al. 2017). Thanks to upcoming wide surveys in the next decades like LSST, this will change. LSST is expected to detect hundreds of lensed SNe (e.g., Goldstein et al. 2019;Wojtak et al. 2019). Therefore, it is important to be prepared for such exciting transient events in a fully automated and fast way. In particular, a fast estimation of time delay(s) is important for optimizing the observing/monitoring strategy for timedelay measurements.
Beside time-delay measurements, observing lensed SNe type Ia can help to answer outstanding questions about their progenitor systems. The basic scenario is the single degenerate case where a white dwarf (WD) is stable until it reaches the Chandrasekhar mass limit (Whelan & Iben 1973;Nomoto 1982) by accreting mass from a nearby star. Today there are also alternative scenarios considered where the WD explodes before reaching the Chandrasekhar mass, the so-called sub-Chandrasekhar detonations (Sim et al. 2010). Another possibility for a SN Ia is the double degenerated scenario where the companion is another WD (e.g., Pakmor et al. 2010) and both are merging to exceed the Chandrasekhar mass limit. It is still unclear which one, or both, main scenarios is correct to describe the SN Ia formation. To shed light on this debate, one possibility is to observe the SN Ia spectroscopically at very early stages which is normally difficult because of the SN detection close to peak luminosity, past the early phase. In case this SN is lensed, we can use the position of the first appearing image, together with a mass model of the underlying lens galaxy, which we can obtain with our network using "reference" images taken in an earlier epoch before SN explosion that show the lens galaxy with the lensed SN host, to predict the position and time when the next images will appear. Here it is very important to react fast as the time delays of galaxy-galaxy strong lensing are typically on the order of weeks. Our CNN can indeed provide a mass model within seconds of analyzing the reference image. We explore below how accurate we can predict the positions and time delays of the next appearing SN images.
To further test our model networks, we use the predicted SIE parameters from the networks to predict the image positions and time delays and compare them to those obtained with the ground-truth SIE model parameter values. This will give us a better understanding of how well the network performs and if the obtained accuracy is sufficient for such an application. For this comparison we compute the image positions of the true source center based on the true SIE parameters obtained by the simulation for the sytsems of the test set (hereafter true image posi-tions). After removing the "central" highly-demagnified lensed image as this would not be observable (given its demagnification and the presence of the lens galaxy in the optical/infrared), we compute the time delays for these systems (hereafter true time delays ∆t tr ) by using the known redshifts and our assumed cosmology. Based on these true image positions and time delays, we can select the first-appearing image and use its true image position to predict the source position with our predicted SIE mass model. This source position is then used to predict the image positions (hereafter predicted image positions) of the next-appearing SN images based on the SIE parameter values predicted with our modeling network. The predicted image positions are then used to predict the time delays (hereafter predicted time delays ∆t pred ) with the network predicted SIE parameter. We compare directly the image positions and time delays that we obtain with the true and with the network predicted SIE parameters, when we have the same number of multiple images. In case the number of image do not match, which happened for 7.8% for the network with equally balanced Einstein radii distribution containing double and quads, we omit the candidate in this analysis as a fair comparison is not possible. Since we always remove the central image, we obtain for a double and quad, respectively, two and four images and one and three time delays. Since the time delays can be very different, we also compare the fractional difference between the true and predicted time delays with respect to the true time delays.
We choose again the three main networks from Sec. 4 for this comparison and show them in Figure 13. All three sets contain quads and doubles, and assume a loss weighting factor of 5 for the Einstein radius. The first set assumes a lower limit on the Einstein radius of 0.5 (blue), the second a lower limit of 2 (yellow), and the third a lower limit of again 0.5 but with a uniform distribution on the Einstein radii instead of the natural distribution following the lensing probability (green). We plot the quantities as function of the brightness ratio log I s I l in analogy to Figure 7 and Figure 8.
In detail, Figure 13 contains in the upper row the median difference in the image position for the x coordinate (left) and y coordinate (right) with the 1σ value per brightness ratio bin, where only the additional image positions are taken into account as the first reference image is known and thus do not need to be predicted. We obtain for all three networks a median offset of nearly zero independent of the brightness ratio and whether we limit further in Einstein radii or not. The 1σ values are around 0.25 , corresponding to ∼ 1.5 pixels. Explicitly, we find for the equally distributed sample applied to θ E > 0.8 a median image position offset of (0.00 +0.29 −0.29 ) and (0.00 +0.32 −0.31 ) for the x and y coordinate, respectively. Interestingly the 1σ values are slightly larger for quads than doubles as we would have expected that quads provide more information to constrain the SIE parameter values and thus predict better the image positions. The reason for this is probably because quads have generally higher image magnification as doubles, and image offsets are larger with higher magnification.
The middle row of Figure 13 shows the legend (left) and a histogram of the difference between the predicted time delay ∆t pred and the true time delay ∆t true . The bottom row shows the difference in time delay divided by the absolute value of the true time delay per brightness ratio bin (left) and the difference of the time delays again per brightness ratio bin (right). In terms of time delay difference, the network trained on the naturally distribution (blue) performs better than that with uniform distribution (green), but especially for the network trained for lens systems with large Einstein radius (orange) we obtain notable differences. In detail, we obtain a median with 1σ value for the naturally distributed sample (blue, Sec. 4.1) for the time-delay difference of 2 +18 −6 days and a fractional time-delay difference of 0.05 +0.47 −0.09 . Since we find a strong correlation between the offset in the Einstein radius and the time delay offset as shown in Figure 14, we exclude again the very small Einstein radii systems (θ tr E < 0.8 ) and obtain then for the time-delay difference 1 +18 −11 days and for the fractional difference 0.01 +0.19 −0.12 . For the equally distributed sample (green, Sec. 4.3) we obtain, with θ E > 0.5 and θ E > 0.8 , respectively, for the time delay difference 7 +38 −6 and 6 +36 −8 days and for the fractional time delay difference 0.06 +0.45 −0.05 and 0.04 +0.27 −0.05 . This restriction is easily applicable in practice, since one will follow up only individual lensing systems at a given time, and one can check by looking at the image of the individual system whether the Einstein radius is >0.8 . Depending on the predicted time delay, one could also improve the model further by using traditional manual maximum likelihood modeling methods to verify the predicted time delay.
The fractional offset in the predicted time delays of 0.04 +0.27 −0.05 that we achieve with our CNN for systems with θ E > 0.8 (for the uniformly distributed θ E sample), i.e. with a symmetrized scatter of ∼16%, is close to the limit that would be achievable even with detailed/time-consuming MCMC models of groundbased images. This is because the assumption of the SIE introduces additional uncertainties on the predicted time delays in practice, even though detailed MCMC models of images would typically yield more precise and accurate estimates for the SIE parameters than our CNN. While galaxy mass profiles are close to being isothermal, the intrinsic scatter in the logarithmic radial profile slope γ (where the three dimensional mass density ρ(r 3D ) ∝ r −γ 3D ) is around ±0.15, translating to ∼ 15% scatter in the time delays (e.g., Koopmans et al. 2006;Auger et al. 2010;Barnabè et al. 2011). In other words, if a lens galaxy has a powerlaw mass slope of γ = 2.1, then our assumed SIE mass profile (with γ = 2.0) for it would predict time delays that are ∼10% too high (e.g., Wucknitz 2002; Suyu 2012). While constraining the profile slope γ with better precision than the intrinsic scatter for individual lenses is possible, this would require high-resolution imaging from space or ground-based adaptive optics (e.g., Dye & Warren 2005;Chen et al. 2016). Given the difficulties of measuring the power-law mass slope γ from seeing-limited ground-based images of lens systems (although see Meng et al. 2015, for the optimistic scenario when various inputs are known perfectly such as the point spread function), we conclude that our network prediction for the delays has comparable uncertainties as that due to the unknown γ . We expect these two sources of uncertainties to be the dominant ones in ground-based images.
We also find a decrease of the performance with increase of brightness ratio which is in the first instance counterintuitive. If we consider the fractional offset in the left panel, we see a better performance for the sample with an Einstein radius lower limit of θ E,min = 2 (orange), especially in terms of the 1σ scatter, when compared to the other two networks. This θ E,min = 2 network also has minimal bias, as shown by the median line. This is understandable as the time delays are longer for systems with a bigger Einstein radius, and therefore the fractional uncertainty is smaller. The accuracy in time delay difference (lower right plot) is good, although 1σ scatter is quite large, ∼20 days. With this reasoning, we can also understand the worse performance of the equally distributed sample (green) compared to the naturally distributed sample (blue) as it contains a much higher fraction of  systems with bigger image separation. As higher brightness ratio (log(I s /I l )) tends to be associated with systems with higher θ E , the prediction of delays thus has larger scatter as shown in the bottom-right panel. Moreover, we have to note that we find a bet-ter performance for doubles than quads, which might be because of smaller image separation and shorter time delays of quads.
During this evaluation of the networks we have to keep in mind that the main advantage of this networks is the run time: we need only a few seconds to estimate the SIE model param- eters, the image positions, and the corresponding time delays. Therefore it is expected that we do not reach the accuracy of current modeling techniques using MCMC sampling which can take weeks. Nonetheless, the network results can serve as input to conventional modelling and help speed up the overall modelling.

Comparison to other modeling codes
There are already several modeling codes developed, and one can separate them into two main groups. The state-of-the-art codes which rely on MCMC sampling are widely tested and were used for most of the modeling so far. The advantage of such codes are their flexibility in image cutout size or pixel size and also in terms of profiles to describe the lens light or mass distribution. With the advantage of the variety of profiles comes the disadvantage that the codes require a lot of user input which limit the applicability of such codes to a very small sample, or specific lensing systems that are modeled. Moreover, based on the MCMC sampling of the parameter space is very computational intensive and thus can take up to weeks per lens system although some steps can be parallelized and run on multiple cores.
Since the number of known lens systems has grown in the past few years and will increase substantially with upcoming surveys like LSST and Euclid, those codes to analyze individual lens systems will not be enough anymore. Thus the modeling process must be more automated and a speed-up will be necessary. While some newer codes (e.g. Nightingale et al. 2018;Shajib et al. 2019;Ertl et al. in prep.) are automating the modeling steps to minimize the user input, they still rely on sampling the parameter space such that the run time remains on the order of days and some user input per lens system.
The second, new kind of modeling are based on machine learning such as that used in this work. The first network for modeling strong lens systems was first presented by Hezaveh et al. (2017). While they use Hubble Space Telescope data quality, we use with images from HSC ground based quality similar to Pearson et al. (2019) as most of the newly detected lens systems will be in first instance observed with ground-based facilities. Moreover, Hezaveh et al. (2017) suggest to first remove the lens light and then model with the network only the arcs. Therefore we cannot further compare the performance fairly. Pearson et al. (2019) consider both modeling with or without lens light subtraction but found no notable difference such that we only consider modeling the lens system without an additional step to remove the lens light. Since we provide the image in four different filters, the network is able to distinguish internally between the lens galaxy and the surrounding arcs. In contrast to Pearson et al. (2019), we use the SIE profile with all five different parameters while they assume a fixed lens center. Moreover, they mock up completely their training data, assume a very conservative threshold of S/N¿20 in at least one band and do not include neighbouring galaxies which can confuse the CNN while we are more realistic by using real observed images as input for the simulation pipeline. This way we have more realistic lens light distributions and also include neighboring objects which the network has to learn to distinguish from the lensing system. Pearson et al. (2019) make use of the same type of network as Hezaveh et al. (2017) and us, a CNN, but they use slightly smaller input cutouts (57 × 57 pixels) and a different network architecture (6 convolutional layers and 2 FC layers) than ours. Since they investigated mostly the effect of using multiple images of different filters and whether to use lens light subtraction or not, whereas we investigate the effect on the underlying samples and a simulation with real observed images, we do not have a scenario that assumes the exact same properties. The closest scenario, from Pearson et al. (2019) the results from LSST-like gri images including lens light and our results based on HSC griz images with naturally distribution of the Einstein radii, show that both networks are very similar in their overall performance. The reason that they do not suffer from the same biases in θ pred E , even with a non-flat θ tr E distribution in their simulations, is perhaps because they use idealised, simplistic simulations (high S/N, wellresolved systems, no neighbours).
There are also other recent publications related to strong lens modeling with machine learning. Bom et al. (2019) present a new idea by suggesting a network which predicts four parameters, the Einstein radius θ E , the lens redshift z d , the source redshift z s , and the related quantity of the lens velocity dispersion v disp . They adapt similar to us a SIE profile to mock up their training data with an image quality similar to that from the Dark Energy Survey. Since this code provides only the Einstein radius instead of a full SIE model, the applicability is somewhat limited.
Madireddy et al. (2019) suggest a modular network to combine lens detection and lens modeling which have been done so far with complete independent networks. In detail, they have four steps, the first one is to reduce the background noise (socalled image denoising), followed by a lens light subtracting step (so called "deblending" step), before the next network decides whether this is a lens system or not. If it detects the input image to be a lens, the module is called to predict the mass model parameter values. Each module of the network is a very deep network and both modules for detection and modeling make use of the residual neural network (ResNet) approach. They use a sample of 120,000 images, with 60,000 lenses and 60,000 nonlenses, and split this into 90% and 10% for the training and test set, respectively, without making use of the cross-validation procedure. Madireddy et al. (2019) uses similar to Pearson et al. (2019) completely mocked-up images based on a SIE profile with fixed centroid to the image center such that the modeling module predict three quantities, Einstein radius, and the two components of the complex ellipticity. Based on the different assumptions a direct comparison of the performance is not possible. However, we see that the performance is typical for the current state of CNNs based on Pearson et al. (2019).

Summary and Conclusion
In this paper, we present a Convolutional Neural Network to model fully automated and very quickly the mass distribution of galaxy-scale strong lens systems by assuming a SIE profile. The network is trained on images of lens systems generated with our newly developed code that takes real observed galaxy images as input for the source galaxy (in our case from the Hubble Ultra Deep Field), lenses the source onto the lens plane, and adds it to another real observed galaxy image for the lens galaxy (in our case from the HSC SSP survey). We choose the HSC images as lenses and adopt their pixel sizes of 0.168 as this is similar to the data quality expected from LSST. With this procedure we simulate different samples to train our networks where we distinguish between the lens types (quads+doubles, doubles-only, and quads-only) as well as on the lower limit of the Einstein radius range. Since we find a strong dependence on the Einstein radius distribution, we also consider a uniformly distributed sample and also a weighting factor of 5 for the Einstein radius' contribution to the loss. With this we obtain eight different samples for each of the two different weighting assumptions summarized in Tab.

1.
For each sample we then perform a grid search to test different hyper-parameter combinations to obtain the best network for each sample although we find that the CNN performance depends much more critically on the assumptions of the mock training data (like quads/doubles/both or Einstein radius distribution) rather than fine tuning of hyper-parameters. From the different networks presented in Tab. 1, we find a good improvement for the networks trained with quads-only compared to the networks trained on both quads and doubles. If the system type is known, we therefore recommend using the corresponding network. Since the Einstein radius is a key parameter, we weighted its loss higher than for the others and, although the minimal validation loss is higher, we advocate these networks for modeling HSC-like lenses.
After comparing the network performance on the SIE parameter level, we test the network performance on the image and time-delay level. For this we use the first appearing image of the true mass model to predict the source position based on the predicted SIE parameter. From this source position and the network predicted SIE parameters, we then predict the other image position(s) and time delay(s). We find for the sample with double and quads, a uniform distribution in Einstein radii and a weighting factor w θ E of five by applying the network to θ E > 0.8 an average image offset of ∆θ x = (0.00 +0.29 −0.29 ) and ∆θ y = (0.00 +0.32 −0.31 ) while we achieve the fractional time-delay difference of 0.04 +0.27 −0.05 . This is very good given that we use a simple SIE profile and need only a few seconds per lens system in comparison to current state-of-the-art methods which require at least days and some user input per lens system. We anticipate that fast CNN modeling such as the one developed here will be crucial for coping with the vast amount of data from upcoming imaging surveys. For fure work, we suggest to investigate further into creating even more realistic training data (e.g., allowing for an external shear component in the lens mass model) and also in exploring the effect of deeper or more complex network architectures. The outputs of even the network presented here can be used to prune down the sample for specific scientific studies, and followed up with more detailed conventional mass modeling techniques.