Euclid preparation. XXV. The Euclid Morphology Challenge -- Towards model-fitting photometry for billions of galaxies

The ESA Euclid mission will provide high-quality imaging for about 1.5 billion galaxies. A software pipeline to automatically process and analyse such a huge amount of data in real time is being developed by the Science Ground Segment of the Euclid Consortium; this pipeline will include a model-fitting algorithm, which will provide photometric and morphological estimates of paramount importance for the core science goals of the mission and for legacy science. The Euclid Morphology Challenge is a comparative investigation of the performance of five model-fitting software packages on simulated Euclid data, aimed at providing the baseline to identify the best suited algorithm to be implemented in the pipeline. In this paper we describe the simulated data set, and we discuss the photometry results. A companion paper (Euclid Collaboration: Bretonni\`ere et al. 2022) is focused on the structural and morphological estimates. We created mock Euclid images simulating five fields of view of 0.48 deg2 each in the $I_E$ band of the VIS instrument, each with three realisations of galaxy profiles (single and double S\'ersic, and 'realistic' profiles obtained with a neural network); for one of the fields in the double S\'ersic realisation, we also simulated images for the three near-infrared $Y_E$, $J_E$ and $H_E$ bands of the NISP-P instrument, and five Rubin/LSST optical complementary bands ($u$, $g$, $r$, $i$, and $z$). To analyse the results we created diagnostic plots and defined ad-hoc metrics. Five model-fitting software packages (DeepLeGATo, Galapagos-2, Morfometryka, ProFit, and SourceXtractor++) were compared, all typically providing good results. (cut)


Introduction
The European Space Agency's Euclid mission (Laureijs et al. 2011, mission RedBook), due to start operations in 2023, is designed to provide accurate photometric, spectroscopic and morphological data (in particular cosmic shear and clustering distributions) for billions of galaxies across 15 000 deg 2 of sky, using them as tracers to study the properties of the dark components of the Universe.
To this end, a processing pipeline is being assembled by the Science Ground Segment, a team that is in charge of releasing the data to the community.This pipeline is ready to ingest, process and analyse the raw imaging data from the satellite on a daily basis; optical data from external ground-based instruments (Rubin/LSST, DECAM, CHFT, Pan-STARRS, Omega-CAM, Subaru; see Euclid Collaboration: Scaramella et al. 2022) will also be used to complement the optical and near-infrared images obtained by the two satellite photometers VIS (observing in I E , a broad optical band; see Cropper et al. 2016) and NISP-P (observing in the three near-infrared -NIR -bands Y E , J E , and H E ; Maciaszek et al. 2016;Euclid Collaboration: Schirmer et al. 2022), allowing for high-quality photometric redshift estimates.The final step of the image analysis pipeline will produce a global catalogue containing all the astrometric, photometric and morphological information about each source detected in the I E images (plus an additional sample of NIR-detected sources).This catalogue will then be exploited for scientific use by the Euclid Collaboration, and it will also be released to the community for legacy use.
The pipeline currently implements two photometric techniques (aperture and template-fitting, performed with Euclidspecific versions of two public software tools, a-phot and t-phot respectively; see Merlin et al. 2015Merlin et al. , 2016Merlin et al. , 2019)), and a module to estimate so-called CAS morphological parameters (Concentration/Asymmetry/Smoothness: non-parametric morphological features that can be used to distinguish between discs, ellipticals, compact, diffuse, symmetric/asymmetric or clumpy objects by means of a dimensional reduction, see Conselice 2003).However, the pipeline is foreseen to also include a profile model-fitting algorithm.The Euclid Morphology Challenge (EMC) was organized with the aim of analysing and comparing the performance of various model-fitting software tools on Euclid data, in order to establish the foundations for choosing the tools that will be integrated into the official processing pipeline.The final choice will be driven by many factors, including computational performance, robustness of the algorithm, and compatibility with the current version of the pipeline; however, the accuracy of the parameter estimates will of course be the main driver.Therefore, assessing the performance of the different software packages on simulated data, for which the ground truth is known, is a necessary and fundamental step for a sound selection.Eight development teams of model-fitting software packages were invited to participate to the challenge, and five provided at least partial results.
In this paper we present the data set created for the EMC, and we discuss the results concerning photometry.In fact, albeit not being the central focus of the challenge, flux measurements obtained via model-fitting techniques will have great relevance, providing a crucial complement to the more straightforward methods already included in the pipeline.A companion paper is dedicated to the analysis of such morphological estimates (Euclid Collaboration: Bretonnière et al. 2022, EMC2022b hereafter).e-mail: emiliano.merlin@inaf.itThis paper is structured as follows.Section 2 describes the technique used to create the simulated data set for the Challenge, with some technical details given in Appendix A. In Sect. 3 we briefly present the software tools taking part in the Challenge, and in Sect. 4 we describe the methods used to analyse and rank the data provided by the participants.The results are then presented in Sect.5, where we investigate the general accuracy of the photometric measurements, and the reliability of the estimated uncertainty budgets, with a further focus on each software package's performance given in Appendix B. Finally, Sect.6 presents a summary of the work and provides conclusions.
All magnitudes are given in the AB system.

Simulating the Euclid universe
Simulated data sets are being produced and used by the Euclid Science Ground Segment to test the full processing pipeline from image reduction to data analysis.These simulations consist of raw single exposures, including observational features and defects, and they must be processed and stacked to reach the nominal depth and be ready for scientific analysis, with background light and defects removed.To simplify this complex procedure, and to have all details under control, for the EMC we decided to produce a tailored data set, directly simulating backgroundsubtracted images at the expected nominal depths of the final stacked mosaics in all bands.With this approach, we were also free to try different options, producing simulations with single and double Sérsic analytical profiles, and also with realistic morphologies.In this section we explain the procedure we followed to obtain all these simulated data sets.

Catalogues and images creation
We started by creating mock cosmological catalogues with the code Egg (v1.3.1, Schreiber et al. 2017).Egg uses the statistical distributions of real galaxies as detected and classified in the five CANDELS fields (Grogin et al. 2011;Koekemoer et al. 2011) to build a simulated catalogue of a patch of the sky, complete with the properties of the objects as observed by a chosen set of passband filters, with a chosen pixel resolution, and to a chosen limiting magnitude.We refer the reader to the paper describing the code for a detailed description of its workflow; here we provide a short summary.Egg draws redshifts and stellar masses from observed galaxy stellar mass (M * ) functions, and subsequently attributes a star-formation rate (SFR) to each galaxy from the observed SFR-M * main sequence; dust attenuation, optical colours and simple disc-plus-bulge morphologies are obtained from empirical relations established from the high-quality Hubble and Herschel observations of the CANDELS fields.Random scatter is introduced in each step to reproduce the observed distributions of each parameter.Finally, based on these observables, a suitable panchromatic spectral energy distribution (SED) is selected for each galaxy and synthetic photometry is produced by integrating the redshifted SED over the chosen broad-band filters.The galaxies are created as two-component objects, with a bulge and a disc both described by a Sérsic (1968) profile, where Sérsic indices for the bulge and disc components are n bulge = 4 and n disc = 1.The output catalogue contains the physical and observed properties of the galaxies within a field of view (FoV) corresponding to the chosen area; the objects are placed at random positions with a fixed angular two-point correlation function, neglecting large-scale clustering beyond 3 (i.e., beyond ∼1 Mpc at z > 0.5).
Fig. 1.Filter transmission curves used to simulate the images (normalised to arbitrary units) in this work.The four filters I E , Y E , J E , and H E form the Euclid set, while u, g, r, i, and z are external complementary filters (from LSST/VRO).The NIR curves are early estimates of the actual transmission functions, which are described in their latest updated version in Euclid Collaboration: Schirmer et al. (2022); however, since we only use these data to compute the input fluxes, the differences are not relevant for the present work.
We created five catalogues, each one with a size of 0.482 deg 2 (41.66 per side, which is comparable to the typical area on which each single photometric catalogue will be extracted from real data), with limiting magnitude I E = 27.1 (the nominal 1σ limit as given in the mission RedBook).The total number of simulated galaxies was about 1.5 million.We then used an Egg built-in script to obtain all the observational properties of the sources.In particular, for each galaxy the following parameters are given: position of the centroid in pixels; total flux in the simulated band; bulge-to-total flux ratio (b/t); scale length of the bulge and of the disc (defined as the radius at which the component is a factor of e less bright than it is at its center); axis ratio for both components; and position angle for both components.For the first of the five fields (F0), we produced nine lists, to include a full multi-wavelength realisation of a Euclidian sky patch: one for each of the four Euclid bands I E , Y E , J E and H E , plus five for the Rubin/LSST bands u, g, r, i and z.The filter transmission curves are shown in Fig. 1.For the other four fields (F1-4) we only produced the I E list, since the main purpose of these simulations is the morphological analysis, which with real data will mostly be performed on the I E images, given that it will be the band with the highest resolution and depth.We point out that in the multi-band realisation the morphological parameters do not change across the spectrum, while total fluxes and b/t do; this information was not explicitly shared with the participants.
We fed these catalogues to GalSim (Rowe et al. 2015, v2.2.1), a Python package that produces simulated astronomical images (it is also used for the official Euclid simulations).We created the images at their expected native pixel scale: 0 . 1 for I E (25 000 × 25 000 pixels); 0 .3 for NIR bands; and 0 . 2 for LSST bands.After the procedure described in the following paragraphs, we resampled all the images to the I E pixel scale, since this is the procedure that will be followed in the real pipeline.To simulate the effects of point spread functions (PSFs), we provided GalSim with the Euclid Mission Database models for the I E and NIR bands (as provided by the corresponding Euclid working groups; both of them are over-sampled by a factor of 6), while for LSST we provided custom simulated PSFs created using PhoSim (Peterson et al. 2015), at the expected observed pixel scale of 0 . 2 (with no over-sampling).The approximate FWHMs of these PSFs are 0 .17 (I E ), 0 .54 (NIR), and 1 .00-1 .12 (LSST).
We produced noiseless galactic profiles with GalSim, with pixel values in µJy/pixel.We simulated three different sets of images, all having identical sets of coordinates and total fluxes of the galaxies, which we describe in the following; see also EMC2022b for further details on this.
-Double Sérsic profiles (DS), directly using the standard output of Egg, which consists of a catalogue formatted to be used with SkyMaker (Bertin 2009).In particular, this means that the dimensions of the objects are given as scale lengths.
On the contrary, GalSim requires half-light radii; the two values coincide for the bulges, while the conversion factor is 1.678 for the discs (see e. -Realistic morphologies (RM), in which galaxy stamps are created by means of a neural network using a variational auto-encoder trained on observed COSMOS galaxies, as described in full detail in Lanusse et al. (2021) and Euclid Collaboration: Bretonnière et al. (2022); each simulated galaxy mimics the properties of its corresponding analytical realisation.In this data set, the biggest and brightest objects (r e > 0 .2, I E < 20.5) are not simulated due to technical limitations; the list of excluded sources was provided to the participants, and accounts for approximately 1% of the total simulated galaxies.Also, the position angles of the galaxies are not constrained to be close to those of the Egg catalogue (and therefore they were not considered in the final analysis of the results).We point out that this is the first time that such a demanding test has been performed: the codes must provide an analytical fit on non-analytical shapes for which a ground-truth value is known.This is inherently a very challenging task.Moreover, the method used to create the images is not perfect.The conditioning of the latent space with galaxy morphology is not always exact, and this can introduce a systematic bias with respect to the input values (see the discussion in Euclid Collaboration: Bretonnière et al. 2022), although the consistency is fully guaranteed in a statistical sense; some level of scatter remains on a objectby-object basis, meaning that the comparison with the input catalogue must be taken with caution.For more details, see EMC2022b.Figure 2 shows the redshift distribution as a function of I E input magnitude.Egg outputs the redshift of each simulated galaxy, and although this information is not explicitly used in the present work, it might nevertheless be useful to have an idea of the global distribution; indeed, most of the analysis and figures will be presented as a function of the input I E magnitude, which correlates with redshift.For example, looking at the plot one can see that galaxies at z = 3 typically begin to be detectable at I E = 23.Figure 3 shows the distribution of simulated galaxies in the magnitude-size-b/t space for the same field (note how there is a non-negligible fraction of bulge-only objects with b/t = 1).Finally, in Fig. 4 we show the distributions of various input parameters for the I E band components and realisations (magnitude, effective radius, axis ratio, Sérsic index, bulge-to-total ratio and redshift), for the full data set (the five fields), down to the nominal 5σ limit I E = 25.35 (Laureijs et al. 2011); we also show the COSMOS distributions (Mandelbaum et al. 2012) for reference.As expected, simulations and observations agree remarkably well, with the exception of the b/t distribution, which is more skewed towards disc galaxies in the Egg catalogue.This might have some impact on the analysis of the results (see Sect. 5).By construction, we could not simulate irregulars, which are estimated to constitute less than 10% of the galaxies at z < 1, but up to 70% at z = 3 (e.g.Huertas-Company et al. 2015, , although recent preliminary results from the James Webb Space Telescope seem to indicate a lower number).This is an obvious but unavoidable limitation of this work.The RM realisation can provide a hint about how model-fitting codes can deal with nonanalytical shapes.
We then added a field of stars, to include the effects of their presence as contaminants in the fitting procedures.To obtain a realistic distribution, we took their celestial coordinates (converted to pixel positions) from one FoV of the official Euclid simulations used for Scientific Challenges,1 and simply placed PSF stamps at the positions of the sources, scaling their flux to match the catalogue magnitudes.We excluded very bright stars (I E < 15), in order to avoid that large regions of the simulations were affected by their presence, and also because -given the limited extension of the PSF stamps -they would saturate creating artificial defects on the images.The fraction of pixels significantly contaminated by stellar light (that is, where the surface brightness from stellar light is more than the 1σ surface brightness per pixel) is approximately 1% in the I E images, 5% in the NIR bands, and from 1% to 7% in LSST bands (u and z respectively).

Observational noise
Once the seed images containing the sources were produced, we proceeded to add simulated observational noise.First of all we replaced the smooth analytical profiles with stochastic realisations from a Poissonian distribution, to simulate the effects of photon shot noise.We used the Python module scipy.stats.poisson.rvsfor this purpose.We paid particular attention to keep the units of the images always consistent during the whole process: we first converted the noiseless images from µJy/pixel to observational units, using the correct image observational ZP at 1 second, to obtain consistent Poissonian realisations, which depend on the total exposure time.Only after this step did we convert the images back to µJy.To calculate the ZPs we followed the procedure described in Euclid Collaboration: Martinet et al. (2019), which we summarize in Appendix A.1.The same method was also used to produce an empty sky map containing only Gaussian noise, simulating the observational background at the desired depth.Since the images were simulated with zero background light, the Gaussian noise must have zero mean, and the standard deviation of the pixel values defines the depth of the final simulated image.The values used in this procedure are summarized in Table 2.1.1.The exposure times used in our simulations the for I E and for the NIR bands were taken from Laureijs et al. (2011), and it is worth pointing out that the actual in-flight values will be slightly different (4 × 560 seconds for I E and 4 × 88 seconds for all NIR bands).The exposure times for the LSST bands were estimated from simulated data created for one of the internal validation Scientific Challenges, and are representative of an early release (LSST final data after ten years of observations will of course be much deeper).The limiting magnitudes and background surface brightness values, which are needed in the computations, are set to be consistent with the expected values for the Euclid Wide Survey, and they were taken from a dedicated study by J. C. Cuillandre (priv.comm.);these too are slightly different from the current best estimated values, which can be found in Euclid Collaboration: Scaramella et al. (2022).These small inconsistencies are due to the fact that the present work began before the latest estimates had been made available; however, they have negligible impact on the scientific results of the EMC.It is worth pointing out that the images were produced with homogeneous noise levels, i.e we did not simulate regions of different depths.
Finally, we summed each image containing the Poissonian realisations of the galaxies and stars with the corresponding 'empty' Gaussian sky noise.

Rebinning
As mentioned, we simulated all the images at their native pixel scales (0 . 1 for I E , 0 .3 for NIR, and 0 . 2 for LSST), and we then rebinned F0 NIR and LSST images to the I E pixel scale using Swarp (Bertin et al. 2002).This is consistent with the real pipeline workflow, with some differences: in the pipeline, a newer version of the software named Swarp++ is used, and single-exposure images are combined to create the mosaics, allowing information to be gained in the process.We also rebinned the PSFs accordingly.All the rebinning processes were performed using the BILINEAR interpolation mode.We note that this resampling procedure introduces artifacts in the noise map (in particular, pixel correlations) that alter the apparent signalto-noise ratio (S/N) of the map, so the actual uncertainties of the measurements must be computed using a dedicated RMS map, which we discuss next.

RMS maps
It is common practice to assign uncertainties to the measurements performed on scientific images by means of a weight or an RMS map, which is often obtained from first principles during the data reduction chain.When this is not possible, it can be easily determined, at least to a first approximation, by measuring the RMS of the pixel values in 'empty' regions of the (non-rebinned) science frame -although such a measurement only provides information on the noise due to the unresolved sky background.For the sake of the EMC goals, we wanted to factorise out any possible source of complication, and therefore ready-to-use RMS maps were provided to the participants, along with the scientific images.This is also again consistent with the pipeline architecture.The procedure to build the RMS map is descibed in Appendix A.2.
The RMS maps were produced at the native pixel scales of the scientific images, and we checked that the pre-resampling S/Ns are consistent with the expected values.This is shown in Fig. 5, where we plot the S/N estimated for each simulated source and check that it is equal to the expected value at the limiting magnitude (star symbols); for this test we used a-phot, forcing the measurements within 2 apertures (as for the definition of S/N adopted in this work) at the true input positions of the sources.Note that some distributions of points are overlapping (the three NIR bands have the same expected depth, and so do two of the LSST bands).The overall agreement with the expected values is very accurate.Finally, we proceeded to resample the maps along with the scientific images, again using Swarp, checking that the S/N values of the resampled images are correct when the RMS maps are used to estimate uncertanties.

Challenge set-up
The scientific and RMS images were finally uploaded to a private online repository for the participants to download, together with lists containing the IDs and input positions of the centroids of the simulated objects down to various nominal S/N levels (100, 10, 5, 1σ).This was done to factor out possible inaccuracies in object detection and deblending, so that the challenge could be actually focused on the accuracy of fitting photometry and mor- phology without adding any further possible source of error.We point out that to obtain the lists we simply applied different cuts to the input I E magnitudes, so they must be considered as coarse reference levels rather than accurate estimations of detection significance.
As a visual example, Fig. 6 shows a small crop of the DS F0 images in all nine bands, while Fig. 7 shows a small crop of the DS, SS and RM F0 I E images.A sample of the ground-truth input values of the simulated galaxies was also provided to the participants (in particular, a small portion of the multi-band F0 data set and the whole F4 I E data set), allowing for a check that their procedures were reasonably correct without evident errors.
The output requested from each participant consisted of the estimates of (i) flux, (ii) Sérsic index (in the SS and RM realisations), (iii) half-light semi-major axis, (iv) axis ratio, and (v) position angle, each with a corresponding 1σ uncertainty, for each component of the simulated objects.In particular, while the SS and RM realisations only required a single fit with free n, for the DS realisation we asked for two estimates, namely: one fit with fixed indices (n disc = 1 and n bulge = 4, consistent with the way the images were simulated); and another with n disc = 1 and n bulge left free to vary.
As mentioned, the output estimates were required for the objects belonging to the list including only S/N > 5 (in I E ) sources.Analysis of objects at lower S/N were explicitly mentioned as an optional output that would not influence the final comparison among the software packages.

Model-fitting software packages
Eight development teams of model-fitting software packages were invited to participate to the Challenge; of them, five (DeepLeGATo, Galapagos-2, Morfometryka, ProFit, and SourceXtractor++) provided at least partial results.All but one are based on parametric methods, using functional forms to fit the observed light distributions, the exception being DeepLeGATo, which exploits a convolutional neural network (CNN).
Here we briefly summarize the basic properties and features of these five software packages, and point out a few important details about the procedures each one followed.It is instructive and important to notice how various subtleties in the interpretation of the requests, usage of the provided input data sets, and processing methods used by each participant led to differences in the format and accuracy of the provided outputs.

DeepLeGATo
DeepLeGATo (Tuccillo et al. 2018) is a software package for estimating galaxy structure based on a supervised deep-learning approach.The code uses CNNs to perform a simple regression between an image centered on a given galaxy and its structural parameters, providing results in very short times (see Appendix B.1).In the version used in this work, the training was performed with images of fixed size (128×128 pixels) independently of the galaxy effective radius; this likely caused sub-optimal performance on the largest and brightest galaxies.The images used for training were not the ones provided with the EMC data-set; instead, they were idealised analytic 1-or 2-component Sérsic profiles, convolved with the PSF of the corresponding band, to which noise having similar properties to that in the EMC data was added.A different training was performed for each structural parameter with slightly different architectures, which are variations of standard CNNs (see Tuccillo et al. 2018, for more details).The fits were performed at the positions of the sources as given in the input files.
Because the loss function used for training is a standard mean square error, only point estimates were provided; therefore in the implementation used for this work the uncertainty budget (i.e. the uncertainties on the estimates) was not computed.Noticeably, the 5σ input source lists were subdivided into S/N bins using the other input catalogs (10 and 100σ), and different parameters were used for the fits in different bins; this can be seen in the distributions of the points in the plots (see Appendix B).While this makes the version of the software used in the Challenge not directly suitable for an implementation in the Euclid pipeline, because the S/N of real sources cannot be known a priori, it is worth pointing out that the code remains under development, and more recent releases work without the need for this fine-tuning of the parameters for different input data.Only I E SS and DS fits were provided.

Galapagos-2
Galapagos-2 (Häußler et al. 2022) 2 is an updated and enhanced version of Galapagos (Barden et al. 2012).It provides a wrapper around either Galfit (Peng et al. 2002(Peng et al. , 2010) ) for singleband fits, or GalfitM (Häußler et al. 2013) for either singleor multi-band fits, and it is specifically designed to carry out fully automated fitting on all objects in a large survey.Starting from the input images and a simple setup file, it employs SExtractor (Bertin & Arnouts 1996) for object detection, and then uses this information to automatically set up the fits.The postage stamp size used for each fit/object depends on the estimated size of the object, set up by the user.Using some limited input from the setup, e.g.enlargement factors to conservatively increase the size and shape of the object estimated by SExtractor, Galapagos-2 takes care of neighbouring objects (deblending and fitting of bright nearby objects, masking of fainter and more distant objects), estimates the sky background level with a sophisticated and robust scheme (see Barden et al. 2012, for details), and sets up the fit using SExtractor values as initial guesses to run the fitting algorithm GalfitM for all ob-  jects, starting from the brightest objects and using the PSF provided.Once an object has been fit, it is merely subtracted from the fits of nearby objects.This significantly speeds up the process overall, as these bright, large objects take the longest to fit, but this only needs to be done once.In a fully automated pipeline, it then reads out the result and provides one final catalogue, which contains fitting information for all bands.GalfitM itself uses a Levenberg-Marquardt minimisation to derive the best-fit parameters and uncertainties.In the multi-band realisations of the EMC, all bands are connected via physically reasonable polynomials and fitted simultaneously, to reduce the degrees of freedom of the fit and make full use of the multi-wavelength information.The software requires one additional input image compared to what was provided, namely a weight image to flag bad pixels.Since no bad pixels were in the data, this was trivially created as a uniform image of the correct size.
All the requested outputs were provided; however, for DS F0 only a simultaneous fit was run, therefore including I E in the multi-band fitting process -in other words, there is no isolated fit for I E DS F0.This causes the I E fit for F0 to be substantially different from that of other fields; for this reason, it was decided not to include F0 in the analysis of I E -only results for all codes (see below).

Morfometryka
Morfometryka (Ferrari et al. 2015), written in Python, was primarily designed to measure non-parametric morphological quantities, but as a bonus it performs single Sérsic model fitting.The software takes as input a galaxy stamp (plus the PSF model), estimates the background with an iterative algorithm, segments the sources and defines the target.Then, it filters out external sources using the code GalClean (de Albernaz Ferreira & Ferrari 2018).From the segmented region it calculates basic geometrical parameters (e.g.centroid, position angle, axis ratio) using light-profile moments.Then it performs photometry, measuring fluxes within ellipses with the aforementioned parameters (contextually masking out point sources over the ellipse annulus with a sigma clipping criterion).From the luminosity growth curve it establishes the Petrosian radius, inside which all the measurements are made.The Sérsic fit is performed on the 1-D luminosity profile; for robustness, the 1-D outputs are used as inputs for a 2-D Sérsic fit of the galaxy pixels.Finally, it measures several morphometric parameters, e.g.concentrations, asymmetries, Gini and M20 (the former is a coefficient quantifying the inequality among values of a frequency distribution, in this case of pixel values; the latter is the second order moment, i.e. the flux values weighted by the their square distance to the center, of the 20% brightest pixels; see Lotz et al. 2004), entropy, spirality, curvature among others).In a forthcoming version, the luminosity profile curvature (Lucatelli & Ferrari 2019) will be used to provide a more robust input to a parametric model-based fit of the light profile, eventually replacing the 1-D Sérsic fit as a metric, mainly to mitigate a long lasting problem of Sérsic index determination (see discussion in EMC2022b).Only the I E SS fit was provided.

ProFit
Profit (Robotham et al. 2017) is a software package designed to perform Bayesian two-dimensional photometric galaxy pro-file modelling.It consists of a low-level C++ library accessible via a command-line interface and documented API, along with high-level R (v.3.6.1) and Python interfaces.The fitting process for each object starts running the source finder ProFound (Robotham et al. 2018) 3 on a 500 × 500 pixel cutout centred on each target, to create a segmentation map and find nearby sources requiring simultaneous modelling; the output also provides some reasonable initial guess for the profile solution.The actual fitting is then performed by the Highlander core software 4 , which combines a genetic algorithm step with a CHARM (Turchin 1971;Smith 1984) Markov chain Monte-Carlo (MCMC) process, repeated twice; each one is run for 100 steps (where model realisations are modified by the number of free parameters also).The CHARM algorithm is particularly useful on highly covariant parameter search, but it is computationally expensive, because a single iteration requires sampling all parameters.Since the kind of fitting used for ProFit is relatively low in the number of parameters, but sometimes quite highly covariant in the posterior, CHARM has proven to be a powerful exploration tool.The provided solution is the combination of parameters that generate the maximum likelihood given the per-pixel Data -Model residual.The parameter priors are implicitly assumed to be uniform.Errors are estimated from the final MCMC run, with full covariance matrix information available.
All requested data was provided.Partially building upon the effort put in the EMC, the whole ProFit pipeline has recently been developed into a new package, ProFuse (Robotham et al. 2022).

SourceXtractor++
SourceXtractor++ (Bertin et al. 2020;Kümmel et al. 2020) 5is a ground-up re-write of the widely used SExtractor2 software (Bertin & Arnouts 1996), written in C++ with a strong focus on extensibility and model-fitting photometry; the software is under active development and the results submitted to the challenge represent a snapshot in this process (the version of SourceXtractor++ used in the EMC is 0.12).
Each SourceXtractor++ run includes two stages, detection and measurement.The detection stage follows the same procedure as used in SExtractor2.Detection parameters need to be optimised for a compromise between the completeness of the true object list and the number of spurious objects extracted or deblended; over-extraction of sources impacts the performance of the run-time required, and may also reduce the accuracy of morphological measurements if objects are over-deblended.The parameters for the EMC were tuned aiming for good overall performance, and therefore not for reaching 100% completeness of the input 5σ source list.The SS and DS simulations produce slightly different distributions in apparent extent and surface brightness for the galaxy images, and so the parameters governing detection were optimised separately for the different simulations.
Measurement in one or several bands is controlled via a Python configuration file with flexible model fitting at its core, which allows for the simultaneous source analysis over a large number of FITS files with different pixel grids.Various components (point source, exponential disc, free Sérsic , etc.) can be used individually or in combination; reasonable priors must therefore be provided to the fitting engine in order to cover the range of parameter values and provide sensible fits.The chosen priors for the EMC are described in Appendix B.5.2.
Noticeably, the SourceXtractor++ pipeline used for the EMC includes a pre-processing of the images, namely the extraction and usage of PSFs from images, performed using the PSFEx software (Bertin 2011), and a re-downsampling to the original pixel scales of the NIR and LSST images (0 . 3 and 0 .2, respectively).This procedure was allowed by the guidelines of the EMC, given that no additional input data were used.However, given that no other participants did anything similar, we decided to also check the performance of the package on standard, non-pre-processed data, finding overall good agreement with a few differences that should be taken into account when considering the overall processing cost of the pipeline.We discuss this in Appendix B.5 (see also EMC2022b).Additionally, SourceXtractor++ priors were obtained by comparing the output distribution of a given morphological parameter with the equivalent distribution for the provided samples of the input true catalogues; the priors on parameters were iteratively adjusted under the constraint that a simple analytical transfer function is required to map each distribution to a Gaussian.Each parameter was calibrated independently, without including covariances; only the statistical distributions were used (i.e.there was no object-by-object comparison in the process).A detailed description of the calibration of the priors is given in Appendix B.5.2.All requested data were provided, except for the multiband DS fit with free n bulge .

Diagnostic metric
Given the high dimensionality of the output data, a straightforward comparison of the results was not feasible.In order to obtain a reasonably comprehensive overview of the quality of the performance, we defined an ad-hoc metric.The participants provided catalogues that were matched to the input ones by means of the unique ID of each source.Then for each run we proceeded to estimate the difference between the input and the measured fluxes of each object, and computed averaged statistical diagnostics.
Importantly, to compute such statistics we used a subset of sources from the 5σ-limit list, including only those for which all software packages provided a meaningful fit.Recall that the default request was to provide a fit for all the sources in the nominal 5σ list; with some minor caveats and exceptions mentioned in Sect.3, all participants obliged to this, and some also provided results for lower S/N sources.However, not all the fits were successful (i.e.some were given as NaN or default values in the output catalogues), and some were flagged as 'bad' or 'unreliable' in one or more codes.In general, these sources were not included in the lists used to evaluate the accuracy of the results, which therefore only included the objects for which all the software packages had provided a reliable fit.An important exception is the case of Galapagos-2, which outputs several quality flags describing which component of a galaxy can be considered as reliably fit.In particular, for the DS runs a total of five flags were provided: USE_FLAG_SS indicates that the single Sérsic fit is usable, as it did not run into fitting constraints; USE_FLAG_BULGE_CONSTR and USE_FLAG_DISK_CONSTR serve the same purpose for the bulge and disc components, respectively; in addition, USE_FLAG_BULGE_BRIGHT and USE_FLAG_DISK_BRIGHT indicate whether the bulge and the disc are relatively bright enough (b/t > 0.2 and b/t < 0.8, respectively), that their fit could in general be trusted, with the additional difficulty that b/t itself is defined via such a fit.However, all of these flags were ignored in our analysis, to avoid excessive complications in the definition of the common set of fitted sources within the submissions.While this choice certainly impacts the statistics of the results, because galaxies are taken into account that are known to violate fitting constraints (they are 4% and 13% of the total number of objects, for single and double component fits, respectively), we found that the effects on the overall analysis was marginal.We stress that a general user of Galapagos-2 shall consider these flags, according to their purposes; a thorough description of the flags is provided in Häußler et al. (2022).See the Appendix of EMC2022b for a comprehensive discussion on this topic.
To estimate the impact for each software tool, an additional term evaluating the completeness fraction of the output catalogues with respect to the full 5σ list was included in the global metric, as described below.
To build the metric, we started considering the relative flux difference of each object with respect to the input true flux, i.e. δ f = ( f meas − f true )/ f true .We then used δ f to evaluate three diagnostics: the bias B; the dispersion D; and the outlier fractions O.In summary, the three diagnostics were first averaged over the sources belonging to bins of input magnitude (we used 15 bins to divide the full interval of simulated magnitudes, from 14 to 28), to quantify the impact of S/N; then these averages (normalised with weighting factors) were summed, and further combined with the completeness C diagnostic, finally yielding a global score S for each field and realisation.For I E , the values computed for each simulated field were finally averaged to obtain a single figure; while this is not strictly correct from a statistical point of view, given that the results in the different fields were very similar we assume that the outcome is sufficiently accurate.In more detail, the four quantities were defined as follows.
-The bias B bin is the median of δ f in each bin of input magnitude, computed considering only the objects having |δ f | ≤ 5σ true , where σ true is the standard deviation of δ f for an ideal distribution of fluxes, that we obtained by perturbing the input true values with a random realisation of observational Gaussian noise consistent with the expected depth of each image (we imposed a minimum value corresponding to δ f = 0.02 (2%), to avoid unrealistically small values of σ true at the bright end).An unbiased measurement would yield B bin = 0. We then define the average value of the bias as the weighted mean of its values across the magnitude bins, B = bins w bin |B bin |, where w bin is a weighting factor given by the fraction of objects in each bin of true magnitude (to give more weight to highly populated bins) multiplied by the logarithm of the median S/N in that bin (to give more weight to the fit of bright objects).Note that while the values of B bin can be positive or negative, B is defined to be positive.-The dispersion D bin is the ratio between σ meas , i.e. the standard deviation of the distribution of δ f (again only including objects within 5σ true ) and σ true , in each magnitude bin.The average dispersion is defined as D = bins w bin D bin .-The outlier fraction O bin is the number of objects having |δ f | > 5σ true divided by the total number of fitted objects in each bin.These objects fall outside the expected distribution, and we assume that their large bias is due to some systematic error in their measurement (e.g.strong contamination from neighbours, or catastrophic failure of the fit).Therefore they were not included in the statistics of "well-behaved" sources, and were instead isolated into a separate diagnostic.The average outlier fraction is defined as O = bins w bin O bin .
-The completeness C is simply the number of objects for which a successful fit was provided, divided by the total number of objects in the input list of S/N > 5 sources (we do not weight this quantity by the magnitude bins).A galaxy is considered not to be fit if there is no entry in the provided output catalogue, or if the challenge participant flagged that galaxy as a 'bad fit' (see discussions above).Each software package has different ways of identifying unreliable fits, and we refer the reader to the publications describing each code for additional information.Here, we simply trusted the participants' verdict on the reliability of their fits.
We point out that the definitions used in EMC2022b are very similar, but since it is difficult to construct meaningful expectations for ideal perturbed distributions of morphological parameters (corresponding to the σ true we use here for the fluxes), some differences were introduced.The interested reader should therefore pay attention to these details.
Finally, the global diagnostic S for each run is defined as where we subtract 1 from D when computing the final global statistics, because when the dispersion is 'ideal' the ratio with σ true is 1, and we want the value of all diagnostics to be close to zero for ideal fits.In this expression, the k factors are multiplicative constants assigned to each of the three diagnostics in an attempt to reasonably weight their relative contributions.
We chose k B = 20.0,k D = 0.6667 and k O = 5.0.While these choices are to some extent arbitrary, it is worth pointing out that the large differences in these values do not reflect the actual weight given to each diagnostic; on the contrary, they were chosen exactly to try and reach a reasonable balance between the three weighted quantities.We argue that a fit might be defined as 'optimal' if it has e.g.C = 1.0 (100% completeness above 5σ), B < 0.015 (1.5% median bias), D < 1.33 (dispersion no larger than 4/3 of that from the perturbed true fluxes), and O < 0.1 (10% of outliers) in all bins of magnitude; in the case of these exact values, applying the chosen weights one gets S 0.3 + 0.22 + 0.5 = 1.02 (the bin weights w bin are not relevant here).So, we see that if S ≤ 1 the fit can be considered as optimal; S ≤ 1.33 is very good; 1.33 < S ≤ 1.67 is good; and 1.667 < S ≤ 2.0 is acceptable.Finally, with this metric values of S much larger than 2 indicate a bad overall fit.Note that when marginalizing the contributions for a 100% complete fit, S = 2.0 can be due to: a 10% overall offset; a 3σ true standard deviation; or a 40% outlier fraction.
The diagnostics were evaluated automatically by means of Python scripts, but the results were also visualized graphically, to allow for sanity checks and for a quick grasp of any particular features.Figure 8 shows an example of a diagnostic plot that we used to analyse one of the provided output catalogues; similar plots were created for the outputs for each field and realisation that each participant provided.Each dot is a single fitted galaxy, and its δ f is plotted against its true input magnitude in the considered band.The dots are colour-coded by the true bulge-tototal ratio (which for the SS and RM realisations is a proxy for the Sérsic index, n = 3 b/t + 1).For each bin of magnitude, the median, standard deviation, and outlier fraction of the distribution were computed, and the values were then used to compute the diagnostics described above; the dotted lines show the 1σ true and 5σ true levels.Specific examples are given in Appendix B.
Because of a technical problem not related to the performance of the code and gone unnoticed during the run, the processing of F1 by DeepLeGATo was interrupted before the end   5.
of the input list, and the corresponding catalogue was therefore incomplete.To ease the comparisons, and considering that the problem was not caused by a bug in the code (since the processing of all the other fields ended smoothly), we decided to remove F1 entirely from the analysis process, after checking that this would not favour one of the codes with respect to the others.
Finally, as already mentioned the Galapagos-2 runs on DS F0 were performed in a simultaneous multi-band fit, including I E with the other bands; this caused the results to be significantly different from those in the other four I E fields.To avoid any impact on the evaluation, and considering the many different approaches of the other participants (DeepLeGATo only provided the I E fits, Morfometryka did not provide the fit, ProFit only provided non-simultaneous fits, and SourceXtractor++ provided both a simultaneous and a separated fit), we decided to remove F0 from the analysis of I E DS.
In summary, I E SS and RM were analysed by averaging F0, F2, F3, and F4 (resulting in 212 000 objects for SS, and 204 229 for RM, where the bright galaxies were not simulated); I E DS by averaging F2, F3 and F4 (207 064 objects); and the other bands on DS F0 alone (because it was the only field simulated with a multi-band data set; it contains 70 700 objects).

Results
In this section we discuss the results obtained with the different software packages.First of all, it is worth pointing out again that the complexity of the challenge caused a significant scatter in the interpretation of its goals and spirit by the participants.This caused substantial differences in the adopted approaches, level of processing and output formats between them.Together with the high dimensionality of the data set, this makes a direct and comprehensive comparison of the results very challenging.In other words, the strategies and techniques adopted by the participants influenced the overall accuracy of the provided output, and this must be taken into account in the analysis, to ensure a fair overview of each code's capabilities and limitations.Nevertheless, we believe it is possible to draw some interesting general conclusions from the comparison.We will describe the overall outcomes in the following, with some particular cases discussed in more detail, when necessary.
Individual diagnostic plots for all the different runs are available in an on-line interactive tool. 6Further discussions on the results provided by each participating team are provided in Appendix B, together with a summary of the computational times and memory workload required by each software package.

Global outcome
In the following we separately analyse the three realisations SS, DS, and RM.We separate the multi-band data set from the I Eonly DS fits, since the results are significantly different (we identify the multi-band case with the addition of the letter 'M' to the acronym DS whenever necessary).The values of the global metric S for each realisation are listed in Table 5 (where the values of the completeness factor C are also reported, to allow for a better evaluation of its impact), and shown in Fig. 9.A visual summary of the values of each diagnostic used to compute S (i.e.B, D and O), averaged over the magnitude bins and fields, is given in Fig. 10.For the multi-band case, these values are the average of the ones obtained in each of the nine individual bands.Note that in these plots the values are shown before the weighting by the k factors in Eq. 2, so the relative difference between the values obtained by any two codes for the three diagnostics does not straightforwardly reflect their final difference in S.
Because the global diagnostic quantities only provide a crude overview of the results, being averages over the input parameter space obtained with arbitrary weights, in Sect.5.2 we also present a collection of summary plots (Figs. 11 to 14), showing the trends of the diagnostics as a function of the input magnitudes in all the cases of interest.
We want to begin emphasizing that each code proved to have points of strength and of weakness, so the comparison of the global score S and of its factors is only intended as a quick overview, and should by no means be taken as a rigorous evaluation and ranking of the software packages.We fully acknowledge that this is a simplified view and therefore alternative metrics, tailored to specific science cases or assigning different weights to the considered diagnostics, could result in different conclusions.That said, we can claim that with our metric all software packages provided acceptable to good results in at least some of the realisations.Some differences are present in a few cases (e.g. a particular realisation or band, the faint end of the simulated distribution of galaxies, etc.), but the outputs provided   are typically fairly accurate.Unsurprisingly, in the DS realisation the best results were obtained in the I E runs, for most of the software packages, given its high resolution and depth.The multi-band data set proved to be more demanding, because of the lower S/N and resolution of the images, and also of the resampling procedure which introduced noise correlations.All the participants reached at least 95% global completeness in the I E DS data set; in the other realisations, some lower scores were obtained (see the numbers in parenthesis in Table 5).
In the I E -only runs, the best global results by all codes were obtained on the SS realisation, with all software packages reaching values of S between 1.0 and 1.7, with the exception of Morphometryka (S = 2.54), penalised by a significant bias B caused by systematic underestimation of fluxes of bright bulgedominated galaxies, and of all objects at faint magnitudes (see discussions in Sect.5.2 and in Appendix B.3).On the DS realisation of the I E band, SourceXtractor++ and DeepLeGATo reached S < 2.0, although it must be recalled that the scheme adopted by the latter (dividing the sources according to their nominal S/N bin) introduces peculiar features in the distribution of δ f (see Appendix B.1); Galapagos-2 reached S 2.0, while ProFit was penalised by a large B caused by a strong overestimation of fluxes at faint magnitudes (see Appendix B.4).Interestingly, the fits with free n bulge were in general slightly better than those with fixed n bulge = 4 (except for Galapagos-2), despite the fact that the simulated galaxies had indeed n bulge = 4.
Finally, the results for the RM realisation were less accurate than the ones for the other I E images.This should be expected, given the inherently more difficult task of fitting analytical profiles on complex realistic morphologies.Interestingly, here the three codes that provided results have very similar trends; this seems to imply that when dealing with realistic galaxy shapes, the impact of prior calibrations, pre-processing of the images, and robustness of the algorithm become of secondary importance with respect to the inherent difficulty of the task.
The results for the multi-band data set show the evident impact of the different strategies followed by the participants.SourceXtractor++ (which included an image and PSF preprocessing pipeline, calibrated its priors on the provided example ground-truth data, and fitted all bands simultaneously) obtained optimal results, comparable to those for the I E band fitted alone.Galapagos-2 did perform a simultaneous fit, but without the image and PSF pre-processing obtained sub-optimal results.ProFit, which performed a separate fit on each band, had weaker performance due to not using information from the I E image in the multiband fits: faint galaxies detected in I E are likely to have very low S/N or could even be undetected in most NIR and LSST bands, and without the I E parameter to constrain the fit, it is very difficult to properly model their light profiles in these bands.This outcome is important for highlighting how the synergy with Euclid can significantly improve the accuracy of Rubin/LSST measurements, as already pointed out by several studies (see e.g.Rhodes et al. 2017;Capak et al. 2019).
It is worth mentioning here that any statistical result showing a dependence on the bulge fraction of the sources is biased by the low fraction of simulated bulge-dominated galaxies with respect to the real Universe distribution (see Sect. 2.1), so the overall performance on real data might be worse.

Trends of the diagnostics with input magnitudes
In Fig. 11 the trends of completeness C are shown for each software package as a function of I E in bins of magnitude, for the three realisations DS, SS and RM (we remind the reader that for the latter only objects with I E > 20.5 were simulated).
As one can expect, the fraction of successfully fitted objects decreases with increasing magnitude for all codes, since fainter objects are generally harder to detect and fit; the only exception is DeepLeGATo, for which bright objects are more often prone to failure.This is a consequence of the fixed stamp sizes that are used in the current version of the software, so that very bright (and large) objects may be larger than the stamp size.
Overall, down to I E 23.5 all codes successfully fit more than 95% of the galaxies; the completeness then typically decreases, with the noticeable exception of ProFit which always stays close to 100% for the SS and DS realisations.This is likely due to the different thresholds used to perform detection in the software packages (we recall that Galapagos-2 uses SExtractor and ProFit uses ProFound), leading to different efficiencies in the actual detection of sources.
In each panel of Figs. 12 to 14, the trends of one of the diagnostics B bin , D bin , and O bin are shown as a function of magnitude in the considered band, for all the software packages that have provided a fit in the relevant realisations.Some individual cases of particularly notable behaviour are described in more detail in Appendix B.

Bias in I E runs
In the I E DSb4 runs, for Galapagos-2 and ProFit, the typical absolute bias in the measured flux B bin is below 1% for bright sources (I E < 23), increasing to 10-15% at the faint end.This is likely due to contamination from nearby brighter sources, and/or to the inherent difficulty in fitting low S/N objects with analytical profiles.DeepLeGATo has a less stable trend, with slightly larger values of bias for intermediate and bright sources, but a lower bias at the faint end.Finally, SourceXtractor++ shows the most stable trend, without a strong overestimation at the faint end, despite a more pronounced average bias of 2-3%.
In the SS runs, in most cases we see more stable trends, with monotonic trends towards overestimation of fluxes with decreasing brightness reaching about 5% bias at the faint end.Noticeably, ProFit behaves differently, slightly underestimating intermediate magnitude sources before starting to overestimate at the faint end.An even more striking exception is given by Morphometryka (SS was the only provided output data set), which has a clearly declining monotonic trend, reaching The situation is completely different in the RM realisation, where all three of the codes that provided results have similar declining trends in the bias B, with faint sources typically having fluxes underestimated by about 10-15% at I E = 24-25.This is at odds with the results from the other realisations.We checked that this is not an issue in the simulations: while minor inconsistencies between the input true fluxes in the catalogue and the actual realisations of the sources in the images can be present because of the simulation method (see Sect. 2.1), we found that the impact of this is negligible, with a typical mean offset of about 0.05% and some scatter in the values that is not sufficient to explain the global trends of the measurements.We postpone further investigation on this topic to future work, given that it does not strongly impact the present analysis; the trend is almost identical for the three considered software packages, leaving the comparison among them essentially unchanged.

Dispersion in I E runs
The dispersion D bin for all codes is typically comparable to σ true at the faint end in all realisations.In DSb4, at I E < 23 there is a hierarchy of performance with DeepLeGATo reaching values around 3.0 (again probably because of the limited dimensions of the stamps), Profit reaching around 2.0, SourceXtractor++ 1.5 and Galapagos-2 going from 0.5 to 1.5.In SS this hierarchy is less pronounced with all codes, including Morfometryka, staying below 2.0 at all magnitudes; the evident exception is again DeepLeGATo.In the RM case again we see similar (and sub-optimal) trends for the three codes that provided results.

Outliers in I E runs
The outlier fraction O bin in DSb4 stays below 10% at all magnitudes for SourceXtractor, and goes from very low values to 20% at faint magnitudes for Galapagos-2.Profit reaches 30%, while again DeepLeGATo suffers from the limited dimensions of the stamps, reaching 100% outliers at the bright end, while remaining close to zero at I E > 23.In SS, SourceXtractor and Galapagos-2 stay below 10% at all magnitudes, while Morfometryka has large values (around 50%) at the bright end, likely because of a sub-optimal estimate of bulge-dominated sources (see Appendix B.3). Again in the RM case there are no major differences between the quality of the performance, except for SourceXtractor performing better on I E < 22 objects.

The free Sérsic bulge case
There is no substantial difference between the fixed and free n bulge cases in the DS realisation.We do not show the trends for the free n bulge case; as mentioned, the latter yields slightly better results than the fixed n bulge = 4 case for SourceXtractor++, reducing the bias at all magnitudes, and for ProFit, which has better trends for all diagnostics; for Galapagos-2, the bias is almost identical in the two runs, while the dispersion and the outlier fraction have opposite trends (the free case having many more bright outliers but, simultaneously, a lower dispersion ratio for the few "well-behaved" sources), resulting in a similar final score (see Fig. 10).

The multi-band data set
In the multi-band data set (for which only Galapagos-2, ProFit, and SourceXtractor++ provided results; see Figs. 13  and 14) the trends are generally similar to those of the I E case, with larger values of bias at the faint end for all codes in the NIR and LSST bands.In particular, Profit reaches B bin 1.5 in the NIR bands, and approximately 4 in the LSST u and g bands, likely because of the independent fits performed on low S/N sources (we do not show the corresponding points in the plots, for readability); there is also a particular trend in the LSST bands, with sources being underestimated at intermediate magnitudes before turning to a strong overestimation at the faint end (we were not able to find an easy explanation for this).Galapagos-2 and SourceXtractor obtain better (and similar) results in NIR thanks to the simultanous fit, reaching B bin 0.1; in LSST they behave similarly as well, with SourceXtractor++ performing clearly better only in the u band.Here the free n bulge fit generally yields slightly worse results in dispersion and outlier fraction for Galapagos-2, and slightly better results for ProFit, than the fixed n bulge fit (not shown in these plots; see again Fig. 10).

Separated bulge and disc estimates
So far we have considered total fluxes, but it is also instructive to investigate how the software packages performed in the separate flux measurements of the two components of each galaxy (bulge and disc) in the DS realisation.It is worth stressing that both estimates can be individually worse than the total flux one, if their sum is close to the true total value, but the partition among bulge and disc is not well recovered; this is linked to the accuracy of the morphological parameter estimates, discussed in EMC2022b.
In Figs. 15 to 18 we show the summary plots for bulges and discs separately.In I E , evident features are the bias trends for Galapagos-2 and ProFit, both overestimating the flux of bulges at the faint end and underestimating that of discs at all magnitudes, by approximately 5% down to I E 22 and then getting worse; DeepLeGATo and SourceXtractor++ show a strong underestimating trend for faint bulges, while showing a reasonable accuracy for discs at all magnitudes (DeepLeGATo overestimating their flux by a few percent).The dispersion and the outlier fraction are generally larger for bulges (often reaching very high values) than for discs, in particular at intermediate and faint magnitudes, where discs show nearly optimal values.For bulges, the dispersion has very similar trends for all codes, while for discs SourceXtractor++ performs better, followed by Galapagos-2, while ProFit and DeepLeGATo suffer at the bright end.SourceXtractor++ also yields the best performance concerning the outlier fraction, both for bulges and for discs.
Similar observations can be made for the other bands.In the NIR bands, the bias of bulges for Galapagos-2 and SourceXtractor++ show a similar underestimation of 3-5% at intermediate magnitudes, and all codes show a strong overestimation at the faint end, particularly dramatic for ProFit.For discs, SourceXtractor++ stays close to zero quite firmly, Galapagos-2 shows an underestimation at the faint end balancing out the bulge overestimation, while ProFit again shows an overestimation.In LSST bands, Galapagos-2 shows the same opposite trend for bulges and discs at faint magnitudes; interestingly, SourceXtractor++ tends towards the reverse (underestimating bulges and counterbalancing with an overestimation of discs).ProFit still shows the strong rising trend in all cases.
The trend of the dispersion for bulges is very similar for the three codes, all having D 2.0-3.0 for the bulk of the bins in all bands; for discs, better values are found towards the faint end, again with SourceXtractor++ performing slightly better than the other two codes.Finally, bulges typically have very high outlier fractions, typically above 50% in all bins in the NIR bands, while having lower values at the bright end in the LSST bands, but always peaking at values above 75% at intermediate magnitudes; on the other hand, discs have a lower fraction of outliers at the faint end.

Colours in the multi-band data set
Thus far, we have focused on the accuracy of the flux estimates.However, for the multi-band data set it is also interesting to examine how accurately the three software packages that pro- vided results were able to recover flux ratios among the bands, i.e. colours.Here too, the multi-dimensionality of the data requires some effort to summarize the results in a few informative plots.In each small sub-panel of Figs.19 (Galapagos-2), 20 (ProFit), and 21 (SourceXtractor++), we show the comparison between a measured colour and its corresponding true value.For each fitted galaxy (colour-coded by its true I E magnitude) the measured colour is the y-axis coordinate, and the true colour is the x-axis coordinate (so a perfect colour estimate would result in a diagonal line from the bottom left to the top right corners; for reference, we plot this line in each panel).The figures show all possible combinations between the nine bands, giving an overview of the results.In each small sub-panel, the considered colour is given by the x-axis band label and the y-axis band labels, so for example the first sub-panel in the upper left corner is the I E − Y E colour, the one below it is the I E − J E colour, and so on.A distribution of points with a narrow vertical strip, which is often seen (e.g. in colours including I E ), imply that while the true colours are close to zero, the measured ones have a large dispersion; if the distribution of points follows the diagonal the colour is estimated with good accuracy.
In each panel the values of two statistical diagnostics are also shown.Defining δc = |colour meas − colour true |/(1 + colour true ), the first number is the normalised median absolute deviation, NMAD = 1.48M(δc) considering the sources with δc ≤ 0.2 (M is the median of the distribution); and the second one is η, the percentage of outliers having δc > 0.2.The upper bigger panel in the figures shows these two values for the colours including I E , as reported in the legend.As a further example, the lower bigger panels show a colour-colour diagram, I E −g versus I E − H E ; again, plotted are all measured colours of individual galaxies, colourcoded by their input I E magnitudes; the density contours shown as solid lines represent the true colour distribution.
Obviously, fainter sources have more scattered distributions.In all cases, the colours including the u band are clearly the least accurate, with large NMAD and η; this is easily understandable given the fact that deep-sky galaxies are typically fainter in blue bands, because of intrinsic properties, dust absorption, and cosmological redshifting; also, the band itself is quite shallow (m lim = 23.6 at 10σ).
These diagrams again show how the impact of the different approaches used by the three participants affect the results.Profit, using independent fits in each band, obtains the least accurate results both in terms of NMAD and η.Galapagos-2 and SourceXtractor++ perform significantly better, probably because of the multi-band simultaneous fit.Between the two, SourceXtractor++ yields slightly better results, most likely due to the careful prior calibration used (see Appendix B.5.2).

Uncertainty budgets
We computed a final quantity of interest, the fraction of sources that have a measurement of flux consistent with the true value within their nominal uncertainty budget, as provided in the output catalogues: i.e., | f meas − f true | < σ meas (e.g.Häußler et al. 2007).This fraction should in principle be 68.3%(assuming a Gaussian distribution of the measurements, a reasonable assumption given that the noise map is predominantly Gaussian at least when the photon noise is not relevant, i.e. for intermediate and faint galaxies).
For the codes that provided magnitudes, we computed the errors on fluxes as σ f = 0.921 f σ m .For the double component runs, we estimated a global uncertainty with the usual approach, i.e. σ tot = b/t σ bulge + (1 − b/t) σ disc , since we cannot consider the errors on the two components as independent.We show the results in Figs.22 (I E ) and 23 (multi-band DSb4).It is evident that uncertainties are typically underestimated for bright sources despite their per pixel RMS being enhanced by the contribution of photon noise, while for faint sources they are mostly correct, in a statistical sense.
A possible explanation for these trends is that bright sources are fit with less uncertainty because of their high S/N, but the uncertainty budget (which is based on the information in the RMS map) does not account for systematic errors that cause the actual offsets in the measurements.Note that in the evaluation of the bias B we considered the relative offset δ f , which is typically small for bright sources even when the absolute offset | f meas − f true | is large.At the faint end, ProFit yields substantially less accurate results than Galapagos-2 and SourceXtractor++, in all bands.It must be pointed out that the version of DeepLeGATo that was used for the EMC did not include a consistent algorithm for the uncertainty budget estimation.Therefore, we have not included DeepLeGATo in this analysis.A new version including error estimation is under development.The uncertainty estimation in Morfometryka takes into account the RMS map, however it only represents the fit parameters uncertainty, since this is extracted from the covariance matrix of the fit.For fluxes and magnitudes, the quoted uncertainty is only due to the fitted radius and underlying region used to integrate the flux.

Conclusions
This is the first of two papers presenting and discussing the Euclid Morphology Challenge, a project aimed at comparing the results of model-fitting software packages on a set of simulated Euclid observations, with the aim of providing a first quantitative study to define the best-suited algorithm to be included in the Euclid Science Ground Segment processing pipeline.Five model-fitting tools (DeepLeGATo, Galapagos-2, Morfometryka, ProFit, and SourceXtractor++) were tested in the challenge, providing partial or complete results.It is worth stressing that the results obtained for the EMC and discussed in this paper and in EMC2022b were used by the developers to improve and in some cases significantly upgrade the software packages, which is per se a relevant outcome of this work.
While the companion paper (Euclid Collaboration: Bretonnière et al. 2022) focuses on results concerning the morphological parameters, in this work we have presented the simulated data set, and focused on the results concerning photometric estimates only.We used the code Egg (Schreiber et al. 2017) to produce mock galaxy catalogues, which were then exploited to create synthetic images in three different realisations: two using GalSim (Rowe et al. 2015), to generate single and double component Sérsic analytical profiles (SS and DS respectively, the latter with n bulge =4 and n disc =1 and varying bulge-to-total luminosity ratios); and one with the method described in Euclid Collaboration: Bretonnière et al. ( 2022) and Lanusse et al. (2021) to obtain realistic morphologies by means of a CNN technique (RM).We did not simulate irregular galaxies, which constitute a substantial fraction of the high-redshift population, and we did not include AGNs.Furthermore, the distribution of the bulgeto-total ratios of the simulated galaxies is more skewed towards disc-dominated objects than in the real Universe (see Fig. 4); this implies that any result showing a dependence on this feature must be considered as statistically optimistic.
We created five fields of view in the Euclid I E band with an area of 0.482 deg 2 each, and for one of them we also simulated eight additional bands in the double Sérsic realisation (Euclid NIR Y E , J E , and H E , plus five Rubin/LSST bands u, g, r, i and z).The images were produced with zero background flux (i.e., already 'backgroud subtracted').Gaussian noise realisations mimicking the expected depths of the Euclid Wide Survey were added to the noiseless images.The analytical profiles of the galaxies were finally randomized, following a consistent Poissonian distribution.RMS maps including photon noise were also simulated, and provided to the participants along with the PSFs of each field, and lists of objects having nominal S/N > 5 in I E , for which the actual centroid positions in pixel coordinates were given.The requested output consisted of the photometric (and morphological) estimates for all the objects in these lists.
To allow for a quantitative analysis of the results and for a comparison of the accuracy of the codes, we defined a metric as described in Sect.4, computing three quantities related to, respectively, the median bias in the flux measurements (B), the mean dispersion of the measured flux distribution (D), and the outlier fraction (O); these quantities were evaluated on a subset of the data set including only the sources for which all participants provided a fit, and we also computed a further value quantifying the fraction of sources of the input list for which a software package succeeded to provide a successful fit (C).Finally, we combined them in a single figure of merit S; the lower the value of S, the better the global fit, with S ≤ 1 being considered optimal performance.The global results are summarised in Table 5 and Figs. 9 to 14.We also analysed how accurately colours were retrieved, and the reliability of the uncertainty estimates.The computational times and some in-depth analysis for each individual software package are discussed in the Appendix.
In general, all the participants provided acceptable to good results in at least some of the tests, with a few differences from case to case.A thorough comparison could only be performed with the results of ProFit, Galapagos-2, and SourceXtractor++ (the first two provided all the requested outputs, and the third only lacked one part of a set, namely the free n bulge run in the double Sérsic multi-band realisation).DeepLeGATo provided fits only for the I E analytical realisations, yielding good results with the caveat of a S/N-dependent calibration of the models, while Morfometryka provided data only for the single Sérsic set; therefore, their contribution to the challenge can only be considered as indicative of their potential.
Considering the S values, among the three codes that provided the complete output the best performance was obtained by SourceXtractor++, in all realisations.It is worth stressing again that the pipeline it adopted for the EMC was the most tuned on the data set, with the inclusion of a substantial pre-processing of the data, and the priors being modeled using the samples of true values that were provided along with the simulated images.In the analytical realisations (SS and DS), SourceXtractor++ and Galapagos-2 reached typical values in the bias of the measured fluxes with respect to the true values below 1% at S/N above 10 in all bands, and above 5 in I E (in at least one of the possible configurations of the codes for which results were provided).For these two codes, the dispersion is typically slightly  larger than, but comparable to, the one expected from a theoretical distribution obtained by perturbing the input true fluxes with simulated observational noise consistent with the nominal depths of the images; a small fraction of outliers (objects having a relative error in the fitted flux larger than 5 times the expected theoretical distribution) is always present.ProFit also yielded good results in SS, but was less accurate on the DS realisation (Galapagos-2 also being sub-optimal at the faint end in I E for this realisation).The three software packages performed similarly on the RM data set, and in all cases results were poorer with respect to the analytical realisations; this seems to imply that the overall quality of the fits might be sub-optimal when dealing with real data.However, further investigation is needed to assess to what extent the quality of the simulation might have impacted the results.
All codes tend to underestimate the uncertainties at the bright end, but statistically provide a reasonable estimate for intermediate and faint sources.Finally, a relevant outcome is the importance of the simultaneous fitting of the multi-band data set: using the information provided by the deep and high resolution I E band helps to obtain more accurate measurements on the shallower NIR bands, and noticeably also on the ground-based LSST bands, both in terms of absolute fluxes and colours.Once again this highlights how the synergy between the two surveys will be of paramount important for the success of both.
We did not attempt to investigate the performance of the model-fitting algorithms in the context of the Euclid Deep Survey (see Laureijs et al. 2011).It is reasonable to expect that the results should be similar to those obtained for the Wide Survey at equivalent S/N, although the increased level of contamination and blending might worsen the performance.Dedicated simulations would be required to quantitatively confirm this hypothesis, but this is beyond the scope of the present study.
The results of the EMC set the baseline to decide which algorithm is to be implemented in the Euclid pipeline.One possible way to proceed might be to use the computational power of DeepLeGATo to provide fast and reliable priors to SourceXtractor++; we will investigate this option in the next future.However, given the many different approaches and techniques adopted by the participants (different use of priors, different pre-processing strategies, different approaches for multiband fitting, etc.) and the other parameters not included in the metric that should be taken into account (such as the computing time, required resources, compatibility with the current pipeline, accuracy of uncertainty budget estimates etc.), it is important to stress that the results presented in this work and in EMC2022b must be interpreted with caution.We followed one possible approach to analyse in a compact way a very complex and multilayered data set; and we emphasise that some of the software packages might be better suited than the ones obtaining the best score here, for other specific science cases.We invite the interested reader to check the full set of results using the online tool.
Future work will include testing at least some of the software packages in more realistic environments using Euclid official simulations, and finally the implementation in the Euclid pipeline of the chosen algorithms.We stress again that the subdivision of the catalogue into subsamples fitted with different models caused evident features in the global distribution of δ f , though the average trends and the final S values were generally quite good.In particular, the bright sources in each section were systematically underestimated while the faint ones were overestimated.There were also many bright outliers especially in the DS runs, with bulges typically being underestimated and discs being overestimated; this is likely to be a consequence of the CNN fitting technique (fixed dimensions of the stamps, and less training at the bright end).

B.2. Galapagos-2
Galapagos-2 employs the down-hill gradient minimisation method of Galfit/GalfitM, so it is typically fast, and can be reasonably parallelized to use 16 cores (due to an IDL limitation; ways around this limit exist, as it can also run several sessions in parallel).For the EMC, the code was run on a machine with 48 cores (2.3 GHz each) at 96 threads and ample (512 GB) memory, using 12 threads (at 1.15 GHz) at a time (and 4 sessions in parallel) for multi-band fits.The run times of the Galfit fits themselves were measured and stored, and account for the majority of the time used, with the wrapper not adding significant computational time.The typical GalfitM fitting times were approximately as follows: -DS F1 single-band single Sérsic fit, 160 hours (about 5.7 seconds per galaxy); -DS F1 single-band double Sérsic fit, 150 hours (about 5.4 seconds per galaxy); -DS F0 multi-band single Sérsic fit, 8000 hours (about 288 seconds per galaxy); -DS F0 multi-band double Sérsic fit, 10 000 hours (about 360 seconds per galaxy).
Therefore the multi-band fitting, while providing the results for nine bands in parallel, took significantly more than the time elapsed to fit individual bands.However, as shown, it clearly provides better results for the shallower bands.
The overall results were good, with no evident pathological behaviour.The performance on I E in the multi-band fit DS F0, i.e. obtained simultaneously with the other bands, is significantly worse than those on other fields, where I E is measured alone (see Fig. B.2); this is not surprising, because when fitting one band individually all the parameters get optimised to minimise that band, while fitting more bands in parallel the structural parameters are minimized averaging over the wavelengths (and this is particularly true if the band is the deepest and the one with the higher resolution, as is the case for I E ).However, it should be pointed out that multi-band fits "fail" less often than the combination of several single-band fits, which have independent constraints (see Häußler et al. 2013).
The total magnitude estimated with a single Sérsic fit (which is always provided by the code) is typically better than the one from the double Sérsic in I E , even on the DS simulation (Fig. B.3).On the contrary, the double Sérsic magnitude estimate is better for the other bands (Fig. B.4).While the latter result is expected, given that the code is fitting a double profile with two separate analytical functions (and, the shallower the band the more the multi-band fit should improve the results, so the NIR and LSST bands should benefit more than I E from it), the opposite outcome found on I E images is more surprising.We point out that since the other teams did not provide single Sérsic flux estimates for DS runs, we could not check whether this particular result is a general one, or only concerns Galapagos-2.
Interestingly, the fit of the LSST i band on the DS F0 data set by Galapagos-2 is substantially worse than the fits provided for the other bands.After an in-depth analysis, we concluded that this is actually due to the polynomial fitting method used in the software.In short, Galapagos-2 works with effective central wavelengths, and when performing a multi-band fit on a data set including two bands with a very close central wavelength, the shallower of the two bands might get a sub-optimal fit with a systematic offset.This is exactly the case here, with the LSST i band having a central wavelength very close to the I E one (λ I E = 710 nm and λ i = 754 nm), and very different filter widths; see Sect.2.1 and Fig. 1.Noticeably, this does not affect the whole catalogue, but only a fraction of the sources.Despite trying to find a correlation with some other input or output parameter of the data set, we were not able to identify the criteria resulting in a good or a bad fit for a given source.Nevertheless, we verified the reliability of our conclusion by picking one source having a bad fit, and looking at the residual images produced by Galapagos-2 in the full multi-band run, and in a test run in which the I E band information was not used to fit the i band.The result of the test is shown in Fig. B.5: while the other LSST bands are well-fitted in both cases, when the I E band is used the i band has an evident residual, which flattens out when excluding I E .Of course, this cannot be considered a viable, definitive 'solution' to the issue, since I E is the most important and deepest band, so it cannot be excluded from the fit to improve the fit in another band; on the contrary, one might want to exclude the i band from the fit,   although this would imply entirely discarding a band and its information.Therefore, for the sake of comparison with the other software packages we agreed to keep the delivered data set including this issue, although it non-negligibly worsens the statistics.Finally, we note that the issue is more evident when using the fluxes of the single Sérsic fit, while it is somehow mitigated if the double Sérsic fit is considered (Galapagos-2 provides both for the DS realisation, which is the only one including LSST bands).

B.3. Morfometryka
Morfometryka runs were performed on an Intel(R) Xeon(R) CPU E5-2640 v3 2.60 GHz shared 32-core workstation, running 16 jobs in parallel at any given time.The wall time was roughly 920 minutes per field, which is roughly 10 seconds per source.For each run, Morfometryka spends around 30% of its running time for the Sérsic profile fits, and the remaining time is dedicated to other measurements; however, runtimes strongly depend on the input image size.
There is an evident dependence on morphological type in the results.Bulge-dominated galaxies (having high Sérsic index, which in the SS realisation is obtained from the bulge-to-total ratio given in the Egg catalogue, see Sect.    in Western Australia), which is a multi-node cluster where each node is 24 core (48 thread), comprised of two Intel Xeon E5-2690 v3 (Haswell) 12-core CPUs.
The typical single Sérsic profile fit took a little under a minute per object, and a double Sérsic fit around two minutes on average; this included the whole processing, but in practice the vast majority of the time was spent doing the ProFit optimisation, with the ProFound detection stage usually taking about one second.This was deemed enough time to find a good global solution per fit, but it was not long enough to thoroughly explore uncertainties, as shown in Sect.5.5.
Good results were provided for the SS realisation, but much less so on the DS and RM ones.The faint end is typically strongly biased (fluxes are overestimated); LSST bands also have a 'U-shaped' trend (see Figs. B.8), for which we were not able to find a simple explanation.

B.5. SourceXtractor++
Single-band runs (SS, RM, DS on I E ) were performed on a small cluster at LMU Munich, with 56 cores and 2 GB of RAM per core.Computational times were of about 50 minutes per field for single Sérsic runs, and 2 hours per field for double Sérsic runs.
Multi-band runs were performed on a local server at the Institut Astrophysique de Paris, with 24 AMD EPYC 7402P cores (ignoring hyperthreading) running at 2.8 GHz, using an improved version of the code with increased parallelisation efficiency.The computational time was 45 hours (it is worth recalling that the NIR and LSST bands were rebinned to their original pixel scales, so their sizes were smaller than the I E one -respectively 12 500 and 8333 pixels per side).
The results were generally good, with no evident trends or pathological behaviour.Among the provided output, SourceXtractor++ typically obtained the best values for the global diagnostics.Interestingly, in the DS realisation the runs with free n bulge typically yielded slightly better results than those with fixed n bulge = 4, in terms of average bias (see Fig. B.9).
As mentioned, the processing pipeline adopted for the EMC included a substantial pre-processing of the data set, and the priors used for the measurements were calibrated on the provided samples of true values, albeit only in a statistical sense (see below).

B.5.1. Re-runs on raw data
The NIR and LSST images were rebinned back to their original pixel scales (while the ones provided in the data set shared with the participants had been rebinned to the I E pixel scale of 0.1 ).Additionally, the PSFs used in the fitting process were not those provided in the data set; instead, they were extracted from the images using PSFEx.This did not require any additional input, and therefore it was not outside the guidelines of the challenge.Still, we felt it was necessary to check the performance of the software on the non-processed images and using the official PSFs, for a fairer comparison with the other participants.To this end, a few additional runs were performed by the SourceXtractor++ team (using version 0.16 of the code instead of 0.12), and we compared the results with the ones officially provided for the EMC.The values of the four diagnostics C, B, D, and O, and of the global S, for both the EMC runs and the re-run, are given in Table B.5.1.Some trends for the DSb4 The results do not seem to significantly improve because of the preprocessing; most of the values and trends look similar, and in some cases even slightly better in the re-run.The evident exception is given by the NIR bands, in particular concerning the median bias; including the weighting factors, the values of S are significantly worse in the re-run, albeit still reasonably good (i.e.close to 1).
These re-runs required 40 minutes per field for the single Sérsic runs (SS and RM), and 1 hour and 20 minutes per field for the double Sérsic I E runs (using 28 cores).Finally, the multiband re-run was performed on a different set of nodes, with 32 AMD EPYC 7302 16-Cores at 3.2 GHz, and required 67 hours.

B.5.2. SourceXtractor++ priors
As mentioned, SourceXtractor++ priors were obtained by determining an appropriate transfer function to map each prior to a Gaussian mimicking the distributions of the provided samples of the input true catalogues.Each parameter was calibrated independently, without including covariances; only the statistical distributions were used (i.e.there was no object-by-object comparison in the process).We have shown how this choice likely had a significant impact on the accuracy of the results.
Even though SourceXtractor++ models source ellipticity internally, with the standard axis ratio q and position angle ϕ variables, it was found useful to express the priors needed for these two degrees of freedom in terms of two e 1 , e 2 Cartesian  ellipticity parameters (that usually enter weak lensing studies).More specifically, in complex notation, e = 1−q 1+q exp(2iϕ) ≡ e 1 + ie 2 .Hence, a Normal prior on both e 1 and e 2 was set, centred on zero and with σ = 0.3.This was in broad agreement with the population distribution of axis ratios in the SS F4 simulation, for which ground truth was provided.A similar prior was built for the Sérsic index, as well as for the effective radius.
In multi-component models, priors were set under the assumption that the bulge and disc of each source share a common position angle, but axis ratios are not correlated (this is not actually the case, since axis ratios do not vary across the spectrum in the Egg catalogues).Therefore the default q and ϕ variables were used instead of the (e 1 , e 2 ) ellipticities.A single fit across all bands was run for bulge and disc effective radii, ellipticity, and position angle; the amplitudes of the bulge and disc components were left free to vary.Here again, the fitting procedure prefers the definition of a set of total magnitudes and a set of b/t (with their own associated priors), instead of using the flux of each component in each band as a set; in practice, b/t is expressed via the transform X b/t = log 10 (b/t + 0.01)/(1.01− b/t).The priors on X b/t were set to be band dependent, but with little variation.
g.Graham & Driver 2005), so we applied this correction before simulating the images.Also, GalSim requires that the fluxes of the two components are given separately, while Egg outputs a total magnitude and a b/t; therefore, to assign a flux to each component we simply used the relations f bulge = b/t f tot and f disc = (1 − b/t) f tot , where f tot = 10 −0.4(m−ZP) (where m is the magnitude of the sources as given in the Egg catalogue and ZP is the zeropoint of the image, see Sect.2.1.1).-Single Sérsic profiles (SS), in which galaxies are modeled with a single Sérsic index, defined using the b/t values from the Egg catalogue as n total = (1 − b/t) n disc + b/t n bulge = 3 b/t+1.To compute the single effective radius r e,tot from the two values given in the 2-component catalogue, we used the following formula, calibrated empirically to obtain a good visual match between the two realisations: r tot = [b/t r e,b ] α + [(1 − b/t) r e,d ] α , where α = 0.8 if r e,b < r e,d (98% of the cases), and α = 2.0 otherwise.Finally, position angles and axis ratios already had the same values for bulges and discs, so we simply kept them unchanged.

Fig. 2 .
Fig. 2. Distribution of redshifts as a function of I E magnitude (input values) in F0.The orange line is the running mean, the vertical dashed line the 5σ limit.

Fig. 3 .
Fig. 3. Distribution of bulge-to-total ratios as a function of I E magnitude in the input catalogue of F0; the colours encode the global scale length, defined here as r = b/t r e,bulge + (1 − b/t) r e,disc .

Fig. 4 .
Fig.4.Distributions of various input parameters for the I E band, in all five simulated fields, for all the separate components and realisations in the Challenge, as described in the legend and in the labels of the panels.For comparison, we also show the corresponding parameter distribution in COSMOS.

Fig. 5 .
Fig. 5. Signal-to-noise ratios of the simulated images at the native pixel scales, measured with forced photometry in 2 diameter apertures at the input positions of the sources.Star symbols show the expected values, which are also reported in the legend.

Fig. 6 .
Fig. 6.A small region of the DS F0 realisation in the nine bands; all are shown with the same colour scale.

Fig. 7 .
Fig. 7.A small crop of the three realisations of I E images.Left to right: DS, SS, RM.Notice that in RM the orientations of the galaxies are not the same as in the other two realisations.

Fig. 8 .
Fig. 8. Example of a diagnostic plot used to analyse each output catalogue provided by the participants to the EMC.Each dot shows the δ f value of a fitted galaxy, colour-coded by the b/t value from the original Egg catalogue (colour-bar on the right), as a function of the input true magnitude in the considered band.The black solid line is the running median of the distribution, which should be identically zero for a perfect fit.The dotted lines show the (positive and negative) 1 and 5 σ true levels, used to compute the diagnostics described in the text and reported in the top left corner of the plot (see Sect. 4).

Fig. 9 .
Fig. 9. Visual summary of the S-values listed in Table5.

Fig. 10 .
Fig. 10.Visual summary of the diagnostic quantities: (left to right) absolute value of the median bias (B), average ratio of σ meas and σ true (D), and outlier fraction (O).In the bias panel, the points corresponding to ProFit DS multi-band runs are not shown, being off the scale (their values are 0.49 and 0.44, respectively).See text for details.

Fig. 11 .
Fig. 11.Trends of the completeness C with input I E magnitude of the output provided by the five participants to the challenge, for (top to bottom) DS, SS, and RM realisations (in the latter, points start at I E = 20.5 because bright objects were not included in the simulated images).

Fig. 12 .
Fig. 12. DSb4 (top), SS (center), and RM (bottom) summary plots for I E .Left to right: median bias B; dispersion (standard deviation) D; and fraction of outliers O.Each line and colour correspond to a different code as indicated in the legend; note the different y-axis scales.

Fig. 13 .
Fig. 13.Multi-band summary plots for I E and the NIR bands.Left to right: median bias B; dispersion (standard deviation) D; and fraction of outliers O. Top to bottom: I E , Y E , J E , and H E .Each line and colour correspond to a different code as indicated in the legends; note the different y-axis scales.

Fig. 15 .
Fig. 15.Trends for bulges only, in the multi-band DSb4 fit, for I E and the NIR bands.Left to right: median bias B; dispersion (standard deviation) D; and fraction of outliers O. Top to bottom: I E , Y E , J E , and H E .Each line and colour correspond to a different code as indicated in the legends; note the different y-axis scales.

Fig. 17 .
Fig. 17.Trends for bulges only, in the multi-band DSb4 fit (as in Fig. 15), for LSST bands.Top to bottom: u, g, r, i, and z.

Fig. 19 .
Fig. 19.Colours estimates for Galapagos-2.Each small sub-panel on the left part of the plot shows the measured versus true colour (magnitude difference) for a pair of bands: the colour is always x-axis label − y-axis label.The numbers in each panel are the NMAD and η, as defined in the text.The upper larger panel shows the values of NMAD and η for the colours of all bands with respect to I E , as reported in the legend.The rightmost larger panel shows the I E − g versus I E − H E colour-colour diagram; points are the measured colours for individual galaxies, while the black lines are density contour levels of the corresponding true colours distribution.

Fig. 22 .
Fig. 22. Summary plots showing the fraction of sources satisfying the relation | f meas − f true | < σ meas in each magnitude bin, for the I E band runs.Expected fractions should be equal to 0.683 (horizontal dashed line).Left to right: DSb4, SS, RM.Each symbol and colour correspond to a different code, as indicated in the legends.

Fig. 23 .
Fig. 23.Same as Fig. 22, for the DSb4 multi-band run.Each panel refers to a band, as indicated in the titles (note that I E is the last panel, for readability of the plot).

Fig. B. 1 .
Fig. B.1.Examples of diagnostic plots for DeepLEGaTo.Top, I E SS F4 run; bottom, I E DS F4 run.See Fig. 8 for a description of this and the subsequent similar plots.Note that the colour coding is given by the value of the bulge fraction b/t in the original Egg double-Sersic catalogue; it can be considered as a proxy for the value of the Sérsic index, n = 3 b/t + 1.
Figure B.1 provides an example of these features.

Fig
Fig. B.2. Galapagos-2 results on F0 (top) and F1 (bottom) DS I E band (one component fits).The fit on F1, performed on the I E image only rather than in multi-band mode, is more accurate.See text for details.

Fig. B. 3 .
Fig. B.3.Galapagos-2 results on F1 DS I E band, one (top) and two (bottom; n bulge fixed) component fits.The one-component (i.e.single Sérsic) fit is more accurate in I E -only runs.See text for details.

Fig. B. 4 .
Fig. B.4. Galapagos-2 results on F0 DS LSST g band, one (top) and two (bottom; n bulge fixed) component fits.The two-component (double Sérsic) fit is more accurate for non-I E bands in the multi-band simultaneous fit.See text for details.
2.1) are strongly biased (B 0.15), and therefore are considered outliers in the computation of S, thus significantly affecting the accuracy and resulting in a sub-optimal overall performance (see Fig. B.7).

Fig. B. 5 .
Fig. B.5. Test on the Galapagos-2 LSST i band issue.In both sets of panels, for a chosen badly fitted source (ID 2158 in F0), top to bottom we show scientific image; Galapagos-2 models; and residuals.Left to right are LSST g, r, i, and z bands.Top set: reference run including I E information in the fit.Bottom set: test run without including I E information in the fit.The residual in the i band, evident in the upper panel, vanishes in the bottom panel, when I E is not included.See text for details.

Fig
Fig. B.6.Galapagos-2 results on F0 LSST i band.The evident double trend is due to the multi-band simultaneous fitting (see Sect.B.2).

Fig. B. 7 .
Fig. B.7.Example of diagnostic plot for Morfometryka.This plot refers to the I E SS F2 run.The colour coding clearly shows that the bimodal distribution of points correlates with the input bulge-to-total fraction, with bulge-dominated galaxies (bluer points) being fitted with less accuracy than disc-dominated ones (redder points).

Fig
Fig. B.8. Example of diagnostic plot for ProFit.Top to bottom: I E SS F2; I E DS F2 (fixed bulge); and LSST g F0.The I E SS fit is substantially better than the DS fit, and there is a 'U-shaped' trend in the LSST bands.

Fig. B. 9 .
Fig. B.9. Example of diagnostic plot for SourceXtractor++.These plots refers to the I E DS F2 run; the upper panel shows the fit with n bulge = 4, while the bottom panel shows the one with n bulge left free to vary.

Fig. B. 10 .
Fig. B.10.SourceXtractor++ re-runs on non-pre-processed (top to bottom) I E , H E , and LSST g F0 DS images (empty symbols, solid lines), compared to output provided for the EMC, obtained with pre-processing of the images (full symbols, dashed lines).Note the different limits on the axes.

Table 2 .
Software package, runs and S-values considering the common list of sources (see text for details).Lower S means better performance.The acronyms in the columns refer to the different realisations described in Sect.2.1, with the following additional specifications for the DS runs: b4 = fixed Sérsic index for bulge (n = 4); bf = free Sérsic index for bulge; M = multi-band (i.e.NIR and LSST bands, excluding I E ; no M is for I E band only).For the Galapagos-2 DS runs, we give the numbers of the best and worst performance (first and second number, respectively), corresponding to either the double Sérsic fit or a single Sérsic fit, which the software produces in all cases (see Appendix B.2).The numbers in parenthesis are the completeness C, i.e. the fraction of sources from the input 5σ list having a successful measurement in the output catalogue.