Euclid preparation: XI. Mean redshift determination from galaxy redshift probabilities for cosmic shear tomography

The analysis of weak gravitational lensing in wide-field imaging surveys is considered to be a major cosmological probe of dark energy. Our capacity to constrain the dark energy equation of state relies on the accurate knowledge of the galaxy mean redshift $\langle z \rangle$. We investigate the possibility of measuring $\langle z \rangle$ with an accuracy better than $0.002\,(1+z)$, in ten tomographic bins spanning the redshift interval $0.2<z<2.2$, the requirements for the cosmic shear analysis of Euclid. We implement a sufficiently realistic simulation to understand the advantages, complementarity, but also shortcoming of two standard approaches: the direct calibration of $\langle z \rangle$ with a dedicated spectroscopic sample and the combination of the photometric redshift probability distribution function (zPDF) of individual galaxies. We base our study on the Horizon-AGN hydrodynamical simulation that we analyse with a standard galaxy spectral energy distribution template-fitting code. Such procedure produces photometric redshifts with realistic biases, precision and failure rate. We find that the Euclid current design for direct calibration is sufficiently robust to reach the requirement on the mean redshift, provided that the purity level of the spectroscopic sample is maintained at an extremely high level of $>99.8\%$. The zPDF approach could also be successful if we debias the zPDF using a spectroscopic training sample. This approach requires deep imaging data, but is weakly sensitive to spectroscopic redshift failures in the training sample. We improve the debiasing method and confirm our finding by applying it to real-world weak-lensing data sets (COSMOS and KiDS+VIKING-450).


Introduction
Understanding the late, accelerated expansion of our Universe (Riess et al. 1998;Perlmutter et al. 1999) is one of the most important challenges in modern cosmology.Three leading hy-w = −1 is compatible with a cosmological constant, and therefore any deviation from this value would invalidate the standard Λ cold dark matter (ΛCDM) model, in favour of dark energy.This makes the precise measurement of w a key component of future cosmological experiments such as Euclid (Laureijs et al. 2011), the Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST; LSST Science Collaboration et al. 2009), or the Nancy Grace Roman Space Telescope (Spergel et al. 2015).
Cosmic shear (see e.g.Kilbinger 2015; Mandelbaum 2018, for recent reviews), which is the coherent distortion of galaxy images by large-scale structures via weak gravitational lensing, offers the potential to measure w with great precision: the Euclid survey, in particular, aims at reaching 1% precision on the measurement of w using cosmic shear.One advantage of using lensing to measure w, compared to other probes, is that there exists a direct link between galaxy image geometrical distortions (i.e. the shear) and the gravitational potential of the intervening structures.When the shapes of, and distances to, galaxy sources are known, gravitational lensing allows one to probe the matter distribution of the Universe.
This discovery has led to the rapid growth of interest in using cosmic shear as a key cosmological probe, as evidenced by its successful application to several surveys.Constraints on the matter density parameter Ω m , and the normalisation of the linear matter power spectrum σ 8 , have been reported by the Canada-France-Hawaii Telescope Lensing Survey (CFHTLenS, Kilbinger et al. 2013), the Kilo Degree Survey (Hildebrandt et al. 2017, KiDS,), the Dark Energy Survey (DES, Troxel et al. 2018), and the Hyper-Suprime Camera Survey (HSC, Hikage et al. 2019).These studies typically utilise so-called cosmic shear tomography (Hu 1999), whereby the cosmic shear signal is obtained by measuring the cross-correlation between galaxy shapes in different bins along the line of sight (i.e.tomographic bins).Large forthcoming surveys, also utilising cosmic shear tomography, will enhance the precision of cosmological parameter measurements (e.g.Ω m , σ 8 , and w), while also enabling the measurement of any evolution in the dark-energy equation of state, such as that parametrised by Caldwell et al. (1998): w = w 0 + w a (1 − a), where a is the scale factor.
Tomographic cosmic shear studies require accurate knowledge of the galaxy redshift distribution.The estimation and calibration of the redshift distribution has been identified as one of the most problematic tasks in current cosmic shear surveys, as systematic bias in the distribution calibration directly influences the resulting cosmological parameter estimates.In particular, Joudaki et al. (2020) show that the Ω m −σ 8 constraints from KiDS and DES can be fully reconciled under consistent redshift calibration, thereby suggesting that the different constraints from the two surveys can be traced back to differing methods of redshift calibration.
In tomographic cosmic shear, the signal is primarily sensitive to the average distance of sources within each bin.Therefore, for this purpose, the redshift distribution of an arbitrary galaxy sample can be characterised simply by its mean z , defined as: where N(z) is the true redshift distribution of the sample.Furthermore, in cosmic shear tomography it is common to build the required tomographic bins using photo-z (see Salvato et al. 2019, for a review), which can be measured for large samples of galaxies with observations in only a few photometric bandpasses.
However these photo-z are imperfect (due to, for example, photometric noise), resulting in tomographic bins whose true N(z) extend beyond the bin limits.These 'tails' in the redshift distribution are important, as they can significantly influence the distribution mean and bring sensitive information (Ma et al. 2006).
For a Euclid-like cosmic shear survey, Laureijs et al. (2011) predict that the mean redshift z of each tomographic bin must be known with an accuracy better than σ z = 0.002 (1 + z) in order to meet the precision on w 0 (σ w 0 = 0.015) and w a (σ w a = 0.15).Given the importance of measuring the mean redshift for cosmic-shear surveys, numerous approaches have been devised in the last decade.A first family of methods, usually referred to as 'direct calibration', involves weighting a sample of galaxies with known redshifts such that they match the colour-magnitude properties of the target galaxy sample; thereby leveraging the relationship between galaxy colours, magnitudes, and redshifts to reconstruct the redshift distribution of the target sample (e.g.Lima et al. 2008;Cunha et al. 2009;Abdalla et al. 2008).A second approach is to utilise redshift probability distribution functions (zPDFs), obtained per target galaxy and subsequently stacked them to reconstruct the target population N(z).The galaxy zPDF is typically estimated by either model fitting or via machine learning.A third family of methods uses galaxy spatial information, specifically galaxy angular clustering, crosscorrelating target galaxies with a large spec-z sample to retrieve the redshift distribution (e.g.Newman 2008;Ménard et al. 2013).New methods are continuously developed, for instance by modelling galaxy populations and using forward modelling to match the data (Kacprzak et al. 2020).
In this paper we evaluate our capacity to measure the mean redshift in each tomographic bin at the precision level required for Euclid, based on realistic simulations.
We base our study on a mock catalogue generated from the Horizon-AGN hydrodynamical simulation as described in Dubois et al. (2014) and Laigle et al. (2019).The advantage of this simulation is that the produced spectra encompass all the complexity of galaxy evolution, including rapidly varying starformation histories, metallicity enrichment, mergers, and feedback from both supernovae and active galactic nuclei (AGN).By simulating galaxies with the imaging sensitivity expected for Euclid, we retrieve the photo-z with a standard template-fitting code, as done in existing surveys.Therefore, we produce photoz with realistic biases, precision and failure rate, as shown in Laigle et al. (2019).The simulated galaxy zPDF appear as complex as the ones observed in real data.
We further simulate realistic spectroscopic training samples, with selection functions similar with those that are currently being acquired in preparation of Euclid and other dark energy experiments (Masters et al. 2017).We introduce possible incompleteness and failures as occurring in actual spectroscopic surveys.
We investigate two of the methods envisioned for the Euclid mission: the direct calibration and zPDF combination.We also propose a new method to debias the zPDF based on Bordoloi et al. (2010).We quantify their performance to estimate the mean redshift of tomographic bins, and isolate relevant factors which could impact our ability to fulfill the Euclid requirement.We also provide recommendations on the imaging depth and training sample necessary to achieve the required accuracy on z .
Finally, we demonstrate the general utility of each of the methods presented here, not just to future surveys such as Euclid but also to current large imaging surveys.As an illustration, we apply those methods to COSMOS and the fourth data release of KiDS (Kuijken et al. 2019) surveys.
The paper is organised as follows.In Sect. 2 we describe the Euclid-like mock catalogues generated from the Horizon-AGN hydrodynamical simulation.In Sect. 3 we test the precision reached on z when applying the direct calibration method.In Sect. 4 we measure z in each tomographic bin using the zPDF debiasing technique.We discuss the advantages and limitations of both methods in Sect. 5. We apply these methods to the KiDS and COSMOS data set in Sect.6.Finally, we summarise our findings and provide closing remarks in Sect.7.

A Euclid mock catalogue
In this section we present the Euclid mock catalogue used in this analysis, which is constructed from the Horizon-AGN hydrodynamical simulated lightcone and includes photometry and photometric redshift information.A full description of this mock catalogue can be found in Laigle et al. (2019).Here we summarise its main features and discuss the construction of several simulated spectroscopic samples, which reproduce a number of expected spectroscopic selection effects.

Horizon-AGN simulation
Horizon-AGN is a cosmological hydrodynamical simulation ran in a simulation box of 100 h −1 Mpc per-side, and with a dark matter mass resolution of 8 × 10 7 M (Dubois et al. 2014).A flat ΛCDM cosmology with H 0 = 70.4km s −1 Mpc −1 , Ω m = 0.272, Ω Λ = 0.728, and n s = 0.967 (compatible with WMAP-7, Komatsu et al. 2011) is assumed.Gas evolution is followed on an adaptive mesh, whereby an initial coarse 1024 3 grid is refined down to 1 physical kpc.The refinement procedure leads to a typical number of 6.5 × 10 9 gas resolution elements (called leaf cells) in the simulation at z = 1.Following Haardt & Madau (1996), heating of the gas by a uniform ultra-violet background radiation field takes place after z = 10.Gas in the simulation is able to cool down to temperatures of 10 4 K through H and He collision, and with a contribution from metals as tabulated in Sutherland & Dopita (1993).Gas is converted into stellar particles in regions where the gas particle number density surpasses n 0 = 0.1 H cm −3 , following a Schmidt law, as explained in Dubois et al. (2014).Feedback from stellar winds and supernovae (both types Ia and II) are included in the simulation, and include mass, energy, and metal releases.Black holes (BHs) in the simulation can grow by gas accretion, at a Bondi accretion rate that is capped at the Eddington limit, and are able to coalesce when they form a sufficiently tight binary.They release energy in either the quasar or radio (i.e.heating or jet) mode, when the accretion rate is respectively above or below one per cent of the Eddington ratio.The efficiency of these energy release modes are tuned to match the observed BH-galaxy scaling relation at z = 0 (see Dubois et al. 2012, for more details).
The simulation lightcone was extracted as described in Pichon et al. (2010).Particles and gas leaf cells were extracted at each time step depending on their proper distance to the observer at the origin.In total, the lightcone contains roughly 22 000 portions of concentric shells, which are taken from about 19 replications of the Horizon-AGN box up to z = 4.We restrict ourselves to the central 1 deg 2 of the lightcone.Laigle et al. (2019) extracted a galaxy catalogue from the stellar particle distribution using the AdaptaHOP halo finder (Aubert et al. 2004), where galaxy identification is based exclusively on the local stellar particle density.Only galaxies with stellar masses M > 10 9 M (which corresponds to around 500 stellar particles) are kept in the final catalogue, resulting in more than 7 × 10 5 galaxies in the redshift range 0 < z < 4, with a spatial resolution of 1 kpc.
A full description of the per-galaxy spectral energy distribution (SED) computation within Horizon-AGN is presented in Laigle et al. (2019)  1 , in the following we only summarise the key details of the SED construction process.Each stellar particle in the simulation is assumed to behave as a single stellar population, and its contribution to the galaxy spectrum is generated using the stellar population synthesis models of Bruzual & Charlot (2003), assuming a Chabrier (2003) initial mass function.As each galaxy is composed of a large number of stellar particles, the galaxy SEDs therefore naturally capture the complexities of unique star-formation and chemical enrichment histories.Additionally, dust attenuation is also modelled for each star particle individually, using the mass distribution of the gasphase metals as a proxy for the dust distribution, and adopting a constant dust-to-metal mass ratio.Dust attenuation (neglecting scattering) is therefore inherently geometry-dependent in the simulation.Finally, absorption of SED photons by the intergalactic medium (i.e.Hi absorption in the Lyman-series) is modelled along the line of sight to each galaxy, using our knowledge of the gas density distribution in the lightcone.This therefore introduces variation in the observed intergalactic absorption across individual lines of sight.Flux contamination by nebular emission lines is not included in the simulated SEDs.While emission lines could add some complexity in galaxy's photometry, their contribution could be modelled in template-fitting code.Moreover, their impact is mostly crucial at high redshift (Schaerer & de Barros 2009) and when using medium bands (e.g.Ilbert et al. 2009).Kaviraj et al. (2017) compare the global properties of the simulated galaxies with statistical measurements available in the literature (as the luminosity functions, the star-forming main sequence, or the mass functions).They find an overall fairly good agreement with observations.Still, the simulation over-predicts the density of low-mass galaxies, and the median specific star formation rate falls slightly below the literature results, a common trend in current simulations.Fig. 2. Few examples of galaxy likelihood L (z) (dashed red lines) and debiased posterior distributions (solid black lines).The spec-z (photo-z) are indicated with green (magenta) dotted lines.These galaxies are selected in the tomographic bin 0.4 < z p < 0.6 for the DES/Euclid (top panels) and LSST/Euclid (bottom panels) configurations.These likelihoods are not a random selection of sources, but illustrate the variety of likelihoods present in the simulations.

Simulation of Euclid photometry and photometric redshifts
As described in Laureijs et al. (2011), the Euclid mission will measure the shapes of about 1.5 billion galaxies over 15 000 deg2 .The visible (VIS) instrument will obtain images taken in one very broad filter (V IS ), spanning 3500 Å.This filter allows extremely efficient light collection, and will enable VIS to measure the shapes of galaxies as faint as 24.5 mag with high precision.The near infrared spectrometer and photometer (NISP) instrument will produce images in three near-infrared (NIR) filters.In addition to these data, Euclid satellite observations are expected to be complemented by large samples of ground-based imaging, primarily in the optical, to assist the measurement of photo-z.
Euclid imaging has an expected sensitivity, over 15 000 deg 2 , of 24.5 mag (at 10σ) in the V IS band, and 24 mag (at 5σ) in each of the Y, J, and H bands (Laureijs et al. 2011).We associate the Euclid imaging with two possible ground-based visible imaging datasets, which correspond to two limiting cases for photo-z estimation performance.
-DES/Euclid.As a demonstration of photo-z performance when combining Euclid with a considerably shallower photometric dataset, we combine our Euclid photometry with that from DES (Abbott et al. 2018).DES imaging is taken in the g, r, i, and z filters, at 10σ sensitivities of 24.33, 24.08, 23.44, and 22.69 respectively.-LSST/Euclid.As a demonstration of photo-z performance when combining Euclid with a considerably deeper photometric dataset, we combine our Euclid photometry with that from the Vera C. Rubin Observatory LSST (LSST Science Collaboration et al. 2009).LSST imaging will be taken in the u, g, r, i, z, and y filters, at 5σ (point source, full depth) sensitivities of 26.3, 27.5, 27.7, 27.0, 26.2, and 24.9, respectively.
DES imaging is completed and meets these expected sensitivities.Conversely LSST will not reach those quoted full depth sensitivities before its tenth year of operation (starting in 2021), and even then it is possible that the northern extension of LSST might not reach the same depth.Still, LSST will be already extremely deep after two years of operation, being only 0.9 magnitude shallower than the final expected sensitivity (Graham et al. 2020).Therefore, these two cases (and their assumed sensitivities) should comfortably encompass the possible photo-z performance of any future combined optical and Euclid photometric data set.
In order to generate the mock photometry in each of the Euclid, DES, and LSST surveys, each galaxy SED is first 'observed' through the relevant filter response curves.In each photometric band, we generate Gaussian distributions of the expected signal-to-noise ratios (SNs) as a function of magnitude, given both the depth of the survey and typical SN-magnitude relation (in the same wavelength range) (see appendix A in Laigle et al. 2019).We then use these distributions, per filter, to assign each galaxy a SN (given its magnitude).The SN of each galaxy determines its 'true' flux uncertainty, which is then used to perturb the photometry (assuming Gaussian random noise) and produce the final flux estimate per source.This process is then repeated for all desired filters.
The galaxy photo-z are derived in the same manner as with real-world photometry.We use the method detailed in Ilbert et al. (2013), based on the template-fitting code LePhare (Arnouts et al. 2002;Ilbert et al. 2006).We adopt a set of 33 templates from Polletta et al. (2007) complemented with templates from Bruzual & Charlot (2003).Two dust attenuation curves are considered (Prevot et al. 1984;Calzetti et al. 2000), allowing for a possible bump at 2175Å.Neither emission lines nor adaptation of the zero-points are considered, since they are not included in the simulated galaxy catalogue.The full redshift likelihood, L (z), is stored for each galaxy, and the photo-z pointestimate, z p , is defined as the median of L (z) 2 .The distributions of (derived) photometric redshift versus (intrinsic) spectroscopic redshift for mock galaxies (in both our DES/Euclid and Euclid Collaboration: O. Ilbert et al.: Determination of the mean redshift of tomographic bins LSST/Euclid configurations) are shown in Fig. 1.Several examples of redshift likelihoods are shown in Fig. 2. We can see realistic cases with multiple modes in the distribution, as well as asymmetric distributions around the main mode.The photo-z used to select galaxies within the tomographic bins are indicated by the magenta lines and that they can differ significantly from the spec-z (green lines).
We wish to remove galaxies with a broad likelihood distribution (i.e.galaxies with truly uncertain photo-z) from our sample.In practice, we approximate the breadth of the likelihood distribution using the photo-z uncertainties produced by the templatefitting procedure to clean the sample.LePhare produces a redshift confidence interval [z min p , z max p ], per source, which encompasses 68% of the redshift probability around z p .We remove galaxies with max( z p − z min p , z max p − z p ) > 0.3, which we denote σ z p > 0.3 in the following for simplicity.We investigate the impact of this choice on the number of galaxies available for cosmic shear analyses, and also quantify the impact of relaxing this limit, in Sect.5.2.
Finally, we generate 18 photometric noise realisations of the mock galaxy catalogue.While the intrinsic physical properties of the simulated galaxies remain the same under each of these realisations, the differing photometric noise allows us to quantify the role of photometric noise alone on our estimated of z .We only adopt 18 realisations due to computational limitations, however, our results are stable to the addition of more realisations.

Definition of the target photometric sample and the spectroscopic training samples
All redshift-calibration approaches discussed in this paper utilise a spec-z training sample to estimate the mean redshift of a target photometric sample.In practice, such a spectroscopic training sample is rarely a representative subset of the target photometric sample, but is often composed of bluer and brighter galaxies.Therefore, to properly assess the performance of our tested approaches, we must ensure that the simulated training sample is distinct from the photometric sample.To do this, we separate the Horizon-AGN catalogue into two equal sized subsets: we define the first half of the photometric catalogue as our as target sample, and draw variously defined spectroscopic training samples from the second half of the catalogue.We test each of our calibration approaches with three spectroscopic training samples, designed to mimic different spectroscopic selection functions: a uniform training sample; a SOM-based training sample; and a COSMOS-like training sample.
The uniform training sample is the simplest, most idealised training sample possible.We sample 1000 galaxies with V IS < 24.5 mag (i.e. the same magnitude limit as in the target sample) in each tomographic bin, independently of all other properties.While this sample is ideal in terms of representation, the sample size is set to mimic a realistic training sample that could be obtained from dedicated ground-based spectroscopic follow-up of a Euclid-like target sample.
Our second training sample follows the current Euclid baseline to build a training sample.Masters et al. (2017) endeavour to construct a spectroscopic survey, the Complete Calibration of the Colour-Redshift Relation survey (C3R2), which completely samples the colour/magnitude space of cosmic shear target samples.This sample is currently assembled by combining data from ESO and Keck facilities (Masters et al. 2019;Guglielmo et al. 2020).The target selection is based on an unsupervised machinelearning technique, the self-organising map (SOM, Kohonen 1982), which they use to define a spectroscopic target sample that is representative in terms of galaxy colours of the Euclid cosmic shear sample.The SOM allows a projection of a multidimensional distribution into a lower two-dimensional map.The utility of the SOM lies in its preservation of higher-dimensional topology: neighbouring objects in the multi-dimensional space fall within similar regions of the resulting map.This allows the SOM to be utilised as a multi-dimensional clustering tool, whereby discrete map cells associate sources within discrete voxels in the higher dimensional space.We utilise the method of Davidzon et al. (2019) to construct a SOM, which involves projecting observed (i.e.noisy) colours of the mock catalogue into a map of 6400 cells (with dimension 80 × 80).We construct our SOM using the LSST/Euclid simulated colours, assuming implicitly that the spec-z training sample is defined using deep calibration fields.If the flux uncertainty is too large (∆m x i > 0.5, for object i in filter x) the observed magnitude is replaced by that predicted from the best-fit SED template, which is estimated while preparing the SOM input catalogue.This procedure allows us to retain sources that have non-detections in some photometric bands.We then construct our SOM-based training sample by randomly selecting N train galaxies from each cell in the SOM.The C3R2 expects to have 1 spectroscopic galaxies per SOM cell available for calibration by the time that the Euclid mission is active.For our default SOM coverage, we invoke a slightly more idealised situation of two galaxies per cell and we impose that these two galaxies belong to the considered tomographic bin.This procedure ensures that all cells are represented in the spectroscopy.In reality, a fraction of cells will likely not contain spectroscopy.However, when treated correctly, such misrepresented cells act only to decrease the target sample number density, and do not bias the resulting redshift distribution mean estimates (Wright et al. 2020).We therefore expect that this idealised treatment will not produce results that are overlyoptimistic.
Finally, the COSMOS-like training sample mimics a typical heterogeneous spectroscopic sample, currently available in the COSMOS field.We first simulate the zCOSMOS-like spectroscopic sample (Lilly et al. 2007), which consists of two distinct components: a bright and a faint survey.The zCOSMOS-Bright sample is selected such that it contains only galaxies at z < 1.2, while the zCOSMOS-Faint sample contains only galaxies at z > 1.7 (with a strong bias towards selecting star-forming galaxies).To mimic these selections, we construct a mock sample whereby half of the sources are brighter than i = 22.5 (the bright sample) and half of the galaxies reside at 1.7 < z < 2.4 with g < 25 (the faint sample).We then add to this compilation a sample of 2000 galaxies that are randomly selected at i < 25, mimicking the low-z VUDS sample (Le Fevre et al. 2015), and a sample of 1000 galaxies randomly selected at 0.8 < z < 1.6 with i < 24, mimicking the sample of Comparat et al. (2015).By construction, this final spectroscopic redshift compilation exhibits low representation of the photometric target sample in the redshift range 1.3 < z < 1.7.
Overall, our three training samples exhibit (by design) differing redshift distributions and galaxy number densities.We investigate the sensitivity of the estimated z on the size of the training sample in Sect.5.3.

Direct calibration
Direct calibration is a fairly straightforward method that can be used to estimate the mean redshift of a photometric galaxy sample, and is currently the baseline method planned for Euclid cosmic shear analyses.In this section we describe our implementation of the direct calibration method, apply this method to our various spectroscopic training samples, and report the resulting accuracy of our redshift distribution mean estimates.

Implementation for the different training samples
Given our different classes of training samples, we are able to implement slightly different methods of direct calibration.We detail here how the implementation of direct calibration differs for each of our three spectroscopic training samples.
The uniform sample.In the case where the training sample is known to uniformly sparse-sample the target galaxy distribution, an estimate of z can be approximated by simply computing the mean redshift of the training sample.
The SOM sample.By construction, the SOM training sample uniformly covers the full n-dimensional colour space of the target sample.The method relies on the assumption that galaxies within a cell share the same redshift (Masters et al. 2015) which can be labelled with the training sample.Therefore, we can estimate the mean redshift of the target distribution z by simply calculating the weighted mean of each cell's average redshift, where the weight is the number of target galaxies per cell: where the sum runs over the i ∈ [1, N cells ] cells in the SOM, z i train is the mean redshift of the training spectroscopic sources in cell i, N i is the number of target galaxies (per tomographic bin) in cell i, and N t is the total number of target galaxies in the tomographic bin.A shear weight associated to each galaxy can be introduced in this equation (e.g.Wright et al. 2020).As described in Sect.2.3, our SOM is consistently constructed by training on LSST/Euclid photometry, even when studying the shallower DES/Euclid configuration.We adopt this strategy since the training spectroscopic samples in Euclid will be acquired in calibration fields (e.g.Masters et al. 2019) with deep dedicated imaging.This assumption implies that the target distribution z is estimated exclusively in these calibration fields, which are covered with photometry from both our shallow and deep setups, and therefore increases the influence of sample variance on the calibration.
The COSMOS-like sample.Applying direct calibration to a heterogeneous training sample is less straightforward than in the above cases, as the training sample is not representative of the target sample in any respect.Weighting of the spectroscopic sample, therefore, must correct for the mix of spectroscopic selection effects present in the training sample, as a function of magnitude (from the various magnitude limits of the individual spectroscopic surveys), colour (from their various preselections in colour and spectral type), and redshift (from dedicated redshift preselection, such as that in zCOSMOS-Faint).Such a weighting scheme can be established efficiently with machinelearning techniques such as the SOM.To perform this weighting, we train a new SOM using all the information that have the potential to correct for the selection effects present in our heterogeneous training sample: apparent magnitudes, colours, and template-based photo-z.We create this SOM using only the galaxies from the COSMOS-like sample that belong to the considered tomographic bin, and reduce the size of the map to 400 cells (20 × 20, because the tomographic bin itself spans a smaller colour space).Finally, we project the target sample into the SOM and derive weights for each training sample galaxy, such that they reproduce the per-cell density of target sample galaxies.This process follows the same weighting procedure as Wright et al. (2020), who extend the direct calibration method of Lima et al. (2008) to include source groupings defined via the SOM.In this method, the estimate of z is also inferred using Eq. ( 2).

Results
We apply the direct calibration technique to the mock catalogue, split into ten tomographic bins spanning the redshift interval 0.2 < z p < 2.2.To construct the samples within each tomographic bin, training and target samples are selected based on their best-estimate photo-z, z p .We quantify the performance of the redshift calibration procedure using the measured bias in z , defined as: and evaluated over the target sample.We present the values of ∆ z that we obtain with direct calibration in Fig. 3, for each of the ten tomographic bins.The figure shows, per tomographic bin, the population mean (points) and 68% population scatter (error bars) of ∆ z over the 18 photometric noise realisations of our simulation.The solid lines and yellow region indicate the |∆ z | ≤ 2 × 10 −3 requirement stipulated by the Euclid mission.
Given our limited number of photometric noise realisations, estimating the population mean and scatter directly from the 18 samples is not sufficiently robust for our purposes.We thus use maximum likelihood estimation, assuming Gaussianity of the ∆ z distribution, to determine the underlying population mean and the scatter.We define these underlying population statistics as µ ∆z and σ ∆z for the mean and the scatter, respectively.We find that, when using a uniform or SOM training sample, direct calibration is consistently able to recover the target sample mean redshift to |µ ∆z | < 2 × 10 −3 .In the case of the shallow DES/Euclid configuration, however, the scatter σ ∆z exceeds the Euclid accuracy requirement in the highest and lowest tomographic bins.The DES/Euclid configuration is, therefore, technically unable to meet the Euclid precision requirement on z in the extreme bins.In the LSST/Euclid configuration, conversely, the precision and accuracy requirements are both consistently satisfied.We hypothesise that this difference stems from the deeper photometry having higher discriminatory power in the tomographic binning itself: the N(z) distribution for each tomographic bin is intrinsically broader for bins defined with shallow photometry, and therefore has the potential to demonstrate greater complexity (such as colour-redshift degeneracies) that reduce the effectiveness of direct calibration.
The direct calibration with the SOM relies on the assumption that galaxies within a cell share the same redshift (Masters et al. 2015).Noise and degeneracies in the colour-redshift space introduce a redshift dispersion within the cell which impacts the accuracy of z .Even with the diversity of SED generated with Horizon-AGN, and introducing noise in the photometry, we find that the direct calibration with a SOM sample is sufficient to reach the Euclid requirement.
We find that the COSMOS-like training sample is unable to reach the required accuracy of Euclid.This behaviour is somewhat expected, since the COSMOS-like sample contains selection effects that are not cleanly accessible to the direct calibration weighting procedure.The mean redshift is particularly biased in the bin 1.6 < z < 1.8, where there is a dearth of spectra; the Comparat et al. (2015) sample is limited to z < 1.6, while the zCOSMOS-Faint sample resides exclusively at z > 1.7, thereby leaving the range 1.6 < z < 1.7 almost entirely unrepresented.In this circumstance, our SOM-based weighting procedure is insufficient to correct for the heterogeneous selection, leading to bias.This is typical in cases where the training sample is missing certain galaxy populations that are present in the target sample (Hartley et al. 2020).We note, though, that it may be possible to remove some of this bias via careful quality control during the direct calibration process, such as demonstrated in Wright et al. (2020).Whether such quality control would be sufficient to meet the Euclid requirements, however, is uncertain.
We note that, although we are utilising photometric noise realisations in our estimates of z , the underlying mock catalogue remains the same.As a result, our estimates of µ ∆z and σ ∆z are not impacted by sample variance.In reality, sample variance affects the performance of the direct calibration, particularly when assuming that the training sample is directly representative of the target distribution (as we do with our uniform training sample).For fields smaller than 2 deg 2 , Bordoloi et al. (2010) showed that Poisson noise dominates over sample variance (in mean redshift estimation) when the training sample consists of less than 100 galaxies.Above this size, sample variance dominates the calibration uncertainty.This means that, in order to generate an unbiased estimate of z using a uniform sample of 1000 galaxies, a minimum of 10 fields of 2 deg 2 would need to be surveyed.
The SOM approach is less sensitive to sample variance, as over-densities (and under-densities) in the target sample population relative to the training sample are essentially removed in the weighting procedure (provided that the population is present in the training sample, Lima et al. 2008;Wright et al. 2020).In the cells corresponding to this over-represented target population, the relative importance of training sample redshifts will be similarly up-weighted, thereby removing any bias in the reconstructed N(z).Therefore, sample variance should have only a weak impact on the global derived N(z) in this method.Nonetheless, samples variance may still be problematic if, for example, under-densities result in entire populations being absent from the training sample.
Finally, it is worth emphasising that these results are obtained assuming perfect knowledge of training set redshifts.We study the impact of failures in spectroscopic redshift estimation in Sect. 5.

Estimator based on redshift probabilities
In this section we present another approach to redshift distribution calibration that uses the information contained in the galaxy redshift probability distribution function, available for each individual galaxy of the target sample.Photometric redshift estimation codes typically provide approximations to this distribution based solely on the available photometry of each source.We study the performance of methods utilising this information in the context of Euclid and test a method to debias the zPDF.

Formalism
Given the relationship between galaxy magnitudes and colours (denoted o) and redshift z, one can utilise the conditional probability p(z|o) to estimate the true redshift distribution N(z), using an estimator such as that of Sheth (2007); Sheth & Rossi (2010): where N(o) is the joint n-dimensional distribution of colours and magnitudes.As made explicit in the above equation, the N(z) estimator reduces simply to the sum of the individual (pergalaxy) conditional redshift probability distributions, p i (z|o).A shear weight associated to each galaxy can be introduced in this equation (e.g.Wright et al. 2020).It is worth noting that this summation over conditional probabilities is ideologically similar to the summation of SOM-cell redshift distributions presented previously; in both cases, one effectively builds an estimate of the probability p(z|o), and uses this to estimate z .Indeed, it is clear that the SOM-based estimate of z presented in Eq. ( 2) in fact follows directly from Eq. (4).Generally, photometric redshift codes provide in output a normalised likelihood function that gives the probability of the observed photometry given the true redshift, L (o|z), or sometimes the posterior probability distribution P(z|o) (e.g.Benítez 2000;Bolzonella et al. 2000;Arnouts et al. 2002; Cunha et al. where Pr(z) is the prior probability.Photometric redshift methods that invoke template-fitting, such as the LePhare photo-z estimation code, generally explore the likelihood of the observed photometry given a range of theoretical templates T and true redshifts L (o|T, z).The full likelihood, L (o|z), is then obtained by marginalising over the template set: In the full Bayesian framework, however, we are instead interested in the posterior probability, rather than the likelihood.In the formulation of this posterior, we first make explicit the dependence between galaxy colours c and magnitude in one (reference) band m 0 : o = {c, m 0 }.Following Benítez (2000) we can then define the posterior probability distribution function: where Pr(z|T, m 0 ) is the prior conditional probability of redshift given a particular galaxy template and reference magnitude, and Pr(T |m 0 ) is the prior conditional probability of each template at a given reference magnitude.Under the approximation that the redshift distribution does not depend on the template, and that the template distribution is independent of the magnitude (i.e. the luminosity function does not depend on the SED type), one obtains Adding the template dependency in the prior would improve our results, but is impractical with the iterative method presented in Sec. 4, given the size of our sample.
The posterior probability P(z|o) is a photometric estimate of the true conditional redshift probability p(z|o) in Eq. ( 4), and thus we are able to estimate the target sample N(z) via stacking of the individual galaxy posterior probability distributions: and therefore:

Initial results
In this analysis we use the LePhare code, which outputs L (o|z) for each galaxy as defined in Eq. ( 6).The redshift distribution (and thereafter its mean) are obtained by summing galaxy posterior probabilities, which are derived as in Eq. ( 9).This raises, however, an immediate concern: in order to estimate the N(z) using the per-galaxy likelihoods, we require a prior distribution of magnitude-dependant redshift probabilities, Pr(z|m 0 ), which naturally requires knowledge of the magnitude-dependent redshift distribution.
We test the sensitivity of our method to this prior choice by considering priors of two types: a (formally improper) 'flat prior' with Pr(z|m 0 ) = 1; and a 'photo-z prior' that is constructed by normalising the redshift distribution, estimated per magnitude bin, as obtained by summation over the likelihoods (following Brodwin et al. 2006).Formally this photo-z prior is defined as: where Θ(m 0,i |m 0 ) is unity if m 0,i is inside the magnitude bin centered on m 0 and zero otherwise, and N t is the number of galaxies in the tomographic bin.We estimate z in the previously defined tomographic bins using Eq. ( 11).In the upper-left panel of Fig. 4, we show estimated (and true) N(z) for one tomographic bin with 1.2 < z p < 1.4, estimated using DES/Euclid photometry.We annotate this panel with the estimated ∆ z made when utilising our two different priors.It is clear that the choice of prior, in this circumstance, can have a significant impact on the recovered redshift distribution.We also find an offset in the estimated redshift distributions with respect to the truth, as confirmed by the associated mean redshift biases being considerable: |∆ z | > 0.012, or roughly six times larger than the Euclid accuracy requirement.
The resulting biases estimated for this method in all tomographic bins, averaged over all noise realisations, is presented in the left-most panels of Fig. 5 (for both the DES/Euclid and LSST/Euclid configurations).Overall, we find that this approach produces mean biases of |µ ∆z | > 0.02 (1 + z) and |µ ∆z | > 0.01 (1 + z), which corresponds to roughly ten and five times larger than the Euclid accuracy requirement, for the DES/Euclid and LSST/Euclid cases respectively.Such bias is created by the mismatch between the simple galaxy templates included in LePhare (in a broad sense, including dust attenuation and IGM absorption) and the complexity and diversity of galaxy spectra generated in the hydrodynamical simulation.Such biases are in agreement with the usual values observed in the literature with broad band data (e.g.Hildebrandt et al. 2012).
We therefore conclude that use of such a redshift calibration method is not feasible for Euclid, even under optimistic photometric circumstances.

Redshift probability debiasing
In the previous section we demonstrated that the estimation of galaxy redshift distributions via summation of individual galaxy posteriors P(z), estimated with a standard templatefitting code, is too inaccurate for the requirements of the Euclid survey.The cause of this inaccuracy can be traced to a number of origins: colour-redshift degeneracies, template set nonrepresentativeness, redshift prior inadequacy, and more.However, it is possible to alleviate some of this bias, statistically, by incorporating additional information from a spectroscopic training sample.In particular, Bordoloi et al. (2010) proposed a method to debias P(z) distributions, using the Probability Integral Transform (PIT, Dawid 1984).The PIT of a distribution is defined as the value of the cumulative distribution function evaluated at the ground truth.In the case of redshift calibration, the PIT per galaxy is therefore the value of the cumulative P(z) distribution evaluated at source spectroscopic redshift z s : If all the individual galaxy redshift probability distributions are accurate, the PIT values for all galaxies should be uniformly distributed between 0 and 1.Therefore, using a spectroscopic training sample, any deviation from uniformity in the PIT distribution can be interpreted as an indication of bias in individual estimates of P(z) per galaxy.We define N P as the PIT distribution for all the galaxies within the training spectroscopic sample, in a given tomographic bin.Bordoloi et al. (2010) demonstrate that the individual P(z) can be debiased using the N P as: where P deb (z) is the debiased posterior probability, and the last term ensures correct normalisation.This correction is performed per tomographic bin.This method assumes that the correction derived from the training sample can be applied to all galaxies of the target sample.As with the direct calibration method, such an assumption is valid only if the training sample is representative of the target sample, i.e. in the case of a uniform training sample, but not in the case of the COSMOS-like and SOM training samples.In these latter cases, we weight each galaxy of the training sample in a manner equivalent to the direct calibration method (see Sect. 3), in order to ensure that the PIT distribution of the training sample matches that of the target sample (which is of course unknown).As for direct calibration, a completely missing population (in redshift or spectral type) could impact the results in an unknown manner, but such case should not occur for a uniform or SOM training sample.
Until now we have considered two types of redshift prior (defined in Sect.4.2): (1) the flat prior and ( 2) the photo-z prior.We have shown that the choice of prior can have a significant impact on the recovered z (Sect.4.2).However, as already noted by Bordoloi et al. (2010), the PIT correction has the potential to account for the redshift prior implicitly.In particular, if one uses a flat redshift prior, the correction essentially modifies L (z) to match the true P(z) (assuming the various assumptions stated previously are satisfied).This is because the redshift prior information is already contained within the training spectroscopic sample.Nonetheless, rather than assuming a flat prior to measure the PIT distribution, one can also adopt the photo-z prior (as in Eq. 12).This approach has two advantages: (1) it allows us to start with a posterior probability that is intrinsically closer to the truth, and ( 2) it includes the magnitude dependence of the redshift distribution within the prior, which is of course not reflected in the case of the flat prior.
Therefore, we improve the debiasing procedure from Bordoloi et al. ( 2010) by including such photo-z prior.We add an iterative process to further ensure the correction's fidelity and stability.In this process the PIT distribution is iteratively recomputed by updating the photo-z prior.We compute the PIT for the galaxy as: where Pr n (z|m 0 ) is the prior computed at step n.We can then derive the debiased posterior as: with N n P the PIT distribution at step n.The prior at the next step is: with m i for the magnitude of the galaxy i.Note that at n = 0, we assume a flat prior.Therefore, the step n = 0 of the iteration corresponds to the debiasing assuming a flat prior, as in Bordoloi et al. (2010).We also note that the prior is computed for the N T galaxies of the training sample in the debiasing procedure, while it is computed over all galaxies of the tomographic bin for the final posterior.As an illustration, Fig. 2 shows the debiased posterior distributions with black lines, which can significantly differ from the original likelihood distribution.We find that this procedure converges quickly.Typically, the difference between the mean redshift measured at step n + 1 and that measured at step n does not differ by more than 10 −3 after 2-3 iterations.
As described in appendix A, we also find that the debiasing procedure is considerably more accurate when the photo-z uncertainties are over-estimated, rather than under-estimated.Such a condition can be enforced for all galaxies by artificially inflating the source photometric uncertainties by a constant factor in the input catalogue, prior to the measurement of photo-z.In our analysis, we utilise a factor of two inflation in our photometric uncertainties prior to measurement of our photo-z in our debiasing technique.

Final results
We illustrate the impact of the P(z) debiasing on the recovered redshift distribution in the lower panels of Fig. 4.This figure presents the case of the redshift bin 0.8 < z p < 1 in the DES/Euclid configuration.The N(z) and PIT distributions, as computed with the initial posterior distribution are shown in the upper panels (for both of our assumed priors).The distributions after debiasing are shown in the bottom panels.We can see the clear improvement provided by the debiasing procedure in this example, whereby the redshift distribution bias ∆ z (annotated) is reduced by a factor of ten.We also observe a clear flattening of the target sample PIT distribution.
We present the results of debiasing on the mean redshift estimation for all tomographic bins in Fig. 5.The three rightmost panels show the mean redshift biases recovered by our debiasing method, averaged over the 18 photometric noise realisations, for our three training samples.The accuracy of the mean redshift recovery is systematically improved compared to the case without P(z) debiasing (shown in the left column).In the DES/Euclid configuration for instance (shown in the upper row), the improvement is better than a factor of ten at z > 1.
In the LSST/Euclid configuration (shown in the bottom row), we find that the results do not depend strongly on the training set used: the accuracy of z is similar for the three training samples, showing that stringent control of the representativeness of the training sample is not necessary in this case.In the DES/Euclid case, however, the SOM training sample clearly out-performs the other training samples, especially at low redshifts.Finally, we note that the iterative procedure using the photo-z prior improves the results when using the SOM training sample and the DES/Euclid configuration.
Overall, the Euclid requirement on redshift calibration accuracy is not reached by our debiasing calibration method in the DES/Euclid configuration.The values of µ ∆z at z < 1 reach five times the Euclid requirement, represented by the yellow bands in Fig. 5.At best, an accuracy of |µ ∆z | ≤ 0.004 (1 + z) is reached for the SOM training sample with the photo-z prior.Conversely, the Euclid requirement is largely satisfied in the LSST/Euclid configuration.In this case, biases of |µ ∆z | ≤ 0.002 (1 + z) are observed in all but the two most extreme tomographic bins: 0.2 < z < 0.4 and 2 < z < 2.2.We therefore conclude that, for this approach, deep imaging data is crucial to reach the required accuracy on mean redshift estimates for Euclid.

Discussion on key model assumptions
In this section, we discuss how some important parameters or assumptions impact our results.We start by discussing the impact of catastrophic redshift failures in the training sample, the impact of our pre-selection on photometric redshift uncertainty, and the influence of the size of the training sample on our conclusions.We also discuss some remaining limitations of our simulation in the last subsection.

Impact of catastrophic redshift failures in the training sample
For all results presented in this work so far, we have assumed that spectroscopic redshifts perfectly recover the true redshift of all training sample sources.However, given the stringent limit on the mean redshift accuracy in Euclid, deviations from this assumption may introduce significant biases.In particular, mean redshift estimates are extremely sensitive to redshifts far from the main mode of the distribution, and therefore catastrophic redshift failures in spectroscopy may present a particularly significant problem.For instance, if 0.5% of a galaxy population with true redshift of z = 1 are erroneously assigned z s > 2, then this population will exhibit a mean redshift bias of |µ ∆z | > 0.002 under direct calibration.Studies of duplicated spectroscopic observations in deep surveys have shown that there exists, typically, a few percent of sources that are assigned both erroneous redshifts and high confidences (e.g.Le Fèvre et al. 2005).Such redshift measurement failures can be due to misidentification between emission lines, incorrect associations between spectra and sources in photometric catalogues, and/or incorrect associations between spectral features and galaxies (due, for example, to the blending of galaxy spectra along the line of sight; Masters et al. 2017;Urrutia et al. 2019).Of course, the fraction of redshift measurement failures is dependant on the observational strategy (e.g.spectral resolution) and the measurement technique (e.g. the number of reviewers per observed spectrum).Incorrect association of stars and galaxies can also create difficulties.Furthermore, the frequency of redshift measurement failures is expected to increase as a function of source apparent magnitude; a particular problem for the faint sources probed by Euclid imaging (V IS < 24.5).
As we cannot know a priori the number (nor location) of catastrophic redshift failures in a real spectroscopic training set, we instead estimate the sensitivity of our results to a range of catastrophic failure fractions and modes.We assume a SOMbased training sample and an LSST/Euclid photometric configuration, and distribute various fractions of spectroscopic failures throughout the training sample, simulating both random and systematic failures.Generally though, because these failures occur in the spectroscopic space, recovered calibration biases are largely independent of the depth of the imaging survey and the method used to build the training sample.
We start by testing the simplest possible mechanism of distributing the failed redshifts, by assigning failed redshifts uniformly within the interval 0 < z < 4. Resulting calibration biases for this mode of catastrophic redshift failure are presented in the left panels of Fig. 6.We find that, for the direct calibration approach (top panel), even 0.2% of failures in the training sample is the limit to bias the mean redshift by |µ ∆z | > 0.002 at low redshifts (by definition, flag 3 in the VVDS could include 3% of failures; Le Fèvre et al. 2005).We also find that the bias decreases with redshift and reaches zero at z = 2.This is a statistical effect; our assumed uniform distribution has a z = 2 mean, and so random catastrophic failures scattered about this point induce no shift in a z ≈ 2 tomographic bin.For the same reason, biases would be significant at the two extreme tomographic bins if we were to assume a catastrophic failure distribution that followed the true N(z) (which peaks at z ≈ 1).In contrast, our debiased zPDF approach is found to be resilient to catastrophic failure fractions as high as 3.0% (bottom panel).In that case, only an unlikely failure fraction of 10% biases the mean redshift by |µ ∆z | ≥ 0.002 (1 + z).We interpret this result demonstrating the low sensitivity of the PIT distribution to redshift failures in the training sample.This is related to the fact that the PIT distribution provides a global statistical correction that is only weakly sensitive to individual galaxy redshifts.
In the previous test, we assign the failed redshifts uniformly within the interval 0 < z < 4, which is not the expected distribution when redshift failures occur by misidentification of spectral emission lines (e.g.Le Fevre et al. 2015;Urrutia et al. 2019).This mode of failure leads to a highly non-uniform distribution of failed redshifts, due to the interplay between the location of spectral emission lines and the redshift distribution of training sample galaxies.If a line emitted at λ true is misclassified as a different emission line at λ wrong , the redshift is therefore assigned to be: We study the impact of such line misidentifications on our estimates of z , by introducing redshift failures in the simulation with the following assumptions: if z true < 0.5, we assume that the H α emission line can be misclassified as [Oii]; if 0.5 < z true < 1.4, we assume that [Oii] can be misclassified as H α (for bright sources) or Ly α (for faint sources, using i = 23.5 as a limit); at 1.4 < z true < 2.0, we assume that the redshift is estimated using NIR spectra, and therefore that the H α line can be misclassified as [Oii]; and for sources at z > 2, we assume that Ly α can be misclassified as [Oii].
The same fraction of misclassifications is assumed in all the redshift intervals.The result of this experiment is shown in the right panels of Fig. 6, and demonstrates that this (more realistic) mode of catastrophic failures results in equivalent levels of bias as was seen in our simple (uniform) mode, albeit in different tomographic bins.This confirms that the sensitivity of the direct calibration to catastrophic redshift failures exists across simplistic and complex failure modes.In this mode, a failure fraction of 0.2% is sufficient to bias direct calibration at |µ ∆z | ≥ 0.002 (1+z) in all tomographic bins with z p > 0.6.This highlights that the calibration bias depends on the exact distribution of failed redshifts: in the case of line misidentification, incorrectly assigned redshifts consistently bias spectra to higher redshift, causing z to be affected more heavily over the full redshift range.We compare our result to the simulation of Wright et al. (2020).They investigate the impact of catastrophic spec-z failures on the estimate of z (for KiDS cosmic shear analyses) in the MICE2 simulation (Fosalba et al. 2015).They introduce 1.03% of failed redshifts following various distributions.In particular, they test the case of a uniform distribution within 0 < z < 1.4,where z = 1.4 is the limiting redshift of the MICE2 simulation.They report a bias in their direct calibration of ∆ z = 0.0029 for their lowest redshift tomographic bin, and smaller biases for higher redshift tomographic bins.In our lowest redshift bin, we observe a bias of ∆ z = 0.01 for a similar analysis.We argue that this is entirely consistent with the results of Wright et al. (2020) given that our considered redshift range is almost three times larger.Wright et al. (2020) conclude that spec-z failures are unlikely to influence cosmic shear analyses with the KiDS survey, which are limited to z < 1.2, but may be significant for Euclid-like analyses.In this way, our results also agree; it is clear that direct calibration for next generation (so called 'Stage-IV') cosmic-shear surveys like Euclid will require careful consideration of the influence of catastrophic spectroscopic failures.
The training sample for Euclid is currently being built with the C3R2 survey (Masters et al. 2019;Guglielmo et al. 2020).Such sample results from a combination of spectra coming from numerous instruments installed on 8-meter class telescopes (e.g.VIMOS, FORS2, KMOS, DEIMOS, LRIS, MOSFIRE) including data from previous spectroscopic surveys (e.g.Lilly et al. 2007;Le Fevre et al. 2015;Kashino et al. 2019).The most robust spec-z acquired on the Euclid Deep fields with the NISP instrument will be included.Given the diversity of observations, a careful assessment of the sample purity is necessary to limit the fraction of failures below 0.2%.Encouragingly, Masters et al. (2019) do not find any redshift failures within the 72 C3R2 spec-z with duplicated observations.Nonetheless, a larger sample of confirmed spectra is necessary to demonstrate that less than 0.2% of spectroscopic redshift measurements suffer from catastrophic failure.Finally, it is possible that improved reliability of both direct calibration methods and spectroscopic confidence could decrease the effects seen here: Wright et al. (2020), for example, advocate a means of cleaning cosmic shear photometric samples of sources with poorly constrained mean redshifts, demonstrating that this can cause a considerable reduction in calibration biases.Of course, the problem could possibly be alleviated if one were able to improve the reliability of the training sample by only including spec-z with corroborative evidence from, for example, high-precision photo-z derived from deep photometry in the calibration fields.

Relaxing the photo-z σ z p preselection
Estimates of the redshift distribution mean are also sensitive to the presence of secondary modes in the redshift distribution, and our ability to reconstruct them.As described in Sect.2.2, all results presented thus far have invoked a selection on the photometric redshift uncertainty of σ z p < 0.3, which reduces the likelihood of secondary redshift distribution peaks in our analysis.
Here we discuss the impact of this adopted threshold on both accuracy of our estimates of z , and on the fraction of photometric sources that satisfies this selection (and so are retained for subsequent cosmic shear analysis).We apply several σ z p thresholds in the range σ z p ∈ [0.15, 0.6] to the full photo-z catalogue.For the training sample, we consider the SOM configuration with two galaxies per cell.The results are shown in Fig. 7 for the DES/Euclid (left) and LSST/Euclid (right) configurations.We find that the σ z p threshold does not influence our conclusions regarding the direct calibration approach, which is largely insensitive to variations in this threshold.We note, however, that the scatter on the mean redshift (σ ∆z , shown by the errorbars) increases well above the Euclid requirement (for the DES/Euclid configuration) when selecting photo-z with σ z p < 0.15; however this is primarily because such a selection drastically reduces the size of the training sample at z > 1.2, increasing the influence of Poisson noise.Therefore, given the insensitivity of the direct calibration to this threshold, it is advantageous to keep galaxies with broad redshift likelihoods in the target sample when using this method.Conversely, σ z p has a decisive impact on the accuracy of mean redshift estimates inferred from the debiased zPDF approach.For instance, in the DES/Euclid configuration, |µ ∆z | is strongly degraded when applying a threshold of σ z p < 0.6.Such a threshold on σ z p could be relaxed in the LSST/Euclid configuration, however, primarily because the sample is already dominated by galaxies with a narrow zPDF.
Not considered in the above, however, is the importance that the target sample number density plays in cosmic shear analyses.Cosmological constraints from cosmic shear are approximately proportional to the square root of the size of the target galaxy sample, and to the mean redshift.Therefore, optimal lensing surveys require a sufficiently high surface density of sources, preferentially at high redshift.In the Euclid project, 30 galaxies per arcmin 2 are required to reach their planned scientific objectives (Laureijs et al. 2011).As shown in the top panels of Fig. 7, however, applying a threshold on σ z p naturally introduces a reduction in the size of the target sample.For instance, we keep less than 10% of the galaxies at z > 1.4 by selecting a sample at σ z p < 0.15 in the DES/Euclid configuration.In the LSST/Euclid case, a threshold of σ z p < 0.3 has only a significant impact in the redshift bins above z > 1.6.A compromise is therefore needed between the number of sources retained in the target sample, and the accuracy of the mean redshift that we estimate for these sources (when using the debiasing technique).We do not attempt to estimate what this optimal selection is using our simulations, as the luminosity function predicted by Horizon-AGN does not perfectly reproduce what is found in real data.Nonetheless, we note that the fraction of galaxies that are removed from the target sample is likely overestimated here: modern cosmic shear analyses typically introduce a weight associated with the accuracy of each source's shape measurement (the 'shear weight', which is not included in our simulations), which systematically decreases the contribution of low signal-to-noise galaxies to the analysis.As these fainter sources have intrinsically broader photo-z distributions, they will be the most heavily affected by our cuts on σ z p .

Size of the training sample
The size of the training sample is naturally of most importance when using the direct calibration approach (e.g.Newman 2008).The debiased zPDF approach, though, is also sensitive to statistical noise in the PIT distribution.As some ongoing spectroscopic surveys are designed to produce the training samples for Stage IV weak-lensing experiments (e.g.Masters et al. 2017), we explore here the minimal size of these samples required for accurate redshift calibration.To do this, we modify the size of the training samples (limiting our analysis to the uniform and SOM training sample cases).We do not consider the COSMOS-like case that is a patchwork of existing surveys, and is not specifically designed for weak-lensing experiments.For the uniform training samples, we test the cases with 500, 1000, 2000 galaxies per tomographic bin.For the SOM training samples, we test the cases corresponding to cells filled with 1, 2, or 3 galaxies.
Figure 8 shows the impact of the training sample size on ∆ z .We find that the mean bias µ ∆z always remains within the Euclid requirements for the direct calibration approach.The scatter σ ∆z in the bias exceeds the Euclid requirements in few tomographic bins, however only when considering the smallest training samples: the Euclid requirements are fully satisfied in all tomographic bins when assuming a training sample with more than 1000 galaxies per bin or more than two galaxies per SOM cell.With the debiased zPDF approach, we find that increasing the size of the training sample is not sufficient to reduce the residual bias in the method; rather deeper photometry is preferable, to improve the quality of the initial zPDF.

Catastrophic failures within the photo-z sample
Catastrophic failures in the photo-z sample are a concern for both methods described in this paper.We discuss here their impact as well as remaining limitations of our simulation.
As shown in Fig. 1, our simulated sample already includes a significant fraction of photo-z outliers, defined such that |z p − z s | > 0.15 (1 + z s ).We find 16.24% and 0.70% of outliers at VIS < 24.5 in DES/Euclid and LSST/Euclid, respectively.These fractions reduce to 1.82% and 0.04% when applying a selection on the photometric redshift uncertainty at σ z p < 0.3.The largest fraction of these outliers is due to the degeneracies in the colourredshift space inherent to the use of low signal-to-noise photometry in several bands.However, less trivial catastrophic failures are also present in the simulation.In particular, the diversity of spectra generated by the complex physical processes in Horizon-AGN is not fully captured by the limited set of SED templates used in LePhare.This misrepresentation in galaxy SED creates a significant fraction of zPDF not compatible with the spec-z.An example of such L (z) is shown in the bottom right panel of Fig. 2. Despite the presence of such failures, our results show that the Euclid requirement is fulfilled.
Several factors were ignored that can potentially create more catastrophic failures in the photo-z.Galaxies with extreme properties, such as sub-millimeter galaxies (SMG) for instance, are known to be under-represented in simulations (e.g.Hayward et al. 2020).If galaxies with an extreme dust attenuation fall within the cosmic-shear selection at VIS < 24.5 and are selected in one tomographic bin, they could have an impact on our results.Nonetheless, nothing indicates that their zPDF cannot be established correctly from template fitting, or that such population cannot be isolated in the multi-color space with SOM.
The presence of AGN could also be a problem.These sources can be isolated from their SED (Fotopoulou & Paltani 2018), identified as point-like sources for quasi-stellar objects, and identified as X-ray sources with eROSITA (Merloni et al. 2012).We should however fail to isolate AGN with an extended morphology or that are too faint to be detected in X-ray.Salvato et al. (2011) find however that standard galaxy SED libraries are sufficient to obtain an accurate photo-z for such sources.
Residual contamination from stars could also bias z .This population contaminates preferentially specific tomographic bins.In particular, stars may bias the mean redshift towards higher values, for both direct calibration and debiased zPDF methods.A morphological selection based on VIS highresolution images, combined with a color selection including near-infrared photometry (e.g.Daddi et al. 2004), is efficient to isolate them (Fotopoulou & Paltani 2018).A minimal contamination could bias the mean redshift at a level similar to the one discussed in Sect.5.1.Nonetheless, future simulations need to include stellar and AGN populations to better assess the level of contamination of the galaxy sample and its impact on the Euclid requirement.
Finally, Laigle et al. (2019) show that the fraction of outliers in Horizon-AGN remains underestimated in comparison to real dataset.One source of discrepancy originates from not taking into account the uncertainties induced by source extraction in images.Bordoloi et al. (2010) estimate that 10% of the sources could be potentially blended and that the likelihood of two blended galaxies with a magnitude difference lower than two is affected in an unpredictable way.In the last decade, numerous source extraction methods have been developed to perform photometry in crowded fields (De Santis et al. 2007;Laidler et al. 2007;Merlin et al. 2016;Lang et al. 2016), which could mitigate the impact of blending.Therefore, a new set of simulations that include images and such source extraction tools should be considered in the future.

Application to real data
In this section, we apply the two approaches presented in Sect. 3 and Sect. 4 to real data.We use existing imaging surveys and associated photo-z to define several tomographic bins.In each tomographic bin, we select a subsample of spec-z for which the mean redshift z true is known.We refer to this sample as the target sample and the goal is to retrieve the mean redshift using only the photometric catalogue and an independent training sample.As previously, we measure ∆ z as defined in Eq. ( 3) in each tomographic bin.

The COSMOS survey
We first investigate a favourable configuration, where the photometric survey is much deeper than the target sample.We aim at measuring the mean redshift of the LEGA-C galaxies (van der Wel et al. 2016) selected in the tomographic bin at 0.7 < z p < 0.9.We base our estimate of z on the COSMOS broad-band photometry and associated zPDF.The imaging sensitivity is three magnitudes deeper than that of the target sample.All the spec-z available on the COSMOS field (excluding the LEGA-C ones) are used for the training.For the direct calibration approach, we obtain a bias of µ ∆z = 0.00032 and a scatter of σ ∆z = 0.00135; an accuracy well within the Euclid requirement.Secondly, we debias the zPDF using the PIT distribution as discussed in Sect.4.3.In that case, we obtain a mean redshift with a bias of µ ∆z = −0.00046and a scatter of σ ∆z = 0.00073.In the case of a target sample associated with much deeper photometry, we thus reach the 0.002 (1 + z) accuracy requirement of Euclid, either using the direct calibration or debiased zPDF approaches.The details of this measurement are given in Appendix B.

The KiDS+VIKING-450 survey
We now study a less favourable case, where the photometric survey has a similar depth as the target sample.We measure the mean redshift in five tomographic bins extracted from the KiDS+VIKING-450 imaging survey, which covers 341 deg 2 (Wright et al. 2019).The survey combines the ugri-band photometry from KiDS with ZY JHK S bands from VISTA Kilo degree Infrared Galaxy (VIKING) photometry.We adopt the method described in Sect.2.2 to measure the photo-z.This leads to a photo-z quality comparable to that obtained by Wright et al. (2019), where σ NMAD ∼ 0.045 at z < 0.9 and σ NMAD ∼ 0.079 at z > 0.9.Those photo-z are used to define five tomographic bins over the photometric redshift interval 0.1 < z < 1.2, as in Hildebrandt et al. (2020).
The KiDS+VIKING-450 survey encompasses the VVDS (Le Fèvre et al. 2005) and DEEP2 (Newman et al. 2013) fields, which contain spectroscopic redshifts.We aim at retrieving the mean redshift of the VVDS/DEEP2 galaxies.By only selecting galaxies with secure spectroscopic redshifts and counterparts in the KiDS+VIKING-450 catalogue, we build a target sample of 5794 galaxies3 .The DEEP2 sample has been selected at R < 24.1 and z > 0.7, while the VVDS sample is purely magnitude limited at i < 24.Our target sample covers the full redshift range of interest 0.1 < z < 1.2, with magnitude limits similar to those used for the KiDS+VIKING-450 cosmic shear analysis (Hildebrandt et al. 2020).
The KiDS+VIKING-450 imaging survey also covers the COSMOS field, and we use the existing spec-z in the COSMOS field as the training sample.We note that the training and target samples are located in different fields.Therefore, the sample variance may impact our results.The COSMOS training sample contains 13 817 galaxies from the KiDS+VIKING-450 survey, after applying a redshift confidence selection.This highly heterogeneous sample combines various spectroscopic surveys covering a large range of magnitudes and redshifts (see Sect.We present our results in Table 1 for the five considered tomographic bins.The upper section of the table shows the fiducial case, where a σ z p < 0.3 photo-z uncertainty selection is applied.The direct calibration produces a bias of |∆ z | < 0.01 (1 + z), except in the lowest tomographic bin (0.1 < z < 0.3) where it reaches |∆ z | = 0.02 (1 + z).Using the debiased zPDF method, we find |∆ z | 0.01 (1 + z).In that case, the σ z p < 0.3 selection removes between 20% and 44% of the full KiDS+VIKING-450 sample4 .If we relax the selection on the photo-z error, as presented in the lower section of Table 1, the bias ∆ z increases with the debiased zPDF approach, as found in the simulation.Nonetheless, ∆ z remains around 1%, which corresponds to an accuracy comparable to that obtain with direct calibration.We note that the zPDF debiasing technique with the photo-z prior performs significantly better than with the flat prior.Figure 9 illustrates the impact of the photo-z prior in recovering the shape of the redshift distribution, where we can see a clear improvement below the main mode (bottom left panel).This result is confirmed in the other tomographic bins.
The depth of the KiDS imaging survey is similar to the one we simulate for DES (5σ sensitivity between 23.6 and 25.1), while the VIKING photometry is much shallower than the Euclid one (between 21.2 and 22.7 for VIKING).It is therefore encouraging to find a bias similar to that expected from the simulation in the DES/Euclid configuration, even with shallower imaging.We emphasise that our estimate is performed in the worst possible conditions: (1) our training sample does not cover the same colour/magnitude space as our target sample as shown in Wright et al. (2020), ( 2) the photometric calibration could vary from field-to-field, and (3) some failures in the spec-z target sample could bias the mean redshift considered as the truth.We know that a fraction of the target spec-z could include catastrophic failures, possibly biasing our estimate of z true .Indeed, flag 3 in VVDS and DEEP2 are expected to be 97% and 95% correct, respectively, suggesting that a few percent of failures may be present in those samples, thereby introducing a bias in the true mean redshift z true of more than 0.01, according to Fig. 7.The presence of such fraction of failures remains difficult to verify.A comparison between duplicated observations in DEEP2 shows that the fraction of failures should be at maximum 1.6% (Newman et al. 2013).
Finally, we note that our various selections on σ z p prevent us from directly comparing the recovered redshift distributions with those published in Wright et al. (2019) and Joudaki et al. (2020).Indeed, our selection σ z p preferentially removes the faintest galaxies from the sample, thus shifting the intrinsic redshift distribution towards lower redshifts than expected for the full KiDS+VIKING-450 sample.

Summary and conclusion
This paper investigates the possibility of measuring the mean redshift z of a target sample of galaxies, in ten tomographic bins from z = 0.2 to z = 2.2, with an accuracy of |∆ z | < 0.002 (1+z), as stipulated by the Euclid mission requirements on cosmic shear analysis.Naturally, the conclusions presented here are equally applicable to all current and future surveys where redshift calibration is a relevant challenge.
We apply two approaches which are foreseen for the Euclid mission : a direct calibration of z with a spectroscopic training sample and the combination of individual zPDF to reconstruct the underlying redshift distribution.This paper analyses in detail several factors which could impact these approaches and provide recommendations to make them successful.
We use the Horizon-AGN hydrodynamical simulation (Dubois et al. 2014), which allows a large diversity of modelled SED, and create 18 Euclid-like mock catalogues, with different realisations of the photometric noise.We simulate two possible configurations, which should encompass the range of sensitivities of future imaging available for Euclid: (1) a shallow configuration combining DES and Euclid, and (2) a deep configuration combining LSST and Euclid.We measure the photo-z of the simulated galaxies using the template-fitting code LePhare, as performed in Laigle et al. (2019).Such procedure produces photometric redshifts with complex zPDF, realistic biases, and catastrophic failures.We also assume different characteristics for the spectroscopic training samples associated to the mock catalogues.We consider several selection functions, several sample sizes, and include possible failures in the spec-z.
We first test the direct calibration approach, where the redshift distribution is directly estimated from existing spectroscopic redshifts in a training sample, applying necessary weights to match this distribution to the target sample.We find that this approach is efficient in recovering the mean redshift with an accuracy of 0.002 (1 + z).The method is successful when based on a representative spectroscopic coverage (uniform or SOM), but the weighting scheme is not sufficient to correct for the heterogeneity in the COSMOS-like training sample at the level required by Euclid.This method is stable and robust, and does not require deep photometry such as that from LSST.However, we find that the recovered mean redshift is extremely sensitive to the presence of catastrophic failures in spectroscopic redshift measurement.To recover unbiased estimates of z , a careful quality assessment of the spectroscopic redshifts must guarantee a fraction of failures below 0.2%.
We then investigate the possibility of reconstructing the redshift distribution from the zPDF produced by a template-fitting photo-z code.As expected, we find that the quality of the initial zPDF is not sufficient to measure z with an accuracy better than |∆ z | < 0.01.We test the method by Bordoloi et al. (2010) to debias the zPDF.We improve it by taking into account an appropriate prior, combined with an iterative correction of the zPDF.Our results are summarised below.
-The mean redshift accuracy inferred from the debiased zPDF is systematically improved when compared to the one inferred from the initial zPDF (by up to a factor ten). -This method is weakly sensitive to the fraction of spec-z failures.-Imaging depth is the primary factor in determining the effectiveness of the debiasing technique.We reach the Euclid requirement when combining Euclid and LSST ground-based images.
-Insufficient imaging depth can be compensated by selecting well peaked zPDF, but introduces considerable losses to the target sample number density.A balance should therefore be established between the accuracy of z and the statistical signal of the cosmic shear analysis.
We test the two approaches on real data sets from COS-MOS and KiDS+VIKING-450, and confirm that a high signalto-noise in the photometry is essential for an accurate estimate of z using the debiased zPDF approach.In the less favourable case , where the photometric sample and a spec-z target sample are approximately of equal depth, we reach an accuracy around 0.01 (1 + z) on z , as expected from the simulation and other works (e.g.Wright et al. 2020).We confirm the trends observed in the simulation and find that including the prior in the debiasing technique produces significantly better results.
We conclude that both methods could foreseeably provide independent and accurate inferences of tomographic bin mean redshifts for Euclid.We find that the current Euclid baseline to measure z with a direct calibration approach and a SOM training sample is robust with respect to the imaging survey depth.However, we recommend that training samples, such as C3R2 (Masters et al. 2019), insure a purity level above 99.8%.We also find that the sum of the debiased zPDF could be sufficient to measure z at the Euclid requirement, with currently ongoing spectroscopic surveys.However, we recommend this method only in areas covered with deep optical data.The two methods should be applied simultaneously with current planning of the Euclid survey and provide complementary and independent estimate of z .
Finally, our work still suffers from several limitations that we still need to investigate.We neglect the catastrophic failures within the photo-z sample created by misclassified stars or AGN, or by the galaxy blending.A residual contamination of The fraction of galaxies kept after this selection is also shown ('% kept').We apply the same definition as Wright et al. (2020) to define the loss of photometric sources (their Eq. 1), including shear weights.
these populations in the tomographic bins could affect both approaches to redshift calibration.Moreover, we do not consider sample variance effects, since the Horizon-AGN simulation covers only 1 deg 2 .We would benefit from a larger simulated area to test the impact of sample variance.Nonetheless, our results here present a largely positive outlook for the challenge of tomographic redshift calibration within Euclid.

Fig. 1 .
Fig. 1.Comparison between the photometric redshifts (z p ) and spectroscopic redshifts (z s ) for the Horizon-AGN simulated galaxy sample.Each panel shows a two-dimensional histogram with logarithmic colour scaling, and is annotated with both the 1:1 equivalence line (red) and |z p − z s | = 0.15 (1 + z s ) outlier thresholds (blue), for reference.Photometric redshifts are computed using both DES/Euclid (left) and LSST/Euclid (right) simulated photometry, assuming a Euclid-based magnitude limited sample with V IS < 24.5.

Fig. 3 .
Fig. 3. Bias on the mean redshift (see Eq. 3) averaged over the 18 photometric noise realisations.The mean redshifts are measured using the direct calibration approach.The tomographic bins are defined using the DES/Euclid and LSST/Euclid photo-z in the top and bottom panels, respectively.The yellow region represents the Euclid requirement at 0.002 (1 + z) for the mean redshift accuracy, and the blue dashed lines correspond to a bias of 0.005 (1 + z).The symbols represent the results obtained with different training samples: (a) selecting uniformly 1000 galaxies per tomographic bin (black circles); (b) selecting two galaxies/cell in the SOM (red squares); and (c) selecting a sample that mimics real spectroscopic survey compilations in the COSMOS field (green triangles).

Fig. 4 .
Fig. 4. Examples of redshift distributions (left) and PIT distributions (right, see text for details) for a tomographic bin selected to 0.8 < z p < 1 using DES/Euclid photo-z.In these examples, we assume a training sample extracted from a SOM, with two galaxies per cell.The top and bottom panels show the results before and after zPDF debiasing, respectively.Redshift distributions and PITs are shown for the true redshift distribution (blue), and redshift distributions estimated using the zPDF method, when incorporating photo-z (red) and uniform (black) priors.

Fig. 5 .
Fig. 5. Bias on the mean redshift (see Eq. 3), estimated using the zPDF method and averaged over the 18 photometric noise realisations.The top and bottom panels correspond to the DES/Euclid and LSST/Euclid mock catalogues, respectively.Note the differing scales in the y-axes of the two panels.The left panels are obtained by summing the initial zPDF, without any attempt at debiasing.The other panels show the results of summing the zPDF after debiasing, assuming (from left to right) a uniform, SOM, and COSMOS-like training sample.The yellow region represents the Euclid requirement of |∆ z | ≤ 0.002 (1 + z).The red circles and black triangles in each panel correspond to the results estimated using photo-z and flat priors, respectively.

Fig. 6 .
Fig.6.Bias on the mean redshift averaged over the 18 photometric noise realisations in the LSST/Euclid case.We assume a SOM training sample, and the different symbols correspond to various fraction of failures introduced in the spec-z training sample.The left and right panels correspond to different assumptions on how to distribute the catastrophic failures in the spec-z measurements: uniformly distributed between 0 < z < 4 (left), and assuming failures are caused by misclassified emission lines (right).The upper and lower panels correspond to the direct calibration and debiasing method, respectively.

Fig. 7 .
Fig. 7. Bias on the mean redshift (see Eq. 3), averaged over the 18 photometric noise realisations, under different σ zp selection thresholds.Top panels: fraction of the sample retained after having applied different σ zp thresholds.The middle and bottom panels show the bias on the mean redshift using the direct calibration and debiasing technique, respectively.The left and right panels correspond to the DES/Euclid and LSST/Euclid configurations, respectively.We assume a SOM training sample with 2 gal/cell.

Fig. 8 .
Fig. 8. Bias on the mean redshift (see Eq. 3) averaged over the 18 photometric noise realisations.Impact of the training sample size on the mean redshift accuracy in the LSST/Euclid case.Left and right panels correspond to a uniform and SOM spectroscopic coverage, respectively.The top panels show the number of galaxies used for the training in three considered cases.Middle and bottom panels show the mean redshift accuracy using the direct calibration and the optimised zPDF, respectively.

Fig. 9 .
Fig.9.Same as Fig.4, except that this refers to real data from the KiDS+VIKING-450 photometric survey and the VVDS-DEEP2 target sample.The sample is selected with a σ zp < 0.6 threshold in the photo-z uncertainties.

Fig. A. 1 .
Fig. A.1.Example of PIT distribution (left) and redshift distribution (right) for a tomographic bin selected at 0.6 < z p < 0.8.The top and bottom panels assume photo-z errors that are under-estimated (A = 0.7) and over-estimated (A = 1.5), respectively.The PIT distribution used to correct the zPDF is shown with the solid black line.The inset shows an example of the debiased zPDF for one galaxy (selected randomly).The resulting PIT distribution, after debiasing, is shown in dashed red.The true N(z) is shown with the blue histogram in the right panels.The N(z) reconstructed using the initial and the debiased zPDF are shown with black solid lines and red dashed lines, respectively.

Table 1 .
Differences between the mean redshifts reconstructed with different methods (direct calibration and debiased zPDF) and z true , divided by (1 + z true ).The KiDS+VIKING-450 survey is split in five tomographic bins.We use VVDS/DEEP2 as target sample, and COSMOS as the training one.In the top part of the table, photo-z are selected with σ zp < 0.3, while the bottom parts show a selection at σ zp < 0.6 and σ zp < 1.2.