A&A 480, 703-714 (2008)
DOI: 10.1051/0004-6361:20077107
H. Hildebrandt1,2 - C. Wolf3 - N. Benítez4
1 - Argelander-Institut für Astronomie, Auf dem Hügel 71, 53115 Bonn, Germany
2 -
Sterrewacht Leiden, Leiden University, Niels Bohrweg 2, 2333 CA Leiden, The Netherlands
3 -
Department of Physics, University of Oxford, DWB, Keble Road, Oxford OX1 3RH, UK
4 -
Instituto de Matemáticas y Física Fundamental (CSIC), C/Serrano 113-bis, 28006 Madrid, Spain
Received 16 January 2007 / Accepted 14 January 2008
Abstract
Aims. Several photometric redshift (photo-z) codes are discussed in the literature and some are publicly available to be used by the community. We analyse the relative performance of different codes in blind applications to ground-based data. In particular, we study how the choice of the code-template combination, the depth of the data, and the filter set influences the photo-z accuracy.
Methods. We performed a blind test of different photo-z codes on imaging datasets with different depths and filter coverages and compared the results to large spectroscopic catalogues. We analysed the photo-z error behaviour to select cleaner subsamples with more secure photo-z estimates. We consider Hyperz, BPZ, and the code used in the CADIS, COMBO-17, and HIROCS surveys.
Results. The photo-z error estimates of the three codes do not correlate tightly with the accuracy of the photo-z's. While very large errors sometimes indicate a true catastrophic photo-z failure, smaller errors are usually not meaningful. For any given dataset, we find significant differences in redshift accuracy and outlier rates between the different codes when compared to spectroscopic redshifts. However, different codes excel in different regimes. The agreement between different sets of photo-z's is better for the subsample with secure spectroscopic redshifts than for the whole catalogue. Outlier rates in the latter are typically larger by at least a factor of two.
Conclusions. Running today's photo-z codes on well-calibrated ground-based data can lead to reasonable accuracy. The actual performance on a given dataset is largely dependent on the template choice and on realistic instrumental response curves. The photo-z error estimation of today's codes from the probability density function is not reliable, and reported errors do not correlate tightly with accuracy. It would be desirable to improve this aspect for future applications so as to get a better handle on rejecting objects with grossly inaccurate photo-z's. The secure spectroscopic subsamples commonly used for assessments of photo-z accuracy may be biased toward objects for which the photo-z's are easier to estimate than for a complete flux-limited sample, resulting in very optimistic estimates.
Key words: techniques: photometric - galaxies: distances and redshifts - galaxies: photometry
Users of photo-z's are often concerned with three main performance issues, which are the mean redshift error, the rate of catastrophic failures, and the validity of the probability density function (PDF) in a frequentist interpretation. The PDF may be correct in a Bayesian interpretation when including systematic uncertainties in the model fitting and correctly express a degree of uncertainty. However, given the non-statistical nature of systematic uncertainties a frequentist PDF that correctly describes the redshift distribution in the real experiment is necessarily different, unless such systematics can be excluded.
The process depends on three ingredients: model, classifier, and
data. A basic issue at the heart of problems with the PDF are the
match between data and model, since best-fitting parameters and
confidence intervals in -fitting are only reliable when the
model is appropriate. The importance in choosing the type of data is
the need to break degeneracies between ambiguous model
interpretations. Finally, the classifiers are expected to produce
similar results, while they could produce them at dramatically
different speed. Artificial Neural Nets, hereafter ANNs, are
especially fast once training has been accomplished
(Firth et al. 2003).
There are many cases in the literature where the precision of photo-z's has been improved after recalibrating the match between data and model (see e.g. Benítez et al. 2004; Coe et al. 2006; Capak et al. 2007; Ilbert et al. 2006; Csabai et al. 2003), although this process requires a large, representative training set of spectroscopic redshifts from the pool of data that is to be photo-z'ed. If ANNs are trained with sufficiently large training samples they can achieve the highest accuracies within the training range as a mismatch between data and model is ruled out from the start.
The literature reports several different photo-z estimators in use across the community, some of which use different template models and some of which allow implementation of user-defined template sets. Assuming a modular problem, where model (templates), classifier, and data can be interchanged, it is interesting to test how comparable the results of different combinations are. In this spirit, we have started the work presented in this paper, where we analyse photo-zperformance from real ground-based survey data, in dependence of magnitude, depth of data, filter coverage, redshift region, and choice of photo-z code. We concentrate on the blind performance of photo-z's which is the most important benchmark for any study that cannot rely on recalibration, e.g. in the absence of spectroscopic redshifts. We choose to focus on ground-based datasets because a lot of codes were tested on the Hubble-Deep-Field for which results can already be found in the literature (see e.g. Benítez 2000; Hogg et al. 1998; Bolzonella et al. 2000).
Meanwhile, a much larger initiative has formed to investigate all
(even subtle) differences in workings and outcomes among codes and
models. This initiative called
PHAT
(PHoto-z Accuracy Testing) engages a world-wide community of
photo-z developers and users and will hopefully develop our
understanding of photo-z's to a reliably predictive level.
The paper is organised as follows. In Sect. 2 the imaging and spectroscopic datasets are presented. The photo-zcodes used for this study are described in Sect. 3. Section 4 presents our approach for describing photo-z accuracy. The results are presented and discussed in Sect. 5. The different photo-z estimates are compared to each other in Sect. 6. A final summary and general conclusions are given in Sect. 7.
Throughout this paper we use Vega magnitudes if not otherwise mentioned.
In order to measure unbiased object colours the
images were
filtered to the seeing of the U-band (
)
and the
photometric catalogue was created with SExtractor
(Bertin & Arnouts 1996) in dual-image-mode with the unfiltered
R-band image as the detection image.
The broad-band data from COMBO-17 resemble a medium-deep wide-field survey, while the full 17-filter data are presently unique in its kind. However, we can use them to investigate whether additional telescope time should be spent on increasing depth as in GaBoDS or on obtaining additional SED information as in COMBO-17.
In contrast to GaBoDS, the COMBO-17 photometry was measured directly on unfiltered images. The photometry was obtained in Gaussian apertures whose width was adapted to compensate seeing variations between the frames. Provided the convolution of aperture and PSF yields the same result for each frame, this procedure is mathematically identical to filtering all frames to a final constant seeing and extracting fluxes with Gaussian apertures at the end.
The calibration of the CDFS field of COMBO-17 has however changed
since the original publication of the data in 2004. COMBO-17 is
calibrated by two spectrophotometric standard stars in each of its
fields. However, the two stars on the CDFS suggested calibrations
that were inconsistent in colour at the 0.15 mag level from B
to I. Both were marginally consistent with the colours of the
Pickles atlas, so the choice was
unconstrained. Wolf et al. (2004) ended up trusting the
wrong star and introducing a colour bias towards the blue. The
calibration has since been changed to follow the other star, and is
now consistent with both the GaBoDS and MUSYC (Multiwavelength
Survey by
Yale-Chile)
calibration. The consequences of the calibration change for the
photo-z's is little in the 17-filter case, but large when only
using broad bands. Broad-band photo-z's hinge more on colours than
on features that are traced in medium-band photo-z's.
Table 1: Properties of the imaging data.
![]() |
Figure 1: Photometric errors in the I-band as a function of I-magnitude ( upper panel) and as a function of spectroscopic redshift ( lower panel) for the COMBO-17 data ( left), the GaBoDS data ( middle), and the FDF data ( right); see text for information on how the errors were estimated. Note that the errors of the photometric zeropoints are larger than the purely statistical errors plotted here. |
Open with DEXTER |
The properties of these three imaging datasets are summarised in
Table 1. Since the limiting magnitudes are estimated
in completely different ways in the three data release papers, we
decided to calculate hypothetical 10
limiting magnitudes
with the GaBoDS values as a reference. These
correspond to the 10
sky noise under the following assumption:
![]() |
(1) |
The FDF limiting magnitudes in the
-bands are the ones given in
Heidt et al. (2003) and Gabasch (2004)
corresponding to 50% completeness.
The dependence of photometric errors on magnitude and redshift in the three datasets is shown in Fig. 1. The errors for the COMBO data are derived from multiple measurements of the same sources, where photon shot-noise is assumed to be a lower limit. The GaBoDS and FDF errors are purely derived from shot-noise as no multiple measurements were made.
We compare the colour measurements in the COMBO and the GaBoDS catalogue and find very good agreement (see Fig. 2). Thus, the different ways of correcting for the PSF variations from band to band deliver consistent results. We carried out another comparison between the COMBO data and the CDFS catalogue from the MUSYC collaboration (Taylor, private communication) and the agreement is similar. We conclude that the colour measurement cannot be a dominant source of systematic error in the following.
![]() |
Figure 2: Comparison of colour measurements for objects in the GaBoDS and the COMBO catalogues. Note that the U-band filters in the two datasets are different, with the GaBoDS filter being broader and bluer. The star symbols represent objects selected by the SExtractor CLASS_STAR parameter in the magnitude range 17<R<20. |
Open with DEXTER |
![]() |
Figure 3:
Spectroscopic redshift distributions of the
comparison samples. Left: the VVDS-CDFS spectroscopic data with
![]() |
Open with DEXTER |
The empirical approach can lead to very precise results if an
extensive, complete spectroscopic catalogue with colour information is
available. But it is not as flexible as the ``SED-fitting'' method
because for every new filter set or camera the colour catalogue must
be recreated. Moreover, it is essentially limited to the
magnitude range where spectra are available in large numbers
()
and implicit priors are driven by the spectroscopic
sample selection.
The ``SED-fitting''-method, however, can be applied to every dataset for which the filter transmission curves are known. Since we want to give guidance for blind applications of photo-z's here we concentrate on this approach in the following.
In practise, a photo-z analysis often involves aspects from both approaches. Empirical colour-redshift relations can certainly be extrapolated in magnitude or redshift. Also, a spectroscopic catalogue can help to optimise parts of an ``SED-fitting'' approach. Ilbert et al. (2006), e.g., present a method to improve the photo-z estimates in the CFHT Legacy Survey. They adjust the photometric zeropoints of their images and optimise the template SEDs with help of more than 3000 spectroscopically observed galaxies in the range 0<z<5. The optimisation of templates was already used for improving template based photo-z estimates in the SDSS (Csabai et al. 2003). Gabasch et al. (2004) claim to obtain highly accurate photo-z's in the FDF by constructing semi-empirical template SEDs from 280 spectroscopically observed galaxies in the FDF and the Hubble Deep Field. In the following we describe the three codes used for this study.
Hyperz comes with two different template SED sets, the mean
observed spectra of local galaxies by Coleman et al. (1980),
hereafter CWW, and synthetic spectra created from the spectral
evolution library of Bruzual & Charlot (1993), hereafter BC. We use
the BC templates for Hyperz since for all tested setups
performance with the CWW templates is worse. Different reddening laws
are implemented to account for the effect of interstellar dust on the
spectral shape. By default we use the reddening law of
Calzetti et al. (2000) derived for local star-forming galaxies.
The damping of the Lyman--forest increasing with redshift is
modelled according to Madau (1995). Another important
option influencing performance strongly is the application of a prior
on the absolute magnitude. For a given cosmology the absolute
magnitude of an object is calculated from the apparent magnitude in a
reference filter for every redshift step. The user can specify limits
to exclude unrealistically bright or faint objects. In the following
we assume a
CDM cosmology (
,
,
)
and
allow galaxies to have an absolute I-band magnitude of
using the local SDSS-value of
from Blanton et al. (2001).
Besides reporting the most probable redshift estimate as a primary
solution Hyperz can also store the redshift probability
distribution giving the probability associated with the -value
for every redshift step. Furthermore, the width of this distribution
around the primary solution is provides a confidence interval, which
allows the user to identify objects with very uncertain estimates.
We choose a minimum photometric error of 0.1 mag for Hyperz to avoid unrealistically small errors in some of the bands.
It uses a 2D age
extinction grid of templates produced
with the PEGASE population synthesis code
(Fioc & Rocca-Volmerange 1997) and an external SMC reddening law. For
all template details we refer the reader to
Wolf et al. (2004). No explicit redshift-dependent prior is
used, however, for the shallow purely optical datasets of COMBO-17 and
GaBoDS, only galaxy redshifts up to 1.4 are considered, while for the
FDF dataset the whole range from z=0 to z=7 is allowed.
The SED fitting is done in colour space rather than in magnitude space. Similar to Hyperz a lower error threshold is applied (0.05 mag) but here for the colour indices.
The code determines the redshift probability distribution p(z) and reports the mean of this distribution as a Minimum-Error-Variance (MEV) redshift and its rms as an error estimate. The code also tests the shape of p(z) for bimodality, and determines redshift and error from the mode with the higher integral probability (for all details see Wolf et al. 2001).
The redshift likelihood is calculated by BPZ in a similar way
as by Hyperz minimising the
of observed and predicted
colours. However, in contrast to Hyperz no reddening is applied
to the templates relying on the completeness of the given set. After
the calculation of the likelihood, Bayes theorem is applied
incorporating the prior probability. The actual shape of this prior is
dependent on template type and I-band magnitude and was derived from
the observed redshift distributions of different galaxy types in the
Hubble Deep Field. By applying this prior the rate of outliers with
catastrophically wrong photo-z assignments can be reduced. For
details on the procedure see Benítez (2000).
BPZ has been extensively used in the ACS GTO program, the GOODS and COSMOS surveys and others.
The mean, ,
and the standard deviation,
,
of the
following quantity are calculated:
![]() |
(2) |
When using Hyperz and the COMBO code all objects with a
probability vs. redshift distribution that is too wide are rejected
by the following criterion:
In BPZ we reject all objects with:
In this way diagrams showing ,
,
and
vs. completeness are created. While
is almost independent
of completeness the dependencies of
and
on
completeness for selected setups are shown in Fig. 4.
![]() |
Figure 4:
Characteristic lines showing
completeness vs. 3![]() ![]() ![]() ![]() |
Open with DEXTER |
![]() |
Figure 5:
Photo-z's from the CDFS-COMBO
![]() |
Open with DEXTER |
![]() |
Figure 6:
Same as Fig. 5 but for
the GaBoDS
![]() |
Open with DEXTER |
Investigating many of these characteristic lines, we find that
the most obvious feature is that
as well as
are often insensitive to a tightening of the cut criterion. This
immediately tells us that the errors from the photo-z codes are
not proportional to the real errors on an object-by-object basis and
thus of limited use. The real accuracy of the photo-z is not
tightly correlated with the error estimate.
The curves corresponding to Hyperz (dotted lines) show
some dependence of the outlier rate,
,
on a tightening
of the cut. At some point around 80% completeness a saturation
behaviour sets in and a further tightening does not decrease the
outlier rates anymore. BPZ shows a similar but less pronounced
behaviour. Thus, very large confidence intervals or very low ODDS
values indicate that the photo-z estimation failed indeed. We assume
that at this point the width of the confidence interval is not
dominated by the photometric errors but becomes influenced by
systematic uncertainties in the photometric calibration, the template
set, the filter curves or the code itself.
From the preceding paragraphs it should be clear that the choice of A in Eqs. (3) and (4) as
a criterion for a reliable redshift estimate is somewhat arbitrary.
After careful investigation of all characteristic line plots for all
setups we decided to fix the cut for the rejection of uncertain
objects in Hyperz at
,
in the COMBO code at
and in BPZ at ODDS<0.95. This appears to
eliminate the most uncertain objects in the datasets studied
here. Furthermore, the error distribution of these remaining,
secure samples is close to a Gaussian for most setups if 3-sigma
outliers are rejected, i.e.
of the objects lie within
their error estimate around the mean
.
There is clearly some amount of degeneracy between the quantities defined in this section. If the photo-z error distribution was purely Gaussian, scatter and bias would be sufficient numbers to characterise the accuracy of one particular setup. As described above, this is not the case for real data (see also Figs. 5 and 6). Usually, there is a core which might be offset by some bias and there are very extended wings containing catastrophic outliers. This complex error distribution is not easily described by a few numbers and a specific choice must be a compromise between clarity and degeneracy.
For example, a smaller core scatter will probably produce more
outliers than a larger core scatter. With no alternative at
hand to condense the performance of one particular setup into a
handful of numbers we can only refer to the
vs.
plots shown in the following which give an
uncompressed view of the data.
Table 2: Photo-z errors and outlier rates for selected setups on the COMBO-CDFS data (bright sample left, faint sample right).
Table 3: Same as Table 2 but for the GaBoDS-CDFS data.
The different setups are named with three-letter acronyms with the
first letter denoting the code (``H'' for Hyperz, ``C'' for the
COMBO code, and ``B'' for BPZ), the second letter denoting the
dataset (``B'' for GaBoDS and ``C'' for COMBO), and the digit at the
third position denoting the filter set (``5'' for
,
``4'' for
``BVRI'', and ``17'' for the full COMBO-17 filter set including
medium-band-filters).
BPZ and Hyperz produce some negative biases. The cross-calibration between templates and photometry is obviously more accurate for the PEGASE templates used by the COMBO code than for the CWW, Kinney, and BC templates used by BPZ and Hyperz, respectively. Similar negative biases are found by Csabai et al. (2003) using the CWW and BC templates for photo-z estimates on SDSS data.
The COMBO code shows the expected behaviour that the photo-zaccuracy decreases when further filters are excluded. Completeness decreases while outlier rate and scatter increase. No large biases are produced in any setup.
A very interesting fact concerning Hyperz and BPZ is that the exclusion of the U-band decreases the bias in the bright bin. Clearly, the photo-z results with the COMBO code as well as the comparisons between the different datasets in Sect. 2.1 show that this behaviour is not caused by a badly calibrated U-band.
The best results in both magnitude bins are certainly achieved with the full 17-filter set of COMBO-17. Especially in the bright magnitude bin the scatter and the outlier fractions are very small compared to all 4- or 5-filter-setups. In the fainter bin, however, the difference is not as dramatic due to the lack of depth in many of the medium-bands. Hyperz also shows relatively accurate results for the 17-filter set (HC17) but not as accurate as the CC17. BC17 performs in between. In the bright bin, the proper modelling of emission lines in the PEGASE templates that can affect the flux in the medium-band filters considerably pays off for the COMBO code resulting in a very small scatter on the 0.02 level. Emission lines are not included in the BC93 templates used by Hyperz and less pronounced in the observed CWW + Kinney templates of BPZ.
The negative biases in the photo-z estimation with BPZ and Hyperz is also present in GaBoDS setups with the U-band included. At this point, it is important to mention again that the GaBoDS U-band filter is different from the COMBO U-band filter. The GaBoDS filter is wider and bluer.
For the COMBO code, the 4- and 5-filter results are nearly
indistinguishable. Only in the faint bin the outlier rates increase
slightly when the U-band is excluded. Hyperz shows the
unexpected feature that most statistics become more accurate when
going from five to four filters. BPZ shows a similar behaviour
as the COMBO code. The statistics are nearly independent on the choice
between 4- and 5-filter set. Even the bias of
in the faint bin is this time present when using just
.
The biases for the
setups may well be due to the very
blue U-band filter used for the GaBoDS data. The filter-curve
entering the photo-z code is less well defined because of the
strongly varying spectral throughput of the atmosphere in the
near-UV and the large chip-to-chip variations in differential CCD
efficiency at these wavelengths. We tried to shift the blue-cutoff
of the transmission curve of the atmosphere in a reasonable
range. This can slightly reduce the photo-z bias but might not be
reproducible. This problem is also present in the COMBO data but
less severe due to the redder COMBO U-band.
Table 4: Same as Tables 2 and 3 but for the FDF data (low-z sample left, high-z sample right).
The outlier-excluded scatter values, ,
do not show a clear
trend with every code being the most accurate in at least one setup.
There is clearly some amount of degeneracy between completeness,
,
,
and
.
The plots in
Figs. 5 and 6 provide a more
complete view of the performance.
![]() |
Figure 7: Photometric vs. spectroscopic redshifts for the FDF full 8-filter set imaging data. The left diagram shows results for BPZ, the middle diagram for the COMBO code, and the right diagram for Hyperz. |
Open with DEXTER |
Remarkably, the negative biases introduced by BPZ and Hyperz as reported above are much smaller or negligible for the COMBO code. This suggests that the consistent photometric calibration of the two surveys (note that the photometry is also consistent with the MUSYC survey) is not the source of the biases. Rather the combination of these ground-based photometric datasets with particular template sets seems to be problematic. BPZ and Hyperz together with the supplied template sets are tested in their release papers (Benítez 2000; Bolzonella et al. 2000) only against real data from the Hubble Deep Field, besides simulations. BPZ now incorporates a new template set (see Sect. 3.3) that was specially calibrated for HST photometry. The COMBO code, however, was originally designed for the ground-based survey CADIS (Wolf et al. 2001), where colours were measured bias-free from seeing adaptive photometry, and included photo-z's for point-source QSOs.
In general, photo-z biases can be removed by a recalibration procedure with a spectroscopic training sample. Fixing the redshift for the training set objects one can fit for zeropoint offsets in the different filters that minimise the magnitude differences between the observed object colours and best-fit template colours. We developed such recalibration methods for BPZ and Hyperz making use of the spectroscopic redshifts of the VVDS. In this way we can decrease or completely remove the biases which are still present in the blind setups. A more advanced technique incorporating also a recalibration of the template set after recalibrating the photometric zeropoints can lead to even more accurate results (see e.g. Benítez et al. 2004; Ilbert et al. 2006). We don't refer to the recalibrated photometry in the remainder of this paper and instead focus on blind applications.
One of the biggest differences between the codes is the template set
chosen and one might presume that most of the difference in
performance originates from this point. However, we run Hyperz
with the PEGASE templates used by the COMBO code as well as with the
CWW templates plus two Kinney starburst templates originally used by
BPZ in Benítez (2000). We switch off the
Hyperz internal reddening because it is already included in the
BPZ templates and the PEGASE age extinction grid used by
the COMBO code. The results can neither compete with the best
Hyperz setups incorporating the BC templates nor with the COMBO
code plus PEGASE templates. Hence, the implementation of user-defined
templates appears to be not straightforward and results may not be
competitive with the template sets that are shipped with the code and
were tested and optimised by the author.
Another interesting point is the comparison of the CC17 setup with the CB5 setup. While the total exposure time with WFI is lower for CC17, the performance of CC17 is better in all statistics described here. It is clear, that for the particular application of photo-z's for bright objects, the exposure time was well spent on more filters (which is an important result for future surveys). However, the GaBoDS data of the CDFS are completely based on archive data and no specific observing programme was proposed to create these deep images. Furthermore, for deeper applications, such as Lyman-break galaxy studies, where you simply need a very deep colour index between three bands, the GaBoDS data are certainly highly superior to the COMBO data.
In the high redshift domain, however, the COMBO code does not perform well with an outlier rate and scatter twice as large as the ones produced by Hyperz and with a considerable bias. BPZ performs not too different from Hyperz. Apparently, the COMBO code in combination with the PEGASE templates has problems when the Lyman break enters the filter set: many objects appear at too low redshifts, hence the large negative bias (see also Fig. 7). The inferior performance in the high redshift domain can then be attributed to colour-redshift-degeneracies described in detail in Benítez (2000). Basically, a larger number of templates can lead to better low-z performance with the tradeoff of poorer high-z performance due to increasing degeneracies. Designed for medium-deep surveys the COMBO code was naturally not optimised to work at high redshifts in contrast to BPZ and Hyperz. There, the application of a Bayesian prior on the apparent magnitude combined with a sparse template set (BPZ) or a top-hat prior on the absolute magnitude (Hyperz) delivers significantly better results.
The dependence of photo-z performance on the filter set
is also shown in Table 4. In the lower redshift
interval the outlier rate nearly doubles as soon as the NIR filters Jand
are dropped. The scatter, however, remains nearly constant.
Without near-infrared data a larger negative bias is introduced which
was already present in all VVDS-Hyperz setups (see
Tables 2 and 3). The
exclusion of the peculiar U-band reduces this bias again with the
drawback of increased scatter. In the higher redshift domain results
get much worse when near-infrared data are dropped.
![]() |
Figure 8:
Photo-z vs. photo-z for the
COMBO code run on the
![]() |
Open with DEXTER |
![]() |
Figure 9:
Same as
Fig. 8 but with photo-z vs. photo-z for
the COMBO code (CB5) and Hyperz (HB5) run on the GaBoDS
![]() |
Open with DEXTER |
Table 5: Statistics for the comparison between the different photo-z's.
We define similar quantities as in Sect. 4 but now with the spectroscopic redshift replaced by another photo-z. Since none of the two photo-z's is superior to the other in general, the interpretation of the statistics changes then. For example, a catastrophic disagreement between two photo-z estimates just means that at least one of the two is wrong, but it can also be true that both are wrong.
Due to these complications we can learn most from comparing the photo-z vs. photo-z benchmarks for different subsamples. In the following, we will look at the complete CC5 catalogue and compare these redshift estimates with CB5. Moreover we compare CB5 to HC5. Thus, we study how the performance is either affected by additional depth or by using a different code. Two samples are considered, the whole catalogue with 17<R<23 and the subsample with secure spectroscopic redshifts used before. Any significant statistical difference in these photo-z vs. photo-z comparisons can be interpreted then to be due to selection effects in the spectroscopic subsample.
Figures 8 and 9 show the results for a comparison of photo-z's from data of different depths and from different codes, respectively. The statistics are summarised in Table 5. We require an object to meet the criteria defined in Sect. 4.1 for both photo-zsetups entering the comparison.
There are some distinctive features visible in
Fig. 8. The ones labelled ``A'' and ``B'' can
be found in in both panels and just the overall number density in the
left panel is larger by a factor of 14. Two other features, one clump
at
and
and a couple of outlier objects at
and
,
are however, only found in the
photometric sample and not in the spectroscopic one.
Even more striking is the difference in the overall distributions of objects when the two codes are compared in Fig. 9.
This is also reflected in the numbers. The outlier rates, f0.15,
for the full sample are larger by a factor of 2 when comparing
data depth (CC5 vs. CB5) and
4 when comparing codes on identical
data (CB5 vs. HB5). Completeness and scatter are essentially the same
for both subsamples while the bias slightly increases from ``CB5
vs. HB5, all'' to ``CB5 vs. HB5, spectro''.
This means that the properties of the core of the distribution are quite similar for both samples but that the wings are
more pronounced when all objects are considered. Apparently, the
secure spectroscopic subsample represents an intrinsically different
galaxy population than the full, purely magnitude-limited sample. The
rejection of objects with bad spectroscopic flags introduces a bias in
the spectroscopic sample so that it is no longer purely
magnitude-limited and artificially reduces the apparent outlier rates.
We have shown that photo-z's estimated with today's tools can produce a reasonable accuracy. The performance of a particular photo-z code, however, cannot easily be characterised by a mere two numbers such as scatter and global outlier rate. The benchmarks are rather sensitive functions of filter set, depth, redshift range and code settings. Moreover, there is at least a factor of two possible difference in performance between different codes which is again not stable for all setups but can vary considerably from one setup to another. There are, for example, redshift ranges where one code clearly beats another one in terms of accuracy only to loose at other redshifts. We give estimates of the performance for a number of codes in some practically relevant cases.
The estimation of photo-z's from different ground-based datasets is not straightforward and results should not be expected to be identical to simulated photo-z estimates. Rather, photo-z simulations often seem to circumvent critical steps in ground-based photo-zestimation. Most importantly, the match between observed colours and some template sets commonly used may be suboptimal.
In the preceding sections we have identified several aspects which are relevant to future optimisations of photo-z codes. The photo-zerror estimation is one of the most unsatisfying aspects to date with error values often only very weakly correlated with real uncertainties. This is likely due to the insufficient inclusion of systematics since very low S/N objects, for which the errors should be dominated by photon shot-noise, show a tighter correlation. Chip-to-chip sensitivity variations, especially in the UV, could either be taken into account more accurately within the photo-zcodes or could be tackled by improved instrument design, survey strategy, and data reduction. The optimisation of template sets can be expected to be successfully done with ever larger spectroscopic catalogues becoming available.
In general, biases can be removed by a recalibration which requires an extensive spectroscopic training set. Another proven successful route to better photo-z's is improving the spectral resolution of the data, instead of their depth, as demonstrated by the COMBO-17 survey. This approach is also taken by the new ALHAMBRA survey (Moles et al. 2005; Benítez et al. 2007, A&A, submitted) and COSMOS-21.
A general problem for all studies comparing photo-z's to
spectroscopic redshifts is our finding that secure spectroscopic
samples can be biased. While surveys like VVDS are >
complete in
obtaining spectra for galaxy samples the redshifts that are claimed to
be >
secure only amount to
.
This subsample obviously
consists of galaxies for which the photo-z estimation works better
than for the whole sample. In the future, it is desirable to put
effort into spectroscopic surveys with secure redshift measurements
for virtually every galaxy down to the same flux limit that is used
for the analysis of photo-z samples.
Several questions that are raised in this work will be tackled by the PHAT initiative mentioned above. PHAT aims to understand the issues presented here in a systematical and quantitative way in order to give guidance for better photo-z's in the future.
Acknowledgements
This work was supported by the German Ministry for Education and Science (BMBF) through the DLR under the project 50 OR 0106, by the BMBF through DESY under the project 05 AV5PDA/3, and by the Deutsche Forschungsgemeinschaft (DFG) under the projects SCHN342/3-1 and ER327/2-1. HH was supported by the European DUEL RTN, project MRTN-CT-2006-036133. CW was supported by a PPARC Advanced Fellowship.