Estimating photometric redshifts for X-ray sources in the X-ATLAS field using machine-learning techniques

G. Mountrichas; A. Corral; V. A. Masoura; I. Georgantopoulos; A. Ruiz; A. Georgakakis; F. J. Carrera; S. Fotopoulou

doi:10.1051/0004-6361/201731762

Home

All issues

Volume 608 (December 2017)

A&A, 608 (2017) A39

Full HTML

Free Access

Issue		A&A Volume 608, December 2017


Article Number		A39
Number of page(s)		10
Section		Catalogs and data
DOI		https://doi.org/10.1051/0004-6361/201731762
Published online		04 December 2017

A&A 608, A39 (2017)

Estimating photometric redshifts for X-ray sources in the X-ATLAS field using machine-learning techniques^⋆

G. Mountrichas¹, A. Corral²^,1, V. A. Masoura¹^,3, I. Georgantopoulos¹, A. Ruiz¹, A. Georgakakis¹, F. J. Carrera² and S. Fotopoulou⁴

¹ National Observatory of Athens, V. Paulou & I. Metaxa, Athens 11532, Greece
e-mail: gmountrichas@gmail.com
² Instituto de Fisica de Cantabria (CSIC-Universidad de Cantabria), 39005 Santander, Spain
³ Section of Astrophysics, Astronomy and Mechanics, Department of Physics, Aristotle University of Thessaloniki, 54 124 Thessaloniki, Greece
⁴ Department of Astronomy, University of Geneva, ch. d’Ecogia 16, 1290 Versoix, Switzerland

Received: 11 August 2017
Accepted: 3 October 2017

Abstract

We present photometric redshifts for 1031 X-ray sources in the X-ATLAS field using the machine-learning technique TPZ. X-ATLAS covers 7.1 deg² observed with XMM-Newton within the Science Demonstration Phase of the H-ATLAS field, making it one of the largest contiguous areas of the sky with both XMM-Newton and Herschel coverage. All of the sources have available SDSS photometry, while 810 additionally have mid-IR and/or near-IR photometry. A spectroscopic sample of 5157 sources primarily in the XMM/XXL field, but also from several X-ray surveys and the SDSS DR13 redshift catalogue, was used to train the algorithm. Our analysis reveals that the algorithm performs best when the sources are split, based on their optical morphology, into point-like and extended sources. Optical photometry alone is not enough to estimate accurate photometric redshifts, but the results greatly improve when at least mid-IR photometry is added in the training process. In particular, our measurements show that the estimated photometric redshifts for the X-ray sources of the training sample have a normalized absolute median deviation, nmad ≈ 0.06, and a percentage of outliers, η = 10–14%, depending upon whether the sources are extended or point like. Our final catalogue contains photometric redshifts for 933 out of the 1031 X-ray sources with a median redshift of 0.9.

Key words: X-rays: general / galaxies: active / catalogs / techniques: photometric

^⋆

The table of the photometric redshifts is only available at the CDS via anonymous ftp to cdsarc.u-strasbg.fr (130.79.128.5) or via http://cdsarc.u-strasbg.fr/viz-bin/qcat?J/A+A/608/A39

© ESO, 2017

1. Introduction

Current and future surveys (e.g. XMM, eROSITA, DES, and Euclid) will provide us with large datasets that contain hundreds of thousands of sources. Spectroscopy is expensive in telescope time and challenging to complete for large samples, thus photometric redshift (photo-z) estimations have become a necessity in observational astronomy today. Although photo-z estimations are cheaper and the only means to estimate distances for large samples, they are also subject to systematics and higher uncertainties than spectroscopic redshift estimations (spec-z).

The pursuit of accurate photometric redshifts has led to the development of many photo-z estimation methods that can be divided into two main categories: template-fitting (e.g. Brammer et al. 2008) and machine-learning (e.g. Carrasco Kind & Brunner 2013) techniques, although there are some hybrid methods as well (e.g. Beck et al. 2017). The template-fitting techniques determine the photometric redshifts by fitting synthetic spectral templates, either empirical or synthesized, from stellar population models to observational spectral templates. A number of variations of this technique exist in the literature, such as the Bayesian photometric redshifts (BPZ; Benitez 2000) and Easy and Accurate photo-Z from Yale (EAZY; Brammer et al. 2008). Machine-learning techniques, also known as empirical methods, use a spectroscopic dataset to train an algorithm, which is then applied to a photometric sample to estimate photometric redshifts. Examples of empirical methods include the Artificial Neural Network (ANNz; Collister & Lahav 2004; Lahav & Collister 2012) and random forest techniques, for example, Trees for photo-Z (TPZ; Carrasco Kind & Brunner 2013).

Each of these techniques has its own advantages and disadvantages. Beck et al. (2017) compared the performance of eight photo-z estimation methods (four template-fitting techniques and four machine-learning techniques). Their analysis revealed that all methods perform adequately when the training set coverage is sufficient, but their performance deteriorates when extrapolation is required. Random forest techniques in particular are not expected to perform well beyond the boundaries of the training set. On the other hand, the latter techniques perform better than the other techniques when the photometric measurement errors increase. Beck et al. (2017) concluded that none of the methods is superior to the others and that a trade-off has to be made depending on the available training set, that is to say, its photometric accuracy and coverage.

The machine-learning methods have been successfully applied to derive photometric redshifts for galaxies (e.g. SDSS; Beck et al. 2016) and optical quasi-stellar objects (QSOs) (e.g. Brescia et al. 2015; Cavuoti et al. 2017). However, for X-ray AGN, only spectral energy distribution (SED) fitting techniques have been used (Salvato et al. 2009; Hsu et al. 2014). AGN SEDs are more complicated than galaxy SEDs, however, because of contamination from the host galaxy, intrinsic obscuration, variability, and dominance of different components in different spectral bands, for instance. Thus, photo-z for AGN through SED fitting is difficult. On the other hand, machine-learning methods require large spectroscopic training samples to perform well, and X-ray datasets that are suitable to be used as training sets are rare.

We here use X-ray sources detected in the XMM-XXL survey (Liu et al. 2016; Georgakakis et al. 2017) to train, for the first time, a machine-learning algorithm (TPZ; Carrasco Kind & Brunner 2013) to estimate photometric redshifts for X-ray AGN in the X-ATLAS field. Our goal is to use these photo-z estimates in a future paper to estimate the star formation rate (SFR) and stellar mass of these sources and study the connection between the AGN activity and the environment of their host galaxy. In this paper, we check the accuracy of the photo-z estimates. The structure of the paper is as follows: in Sect. 2 we describe the X-ray sources for which we estimate photo-z, in Sect. 3 we briefly describe the TPZ algorithm and provide information for the training sample. The results are presented in Sect. 4, while we discuss and summarize the main conclusions of this work in Sect. 5.

2. X-ray sample

The Herschel Terahertz Large Area survey (H-ATLAS) is the largest Open Time Key Project carried out with the Herschel Space Observatory (Eales et al. 2010), covering an area of 550 deg² in five far-infrared and sub-millimeter (submm) bands (100, 160, 250, 350, and 500 μm). 16 deg² have been presented in the Science Demonstration Phase (SDP) catalogue (Rigby et al. 2011) and lie within one of the regions observed by the Galaxy And Mass Assembly (GAMA) survey (Driver et al. 2011; Baldry et al. 2010). XMM-Newton observed 7.1 deg² with a total exposure time of 336 ks (in the MOS1 camera) within the H-ATLAS SDP area, making the XMM-ATLAS one of the largest contiguous areas of the sky with both XMM-Newton and Herschel coverage. The catalogue contains 1816 unique sources (Ranalli et al. 2015).

To obtain optical, mid-IR, and far-IR photometry for the XMM-ATLAS sources, we cross-matched the X-ray catalogue with the SDSS-DR13 (Albareti et al. 2015), the WISE (Wright et al. 2010), and the VISTA-VIKING catalogues (Emerson et al. 2006; Dalton et al. 2006) with the ARCHES cross-correlation tool xmatch, which symmetrically matches an arbitrary number of catalogues providing a Bayesian probability of association or non-association (Pineau 2016). xmatch associates one or more tuples with each X-ray source, including possible counterparts in VISTA and/or WISE, with the corresponding probability. When a given X-ray source had more than one associate tuple, we selected those with a probability >0.68, of these, those that were included in most catalogues, and finally, those with the highest probability. The cross-match revealed 1031 sources with at least optical photometry. Using the association probabilities derived by xmatch, fewer than 10% of the counterparts in our catalogue are missmatches (≈85 sources). Of the 1031 sources, 848 have mid-IR counterparts, while 589 also have near-infrared (NIR) counterparts (Table 1). Of the 1031 sources, 174 have spectroscopic redshifts from either the SDSS or the GAMA surveys.

Table 1

Number of X-ATLAS X-ray AGN divided based on their available photometry and optical morphology.

3. Analysis

3.1. Method

To estimate the photometric redshifts for the X-ray AGN in the ATLAS field, we used the publicly available algorithm TPZ. The technique is described in detail in Kind & Brunner (2013). In brief, TPZ is a parallel machine-learning algorithm that uses prediction trees and random forest techniques to generate photometric redshift probability density functions (PDFs) by incorporating measurement errors in the calculation while also efficiently accounting for missing values in the data.

Random forest is an ensemble-learning method for classification, regression, and other tasks. The method generates prediction trees and then combines their predictions. Prediction trees are built by asking questions that split the data until a stopping criterion is met that creates a terminal leaf. The leaf contains a subsample of the data with similar properties, and by applying a model within the leaf, a prediction is made.

TPZ is an empirical technique and therefore required a dataset with spectroscopically measured redshifts to train the algorithm before it was applied to our photometric X-ray sample. The spectroscopic training sample we used in our analysis is described in the following section.

3.2. Training sample

The X-ray catalogue we used to train the TPZ algorithm comes from the XXM-XXL survey. XMM-XXL covers a total area of about 50 deg² with an exposure time of about 10 ks per XMM pointing (Liu et al. 2016; Georgakakis et al. 2017). In the north, 8445 X-ray sources are detected (XXL-N). This region extends to about 25 deg². Of these sources, 5294 have optical (SDSS) photometry. Reliable spectroscopy from SDSS-III/BOSS is available for 2512 AGN (Menzel et al. 2016). To increase the size of our training sample, we also included sources from the XWAS (XMM-Newton Wide Angle Survey; Esquej et al. 2013), XBS (Della Ceca et al. 2004), XMS (Barcons et al. 2007), and COSMOS (Brusa et al. 2010) surveys. We also added ~1500 optically selected X-ray AGN with spectroscopic redshifts from the SDSS-DR13 dataset by cross-matching the 3XMM-DR5 catalogue with SDSS, UKIDSS (Hambly et al. 2008; Irwin 2008), 2MASS (Skrutskie et al. 2006), and WISE. This increased the total number of sources in our training sample to 5157 (Table 2). Testing the performance of TPZ (see next section) with and without the optically selected X-ray AGN revealed that the inclusion of these extra sources marginally but systematically improved the training process of the TPZ code. Specifically, the outlier percentage (see next section) decreased by 2–3% in all cases. Therefore, the results we present next were estimated using the training sample described above.

In addition to the photometric bands of SDSS (u, g, r, i, and z), we included mid-IR (W1, W2) and near-IR (J, H, K) bands in the training process of TPZ to determine whether its performance improved. For this purpose we cross-matched the 5157 sources with the WISE catalogue and near-IR catalogues, that is, with VISTA, UKIDSS, or 2MASS. The cross-match was performed using the xmatch cross-correlation tool and following the same analysis as described in the previous section for the ATLAS sources. The number of sources we obtained and the available photometry is presented in Table 2. Although TPZ can infer missing photometry, in our validation tests and the estimation of the photometric redshifts of the X-ATLAS sources, only the available photometric bands were used for each subsample.

The redshift distribution of the training set is presented in Fig. 1.

Table 2

Number of sources used to train TPZ, with the corresponding available photometry.

Fig. 1

Redshift distribution of the 5157 sources used to train the TPZ algorithm (black solid line). The dashed and dotted lines present the redshift distribution when we split the training sources into extended and point like based on their optical classification.

Fig. 2

Point-like sources. Left: importance of attributes as a function of redshift. Right: RMS importance factor as a function of the attributes computed using the bias and its scatter.

Fig. 3

Same measurements as presented in Fig. 2, but for extended sources.

Fig. 4

Left: u−g vs. g−r colour distribution of the training sample (black circles) and the X-ATLAS sources (blue triangles). Right: z−W1 vs. J−H colour distribution of the training sample (black circles) and the X-ATLAS sources (blue triangles). The fraction of the X-ATLAS sources that is well covered by the training set is different for different colour combinations. This is quantified in Table 4.

Table 3

Performance of the TPZ algorithm, estimated by splitting our spectroscopic sample (see Sect. 3.2) into train and test files.

Fig. 5

Performance of TPZ using the ten available photometric bands (SDSS+WISE+near-IR). The training sample has been split into train and test files to compare the estimated photometric redshifts with the spectroscopic redshifts of the sources. The dashed lines correspond to Δz_norm = ± 0.15. Based on our analysis, the number of outliers is η = 9% and η = 13%, for the extended and point-like sources, respectively. The normalized absolute median deviation is σ_nmad ≈ 0.04–0.05.

Fig. 6

Examples of PDFs produced by TPZ during the validation process. The top panels present results for extended sources and the bottom panels for point-like sources. In the left panels, the estimated photo-z (dotted line) is in agreement with the spectroscopic redshift (solid line) of the source. In the right panels, the estimated photo-z differs significantly from the spectroscopic redshift. These measurements are also characterized by a low confidence level of the photometric redshift.

Fig. 7

Left: r−i vs. u−g colour distribution of the training sample (black circles) and the outliers (blue triangles). Right: H−W1 vs. K−W1 colour distribution of the training sample (black circles) and the outliers (blue triangles).

3.3. Checking the performance of TPZ using the training set

To check the performance of TPZ in estimating accurate photometric redshifts, we split our training set into two subsamples. One was used to train the algorithm, and the other subsample was used as a test case for which we estimated photometric sources. This is an ideal scenario since both subsamples share the same region of the parameter space and the same quality of (spectroscopic) data, that is, the same distribution in redshift and magnitude as well as the same photometric errors. To account for the fainter magnitudes of our photometric X-ATLAS sources compared to the spectroscopic training sample and to facilitate a more accurate check of the TPZ performance, in this test we trained TPZ using colours instead of magnitudes. Figure 4 presents two examples of the colour distribution of the training sources (black circles).

The accuracy of the photometric redshifts estimated by TPZ was quantified by two widely used statistical parameters, the normalized absolute median deviation, σ_nmad, and the percentage of outliers, η. σ_nmad is defined as $\begin{matrix} Δ (z_{norm}) = \frac{z_{spec} - z_{phot}}{1 + z_{spec}}, \\ MAD (Δ (z_{norm})) = Median (| Δ (z_{norm}) |), \\ σ_{nmad} = 1.4826 \times MAD (Δ (z_{norm})) . \end{matrix}$ $\begin{eqnarray} &&\Delta ({z_{\rm norm}})=\frac{z_{\rm spec}-z_{\rm phot}}{1+z_{\rm spec}},\nonumber \\[2mm] &&{\rm MAD}(\Delta ({z_{\rm norm}}))={\rm Median}(|\Delta ({z_{\rm norm}})| ),\nonumber\\[2mm] &&\sigma_{\rm{nmad}}=1.4826 \times {\rm MAD}(\Delta ({z_{\rm norm}})). \end{eqnarray}$ (1)The percentage of outliers, η, is defined as $η = \frac{100}{N} \times (Number of sources with | Δ (z_{norm}) | > 0.15) .$ $\begin{equation} \eta=\frac{100}{N} \times ({\rm Number\, of\, sources\, with\,} |\Delta (z_{\rm norm})| > 0.15) . \end{equation}$ (2)Since the near-IR data come from different surveys, the training sample was used to calibrate any possible dependencies on the different filters used, that is, the differences between the K filter on UKIDSS and the K_s filter on VISTA and 2MASS. Our tests revealed that there are no differences, regardless of whether we ignored the different filters or scaled K magnitudes to K_s. For example, using the SDSS+NIR sample for point-like and extended sources, the percentage of outliers differs by <± 0.8% and the difference in σ_nmad is negligible. Therefore, we ignored this difference in filters in our analysis.

Our initial tests revealed that the performance of the TPZ algorithm in estimating photometric redshifts improved when we split the sources based on their morphology (Salvato et al. 2011). Using the SDSS photometric bands and estimating photometric redshifts without dividing the sources into point like and extended, we obtained σ_nmad = 0.12 and η = 0.35%. These numbers are higher than those derived when splitting the sources based on their optical morphology (see Table 3). We also tried to use the morphology as one of the features used to train the algorithm. Our tests revealed that there is no improvement in the accuracy of the photo-z estimations. For example, using ten photometric bands, σ_nmad = 0.05 and η = 11.8%. These estimates are in between the values obtained when the sources are split based on their morphology (Table 3). We therefore split the training sources into point like and extended, using their SDSS classification. The number of sources in each subsample is shown in Table 2. Their redshift distribution is presented in Fig. 1. Based on the two distributions, we can reach redshifts of up to 3.5 and 2.5 for point-like and extended sources, respectively.

Table 3 presents the values for the various parameters of TPZ we used to estimate photometric redshifts for each subsample. Nrandom is the number of random realizations that TPZ performs, NTrees is the number of trees used, and Natt the number of attributes for TPZ. The number of the bins used was 50 in the case of extended sources and 70 for the point-like sources. To estimate the PDFs and the confidence level of the estimated photometric redshifts (see Carrasco Kind & Brunner 2013), the rms factor was set to 0.06. The same values for each parameter were used to estimate the photo-z for the 1031 X-ray sources in the ATLAS field (next section).

Figure 2 presents the importance of some of the attributes we used in the training process of the TPZ algorithm. The left panel presents the importance of the attribute as a function of redshift for the point-like sources when ten photometric bands are available. A factor of one in the importance implies that the attribute acts as a random variable (for more details see Sect. 4.1.1. in Carrasco Kind & Brunner 2013). The right panel presents the RMS importance factor as a function of the attributes computed using the bias, defined as Δz = z_spec−z_phot, and its scatter. Figure 3 shows the same measurements for the extended sources.

The left panels of Figs. 2 and 3 show that the importance of each attribute is different at different redshifts. In the case of point-like sources, the z−W1 colour is the most important attribute up to redshift 2.5, but its importance significantly drops at z = 3. Similarly, the importance of the h−k colour in the case of extended sources significantly drops at z > 1.4. Moreover, same colours have a different importance for point-like and extended sources, as can be more clearly seen in the right panels of the two figures. For instance, the z−W1 colour is the most important attribute for the point-like sources, but is least important in the case of extended sources. Therefore, the importance of the colours used to estimate photometric redshifts for X-ray sources strongly depends on the morphology of the source and the redshift range of interest.

The results of our measurements are presented in Table 3. When we use optical photometry alone (SDSS), the number of outliers is high, especially in the case of point-like sources. When we add mid-IR colours (WISE), the results improve significantly, while TPZ performs best when we also include near-IR magnitudes in the training process of the algorithm. Figure 5 compares the estimated photometric redshifts with the available spectroscopic redshifts of the sources. Figure 6 presents examples of photometric redshift PDFs produced by TPZ.

The number of outliers drops to 9–14% when ten bands are used for the photo-z estimation (Table 3). Although this number is significantly lower than the outlier percentage that we obtain when we used fewer photometric bands, there is a non-negligible number of outliers even among our best photo-z estimates. Figure 7 presents the colour space occupied by the training sample (black circles) for different colour combinations. Outliers (blue triangles) lie within the boundaries of the training set. Therefore, their existence cannot be attributed to the extrapolation in colour space that TPZ may be required to perform. Although the cause of these outliers is uncertain, their percentage can be significantly reduced by applying a cut in the confidence level, z_conf (Carrasco Kind & Brunner 2013), of the photo-z. For example, for z_conf > 0.6, η = 4.5% in the case of point-like sources. The percentage further decreases (η = 2.4%) when we consider only photo-z estimated using ten photometric bands. When we apply a z_conf > 0.5 cut for the extended sources, the corresponding numbers are η = 4.0% and η = 1.2%. Figure 8 presents the distribution of z_conf for point-like and extended sources.

Variability of AGN can affect the accuracy of the estimated photometric redshifts (Simm et al. 2015). This is not a problem for the optical bands of SDSS we used, since all bands have been observed simultaneously. Variability is also minimum in the mid-IR photometric bands. No estimate of the variable sources in our sample can be made for the near-IR bands, however. We would expect most of these sources to be excluded when a z_conf cut were applied, as discussed above, but a flag cannot be assigned to indicate these sources in the full catalogue.

Table 4

Fraction of the X-ATLAS sample that is well covered in all possible combinations of colours as well as in at least one colour-colour combination.

Fig. 8

Distribution of z_conf for the extended (dashed line) and the point-like (dotted line) sources in our training sample.

Fig. 9

i−z vs. g−i colour space diagram. Black dots present the sources in our training sample. The black solid line defines the region of the colour space that contains 90% of the training sources as estimated by the KDE test. Green dots are the sources from the X-ATLAS sample inside the 90% region, and red crosses present the remaining X-ATLAS sources.

Fig. 10

Redshift distribution of the 933 X-ATLAS sources taking into account the full PDF of each source. Photo-z are estimated using the TPZ algorithm.

4. Results

Following the results of the tests during the validation process (see previous section), we split the 1031 X-ATLAS X-ray AGN into point-like and extended sources using their SDSS classification. The number of sources divided based on their optical morphology as well as the available photometry is presented in Table 1.

Machine-learning methods, such as TPZ, are known to perform poorly when no training set coverage is available and extrapolation must be performed (Beck et al. 2017). Figure 4 compares the colour distribution of the X-ATLAS AGN (blue triangles) with that of the training sample (black circles). In both examples, the coverage of the training set seems sufficient to properly train TPZ to estimate the photometric redshift of the X-ATLAS sources. To quantify the differences among the colours between the training and the X-ATLAS samples, we performed a kernel-density estimation (KDE) test. Using KDE, we defined the region in colour space that contained 90% of the training sample. Then we estimated the fraction of the X-ATLAS sources that were contained in that region, that is, the sources that are well covered by the training sample. This is illustrated in Fig. 9 for the g−i vs. r−z colours. Table 4 presents the fraction of X-ATLAS sample that is well covered in all possible combinations of colours as well as in at least one colour-colour combination.

TPZ estimated photo-z for 933 out of the 1031 sources. Most of the remaining 98 sources have missing photometry, that is, only SDSS bands are available, and therefore the algorithm cannot be properly trained to give a photometric redshift estimate. The distribution of the photometric redshifts for the 933 X-ATLAS X-ray sources, estimated by TPZ and taking into account the full PDF of each sources, is shown in Fig. 10. Of the 933 AGN, 174 have available spectroscopic redshifts from the SDSS and GAMA surveys. In Fig. 11 we compare our photometric redshifts, estimated using TPZ, with the available spectroscopic redshifts. Table 5 presents the median error and the median confidence level, z_conf, of the photometric redshifts, calculated by TPZ as a function of the available photometric bands. The full catalogue with the estimated photometric redshifts is available at the CDS¹.

To verify how many of the X-ATLAS sources are AGN (log L_X > 42 erg s^-1), we used the X-ray fluxes provided by the XMM-ATLAS catalogue (Ranalli et al. 2015) and the estimated photometric redshifts to calculate the X-ray luminosities. This information is available for 894 sources. Our calculations show that 883 of the sources have log L_X > 42 erg s^-1.

Table 5

Median error of the photometric redshifts and their median confidence level, estimated by TPZ, for each subsample of the X-ATLAS dataset based on the available photometry.

5. Summary and discussion

We presented a catalogue with photometric redshift estimates for 933 X-ray AGN in the ATLAS field. For the first time, we used the largest available X-ray sample to train a machine-learning technique (TPZ) and estimate photo-z for X-ray sources. Our analysis shows that our redshift estimates are accurate when optical photometry is combined with mid-IR photometry in the training process of the algorithm. When additional photometric bands (near-IR) are used, the precision of the photometric redshifts is further improved. Our photo-z estimates have a normalized absolute median deviation, σ_nmad ≈ 0.06 and the percentage of outliers is η = 10–14%, depending on whether the sources are extended or point like. These numbers significantly improve when a cut in the confidence level of the photometric redshift is applied (z_conf > 0.5–0.6.).

Fig. 11

Comparison of the photometric redshifts estimated using TPZ with the spectroscopic redshifts from the SDSS and GAMA surveys for the 174 of the 933 sources in the ATLAS field. The left panel shows the comparison for 55 extended sources and the right panel for 119 point-like sources. The median error of the photo-z varies from 0.19 to 0.26 and the median confidence level from 0.36 to 0.49, depending on the morphology of the source and the available photometric bands (Table 5). A significant fraction of outliers exists in the case of the point-like sources, even when seven or even ten photometric bands are used. This number can be greatly reduced when a cut is applied on the confidence level of the photometric redshift, as discussed in the text (z_conf > 0.6).

Valiante et al. (2016) and Bourne et al. (2016) presented a catalogue of 120 230 sources with identifications of optical counterparts to submm sources in Data Release 1 (DR1) of the H-ATLAS sample. The sources are located in three fields on the celestial equator, covering a total area of 161.6 deg², which was previously observed in the GAMA spectroscopic survey. The catalogue contains photometric redshifts (Smith et al. 2011) measured from the SDSS ugriz and UKIDSS YJHK photometry using the neural network technique of ANNz (Collister & Lahav 2004). Photometric redshifts have been estimated using a training sample constructed from spectroscopic redshifts from GAMA I, SDSS DR7, 2SLAQ (Cannon et al. 2006), AEGIS (Davis et al. 2007), and zCOSMOS (Lilly et al. 2009), covering redshifts z < 1. Of these sources, 5500 lie in the X-ATLAS region, and 3515 have a photometric redshift estimate using ANNz. Sixty-five of these sources are common between the two samples. Figure 12 presents the redshift distribution of the 3515 sources (solid line) and that of the 65 common sources, based on our TPZ photo-z estimations (dashed line). The vast majority of the ANNz photo-z estimates are at z < 1 because of the galaxy training sample used for ANNz. In Fig. 13 we compare our photometric redshift estimates using TPZ with those using the ANNz method. Most of the discrepancy between the two photo-z estimates is located in the upper left part of the plot, that is, ANNz computes lower redshift values than we find from our TPZ measurements. Most of this difference is likely due to the different training sets used in the two methods. The training sample of ANNz was constructed to better suit their test sample, the vast majority of which consists of galaxies. Our training sample (Sect. 3.2) consists of X-ray AGN and extends to higher redshifts (up to z ~ 3.5; see Fig. 1). Our analysis has shown (Figs. 2, 3, and 9 and Table 4) that the coverage of our training set in feature space, that is to say, in colours, is also sufficient at high redshifts (z > 1). The results of this comparison is not an indication that ANNz generally performs poorer than TPZ, but that for the specific X-ray sources our X-ray training set is probably better suited.

Fig. 12

Redshift distribution of the 3515 sources with ANNz photo-z estimates in the X-ATLAS field (solid line) and the N(z) using TPZ (normalized to the number of sources with ANNz estimates) of the 65 sources that also belong to our X-ray AGN sample. The redshift distribution of the photo-z estimated by ANNz peaks at low redshifts (z ~ 0.3), and very few sources have z > 1 (solid line). This is expected since ANNz has been trained to estimate photometric redshifts for galaxies. The N(z) estimated using TPZ has been specifically trained to estimate photo-z for X-ray sources and presents a second peak at z ~ 1.5 (dotted line).

Fig. 13

Comparison of our photometric redshifts estimated using TPZ and those estimated using ANNz (Smith et al. 2011) for the 65 common sources with our X-ATLAS X-ray AGN catalogue and the submm catalogue described in Valiante et al. (2016) and Bourne et al. (2016). Most of the discrepancy between the two photo-z estimates is located in the upper left part of the plot, i.e., ANNz computes lower redshift values than are obtained with our TPZ measurements. Most of this difference is likely due to the different training sets used in the two methods. The training sample of ANNz is constructed to better suit their test sample, the vast majority of which consists of galaxies. Our training sample (Sect. 3.2) consists of X-ray AGN (see text for more details).

Large-scale structure studies (e.g. weak lensing, gravitational waves, clustering) require accurate redshifts in their analysis. Georgakakis et al. (2014) examined the effect of the accuracy of photometric redshifts on the estimation of the correlation function in clustering measurements. They concluded that a σ ~ 0.04 (standard deviation of the photo-z) is required in photo-z estimations that are to be used to calculate the AGN correlation function in clustering studies. This accuracy is challenging to obtain, although Georgakakis et al. argued that the clustering signal can be recovered even if the normalized absolute median deviation is σ = 0.08, when the AGN/galaxy cross-correlation function is measured and the galaxy sample has very accurate photometric redshifts (σ ≈ 0.01). Their analysis takes the error of the photometric redshifts into consideration, but does not account for outliers. Even our best photometric redshift measurements (extended sources with ten available photometric bands) have a considerable percentage of outliers (~9–10%). Our preliminary results (Mountrichas et al., in prep.) indicate that the clustering signal can be recovered using photometric redshifts derived by TPZ when a cut is applied on the confidence level of the photometric redshift.

The 3XMM catalogue is the largest available X-ray catalogue, containing about 470 000 unique sources covering a total area of 1000 deg² on the sky. XMMFITCAT-Z² (Corral et al. 2015) is a spectral fit database for 124 000 sources with good photon statistics in the 3XMM. The potential of these catalogues will increase significantly with the addition of the distance information for their sources. We will apply the analysis presented in this work to the 3XMM catalogue to estimate photometric redshifts for all the X-ray sources with at least optical photometry. In the 3XMM-DR5 catalogue, 42 697 sources have available SDSS photometry and 22 619 also have WISE counterparts. 3XMM-DR6 and usage of PanSTARRS in the southern sky will increase the numbers of available X-ray sources. The resulting X-ray catalogue will exceed any other current X-ray catalogue with available redshift information by an order of magnitude.

¹

And at http://xraygroup.astro.noa.gr/atlas/atlas-photoz-online.dat

²

http://xraygroup.astro.noa.gr/Webpage-prodec/xmmfitcatz.html

Acknowledgments

The authors thank the anonymous referee for their careful reading of the paper and their constructive comments. The research leading to these results has received funding from the European Union’s Horizon 2020 Programme under the AHEAD project (grant agreement No. 654215). G.M. acknowledges financial support from the AHEAD project, which is funded by the European Union as Research and Innovation Action under Grant No: 654215. F.J.C. and A.C.R. acknowledge financial support through grant AYA2015-64346-C2-1-P (MINECO/FEDER). A.C.R. also acknowledges financial support by the European Space Agency (ESA) under the PRODEX program.

References

Albareti, F. D., Comparat, J., Gutiérrez, C. M., et al. 2015, MNRAS, 452, 4153 [NASA ADS] [CrossRef] [Google Scholar]
Baldry, I. K., Robotham, A. S. G., Hill, D. T., et al. 2010, MNRAS, 404, 86 [NASA ADS] [Google Scholar]
Barcons, X., Carrera, F. J., Ceballos, M. T., et al. 2007, A&A, 476, 1191 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Beck, R., Dobos, L., Budavári, T., Szalay, A. S., & Csabai, I. 2016, MNRAS, 460, 1371 [NASA ADS] [CrossRef] [Google Scholar]
Beck, R., Lin, C.-A., Ishida, E. E. O., et al. 2017, MNRAS, 468, 4323 [NASA ADS] [CrossRef] [Google Scholar]
Benitez, N. 2000, ApJ, 536, 571 [NASA ADS] [CrossRef] [Google Scholar]
Bourne, N., Dunne, L., Maddox, S. J., et al. 2016, MNRAS, 462, 1714 [NASA ADS] [CrossRef] [Google Scholar]
Brammer, G. B., van Dokkum, P. G., & Coppi, P. 2008, ApJ, 686, 1503 [NASA ADS] [CrossRef] [Google Scholar]
Brescia, M., Cavuoti, S., & Longo, G. 2015, MNRAS, 450, 3893 [NASA ADS] [CrossRef] [Google Scholar]
Brusa, M., Civano, F., Comastri, A., et al. 2010, ApJ, 716, 348 [NASA ADS] [CrossRef] [Google Scholar]
Cannon, R., Drinkwater, M., Edge, A., et al. 2006, MNRAS, 372, 425 [NASA ADS] [CrossRef] [Google Scholar]
Carrasco Kind, M., & Brunner, R. J. 2013, MNRAS, 432, 1483 [NASA ADS] [CrossRef] [Google Scholar]
Cavuoti, S., Amaro, V., Brescia, M., et al. 2017, MNRAS, 465, 1959 [NASA ADS] [CrossRef] [Google Scholar]
Collister, A. A., & Lahav, O. 2004, PASP, 116, 345 [NASA ADS] [CrossRef] [Google Scholar]
Corral, A., Georgantopoulos, I., Watson, M. G., et al. 2015, A&A, 576, A61 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Dalton, G. B., Caldwell, M., Ward, A. K., et al. 2006, SPIE, 6269, 62690X [Google Scholar]
Davis, M., Guhathakurta, P., Konidaris, N. P., et al. 2007, ApJ, 660, L1 [NASA ADS] [CrossRef] [Google Scholar]
Della Ceca, R., Maccacaro, T., Caccianiga, A., et al. 2004, A&A, 428, 383 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Driver, S. P., Hill, D. T., Kelvin, L. S., et al. 2011, MNRAS, 413, 971 [NASA ADS] [CrossRef] [Google Scholar]
Eales, S., Dunne, L., Clements, D., et al. 2010, PASP, 122, 499 [NASA ADS] [CrossRef] [Google Scholar]
Emerson, J., McPherson, A., & Sutherland, W. 2006, The Messenger, 126, 41 [NASA ADS] [Google Scholar]
Esquej, P., Page, M., Carrera, F. J., et al. 2013, A&A, 557, 11 [Google Scholar]
Georgakakis, A., Mountrichas, G., Salvato, M., et al. 2014, MNRAS, 443, 3327 [NASA ADS] [CrossRef] [Google Scholar]
Georgakakis, A., Salvato, M., Liu, Z., et al. 2017, MNRAS, 469, 3232 [NASA ADS] [CrossRef] [Google Scholar]
Hambly, N. C., Collins, R. S., Cross, N. J. G., et al. 2008, MNRAS, 384, 637 [NASA ADS] [CrossRef] [Google Scholar]
Hsu, L.-T., Salvato, M., Nandra, K., et al. 2014, ApJ, 796, 22 [Google Scholar]
Irwin, M. J. 2008, in Processing Wide Field Imaging Data (Berlin Heidelberg: Springer-Verlag), 541 [Google Scholar]
Lahav, O., & Collister, A. A. 2012, Astrophysics Source Code Library [record ascl:1209.009] [Google Scholar]
Lilly, S. J., Le Brun, V., Maier, C., et al. 2009, ApJS, 184, 218 [Google Scholar]
Liu, Z., Merloni, A., Georgakakis, A., et al. 2016, MNRAS, 459, 1602 [NASA ADS] [CrossRef] [Google Scholar]
Menzel, M.-L., Merloni, A., Georgakakis, A., et al. 2016, MNRAS, 457, 110 [NASA ADS] [CrossRef] [Google Scholar]
Pineau, D. C. 2016, ArXiv e-prints [arXiv:1609.03457] [Google Scholar]
Ranalli, P., Georgantopoulos, I., Corral, A., et al. 2015, A&A, 577, 10 [Google Scholar]
Rigby, E. E., Maddox, S. J., Dunne, L., et al. 2011, MNRAS, 415, 2336 [NASA ADS] [CrossRef] [Google Scholar]
Salvato, M., Hasinger, G., Ilbert, O., et al. 2009, ApJ, 690, 1250 [CrossRef] [Google Scholar]
Salvato, M., Ilbert, O., Hasinger, G., et al. 2011, ApJ, 742, 61 [NASA ADS] [CrossRef] [Google Scholar]
Simm, T., Saglia, R., Sabato, M., et al. 2015, A&A, 584, 22 [Google Scholar]
Skrutskie, M. F., Cutri, R. M., Stiening, R., et al. 2006, AJ, 131, 1163 [NASA ADS] [CrossRef] [Google Scholar]
Smith, D. J. B., Dunne, L., Maddox, S. J., et al. 2011, MNRAS, 416, 857 [NASA ADS] [CrossRef] [Google Scholar]
Valiante, E., Smith, M. W. L., Eales, S., et al. 2016, MNRAS, 462, 3146 [NASA ADS] [CrossRef] [Google Scholar]
Wright, E. L., Eisenhardt, P. R. M., Mainzer, A. K., et al. 2010, AJ, 140, 1868 [NASA ADS] [CrossRef] [Google Scholar]

All Tables

Table 1

Number of X-ATLAS X-ray AGN divided based on their available photometry and optical morphology.

In the text

Table 2

Number of sources used to train TPZ, with the corresponding available photometry.

In the text

Table 3

Performance of the TPZ algorithm, estimated by splitting our spectroscopic sample (see Sect. 3.2) into train and test files.

In the text

Table 4

Fraction of the X-ATLAS sample that is well covered in all possible combinations of colours as well as in at least one colour-colour combination.

In the text

Table 5

Median error of the photometric redshifts and their median confidence level, estimated by TPZ, for each subsample of the X-ATLAS dataset based on the available photometry.

In the text

All Figures

	Fig. 1 Redshift distribution of the 5157 sources used to train the TPZ algorithm (black solid line). The dashed and dotted lines present the redshift distribution when we split the training sources into extended and point like based on their optical classification.
In the text

	Fig. 2 Point-like sources. Left: importance of attributes as a function of redshift. Right: RMS importance factor as a function of the attributes computed using the bias and its scatter.
In the text

	Fig. 3 Same measurements as presented in Fig. 2, but for extended sources.
In the text

Fig. 4

Left: u−g vs. g−r colour distribution of the training sample (black circles) and the X-ATLAS sources (blue triangles). Right: z−W1 vs. J−H colour distribution of the training sample (black circles) and the X-ATLAS sources (blue triangles). The fraction of the X-ATLAS sources that is well covered by the training set is different for different colour combinations. This is quantified in Table 4.

In the text

Fig. 5

Performance of TPZ using the ten available photometric bands (SDSS+WISE+near-IR). The training sample has been split into train and test files to compare the estimated photometric redshifts with the spectroscopic redshifts of the sources. The dashed lines correspond to Δz_norm = ± 0.15. Based on our analysis, the number of outliers is η = 9% and η = 13%, for the extended and point-like sources, respectively. The normalized absolute median deviation is σ_nmad ≈ 0.04–0.05.

In the text

Fig. 6

Examples of PDFs produced by TPZ during the validation process. The top panels present results for extended sources and the bottom panels for point-like sources. In the left panels, the estimated photo-z (dotted line) is in agreement with the spectroscopic redshift (solid line) of the source. In the right panels, the estimated photo-z differs significantly from the spectroscopic redshift. These measurements are also characterized by a low confidence level of the photometric redshift.

In the text

	Fig. 7 Left: r−i vs. u−g colour distribution of the training sample (black circles) and the outliers (blue triangles). Right: H−W1 vs. K−W1 colour distribution of the training sample (black circles) and the outliers (blue triangles).
In the text

	Fig. 8 Distribution of z_conf for the extended (dashed line) and the point-like (dotted line) sources in our training sample.
In the text

	Fig. 9 i−z vs. g−i colour space diagram. Black dots present the sources in our training sample. The black solid line defines the region of the colour space that contains 90% of the training sources as estimated by the KDE test. Green dots are the sources from the X-ATLAS sample inside the 90% region, and red crosses present the remaining X-ATLAS sources.
In the text

	Fig. 10 Redshift distribution of the 933 X-ATLAS sources taking into account the full PDF of each source. Photo-z are estimated using the TPZ algorithm.
In the text

Fig. 11

Comparison of the photometric redshifts estimated using TPZ with the spectroscopic redshifts from the SDSS and GAMA surveys for the 174 of the 933 sources in the ATLAS field. The left panel shows the comparison for 55 extended sources and the right panel for 119 point-like sources. The median error of the photo-z varies from 0.19 to 0.26 and the median confidence level from 0.36 to 0.49, depending on the morphology of the source and the available photometric bands (Table 5). A significant fraction of outliers exists in the case of the point-like sources, even when seven or even ten photometric bands are used. This number can be greatly reduced when a cut is applied on the confidence level of the photometric redshift, as discussed in the text (z_conf > 0.6).

In the text

Fig. 12

Redshift distribution of the 3515 sources with ANNz photo-z estimates in the X-ATLAS field (solid line) and the N(z) using TPZ (normalized to the number of sources with ANNz estimates) of the 65 sources that also belong to our X-ray AGN sample. The redshift distribution of the photo-z estimated by ANNz peaks at low redshifts (z ~ 0.3), and very few sources have z > 1 (solid line). This is expected since ANNz has been trained to estimate photometric redshifts for galaxies. The N(z) estimated using TPZ has been specifically trained to estimate photo-z for X-ray sources and presents a second peak at z ~ 1.5 (dotted line).

In the text

Fig. 13

Comparison of our photometric redshifts estimated using TPZ and those estimated using ANNz (Smith et al. 2011) for the 65 common sources with our X-ATLAS X-ray AGN catalogue and the submm catalogue described in Valiante et al. (2016) and Bourne et al. (2016). Most of the discrepancy between the two photo-z estimates is located in the upper left part of the plot, i.e., ANNz computes lower redshift values than are obtained with our TPZ measurements. Most of this difference is likely due to the different training sets used in the two methods. The training sample of ANNz is constructed to better suit their test sample, the vast majority of which consists of galaxies. Our training sample (Sect. 3.2) consists of X-ray AGN (see text for more details).

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.

[1] Albareti, F. D., Comparat, J., Gutiérrez, C. M., et al. 2015, MNRAS, 452, 4153 [NASA ADS] [CrossRef] [Google Scholar]

[2] Baldry, I. K., Robotham, A. S. G., Hill, D. T., et al. 2010, MNRAS, 404, 86 [NASA ADS] [Google Scholar]

[3] Barcons, X., Carrera, F. J., Ceballos, M. T., et al. 2007, A&A, 476, 1191 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[4] Beck, R., Dobos, L., Budavári, T., Szalay, A. S., & Csabai, I. 2016, MNRAS, 460, 1371 [NASA ADS] [CrossRef] [Google Scholar]

[5] Beck, R., Lin, C.-A., Ishida, E. E. O., et al. 2017, MNRAS, 468, 4323 [NASA ADS] [CrossRef] [Google Scholar]

[6] Benitez, N. 2000, ApJ, 536, 571 [NASA ADS] [CrossRef] [Google Scholar]

[7] Bourne, N., Dunne, L., Maddox, S. J., et al. 2016, MNRAS, 462, 1714 [NASA ADS] [CrossRef] [Google Scholar]

[8] Brammer, G. B., van Dokkum, P. G., & Coppi, P. 2008, ApJ, 686, 1503 [NASA ADS] [CrossRef] [Google Scholar]

[9] Brescia, M., Cavuoti, S., & Longo, G. 2015, MNRAS, 450, 3893 [NASA ADS] [CrossRef] [Google Scholar]

[10] Brusa, M., Civano, F., Comastri, A., et al. 2010, ApJ, 716, 348 [NASA ADS] [CrossRef] [Google Scholar]

[11] Cannon, R., Drinkwater, M., Edge, A., et al. 2006, MNRAS, 372, 425 [NASA ADS] [CrossRef] [Google Scholar]

[12] Carrasco Kind, M., & Brunner, R. J. 2013, MNRAS, 432, 1483 [NASA ADS] [CrossRef] [Google Scholar]

[13] Cavuoti, S., Amaro, V., Brescia, M., et al. 2017, MNRAS, 465, 1959 [NASA ADS] [CrossRef] [Google Scholar]

[14] Collister, A. A., & Lahav, O. 2004, PASP, 116, 345 [NASA ADS] [CrossRef] [Google Scholar]

[15] Corral, A., Georgantopoulos, I., Watson, M. G., et al. 2015, A&A, 576, A61 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[16] Dalton, G. B., Caldwell, M., Ward, A. K., et al. 2006, SPIE, 6269, 62690X [Google Scholar]

[17] Davis, M., Guhathakurta, P., Konidaris, N. P., et al. 2007, ApJ, 660, L1 [NASA ADS] [CrossRef] [Google Scholar]

[18] Della Ceca, R., Maccacaro, T., Caccianiga, A., et al. 2004, A&A, 428, 383 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[19] Driver, S. P., Hill, D. T., Kelvin, L. S., et al. 2011, MNRAS, 413, 971 [NASA ADS] [CrossRef] [Google Scholar]

[20] Eales, S., Dunne, L., Clements, D., et al. 2010, PASP, 122, 499 [NASA ADS] [CrossRef] [Google Scholar]

[21] Emerson, J., McPherson, A., & Sutherland, W. 2006, The Messenger, 126, 41 [NASA ADS] [Google Scholar]

[22] Esquej, P., Page, M., Carrera, F. J., et al. 2013, A&A, 557, 11 [Google Scholar]

[23] Georgakakis, A., Mountrichas, G., Salvato, M., et al. 2014, MNRAS, 443, 3327 [NASA ADS] [CrossRef] [Google Scholar]

[24] Georgakakis, A., Salvato, M., Liu, Z., et al. 2017, MNRAS, 469, 3232 [NASA ADS] [CrossRef] [Google Scholar]

[25] Hambly, N. C., Collins, R. S., Cross, N. J. G., et al. 2008, MNRAS, 384, 637 [NASA ADS] [CrossRef] [Google Scholar]

[26] Hsu, L.-T., Salvato, M., Nandra, K., et al. 2014, ApJ, 796, 22 [Google Scholar]

[27] Irwin, M. J. 2008, in Processing Wide Field Imaging Data (Berlin Heidelberg: Springer-Verlag), 541 [Google Scholar]

[28] Lahav, O., & Collister, A. A. 2012, Astrophysics Source Code Library [record ascl:1209.009] [Google Scholar]

[29] Lilly, S. J., Le Brun, V., Maier, C., et al. 2009, ApJS, 184, 218 [Google Scholar]

[30] Liu, Z., Merloni, A., Georgakakis, A., et al. 2016, MNRAS, 459, 1602 [NASA ADS] [CrossRef] [Google Scholar]

[31] Menzel, M.-L., Merloni, A., Georgakakis, A., et al. 2016, MNRAS, 457, 110 [NASA ADS] [CrossRef] [Google Scholar]

[32] Pineau, D. C. 2016, ArXiv e-prints [arXiv:1609.03457] [Google Scholar]

[33] Ranalli, P., Georgantopoulos, I., Corral, A., et al. 2015, A&A, 577, 10 [Google Scholar]

[34] Rigby, E. E., Maddox, S. J., Dunne, L., et al. 2011, MNRAS, 415, 2336 [NASA ADS] [CrossRef] [Google Scholar]

[35] Salvato, M., Hasinger, G., Ilbert, O., et al. 2009, ApJ, 690, 1250 [CrossRef] [Google Scholar]

[36] Salvato, M., Ilbert, O., Hasinger, G., et al. 2011, ApJ, 742, 61 [NASA ADS] [CrossRef] [Google Scholar]

[37] Simm, T., Saglia, R., Sabato, M., et al. 2015, A&A, 584, 22 [Google Scholar]

[38] Skrutskie, M. F., Cutri, R. M., Stiening, R., et al. 2006, AJ, 131, 1163 [NASA ADS] [CrossRef] [Google Scholar]

[39] Smith, D. J. B., Dunne, L., Maddox, S. J., et al. 2011, MNRAS, 416, 857 [NASA ADS] [CrossRef] [Google Scholar]

[40] Valiante, E., Smith, M. W. L., Eales, S., et al. 2016, MNRAS, 462, 3146 [NASA ADS] [CrossRef] [Google Scholar]

[41] Wright, E. L., Eisenhardt, P. R. M., Mainzer, A. K., et al. 2010, AJ, 140, 1868 [NASA ADS] [CrossRef] [Google Scholar]

Estimating photometric redshifts for X-ray sources in the X-ATLAS field using machine-learning techniques⋆

1. Introduction

2. X-ray sample

3. Analysis

3.1. Method

3.2. Training sample

3.3. Checking the performance of TPZ using the training set

4. Results

5. Summary and discussion

Acknowledgments

References

All Tables

All Figures

Estimating photometric redshifts for X-ray sources in the X-ATLAS field using machine-learning techniques^⋆