Photometric redshifts from SDSS images using a Convolutional Neural Network

We developed a Deep Convolutional Neural Network (CNN), used as a classifier, to estimate photometric redshifts and associated probability distribution functions (PDF) for galaxies in the Main Galaxy Sample of the Sloan Digital Sky Survey at z<0.4. Our method exploits all the information present in the images without any feature extraction. The input data consist of 64x64 pixel ugriz images centered on the spectroscopic targets, plus the galactic reddening value on the line-of-sight. For training sets of 100k objects or more ($\geq$ 20% of the database), we reach a dispersion $\sigma_{MAD}$<0.01, significantly lower than the current best one obtained from another machine learning technique on the same sample. The bias is lower than 0.0001, independent of photometric redshift. The PDFs are shown to have very good predictive power. We also find that the CNN redshifts are unbiased with respect to galaxy inclination, and that $\sigma_{MAD}$ decreases with the signal-to-noise ratio (SNR), achieving values below 0.007 for SNR>100, as in the deep stacked region of Stripe 82. We argue that for most galaxies the precision is limited by the SNR of SDSS images rather than by the method. The success of this experiment at low redshift opens promising perspectives for upcoming surveys.


Introduction
Panoramic imaging surveys for cosmology are underway or in preparation phase (HSC, LSST, Euclid, WFIRST). They will deliver multi-band photometry for billions of galaxies for which reliable redshifts are necessary to study the large scale structure of the universe and to constrain the dark energy equation-ofstate using weak gravitational lensing. However spectroscopic redshifts are extremely time-intensive and it has become a necessity to use photometric redshifts. The projections for cosmic shear measurements estimate that the true mean redshift of objects in each photometric redshift bin must be known to better than ∼0.002(1+z) (Knox et al. 2006) with stringent requirements on the fraction of unconstrained catastrophic outliers (Hearin et al. 2010). Another challenge is the derivation of robust redshift probability distribution functions (PDFs, Mandelbaum et al. 2008) for a complete understanding of the uncertainties attached to any measurements in cosmology (e.g. galaxy clustering, weak lensing tomography, baryon acoustic oscillations) or galaxy evolution (e.g. luminosity and stellar mass functions, galaxy density field reconstruction, cluster finders).
Two main techniques are traditionally used to perform this task: template fitting and machine learning algorithms. The template fitting codes (e.g., Arnouts et al. 1999;Benítez 2000;Brammer et al. 2008) match the broad band photometry of a galaxy to the synthetic magnitudes of a suite of templates across a large redshift interval 1 . This technique does not require a large spectroscopic sample for training: when a representative set of galaxy template has been found it can be applied to different sur-veys and redshift range. However, they are often computationally intensive due to the brute-force approach to explore the pregenerated grid of model photometry. Moreover poorly known parameters, such as dust attenuation, can lead to degeneracies in color − redshift space. On the other hand, the machine learning methods, such as artificial neural network (Collister & Lahav 2004), k−nearest neighbors ( KNN Csabai et al. 2007) or random forest techniques (Carliles et al. 2010) were shown to have similar or better performances when a large spectroscopic training set is available. However, they are only reliable within the limits of the training set and the current lack of spectroscopic coverage in some color space regions and at high redshift remains a major issue for this approach. For these reasons, template-fitting methods are still considered and new approaches have emerged which combine several techniques to maximize the photometric redshift PDF estimations (e.g., Carrasco Kind & Brunner 2014; Cavuoti et al. 2017).
One limiting factor of all the above methods is the information used as input. The accuracy of the output photometric redshifts is limited by that of the photometric measurements (Hildebrandt et al. 2012). Magnitudes or colors are affected by the choice of aperture size, Point Spread Function (PSF) variations, overlapping sources, and even modeled magnitudes (accounting for PSF and galaxy luminosity profiles) capture only a fraction of the information present in the images.
Over the past few years, Deep Learning techniques have revolutionized the field of image recognition. By bypassing the condensed information of manual feature extraction required by previous methods they can offer unprecedented image classification performance in a number of astronomical problems, including galaxy morphological classification (e.g., Dieleman et al. 2015;Huertas-Company et al. 2015), lensed images (Lanusse et al. Article number, page 1 of 14 arXiv:1806.06607v2 [astro-ph.IM] 19 Jun 2018 A&A proofs: manuscript no. aanda 2018), classification of light curves (Charnock & Moss 2017;Pasquet-Itam & Pasquet 2017). Thanks to the speed boost from Graphic Processing Units (GPU) technology and large galaxy spectroscopic sample such as the SDSS survey, Hoyle (2016); D' Isanto & Polsterer (2018) showed that Deep Convolutional neural network (CNN) were able to provide accurate phototometric redshifts from multichannel images, instead of extracted features, taking advantage of all the information contained in the pixels, such as galaxy surface brightness and size, disk inclination, or the presence of color gradients and neighbors. To do so, Hoyle (2016) used a Deep CNN inspired by the architecture of Krizhevsky et al. (2012a), on 60×60 RGBA images, encoding colors (i − z, r − i, g − r) in RGB layers and r band magnitudes in the alpha layer. The use of only four bands (griz) allowed them to mimic the DES experiment. They divided the spectroscopic redshift distribution into bins and extracted the redshift bin that the galaxy was most likely to be in. To obtain a PDF, they then randomly extracted 100 60×60 stamps from an original 72 ×72 image stamp, and rerun the CNN algorithm. D'Isanto & Polsterer (2018) used a Deep CNN model, based on a LeNet-5 architecture (LeCun et al. 1998), using 28×28 pixel images in the 5 SDSS bands (ugriz) as well as all the color combinations as input. They modified the fully connected part of the CNN to include a Mixture Density Network (with one hidden layer) to describe the output PDF as a Gaussian mixture model. These two first-of-their-kind analyses based on existing architectures, achieved competitive photometric redshift accuracies compared to other machine learning techniques based on boosted decision tree or random forest.
In this paper, we present a Deep Learning model for the estimation of photometric redshifts and their associated PDF using the TensorFlow framework. In contrast to previous studies, our input consists of ugriz images only (no color images are produced), from the flux-limited spectroscopic Main Galaxy Sample (MGS) of the SDSS. The paper is organized as follows. In Section 2, we describe the data used in this study. In Section 3, we introduce the CNN concepts and the particular CNN architecture we are proposing. In Section 4, we present our results for the estimation of photometric redshifts and associated PDFs. In Section 5, we investigate the impact of reducing the size of the training set. In Section 6, we analyze the behavior of the CNN with respect to a number of galaxy properties and observing conditions. Our results are summarized in Section 7.

Data
The SDSS is a multi-band imaging and spectroscopic redshift survey using a dedicated 2.5-meter telescope at Apache Point Observatory in New Mexico. It provides deep photometry (r < 22.5) in ugriz passbands. Our input data are selected from the data release 12 (DR12, Alam et al. 2015) by using the SDSS CasJob website (the MySQL query is given in Appendix A). From the SDSS database, we retrieved 516,525 sources classified as galaxy, with dereddened petrosian magnitudes r ≤ 17.8 and spectroscopic redshifts z ≤ 0.4. For all we queried the equatorial coordinates (RA, Dec), dereddened Petrosian magnitudes, ellipticities (b/a), galactic extinction (Schlegel et al. 1998), PSF Full-Widths-at-Half-Maximum (FWHMs) and sky background values in all bands. The spatial distribution of the final sample and its redshift distribution are shown in Figure 1. The color code indicates the mean galactic reddening excess in each cell, which increases sharply in the vicinity of the galactic plane.
We also retrieved the photometric redshifts of Beck et al. (2016, hereafter B16), which are the only such redshifts avail-able for comparison in DR12. They were computed with a k-NN method (Csabai et al. 2007, local linear regression) that included five dimensions (r magnitude and u − g, g − r, r − i, i − z colors) and trained with deep and high redshift spectroscopic surveys in addition to the SDSS. These photometric redshifts have similar or better accuracies than those inferred from random forests or prediction trees on the MGS sample (Carrasco Kind & Brunner 2013;Carliles et al. 2010), and may thus serve as reference for machine learning photometric redshifts based on photometric measurements.

Image data
We downloaded all the "corrected frames" image data of the SDSS data release 8 from the SDSS Science Archive Server, including the 118 runs on Stripe 82 that are part of the release (Aihara et al. 2011a). The image headers include the astrometric fix applied to the SDSS data release 9 (Aihara et al. 2011b). The frames come background-subtracted and photometrically calibrated with an identical magnitude zero-point (Padmanabhan et al. 2008;Blanton et al. 2011). All 4,689,180 frames and their celestial coordinates are indexed in a table for query with a homemade query tool 2 . The purpose of this tool is to provide the list of all images likely to overlap a given sky area. For every entry in the galaxy sample we query the list of overlapping frames for each of the 5 SDSS filters and run the SWarp tool 3 (Bertin et al. 2002) to resample to a common pixel grid and stack all the available image data. We rely on the WCS parameters in the input image headers (Calabretta & Greisen 2002) for the astrometry. The result is a 64×64×5 pixel datacube in a gnomonic projection centered on the galaxy coordinates, and aligned with the local celestial coordinate system. The output pixel scale is identical to that of the input images (0.396 arcsec). We choose a Lánczos-3 resampling kernel (Duchon 1979) as a compromise between preserving image quality and minimizing the spread of possible image artifacts. Other SWarp settings are also set to default, except for background subtraction, which is turned off.
The number of overlapping frames contributing to a given output pixel in a given filter ranges from 1 or 2 for "regular" SDSS images, to 64 for some of the galaxies in Stripe 82.
No attempt is made to remove singular images, to mask out neighbors or to subtract light contamination from close bright sources, hence all galaxy images are included in the final dataset unmodified.
Machine learning algorithms dealing with pixel data are generally trained with gamma-compressed images (e.g., in JPEG or PNG format). Yet our tests did not show any improvement of the classifier performance after applying dynamic range compression. We therefore decided to leave the dynamic range of the images intact. Fig. 2 shows a few cutouts from the dataset.

Neural networks
Artificial Neural Networks (NNs) are made of interconnected, elementary computing units called neurons. As in biology where a natural neuron is an electrically excitable cell that processes and transmits information via synapses connected to other cells, an artificial neuron receives a vector of inputs, each with a different weight, and computes a scalar value sent to other neurons. Once the type of neuron and the network topology have been defined, the network behavior relies solely on the strength of the connections (synaptic weights) between neurons. The purpose of learning (or training) is to iteratively adjust the synaptic weights until the system behaves as intended.
Multilayered neural network models are currently the basis of the most powerful machine learning algorithms. In this type of NN, neurons are organized in layers. Neurons from a layer are connected to all or some of the neurons from the previous layer.
Convolutional Neural Networks (CNNs) are a special type of multilayered NNs. CNN classifiers are typically composed of a number of convolutional and pooling layers followed by fully connected layers.

Convolutional layers
A convolutional layer operates on a data cube, and computes one or several feature maps, also stored as a datacube. For the first convolution layer the input datacube is typically a multispectral image (64 × 64 × 5 pixels in the case of our images taken through the ugriz SDSS filters). Subsequent layers operate on feature maps from the previous layers.
The feature maps inside a convolutional layer are generally computed independently of one another. In our CNN, any given feature map relies on a set of adjustable convolution kernels, each of which is applied to the corresponding input data plane separately (with a stride of 1). The convolved arrays plus an adjustable offset are then summed and an activation function applied element-wise to form the feature map.
Non-linear activation functions introduce non-linearity into the network. The most commonly used activation function is the ReLU (Rectified Linear Unit, Nair & Hinton 2010) defined by f (x) = max(x, 0). More recently, He et al. (2015) introduced a Parametric ReLU (PReLU) with parameter α adjusted through basic gradient descent: (1)

Pooling layers
Pooling layers reduce the size of input feature maps through down-sampling along the spatial dimensions (2 × 2 elements in our case). The down-sampling operation typically consists in taking the mean or the max of the elements. Pooling allows feature maps to be computed on multiple scales at a reduced computational cost while providing some degree of shift invariance.

Fully connected layers
Contrary to their convolution layer equivalents, the neurons of a fully connected layer are connected to every neuron of the previous layer, and do not share synaptic weights. In conventional CNNs, the fully connected layers are in charge of further processing the features extracted by the convolutional layers upstream, for classification or regression.

Output layer
We chose to handle the estimation of photometric redshifts as a classification problem, as opposed to using (non-linear) regression. Previous studies (e.g., Pasquet-Itam & Pasquet 2017) demonstrated the benefits of this approach. Each class corresponds to a narrow redshift bin δz. The galaxy at the center of the input multispectral image belongs to a single class (i.e. is at a single redshift). The classes being mutually exclusive, we impose that the sum of the outputs be 1 by applying the softmax activation function (Bridle 1990) to the output layer. The network is trained with the cross-entropy loss function (Baum & Wilczek 1987;Solla et al. 1988). The output of this type of classifier was shown, both theoretically and experimentally, to provide good estimates of the posterior probability of classes in the input space (Richard & Lippmann 1991;Rojas 1996). This of course requires the neural network to be properly trained and to have enough complexity.

Our CNN architecture
The overall architecture 4 is represented in Figure 4. The network takes as input a batch of images of size 64 × 64, centered on the galaxy coordinates, in five channels corresponding to the five bands (u, g, r, i, z). The first layer performs a convolution with a large kernel of size 5 × 5. The pooling layer that follows reduces the size of the image by a factor of 2. All the poolings in the network are computed by selecting the mean value in the sliding window. Although many studies use max pooling for classification problems, our experience with SDSS images is that average pooling performs better. The fact that most image pixels are dominated by background noise may explain why max pooling does not work as reliably.
We also found the PReLU activation function to perform better than the traditional ReLU in convolutional layers. One possible explanation may be that the negative part of the PReLU does Apply activation function Fig. 3. Representation of the first convolution layer with two kernels of convolution to simplify the schema. The multi-spectral image of a galaxy in the five SDSS filters is fed to the network. Each kernel is convolved with the multi-channel images. The convolved images are summed, an additive bias is added and a non-linear function is applied, to produce one feature map.
not saturate, allowing faint signals below the threshold (e.g., those dominated by background noise) to propagate throughout the layer.
The remaining convolution part of the network is organized in multi-scale blocks called inception modules . Each inception module is organized in two stages. In the first stage, the feature maps are convolved by three "1 × 1" convolution layers. These layers are used to combine input feature maps and reduce their number before the more computationally expensive convolutions of the second stage. In this second stage, feature maps are processed in parallel in a pooling layer and a pair of larger convolution layers with size 3 × 3 and 5 × 5 to help identify larger patterns. Note that the last inception module does not include the 5 × 5 convolution layer as the last feature maps (8 × 8) have become too small to generate a useful output.
All the feature maps coming out from the last multi-scale block are concatenated and sent to a fully connected layer of 1096 neurons. One extra input containing the Galactic reddening value for the current galaxy is also added at this stage (Section 6.1).
After a second fully connected layer of 1096 neurons comes the output softmax layer. We set the number of classes to N c = 180 redshift bins over the range 0 -0.4, with constant width δz = 2.2 × 10 −3 . We believe that this sampling offers a reasonable compromise between the number of galaxies in each bin and redshift quantization noise. Such a large number of classes is not uncommon in modern classification challenges (e.g., Russakovsky et al. 2015). As we shall see, the resulting distribution (a vector of 180 posterior probabilities) provides a reliable estimate of the photometric redshift PDF.
The number and sizes of the feature maps are provided in Table A1. Appendix B also details the computational time needed for the various training steps.
In total the number of adjustable parameters is 27,391,464. A common concern with large neural networks is overfitting, which happens whenever a model with too many free parameters learns irrelevant and noisy details of the training data to the extent that it negatively impacts its performance. To make sure that our classifier was not overfitting, we monitored the loss on both the training and validation sets during the training stage. The PIT distribution (Section 4.3) also proved a valuable monitoring tool. Other methods we tested for addressing overfitting are batch normalization (Ioffe & Szegedy 2015) and dropouts (Srivastava et al. 2014). However they did not improve the performance of our classifier.

Photometric redshift estimation
Although photometric redshift PDFs may be directly incorporated into larger Bayesian schemes for inferring model parameters (e.g., cosmological parameters), they may also be used to compute point estimates of the individual redshifts. We estimate the photometric redshift of a given galaxy by computing its expected value from its PDF P(z k ): where z k is the midpoint value of the kth redshift bin. Assuming that P(z k ) is reliable over the whole redshift range (see section 4.3), Eq. 2 provides a minimum mean square error estimation of the photometric redshift given the data.  Fig. 4. Classifier architecture (see Section 3.6 for a detailed description). The neural network is composed of a first convolution layer, a pooling layer and five inception blocks followed by two fully connected layers and a softmax layer. Further details can be found Table A1. Alternatively, we tested non-linear regression models with a multilayered neural network to estimate photometric redshifts directly from the softmax layer PDF output, using both a quadratic and Median Absolute Deviation (MAD) cost functions (see Section 4.1). After training with the spectroscopic redshifts, both models performed almost identically and did not provide any significant accuracy improvement over Eq. 2.

Experimental protocol
We divided the database into a training sample containing 80% of the images and a testing sample composed of the remaining 20%. To ensure that the CNN is not affected by galaxy orientation, we augmented the training set with randomly flipped and rotated (with 90 deg steps) images. We also selected 20,000 images in the training database to create a validation sample that allows us to control the performance of the model.
To increase the performance, we trained an ensemble of classifiers as it was shown to be more accurate than individual classifiers (e.g. Krizhevsky et al. 2012b). Moreover the generalization ability of an ensemble is usually stronger than that of base learners. This step involves training N times one model with the same training database but a different initialization of the weights. As a compromise between computation time and accuracy, we chose N = 6. The individual decisions were then averaged out to obtain the final values of the photometric redshifts. We also averaged the PDFs, although we are aware that the result is a pessimistic estimate of the true probability distribution function. Other combination methods will be investigated in a future analysis. In the following sections, the terms photometric redshift and PDF refer to averages over the 6 trainings. We also processed five crossvalidations of the database to evaluate the stability of the network. This involves performing five learning phases, each with its own training sample (80% of the initial database) and testing sample (the remaining 20%), so as to test each galaxy once. Table 1 shows the performance of each cross-validation, using the metrics (bias, σ MAD and η) defined in Section 4.1. The bias varies slightly but the standard deviation and the fraction of outliers do not change significantly. Therefore we find the Article number, page 5 of 14 network to be statistically robust. In the following, the quoted values for the bias, standard deviation and fraction of outliers are the average values over 5 cross-validations, unless otherwise stated. Figure 5 summaries the protocol of the training process.

Results
In this section, we present the overall performance of our method for the estimation of photometric redshifts, and test the statistical reliability of the PDFs.

Metrics
To assess the quality of the photometric redshifts, we adopt the following commonly used statistics: the residuals, ∆z = (z phot − z spec )/(1 + z spec ), following the definition of Cohen et al. (2000); the prediction bias, < ∆z >, defined as the mean of the residuals; the MAD deviation σ MAD = 1.4826 × MAD, where MAD (Median Absolute Deviation) is the median of |∆z − Median(∆z)| 5 ; the fraction η of outliers with |∆z| > 0.05, chosen to be ∼5 times the σ MAD achieved by the CNN. 5 Our definition of σ MAD differs from the modified version adopted by Ilbert et al. (2006), in which the median residual is not subtracted.
The statistics for the network described in the previous section are reported in Table 1, as well as in the first row of Table 2 (mean values over the 5 cross-validations). The normalized distribution of ∆z is shown in Figure 6. The green line is a Gaussian distribution with the inferred sigma and bias. The hatched zones define the catastrophic outliers.

Photometric redshifts
The distribution of the photometric vs spectroscopic redshifts obtained from the CNN (Fig. 7, left panel) shows a striking decrease of the dispersion around the truth value compared to B16 (right panel), the latest and only comparable study. This is reflected in the scores for the three statistics: a factor of 1.5, 6 and 4 improvement for the σ MAD , bias and outlier rate respectively (the B16 values are listed in parenthesis in Table 2). However we observe a plateau near z∼0.3 for the CNN, where high redshift objects are significantly under-represented (galaxies with z∼0.3 represent 0.1% of the training database) 6 . This trend is not observed in B16 as they use a larger training set that extend to much higher redshift. Figure 8 shows the bias as a function of both spectroscopic redshift and photometric redshift, with the corresponding redshift distributions and the B16 results for comparison. The photometric redshifts predicted by the CNN tend to be slightly over(under)-estimated below(above) the median redshift of the training sample, i.e. biased towards the most highly populated redshift bins. However the bias remains small and never exceeds 1σ. It is also significantly smaller than the bias induced by the B16 method. Most importantly, none is found as a function of photometric redshifts. This is particularly noteworthy for future large missions. The photometric redshift requirement for Euclid is ∆z ≤ 0.002 in photometric redfshift bins (yellow shaded zone), well achieved by the present method.
To understand the origin of the spectroscopic redshift dependent bias, one must realize that neural networks are naturally sensitive to priors in the training sample. In the case of our classifier, the PDF should be a good estimation of the redshift posterior probability density, given an input datacube x (Section 3.5). In virtue of the Bayes rule, and in the continuous limit, the PDF writes: where p(z) is the normalized redshift distribution of input galaxies in the training set, and p(x|z) the normalized distribution of galaxies with true redshift z in multispectral image space. The net effect of p(z) on the estimated photometric redshift z phot = E[z|x] is to "push" z phot in the direction of increasing p(z). The detailed analysis of redshift-dependent biases will be the subject of a future paper.

Probability Distribution Functions
The PDFs of a subset of randomly selected galaxies from the test set are shown in Fig. 9, together with the galaxy RGB images. As proposed by Polsterer et al. (2016), we use two statistics to 6 A preliminary test showed that the size of the training batches of images, progressively fed to the CNN, may be at least partly responsible for this plateau: doubling the size improved the prediction at the highest redshifts and moved the plateau upward. Better balancing the redshifts inside each batch may be necessary. This point will be addressed in a future analysis. The "suspect zone" (SZ) was identified as a small region of the SDSS with above average bias (see Section 6.4). It has been removed by default in all other cases below the "SZ removed" line.
assess the overall prediction quality of our PDFs: the Probability Integral Transform (PIT) and the Continuous Ranked Probability Score (CRPS).
The PIT statistic (Dawid 1984) is based on the histogram of the cumulative probabilities (CDF) at the true value, i.e. the spectroscopic redshift. For galaxy i at spectroscopic redshift z i , with redshift probability distribution function PDF i , the PIT value is: A flat PIT distribution indicates that the PDFs are not biased with respect to the spectroscopic redshifts and are neither too narrow nor too wide, whereas convex or concave distributions point to under or over-dispersed PDFs, respectively (Polsterer et al. 2016). Excessively narrow PDFs will often miss the target, overproducing PIT values close to 0 or 1, whereas PDFs that are too wide will encompass the true redshifts more often than not and therefore favor intermediate PIT values. The PIT distribution for each of the six models in our ensemble of classifiers, and for the final PDFs (see Section 3.8) are shown in Figure 10. Each individual model exhibits a nearly flat PIT distribution, indicating well behaved probability distribution functions. The PIT distri-  Fig. 8. Mean residuals as a function of spectroscopic redshifts (gray) and CNN redshifts (red). The dashed lines show the corresponding results for B16. The histograms are the respective redshift distributions. The CNN redshifts tend to be slightly over(under)-estimated below(above) the most populated redshift bins of the training sample, i.e. are biased towards the region of highest training. The effect is larger for B16, however no bias is found as a function of photometric redshift in either case. bution of the final (averaged) PDFs is slightly over dispersed, as expected from our pessimistic choice of combination.
The CRPS is a performance score (well known in meteorological predictions, Hersbach 2000) that quantifies how well the predicted PDF represents the true spectroscopic redshift. For galaxy i, it is defined as: The mean CRPS (∼ 0.007) is reported in the last column of Table 2. This value is significantly lower than the CRPS quoted by D' Isanto & Polsterer (2018) or Tanaka et al. (2018) (∼ 0.1 and 0.02 respectively), although a fair comparison is difficult as these studies encompass larger redshift domains. However, the small mean CRPS and the nearly flat PIT distribution reflect the high reliability of our PDFs. To further assess the quality of the PDFs, we measure their "widths", defined as the redshift interval resulting from chopping off their left and right wings in equal measure, so as to keep 68% of the probability distribution 7 . Removing the widest 10% (20%) of the PDFs (width > 0.0383 (0.0335)) significantly improve σ MAD and η, as reported in Table 2. These improvements reinforce our confidence that our PDFs properly reflect the photometric redshift uncertainties.

Size of the training database
As acquiring large spectroscopic samples for training is very observing time intensive, it is crucially important to assess the performance of our CNN as a function of training size. Our baseline network made use of 400,000 galaxies (80% of the database). We trained the same model on 250,000 and 100,000 galaxies (50% and 20% of the database respectively), and also adapted the network by reducing its depth and width for a 10,000 galaxy training sample (2% of the database).
The statistics of these three trials are reported in Table 2. We find practically no fall in performance between the training on 400,000 objects and that on 100,000, which is a particularly encouraging result. Although the global statistics deteriorate significantly with 10,000 training sources, they remain comparable to the results of B16, which is also remarkable. Moreover, for all 3 trials including the 10,000 sources training sample, all the trends, or lack thereof, plotted in the next section remain nearly indistinguishable from our baseline 400,000 sources training case.

Further behavioral analysis of the CNN
In this section, we study how the performance of the CNN varies with a number of characteristics of the input images relating to galaxy properties or observing conditions.

Galactic reddening
As illustrated in Fig. 1, the SDSS spans a large range in Galactic extinction. The effect of E(B − V) on the observed colors of a galaxy can mimic that of a redshift increase. The impact of including the reddening information into the training, or not, is illustrated on Fig. 11 (left panel). Without this information provided to the classifier, a strong reddening-dependent bias is observed (orange line). A weaker trend is observed for B16, who use SDSS de-reddened magnitudes. The small size of the sample at high galactic extinction most likely prevents the CNN from properly accounting for this degeneracy, hence our choice to include the reddening information into our model, which successfully removes the trend (red line).

Galaxy inclination
As Galactic reddening, the inclination of a galaxy reddens its color but also affects the shape of the attenuation curve in a complex way . It has been a major issue for the estimation of photometric redshifts, especially with SED fitting codes (Arnouts et al. 2013). Figure 11 (right panel) shows that the CNN is very robust to galaxy inclination, unlike the B16 method, which is strongly biased, especially at high inclination. The network is able to account for this effect thanks to the large training sample, helped by the data augmentation process that further expanded it by rotating and flipping the images. While B16 only uses the photometric information, Yip et al. (2011) showed that machine learning methods can better handle this bias if the inclination information is included into their training.

Neighboring galaxies
Using the SDSS neighbors catalog, we investigated the performance of the CNN on crowded images (i.e. with at least one neighbor within 17 arcsec of the spectroscopic target). No statistical effect was found on the photometric redshift accuracy of the central targets. A few examples can be seen in Fig. 9. Figure 12 shows the spatial variations of the bias and of the PDF widths on the celestial sphere (the color code refers to the mean quantities per cell). Overall, both quantities, which we find to be uncorrelated, show little variation throughout the surveyed area. However we identified a small region of the SDSS (∼ 2.4%) where both are puzzlingly below average in quality (red patch towards the pole).

Variations throughout the surveyed area
This "suspect zone" (SZ) appears to coincide with Stripe 38 and 39, but with no evidence of sky background, PSF or photometric calibration issues. Although it is in a region of high Galactic extinction, excess reddening doesn't seem to cause the problem as it is not detected in the other regions of equally high galactic extinction (Fig. 1). The bias on this patch alone is ∼30 times larger than on the full test sample (+0.0033 versus +0.0001), and also ∼10 times larger for B16. Removing the region from the test sample reduces the bias by a factor of 2.5, while σ MAD is unaffected (see Table 2).
Also noteworthy is the Stripe 82 region, which exhibits narrower than average PDFs (dark blue stripe in the right panel of Fig. 12). This point is addressed in the next section. Figure 13 shows the mean PDF in these two atypical regions compared to that of the full sample.

Effect of noise
The Stripe 82 region, which combines repeated observations of the same part of the sky, gives us the opportunity to look into the impact of Signal-to-Noise Ratio (SNR) on our photometric redshift estimations. The statistics for this region alone are reported in Table 2. The resulting σ MAD outperforms that of the other tests (and can be further reduced by removing the widest PDFs). Thus increasing the SNR improves the performance of the classifier, even though the training was predominantly done using images with lower SNRs.
We further tested the impact of SNR by training the same model (Fig. 4) on two different datasets: one with Stripe 82 images only, one without Stripe 82. The statistics are reported in Table 2. Removing Stripe 82 from the training set has no impact on the performance of the network outside of Stripe 82, unsurprisingly given the small perturbation it induces, and only slightly degrades σ MAD on Stripe 82. This confirms that a CNN network mostly trained on low SNR images performs better on higher SNR images.
Evaluating whether training on high SNR images improves the performance on low and/or high SNR images is more difficult. Training the network on Stripe 82 images reduces the training set to only ∼ 16, 000 galaxies, a very small sample for Deep Learning. The testing on Stripe 82 shows that σ MAD is slightly higher and the bias lower, compared to training with the full dataset: a better match between the training and test sets may be compensating for the reduced training size. The performance outside of Stripe 82 is degraded; this may be due to mismatched datasets (Stripe 82 images are too deep for the CNN to be able to classify shallow images well enough) and/or to the small size of the training sample. We can only conclude that small training sets with higher SNR images do not help the performance on low SNR images.
A more detailed analysis of this effect is presented in Fig. C.1, which shows the behavior of the redshift bias and σ MAD as a function of SNR in all 5 bands. The SNRs were derived simply from the Petrosian magnitude errors quoted in the SDSS catalog. Stripe 82 galaxies were removed from this analysis, as the SDSS DR12 catalog does not provide photometric errors for the stacked images used in this work (see Section 2.1). As may be expected from the results on Stripe 82, σ MAD gradually improves towards the highest SNR in all bands, with values lower than ∼0.006 for SNR≥200 (∼17,000 sources).
σ MAD seems to plateau at the low-end of SNRs, an effect that is not seen with B16. This is perhaps surprising but good news for the faintest sources, although it must be taken with a grain of salt as our SNR derived from Petrosian magnitude errors may be unreliable for faint objects. The redshift bias shows no clear trend within the uncertainties at low SNR but increases at high SNR. As high SNR objects are preferentially at low redshift (z ≤ 0.1), it probably simply reflects the bias discussed in Section 4.2, where galaxies with spectroscopic redshifts below the peak of the training set distribution have their photometric redshifts slightly over-estimated.
6.6. Influence of the PSF As Fig. C.1 shows, σ MAD appears to be relatively immune to PSF variations, with only a slight increase for the worst observing conditions in individual bands (middle panel) or in case of large PSF variations between 2 bands (right panel). On the other hand, the redshift bias shows some trend with seeing (middle and right panels), similar to what is seen in B16, but with opposite signs. Larger PSFs generate an apparent decrease of the apparent surface brightness of galaxies that are not well resolved. Note that SDSS observations are carried out through the different filters within a few minutes of interval, and therefore under very similar atmospheric conditions. This situation is likely to be worse for imaging surveys where the different channels are acquired during separate nights or even runs. Such datasets may require PSF information to be explicitly provided as input to the classifier in addition to the pixel and extinction data.

Summary and Discussion
In this paper we have presented a Deep Convolutional Neural Network (CNN) used as a classifier, that we trained and tested on the Main Galaxy Sample of the SDSS at z ≤ 0.4, to estimate photometric redshifts and their associated PDFs. Our challenge was to exploit all the information present in the images without relying on pre-extracted image or spectral features. The input data consisted of 64×64 pixel ugriz images centered on the spectroscopic target coordinates, and the value of galactic  reddening on the line-of-sight. We tested 4 sizes of training set: 400k, 250k, 100k and 10k galaxies (80%, 50%, 20% and 2% of the full database, respectively).
In all cases but the last, we obtain a MAD dispersion σ MAD = 0.0091. This value is significantly lower than the best one published so far, obtained from another machine learning technique (KNN) applied to photometric measurements by Beck et al. (2016) for the same galaxies (σ MAD = 0.0135). Restricting the training set to only 10,000 sources (although the CNN was not optimized for such a small number) increases dispersion by 60%, but is still competitive with the current state-of-the-art.
The bias shows a quasi-monotonic trend with spectroscopic redshift, largely due to the prior imposed by the training set redshift distribution, as expected from PDFs behaving as posterior probabilities. However, the bias is independent of photometric redshift and lower than 10 −4 at z ≤ 0.25, far below the 0.002 value required for the scientific goals of the future Euclid mission.
We also find that: 1/ our photometric redshifts are essentially unbiased with respect to galactic extinction and galaxy inclination; 2/ the PDFs have very good predictive power, with a nearly flat distribution of the Probability Integral Transforms (PIT). Removing the widest PDFs improves the already small σ MAD and fraction of outliers; 3/ σ MAD decreases with the signal-to-noise ratio (SNR), achieving values below 0.007 for SNR > 100, as in the deep stacked region of Stripe 82; 4/ Variations of the PSF FWHM induce a small but measurable amount of systematics on the estimated redshifts, which prompts for the inclusion of PSF information into future versions of the classifier.   Table A1. Characteristics of each layer of the CNN architecture: name of the layer, input layer, size of the convolution kernel (in pixels), size (height×width in pixels) and number of the resulting feature maps.