Issue |
A&A
Volume 666, October 2022
|
|
---|---|---|
Article Number | A87 | |
Number of page(s) | 10 | |
Section | Numerical methods and codes | |
DOI | https://doi.org/10.1051/0004-6361/202243135 | |
Published online | 13 October 2022 |
Photometric redshift-aided classification using ensemble learning
1
Faculdade de Ciencias da Universidade do Porto,
Rua do Campo de Alegre,
4150-007
Porto, Portugal
e-mail: pedro.cunha@astro.up.pt
2
Instituto de Astrofísica e Ciencias do Espaço, University of Porto, CAUP,
Rua das Estrelas,
Porto
4150-762, Portugal
e-mail: andrew.humphrey@astro.up.pt
Received:
17
January
2022
Accepted:
21
April
2022
We present SHEEP, a new machine learning approach to the classic problem of astronomical source classification, which combines the outputs from the XGBoost, LightGBM, and CatBoost learning algorithms to create stronger classifiers. A novel step in our pipeline is that prior to performing the classification, SHEEP first estimates photometric redshifts, which are then placed into the data set as an additional feature for classification model training; this results in significant improvements in the subsequent classification performance. SHEEP contains two distinct classification methodologies: (i) Multi-class and (ii) one versus all with correction by a meta-learner. We demonstrate the performance of SHEEP for the classification of stars, galaxies, and quasars using a data set composed of SDSS and WISE photometry of 3.5 million astronomical sources. The resulting F1 -scores are as follows: 0.992 for galaxies; 0.967 for quasars; and 0.985 for stars. In terms of the F1-scores for the three classes, SHEEP is found to outperform a recent RandomForest-based classification approach using an essentially identical data set. Our methodology also facilitates model and data set explainability via feature importances; it also allows the selection of sources whose uncertain classifications may make them interesting sources for follow-up observations.
Key words: methods: data analysis / methods: statistical / catalogs / stars: general / Galaxy: general / quasars: general
© P. A. C. Cunha and A. Humphrey 2022
Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
This article is published in open access under the Subscribe-to-Open model. Subscribe to A&A to support open access publication.
1 Introduction
Imaging and spectroscopic surveys are one of the main resources for the understanding of the baryonic content of the Universe. The data from these surveys enables statistical studies of stars (e.g. Bensby et al. 2014), quasars (hereafter QSOs; e.g. Richards et al. 2006), and galaxies (e.g. Kauffmann et al. 2003), and the discovery of more peculiar objects such as the previously elusive Type 2 quasars (e.g. Alexandroff et al. 2013; Zakamska et al. 2016).
Wide-area surveys, both ground- and space-based, have yielded high volumes of data, revolutionising the field of astronomy (e.g. Sloan Digital Sky Survey (SDSS): Gunn et al. 1998; York et al. 2000). Future surveys will continue to give us more detailed imaging, at a range of wavelengths (e.g. the Large Synoptic Survey Telescope (LSST): Ivezić et al. 2019; the Dark Energy Spectroscopic Instrument (DESI): Dey et al. 2019; Euclid: Euclid Collaboration et al. 2022; the James Webb Space Telescope: Gardner et al. 2006).
While high quality spectroscopic observations of sources is desirable, they can be time consuming even with modern multiplexing multi-object spectrographs. Conversely, photometry measured from images allows the efficient assembly of spectral energy distributions (SEDs) for very large samples of objects, thus continues to play an important role in source classification, and in the estimation of redshifts and physical properties (e.g. Benítez 2000).
Among the simplest use-cases for photometric SEDs are single-colour or colour-colour selection techniques, which function as lossy dimensionality reduction methods and allow the separation of some source classes (e.g. Haro 1956; Bell et al. 2004). While intuitive and usually simple to apply, colour-colour methods are often crude, and typically only use a small subset of the information available in the SED.
On the other hand, spectral template fitting techniques make use of a wider range of features within the SED to derive various physical properties and to estimate its photometric redshift (e.g. Bolzonella et al. 2000; Laigle et al. 2016). Although template fitting methods generally outperform colour-colour methods, they can be computationally expensive to apply to very large samples since the sources are typically modelled on an individual basis.
Machine learning offers a promising alternative to the colour-colour and template fitting methods, for two main reasons: (i) the full range of photometric information in a source SED can be made use of and (ii) once a machine learning model is trained, it can be computationally inexpensive to apply it to new samples. While there are various ways in which machine learning can be applied to astronomical data, the three most common applications are source classification (e.g. Elting et al. 2008; Kurcz et al. 2016; Krakowski et al. 2016; Bai et al. 2019; Clarke et al. 2020), physical property estimation (e.g. Ucci et al. 2017; Bonjean et al. 2019; Delli Veneri et al. 2019; Simet et al. 2019; Mucesh et al. 2021), and photometric redshift estimation (e.g. Fotopoulou & Paltani 2018; Sadeh et al. 2019; Salvato et al. 2019; Carvajal et al. 2021; Nakoneczny et al. 2021).
In this work, we describe SHEEP, a supervised machine learning pipeline that estimates photometric redshifts and uses this information when subsequently classifying the sources into the galaxy, QSO, and star classes. In Sect. 2, we describe the data set used in this work. In Sect. 3, we present the main statistical metrics used to evaluate the performance of our pipeline. Section 4 presents the results from a proof of concept task for object type classification using redshift as a feature. In Sect. 5, we describe the SHEEP pipeline, the feature engineering process, the pipeline architecture and motivations, and the selection of features. In Sect. 6, we apply SHEEP to an SDSS and WISE photometric data set, and discuss the performance. Finally, in Sect. 7 we provide a summary and the conclusions of this work.
![]() |
Fig. 1 Histogram of the modelMag_r for the extracted SDSS DR15 sources. Missing values are not included. Each source is colour-coded: orange for the star class; blue for the QSO class; and green the galaxy class. |
2 Data
In this work an SQL query1 was used to extract photometric data from SDSS Data Release 15 (Aguado et al. 2019) and WISE catalogues (Wright et al. 2010). The source ID was used to assure the cross-match between the catalogues. From the SDSS catalogue we extracted the psfMag, petroMag, cModelMag, and modelMag for the five optical bands (u, g, r, i, z). The distribution of the modelMag magnitude for the r filter, by class, is represented in Fig. 1. The spectroscopic redshift distribution for the different class sources is represented in Fig. 2.
From the WISE catalogue, the four infrared bands (W1 3.4µm, W2 4.6µm, W3 12µm, W4 22µm) were extracted. In total, we extracted photometric data for 3 497 864 sources: 2401 787 classified as galaxies, 473 954 as QSOs, and 599 177 as stars2. The SDSS spectroscopic classifications are template- based for all three classes; the galaxy class also contains some AGN whose detectable emission lines are consistent with Seyfert or LINER galaxies.
No pre-processing tasks were performed to remove missing values from the data set. All missing values are set as −9999.0, as in SDSS DR15. The properties of the missing values can be seen in Table 1.
3 Statistical Metrics
To evaluate the performance of any machine learning algorithm, either for classification or regression, statistical metrics are essential. In this section, we give an overview of the different metrics used in this work. The statistical metrics for classification tasks vary between 0 and 1, where 1 is the perfect score.
![]() |
Fig. 2 Histogram of the redshift distribution for the extracted SDSS DR15 sources. Each source is colour-coded: green for the galaxy class and blue for the QSO class. The distribution for the star class is not represented since it is concentrated at z = 0. |
Fraction of the missing data in this work.
3.1 Classification Metrics
For the classification tasks we used the following statistical metrics. Precision is the fraction of correct predictions for a given class compared to the overall number of predictions for that class,
(1)
where TP is the number of true positives in a given class and FP is the number of false positives in a given class. In astronomy, this metric is also called “purity”.
Recall is the fraction of correct predictions for a given class, compared to the overall number of positives cases for that class,
(2)
where FN is the number of false negatives in a given class. In other words, recall indicates the fraction of a class in the data set that the model correctly identifies. In astronomy this metric is also called “completeness”.
The F1-score is the harmonic mean of the precision and recall,
(3)
where equal weight are given to precision and recall.
Comparison of the classification metrics for each learning model in the POC tasks.
3.2 Regression Metrics
For the regression tasks (i.e. redshift estimation) the following statistical metrics were adopted for consistency with Euclid Collaboration et al. 2020. The normalised median absolute deviation (NMAD) provides a measure of the variability in the sample,
(4)
where zpred is the predicted redshift and zref is the ground-truth redshift given by the SDSS pipeline.
The fraction of catastrophic outliers, fout (e.g. Hildebrandt et al. 2010) is a quality control metric defined as
(5)
The bias in the predicted redshift can be estimated by studying the fluctuation of the predicted values,
(6)
4 Proof of Concept
Proof of concept (POC) tasks are essential to understand the viability of the classification and/or regression task. We designed three POC tasks:
Simulating the results obtained by Clarke et al. 2020, using the same machine learning algorithm as the cited work (Random Forest3 Breiman 2001);
Testing the gradient-boosting algorithm XGBoost (Mitchell et al. 2018) on the same data, for bench-marking;
Exploring the impact of adding redshift as a feature to the SDSS and WISE photometry data by studying the impact in the classification statistical metrics.
The same internal configuration of the classification algorithm (Random Forest and XGBoost) were used throughout this analysis. Fixing the hyper-parameters is important for an unbiased performance comparison. The classification metrics for the POC tasks are combined in Table 2. The column spec-z is binary, where the value No indicates that the spectroscopic redshift was not added to the data; the value Yes indicates that it was added to the data.
For the POC tasks (1) and (2), we used the same data and feature engineering as described in Clarke et al. (2020). We reproduced the results when using the Random Forest algorithm, achieving identical statistical metrics. With XGBoost, we obtained an increase of 0.004 in accuracy, 0.003 in precision, 0.01 in recall, and 0.006 in F1 -score, compared to the metrics obtained using Random Forest.
In the third POC task, we used the previously described photometric data, and the spectroscopic redshift was added. The same algorithms and internal configurations from the previous POC task were applied to the modified data. In the internal classification metrics analysis, XGBoost showed a slight improvement in the overall performance, compared with Random Forest, with an increase of 0.005 in accuracy, 0.002 in precision, 0.01 in recall, and 0.007 in F 1-score. The overall increase in performance is consistent with that observed in the first POC task. To understand how the performance of the classification algorithms changes between the POC tasks, we compared the best algorithm from each task. An increase of 0.007 in accuracy, 0.009 in precision, 0.012 in recall, and 0.011 in F1-score was achieved with the addition of the redshift.
In summary, the main conclusions from the exploratory POC tasks are that XGBoost outperforms Random Forest for the POC classification tasks and the inclusion of the spectroscopic redshift as a feature significantly increases the performance of both classification algorithms. To take advantage of the results from the third POC task, photo-z will be estimated and used in the classification tasks. The detailed description is found in Sect. 5.3.
5 The SHEEP Pipeline
We describe SHEEP4, a machine learning pipeline built to perform regression and classification tasks using tabular data. It is designed to perform photo-z estimation and automatised source classification for tabular data. While in this work SDSS and WISE photometry are used, others types of tabular data are compatible with our architecture, such as radio fluxes. In the following sections, we present the general structure of our pipeline to allow reproducibility and further improvements. Since we are working with a large data set, only gradient boosting decision tree algorithms5 with GPU compatibility were explored. The Gradient platform6 was used to allow GPU acceleration with the NVIDIA RAPIDS framework (Raschka et al. 2020).
5.1 Feature Engineering
All unique permutations of broad-band colours were calculated from the following parameters: cModelMag; modelMag; psfMag; WISE; psfMag-WISE. Additional features using broad-band magnitudes were engineered as follows: psfMag-modelMag; psfMag-cmodelMag. The latter allows for the study of variance between different magnitudes measurements for the same filter, which can help the model to differentiate between compact and extended objects. Missing values for the magnitudes, colours, and engineered features were set as −9999.0. For the regression tasks, the spectroscopic redshift was set as the target; for the classification tasks the source class was set as the target. Depending on the task being performed, additional targets may be defined, as explained in Sect. 5.2.
Decision tree ensembles are not sensitive to the variance in the data set, and thus the use of scaling does not affect the performance of the algorithm. For this reason we did not perform any feature scaling.
![]() |
Fig. 3 Flow chart describing the photo-z regression learning algorithm used in the SHEEP pipeline. |
5.2 Learning Algorithm
The SHEEP pipeline is divided into two branches, a regression model that estimates the photo-z and a classification model to identify sources as galaxies, QSOs, and stars. We use the following algorithms as our base learners: XGBoost7 (Mitchell et al. 2018), LightGBM8 (Ke et al. 2017), and CatBoost9 (Prokhorenkova et al. 2017). While these gradient-boosting decision tree algorithms have similarities, each one has unique advantages that can be leveraged for studies such as ours. XGBoost presents some innovative features, in particular the ability to deal with sparse data, which is particularly relevant for astronomy data sets. LightGBM introduces a histogram-based approach for continuous variables, which improves training time, as well as a leaf-wise tree growth, while XGBoost uses a level-wise tree growth. CatBoost relies on an ordered boosting approach, preventing overfitting due to the use of residuals from previous fits, also called prediction shift. To make use of the different methodologies, we use ensemble methods to combine them: soft-voting for regression and hard-voting for classification.
The base learners were trained and optimised using FLAML (Wang et al. 2019), a Python library that explores the hyperparameter space for each learner. FLAML10 is built to provide cost-effective hyper-parameter optimisation. To ensure the generalisation of our model, we applied a data partitioning strategy called k-fold cross-validation (where k defines the number of partitions), where k was set as five. In resume, the full dataset is randomly split into k-folds (or parts) with similar size and with a balanced number of target classes. The model is trained using the k – 1 folds and validated on the remaining fold, also known as out-of-fold (OOF). The model performance is computed averaging the predictions for k iterations. This procedure prevents overfitting by evaluating the model in k different validation sets, and allows the computation of OOF predictions for the entire data set.
5.3 Photo-z Estimation
Motivated by our finding that including redshift as a feature can improve classification accuracy, we implemented a photo-z prediction algorithm as part of our pipeline to predict the spec-troscopic redshift of galaxies, QSOs, and stars from SDSS. In Fig. 3, we show a flow chart describing the regression learning algorithm. After the feature engineering task, the data is split using a train-to-test ratio of 70:30. This split is motivated by the need to increase the training set, while still having a representative hold-out test set for model evaluation. The base learners are optimised and trained to calculate the first level for the photo-z predictions.
5.4 Classification
In Fig. 4, we show a flow chart describing the classification learning algorithm. After the pre-processing task, the data is split using a train-to-test ratio of 50:50. The algorithm contains two approaches: multi-class and one versus all.
5.4.1 Multi-class
A model is trained to output multiple class predictions, while allowing the removal of potential catastrophic outliers from the photo-z estimation. We refer to this option as the monitoring model, and it must be activated in the early stages of the pipeline. This model is a binary classification model trained and optimised, using FLAML library, to identify potential catastrophic outliers (e.g. Singal et al. 2022). If the monitoring model is requested, the potential catastrophic outlier photo-z estimations are removed. The training data remains unchanged independently of the activation of the monitoring model. Finally, a classification model is used to predict the testing data. The final classification model is the same for both approaches.
![]() |
Fig. 4 Flow chart describing the classification learning algorithm used in the SHEEP pipeline. |
5.4.2 One Versus all
The one versus all approach allows the classification problem to be divided into multiple binary classification problems: galaxy versus all; QSO versus all; star versus all. The three models produced will be specialised for each source, allowing a grasp of deeper connections in the data. The one versus all approach presented here was inspired by Logan & Fotopoulou (2020). We added a more detailed exploration of this concept with alternative solutions and analysis. The most interesting, and also overlooked, aspect of this approach is that sources can be classified with only one class, for example the source is classified as a galaxy in the galaxy versus all model, and not as a QSO or star in the others.
When only one class label is assigned to a source from the three binary classification models, we denominate it as a confident prediction. If the source is assigned multiple labels or none, we denominate it as an uncertain prediction. The uncertain predictions can highlight interesting study cases since they can originate from (i) a lack of generalisation from the model itself or (ii) a similarity in the photometry data confusing the models. Case (ii) can be physically motivated (e.g. rare objects, transitional sources).
Two solutions were designed to deal with the uncertain predictions: multi-class and meta-learner.
The multi-class model is built with the assumption that most of the incorrect predictions are due to limitations in the pipeline models. The training data is used by our base learner algorithms to perform a multi-class classification using the uncertain predictions as the test data. The main idea is make use of the generalisation strength to provide a final predicted label for the test data, while allowing parameter tuning for the classification probability threshold. The decision tree ensemble used for the uncertain prediction corrections are the same as previously computed in the main branch multi-class approach. To avoid the introduction of biases from separate approaches and to fairly compare the efficiency provided by two different and independent methodologies, this solution was not implemented.
In the meta-learner approach, generalised stacking (Wolpert et al. 1992) is an ensemble method that takes multiple weak learners and combines them, either by building a new predictive meta-model or by combining the weak learner predictions. As in the multi-class solution, we assume that the uncertain predictions are caused by a lack of generalisation. The generalised stacking with meta-learner method is used to label the uncertain predictions by stacking the predictions from the base learners, for each of the classes, in the one versus all starting step. A meta-learner is trained with the correct prediction classification probabilities, and then used to make the final predictions for the uncertain predictions. This process allows the meta-learner to learn the inner structure of the predictions and learn how to improve them, correcting potential misclassifications made by the initial stacking. By setting a probability threshold, the number of misclassifications can be controlled and further studied. In this work the task-oriented AutoML from the FLAML library was used to select the most accurate algorithms from our base learners to be used as the meta-learner. The LightGBM algorithm was selected and trained using the predictions probabilities from the confident predictions given by the base learners.
![]() |
Fig. 5 Importance values of top 20 features for the best models in the multi-class and one vs all approach: (a and b) XGBoost; (c and d) LightGBM. The importance values (x-axis) are presented in logarithmic scale for ease of visualisation. The x-axis scale also varies based on the algorithm being used. The feature names are shown on the y-axis. The variable imp_z is the final photo-z prediction value and oof_flag_outlier is the binary variable that identifies the out-of-fold photo-z catastrophic outliers. |
5.5 Target Variable
Inside the SHEEP pipeline, the target feature is set dynamically, depending on the task. For photo-z regression the target is continuous and is set as 1 + z. For the classification tasks the target feature is always discrete. For the multi-class models the target variable is set to 0 for galaxies, 1 for QSO, and 2 for stars. In the one versus all approach the class to be identified (galaxy, QSO, or star) is assigned a target value of 0, while the remaining sources are assigned a target value of 1. For the outlier detection the target feature is set as 0 for ‘Not outlier’ and 1 for ‘Outlier’.
5.6 Feature Selection and Importance
For model explainability, decision tree-based algorithms provide the user with feature importance values. This is relevant for understanding the importance of the individual features used in the model. However, different learning algorithms have a unique way of calculating the feature importance. For example, in the XGBoostClassifier the default option is the gain, which represents the improvement in accuracy that each feature is able to provide. In Fig. 5, the top 20 features are shown for the multi-class learning algorithm. This analysis reveals some intriguing results, such as the presence of the oof_flag_outlier binary feature that denotes whether the photo-z is expected to be an outlier. This is also seen for LightGBM and CatBoost, even though the feature importance value varies between the learning algorithms.
The presence of the photometric redshift in Fig. 5 is physically explainable (see Sect. 4). For example, individual stars are only detected at redshifts very close to zero, and QSOs have a bias towards higher redshifts than galaxies. The learning algorithm is able to pick up those relationships and correlate them to physical features, such as psfMag_r-modelMag_r and w1mpro-w2mpro.
A/B tests were conducted to understand whether the coordinates of each source, right ascension (RA) and declination (Dec), would help the learning algorithm. No significant improvement was observed for the multi-class approach. These features were used in the one versus all approach.
In Fig. 5, the top 20 features are represented for the three one versus all models. As expected, the photo-z related features, imp_z and oof_flag_outlier, and the photometric colours are among the most important features, independently of the task. Infrared information, either magnitudes or colours, are also relevant, in particular for QSO and star classification.
Interestingly, the RA and Dec values are among the most important features in the QSO versus all and star versus all LightGBM models. Stars are distributed closer to the Galaxy disk, while QSOs are easier to find away from the dusty Milky Way disk. Since the relative density varies across the SDSS survey area, the models can take into account the local density of these classes when estimating the probability that an unresolved source is a star or a QSO.
Regression metrics for photo-z predictions level one and two (top rows), regression metrics for the SDSS photo-z prediction and this work, only for galaxy sources (bottom rows).
![]() |
Fig. 6 Distribution of the spectroscopic redshifts (zspec) as a function of the predicted photometric redshifts (zphoto). The centre dashed line indicates the y = x, while the other solid lines delimit the outlier region. In the star class plot, only the outlier line is represented as a solid line. The colour bar is normalised, where blue represent the denser regions and yellow represents the rarer regions. |
6 Results and Discussion
6.1 Photometric Redshifts
The photo-z regression learning algorithm provides predictions at two levels. For the first level, the photo-z is predicted using photometric data, as described in Sect. 5.1. The metrics for the first- and second-level predictions are shown in Table 3. We also extracted SDSS photometric redshift estimations, derived by robust fit to the nearest neighbours in a reference set, for 2 473 520 galaxy sources and added them for comparison.
The regression metrics for level two are significantly improved, when compared with the predictions for level one. The addition of the chain regressor approach helps the new model to recognise and correct some of the inconsistencies in the level one predictions. In a comparison of our work with the SDSS photo-z predictions, we outperform the SDSS algorithm.
In Fig. 6, the predicted photo-z for each class is plotted in comparison to the ground-truth spectroscopic redshift. For the galaxy and QSO classes, a linear relation is clearly verified with a greater number of sources lying close to the 1:1 relation. For the star class, a more dispersed distribution is observed due to difficulties in distinguishing stars from QSOs. The distribution of the difference between the spectroscopic redshift and the photo-z is represented in Fig. 7. The galaxy and star sources present a peak around value zero; the catastrophic outliers are due to overesti-mations from the prediction model. For QSO sources, over- and underestimations are detected, with higher distribution towards underestimation.
6.2 Galaxy, QSO, Star Classifications
We present the results from our unique architecture (see Fig. 4). The photo-z predictions were added to the photometric data and included in the SHEEP pipeline.
Classification metrics for the XGBoost, LightGBM, and CatBoost models for the multi-class and one versus all approaches.
![]() |
Fig. 7 Distribution of the difference between the spectroscopic redshift (zgroundtruth) provided by SDSS spectroscopy and the photometric redshift predicted in our work (zprediCted). The histogram is colour-coded by spec-troscopic label: orange for stars, blue for QSOs, and green for galaxies. The black dashed lines delimit the area where the difference between redshifts is 1 or −1. |
6.2.1 Multi-class Approach
In Table 4, the final classification metrics are represented for the case where the photo-z catastrophic outliers corrections is not chosen. The best model performing algorithm was XGBoost. The impact of quality for the photo-z predictions, using the XGBoost algorithm, are represented in Table 5. The classification metrics are improved when potential catastrophic outliers are removed from the data set. Nonetheless, the improvement is also correlated to the efficiency of the monitoring model. In our case we obtained the following F1 -score, 0.56 for catastrophic outliers, and 0.99 for non-outliers. Only 57% of the catastrophic outliers were detected and removed from our sample. By improving the detection and extraction these outliers the model can be easily improved either by removing them or correcting them.
6.2.2 One Versus all Approach
The classification metrics for the individual learning algorithms in the one versus all approach are compiled in Table 4. All base learners perform similarly well. It is important to remember that a small increase of 0.002 in recall is still relevant when dealing with big data. For example, in the QSO versus all, this increase in recall results in more 1800 sources being correctly classified. This is important to improve our machine learning models and reduce further human analysis. The XGBoost algorithm performed better for galaxy data, while the LightGBM algorithm performed better with the QSO and star data.
To compare directly the predictions made with the main approaches, only using the sources with one class designation, we computed the F1 -scores: multi-class: 0.98389; one versus all: 0.98398. The two methodologies have similar classification metrics, but the one versus all approach is slightly better.
There were 5285 uncertain predictions from the total test data set of 1 748 932 sources, representing around 0.3% of the test data. A total of 801 sources were classified as galaxy and QSO; 814 sources as galaxy and star; 288 sources as QSO and star; and 3382 sources had no class. The meta-learner algorithm, LightGBM, predicted the class sources for the uncertain predictions with the following metrics: Precision: 0.585; Recall: 0.592; F 1-score: 0.583.
Classification metrics for the XGBoost classification model in the multi-class approach, with the monitoring model for the photo-z predictions.
6.3 Comparison with Similar Studies
This study was designed to allow a direct comparison with the random forest-based classification methodology of Clarke et al. (2020). A histogram combining the F1 -scores for the different classes are shown in Fig. 8. The precision, recall, and F1-score metrics are given in Table 6. Our method outperforms the approach of Clarke et al. (2020), in terms of the F1 -score, for all three classes. In general, we obtain improved precision and recall metrics values compared to those obtained by Clarke et al. (2020). Two exceptions are the precision for the star class, which is marginally below that of Clarke et al. (2020), and the recall for the galaxy class, where a similar result was obtained.
Our classification method also performs well when compared to other similar studies (Logan & Fotopoulou 2020; Li et al. 2021; Nakazono et al. 2021, see Fig. 8), although a direct comparison is not possible since significantly different data sets were used.
![]() |
Fig. 8 F 1-score metrics from this work (one vs. all with meta-learner corrections), Clarke et al. (2020), Li et al. (2021), Nakazono et al. (2021), and Logan & Fotopoulou (2020). The first three plots represent the metrics for each individual source class: galaxy, QSO, and star. The fourth and fifth plots represent the average F 1-score for each work. |
Comparison of the classification metrics from our methodology, multi-class and one versus all, with Clarke et al. (2020).
7 Conclusions and Final Remarks
We have described SHEEP, a machine learning pipeline for the classification of galaxies, QSOs, and stars using tabular photometric data, while accepting sparse data. It starts by estimating the photo-z and adding it as a feature for the training tasks. The pipeline enables two types of approaches, multi-class with photo-z outlier detection and one versus all with meta-learner correction, to combine the outputs from gradient boosting decision trees into a single stronger classifier.
From our best approach, one versus all with meta-learner corrections for the uncertain predictions, we obtained a precision of 0.985, a recall of 0.979, and an F 1-score of 0.982, using photometric data from SDSS and WISE and the predicted photo-z. Our F1-scores are significantly higher than those obtain with the Random Forest method by Clarke et al. (2020). For the photo-z estimation, we obtained a NMAD of 0.010, a bias of 0.007, a catastrophic outlier fraction of 0.02, and an R2 score of 0.916.
In future work we will adapt the SHEEP pipeline for application to data sets from upcoming surveys such as Euclid, where machine learning techniques will be indispensable due to the expected very large volumes of data. The inclusion of additional data will likely lead to further improvement in the classification performance of SHEEP, including (i) wider wavelength coverage (e.g. near-infrared photometry from Euclid); (ii) narrower band filters (e.g. from J-PAS); (iii) the introduction of morphological information from images; and (iv) the inclusion of spectral information. In addition, an exploration of the application of active learning with a hybrid methodology is deferred to a future publication (Cunha et al., in prep.).
SQL query can be found here: https://github.com/pedro- acunha/SHEEP/blob/main/sql_query_casjobs
An interesting benchmark study related to gradient boosting decision tree algorithms can be found in Anghel et al. (2018).
Acknowledgements
PACC dedicates this work in memory of Prof. Eduardo Pereira. The authors thank Ana Afonso and Tom Scott for their comments and suggestions. PACC acknowledges financial support by Centro de Astrofísica da Universidade do Porto through grant CIAAUP-12/2021-BI-D from UIDB/04434/2020. AH acknowledges support from NVIDIA through an NVIDIA Academic Hardware Grant Award. AH acknowledges financial support by Fundação para a Ciencia e a Tecnologia (FCT) through grants UID/FIS/04434/2019, UIDB/04434/2020, UIDP/04434/2020 and PTDC/FISAST /29245/2017, and an FCT-CAPES Transnational Cooperation Project. Funding for the Sloan Digital Sky Survey IV has been provided by the Alfred P. Sloan Foundation, the U.S. Department of Energy Office of Science, and the Participating Institutions. SDSS-IV acknowledges support and resources from the Center for High-Performance Computing at the University of Utah. The SDSS website is http://www.sdss.org/. SDSS-IV is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS Collaboration including the Brazilian Participation Group, the Carnegie Institution for Science, Carnegie Mellon University, the Chilean Participation Group, the French Participation Group, Harvard-Smithsonian Center for Astrophysics, Instituto de Astrofísica de Canarias, The Johns Hopkins University, Kavli Institute for the Physics and Mathematics of the Universe (IPMU)/University of Tokyo, Lawrence Berkeley National Laboratory, Leibniz Institut für Astrophysik Potsdam (AIP), Max-Planck-Institut für Astronomie (MPIA Heidelberg), Max-Planck-Institut für Astrophysik (MPA Garching), Max-Planck-Institut für Extraterrestrische Physik (MPE), National Astronomical Observatories of China, New Mexico State University, New York University, University of Notre Dame, Observatário Nacional/MCTI, The Ohio State University, Pennsylvania State University, Shanghai Astronomical Observatory, United Kingdom Participation Group, Universidad Nacional Autónoma de México, University of Arizona, University of Colorado Boulder, University of Oxford, University of Portsmouth, University of Utah, University of Virginia, University of Washington, University of Wisconsin, Vanderbilt University, and Yale University.
References
- Aguado, D. S., Ahumada, R., Almeida, A., et al. 2019, ApJS, 240, 23 [Google Scholar]
- Alexandroff, R., Strauss, M. A., Greene, J. E., et al. 2013, MNRAS, 435, 3306 [NASA ADS] [CrossRef] [Google Scholar]
- Anghel, A., Papandreou, N., Parnell, T., et al. 2018 arXiv e-prints [arXiv:1809.04559] [Google Scholar]
- Bai, Y., Liu, J., Wang, S., et al. 2019, AJ, 157, 9 [Google Scholar]
- Baldry, I. K., Glazebrook, K., Brinkmann, J., et al. 2004, ApJ, 600, 681 [Google Scholar]
- Baum, W. A. 1957, AJ, 62, 6 [NASA ADS] [CrossRef] [Google Scholar]
- Bell, E. F., Wolf, C., Meisenheimer, K., et al. 2004, ApJ, 608, 752 [NASA ADS] [CrossRef] [Google Scholar]
- Benítez, N. 2000, ApJ, 536, 571 [Google Scholar]
- Bensby, T., Feltzing, S., & Oey, M. S. 2014, A&A, 562, A71 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Bolzonella, M., Miralles, J.-M., & Pelló, R. 2000, A&A, 363, 476 [Google Scholar]
- Bonjean, V., Aghanim, N., Salomé, P., et al. 2019, A&A, 622, A137 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Borucki, W. J., Koch, D. G., Lissauer, J. J., et al. 2003, Proc. SPIE, 4854, 129 [NASA ADS] [CrossRef] [Google Scholar]
- Breiman, L. 2001, Mach. Learn., 45, 5 [Google Scholar]
- Carvajal, R., Matute, I., Afonso, J., et al. 2021, Galaxies, 9, 86 [NASA ADS] [CrossRef] [Google Scholar]
- Clarke, A. O., Scaife, A. M. M., Greenhalgh, R., et al. 2020, A&A, 639, A84 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Delli Veneri, M., Cavuoti, S., Brescia, M., et al. 2019, VizieR Online Data Catalog: J/MNRAS/486/1377 [Google Scholar]
- Dey, A., Schlegel, D. J., Lang, D., et al. 2019, AJ, 157, 168 [Google Scholar]
- Elting, C., Bailer-Jones, C. A. L., & Smith, K. W. 2008, Classif. Discov. Large Astron. Surv., 1082, 9 [NASA ADS] [CrossRef] [Google Scholar]
- Euclid Collaboration (Desprez, G. et al.) 2020, A&A, 644, A31 [EDP Sciences] [Google Scholar]
- Euclid Collaboration (Scaramella, R., et al.) 2022, A&A, 662, A112 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Fotopoulou, S., & Paltani, S. 2018, A&A, 619, A14 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Gardner, J. P., Mather, J. C., Clampin, M., et al. 2006, Space Sci. Rev., 123, 485 [Google Scholar]
- Gomes, J. M., & Papaderos, P. 2017, A&A, 603, A63 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Gunn, J. E., Carr, M., Rockosi, C., et al. 1998, AJ, 116, 3040 [NASA ADS] [CrossRef] [Google Scholar]
- Haro, G. 1956, Bol. Observ. Tonantzintla Tacubaya, 2, 8 [Google Scholar]
- Hernán-Caballero, A., Varela, J., López-Sanjuan, C., et al. 2021, A&A, 654, A101 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Hildebrandt, H., Arnouts, S., Capak, P., et al. 2010, A&A, 523, A31 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Ivezić, Ž., Kahn, S. M., Tyson, J. A., et al. 2019, ApJ, 873, 111 [Google Scholar]
- Kauffmann, G., Heckman, T. M., White, S. D. M., et al. 2003, MNRAS, 341, 33 [Google Scholar]
- Ke, G., Meng, Q., Finley, T., et al. 2017, Adv. Neural Inform. Process. Syst., 30, 3146 [Google Scholar]
- Krakowski, T., Małek, K., Bilicki, M., et al. 2016, A&A, 596, A39 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Kurcz, A., Bilicki, M., Solarz, A., et al. 2016, A&A, 592, A25 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Laigle, C., McCracken, H. J., Ilbert, O., et al. 2016, ApJS, 224, 24 [Google Scholar]
- Li, C., Zhang, Y., Cui, C., et al. 2021, MNRAS, 506, 1651 [NASA ADS] [CrossRef] [Google Scholar]
- Logan, C. H. A., & Fotopoulou, S. 2020, A&A, 633, A154 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Mitchell, R., Adinets, A., Rao, T., et al. 2018 arXiv e-prints [arXiv:1806.11248] [Google Scholar]
- Mucesh, S., Hartley, W. G., Palmese, A., et al. 2021, MNRAS, 502, 2770 [NASA ADS] [CrossRef] [Google Scholar]
- Nakazono, L., Mendes de Oliveira, C., Hirata, N. S. T., et al. 2021, MNRAS, 507, 5847 [CrossRef] [Google Scholar]
- Nakoneczny, S. J., Bilicki, M., Pollo, A., et al. 2021, A&A, 649, A81 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Prokhorenkova, L., Gusev, G., Vorobev, A., et al. 2017 arXiv e-prints [arXiv:1706.09516] [Google Scholar]
- Puschell, J. J., Owen, F. N., & Laing, R. A. 1982, ApJ, 257, L57 [NASA ADS] [CrossRef] [Google Scholar]
- Raschka, S., Patterson, J., & Nolet, C. 2020 arXiv e-prints [arXiv:2002.04803] [Google Scholar]
- Richards, G. T., Lacy, M., Storrie-Lombardi, L. J., et al. 2006, ApJS, 166, 470 [Google Scholar]
- Sadeh, I., Abdalla, F. B., & Lahav, O. 2019, Astrophysics Source Code Library. [record ascl:1910.014] [Google Scholar]
- Salvato, M., Ilbert, O., & Hoyle, B. 2019, Nat. Astron., 3, 212 [NASA ADS] [CrossRef] [Google Scholar]
- Simet, M., Chartab, N., Lu, Y., et al. 2019, ApJ, 908, 47 [Google Scholar]
- Singal, J., Silverman, G., Jones, E., et al. 2022, ApJ, 928, 6 [NASA ADS] [CrossRef] [Google Scholar]
- Stevens, G., Fotopoulou, S., Bremer, M., et al. 2021, J. Open Source Softw., 6, 3635 [NASA ADS] [CrossRef] [Google Scholar]
- Ucci, G., Ferrara, A., Gallerani, S., et al. 2017, MNRAS, 465, 1144 [NASA ADS] [CrossRef] [Google Scholar]
- Wang, C., Wu, Q., Weimer, M., et al. 2019 arXiv e-prints [arXiv:1911.04706] [Google Scholar]
- Wolpert, David, H. 1992, Neural Netw., 5, 241 [CrossRef] [Google Scholar]
- Wright, E. L., Eisenhardt, P. R. M., Mainzer, A. K., et al. 2010, AJ, 140, 1868 [Google Scholar]
- York, D. G., Adelman, J., Anderson, J. E., et al. 2000, AJ, 120, 1579 [Google Scholar]
- Zakamska, N. L., Hamann, F., Pâris, I., et al. 2016, MNRAS, 459, 3144 [NASA ADS] [CrossRef] [Google Scholar]
All Tables
Comparison of the classification metrics for each learning model in the POC tasks.
Regression metrics for photo-z predictions level one and two (top rows), regression metrics for the SDSS photo-z prediction and this work, only for galaxy sources (bottom rows).
Classification metrics for the XGBoost, LightGBM, and CatBoost models for the multi-class and one versus all approaches.
Classification metrics for the XGBoost classification model in the multi-class approach, with the monitoring model for the photo-z predictions.
Comparison of the classification metrics from our methodology, multi-class and one versus all, with Clarke et al. (2020).
All Figures
![]() |
Fig. 1 Histogram of the modelMag_r for the extracted SDSS DR15 sources. Missing values are not included. Each source is colour-coded: orange for the star class; blue for the QSO class; and green the galaxy class. |
In the text |
![]() |
Fig. 2 Histogram of the redshift distribution for the extracted SDSS DR15 sources. Each source is colour-coded: green for the galaxy class and blue for the QSO class. The distribution for the star class is not represented since it is concentrated at z = 0. |
In the text |
![]() |
Fig. 3 Flow chart describing the photo-z regression learning algorithm used in the SHEEP pipeline. |
In the text |
![]() |
Fig. 4 Flow chart describing the classification learning algorithm used in the SHEEP pipeline. |
In the text |
![]() |
Fig. 5 Importance values of top 20 features for the best models in the multi-class and one vs all approach: (a and b) XGBoost; (c and d) LightGBM. The importance values (x-axis) are presented in logarithmic scale for ease of visualisation. The x-axis scale also varies based on the algorithm being used. The feature names are shown on the y-axis. The variable imp_z is the final photo-z prediction value and oof_flag_outlier is the binary variable that identifies the out-of-fold photo-z catastrophic outliers. |
In the text |
![]() |
Fig. 6 Distribution of the spectroscopic redshifts (zspec) as a function of the predicted photometric redshifts (zphoto). The centre dashed line indicates the y = x, while the other solid lines delimit the outlier region. In the star class plot, only the outlier line is represented as a solid line. The colour bar is normalised, where blue represent the denser regions and yellow represents the rarer regions. |
In the text |
![]() |
Fig. 7 Distribution of the difference between the spectroscopic redshift (zgroundtruth) provided by SDSS spectroscopy and the photometric redshift predicted in our work (zprediCted). The histogram is colour-coded by spec-troscopic label: orange for stars, blue for QSOs, and green for galaxies. The black dashed lines delimit the area where the difference between redshifts is 1 or −1. |
In the text |
![]() |
Fig. 8 F 1-score metrics from this work (one vs. all with meta-learner corrections), Clarke et al. (2020), Li et al. (2021), Nakazono et al. (2021), and Logan & Fotopoulou (2020). The first three plots represent the metrics for each individual source class: galaxy, QSO, and star. The fourth and fifth plots represent the average F 1-score for each work. |
In the text |
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.