Optimised sampling of SDSS-IV MaStar spectra for stellar classification using supervised models

R. I. El-Kholy; Z. M. Hayman

doi:10.1051/0004-6361/202451309

Open Access

Issue		A&A Volume 693, January 2025


Article Number		A300
Number of page(s)		17
Section		Numerical methods and codes
DOI		https://doi.org/10.1051/0004-6361/202451309
Published online		27 January 2025

A&A, 693, A300 (2025)

Optimised sampling of SDSS-IV MaStar spectra for stellar classification using supervised models

R. I. El-Kholy^★ and Z. M. Hayman

Department of Astronomy, Space Science, and Meteorology, Faculty of Science, Cairo University, Giza 12613, Egypt

^★ Corresponding author; This email address is being protected from spambots. You need JavaScript enabled to view it.

Received: 29 June 2024
Accepted: 17 December 2024

Abstract

Context. Supervised machine learning models are increasingly being used for solving the problem of stellar classification of spectroscopic data. However, training these models calls for a large number of labelled instances, whereas their collection is usually costly in both time and expertise.

Aims. Active learning (AL) algorithms minimise training dataset sizes by keeping only the most informative instances. This paper explores the application of AL to sampling stellar spectra using data from a highly class-imbalanced dataset.

Methods. We utilised the MaStar Stellar Library from the SDSS DR17, along with its associated stellar parameter catalogue. A preprocessing pipeline that includes feature selection, scaling, and dimensionality reduction was applied to the data. Using different AL algorithms, we iteratively queried instances where the model or committee of models exhibits the highest uncertainty or disagreement, respectively. We assessed the effectiveness of the sampling techniques by comparing several performance metrics of supervised-learning models trained on the queried samples with randomly sampled counterparts. Evaluation metrics included specificity, sensitivity, and the area under the curve. In addition, we used Matthew’s correlation coefficient, which accounts for class imbalance. We applied this procedure to the effective temperature, surface gravity, and iron metallicity, separately.

Results. Our results demonstrate the effectiveness of AL algorithms in selecting samples that produce performance metrics that are superior to random sampling and even stratified samples, with fewer training instances.

Conclusions. We find AL is recommended for prioritising instance labelling for astronomical-survey data by experts or crowdsourcing to mitigate the high time cost. Its effectiveness can be further exploited in selecting targets for follow-up observations in automated astronomical surveys.

Key words: methods: data analysis / methods: statistical / techniques: spectroscopic / surveys / stars: general

Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article is published in open access under the Subscribe to Open model. This email address is being protected from spambots. You need JavaScript enabled to view it. to support open access publication.

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.