Evaluating the feasibility of interpretable machine learning for globular cluster detection

Dominik Dold; Katja Fahrion

doi:10.1051/0004-6361/202243354

Home

All issues

Volume 663 (July 2022)

A&A, 663 (2022) A81

Abstract

Free Access

Issue		A&A Volume 663, July 2022


Article Number		A81
Number of page(s)		18
Section		Extragalactic astronomy
DOI		https://doi.org/10.1051/0004-6361/202243354
Published online		14 July 2022

A&A 663, A81 (2022)

Evaluating the feasibility of interpretable machine learning for globular cluster detection

Dominik Dold¹^,⋆ and Katja Fahrion²^,⋆

¹ European Space Agency, European Space Research and Technology Centre, Advanced Concepts Team, Keplerlaan 1, 2201 AZ Noordwijk, The Netherlands
e-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.
² European Space Agency, European Space Research and Technology Centre, Keplerlaan 1, 2201 AZ Noordwijk, The Netherlands
e-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

Received: 17 February 2022
Accepted: 31 March 2022

Abstract

Extragalactic globular clusters (GCs) are important tracers of galaxy formation and evolution because their properties, luminosity functions, and radial distributions hold valuable information about the assembly history of their host galaxies. Obtaining GC catalogues from photometric data involves several steps which will likely become too time-consuming to perform on the large data volumes that are expected from upcoming wide-field imaging projects such as Euclid. In this work, we explore the feasibility of various machine learning methods to aid the search for GCs in extensive databases. We use archival Hubble Space Telescope data in the F475W and F850LP bands of 141 early-type galaxies in the Fornax and Virgo galaxy clusters. Using existing GC catalogues to label the data, we obtained an extensive data set of 84929 sources containing 18556 GCs and we trained several machine learning methods both on image and tabular data containing physically relevant features extracted from the images. We find that our evaluated machine learning models are capable of producing catalogues of a similar quality as the existing ones which were constructed from mixture modelling and structural fitting. The best performing methods, ensemble-based models such as random forests, and convolutional neural networks recover ∼90−94% of GCs while producing an acceptable amount of false detections (∼6−8%), with some falsely detected sources being identifiable as GCs which have not been labelled as such in the used catalogues. In the magnitude range 22 < m4_g ≤ 24.5 mag, 98−99% of GCs are recovered. We even find such high performance levels when training on Virgo and evaluating on Fornax data (and vice versa), illustrating that the models are transferable to environments with different conditions, such as different distances than in the used training data. Apart from performance metrics, we demonstrate how interpretable methods can be utilised to better understand model predictions, recovering that magnitudes, colours, and sizes are important properties for identifying GCs. Moreover, comparing colour distributions from our detected sources to the reference distributions from input catalogues finds great agreement and the mean colour is recovered even for systems with fewer than 20 GCs. These are encouraging results, indicating that similar methods trained on an informative sub-sample can be applied for creating GC catalogues for a large number of galaxies, with tools being available for increasing the transparency and reliability of said methods.

Key words: galaxies: star clusters: general / methods: data analysis / galaxies: formation / galaxies: evolution

^⋆

Both authors contributed equally to this work.

© ESO 2022

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.