Galaxy merger challenge: A comparison study between machine learning-based detection methods

B. Margalef-Bentabol; L. Wang; A. La Marca; C. Blanco-Prieto; D. Chudy; H. Domínguez-Sánchez; A. D. Goulding; A. Guzmán-Ortega; M. Huertas-Company; G. Martin; W. J. Pearson; V. Rodriguez-Gomez; M. Walmsley; R. W. Bickley; C. Bottrell; C. Conselice; D. O’Ryan

doi:10.1051/0004-6361/202348239

Home

All issues

Volume 687 (July 2024)

A&A, 687 (2024) A24

Full HTML

Open Access

Issue		A&A Volume 687, July 2024


Article Number		A24
Number of page(s)		25
Section		Numerical methods and codes
DOI		https://doi.org/10.1051/0004-6361/202348239
Published online		26 June 2024

A&A, 687, A24 (2024)

Galaxy merger challenge: A comparison study between machine learning-based detection methods

B. Margalef-Bentabol¹, L. Wang¹^,2, A. La Marca¹^,2, C. Blanco-Prieto³, D. Chudy⁴, H. Domínguez-Sánchez⁵, A. D. Goulding⁶, A. Guzmán-Ortega⁷, M. Huertas-Company⁸, G. Martin⁹^,10, W. J. Pearson¹¹, V. Rodriguez-Gomez⁷, M. Walmsley¹², R. W. Bickley¹³, C. Bottrell¹⁴, C. Conselice¹² and D. O’Ryan¹⁵

¹ SRON Netherlands Institute for Space Research, Landleven 12, 9747 AD Groningen, The Netherlands
e-mail: B.Margalef.Bentabol@sron.nl
² Kapteyn Astronomical Institute, University of Groningen, Postbus 800, 9700 AV Groningen, The Netherlands
³ Centro de Astrobiología (CAB), CSIC-INTA, Carretera de Ajalvir km4, 28850 Torrejón de Ardoz, Madrid, Spain
⁴ Astronomical Observatory of the Jagiellonian University, Faculty of Physics, Astronomy and Applied Computer Science, ul. Orla 171, 30-244 Cracow, Poland
⁵ Centro de Estudios de Física del Cosmos de Aragón (CEFCA), Plaza San Juan 1, 44001 Teruel, Spain
⁶ Department of Astrophysical Sciences, Princeton University, 4 Ivy Lane, Princeton, NJ 08544, USA
⁷ Instituto de Radioastronomía y Astrofísica, Universidad Nacional Autónoma de México, Apdo. Postal 72-3, 58089 Morelia, Mexico
⁸ Instituto de Astrofísica de Canarias, c/ Via Lactea sn, 38025 La Laguna, Spain
⁹ Korea Astronomy and Space Science Institute, 776 Daedeokdae-ro, Yuseong-gu, Daejeon 34055, Korea
¹⁰ Steward Observatory, University of Arizona, 933 N. Cherry Ave, Tucson, AZ, USA
¹¹ National Centre for Nuclear Research, Pasteura 7, 02-093 Warszawa, Poland
¹² Jodrell Bank Centre for Astrophysics, Department of Physics & Astronomy, University of Manchester, Oxford Road, Manchester M13 9PL, UK
¹³ Department of Physics and Astronomy, University of Victoria, Victoria, British Columbia V8P 1A1, Canada
¹⁴ International Centre for Radio Astronomy Research, University of Western Australia, 35 Stirling Hwy, Crawley, WA 6009, Australia
¹⁵ Department of Physics, Lancaster University, Bailrigg, Lancaster LA1 4YB, UK

Received: 11 October 2023
Accepted: 19 March 2024

Abstract

Aims. Various galaxy merger detection methods have been applied to diverse datasets. However, it is difficult to understand how they compare. Our aim is to benchmark the relative performance of merger detection methods based on machine learning (ML).

Methods. We explore six leading ML methods using three main datasets. The first dataset consists of mock observations from the IllustrisTNG simulations, which acts as the training data and allows us to quantify the performance metrics of the detection methods. The second dataset consists of mock observations from the Horizon-AGN simulations, introduced to evaluate the performance of classifiers trained on different, but comparable data to those employed for training. The third dataset is composed of real observations from the Hyper Suprime-Cam Subaru Strategic Program (HSC-SSP) survey. We also compare mergers and non-mergers detected by the different methods with a subset of HSC-SSP visually identified galaxies.

Results. For the simplest binary classification task (i.e. mergers vs. non-mergers), all six methods perform reasonably well in the domain of the training data. At the lowest redshift explored 0.1 < ɀ < 0.3, precision and recall generally range between ~70% and 80%, both of which decrease with increasing ɀ as expected (by ~5% for precision and ~10% for recall at the highest ɀ explored 0.76 < ɀ < 1.0). When transferred to a different domain, the precision of all classifiers is only slightly reduced, but the recall is significantly worse (by ~20–40% depending on the method). Zoobot offers the best overall performance in terms of precision and F1 score. When applied to real HSC observations, different methods agree well with visual labels of clear mergers, but can differ by more than an order of magnitude in predicting the overall fraction of major mergers. For the more challenging multi-class classification task to distinguish between pre-mergers, ongoing-mergers, and post-mergers, none of the methods in their current set-ups offer good performance, which could be partly due to the limitations in resolution and the depth of the data. In particular, ongoing-mergers and post-mergers are much more difficult to classify than pre-mergers. With the advent of better quality data (e.g. from JWST and Euclid), it is of great importance to improve our ability to detect mergers and distinguish between merger stages.

Key words: methods: numerical / techniques: image processing / surveys / galaxies: evolution / galaxies: interactions / galaxies: structure

© The Authors 2024

Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article is published in open access under the Subscribe to Open model. Subscribe to A&A to support open access publication.

1 Introduction

Galaxy mergers play a crucial role in galaxy formation and evolution in the hierarchical paradigm of structure formation (White & Rees 1978; Fakhouri & Ma 2008; Conselice et al. 2014). For example, mergers are expected to assemble stellar mass in addition to what is produced from star formation alone (Rodriguez-Gomez et al. 2016; Mundy et al. 2017; Duncan et al. 2019; Martin et al. 2021), transform morphology (Dubois et al. 2016; Rodriguez-Gomez et al. 2017; Martin et al. 2018, 2021), kinematics (Berg et al. 2014; Clauwens et al. 2018; Hani et al. 2018), trigger starburst activity (Cortijo-Ferrero et al. 2017; Pearson et al. 2019a; Patton et al. 2020), and active galactic nucleus (AGN) activity (Di Matteo et al. 2012; Satyapal et al. 2014; Ellison et al. 2019). However, the relative importance of mergers compared to other physical processes such as smooth gas accretion is still much debated (Rodriguez-Gomez et al. 2016; Fitts et al. 2018; McAlpine et al. 2020; Jackson et al. 2022), and the details of how mergers are connected to specific phases along galaxy evolution histories (e.g. the triggering of the star-burst and AGN phases) are not well understood (Martin et al. 2022). One of the main challenges in better understanding the role of mergers in galaxy evolution is detecting them with sufficiently good reliability and completeness in a large enough sample across a wide redshift range. According to numerical simulations, a typical major merger between galaxies of similar masses can take several Gyr to complete (Kitzbichler & White 2008; Lotz et al. 2010; Huško et al. 2022). A merger sequence includes both a pre-merger phase and post-merger phase to which several galaxy physical properties, such as star formation rate (SFR), are sensitive. The pre-merger phase is typically defined as the period during which the two gravita-tionally interacting galaxies approach each other, fly apart, and come close together again; the post-merger phase is defined as the period during which the colliding galaxies coalesce and form a single more massive galaxy (Jiang et al. 2014; Snyder et al. 2017; Moreno et al. 2019). The wide diversity in disturbed appearances and merging features (e.g. tidal tails, bridges, double nuclei) associated with different merger stages and merger types (e.g. with different mass ratios and gas content) along the relatively long merging sequence makes them difficult to characterise (Martin et al. 2022; Desmons et al. 2023).

Traditionally, there are two categories of merger detection methods. The first relies on pair selection in which galaxies close on the sky and in redshift are identified as mergers (e.g. Woods & Geller 2007; Ellison et al. 2013; Bustamante et al. 2020). Ideally, spectroscopic redshifts are required to select genuine pairs. However, spectroscopic observations are very resource-intensive and time-consuming, and so photometric redshifts have been used in some cases (López-Sanjuan et al. 2015; Mundy et al. 2017; Duncan et al. 2019), with increased contamination due to projection effect. Galaxy pairs can also suffer from flyby contamination (i.e. galaxies that interact briefly but do not merge in the end), even with the availability of spectroscopy. In addition, this method, by design, only selects pre-mergers when interacting galaxies are physically separated. The second set of methods involves identifying morphological disturbances in imaging, by visual inspection or using non-parametric morphological statistics, such as the CAS parameters (concentration, asymmetry, and smoothness; Conselice et al. 2003) or the combination of the Gini coefficient and M₂₀ (the second-order moment of the brightest 20% of the light, Lotz et al. 2004). These methods require good-quality images in terms of spatial resolution and depth to identify merging features. However, even though CAS, Gini, and M-20 have been used to detect mergers (Conselice et al. 2003; Lotz et al. 2004; Conselice 2009; Mundy et al. 2017; Duncan et al. 2019), they have been shown to yield significantly incomplete samples and can also contain disturbed galaxies that do not correspond to mergers (e.g. Pearson et al. 2019b; Snyder et al. 2019; Bickley et al. 2021; Wilkinson et al. 2022). One can also visually identify mergers, and the largest such dataset comes from Galaxy Zoo (Lintott et al. 2008; Darg et al. 2010). However, the main issues here are the difficulties in reproducibility and feasibility for large datasets¹. It can also be affected by low accuracy and incompleteness, particularly at high redshifts (Huertas-Company et al. 2015). On the other hand, the visually most conspicuous mergers are expected to be very reliable.

The use of machine learning (ML) in astronomy has exploded in recent decades (e.g. Dieleman et al. 2015; Huertas-Company et al. 2018; Walmsley et al. 2020; Margalef-Bentabol et al. 2020; Zanisi et al. 2021; Karsten et al. 2023; Huertas-Company & Lanusse 2023), and can be divided into traditional ML and deep learning (DL). Traditional ML algorithms have simpler structures (e.g. linear regression or decision trees) and typically rely on hand-crafted features to train (Domingos 2012; Goulding et al. 2018; Martin et al. 2020; Lazar et al. 2023). DL methods, based on artificial neural networks, are more sensitive to higher-order features. However, this comes at the cost of a much larger number of parameters, and they are thereby harder to train and require more data for optimisation (Pascanu et al. 2012; Schmidhuber 2015; Goodfellow et al. 2016; Tan & Le 2020). Convolutional neural network (CNN, Fukushima 1988; LeCun et al. 2015) architectures are a particular type of DL architecture that is extremely well suited for image classification. Their key features are convolutional layers that can identify patterns on different scales and extract relevant features, and a head of fully connected layers that perform the classification task (O’Shea & Nash 2015; Albawi et al. 2017). In recent years, CNNs have shown great success in morphological galaxy classification, accurately reproducing visual labels (Dieleman et al. 2015; Huertas-Company et al. 2015; Domínguez Sanchez et al. 2018; Cheng et al. 2020; Walmsley et al. 2022a). They solve problems such as reproducibility and applicability to large datasets. However, as they are designed to learn from visual labels in the training data, they can inherit biases from visual classification.

Deep learning has also been used for merger detection, with some studies successfully reproducing visual merger classifications (Ackermann et al. 2018; Walmsley et al. 2019; Pearson et al. 2019b, 2022). However, discerning mergers visually is much harder than discerning other more regular morphological classes (e.g. spirals and ellipticals), leading to incomplete and unreliable merger samples. To mitigate these issues, hydrodynamical simulations can be used to train DL algorithms. The main advantage of using simulations is the knowledge of ground truth on whether a galaxy is in the process of merging (within a specific pre-defined time-frame). Pearson et al. (2019b) trained a CNN on simulated galaxies from EAGLE (Schaye et al. 2015), processed to mimic the Sloan Digital Sky Survey (SDSS) observations. Their classifier achieved an accuracy of 65.2% in the mock SDSS data and 64.6% when applied to the real SDSS observations (compared to visual classifications). Bottrell et al. (2019) trained a CNN on mock SDSS galaxies generated from binary merger simulations (Moreno et al. 2019) with the FIRE-2 physical model (Hopkins et al. 2018). The trained model was shown to achieve 87.1% classification accuracy, discerning between isolated galaxies, pre-mergers, and post-mergers. Ćiprijanović et al. (2020b) used a CNN to distinguish between mergers and non-mergers in simulated images from Illustris-1 (Vogelsberger et al. 2014a,b) at ɀ = 2, reaching 79% accuracy, which is slightly reduced to 76% when noise mimicking real observations with the Hubble Space Telescope is added. Ferreira et al. (2020) trained a CNN on simulated galaxies at ɀ = 0–3 from IllustrisTNG (Nelson et al. 2019), which are further processed to resemble observations from the CANDELS survey. Using a Bayesian Optimisation², their model achieves 90% accuracy when classifying mergers from the simulation, and can even distinguish between pre-mergers and post-mergers (with 87% and 78% accuracy, respectively). Finally, Bickley et al. (2021, 2022, 2023) trained a CNN for post-merger classification with images generated from IllustrisTNG and further processed to mimic the Canada France Imaging Survey (Ibata et al. 2017) for galaxies up to ɀ = 1, and achieved a classification accuracy of 88%. Even though many works on merger detection have focused on CNNs, traditional ML methods have also proven to be viable. Snyder et al. (2019) used non-parametric morphology statistics extracted from Illustris as features to train a random forest (RF) classifier. They achieved a completeness of 70% at 0.5 < ɀ < 3, and purity ranging from 10% at ɀ = 0.5 to 60% at ɀ = 3. Guzmán-Ortega et al. (2023) and Rose et al. (2023) obtained comparable performance at low and high z, respectively, with a RF classifier applied to IllustrisTNG. Nevin et al. (2019), on the other hand, used a linear discriminant analysis in combination with non-parametric statistics to classify mergers in simulated images, achieving an accuracy of 85% and precision of 97% for major mergers.

All of these studies show that combining ML with simulations is a promising approach for merger detection. However, these studies usually have very different set-ups. First, they may differ in the choice of simulation and galaxy formation physics. Thus, the impact of merging on galaxies may differ between simulations. Secondly, these studies differ in which observational survey they try to mimic (if any) and the chosen redshift range. Thirdly, they may differ in the definition of major mergers (with a mass or flux ratio of 1:4 or 1:3) or focus on different merger stages. Lastly, there are differences in the construction of major merger and non-merger samples, which generally should be complementary (i.e. galaxies are either in one category or another). However, this is often not the case as minor mergers with low mass ratios may be excluded from the non-merger sample. All of these differences make it very difficult to fully understand their relative performance. Another important aspect that has not been sufficiently studied is how well these methods trained on simulations perform on real observations; there are few studies that focus on application to real surveys (Pearson et al. 2019a; Wang et al. 2020). Some attempts were made to compare ML predictions with visual classifications, even though the latter does not necessarily represent the truth. Ideally, the source domain (data used to train the model) and the target domain (data to which the model will be applied) should be as similar as possible. The more different the two domains are, the less reliable the predictions will be in the target domain (Bottrell et al. 2019; Ćiprijanović et al. 2020a). One way to mitigate this is to generate simulated data that is as similar as possible to the observations. However, it is not always possible to fully recreate observational effects, and simulated galaxies may be intrinsically different from real galaxies. Therefore, models trained on simulations are expected to perform worse on observations (Domínguez Sánchez et al. 2023).

In this paper, we have several aims: (i) to apply leading ML-based methods to the same datasets, quantitatively comparing their performance on major merger identification; (ii) to assess whether the same performance is maintained when we apply classifiers trained on one simulation to another; (iii) to compare classifiers trained on simulations with visual labels for real observations. To achieve these goals, we utilised three different datasets. The first comes from the IllustrisTNG simulations, the second from the Horizon-AGN simulations, and the last from the Hyper Suprime-Cam Subaru Strategic Program (HSC-SSP) survey (Aihara et al. 2018).

The paper is structured as follows. In Sect. 2, we briefly describe the different datasets used in this work, including the two cosmological simulations of galaxy formation and evolution and real observations from the HSC-SSP. In Sect. 3, we define the merger challenge and goals, explain how training and test datasets are created (including the whole process of generating mock images from simulations), and describe the metrics used for evaluation. In Sect. 4, we outline the six merger identification methods explored in this work. In Sect. 5, we first compare the performance of the methods for the binary and multi-class merger classification tasks on the training data (TNG), and then we explore how the trained methods perform on a second set of simulations (Horizon-AGN). Finally, we investigate how they perform on real data and compare with visual labels. In Sect. 6, we present our conclusions and future directions.

2 Data

To study how different methods compare, we made use of two cosmological simulations of galaxy formation and evolution (IllustrisTNG and Horizon AGN), and real observations from the HSC-SSP survey. The IllustrisTNG data were used to train all merger identification algorithms. The Horizon-AGN data were used for testing and quantifying how these methods perform when applied to a different dataset that was constructed in a similar way to the training data. This approach allows us to better understand how methods trained on simulations will behave when applied to observations. Below, we explain the main characteristics of the simulation and observation datasets used in this study.

2.1 IllustrisTNG

The IllustrisTNG project (Nelson et al. 2018, 2019; Pillepich et al. 2018a; Springel et al. 2018; Naiman et al. 2018; Marinacci et al. 2018), is a series of cosmological magnetohydrodynamical simulations of galaxy formation and evolution that includes three runs spanning a range of volume and resolution, TNG50, TNG100, and TNG300, with comoving length sizes of 50, 100, and 300 Mpc h⁻¹, respectively. For this work, we used TNG300 due to the large number of galaxies it comprises and TNG100 to expand to lower-mass galaxies. The initial conditions for both runs are drawn from Planck results (Planck Collaboration XIII. 2016). Both runs follow dark matter (DM) particles, gas cells, and stellar and supermassive black hole (SMBH) particles. TNG100 contains 1820³ DM particles with a mass resolution of M_{DM, res} = 7.5 × 10⁶M_⊙, while TNG300 contains 2500³ DM particles with M_{DM, res} = 6 × 10⁷ M_⊙. The baryonic particle resolution is M_baryonres = 1.4 × 10⁶ M_⊙ and M_baryonres = 1.1 × 10⁷ M_⊙, for TNG100 and TNG300, respectively. Metal-enriched gas cools radiatively in the presence of a redshift-dependent, spatially uniform UV background. The cooling of gas is also affected by radiation from nearby SMBH. Gas above a density threshold of 0.1 H cm⁻³ forms stars following the Kennicutt–Schmidt relation (Kennicutt 1998) and a Chabrier (Chabrier 2003) initial mass function (IMF). Stellar populations evolve via Type Ia supernovae, Type II supernovae, and asymptotic giant branch stars, returning mass and metals to the interstellar medium. Accreting SMBHs release energy via the ‘quasar mode’ at high accretion rates with thermal feedback heating up gas surrounding the SMBH and via the ‘kinetic wind mode’ at low accretion rates, producing SMBH-driven winds. We refer to Pillepich et al. (2018b) for more details about IllustrisTNG. In this work, we only selected galaxies from the simulation snapshots 50–91, which correspond to redshifts from ɀ = 1 down to 0.1. The time step between each snapshot is roughly ~160 Myr over this redshift interval.

In the TNG simulations, DM halos are extracted using the friends-of-friends (FoF) approach (Davis et al. 1985), and only structures with more than 32 DM particles are considered DM halos. Substructures within the FoF groups (galaxies) are further extracted using a modified version of SUBFIND algorithm (Springel et al. 2001; Dolag et al. 2009), which calculates the density field for all particles and cells, and only substructures with at least 20 resolution elements (stars and gas) are considered galaxies. Finally, the SUBLINK algorithm is used to construct the merger trees of galaxies, following the star particles and star-forming gas elements. Therefore, merger trees are constructed from baryon-based structures. This approach yields more accurate results than DM-based structures when studying mergers due to the baryon structures following the visible components, resulting in a closer observational definition. Furthermore, the merger time and mass ratio are calculated from the stellar masses of the two merging galaxies at the time when the secondary reached its maximum stellar mass (Rodriguez-Gomez et al. 2015).

For TNG100, we selected galaxies with stellar mass M_* > 10⁹ M_⊙ and for TNG300 with M_* > 8 × 10⁹ M_⊙, to ensure that most galaxies have a sufficient number of stellar particles (hence reasonably well resolved), with the lowest mass galaxies in TNG100 (M_* = 10⁹ M_⊙) consisting of 714 particles, and in TNG300 (M_* = 8 × 10⁹ M_⊙, consisting of 727 particles.

2.2 Horizon-AGN

Horizon-AGN is a cosmological hydrodynamical simulation of galaxy formation and evolution (Dubois et al. 2014) with a comoving box size of 100 Mpc h⁻¹. The initial conditions are drawn from WMAP-7 cosmology (Komatsu et al. 2011). The total volume contains 1024³ DM particles with a mass resolution of M_{DM, res} = 8 × 10⁷ M_⊙ (similar to TNG300 but an order of magnitude lower than TNG100). The baryonic particle resolution is M_baryonres = 2 × 10⁶ M_⊙ (similar to TNG100 and better than TNG300). This simulation uses the adaptive mesh refinement code RAMSES (Teyssier 2002), with a uniform grid that is refined down to a minimum cell size of 1 kpc constant in physical length. Gas cooling proceeds in the presence of a uniform UV background (Haardt & Madau 1996) via H, He and metal-enriched gas down to 10⁴ K (Sutherland & Dopita 1993). At densities above 0.1 H cm⁻³, star formation proceeds with a fixed 2% efficiency (Kennicutt 1998). Chemical enrichment and kinetic energy injection into the gas is modelled via continuous stellar feedback from Type II SNe, Type Ia SNe, and stellar winds (Leitherer et al. 1999, 2010; Girardi et al. 2000; Nomoto et al. 2007). Additionally, SMBHs impart feedback on gas via the ‘quasar mode’ (at Eddington ratios χ > 0.01), where thermal energy is injected isotropically into the surrounding gas with 1.5% efficiency and the ‘radio mode’ (at χ < 0.01), where kinetic energy is injected via bipolar outflows with jet velocities of 10⁴ km s⁻¹ . For more details on the physical processes in Horizon-AGN, we refer to Dubois et al. (2014).

In the Horizon-AGN simulations, DM halos are identified using AdaptaHOP halo finder (Aubert et al. 2004). Only structures with more than 100 particles and with a density larger than 80 times the total matter density are considered DM halos. AdaptaHOP finder is also applied to the stellar distribution to identify galaxies with more than 50 particles. Merger trees of galaxies are built using the TREEMAKER algorithm (Tweed et al. 2009). Merger times and mass merger ratios are calculated when the minor companion is at its maximum mass (i.e. before it starts to lose any mass to the main galaxy), following the same approach described in Rodriguez-Gomez et al. (2015). Similarly to TNG, the merger trees are derived using baryon-based structures (in this case, stars) rather than DM ones.

We selected galaxies with M_* > 10⁹ (which ensures reasonably resolved galaxies, with a minimum of 500 particles for galaxies with M_* = 10⁹) and with redshifts between 0.1 and 1, to match galaxies selected from TNG.

Even though the two simulations use different methods for galaxy identification and to construct merger trees, we do not expect these differences to be significant, as they are both tracking the baryonic components of galaxies and use the same definitions of merger mass ratio. Moreover, Srisawat et al. (2013) compare different simulations and find that the choice of merger tree algorithms does not have a significant impact on the merger trees that they produce. In particular, SUBLINK and TREEMAKER (the codes used for TNG and Horizon-AGN, respectively) produce comparable merger trees. The choice of sub-grid physics applied to each observation could result in somewhat different galaxy populations (we explore the differences between the two simulations in Appendix A) since TNG is designed to reproduce some observed trends such as the galaxy mass size relation, galaxy stellar mass function and the SFR density (Pillepich et al. 2018b), whereas in Horizon-AGN only the AGN feedback is implemented to reproduce the M-sigma relation (Dubois et al. 2014). Despite these differences, both simulations find that the optical morphologies are in relatively good agreement with observations (e.g. Rodriguez-Gomez et al. 2019; Dubois et al. 2016).

2.3 HSC Subaru Strategic Program

The HSC-SSP survey is a wide-field optical imaging survey covering ~1200 deg², conducted by the Hyper Suprime-Cam (HSC; Miyazaki et al. 2018) imaging camera on the Subaru telescope, with a pixel resolution of 0.168 arcsec/pixel (Aihara et al. 2018). We choose the GAMA-09 field of 60 deg² which spans 129° ≤ RA ≤ 141° and −2° ≤ Dec ≤ 3° (Liske et al. 2015). We focused our study on the i-band, given its depth of ~26 mag at 5σ for point sources and seeing of 0.61″ (Aihara et al. 2018). Stellar masses and photometric redshifts are derived with the KiDS-VIKING (Kuijken et al. 2019; Edge et al. 2013) photometry (see Wright et al. 201 for details). In summary, photometric redshifts (or spectroscopic redshifts when available) are derived from the KiDS-VIKING 9-band photometry using the Bayesian Photometric Redshift code (BPZ, Benítez 2000). The normalised median-absolute-deviation achieved is (z_phot − z_spec)/(1 + z_spec) = 0.061. We then constructed our HSC sample by randomly selecting ~ 120 000 galaxies in GAMA-09, matching the redshift (0.1 < ɀ < 1) and stellar mass range (M_* > 10⁹ M_⊙) of the simulated samples. Stellar masses are estimated using the template fitting code LE PHARE (Ilbert et al. 2006), which models the photometry with a library of stellar population models of Bruzual & Charlot (2003), Chabrier (2003) IMF, Calzetti et al. (1994) dust extinction law and exponentially declining star formation histories. We downloaded the /-band coadd images of our sample from the HSC-SSP Data Release 3 (Aihara et al. 2022) using the DAS cutout facility.

We included a further ~2000 galaxies with visual classification labels from Goulding et al. (2018). These galaxies come from an initial random sample of 5900 star-forming galaxies (according to their position in the UVJ diagram) and have been visually classified using K-corrected 3-colour images of size 50 × 50 kpc. These ~2000 galaxies with visual classification labels fall into one of the two classes:

Visual major mergers: strongly interacting massive galaxy pairs or post-mergers, including galaxies that have double nuclei. In the case of clear evidence of interaction with a distinct companion galaxy, only systems in which the flux ratio between the two interacting galaxies is >1:4 are included.
Disturbed–minor-mergers: galaxies that do not have clear signs of a major merger but show irregular, disturbed, asymmetrical, or torqued morphologies, along with galaxies considered to be minor mergers (galaxies with a companion but with a flux ratio between the systems <1:4).

Additionally, we created a visually identified sample of clear non-mergers, which are galaxies with no signs of merging or any type of disturbed morphologies. From our HSC sample, we randomly classified galaxies and selected those that are deemed as clear non-mergers to create this sample of ~1000 galaxies to match the number of galaxies in the previous two groups (visual major mergers and disturbed–minor mergers). Examples of the three visually classified groups are shown in Fig. 1.

Fig. 1

Example real HSC galaxies in the three visually classified groups: major mergers (two leftmost columns), disturbed or minor mergers (two middle columns), and non-mergers (two rightmost columns). The first two groups are from Goulding et al. (2018), and the last group is from this work. Images have a physical size of ~160 kpc, displayed with an arcsinh inverted grey scale.

3 The merger challenge

3.1 Goals

One of the main purposes of building a merger classifier is to apply it to real observations to measure merger fractions and merger rates. However, it is impossible to quantify the performance that a method trained on simulations has on real observations, at least not in the same way we assess such performance in the simulations. This is because we do not have access to the ground-truth merger history of real galaxies. Instead, the classification labels that we rely on in observations can be subjective, highly uncertain, and/or biased towards the most visually conspicuous mergers. Therefore, in this work, we used a second simulation for verification of merger classification methods. This gives us a better understanding of what could happen when applying classifiers trained on simulations to real observations. We used TNG, and more specifically, the combination of TNG300 and TNG100, as our main training sample for two reasons. Firstly, TNG300 has the largest volume within the TNG and Horizon-AGN simulations, which results in the largest sample (together with TNG100, which adds lower-mass galaxies to our training sample). Secondly, we want the secondary simulation, used only for testing, to have a similar or better resolution than the training data. Horizon-AGN has a baryon particle resolution better than TNG300 and similar to TNG100 (even though the DM particle resolution is an order of magnitude lower than TNG100). In this work, we explored and compare in detail how six leading ML classification methods perform on the three datasets (TNG, Horizon, and HSC) for two different tasks:

Binary classification between major mergers and non-mergers. Definitions of major merger and non-merger can be found in Sect. 3.3.
Multi-class classification into four classes (non-merger, pre-merger, ongoing-merger, post-merger). Definitions of these four classes can be found in Sect. 3.3.

3.2 Mock images

We explain here how synthetic images of simulated galaxies from both simulations are processed to generate mock images as if they were observed by HSC. For each mock galaxy image, all stellar particles around the main galaxy were used (within the size of the mock image) to ensure that secondary galaxies that will eventually merge with the main one are visible in the image. Each stellar particle from the simulations contributes its own spectral energy distribution derived from the Bruzual & Charlot (2003) stellar population synthesis models, depending on its mass, age, and metallicity. The sum of the contribution from all stars passes through the desired filter to create a smoothed 2D projected map (Rodriguez-Gomez et al. 2019; Martin et al. 2022). These maps do not include a full radiative transfer treatment and, therefore, do not account for dust. For this work, the simulated images were produced in the i-band, with the HSC pixel resolution, and had a physical size of 160 × 160 kpc. We chose this size as this is the maximum separation between merging galaxies based on binary merger simulations (Qu et al. 2017; Moreno et al. 2019). Then, each image was convolved with the i-band PSF, retrieved from the HSC-SSP database. The third step is to add Poisson noise. Lastly, each image was injected into cutouts of real HSC observations.

For the final step, we need cutouts of the real HSC sky, which should not have bright sources in the centre where the synthetic images will be injected. To ensure this, we constructed a catalogue of low-z and bright sources which we want to avoid using the following criteria: 129° ≤ RA ≤ 180°, −2° ≤ Dec ≤ 2°, Z ≤ 1, g_cModel ≤ 26.0, r_cModel ≤ 25.6, i_cModel ≤ 25.4, Z_cModel ≤ 24.2, y_cModel ≤ 23.4. This way we still allowed cutouts to contain possible faint sources and higher redshift (z > 1) background galaxies, as would be the case for real observations. We generate sky cutouts centred on random sky coordinates, keeping only those that do not have any catalogued bright or low-z source within 21″ (based on the surface density of the sources to be avoided). After that, we discarded cutouts that, according to mask flags (Bosch et al. 2018), contain bad pixels, saturated pixels, unmasked NaN, and possible missed bright objects. The whole process is illustrated in Fig. 2, from the raw simulated image to the PSF-convolved image, to the addition of Poisson noise, and finally the injection into the real sky from the HSC survey. As stated before, this procedure has the limitation of not including a full radiative transfer treatment and not including the effects of dust. However, given the wavelength used in this study (i-band), we are probing the rest-frame optical, except for galaxies at redshift >0.9. Only for galaxies above this redshift are we probing the rest-frame ultraviolet, where the effect of dust will be of greater importance. Furthermore, Bottrell et al. (2019) show that radiative transfer effects for gas-rich, star-forming galaxies (which will be most affected by dust obscuration) do not have a significant impact on the performance of ML models trained to classify merger galaxies, as the models appear to focus on broad morphological features (such as tidal features) rather than variations due to dust obscuration. Dust effects, therefore, produce only a slight improvement in performance, while realistic instrumental effects, such as PSF resolution and realistic noise and crowding of nearby sources, are much more important.

We split the training set into four redshift bins: ɀ1 → [0.1, 0.31), ɀ2 → [0.31, 0.52), ɀ3 → [0.52, 0.76), ɀ4 → [0.76, 1.0). The redshift bins were chosen so that each bin has a similar redshift span with a similar number of galaxies. At different redshifts, a physical size of 160 kpc corresponds to a different image size in pixels. For each redshift bin, we used the image size, in pixels, that corresponds to 160 kpc in the midpoint of the ɀ range of the bin, which corresponds to image sizes of 320, 192, 160, and 128 pixels. The final adopted sizes for each method may vary, as some use smaller sizes than the ones provided. Figure 3 shows examples of mock HSC images of simulated galaxies from TNG and HorizonAGN in the four redshift bins. We can clearly see how much more difficult it is to discern features at higher redshifts, at least visually.

Fig. 2

Steps used to create mock images for four randomly selected galaxies from TNG. From left to right are shown the raw simulated images, convolution with the HSC PSF, addition of Poisson noise, and injection into real sky background from HSC. Images have a physical size of 160 kpc, displayed with an arcsinh inverted grey scale.

3.3 Training and testing datasets

For IllustrisTNG, a complete merger history is available through merger trees for each galaxy (Rodriguez-Gomez et al. 2015), and similarly for Horizon-AGN. We used these trees to construct our samples from both simulations. For the first task (binary classification), we needed a sample of mergers and a sample of non-mergers. For the multi-class task, we needed to subdivide the merger class into three subclasses (pre-mergers, ongoing-mergers, and post-mergers). Following the merger trees, we constructed a sample of mergers, which were selected to be galaxies that had a merger event in the last 0.3 Gyr or will have a merger event in the following 0.8 Gyr. Only major mergers with stellar mass ratios >1:4 were included. Furthermore, mergers that are −0.8 to −0.1 Gyr, −0.1 to 0.1 Gyr, and 0.1 to 0.3 Gyr away from the coalescence (dt = 0 Gyr) were classified as pre-mergers, ongoing-mergers, and post-mergers, respectively. Figure 4 shows examples of simulated galaxies from TNG at different merger stages. A control sample of non-mergers for each simulation consists of galaxies that do not follow the criteria above. These galaxies are much more numerous, so we selected a random sample to roughly match the merger sample size, with the same mass and ɀ criteria. The stellar mass and redshift distributions of the TNG and Horizon-AGN datasets are shown in Fig. 5.

The IllustrisTNG dataset is split into a training sample (TNG-Training) and a testing sample (TNG-Test). The former comprises 90% of the whole sample and is used for training the different ML methods. The latter comprises the remaining 10% and was used to measure the performance of the methods on images that have not been seen by the algorithms during training. Galaxies belonging to the same merger tree end up in only one of these splits, which ensures that the test sample cannot be learned by interpolation from the training sample (Eisert et al. 2023). Along with the TNG-test, Horizon-AGN was also used to quantify the performance of the methods. The Horizon-AGN samples have labels (i.e. merger–non-merger for the binary task and non-merger–pre-merger–ongoing-merger–post-merger for the multi-class task) and were only used to evaluate the performance metrics (so not used in training). As seen in Fig. 5, the Horizon-AGN stellar mass distribution does not follow TNG100 (or TNG300, where the lower mass limit is 8 × 10¹⁰ M_⊙). The differences may arise from the different design choices of the simulations. The difference in the stellar mass distributions is not necessarily a problem for training ML models, as long as the training set has enough galaxies that cover the whole stellar mass range of the test set. However, the lower number of galaxies at the low-mass end, compared to the total number of galaxies, may have an impact on the performance of the models for low-mass galaxies, as discussed in Sects. 5.1.1 and 5.2.1.

The HSC sample, constructed to have similar stellar mass and ɀ ranges as the simulations, consists of randomly selected galaxies in GAMA-09. The HSC images (with a physical size of 160 kpc) were split into the four redshift bins and resized according to the image size (in pixels) of each bin to match the simulated image sizes. The HSC sample has no true classification labels, as explained before. So, we included the 2111 galaxies visually classified by Goulding et al. (2018; see Sect. 2.3). There are 1243 galaxies classified as clear major mergers, while the other 868 are classified as disturbed–minor mergers. Labels for this visual sample were not shared with the participants. In Table 1 we summarise the three samples from IllustrisTNG, Horizon-AGN and HSC. There are more than twice as many objects in TNG than in Horizon-AGN. In both simulations, the merger and non-merger sample sizes are similar to each other by construction. Among the different merger stages, sample sizes of ongoing- and post-mergers are much smaller (due to the shorter timescales) compared to pre-mergers, which could have an impact on the performance of the classifiers in detecting these later merger stages.

Fig. 3

Example mock HSC images of simulated galaxies from TNG and Horizon-AGN. Images have sizes in pixels of 320 (at 0.1 < ɀ < 0.31), 192 (at 0.31 < ɀ < 0.52), 160 (at 0.52 < ɀ < 0.76), and 128 (at 0.76 < ɀ < 1), corresponding to ~160 kpc at a given redshift, displayed using an arcsinh inverted grey scale.

Fig. 4

Example mergers from TNG at different merger stages (obtained from the corresponding merger trees in the simulation): pre-mergers (−0.8 < dt < −0.1 Gyr), ongoing-mergers (−0.1 < dt < 0.1 Gyr), and post-mergers (0.1 < dt < 0.3 Gyr). Each row shows a galaxy along its merger sequence. Images have an approximate physical size of 160 kpc, displayed using an arcsinh inverted grey scale.

Fig. 5

Stellar mass (left) and redshift (right) distributions of the different simulated datasets (red: TNG300; yellow: TNG100; blue Horizon-AGN). By combining TNG100 and TNG300, a broader range of stellar masses is covered. The redshift distributions of the simulation are not designed to follow the observations.

3.4 Evaluation metrics

To evaluate the performance of different methods we used several popular metrics for classification problems, including accuracy, precision, recall, F1-score, receiver operating characteristic (ROC) curve, and area under the ROC curve (AUC).

Accuracy is the fraction of correctly classified examples. However, it does not distinguish between the fraction of correctly identified examples from each class. In some cases, correctly identifying a particular class is more important, so accuracy may not be the most appropriate metric. Precision is the fraction of examples from one class correctly identified (reliability for that class), and recall is the fraction of examples from one class that are correctly identified (completeness). Unfortunately, when the classes are difficult to separate (e.g. mergers and non-mergers), it is normally not possible to have both high precision and high recall. This is called the precision-recall trade-off, which means increasing precision will decrease recall and vice-versa. Precision and recall can be calculated for each class as follows,

$Precision = \frac{TP}{TP + FP},$ ${\rm{ Precision }} = {{{\rm{TP}}} \over {{\rm{TP}} + {\rm{FP}}}}{\rm{, }}$ (1)

and

$Recall = \frac{TP}{TP + FN} .$ ${\rm{ Recall }} = {{{\rm{TP}}} \over {{\rm{TP}} + {\rm{FN}}}}{\rm{. }}$ (2)

For mergers, TP (true positives) is the number of mergers correctly classified as mergers. FP (false positives) is the number of non-mergers incorrectly classified as mergers. FN (false negatives) is the number of mergers classified as non-mergers. The F₁ score is the harmonic mean of the precision and the recall,

$F_{1} = \frac{2}{{Recall}^{- 1} + {Precision}^{- 1}} .$ ${F_1} = {2 \over {{\rm{ Recall}}{{\rm{ }}^{ - 1}} + {\rm{ Precision}}{{\rm{ }}^{ - 1}}}}.$ (3)

The ROC curve shows the performance of a classifier by plotting True Positive Rate (TPR; a synonym for recall) versus false positive rate (FPR) at different classification thresholds, where

$TPR = \frac{TP}{TP + FN},$ ${\rm{TPR}} = {{{\rm{TP}}} \over {{\rm{TP}} + {\rm{FN}}}}{\rm{, }}$ (4)

and

$FPR = \frac{FP}{FP + TN} .$ ${\rm{FPR}} = {{{\rm{FP}}} \over {{\rm{FP}} + {\rm{TN}}}}.$ (5)

A perfect classifier would yield a point in the upper left corner with a coordinate (0,1), while a random classifier would show a diagonal line. By lowering the classification threshold, more items are classified as positive, thus increasing both FP and TP. Lastly, AUC corresponds to the area underneath the ROC curve, and measures how well predictions are ranked, rather than their absolute values. AUC quantifies the model’s performance irrespective of the classification threshold, with ranges from 0 (all predictions are incorrect) to 1 (all predictions are correct).

Table 1

Total number of galaxies in the training sample (TNG-train) and testing sample (TNG-test, Horizon-AGN, and HSC).

4 Machine learning-based merger detection methods

We explored six different methods from two categories: traditional feature-based ML and image-based DL methods. The first method employs a RF algorithm, while the rest use CNNs. In this section, we briefly describe the main characteristics of each method, for instance the structure; the number of images used for training, validation, and testing; image pre-processing (if any); whether different redshift bins are analysed separately or combined; whether the multi-class classification is provided in addition to the binary classification; and how thresholds for different classes are chosen. In Table B.1 we summarise the methods explored in this study, highlighting the main differences and similarities (for more details on each method, see the list of references).

4.1 Method-1 (RF)

This method performs binary and multi-class classification tasks using the RF algorithm (Ho 1995), which is an ensemble learning method that fits multiple decision tree classifiers to various subsamples of the dataset. The final classification for a particular example is then obtained by averaging the classifications from all the individual trees. For data pre-processing, we first performed source deblending to separate overlapping but different sources for each image in all datasets, in order to isolate the galaxy of interest and remove unwanted or contaminating sources from the calculation of the morphological diagnostics. This procedure is described in Guzmán-Ortega et al. (2023). Secondly, we run statmorph (Rodriguez-Gomez et al. 2019), which is a code for calculating non-parametric morphological diagnostics of galaxy images (such as the Gini-M₂₀, asymmetry and concentration parameters), as well as fitting 2D Sérsic profiles. The resulting statistics from applying statmorph to each galaxy image were subsequently employed as model features for the classifier. In particular, we used the measurements of 32 parameters (including Gini, concentration and asymmetry; for a complete list, please check Rodriguez-Gomez et al. 2019) as features, ensuring that only reliable quantifications are used.

For the binary task, we used the extracted features from TNG-training. Using the scikit-learn library (Pedregosa et al. 2011), we performed a 5-fold cross-validation strategy together with 200 iterations of a randomised search for tuning the RF hyper-parameters of n_estimators (the number of trees in the forest), max_depth (the maximum depth of the tree), min_samples_split (the minimum number of samples required to split an internal node), and min_samples_leaf (the minimum number of samples required to be at a leaf node). We then obtained a set of these parameters that maximise the accuracy score for all combinations in the random search and used them to refit the RF classifier on the corresponding training set for each redshift interval. The resulting model was applied to the corresponding testing sets to obtain merger predictions. The multi-class task was carried out in a similar fashion, except that we employed the balanced RF algorithm from the imblearn library (Lemaître et al. 2017), which uses random undersampling to deal with class imbalance.

4.2 Method-2 (Swin)

This is a DL method that uses a Swin Transformer architecture (Liu et al. 2021) pre-trained on ImageNet-1K³ data with an additional fully connected layer with 256 neurons and an output layer of four neurons, each for one of the four classes in the multi-class task (pre-merger, ongoing-merger, post-merger, and non-merger). In terms of data pre-processing, first, the mock images were cropped to 112 × 112 pixels and linearly scaled between 1 and 0. Each image was then stacked with itself to form a 3-channel image (to be compatible with the expected input of the architecture) and resized to 224 × 224 pixels using the nearest neighbour interpolation).

For the binary classification task, we fixed the parameters of the Swin Transformer during training and so only the fully connected layer and the output layer are trained with the TNG-training data. Data augmentation was performed during training with each image randomly rotated by multiples of 90^°, randomly flipped horizontally, and randomly flipped vertically. TNG-training was split into training and validation datasets (80– 20 split), ensuring that members of the same merger tree are only found in the training or validation set and not both. The output of the network is a 4-element vector. Each element represents the probability assigned to that class, such that the sum of all four elements is 1. For the binary task, an image was classified as a merger if the probability of any of the merger classes (pre-merger, ongoing-merger, or post-merger) is higher than the probability of being a non-merger. For the multi-class task, the same trained network as for the binary task was used, and an image will have the classification of the class with the highest probability.

4.3 Method-3 (Zoobot)

Zoobot (Walmsley et al. 2023) is a Python package for measuring the detailed appearance of galaxies using DL. Zoobot includes CNN and vision transformer models pre-trained on the responses of Galaxy Zoo volunteers to (real) images (Willett et al. 2013, 2017; Simmons et al. 2017; Walmsley et al. 2022a). These models are designed to be adapted to new tasks and surveys using minimal new labels. Here, a pre-trained Zoobot CNN model⁴ is adapted to perform the Merger Challenge tasks. For classifying simulated images, we add a custom head and loss designed to jointly predict the answers to each task. Specifically, our head is a single dense layer with five outputs, as follows,

Two outputs classify whether a galaxy is a merger or non-merger.
Three outputs classify the subclasses (pre-merger, ongoing-merger, and post-merger).

The outputs use a cross-entropy loss, and no activation functions are applied. Each component of the loss is weighted to be roughly equal⁵ and only applied when the task is relevant (e.g. there is no additional loss for predicting merger subclass wrong when the galaxy is not a merger).

Regarding data pre-processing, all images were cropped to a smaller physical size of 100 kpc, with an arcsinh scaling. Hyper-parameters (batch size, image physical size, and AdamW weight decay) were optimised based on a grid search training only on the lowest-z subset (for speed). Experiments showed that the best performance was obtained when all model parameters were set as trainable, not just the head or final convolutional layers. Initialising from the pre-trained Zoobot model significantly outperformed initialising an otherwise identical model from random.

4.4 Method-4 (CNN1)

This method uses the CNN architecture described in Bickley et al. (2021), which has been used for post-merger versus non-merger classification. We trained four networks, one for each redshift bin. The networks have four convolutional layers with 32, 64, 128, and 128 filters, respectively, all of size 7 × 7, followed by two dense layers with 512 and 128 neurons and finally by a 1-neuron dense layer. Each convolutional layer is followed by a max-pooling layer and a dropout layer. Dropout is also applied to the dense layers. For data pre-processing, the images were first cropped to 120 × 120, 96 × 96, 80 × 80 and 64 × 64, for z-bin 1, 2, 3 and 4, respectively. This corresponds to approximately a physical size of 80 kpc for all redshift bins. For the first redshift bin, the size corresponds to 60 kpc. This was done to reduce the size of the network, memory, and time needed for training. We used arcsinh scaling, along with clipping to maximise the contrast in the central region. Finally, we normalise the images between 0 and 1.

For the binary task, we trained each redshift bin separately. In each bin, TNG-training was split into training and validation datasets (90–10 split), making sure that the same merger histories were not split between the sets. Each network was optimised with the ADADELTA optimiser (Zeiler 2012). Data augmentation was used during training, including random rotation between 0 and 90°, random horizontal and vertical flip, and random zoom with a factor between 0.7 and 1.3. The hyper-parameters (such as batch size, learning rate, and optimiser) of the network were fine-tuned to find the best performance on the first redshift bin. The same set-up is used for the other ɀ bins. We used early stopping in each network to ensure that there is no over-fitting. After finding the best models, we applied them to TNG-test, Horizon-AGN, and HSC. The output of the network is a value between 0 and 1. We used a threshold of 0.5 to separate non-mergers and mergers. For the multi-class task, we first balanced the datasets by performing data augmentation for the ongoing-mergers and post-mergers classes. The same networks and hyper-parameters as for the binary task were used, but the last layer was replaced by a softmax function with four neurons (one for each class). The classification corresponded to the class with the highest probability (given by the corresponding neurons in the last layer).

4.5 Method-5 (CNN2)

This method used a CNN architecture (Chudy et al. in prep.) with four convolutional layers, each consisting of 128 filters with sizes 13 × 13, 11 × 11, and 11 × 1, respectively. After each convolu-tional layer, the ReLu activation function, batch normalisation, dropout (of 20%), and max pool layer are applied. After convolving, the output is flattened before being put forward to the next two fully connected layers of 512 and 128 neurons, respectively. After each dense layer, activation, batch normalisation, and dropout (of 20%) are performed. The output of a network is a single neuron activated by a softmax function that provides values representing the probability of being a merger. The loss of the network is determined using binary cross entropy. The CNN has 8186113 total trainable parameters. Regarding data preprocessing, all images were first resized to 128x128 pixels and normalised between 0 and 1. Images were randomly divided using a 75:15:10 ratio into three sets: the training set used to fit the parameters, the validation set used to evaluate a model while tuning the model’s hyper-parameters, and finally, the test set used for an unbiased evaluation of the final model.

For the binary task, the network was trained on all TNG-training data from the four redshift bins. It was optimised with the ADAM algorithm with a learning rate α = 3 × 10⁴ and trained for 112 epochs with the set of tuned hyper-parameters for the epoch that provided the highest value of validation accuracy used for classification. The threshold for classification is set to 0.51, at which TPR = True Negative Rate (TNR). This method has a similar architecture to Method-4 (CNN1), but differs in the number and size of filters, resulting in a higher number of trainable parameters. Another key difference is the scaling used for the images (linear scaling in this case). While for Method-4 (CNN1) the images are cropped to a physical size of 80 kpc to focus more on the central region, for this method, the original size of 160 kpc is used. Lastly, for Method-4 (CNN1), we trained four networks, one for each redshift bin, while for this method, we trained a single network for all redshift bins. This means that the higher the redshift bin, the more compressed the images are, which could result in loss of information.

4.6 Method-6 (CNN3)

This method uses a CNN architecture (Walmsley et al. 2019), consisting of three convolutional layers. The first convolutional layer has 32 filters with size 3 × 3, the second layer has 38 filters with size 3 × 3, and the last layer has 64 filters of size 2 × 2. Each convolutional layer is followed by a pooling layer. The convolutional part is then followed by a 64 neurons dense layer and an output 1-neuron layer, activated by a sigmoid function, providing an output between 0 and 1 for each image. The loss of the network consists of binary cross entropy. The network has a total of 616 497 trainable parameters. Regarding data pre-processing, all images were resized to an image size of 100 × 100 pixels. To normalise the images we applied the function transform = AsinhStretch(0.1) + PercentileInterval(97), where AsinhStrtch(0.1) performs the following operation on an image x: asinh(x/0.1)/asinh(1/0.1), and PercentileInterval(97), looking at the cumulative pixel distribution, sets the lowest 1.5% to 0 and the highest 1.5% to 1, while the rest 97% is normalise between 0 and 1.

For the binary task, the network was trained on TNG-training data from the four redshift bins. The training data was split into training and validation with a 90:10 split, ensuring no merger history was split between the sets. The network was optimised with the ADAM algorithm with a learning rate α = 0.001. Data augmentation was used during training, including random rotation between –45° and 45°, random vertical flip, random horizontal flip, random translations (±5%), and random zoom (with a factor between 0.75 and 1.3). The threshold for classification is set to 0.51, which maximises TPR and minimises TNR. This method has a similar architecture to Method-4 (CNN1) and Method-5 (CNN2). However, the architecture in this case is simpler, with a smaller number of layers and trainable parameters. Similar to Method-5 (CNN2), a single network was trained for all z bins combined. The images have a physical size of 160 kpc, which is twice as much as used in Method-4 (CNN1), but the same as in Method-5 (CNN2), albeit with smaller image size (in pixels) and potentially more loss of information in the highest ɀ bins. It is interesting to investigate these three CNNs to determine whether the differences, such as the number of layers, image scaling, or use of data augmentation, can have a significant impact.

Table 2

Performance metrics as percentages (accuracy, precision, recall, and F1-score for the merger class and AUC) of the different methods on the TNG-test set for the binary classification task.

5 Results

5.1 TNG

In this section, we show the performance of the six different methods on the TNG-test dataset using a variety of metrics for the binary task and the multi-class task. First, we check the overall performance by combining all redshift bins. Then, we examine in detail how performance changes as a function of redshift (for both tasks) and stellar mass (only for the binary task).

5.1.1 TNG: binary classification

In Table 2, we summarise the performance of the different methods on the whole TNG-test sample over the entire redshift range explored, using various metrics including accuracy, precision, recall, F1-score and AUC. We note that precision and recall are only calculated for the merger class. Overall, all methods show a similar performance on TNG-test, with a maximum difference of 12% in precision (ranging from ~69% to 81%) and 15% in recall (ranging from ~67% to 82%). However, these methods rely on choosing a threshold on the output probability to classify mergers and non-mergers. A common value ofthis threshold is 0.5, but different thresholds will yield different precision and recall. Another common choice of threshold is the value at which TPR = TNR. In general, one can increase precision at the cost of reducing recall, and vice versa. Metrics that are independent of the threshold choice are the ROC and AUC, which give an overall view of the performance of the model. Figure 6 shows the ROC for all six methods. Method-3 (Zoobot) shows the highest AUC at just over 85%, while also having the highest accuracy (78%), precision (80%) and F1 score (77%). Method-2 (Swin) has the highest recall (82%) and the second highest F1-score and AUC, but its precision (69%) is the lowest. Method-4, -5, and -6, which use similar CNNs (with small differences in the exact architecture, image pre-processing and augmentation), achieve similar performance in all metrics, varying by only a few per cent. The performance of Method-1, which is the only traditional ML method in this study, is actually quite similar to the worst-performing DL-based methods.

We also evaluate the performance of the models in each red-shift bin, as shown in Table C.1. Precision and recall are plotted as a function of redshift in Fig. 7 as filled symbols. Method-3 (Zoobot) has the highest precision at all redshifts, varying between 76% in z-bin4 and 84% in z-bin1. Method-3 also has the highest F1-score in all redshift bins, except in z-bin 4 where it is overtaken slightly by Method-2. On the other hand, Method-2 (Swin) has the highest recall at all redshifts, but again its precision is the lowest (except in z-bin4 where the traditional ML method, i.e. Method-1, has the lowest precision). There is a mild downward trend with increasing ɀ in both precision and recall for most methods, with an average drop from the lowest to the highest redshift bin of ~5% and ~9% in precision and recall, respectively.

Figure 8 shows how precision and recall change with stellar mass (again plotted as filled symbols for the results on the TNG-test). Precision remains more or less constant at M_* < results are obtained for galaxies with M_* < 10¹¹ M_⊙. However, the average recall at M_* < 10¹⁰ M_⊙ is only ~50%. Galaxies with 10¹⁰ < M_* < 10¹¹ M_⊙ (where most galaxies are located) show the best balance in terms of precision and recall. The low precision and high recall at M_* > 10¹¹ M_⊙ can be explained first, by the small number of galaxies in that mass range and secondly, by the small fraction of non-mergers compared to mergers (two times less at 10¹¹ < M_* < 10^11.5 M_⊙ and five times less at M_* > 10^11.5 M_⊙), which may result in the models learning to predict the most massive galaxies as mergers. In future work, we could try to balance better the number of galaxies in different stellar mass ranges in the training data.

Fig. 6

ROC for the TNG test set. The ROC curves show the overall performance of each method independently of the chosen classification threshold. The farther the curve is from the 1:1 line (which represents a random classifier) or the greater the area under the curve, the better the model. Method-3 (Zoobot) shows the best performance in terms of ROC.

5.1.2 TNG: multi-class classification

Here we present the results obtained for the multi-class classification task. Of the six methods, only four are trained for this task (Method-1 (RF), Method-2 (Swin), Method-3 (Zoobot), and Method-4 (CNN1)). This task aims to predict the four classes corresponding to non-merger, pre-merger, ongoing-merger, and post-merger. All methods show great confusion between ongoing-mergers and post-mergers, probably due to the relatively low number of galaxies in these two classes (corresponding to 7% and 8% of the training set for ongoing- and post-mergers, respectively). Additionally, the morphologies in these two classes tend to be similar making them harder to separate. Therefore, we show in this section the results of combining these two classes into a final combined post-merger class. The results for the 4-class classification are shown in Appendix D instead.

Figure 9 shows the confusion matrices for the four methods for all redshift bins combined. The matrices are normalised to show the precision of each class (with recall values shown in brackets). The overall performance is worse than for the binary task, as shown by the lower precision and recall values. An ideal classifier will show values close to 100 in the diagonal and 0 outside. However, all methods show high percentages of mis-classifications. For most methods (Method-1, -2, and -4) the easiest class to identify seems to be pre-mergers, with precision of 60%, 72%, and 65%, respectively, but recall is very low (42%, 65%, and 18%, respectively). Method-3 (Zoobot) has the highest precision for post-mergers (81%), while for pre-mergers, the precision is 65%, comparable to the other methods. However, the recall of Method-3 for post-mergers and pre-mergers is 38% and 74%, respectively.

In Table C.2 we show the precision and recall for the pre-mergers and post-mergers as a function of z, for each method. The performance of all methods decreases with increasing ɀ (~10% and 11% decrease for precision and recall, respectively), more than for the binary classification task, demonstrating the greater difficulty in distinguishing merger stages with increasing z. Some studies in the literature show better performance (Bottrell et al. 2019; Ferreira et al. 2020), however, the comparison is not straightforward (as explained in the introduction), as different studies have different definitions of the merger and non-merger classes, or may use better-quality data. In particular, Ferreira et al. (2020) focuses on mock images to mimic data from the Hubble Space Telescope, which may be an indication that deeper data and better spatial resolution can improve performance in distinguishing the different merger stages (as one would naturally expect). Additionally, increasing the relative fraction of the post-mergers compared to the pre-mergers in the training data could also lead to a better performance in merger stage classification.

Fig. 7

Precision (left) and recall (right) of the merger class as a function of redshift. The filled symbols correspond to the performance of the methods on the TNG dataset and the empty symbols correspond to those on the Horizon-AGN dataset. There is a slight downward trend in precision and a more significant drop in recall with increasing redshift for TNG. These trends are stronger for Horizon-AGN. While precision and recall are both relatively high for TNG, for Horizon-AGN recall drops much more than precision. All methods were trained on TNG and then applied to the Horizon-AGN dataset.

Fig. 8

Precision (left) and recall (right) of the merger class as a function of stellar mass for each method. The filled symbols show the metrics for the TNG dataset and the empty symbols for the Horizon-AGN dataset. For most methods and in both simulations, precision remains constant with mass, but then decreases with increasing mass at M_*> 10¹¹ M_⊙. There is a sharp downward trend in recall with decreasing mass for both datasets.

5.2 Horizon-AGN

This section presents the results of applying the models trained on TNG-training to Horizon-AGN. The use of the second simulation allows us to quantitatively assess how the performance of the various classifiers changes when transferred to a different dataset. This exercise is useful as it gives us an idea of what may happen to the performance of the classifiers when applied to real observations (for which we have no ground-truth labels).

5.2.1 Horizon-AGN: binary classification

In Table 3, we summarise the performance of the six methods for all redshift bins combined. The precision for the merger class in Horizon-AGN (~70%) does not decrease very much compared to the results on TNG. Method-3 (Zoobot) still achieves the best precision at 72%, which is 8% lower than its performance on TNG. Method-1 (RF) and Method-4 (CNN1) are both close second to Zoobot, achieving a precision level of over 71%. However, we see a much more significant drop in recall relative to TNG. The best-performing method for Horizon-AGN in this metric is Method-5 (CNN2) with a recall of 47%, which is a 35% drop compared to the best performance in TNG. The worst-performing method in recall for Horizon-AGN is Method-6 (CNN3) with just over 12%. Figure 10 shows the ROC curves for all the models which are fairly similar to each other (as seen also by the AUC values in Table 3).

In Table C.3 we summarise the performance of the methods for each redshift bin, and in Fig. 7 we show (in open symbols) precision and recall as a function of z. While there is a decrease in the overall precision with respect to TNG, in the first red-shift bin, the precisions for both datasets are very similar. Only as ɀ increases does the difference in precision between TNG and Horizon increase. While in TNG-test the drop in precision with increasing ɀ is <10%, in Horizon-AGN precision drops by ~ 10-30%, depending on the method. The recall for all methods decreases more sharply as ɀ increases. Only Method-2 (Swin) and Method-5 (CNN3) have comparable recalls to TNG in the first redshift bin, but the difference rapidly increases with z. The rest of the methods have recalls <40% already in the first red-shift bin. In Fig. 8, we show (in open symbols) how precision and recall vary with stellar mass. Precision remains more or less the same for all mass bins as for TNG, except for Method-6 (CNN), which has lower precision for M_* < 10¹¹ M_⊙ than it had on TNG. Similar to TNG, recall drops rapidly towards lower mass, but in this case, the drop in recall from the highest to the lowest stellar mass bin is even bigger. The discrepancy between the results obtained on TNG and Horizon-AGN may arise from a number of factors (or combinations thereof), such as the different underlying galaxy physics implemented, the difference in the effective resolution, and in how galaxies are identified and linked through time. We expect that the two latter factors play a smaller role as, on one hand, both simulations use stellar particles to find galaxies (and both simulations have comparable baryonic matter resolution), and on the other hand, they use the same methodology to define mergers through the mergers trees. The different sub-grid physics may result in different galaxy populations (see Appendix A for a comparison of the galaxy populations), which could have a bigger impact on the difference in performance on the two simulations. All these dissimilarities together could lead to differences in the mock images of mergers produced by each simulation, with our results suggesting that Horizon-AGN produces a type of mock images of mergers not found on TNG.

Table 3

Performance metrics as percentages (accuracy, precision, recall, and F1-score for the merger class, and AUC) for the different methods (trained on TNG) applied to the Horizon-AGN set, for the binary classification task.

Fig. 9

Confusion matrices for Method-1 (top left), Method-2 (top right), Method-3 (bottom left), and Method-4 (bottom right) for the multi-class classification task on TNG. The data from all four redshift bins are combined. The post-merger class includes the ongoing-mergers. The confusion matrices are normalised vertically, and therefore the diagonal elements represent the precision of each class. The recall of each class is shown in brackets.

Fig. 10

ROC for the different methods (trained on TNG) applied to Horizon-AGN set. Method-1 (RF) and Method-2 (Swin) show the best performance in terms of ROC. However, the differences with the other methods are small.

Fig. 11

Confusion matrices for Method-1 (top left), Method-2 (top right), Method-3 (bottom left), and Method-4 (bottom right) for the multi-class classification task on Horizon-AGN (from methods trained on TNG). The data from all four redshift bins are combined. The post-merger class includes the ongoing-mergers. The confusion matrices are normalised vertically, and therefore the diagonal represents the precision of each class. The recall is shown in brackets.

5.2.2 Horizon-AGN: Multi-class classification

In this section, we present the results of applying the four models trained for the multi-class task on the Horizon-AGN dataset. As in Sect. 5.1.2, we show the results after combining the ongoing-mergers and post-mergers classes into the post-merger class. Figure 11 shows the four confusion matrices for all the Horizon-AGN redshift bins combined. None of the models performs well on this dataset. This could be at least partly due to the fact that the baseline performance (on the training dataset, as seen in Fig. 9) is not high enough to apply to a different domain, in which the performance is expected to drop (as seen in the previous section).

5.3 HSC

In this section, we explore the application of the models in real observations from HSC. To evaluate the performance on the HSC dataset we cannot rely on true labels, as they do not exist. Instead, we compare the classification results from different methods using visual labels. Obviously, the performance obtained in this way cannot be directly compared with the ones in previous sections, as visual inspection is biased towards the most conspicuous mergers and non-mergers. Nonetheless, we expect a good classifier trained on simulations to be able to correctly classify the majority of the most obvious mergers and non-mergers.

We use the visual classification labels from the subsample of ~2000 galaxies from Goulding et al. (2018) that fall into one of the two categories corresponding to major mergers and disturbed galaxies (possible minor mergers in some cases), along with the sample of ~1000 visually identified clear non-mergers in this work. Examples of these three classes can be found in Fig. 1. We first investigate how the methods classify the galaxies with visual labels of major mergers and non-mergers, and use these labels as if they were true labels. Figure 12 shows the ROC for all six methods, with Method-3 (Zoobot) having the overall best performance, in terms of ROC and AUC. The computed performance metrics for each method, after combining all redshift bins, are shown in Table 4 and per redshift bin in Table C.4. Overall, precisions are high and even higher than for TNG and Horizon AGN for all methods which is expected as this classification task should be easier (due to the greater distinction between the two visually identified classes). Precision for the whole sample ranges from ~71% and ~85%, depending on the method. Recall, however, is generally lower than for TNG (but higher than for Horizon-AGN), ranging from ~58% and ~75%. This means that a significant fraction of these obvious mergers are misclassified as non-mergers. In Fig. 13 we plot precision and recall as a function of z. As in Fig. 7 we see a downward trend with increasing ɀ for precision. Recall, on the other hand, increases with ɀ for most methods, with the exception of Method-3 (Zoobot) which shows a downward trend. In Fig. 14, we show the galaxies for which all methods fail: False positives are galaxies visually classified as non-mergers for which all methods predict them as mergers and false negatives are galaxies visually classified as mergers but with a non-merger label according to all methods. There are only 17 galaxies for which all methods fail. In the case of the false negatives, they are not among the most obvious mergers, in general with faint features, while in the case of false positives, there tends to be a small or not obviously interacting companion.

Visual classifications of mergers-non-mergers do not represent the real universe, as they are likely to only include the most obvious mergers and non-mergers. In the real universe, not all galaxies will fall into either category. There will be galaxies that are neither clear mergers nor clear non-mergers (visually). Therefore the results show the best-case scenario or upper limits for the performance. In order to explore this further, we use galaxies that have been visually classified as disturbed (which could include minor mergers). Galaxies in this class are expected to not be mergers by our definition, but they are neither clear mergers nor clear non-mergers visually. If we compare the performance of the models using this disturbed class as the non-mergers, it will represent the worst-case scenario. However, this probably does not reflect the real universe, as not all galaxies fall in a disturbed, but not major-merger category. That is why we then combine the clear non-mergers with these disturbed galaxies in the non-merger class. This represents a more realistic worst-case scenario. The results of these two scenarios are summarised in Table 5. The predictions for the scenario in which we compare mergers versus non-merger+disturbed, the precisions range from around 62% to 81% (for the merger class). When considering only disturbed galaxies as the non-merger class, the precision drops a further 5%, on average. In Fig. 15, we show the precision for each method and for the three scenarios, where it is clear how the precision drops, for all methods, when there is less separation between the classes, therefore, favouring more misclassifications. However, from the most realistic worst-case scenario (mergers vs. non-mergers+disturbed), the precision only drops ~5% on average, for all methods, with respect to the best-case scenario (mergers vs. non-mergers) We note that, by definition (see Eq. 2), the recall will not change in any of these scenarios as neither TP (mergers correctly identified as mergers) nor FN (mergers incorrectly identified as non-mergers) change in these scenarios.

Finally, in Fig. 16 we explore the fraction of mergers as a function of redshift and stellar mass for the different methods. The solid and dashed black lines, show the fraction of major mergers found in TNG100 and TNG300, respectively, from which we construct the training sample. The dotted black line shows the fraction of major mergers found in Horizon-AGN, compatible with the results from TNG. The grey band represents the range of merger fractions found in different observational studies. Most observational studies find merger fractions, on average, between 0.03 and 0.08, for galaxy samples within our redshift and stellar mass range (in agreement with the fraction of major mergers found in TNG and Horizon-AGN simulations). These studies calculate the merger fractions through galaxy pairs (e.g. Duncan et al. 2019; Ventou et al. 2017; Mundy et al. 2017; Man et al. 2012; Williams et al. 2011), from morphological parameters (Whitney et al. 2021) or from DL methods trained on simulations (Ferreira et al. 2020). The fraction of major mergers can vary by more than an order of magnitude between methods, and it shows very different trends with z. None of the methods reproduces the trend from the simulations as a function of red-shift. Furthermore, all methods except Zoobot find higher major merger fractions in HSC than the average values found in the literature.

Part of these differences with previous literature results may be explained by how major merger fractions are affected by the definition of merger. For example in galaxy pair studies, the major merger fraction will increase with the maximum spatial separation adopted in the definition (de Ravel et al. 2009). Furthermore, in galaxy-pair studies, the sample of major mergers used to calculate merger fractions will be biased towards what we refer as pre-mergers, while methods based on morphology, such as methods based on CAS or imaging, will preferentially find post-mergers (Desmons et al. 2023). In our study, the merger definition spans a larger timescale, and the methods are trained to select both pre-mergers and post-mergers, making a direct comparison not trivial. Sample selection can also impact major merger fractions. For example, de Ravel et al. (2009), with a luminosity-selected sample, found major merger fractions from about 0.03 and up to 0.7 depending on the galaxy pair definition. On the other hand, all methods find a similar trend with stellar mass as TNG, and in agreement with observational studies that find an increase of major merger fraction with stellar mass (Ventou et al. 2017; de Ravel et al. 2009)

Based on Fig. 16, it may seem that major mergers do not have a significant role, However, as seen in Fig. 7, we may expect the recall to drop significantly, particularly at higher redshift, indicating that the fraction of mergers could be much larger compared to what we observe. All methods show a very similar trend with stellar mass (an upward trend, also observed in TNG100 and TNG300), albeit with different absolute fractions. However, this trend may not reflect an intrinsic behaviour. As seen in Fig. 8, the massive end, tends to be more complete but less precise, while the lower-mass is more precise but less complete. This may be, in part, due to a bias on the training sample, in which the majority of the most massive galaxies (M_* > 10¹¹ M_⊙) are mergers, which may translate in the methods classifying the most massive galaxies as mergers regardless of any (or lack) of merger signatures.

Fig. 12

ROC for different methods (trained on TNG) applied to HSC set. Method-4 (CNN1) shows the best performance in terms of ROC.

Table 4

Performance metrics as percentages (accuracy, precision, recall, and F1-score for the merger class and AUC) of the different methods (trained on TNG) on the visual HSC set, for the binary classification task.

Fig. 13

Precision (left) and recall (right) as a function of redshift, using HSC visual classifications of major mergers-non-mergers as true labels. For all methods, precision is higher than for the TNG training sample because in this case there is a clearer distinction between the two classes. All methods were trained on TNG and then applied to the HSC dataset.

Fig. 14

Galaxies for which all methods (trained on TNG) predict the wrong class when applied to HSC. False positives (left) are those visually classified as non-mergers, but that all methods predict as mergers. False negatives (right) are galaxies visually classified as mergers, while all methods predict them as non-mergers.

Fig. 15

Precision for the HSC set using visual labels. Different definitions of negative class (non-merger): visually classified non-mergers (clear separation between the classes), visually classified non-merger plus disturbed galaxies, and disturbed galaxies. The precision for all methods drops when the separation between the classes is smaller. The recall is not shown as by definition it does not vary. All methods were trained on TNG and then applied to the HSC dataset.

Table 5

Summary of the performance metrics as percentages (accuracy, precision, recall, and F1-score for the merger class) of the different methods (trained on TNG) on the visual HSC set, considering different scenarios for the non-merger class.

Fig. 16

Fraction of major mergers as a function of redshift (left) and stellar mass (right), for each method on real HSC observations. The solid (dashed) black line shows the fraction of major mergers found in TNG100 (TNG300). The dotted black line shows the fraction ofmergers for Horizon-AGN. All methods were trained on TNG and then applied to the HSC dataset.

6 Discussions and conclusions

In this paper, we benchmarked the performance of six ML-based merger detection methods that were trained on the same data. The training dataset (TNG-train) was constructed from Illus-trisTNG by creating mock images that mimic the HSC survey in terms of PSF, resolution, filter, and sky background, but that do not include the effects of dust. We first evaluated the performance of all methods on the training data (TNG-test) and then on mock images from a different simulation (Horizon-AGN), constructed in a similar way. Finally, we used all methods to make predictions on real galaxies from HSC and compared them with visual classification labels. Our conclusions are summarised below:

When we have representative data, all methods achieve fairly good performance in the binary merger versus non-merger classification task, with precision ~70–80% and recall ~70– 77%. For the binary task, the best overall method (Method-3, Zoobot) in terms of AUC has a precision of 80% and recall of 74% on TNG-test. Zoobot is the only method pre-trained with galaxy images, which may indicate that transfer learning seems to be important, even when the classification tasks are different. Traditional ML methods (trained on morphological and structural parameters) can be competitive in some cases, and they have the advantage of being easier and quicker to use and interpret. Method-4, -5, and -6 use similar CNNs, but differ in the specific architecture, image pre-processing, and augmentation. Their performance varies by a few percentage points, indicating that a given method can be improved further by fine-tuning the various aspects. Interestingly, the performance in the binary classification does not decrease much with increasing redshift (~5% decrease in precision and ~9% decrease in recall). Stellar mass has a bigger impact, with lower precision towards higher-mass galaxies and rapidly decreasing recall towards lower-mass galaxies.
The multi-class classification task is, as expected, much more challenging for all methods, as we try to discern more subtle differences. It is particularly difficult to distinguish between ongoing-mergers and post-mergers (which can have similar features). In addition, due to the relatively short timescale of ongoing- and post-mergers, the number of galaxies in these two categories is very small, making it harder for the classifiers to learn their representation. Most methods find it easier to classify pre-mergers. Method-3 (Zoobot) has very high precision for post-mergers.
When we apply the trained classifiers to the second simulation (Horizon-AGN), we obtain similar precision to TNG with a slightly bigger difference at higher redshift. However, recall in Horizon-AGN is much worse compared to TNG, particularly at high redshift and low stellar masses. This may be due to the intrinsic differences between the two simulations, such as the effective resolution or the sub-grid physics, resulting in different types of mock images of mergers. The implication is that we can classify with high precision the type of mergers that are present in the two simulations, but different types of mergers that are not included in the training cannot be identified as easily. It is important to realise that this may be the case when we apply simulation-trained classifiers to real observations leading to detected merger samples which can be very incomplete.
When comparing the model’s predictions to visual classifications of clear HSC mergers and clear non-mergers (two very distinct classes), the precision ranges from ~71 to 85%, depending on the method. These values are slightly higher than those obtained for TNG-test, which is expected, as in this case the mergers are the most obvious, and there is less confusion between the classes. However, when the classes are less distinct (i.e. the non-merger class included disturbed or minor merger galaxies), the precision drops on average by 5%.
The fraction of detected major mergers in the HSC survey does not agree with the fraction of major mergers found in TNG and Horizon-AGN simulations; most methods find higher major merger fractions in HSC than in the simulation. Moreover, the fraction of detected major mergers can differ by more than an order of magnitude among the various methods, which also exhibit very different trends with redshift. All methods show a fairly flat relation between major merger fraction and stellar mass, with a slight increase towards more massive galaxies. The increase in the fraction of major mergers with stellar mass is also observed in TNG and Horizon-AGN simulations. However, it is not straightforward to translate this observed trend to an intrinsic relation, because of the competing effect between precision and recall. Our work demonstrates that without a detailed quantitative understanding of precision and recall it is very challenging to understand the role of mergers in galaxy evolution.

Detecting mergers is a challenging task; current methods achieve accuracies of ~80% at best in simulated data. Galaxy properties such as stellar mass and redshift have an impact on the performance and may introduce biases in the detected mergers. A good understanding of these dependences is extremely important, not only in terms of constructing sufficiently reliable and complete merger samples, but also in terms of recovering the intrinsic relations in merger fraction versus mass and red-shift. Other properties not investigated here, such as the mass ratio between merging galaxies and gas content, could also have an impact and will be investigated in the future. In this paper we only used galaxy images in a single filter. However, it is reasonable to expect that combining images in different filters can improve performance. Better quality data in terms of depth and spatial resolution (e.g. from JWST and Euclid) should also lead to better results (particularly for distinguishing between different merger stages).

In this study, we show that the performance of the classifiers trained on one simulation worsens when applied to a different simulation, and it is expected to decrease in real observations as well. Domain adaptation techniques that focus on domain-invariant learning may be a promising approach to alleviate this problem (Ćiprijanović et al. 2020a). Without domain adaptation, we show in this work that the detected merger sample can be very incomplete. However, by training on mock images that are made to resemble as much as possible the real observations (in terms of resolution, noise, and background), we can obtain similar precision in the merger class, highlighting the importance of using realistic and representative training sample (Bottrell et al. 2019; Ćiprijanović et al. 2020b).

Acknowledgements

We would like to thank the anonymous referee for their insightful comments that have improved the quality of this paper. This publication is part of the project 0145Clash of the Titans: deciphering the enigmatic role of cosmic collisions0146 (with project number VI.Vidi.193.113 of the research programme Vidi, which is (partly) financed by the Dutch Research Council (NWO)). This work has made use of the Horizon cluster, on which the Horizon-AGN simulation was post-processed, hosted by the Institut d’Astrophysique de Paris. We warmly thank S. Rouberol for running it smoothly. We thank the Center for Information Technology of the University of Groningen for their support and for providing access to the Hábrók high-performance computing cluster. CB acknowledges support from the Forrest Research Foundation. H.D.S. acknowledges support by the PID2020-115098RJ-I00 grant from MCIN/AEI/10.13039/501100011033 and from the Spanish Ministry of Science and Innovation and the European Union – NextGenerationEU through the Recovery and Resilience Facility project ICTS-MRR-2021-03-CEFCA. W.J.P. has been supported by the Polish National Science Center project UMO-2020/37/B/ST9/00466. MW acknowledges funding from the Science and Technology Facilities Council (STFC) Grant Code ST/R505006/1. CBP acknowledges support by grant CM21_CAB_M2_01 from the Program “Garantía Juveníl” from the “Comunidad de Madrid” 2021.

Appendix A Comparison between TNG and Horizon-AGN galaxy populations

We explored some of the differences between the galaxy populations produced by TNG and Horizon-AGN simulations. These differences may arise from the sub-grid physics that each simulation implements. All morphological statistics shown here were been obtained from Statsmorph, as explained in Sec. 4.1. In Fig. A.1, we show the distributions of different galaxy properties (stellar mass, circularised half-light radius and Sérsic index) for mergers and non-mergers in both simulations. It appears that Horizon-AGN produces fewer massive galaxies than TNG, and the difference is larger for the mergers. In addition, Horizon-AGN produces, in general, smaller galaxies with lower Sérsic indices. In Fig. A.2 we show how Horizon-AGN and TNG galaxies populate the Gini-M20 plane, with some dissimilarities between them. Therefore, Horizon-AGN and TNG produce slightly different galaxy populations, which may play a role in the differences observed in the performance between Horizon-AGN and TNG (as discussed in Sec. 5.2).

Fig. A.1

Stellar mass distribution, circularised half-light radius, and Sérsic index for each simulation (combined TNG100 and TNG300, and Horizon-AGN). The left columns show the mergers (as determined by the merger trees in the simulations), and the right column shows the non-mergers. The vertical lines show the mean value of each distribution.

Fig. A.2

Gini-M20 relationship for mergers (left) and non-mergers (right), for the two simulations (TNG100 and TNG300 combined in yellow, and Horizon-AGN in purple).

Appendix B Summary of the methods

Table B.1 summarises the methods described in Sect. 4, to show the main differences between them. In particular, we present the name of each architecture, the number of trainable parameters, whether one single network was trained on the whole dataset or four networks were used (one for data in each redshift bin), the final size of the images (in physical size and in pixels), the scaling applied, data augmentation and any extra data that was used. Finally, we give a reference for each method.

Table B.1

Summary of the ML and DL methods: Name or architecture, number of trainable parameters, whether one network is trained in the whole dataset or four networks are trained (one for each redshift bin), the size of the image in pixels, the size of the image in kpc, the scaling used in the images, the threshold used for the binary classification task, the kind of data augmentation used during training, any extra data that was used, whether the method performed the multi-class classification task, and a paper reference for the method.

Appendix C Performance of ML/DL-methods as a function of redshift

Here we summarise the performance of the methods for the different datasets as a function of redshift. In Table C.1 we show the performance (accuracy, precision, recall, and F1-score) of all six methods on the TNG-test dataset. As discussed in Sec 5.1 Zoobot shows the overall best performance over the whole redshift range. Table C.2 shows the precision and recall for pre-mergers and post-mergers as a function of redshift. Tables C.3 and C.4 show the performance obtained on Horizon-AGN and HSC, respectively, as a function of redshift.

Table C.1

Performance metrics as percentages (accuracy, precision, recall, and F1-score for the merger class) of the different methods on the TNG-test set per redshift bin, for the binary classification task. The best performance in each metric is highlighted in bold.

Table C.2

Performance metrics as percentages (precision and recall for the pre-mergers and post-mergers) of the different methods on the TNG-test set per redshift bin, for the multi-class classification task. The post-merger class includes the ongoing-mergers. The merger classes or stages are determined by the merger trees from the TNG simulation. The best performance in each metric is highlighted in bold.

Table C.3

Performance metrics as percentages (accuracy, precision, recall, and F1-score for the merger class) for the different methods (trained on TNG) applied to the Horizon-AGN set per redshift bin, for the binary classification task. The best performance in each metric is highlighted in bold.

Table C.4

Performance metrics as percentages (accuracy, precision, recall, and F1-score for the merger class) of the different methods (trained on TNG) on the HSC set per redshift bin, for the binary classification task. The best performance in each metric is highlighted in bold.

Appendix D Merger stages: Four classes

Here we present the results for the 4-class classification for methods 1-4. The methods are trained to classify galaxies into four classes: non-merger, pre-merger, ongoing-merger, and post-merger. These four merger stages have been determined from the merger trees from the TNG simulation. Figure D.1 shows the confusion matrices for the four methods for all redshift bins combined. The matrices are normalised to show the precision of each class (with recall values shown in brackets). The performance is relatively low, with only some methods achieving more than 50% precision for some of the classes. In particular, it is clear that all methods are failing at distinguishing between ongoing-mergers and post-mergers. This may be because these two classes have the lowest number of galaxies, and they can present very similar features. In Sect. 5.1.2, we show the performance of the methods when combining these two classes in one.

Fig. D.1

Confusion matrices for Method-1 (top left panel), Method-2 (top right panel), Method-3 (bottom left panel), and Method-4 (bottom right panel) for the four-class classification task on TNG. The four classes are non-merger, pre-merger, ongoing-merger, and post-merger (the different classes determined by the merger trees in the simulation). The data from all four redshift bins are combined. The confusion matrices are normalised vertically; therefore, the diagonal represents the precision of each class. The recall is shown in brackets.

References

Ackermann, S., Schawinski, K., Zhang, C., Weigel, A. K., & Turp, M. D. 2018, MNRAS, 479, 415 [NASA ADS] [CrossRef] [Google Scholar]
Aihara, H., Arimoto, N., Armstrong, R., et al. 2018, PASJ, 70, S4 [NASA ADS] [Google Scholar]
Aihara, H., AlSayyad, Y., Ando, M., et al. 2022, PASJ, 74, 247 [NASA ADS] [CrossRef] [Google Scholar]
Albawi, S., Mohammed, T. A., & Al-Zawi, S. 2017, in 2017 International Conference on Engineering and Technology (ICET), 1 [Google Scholar]
Aubert, D., Pichon, C., & Colombi, S. 2004, MNRAS, 352, 376 [Google Scholar]
Benítez, N. 2000, ApJ, 536, 571 [Google Scholar]
Berg, T. A. M., Simard, L., Mendel, T. J., & Ellison, S. L. 2014, MNRAS, 440, L66 [NASA ADS] [CrossRef] [Google Scholar]
Bickley, R. W., Bottrell, C., Hani, M. H., et al. 2021, MNRAS, 504, 372 [NASA ADS] [CrossRef] [Google Scholar]
Bickley, R. W., Ellison, S. L., Patton, D. R., et al. 2022, MNRAS, 514, 3294 [NASA ADS] [CrossRef] [Google Scholar]
Bickley, R. W., Ellison, S. L., Patton, D. R., & Wilkinson, S. 2023, MNRAS, 519, 6149 [CrossRef] [Google Scholar]
Bosch, J., Armstrong, R., Bickerton, S., et al. 2018, PASJ, 70, S5 [Google Scholar]
Bottrell, C., Hani, M. H., Teimoorinia, H., et al. 2019, MNRAS, 490, 5390 [NASA ADS] [CrossRef] [Google Scholar]
Bruzual, G., & Charlot, S. 2003, MNRAS, 344, 1000 [NASA ADS] [CrossRef] [Google Scholar]
Bustamante, S., Ellison, S. L., Patton, D. R., & Sparre, M. 2020, MNRAS, 494, 3469 [NASA ADS] [CrossRef] [Google Scholar]
Calzetti, D., Kinney, A. L., & Storchi-Bergmann, T. 1994, ApJ, 429, 582 [Google Scholar]
Chabrier, G. 2003, PASP, 115, 763 [Google Scholar]
Cheng, T.-Y., Conselice, C. J., Aragón-Salamanca, A., et al. 2020, MNRAS, 493, 4209 [Google Scholar]
Ciprijanovic, A., Kafkes, D., Jenkins, S., et al. 2020a, arXiv e-prints [arXiv: 2011.03591] [Google Scholar]
Ciprijanovic, A., Snyder, G. F., Nord, B., & Peek, J. E. G. 2020b, Astron. Comput., 32, 100390 [NASA ADS] [CrossRef] [Google Scholar]
Clauwens, B., Schaye, J., Franx, M., & Bower, R. G. 2018, MNRAS, 478, 3994 [NASA ADS] [CrossRef] [Google Scholar]
Conselice, C. J. 2009, MNRAS, 399, L16 [NASA ADS] [Google Scholar]
Conselice, C. J., Bershady, M. A., Dickinson, M., & Papovich, C. 2003, AJ, 126, 1183 [CrossRef] [Google Scholar]
Conselice, C. J., Bluck, A. F. L., Mortlock, A., Palamara, D., & Benson, A. J. 2014, MNRAS, 444, 1125 [Google Scholar]
Cortijo-Ferrero, C., González Delgado, R. M., Pérez, E., et al. 2017, A&A, 607, A70 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Darg, D. W., Kaviraj, S., Lintott, C. J., et al. 2010, MNRAS, 401, 1043 [NASA ADS] [CrossRef] [Google Scholar]
Davis, M., Efstathiou, G., Frenk, C. S., & White, S. D. M. 1985, ApJ, 292, 371 [Google Scholar]
de Ravel, L., Le Fèvre, O., Tresse, L., et al. 2009, A&A, 498, 379 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Desmons, A., Brough, S., Martínez-Lombilla, C., et al. 2023, MNRAS, 523, 4381 [CrossRef] [Google Scholar]
Dieleman, S., Willett, K. W., & Dambre, J. 2015, MNRAS, 450, 1441 [NASA ADS] [CrossRef] [Google Scholar]
Di Matteo, T., Khandai, N., DeGraf, C., et al. 2012, ApJ, 745, L29 [NASA ADS] [CrossRef] [Google Scholar]
Dolag, K., Borgani, S., Murante, G., & Springet, V. 2009, MNRAS, 399, 497 [NASA ADS] [CrossRef] [Google Scholar]
Domingos, P. 2012, Commun. ACM, 55, 78 [Google Scholar]
Domínguez Sánchez, H., Huertas-Company, M., Bernardi, M., Tuccillo, D., & Fischer, J. L. 2018, MNRAS, 476, 3661 [Google Scholar]
Domínguez Sánchez, H., Martin, G., Damjanov, I., et al. 2023, MNRAS, 521, 3861 [CrossRef] [Google Scholar]
Dubois, Y., Pichon, C., Welker, C., et al. 2014, MNRAS, 444, 1453 [Google Scholar]
Dubois, Y., Peirani, S., Pichon, C., et al. 2016, MNRAS, 463, 3948 [Google Scholar]
Duncan, K., Conselice, C. J., Mundy, C., et al. 2019, ApJ, 876, 110 [NASA ADS] [CrossRef] [Google Scholar]
Edge, A., Sutherland, W., Kuijken, K., et al. 2013, The Messenger, 154, 32 [NASA ADS] [Google Scholar]
Eisert, L., Pillepich, A., Nelson, D., et al. 2023, MNRAS, 519, 2199 [Google Scholar]
Ellison, S. L., Mendel, J. T., Scudder, J. M., Patton, D. R., & Palmer, M. J. D. 2013, MNRAS, 430, 3128 [NASA ADS] [CrossRef] [Google Scholar]
Ellison, S. L., Viswanathan, A., Patton, D. R., et al. 2019, MNRAS, 487, 2491 [NASA ADS] [CrossRef] [Google Scholar]
Fakhouri, O., & Ma, C.-P. 2008, MNRAS, 386, 577 [NASA ADS] [CrossRef] [Google Scholar]
Ferreira, L., Conselice, C. J., Duncan, K., et al. 2020, ApJ, 895, 115 [NASA ADS] [CrossRef] [Google Scholar]
Fitts, A., Boylan-Kolchin, M., Bullock, J. S., et al. 2018, MNRAS, 479, 319 [CrossRef] [Google Scholar]
Fukushima, K. 1988, Neural Networks, 1, 119 [CrossRef] [Google Scholar]
Girardi, L., Bressan, A., Bertelli, G., & Chiosi, C. 2000, A&AS, 141, 371 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Goodfellow, I., Bengio, Y., & Courville, A. 2016, Deep Learning (MIT Press) [Google Scholar]
Goulding, A. D., Greene, J. E., Bezanson, R., et al. 2018, PASJ, 70, S37 [NASA ADS] [CrossRef] [Google Scholar]
Guzmán-Ortega, A., Rodriguez-Gomez, V., Snyder, G. F., Chamberlain, K., & Hernquist, L. 2023, MNRAS, 519, 4920 [CrossRef] [Google Scholar]
Haardt, F., & Madau, P. 1996, ApJ, 461, 20 [Google Scholar]
Hani, M. H., Sparre, M., Ellison, S. L., Torrey, P., & Vogelsberger, M. 2018, MNRAS, 475, 1160 [CrossRef] [Google Scholar]
Ho, T. K. 1995, in Proceedings of 3rd International Conference on Document Analysis and Recognition, 1, IEEE, 278 [Google Scholar]
Hopkins, P. F., Wetzel, A., Kereš, D., et al. 2018, MNRAS, 480, 800 [NASA ADS] [CrossRef] [Google Scholar]
Huertas-Company, M., & Lanusse, F. 2023, PASA, 40, e001 [NASA ADS] [CrossRef] [Google Scholar]
Huertas-Company, M., Gravet, R., Cabrera-Vives, G., et al. 2015, ApJS, 221, 8 [NASA ADS] [CrossRef] [Google Scholar]
Huertas-Company, M., Primack, J. R., Dekel, A., et al. 2018, ApJ, 858, 114 [NASA ADS] [CrossRef] [Google Scholar]
Huško, F., Lacey, C. G., & Baugh, C. M. 2022, MNRAS, 509, 5918 [Google Scholar]
Ibata, R. A., McConnachie, A., Cuillandre, J.-C., et al. 2017, ApJ, 848, 128 [Google Scholar]
Ilbert, O., Arnouts, S., McCracken, H. J., et al. 2006, A&A, 457, 841 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Jackson, R. A., Kaviraj, S., Martin, G., et al. 2022, MNRAS, 511, 607 [NASA ADS] [CrossRef] [Google Scholar]
Jiang, C. Y., Jing, Y. P., & Han, J. 2014, ApJ, 790, 7 [NASA ADS] [CrossRef] [Google Scholar]
Karsten, J., Wang, L., Margalef-Bentabol, B., et al. 2023, A&A, 675, A159 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Kennicutt, Jr., R. C. 1998, ApJ, 498, 541 [Google Scholar]
Kitzbichler, M. G., & White, S. D. M. 2008, MNRAS, 391, 1489 [NASA ADS] [CrossRef] [Google Scholar]
Komatsu, E., Smith, K. M., Dunkley, J., et al. 2011, ApJS, 192, 18 [Google Scholar]
Kuijken, K., Heymans, C., Dvornik, A., et al. 2019, A&A, 625, A2 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Lazar, I., Kaviraj, S., Martin, G., et al. 2023, MNRAS, 520, 2109 [NASA ADS] [CrossRef] [Google Scholar]
LeCun, Y., Bengio, Y., & Hinton, G. 2015, Nature, 521, 436 [Google Scholar]
Leitherer, C., Schaerer, D., Goldader, J. D., et al. 1999, ApJS, 123, 3 [Google Scholar]
Leitherer, C., Ortiz Otálvaro, P. A., Bresolin, F., et al. 2010, ApJS, 189, 309 [Google Scholar]
Lemaître, G., Nogueira, F., & Aridas, C. K. 2017, J. Mach. Learn. Res., 18, 559 [Google Scholar]
Lintott, C. J., Schawinski, K., Slosar, A., et al. 2008, MNRAS, 389, 1179 [NASA ADS] [CrossRef] [Google Scholar]
Liske, J., Baldry, I. K., Driver, S. P., et al. 2015, MNRAS, 452, 2087 [Google Scholar]
Liu, Z., Lin, Y., Cao, Y., et al. 2021, arXiv e-prints [arXiv: 2103.14030] [Google Scholar]
López-Sanjuan, C., Cenarro, A. J., Varela, J., et al. 2015, A&A, 576, A53 [Google Scholar]
Lotz, J. M., Primack, J., & Madau, P. 2004, AJ, 128, 163 [NASA ADS] [CrossRef] [Google Scholar]
Lotz, J. M., Jonsson, P., Cox, T. J., & Primack, J. R. 2010, MNRAS, 404, 575 [Google Scholar]
Man, A. W. S., Toft, S., Zirm, A. W., Wuyts, S., & van der Wel, A. 2012, ApJ, 744, 85 [NASA ADS] [CrossRef] [Google Scholar]
Margalef-Bentabol, B., Huertas-Company, M., Charnock, T., et al. 2020, MNRAS, 496, 2346 [Google Scholar]
Marinacci, F., Vogelsberger, M., Pakmor, R., et al. 2018, MNRAS, 480, 5113 [NASA ADS] [Google Scholar]
Martin, G., Kaviraj, S., Devriendt, J. E. G., Dubois, Y., & Pichon, C. 2018, MNRAS, 480, 2266 [Google Scholar]
Martin, G., Kaviraj, S., Hocking, A., Read, S. C., & Geach, J. E. 2020, MNRAS, 491, 1408 [Google Scholar]
Martin, G., Jackson, R. A., Kaviraj, S., et al. 2021, MNRAS, 500, 4937 [Google Scholar]
Martin, G., Bazkiaei, A. E., Spavone, M., et al. 2022, MNRAS, 513, 1459 [NASA ADS] [CrossRef] [Google Scholar]
McAlpine, S., Harrison, C. M., Rosario, D. J., et al. 2020, MNRAS, 494, 5713 [NASA ADS] [CrossRef] [Google Scholar]
Minghao, C., Kan, W., Bolin, N., et al. 2021, arXiv e-prints [arXiv:2111.14725] [Google Scholar]
Miyazaki, S., Komiyama, Y., Kawanomoto, S., et al. 2018, PASJ, 70, S1 [NASA ADS] [Google Scholar]
Moreno, J., Torrey, P., Ellison, S. L., et al. 2019, MNRAS, 485, 1320 [NASA ADS] [CrossRef] [Google Scholar]
Mundy, C. J., Conselice, C. J., Duncan, K. J., et al. 2017, MNRAS, 470, 3507 [Google Scholar]
Naiman, J. P., Pillepich, A., Springel, V., et al. 2018, MNRAS, 477, 1206 [Google Scholar]
Nelson, D., Pillepich, A., Springel, V., et al. 2018, MNRAS, 475, 624 [Google Scholar]
Nelson, D., Springel, V., Pillepich, A., et al. 2019, Computat. Astrophys. Cosmol., 6, 2 [NASA ADS] [CrossRef] [Google Scholar]
Nevin, R., Blecha, L., Comerford, J., & Greene, J. 2019, ApJ, 872, 76 [NASA ADS] [CrossRef] [Google Scholar]
Nomoto, K., Saio, H., Kato, M., & Hachisu, I. 2007, ApJ, 663, 1269 [Google Scholar]
O’Shea, K., & Nash, R. 2015, arXiv e-prints [arXiv:1511.08458] [Google Scholar]
Pascanu, R., Mikolov, T., & Bengio, Y. 2012, arXiv e-prints [arXiv:1211.5863] [Google Scholar]
Patton, D. R., Wilson, K. D., Metrow, C. J., et al. 2020, MNRAS, 494, 4969 [NASA ADS] [CrossRef] [Google Scholar]
Pearson, W. J., Wang, L., Alpaslan, M., et al. 2019a, A&A, 631, A51 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Pearson, W. J., Wang, L., Trayford, J. W., Petrillo, C. E., & van der Tak, F. F. S. 2019b, A&A, 626, A49 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Pearson, W. J., Suelves, L. E., Ho, S. C. C., et al. 2022, A&A, 661, A52 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Pedregosa, F., Varoquaux, G., Gramfort, A., et al. 2011, J. Mach. Learn. Res., 12, 2825 [Google Scholar]
Pillepich, A., Nelson, D., Hernquist, L., et al. 2018a, MNRAS, 475, 648 [Google Scholar]
Pillepich, A., Springel, V., Nelson, D., et al. 2018b, MNRAS, 473, 4077 [Google Scholar]
Planck Collaboration XIII. 2016, A&A, 594, A13 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Qu, Y., Helly, J. C., Bower, R. G., et al. 2017, MNRAS, 464, 1659 [NASA ADS] [CrossRef] [Google Scholar]
Rodriguez-Gomez, V., Genel, S., Vogelsberger, M., et al. 2015, MNRAS, 449, 49 [Google Scholar]
Rodriguez-Gomez, V., Pillepich, A., Sales, L. V., et al. 2016, MNRAS, 458, 2371 [Google Scholar]
Rodriguez-Gomez, V., Sales, L. V., Genel, S., et al. 2017, MNRAS, 467, 3083 [Google Scholar]
Rodriguez-Gomez, V., Snyder, G. F., Lotz, J. M., et al. 2019, MNRAS, 483, 4140 [NASA ADS] [CrossRef] [Google Scholar]
Rose, C., Kartaltepe, J. S., Snyder, G. F., et al. 2023, ApJ, 942, 54 [NASA ADS] [CrossRef] [Google Scholar]
Russakovsky, O., Deng, J., Su, H., et al. 2015, Int. J. Comput. Vis., 115, 211 [Google Scholar]
Satyapal, S., Ellison, S. L., McAlpine, W., et al. 2014, MNRAS, 441, 1297 [Google Scholar]
Schaye, J., Crain, R. A., Bower, R. G., et al. 2015, MNRAS, 446, 521 [Google Scholar]
Schmidhuber, J. 2015, Neural Networks, 61, 85 [CrossRef] [Google Scholar]
Simmons, B. D., Lintott, C., Willett, K. W., et al. 2017, MNRAS, 464, 4420 [Google Scholar]
Snyder, G. F., Lotz, J. M., Rodriguez-Gomez, V., et al. 2017, MNRAS, 468, 207 [NASA ADS] [CrossRef] [Google Scholar]
Snyder, G. F., Rodriguez-Gomez, V., Lotz, J. M., et al. 2019, MNRAS, 486, 3702 [NASA ADS] [CrossRef] [Google Scholar]
Springel, V., White, S. D. M., Tormen, G., & Kauffmann, G. 2001, MNRAS, 328, 726 [Google Scholar]
Springel, V., Pakmor, R., Pillepich, A., et al. 2018, MNRAS, 475, 676 [Google Scholar]
Srisawat, C., Knebe, A., Pearce, F. R., et al. 2013, MNRAS, 436, 150 [NASA ADS] [CrossRef] [Google Scholar]
Sutherland, R. S., & Dopita, M. A. 1993, ApJS, 88, 253 [Google Scholar]
Tan, M., & Le, Q. V. 2020, arXiv e-prints [arXiv:1905.11946] [Google Scholar]
Teyssier, R. 2002, A&A, 385, 337 [CrossRef] [EDP Sciences] [Google Scholar]
Tweed, D., Devriendt, J., Blaizot, J., Colombi, S., & Slyz, A. 2009, A&A, 506, 647 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Ventou, E., Contini, T., Bouché, N., et al. 2017, A&A, 608, A9 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Vogelsberger, M., Genel, S., Springel, V., et al. 2014a, Nature, 509, 177 [Google Scholar]
Vogelsberger, M., Genel, S., Springel, V., et al. 2014b, MNRAS, 444, 1518 [Google Scholar]
Walmsley, M., Ferguson, A. M. N., Mann, R. G., & Lintott, C. J. 2019, MNRAS, 483, 2968 [NASA ADS] [CrossRef] [Google Scholar]
Walmsley, M., Smith, L., Lintott, C., et al. 2020, MNRAS, 491, 1554 [Google Scholar]
Walmsley, M., Lintott, C., Géron, T., et al. 2022a, MNRAS, 509, 3966 [Google Scholar]
Walmsley, M., Slijepcevic, I., Bowles, M. R., & Scaife, A. 2022b, in Machine Learning for Astrophysics, proceedings of the Thirty-ninth International Conference on Machine Learning (ICML 2022), https://ml4astro.github.io/icml2822, 29 [Google Scholar]
Walmsley, M., Allen, C., Aussel, B., et al. 2023, J. Open Source Softw., 8, 5312 [NASA ADS] [CrossRef] [Google Scholar]
Wang, L., Pearson, W. J., & Rodriguez-Gomez, V. 2020, A&A, 644, A87 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
White, S. D. M., & Rees, M. J. 1978, MNRAS, 183, 341 [Google Scholar]
Whitney, A., Ferreira, L., Conselice, C. J., & Duncan, K. 2021, ApJ, 919, 139 [NASA ADS] [CrossRef] [Google Scholar]
Wilkinson, S., Ellison, S. L., Bottrell, C., et al. 2022, MNRAS, 516, 4354 [CrossRef] [Google Scholar]
Willett, K. W., Lintott, C. J., Bamford, S. P., et al. 2013, MNRAS, 435, 2835 [Google Scholar]
Willett, K. W., Galloway, M. A., Bamford, S. P., et al. 2017, MNRAS, 464, 4176 [NASA ADS] [CrossRef] [Google Scholar]
Williams, R. J., Quadri, R. F., & Franx, M. 2011, ApJ, 738, L25 [NASA ADS] [CrossRef] [Google Scholar]
Woods, D. F., & Geller, M. J. 2007, AJ, 134, 527 [NASA ADS] [CrossRef] [Google Scholar]
Wright, A. H., Hildebrandt, H., Kuijken, K., et al. 2019, A&A, 632, A34 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Zanisi, L., Huertas-Company, M., Lanusse, F., et al. 2021, MNRAS, 501, 4359 [NASA ADS] [CrossRef] [Google Scholar]
Zeiler, M. D. 2012, arXiv e-prints [arXiv:1212.5701] [Google Scholar]

¹

Some attempts have been made to solve the reproducibility problem, but human classification is still needed (Walmsley et al. 2022a).

²

Bayesian optimisation is a hyper-parameter optimisation method that takes into account past evaluations when choosing the hyper-parameter set to evaluate next, and therefore, focuses on areas of the parameter space that will most likely produce the best validation scores.

³

Dataset containing a collection of 1.2 million labelled images with one thousand object categories (Russakovsky et al. 2015), from animals to everyday objects.

⁴

EfficientnetB0, pre-trained on GZ Evo as described in Walmsley et al. (2022b). This earlier GZ Evo version did not include classifications from HSC images, i.e. GZ Cosmic Dawn, which was not yet complete at the time of writing.

⁵

We applied weightings of [1, 1, 10, 1, 3] to each group of outputs (with respect to the list above).

All Tables

Table 1

Total number of galaxies in the training sample (TNG-train) and testing sample (TNG-test, Horizon-AGN, and HSC).

	Fig. 1 Example real HSC galaxies in the three visually classified groups: major mergers (two leftmost columns), disturbed or minor mergers (two middle columns), and non-mergers (two rightmost columns). The first two groups are from Goulding et al. (2018), and the last group is from this work. Images have a physical size of ~160 kpc, displayed with an arcsinh inverted grey scale.
In the text

	Fig. 2 Steps used to create mock images for four randomly selected galaxies from TNG. From left to right are shown the raw simulated images, convolution with the HSC PSF, addition of Poisson noise, and injection into real sky background from HSC. Images have a physical size of 160 kpc, displayed with an arcsinh inverted grey scale.
In the text

	Fig. 3 Example mock HSC images of simulated galaxies from TNG and Horizon-AGN. Images have sizes in pixels of 320 (at 0.1 < ɀ < 0.31), 192 (at 0.31 < ɀ < 0.52), 160 (at 0.52 < ɀ < 0.76), and 128 (at 0.76 < ɀ < 1), corresponding to ~160 kpc at a given redshift, displayed using an arcsinh inverted grey scale.
In the text

	Fig. 4 Example mergers from TNG at different merger stages (obtained from the corresponding merger trees in the simulation): pre-mergers (−0.8 < dt < −0.1 Gyr), ongoing-mergers (−0.1 < dt < 0.1 Gyr), and post-mergers (0.1 < dt < 0.3 Gyr). Each row shows a galaxy along its merger sequence. Images have an approximate physical size of 160 kpc, displayed using an arcsinh inverted grey scale.
In the text

	Fig. 5 Stellar mass (left) and redshift (right) distributions of the different simulated datasets (red: TNG300; yellow: TNG100; blue Horizon-AGN). By combining TNG100 and TNG300, a broader range of stellar masses is covered. The redshift distributions of the simulation are not designed to follow the observations.
In the text

	Fig. 6 ROC for the TNG test set. The ROC curves show the overall performance of each method independently of the chosen classification threshold. The farther the curve is from the 1:1 line (which represents a random classifier) or the greater the area under the curve, the better the model. Method-3 (Zoobot) shows the best performance in terms of ROC.
In the text

	Fig. 10 ROC for the different methods (trained on TNG) applied to Horizon-AGN set. Method-1 (RF) and Method-2 (Swin) show the best performance in terms of ROC. However, the differences with the other methods are small.
In the text

	Fig. 12 ROC for different methods (trained on TNG) applied to HSC set. Method-4 (CNN1) shows the best performance in terms of ROC.
In the text

	Fig. 13 Precision (left) and recall (right) as a function of redshift, using HSC visual classifications of major mergers-non-mergers as true labels. For all methods, precision is higher than for the TNG training sample because in this case there is a clearer distinction between the two classes. All methods were trained on TNG and then applied to the HSC dataset.
In the text

	Fig. 14 Galaxies for which all methods (trained on TNG) predict the wrong class when applied to HSC. False positives (left) are those visually classified as non-mergers, but that all methods predict as mergers. False negatives (right) are galaxies visually classified as mergers, while all methods predict them as non-mergers.
In the text

	Fig. 16 Fraction of major mergers as a function of redshift (left) and stellar mass (right), for each method on real HSC observations. The solid (dashed) black line shows the fraction of major mergers found in TNG100 (TNG300). The dotted black line shows the fraction ofmergers for Horizon-AGN. All methods were trained on TNG and then applied to the HSC dataset.
In the text

	Fig. A.1 Stellar mass distribution, circularised half-light radius, and Sérsic index for each simulation (combined TNG100 and TNG300, and Horizon-AGN). The left columns show the mergers (as determined by the merger trees in the simulations), and the right column shows the non-mergers. The vertical lines show the mean value of each distribution.
In the text

	Fig. A.2 Gini-M20 relationship for mergers (left) and non-mergers (right), for the two simulations (TNG100 and TNG300 combined in yellow, and Horizon-AGN in purple).
In the text

Galaxy merger challenge: A comparison study between machine learning-based detection methods

1 Introduction

2 Data

2.1 IllustrisTNG

2.2 Horizon-AGN

2.3 HSC Subaru Strategic Program

3 The merger challenge

3.1 Goals

3.2 Mock images

3.3 Training and testing datasets

3.4 Evaluation metrics

4 Machine learning-based merger detection methods

4.1 Method-1 (RF)

4.2 Method-2 (Swin)

4.3 Method-3 (Zoobot)

4.4 Method-4 (CNN1)

4.5 Method-5 (CNN2)

4.6 Method-6 (CNN3)

5 Results

5.1 TNG

5.1.1 TNG: binary classification

5.1.2 TNG: multi-class classification

5.2 Horizon-AGN

5.2.1 Horizon-AGN: binary classification

5.2.2 Horizon-AGN: Multi-class classification

5.3 HSC

6 Discussions and conclusions

Acknowledgements

Appendix A Comparison between TNG and Horizon-AGN galaxy populations

Appendix B Summary of the methods

Appendix C Performance of ML/DL-methods as a function of redshift

Appendix D Merger stages: Four classes

References

All Tables

All Figures