Issue |
A&A
Volume 667, November 2022
|
|
---|---|---|
Article Number | A144 | |
Number of page(s) | 16 | |
Section | Numerical methods and codes | |
DOI | https://doi.org/10.1051/0004-6361/202244116 | |
Published online | 21 November 2022 |
Random Forest classification of Gaia DR3 white dwarf-main sequence spectra: A feasibility study
1
Departament de Física, Universitat Politècnica de Catalunya,
c/Esteve Terrades 5,
08860
Castelldefels, Spain
e-mail: santiago.torres@upc.edu
2
Institute for Space Studies of Catalonia,
c/Gran Capità 2–4, Edif. Nexus 104,
08034
Barcelona, Spain
Received:
25
May
2022
Accepted:
18
August
2022
Aims. The third Gaia data release provides low-resolution spectra for around 200 million sources. It is expected that a sizeable fraction of them contain a white dwarf (WD), neither isolated, or in a binary system with a main-sequence (MS) companion, that is a white dwarf-main sequence (WDMS) binary. Taking advantage of a consolidated Random Forest algorithm used in the classification of WDs, we extend it to study the feasibility of classifying Gaia WDMS binary spectra.
Methods. The Random Forest algorithm is first trained with a set of synthetic spectra generated by combining individual WD and MS spectra for the full range of effective temperatures and surface gravities. Moreover, with the aid of a detailed population synthesis code, we simulate the Gaia spectra for the abovementioned populations. For evaluating the performance of the models, a set of metrics are applied to our classifications.
Results. Our results show that for resolving powers above ~300 the accuracy of the classification depends exclusively on the S/R of the spectra, while below that value the S/R should be increased as the resolving power is reduced to maintain a certain accuracy. The algorithm is then applied to the already classified SDSS WDMS catalog, revealing that the automated classification exhibits an accuracy comparable (or even higher) to previous classification methods. Finally, we simulate the Gaia spectra, showing that our algorithm is able to correctly classify nearly 80% the synthetic WDMS spectra.
Conclusions. Our algorithm represents a useful tool in the analysis and classification of real Gaia WDMS spectra. Even for those spectra dominated by the flux of the MS stars, the algorithm reaches a high degree of accuracy (60%).
Key words: white dwarfs / binaries: general
© D. Echeverry et al. 2022
Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
This article is published in open access under the Subscribe-to-Open model. Subscribe to A&A to support open access publication.
1 Introduction
The advent of large astronomical databases has surpassed the human capability of their direct analysis. Since the arrival of automated missions such as Hipparcos in the 1990’s (Turon et al. 1992), large-scale projects such as the SuperCosmos Sky Survey (Hambly et al. 1998), the Sloan Digital Sky Survey (SDSS) (York et al. 2000), the Pan-STARRS collaboration (Kaiser et al. 2002), the RAVE Survey (Zwitter et al. 2008) or the Large Sky Area Multi-Object Fibre Spectroscopic Telescope (LAMOST) (Zhao et al. 2012), among other examples, have provided an unprecedented wealth and quantity of information. In particular, the current Gaia mission in its data release 3 (DR3; Gaia Collaboration 2021a), provides information for about nearly 2 billion sources. Data mining and machine learning methods and, in general, artificial intelligence techniques have therefore become an essential tool for handling such a large amount of data. In this sense, we can mention some pioneering works using these techniques in a wide variety of subjects, such as automatically discrimination of stars from galaxies (Bazell & Peng 1998), classification of galaxies according to their morphology (Naim et al. 1995), estimation of the fraction of binaries in stars clusters (Serra-Ricart et al. 1996), classification of main-sequence populations in the Hipparcos Input Catalogue (Hernandez-Pajares & Floris 1994) or white dwarf (WD) classification in its Galactic components (Torres et al. 1998). Since then, numerous approaches have been proposed, mainly based on artificial neural networks, decision trees, discriminant analysis, and many others (see, Ball & Brunner 2010, for a recent review). Among them, the Random Forest algorithm is one of the most promising techniques due to its versatility, robustness, and easiness of implementation as we will demonstrate along this paper.
On the other hand, WDs are the most common remnant of intermediate stars (≤8–11 M⊙). Their theoretical properties are reasonably well understood (e.g., Althaus et al. 2010) and, as WDs can be very old objects, they can be used as reliable cosmochronometers carrying out a valuable information about the past history and evolution of our Galaxy (see, García-Berro & Oswalt 2016, for a review). However, two major drawbacks arise from the fact that WDs are high dense stellar objects. First, their large surface gravity leads to a broadening of the Balmer’s spectral lines thus hindering accurate radial velocity determinations. Second, this same high surface gravity causes the sink of metals in the deep interior of WD atmospheres erasing any trace of the metallicity of the star. However, this lack of information can be retrieved if the WD is in a binary system, in particular, in a white dwarf plus a main-sequence (WDMS) star. In these cases, radial velocities or metallicities can be indirectly derived from the WD companions (Rebassa-Mansergas et al. 2016a, 2021; Raddi et al. 2022). Moreover, WDMS systems that arised from common envelope evolution are key to understand several important phenomena of our Galaxy, such as supernova type I progenitors, cataclysmic variables, low mass X-ray binaries and, in general, close binary evolution.
From an observational point of view, there exist several techniques for the identification of WDMS systems. If both objects are physically enough separated, a fact that happens in wide systems that usually avoided mass transfer episodes, they can be spatially resolved. Common-proper motion pairs identification is a widely use technique that can be applied in that situation. For instance, in the recent data provided by the Gaia mission in its early data release three (EDR3), around 16 000 WDMS systems have been identified in this way within 1 kpc from the Sun (El-Badry et al. 2021). However, if the systems are close enough, indicating that they likely evolved through a common-envelope episode, they are hard to identify as long as the brightness of one of the components overwhelms the other. In a recent study, Rebassa-Mansergas et al. (2021) identified 112 WDMS candidates in a nearly-complete volume sample up to 100 pc from the Sun. The identification, based on the photometry of the objects and the accurate parallaxes provided by Gaia allowed them to fit the Spectral Energy Distribution (SED) of both components and, thus, to derive its main stellar parameters. Although with this method is expected to retrieve 80% of the unresolved WDMS binary underlying population, it is only circumscribed to a particular region of the Hertzsprung-Russell (HR) diagram. Outside this region, one of the components dominates over the other and it becomes impossible to individually identify the two components. The same authors estimate that this happens for ~91% of the systems. One way to overcome this problem is to make use of eclipsing systems, which reveal the presence of out-shined companions through eclipses in their light curves (Pyrzas et al. 2012; Parsons et al. 2012, 2017; Casewell et al. 2020). Unfortunately, the percentage of eclipsing systems is low among WDMS systems (≃ 10% Parsons et al. 2013). Although spectra of unresolved WDMS are in principle also subject to these drawbacks, the use of artificial intelligence algorithms may help in differentiating single stars from binaries in which one of the components is out-shined.
In this sense, the largest spectroscopic WDMS binary catalog to date has been obtained from SDSS and LAMOST data (Rebassa-Mansergas et al. 2016b; Ren et al. 2018, and references therein). It contains ≈4100 systems (≈3200 from SDSS and ≈900 from LAMOST). The identification, mainly based on a χ2-fit and a wave-let transform with a human visual inspection of individual spectra, results to be a robust and efficient method. However, it is expected that Gaia will provide an equivalent number of WDMS spectra, thus implying a huge consumption of computational and human time. In this sense, automated classification methods become essential. Moreover, as mentioned above, WDMS binary spectra in which one of the components dominates the SED are nearly impossible to identify via χ2-fit/wavelet plus visual inspection methods, and it becomes important to evaluate whether or not artificial intelligence methods can help in overcoming this problem.
Machine learning techniques have been applied to spectral analysis (e.g., Bailer-Jones et al. 1998) and, in particular, Random Forest algorithms to the identification of stellar objects (Pérez-Ortiz et al. 2017; Plewa 2018). Regarding WDs, the Random Forest algorithm has been proved as an efficient tool in the classification of their Galactic components (Torres et al. 2019) or their differentiation from MS stars (Gaia Collaboration 2021b). Here we move a step forward and we aim to explore the capabilities of the Random Forest algorithm when applied to a continuous system, such as it is an spectrum, with a large (theoretically nearly infinite) number of features. Thus, in this paper we explore the possibilities of the Random Forest algorithm in the identification of WDMS systems through their spectra within the Gaia DR3 context.
The paper is organized as follows. In Sect. 2 we introduce the Random Forest algorithm and define its main parameters. Section 3 is devoted to a general theoretical analysis of the Random Forest algorithm. The method is then compared and checked in Sect. 4 with a sample of already labeled objects provided by SDSS. In Sect. 5 we study the capabilities of the method by emulating synthetic spectra for Gaia mission. Finally, the major conclusions and results are summarized in Sect. 6.
2 The Random Forest algorithm
The Random Forest algorithm is a well known supervised ensemble machine learning method, widely used for classification purposes (Breiman 2001). An initial labeled sample is needed in order to train the model. Once trained, the algorithm is tested in the labeled sample and if the accuracy of the model is good enough it can be applied to an unlabeled sample. The base of the Random Forest algorithm is the Decision Tree algorithm, in which the different branches are weighted with a certain probability. The algorithm chooses the attribute from the labeled data that best splits the subset by minimizing a certain function (typically an entropy or Gini index). As a result of this training process, branches are weighted, thus being possible to apply the algorithm to an unlabeled sample for its classification. A detailed explanation of how the Random Forest works in the classification of the WD population can be found in Torres et al. (2019). Additionally, for the reader who is not familiarized with Random Forest hyperparameters and definitions associated to the algorithm, we provide in Appendix A a short description of the most relevant ones used in the present work.
3 Theoretical classification of white dwarf-main sequence systems
3.1 The synthetic sample
In order to train the Random Forest algorithm we built synthetic samples of WDs, MS stars, and WDMS binaries. We used the Phoenix MS star library (Husser et al. 2013) and collected spectra of three different surface gravity values: 4.5, 5, and 5.5 dex. Their temperatures ranged from 2500 to 4200 K in steps of 100 K, thus representing a total of 54 MS star model spectra that perfectly cover the stellar parameters for M-type and late K-type stars which are the typical companions of WDs. Hotter MS stars are disregarded given that they are expected to outshine the WDs at optical wavelengths, leading to a practical impossibility to disentangle the two components in case of WDMS binary systems.
Regarding WDs, we used the spectral library from Koester (2010), containing thirteen surface gravity values, from 6.5 to 9.5 dex in 0.25 dex steps. Their temperatures ranged from 6000 K to 10 000 K in 250 K steps, from 10 000 to 30 000 K in 1000 K steps, from 30 000 to 70 000 K in 5000 K steps, and from 70 000 to 10 0000 K in 10 000 K steps. In total, we ended up with 611 WD model spectra that represents the space parameter of practically all observable WDs.
The best performance of the Random Forest is achieved for balanced data sets (Breiman 2001). Consequently, we generated additional MS spectra by interpolating the temperatures for each surface gravity ranging from 2500 to 4200 K in 6.25 K steps. In total we thus obtained 819 MS spectra, from which 615 uniformly distributed were selected.
For WDMS binaries there are not available simulated spectra, hence we created them by just adding the fluxes from the MS and the WD model spectra previously introduced rescaled at a fixed distance of 200 pc. In Fig. 1 we show a typical example of a WD spectra (effective temperature and surface gravity, 9250 K and 7.75 dex, respectively; green line), a MS spectra (2700 K and 5.5 dex; red line) and the resulting WDMS combined spectra (blue line). As seen, the blue part of the spectrum is dominated by the WD, while the red part has clear M star spectral lines. With the objective of having a balanced data set, only 729 were generated with 27 MS and 27 WD stars uniformly distributed along all the range of temperatures and surface gravities.
![]() |
Fig. 1 Example of a WDMS binary spectrum (blue line) obtained from the combination of the individual spectra of a WD (green line) and a MS star (red line). The effective temperatures and surface gravities are. respectively, 9250 K and 7.75 dex for the WD, and 2700 K and 5.5 dex for the MS star. As seen, the blue part of the spectrum is dominated by the WD, while the red part has clear M star spectral lines. |
3.2 Pre-procesing and data preparation
The WD, MS, and WDMS synthetic spectra created so far are virtually at an infinite resolution and are not affected by noise. Given that this is not the case for real spectra, we need to degrade our synthetic spectra according to a certain resolution and a given signal-to-noise ratio (S/R).
When expressing the spectral resolution of an instrument, the concept of resolving power, R, is widely used, defined as:
(1)
where λ is a considered wavelength, and Δλ is the difference between peaks that can be distinguished, that is the resolution – this is for example the measure used by the SDSS spectrographs (York et al. 2000), where the resolution Δλ is calculated at full width at half maximum.
The noise was introduced in the synthetic spectra by applying a white noise affecting all of the wavelengths and having a constant power spectral density (although a more specific noise model will be introduced for the SDSS case, see Sect. 4.1, or the Gaia case, see Sect. 5.1 and Appendix B). For this initial analysis, we adopted an additive white Gaussian noise (AWGN) centered at the signal value, µsignal, and with standard deviation σnoise which depends on the S/R. That is, the S/R for our synthetic model spectra is calculated as:
(2)
Taking the previous definitions into account, we varied the resolving power from 1 to 83 000, and the S/R from 0.25 to 100. The resolution values are set to cover a wide range of resolving powers that could be possibly achieved by real spectrographs. The lower limits on the values of S/R and resolving power are chosen to analyze how the Random Forest performs with poorly resolved and noisy data. This analysis can therefore serve to future studies as an estimation of the possible classification results. A summary of the models with the corresponding range of parameters analyzed in this work is shown in Table 1.
Summary of the number of models and the range of parameters used to generate our synthetic spectra for the MS, WD, and WDMS populations.
3.3 Cross-validation and hyperparameter adjustment
Before building the Random Forest model, we must establish the method of crossvalidation to be used as well as to set the hyperparameters of the model. In our case we apply the ski-kit tool train_test_split which automatically and randomly divides the input data set and its classes in two groups, training set and testing or validation set. We adopt a training to testing ratio of 70:30 and an average of 100 simulations is done, ensuring an excellent performance (i.e., Accuracy > 0.99) and avoiding cross-validation effects.
Regarding the set of best hyperparameters to be used, we start with the default values and individually analyze the performance as a function of each hyperparameter. It is noticeable that some of them depend on the number of features that the spectra have (i.e., their resolution). Consequently, it would not be effective to study the hyperparameters for each one of the resolutions, as they could change the interpretability of the results. Therefore, we choose as our reference case the maximum resolving power corresponding to 375 000 flux values. It is also worth saying that other procedures such as grid search lead to similar results, although implying a higher computational cost and a slighter improvement of the performance. Further information of the set of hyperparameters used is described in Appendix A and the values adopted are shown in Table A.1.
3.4 General performance as a function of R and S/R
First of all we analyzed the performance of the Random Forest algorithm when applied to the simulated synthetic spectra for different R and S/R values. In particular, we varied R from 1 to 83 000 for 30 evaluated points, and S/R in the range from 0.25 to 100 for 21 points. In total, we obtained 630 cases where the Random Forest performance was assessed, being the result of each one of those cases the average of 100 realizations.
The first metric to be analyzed is the general accuracy of the Random Forest algorithm. In Fig. 2, we show a density map of the accuracy (gray scale quantified in 5% steps) as a function of R and S/R.
An initial inspection of Fig. 2 reveals that the accuracy ranges from a minimum value of 37.5% for low R and low S/R to a maximum of 96% for high values of R and S/R. In a closer look, we observe that for values of R below ≈300, we need to increase the S/R as we decrease R if we want to maintain a constant value of the accuracy. This implies that for low resolution spectrographs the higher S/R as possible is needed, otherwise the Random Forest algorithm will not be able to achieve an acceptable performance. On the contrary, for values of R above 300 the accuracy score mainly depends on the S/R. This is a remarkable result, because if a Machine Learning algorithm is used for spectral classification purposes similar to those treated here in future scientific space missions, the use of a very high resolution spectrograph would not be required, since with a low resolution the same results seemed to be assured.
Complementary to the accuracy metric, in Fig. 3 we represent the F1-score (also known as balanced F-score) as function of R and S/R for the MS, WD, and WDMS populations (left, middle, and right panels, respectively). This metric is the weighted average of the recall and the precision (see the Appendix A for further details). In principle, we observe the same general trend in all three populations. That is, for R values above ≈300 the performance of the Random Forest is independent of the S/R. In particular, the WD population is the best classified by the algorithm.
![]() |
Fig. 2 Accuracy density map as a function of R and S/R. Minimum and maximum values of the accuracy are indicated in red in the gray scale. |
3.5 Performance at specific values of R and S/R
After analysing the overall performance of the Random Forest as a function of a wide range of R and S/R values, we focused on three specific cases1:
- (i)
Case 1, high resolving power and high S/R: R = 80 000 and S/R = 100.
- (ii)
Case 2, intermediate resolving power and medium S/R: R = 1800 and S/R = 10.
- (iii)
Case 3, low resolving power and low S/R: R = 40 and S/R = 1.
Case 1 represents the most ideal scenario in which we not only have high-resolution spectra but also high S/R. This resolving power can be achieved, for instance, at the Galileo Telescope located in La Palma using the HARPS spectrograph. The intermediate case (Case 2) is very similar to the spectra provided by the SDSS public data base (see Sect. 4). Case 3 covers the worse case scenario of both very low R and S/R. This combination is unlikely in real observations but it is indicative of the lowest expectations the user can get from the Random Forest performance.
In Fig. 4 we show the confusion matrices (top panels) for the three cases under study and the corresponding ROC curves (bottom panels). The confusion matrices have been normalized by each population, that is, the numbers shown are the ratios of samples per population (the colour bar in the right provides this ratio). For the Case 1 (left panel), high resolving power and high S/R, the algorithm performance is excellent. The population with the worst percentage of well-classified objects in this case is the MS population, with 93.7% of the samples being correctly classified and only 6.35% wrongly assigned to the WDMS population. Attending to WD stars, we have an almost perfect classification, with only 2% of the population being mistaken as WDMS. For the WDMS population, the ratio of well-classified objects is also very high, with only 5% of misclassified samples: 3.5% as MS and 1.17% as WDs. This excellent performance is reinforce by the ROC curve (bottom-left panel), which we recall represents the True Positive and False Positive rates (TPR and FPR, respectively; see Appendix A for further details). These rates are the recall or sensitivity of the algorithm and the probability of false alarm, respectively. We can check that for all three populations the curve is almost a 90-degree angle, which represents the perfect case.
Case 2 (middle panel), which represents a more common situation, continues to provide an excellent classification for MS and WD stars, 93 and 99.3% correctly classified, respectively. However, the percentage of well-classified WDMS decreases to 74%. The missclassified spectra is almost equally distributed between the MS and WD populations, with around 12–13% for each. This deterioration of the results is probably due to the lower S/R, according to the accuracy and F1-score maps of the previous section (see Figs. 2 and 3). Regarding the ROC curves for this case (bottom-middle panel), we can observe very sharp curves for the MS and WD populations as well as for the macro-average, but the AUC scores denounce substantial worse scores than in the previous case (see Appendix A for a definition of the previous parameters). This is because even though the populations are very well classified, there is a certain rate of False Positives (WDMS classified as single stars). The WDMS curve is now visibly weaker than the others due to the worse performance of the Random Forest for this population.
Finally, we analyzed the worst possible scenario, Case 3, with a low R and S/R. Even in this case the confusion matrix (top-right panel) reveals for the single population a very high ratio of well-classified objects, with 80% for the MS and a stunning 91.6% for the WD population. This implies that even though the data are affected by a strong noise, the algorithm is still able to recover the single populations most of the times. It is not the same for the WDMS binaries, where the recall collapses to 50%. However, this value is still larger than a random classification (showed in the ROC curves by a diagonal dashed line), which indicates that the algorithm is still able to extract some information from the spectra. It is worth noting that 30% of miss-classified WDMS objects are labeled as WDs, while only 20% are assigned to the MS category. This fact is indicative of how well the WDs are classified, prioritizing this population at the expense of the WDMS binary population, which is often mistaken.
![]() |
Fig. 3 F1-score density maps as a function of R and S/R for the MS (left panels), WD (centrαl pαnel), and WDMS (right pαnel) populations. Minimum and maximum values of the respective score are marked in red in the gray scale. |
![]() |
Fig. 4 Confusion matrices (top panels) for the three cases under study and the corresponding ROC curves (bottom panels). |
3.6 WDMS binary performance depending on the intrinsic stellar parameters
The Random Forest algorithm can provide us with a valuable bunch of information. For instance, we can recover the information of how WDMS binary systems are classified (or misclassified) according to the stellar parameters of their components, that is for example, the WD and MS effective temperatures and surface gravities.
In the top panels of Fig. 5 we show the WDMS recall density map as a function of the MS component (left panel) and WD component (right panel) stellar parameters for our Case 2. The gray scale on the right of the panels indicates the WDMS recall (from 0 to 1 in steps of 5%) and the minimum and maximum values are marked in red. In general, the recall score ranges between 50 and 100% depending on the MS star parameters and between 10 and 100% depending on the WD parameters. The maximum WDMS recall is achieved when the MS component reaches effective temperatures above 4100 K regardless of the surface gravity. This is because at these effective temperature values the main sequence components are K-type stars or earlier, which implies the resulting WDMS binary spectra resemble the spectra of single MS stars. Since the training sample contains WDMS spectra dominated by the flux of the main sequence stars, all WDMS binaries with MS components hotter than 4100 K ended up being correctly classified. On the other hand, WDMS are better recovered when the WD components have effective temperatures above 30 000 K and surface gravities over 8.5 dex. This combination of parameter values, result in WD spectra with fluxes that are of similar contribution to most of their MS companions, thus implying that the two components contribute significantly to the spectral energy distribution. The same situation takes place when considering WD components of effective temperatures between ≃ 10 000 and 20 000 K, and surface gravities between ≃ 7.5 and 8.5 dex.
The darker areas, indicative of low recall, are concentrated at MS star effective temperatures below 2700 K and WD effective temperatures below 10 000 K and surface gravities under 9 dex. In these cases one of the components is completely outshined by the other, leading to binary system spectra of very similar features as single MS stars or WDs. It is also interesting to see that for MS temperatures between 3200 and 3500 K, the fluxes are on average comparable to the WD ones, so the resulting WDMS spectra are well classified. However, from 3600 to 4000 K the fluxes of the main sequence spectra start to overtake the WD and, as a consequence, the algorithm miss-classifies the WDMS as MS stars.
Complementary to the previous analysis, we show in the bottom panels of Fig. 5 a density map indicating to which wrongly class WDMS systems have been assigned. That is, in the bottom left panel we can confirm that WDMS are miss-classified as MS stars when their effective temperatures are above 3200 K, and as WDs for temperatures below that value regardless the surface gravity. On the bottom right panel, we observe in this case that the miss-classification depends as well on the WD surface gravity. When the WD component is cold, the WDMS systems are miss-classified as MS stars, however for effective temperatures above 10 000 K and low surface gravity values, the binary system can be wrongly identified as a WD.
Finally, it is noticeable that the MS surface gravity does not play an important role in the WDMS classification process. This fact is in part due to the small range of surface gravity values, only 1 dex for MS stars, in comparison with 3.0 dex for WD stars. Moreover, the mass-radius relationship is practically linear for MS stars and inversely proportional to the cube of the mass for degenerate objects. Thus, for a fixed effective temperature, a stronger dependence on the surface gravity is expected in WD spectra rather than in MS spectra.
![]() |
Fig. 5 Parameters density maps. Top panels: WDMS recall density map as a function of the MS (left panel) and WD component (right panel) stellar parameters. Bottom panels: WDMS miss-classification as a function of the MS (left panel) and WD (right panel) stellar parameters. |
3.7 Feature importance
The Random Forest algorithm through its ensemble of decision trees assigns a Gini function (or also equivalently an entropy function; see Appendix A) to each feature (in our case, wavelength). Thus, after the training process has been accomplished, we can deduce which of these features have been mastered the classification process, that is, the feature importance. In Fig. 6 we plot the Gini function importance as a function of wavelength for our Case 2. The first thing to remark is the higher importance of the extreme wavelengths, corresponding to the blue and red limits of the spectra. This is a consequence that both types of stars, WD and MS, generally emit more at red and blue wavelengths, respectively. However, the larger importance value is achieved for wavelengths around 5000 Å. This wavelength can be related to a common transition found on MS stars: the doubly ionized oxygen transition or O III. Additionally, some other relevant wavelengths have been found in the range from 4000 to 5000 Å. We clearly identify the Hδ, Hγ, and Hβ Balmer lines, located at ≃4100, ≃4300, and ≃4900 Å, respectively.
Finally, and additional implication can be derived from Fig. 6. As observed, there exist three ranges of wavelengths located at 3500, 5000, and 11 000 Å, which account for the maximum information in the classification process. This would imply that three photometric pass-bands covering these wavelengths – for instance, Sloan (u′, g′, z′), Johnson-Cousin (U, B, I) or Hubble Space Telescope (F336W, F450W, F814W) –, may also provide a good solution in the classification of WD, MS, and WDMS populations.
![]() |
Fig. 6 Feature importance as function of the wavelenght for our Case 2. As a visual reference we marked as vertical dashed lines the doubly ionized oxygen transition O III, and the Hδ, Hγ, and Hβ Balmer lines transitions on WDs. |
4 Random Forest classification of SDSS spectra
So far we have analyzed the performance of the Random Forest for classifying synthetic spectra. The next step is to test the algorithm with observed, already labeled, data. We take advantage of the observed data collected by the SDSS belonging to MS, WD, and WDMS populations.
4.1 The SDSS spectra
The Sloan Digital Sky Survey (SDSS) (York et al. 2000; Gunn 2006; Wilson 2019) is one of the most detailed and largest astronomical surveys ever made, reaching more than three million of observed spectra. SDSS spectra are obtained with a 2.5 m optical telescope located at the Apache Point Observatory in New Mexico using a pair of fiber-fed double spectrograph covering the wavelengths from ~3800 to ~9200 Å with a resolving power of 1800 ~ 2200. The SDSS contains the most complete spectroscopic catalogs of MS, WD, and WDMS binaries (West et al. 2011; Kepler et al. 2021), which were classified in the case of WDMS binaries thanks to different techniques that involve human supervision (Rebassa-Mansergas et al. 2007, 2010, 2012, 2016b).
From the available catalogs, we extracted a practically balanced sample containing 2340 MS, 2031 WD, and 2552 WDMS binary spectra - that is, a total of 6923 labeled spectra. These samples were selected in a way that contained spectra representative of all possible stellar parameter values. The resolving power is around 1800 ~ 2200 depending on the used spectrograph, while the wavelength range of the spectra can vary slightly among the spectra. In order to normalize the sample we introduced a cut from 3850 to 9150 Å. Hence, we obtain a reliable sample where to validate the performance of our Random Forest algorithm. All SDSS spectra have associated flux errors, which can be used to derive an approximate value of the S/R for each spectrum as follows:
(3)
where i refers to each one of the wavelengths available in the spectrum, fluxi is the value of the flux for that wavelength and errori the error estimation at that certain wavelength. The total number of wavelengths is represented by n, where n = 3760. The range of S/R for the chosen SDSS spectra goes from 0.45 in the worst case to 100. In Fig. 7 we depict the S/R distribution for our selected SDSS objects. Most of them have S/Rs below 50, with an average at around 11. It is worth noting that there is a large fraction of objects with a S/R as low as 5. This fact could be a problem when classifying the spectra, as we have seen in Sect. 3.4 that with a resolving power higher than 300 the performance of the algorithm only depends on the S/R.
![]() |
Fig. 7 S/R histogram distribution of the SDSS spectra. |
4.2 Classification results
All the previously observed selected objects constitute the testing set of the classification process. That is, they are already labeled objects that we want to reclassify using our Random Forest algorithm. We adopted a training set formed by simulated spectra at a resolving power of 1800, typical of SDSS, and a set of 10 different values of S/R, 1, 3, 5, 7, 10, 15, 20, 40, 60, and 100, in coherence with the observed spectra. Therefore, from the starting 615 MS, 611 WD, and 729 WDMS synthetic spectra – 1955 in total –, we obtained a training set composed of 19 550 simulated spectra. By doing this, we assure that the algorithm is fed with all of the possible spectra that are expected to be found in the SDSS observed sample. In each case we added white Gaussian noise as explained in Sect. 3.2, with a null mean and a standard deviation according to the S/R.
The Random Forest algorithm is set with the adopted default parameters applied in Sects. 3.4 and 3.5. Averaged values are obtained through 100 Random Forest predictions, thus minimizing deviated classifications and allowing us to compute the probability estimates needed for the ROC curves.
The first metric to be analyzed is the confusion matrix shown in Fig. 8. The three rows indicate the label assigned by the SDSS spectra, while in the columns we find the Random Forest predicted label. The WD population shows a nearly perfect agreement between the Random Forest and the SDSS classification. The WDMS has also a higher recall, almost a 90% with most objects being miss-classified as WD. Finally, the MS population is the one that presents the lowest agreement with the SDSS labels, with almost 15% of the objects assigned to the WDMS binary population.
It is important to remark here that the SDSS spectra classified as WDMS binaries clearly display in the vast majority of cases the two components and, as a consequence, they are undoubtedly WDMS stars. Thus we can assume that these objects are correctly classified by SDSS and take them as a reference. We expect that our algorithm can also distinguish the two components. However, there are around 10% of the objects miss-classified mainly as WDs.
Complementary to the accuracy metric, in the right panel of Fig. 8 we show the ROC curve for the three populations under study and the corresponding macro-average curve. For the MS population (red line), we find that with a low threshold we immediately get a very high recall. This means that the objects that have been classified nearly 100% of the times as MS, will surely coincide with the SDSS label. For the WD population (blue line) the algorithm and the SDSS label agree in all cases, however, the Random Rorest classifies some WDMS binaries as single WDs. Lastly, the WDMS (green line) curve has the worst performance, and only reaches a recall of around 90% but with a high FPR. That is, if we want the Random Forest to correctly identified 90% of WDMS binaries, we must give up on almost 25% of other objects that will be miss-classified as WDMS.
![]() |
Fig. 8 Confusion matrix (left panel) and ROC curves (right panel) of the SDSS spectra set for the initial Random Forest model. |
4.3 Classification improvements
So far, the overall performance of the Random Forest algorithm can be considered as notably good, with an accuracy of 0.904. In order to achieve a higher accuracy and thus a better agreement between the SDSS labels and those provided by the Random Forest, we performed an analysis of the hyperparameters and tested different configurations of the Random Forest. A summary of the modifications of the initial model is detailed as follows:
- (i)
Probability estimates threshold. We introduced a 99% threshold for the MS population, and a 95% for both WDs and WDMS systems. Setting a threshold of the probability estimates for each population improves the classification at the expense that some spectra remain unlabeled. From the 6923 spectra that were in the test set, 6126 have been classified – almost 800 do not accomplish the threshold condition. From those that remained unlabeled, 487 were MS stars, 60 were WDs, and 250 were WDMS.
- (ii)
Limiting the stellar parameter range. Our simulated spectra cover all possible values of effective temperatures as defined from our model grids (see Table 1). However, that is not the case for the observed WDMS SDSS sample since it is heavily affected by selection effects (Rebassa-Mansergas et al. 2010). We thus limited our synthetic sample to the range of values covered by the observed SDSS WDMS population, that is, effective temperatures from 2500 to 3800 K for MS stars and from 7000 to 80 000 K for WDs.
- (iii)
Noise distribution. In real SDSS spectra, the noise contribution does not affect all the wavelengths in the same way. Conversely, the noise level at the blue and red edges of the spectra are generally higher due to the throughput of the spectrographs, which have less efficiency in those regions. Thus, we applied a grey type noise so that its contribution is stronger at the blue and end limits, featuring in this way the observed spectra.
In the left panel of Fig. 9 we present the confusion matrix obtained when the previous improvements have been taken into account in our Random Forest algorithm. The overall accuracy increases to 0.972 (close to a perfect agreement with the SDSS classification labels), with very low off-diagonal percentages (the maximum is 3.5%). Due to the limiting parameter range, the MS stars reach a 98% recall, and the new noise distribution allows the WDMS objects to be retrieved 96% of the times. All the populations are over the 2σ confidence level, reassuring the validity of the Random Forest algorithm. The corresponding ROC curves are shown in the right panel of Fig. 9. The MS population (red line) reaches the perfect case, while the others have areas around 0.98 and 0.99 for the WDMS (green line) and WDs (blue line), respectively.
Of particular interest among the few disagreements are 75 spectra classified as WDMS by the Random Forest that are previously labeled as single WDs (40) or as single MS stars (35). The obvious question is then whether these spectra belong to true single WDs and MS stars or they are WDMS binaries dominated by the flux of one of its components (as indicated by the Random Forest label). A visual inspection of the 75 SDSS spectra2 reveals that, in few cases, the two components can actually be discerned. For the remaining ones we do not have sufficient information at hand that robustly allows us to confirm or disprove either hypothesis. Follow-up observations, for example, for testing radial velocity variations, would be desired for these objects to test their possible WDMS binary nature.
The perfect agreement achieved, after some improvements introduced in our Random Forest algorithm, leads us to conclude that the artificial classification model performance can be totally equivalent to the previous classification methods applied to SDSS spectra which require a certain level of human expert visualization.
![]() |
Fig. 9 Same as Fig. 8 but for the final Random Forest model. As shown, the agreement between our Random Forest algorithm and the SDSS classification is almost perfect. |
5 Gaia DR3 white dwarf-main sequence spectra classification
In this section we study the feasibility of our Random Forest algorithm to classify Gaia WDMS binary spectra. To that end we first generate synthetic spectra featuring the Gaia performance for WD, MS, and WDMS binary systems. Once our Random Forest is trained, it is then applied to a simulated population of single and binary WD and MS stars within 100 pc from the Sun. This exercise will give us an idea of how reliable the Random Forest classification applied to real spectra obtained by Gaia can be, a goal that we will be addressed in a forthcoming publication.
5.1 Gaia DR3 synthetic spectra
Our first task is to create a set of synthetic spectra emulating the Gaia performance. As it is well known, the Gaia spectrograph consists of two low-resolution prisms, one for the blue wavelengths (Blue Photometer or BP) from 3300 to 6800 Å and the other for the red wavelengths (Red Photometer or RP), between 6400 and 10 000 Å (Gaia Collaboration 2016). The expected resolving power is relatively low and not constant, being it a function of the wavelength (see Fig. 3 from Carrasco et al. 2021). The mean R is around 66 and the expected S/R of the spectra depends on the stellar object and its magnitude. S/R estimates for the MS and WD stars with a magnitude of 13 are shown in Table 2.
In Fig. 10 we show the process that leads from our theoretical initial model to our final Gaia simulated spectrum (for a mathematical description of the process see Appendix B). As an illustrative example we take a WDMS system located at distance d = 5 pc from the Sun, with G = 16 mag that implies an associated spectrum with S/Rrp = 35 and S/Rbp = 18. The initial noiseless and virtually infinite resolution spectrum is plotted on the top left panel of Fig. 10. Firstly, as the initial models are computed at a fixed distance of 200 pc, we rescaled the spectrum to the particular distance of the system. Then, the spectrum is downgraded according to the Gaia nominal spectral resolution relationship from (Carrasco et al. 2021). The resulting spectrum is shown on the top right panel. In the next step the spectrum is filtered through the BP and RP passbands. In order to avoid filter overlaps and without losing generalization in our procedure, we consider that filters BP and RP, end and start, respectively, at 6350 Å. The filtered spectrum is plotted in the left bottom panel of Fig. 10.
Finally, we add noise to our filtered spectrum. As previously stated, the estimated S/R for each filter and kind of object (WD or MS) for a fixed G magnitude of 13 mag is presented in Table 2. For a different G magnitude the S/R is computed assuming that it is proportional to the flux. In case of binary systems, the corresponding WD and MS star fluxes are added and the S/R is recalculated to the corresponding G magnitude of the system (see Eqs. (B.8) and (B.9) from Appendix B). According to Gaia performance, it can be assumed that Poisson noise dominates for faint sources (typically G > 12 mag; as it is the case of WDMS systems), while CCD readout noise can be disregarded. Hence, the filtered spectrum, which is provided in flux units, is converted to number of electrons per wavelength interval. That way a Poissonian noise proportional to the number of electrons can be naturally introduced in the modelling. In this process we assume that the conversion factor from photons to electrons is constant all along the wavelength range of the filter. Additionally, the specific characteristics of the CCD device are engulfed in a normalization constant. Thus, to each wavelength interval of the filtered spectrum we add a Gaussian noise whose deviation follows a Poissonian noise, that is, it is proportional to the square root of the number of electrons in that interval. The normalization constant is calculated in such a way that the full spectrum has the S/R previously derived for that system. The final resulting spectrum once the noise has been added and then defiltered is shown in the bottom right panel of Fig. 10. As can be observed, it results in a realistic gray noise model of the Gaia spectrum.
Expected S/R for Gaia WD and MS star spectra for an apparent magnitude of G = 13 mag (Carrasco, priv. comm.).
![]() |
Fig. 10 Example of the building process of a simulated Gaia WDMS spectrum. The initial noiseless infinite-resolution theoretical spectrum (top left panel) is rescaled to an specific distance of 5 pc and downgraded to the Gaia resolution (top right panel). It is then filtered through the BP and RP passbands (bottom left panel). A Gaussian noise with a deviation per wavelength interval proportional to the Poisson noise of the number of electrons in that interval is added and, once defiltered, the final synthetic spectrum (bottom right panel) is obtained. See text for details. |
5.2 Gaia white dwarf-main sequence synthetic population
Once WD, MS, and WDMS Gaia spectra are modeled, we take advantage of a detailed population synthesis modeling of the Galaxy widely used in the analysis of the WD population (see Torres et al. 2019, 2022, and references therein). The simulator, based on Monte Carlo techniques, generates the expected stellar parameters of the population of single WDs, single MS stars, and WDMS binaries in the Galaxy that can be accessible by the Gaia satellite. For the analysis presented here we adopt a standard model (a full description of the model, named as Model 1, can be found in Torres et al. 2022). The physical inputs consist in a flat initial-mass-relation distribution, n(q) ∝ 1, an inverse initial orbital separation, f(a) ∝ a−1, and a binary fraction of fb = 0.50. The synthetic population is mixed with a ratio 74:25:1 for the thin and thick disk, and halo Galactic components, respectively, within 100 pc from the Sun. Photometric and astrometric errors have also been introduced for each object of the simulated sample according to Gaia’s performance.
Our synthetic sample initially contains around 133 000 MS stars, 23 000 WDs, and 1850 WDMS binary systems. In Fig. 11 we display the distribution of stellar parameters, namely their effective temperatures and surface gravities, for the different synthetic populations generated by our simulator in the region between WD and MS stars (see Sect. 5.3 for further details). The result shows that the percentages of the different populations are clearly unbalanced in this region, resulting in 5992:126:167, that is 95.34% of single MS stars, 2% of WDs, and 2.66% of binaries. Based on this simulation we build the training and testing sets (≅70 and 30%, respectively, of the simulated sample) by applying a bivariate interpolation of the temperatures and surface gravities on the three spectral grids of MS, WDs, and WDMS (see Sect. 3.1.), thus obtaining the corresponding synthetic spectra for each simulated star of the sample following the procedure described in the previous section.
5.3 Classification of Gaia white dwarf-main sequence systems
In Fig. 12 we show the results of our Random Forest algorithm when applied to the simulated spectra of the Gaia 100 pc synthetic population of WDs, MS stars, and WDMS systems. Following Gaia performance, we have selected only those spectra with G < 18 mag. Moreover, as a starting point to test our algorithm, we used the region within the HR diagram (solid red line in left panels of Fig. 12) defined by Rebassa-Mansergas et al. (2021) to select unresolved WDMS. Within this region a certain level of contamination from single low-mass WDs and especially from single low-metallicity MS stars is expected as shown by our simulations (blue open circles and red open circles, respectively in Fig. 12). However, as revealed by the confusion matrix on the top middle panel of Fig. 12, the performance of the Random Forest can be considered as excellent, achieving a practically perfect accuracy of 0.99. This is due to the fact that the sample within our selected region is highly unbalanced, representing the MS star population 97.6% of the whole sample, while WDs and WDMS systems represent only 1 and 1.4%, respectively, of the objects. Even so, practically all single WDs and MS stars and nearly 80% of the WDMS systems are correctly classified. Only in the region where WDMS and single low-metallicity MS stars overlap, WDMS binaries are miss-classified as MS stars (see gray scale on top left panel of Fig. 12) and, at the same time, some MS stars are erroneously assigned as WDMS systems. The performance of the algorithm for each sub-population can be analyzed in the ROC curves illustrated in the top right panel of Fig. 12. As shown, the performance of the Random Forest algorithm for correctly classifying single WDs is perfect, as it is the case for MS stars when adopting a FPR threshold larger than ≈0.17. The performance of the algorithm in classifying WDMS is also notably good, with a TPR always above 0.80 and a total ROC area of 0.90.
However, it is important to emphasize that only 9% of the total underlying population of unresolved WDMS binaries falls within the region defined by Rebassa-Mansergas et al. (2021), while the remaining 91% lay above this region (that is the area where single MS stars of solar metallicity are located). From the point of view of the spectral classification, we have to face two problems in this region. On one hand, the large quantity of MS stars. And, on the other hand, the SEDs of WDMS systems which overlap in the HR diagram with single MS stars are normally dominated by the flux of the MS companion, being this the reason why the vast majority of such WDMS binaries remain observationally undetected. Nevertheless, motivated by the excellent results achieved by of our Random Forest algorithm, we extend our analysis to test whether or not it can help in identifying these elusive binaries. For that purpose we extend 1 magnitude up the upper border of the WDMS region defined by Rebassa-Mansergas et al. (2021). The new selection region is marked by a dashed-red line in the bottom left panel of Fig. 12. As shown, a large number of WDMS, as well as MS stars, are now included in the extended region. The sample is now totally dominated by MS stars (99%), while WDs and WDMS binaries are just 0.3 and 0.7% of the sample. The corresponding confusion matrix is shown in the bottom middle panel of Fig. 12. The performance of the algorithm, even slightly worse that in the previous case, can still be considered as notably good. Practically all WDs, and nearly 93% of the MS stars are correctly classified. Regarding WDMS binary systems, 77% are properly identified, being those miss-classified assigned to MS stars. It is important to remark that in this extended region due to the large quantity of MS stars the identification of WDMS systems through their SEDs is extremely difficult. However, the Random Forest algorithm is able to correctly identify nearly 60% of WDMS binary systems. However, the drawback is that even though the percentage of MS stars miss-classified as WDMS is small (≈7%), that represents a decrease in the number of TPR of the WDMS population, as shown in the ROC curves (bottom right panel of Fig. 12). The performance of the algorithm in correctly classifying WDMS is reduced to a ROC area of 0.87, which is also related to a slight decrease in the TPR of the MS star population.
We conclude that, even with the intrinsic difficulties for disentangling WDMS binaries from single MS stars in the extended region, our results demonstrate that a classification of Gaia WDMS spectra based on a Random Forest algorithm is feasible and can be considered as a powerful tool for identifying binary WDMS systems, even those that so far have remained elusive.
![]() |
Fig. 11 Distribution of stellar parameters of the Gaia accessible population of single MS stars (top left), single WDs (top right), MS star components in WDMS (bottom left), and WD components in WDMS (bottom right). |
![]() |
Fig. 12 HR-diagram, confusion matrices, and ROC curves. Top panels: HR-diagram (left panel) of WDs (blue open circles), MS stars (red open circles), and WDMS systems (grey solid circles) for those spectra with G < 18 accessible by Gaia within the WDMS region (red solid line) defined by Rebassa-Mansergas et al. (2021). The performance of our Random Forest in classifying these population is shown in the corresponding confusion matrix (middle panel) and the ROC curve (right panel). Bottom panels: same as previous panels when the region defined by Rebassa-Mansergas et al. (2021) is extended 1 mag up into the MS region (red dashed line). Even in those extreme conditions the algorithm is able to identify ~60% of WDMS binary systems within the extended region (see text for details). |
6 Conclusions
We have proven that a Random Forest algorithm is an optimal tool for classifying single WDs or MS stars, as well as WDMS binaries among large spectroscopic databases such as the SDSS or Gaia. We first analyzed the capabilities of the algorithm by classifying synthetic spectra for the full space-parameter range of effective temperatures and gravities of WDs and MS stars. The synthetic spectra were generated combining the individual WD and MS spectra, which cover a wide range of temperatures and surface gravities. Our results show that the accuracy of the classification depends only on the S/R of the spectra for resolving powers R above ≈300. For a typical case with a resolving power of 1800 and S/R = 10 the algorithm is able to recover practically all MS stars and WDs, 93 and 99.3%, respectively, while up to 74% of WDMS systems can be correctly classified.
The Random Forest algorithm was also tested using the SDSS catalog where nearly 7000 selected WD, MS, and WDMS stars were already labeled. The agreement between our algorithm and the SDSS classification was practically total, achieving 97% of match. Taking into account that, in our previous theoretical analysis we found a recall ratio of WDMS systems of 74% by our algorithm, we can hypothesize that a similar ratio is presented in the SDSS labeling. This would imply that not all WDMS systems are properly recovered by the SDSS classification. Among the discrepancies, we found 75 spectra classified by the Random Forest as WDMS that were initially labeled by SDSS as WDs (40) or MS stars (35). Visual inspection of these spectra reveals that, for some of them, the two components can actually be discerned, while for others further follow-up observations are required to confirm whether or not they are single stars or binaries. This result gives indications that the Random Forest algorithm is capable to some extent to correctly classify WDMS binaries even in the case that the flux is dominated by one of their components.
Once our algorithm has been proven to be a useful tool for classifying WDMS spectra from actual surveys, we explored the possibilities of applying it to a synthetic sample of spectra mimicking Gaia observations. First, we introduced a realistic modeling of Gaia spectral noise taking into account, among other things the S/R as a function of the G magnitude for different sources such as WDs or MS stars, as well as the error that arises during the filtering through the Gaia BR and RP passbands. Then, with the aid of a detailed Monte Carlo simulator of the Galactic WD population, we generated the parameter space for WDs, MS stars, and WDMS systems observable by Gaia within 100 pc from the Sun. For each object of the simulated sample we assigned a synthetic Gaia spectrum with their corresponding S/Rs. We chose as our initial area for evaluating the Random Forest algorithm the region within the HR-diagram defined by Rebassa-Mansergas et al. (2021) for selecting unresolved WDMS binary objects. The performance of the algorithm within this region is notably good, being able to identify nearly 80% of WDMS systems. The analysis has been then extended to regions of the HR-diagram where the presence of MS stars is dominant and, at the same time, WDMS systems are practically overshined by the MS components. Even though from an observational point of view the detection of WDMS in these areas is extremely difficult, the algorithm was able to correctly identify nearly 60% of the WDMS systems in that extended region, which further supports the potential of the Random Forest to identify these elusive objects.
Along this work we have shown that the Random Forest algorithm is a powerful tool for classifying large samples of spectra, specifically those of WDs, MS stars, and WDMS binary systems. In particular, once provided with a realistic modeling of spectral noise and the population parameters, we have proven that the Random Forest is a feasible tool for identifying and correctly classifying WD, MS, and WDMS systems observed by Gaia.
Acknowledgements
This work was partially supported by the MINECO grant PID2020-117252GB-I00. A.R.M. acknowledges support from grant RYC-2016-20254 funded by MCIN/AEI/10.13039/501100011033 and by ESF Investing in your future. We would also like to acknowledge the fruitful information provided by M. Carrasco and J. de Bruijne.
Appendix A Random Forest: hyperparameters, metrics, and scoring
Appendix A.1 Hyperparameters definitions
For the present project we have used the scikit-learn3 library for Python, which already includes a Random Forest classifier (Pedregosa et al. 2011). Some of the most relevant hyperparameters are defined and in Table A.1 a list of the adopted values is presented.
Number of estimators: the number of Decision Trees that form the Random Forest.
-
Criterion: the function that is used in order to calculate the quality of the split. Generally, Gini index or the entropy index are used. In our case, Gini impurity index has been adopted, defined as:
(A.1)
where (p(i|j) is the probability of picking an element of class i at certain node j, so it will be 0 when all the samples of a branch are in the same class and 0.5 when the mixture has the same elements of each class.
Minimum number of samples required to split a node: if there are more samples than this value, the algorithm will continue the splitting process; if not, the current node will be considered a Leaf Node.
Minimum number of samples required to be at a Leaf Node: the current node will be considered to be a Leaf Node if there are at least a number of samples equal to this value in the node.
Maximum depth: the algorithm finishes a Decision Tree when there are no more mixed subsets with different classes, when the number of samples is less than the minimum number of samples required to split a node, or when the tree reaches the maximum number of Decision Nodes set by this hyperparameter.
Maximum number of features to consider when looking for the best split: the number of random features that the algorithm takes into account when doing a split can be controlled with this parameter.
Maximum samples to draw from the training set to train each base estimator: if bootstrap is enabled, this value sets the number of samples used in order to build each estimator. If there is no set value, the algorithm will use all the samples.
Class weights: parameter that sets weights to the different classes. This is useful in imbalanced datasets where there are many objects of one population and few of another, as it biases the calculation of the Gini index in favour of the minority class at the cost of allowing some errors in the majority one.
Hyperparameters optimal values adopted in the present work. Default values are marked with asterisk.
Appendix A.2 Metrics and scoring
Some of the most relevant metrics and scoring used along this work are presented:
Confusion matrix, M: an element of the matrix Mi,j corresponds to all the objects of true class i and labeled as j. The diagonal indicates the number of objects that were actually classified according to its true label, while the off-diagonal elements represent the misclassified objects.
-
Accuracy score: defined as
(A.2)
represents the fraction of correctly classified elements divided by the whole population. Its maximum value 1 indicates that all the elements are correctly classified, while its minimum 0 that all the elements are incorrectly classified.
-
Recall score: it measures the fraction of correctly labeled as class i from the whole set of objects that actually belong to that class. Defined as
(A.3)
where the sum term refers to the sum of the elements of the row i.
It is also called sensitivity or probability of detection, because it indicates the True Positive samples, that is, the hits of the algorithm for a certain class. In a normalized confusion matrix, the recall scores directly correspond to the diagonal values.
-
Precision score: indicates the amount of objects correctly classified as i in comparison with the total amount of objects labeled as i. This value quantifies the ability of the classifier to not label as class j an element of class i, which would be:
(A.4)
where the sum term refers to the sum of the elements of the column i.
F1 score: also named F-score is a weighted average of both, the precision and recall scores. Its mathematical expression is given by:
(A.5)
-
ROC curve: the Receiver Operating Characteristic (ROC) curve, although it is primarily intended for binary classifications, it can be extended to multi-class problems such as the present one. This curve represents the performance of a classifier while varying its discrimination threshold. The resulting points are the True Positive Rate (TPR), also known as recall or probability of detection, versus the False Positive Rate (FPR), also named as probability of false alarm. We could express those three values as:
(A.6)
In a multi-class problem, the Positives would be all those elements that actually belong to a class i, whether they are correctly classified or not, and the True Positives would be those who are correctly classified as a class i. On the other hand, the Negatives would be all the other samples that do not belong to a class i, while the False Positives are the elements that do not belong to a class i but are classified in it. To sum up:
AUC curve: the Area Under the Receiver Operating Characteristic Curve (ROC AUC), which is the area under the ROC curve. This value basically summarizes the information of the curve in a number that goes from 0.5 in the worst case, to 1 if the classifier does not make mistakes.
macro-average curve: metric average weighting each classes equally. In contrast, micro-average adds the contribution of each object equally.
Appendix A.3 Hyperparameters adjustment
Starting from the default values, we individually analyzed the performance of the Random Forest for a set of the most relevant hyperparameters (i.e., number of estimators, maximum depth, minimum number of samples required to split a node, minimum number of samples required to be at a Leaf Node, and maximum number of features to consider when looking for the best split). The score adopted is the accuracy. In Figure A.1 we show the results for the training sample (red line) and the testing sample (blue line) for the different hyperparameters under study. Additionally we studied the performance for different scalers and normalized (top left panel of Fig. A.1).
In our selection of the best hyperparameters values we have adopted two basically criteria: have an accuracy greater than 90% and a minimum processing time. A summary of the best values adopted is presented in Table A.1. Finally, once all the hyperparameters have been analyzed, a test has been made with both settings, default and optimal values, in order to compare the computational time and the accuracy. For the default hyperparameters, we obtained an accuracy of 95.61% with a computation time of 19.51 s for one loop. For the optimal setting, the accuracy obtained changed in the thousandths but with a computation time of 33.07 s, an increase of almost 70%. Although, optimal values have been used along this work, default values are also acceptable.
![]() |
Fig. A.1 Performance of the accuracy score as a function of different hyperparameters for the training sample (red line) and testing sample (blue line). In the top left panel we show the accuracy for different scalers: standard (1), MinMax (2), MaxAbs (3), robust (4), power transformer (5), quantile transformer (6), normalizer L1 (7), normalizer L2 (8), and normalizer-max (9). Marked in red (7) the one adopted in the present work. |
Appendix B Gaia spectral noise modeling
Let ϕ be the flux of a noiseless spectrum measured in units of erg cm−2 s−1 Å−1. The initial flux, ϕ0, computed at distance d0 (in our case, d0 = 200 pc) can be directly rescaled at distance d as ϕd = ϕ0 (200/d[pc])2. Taking into account the resolving power, the flux is discretized in n intervals of length ∆λ, ϕi, with i = 1…n. If the wavelength interval ∆λ is small enough, we can estimate the number of electrons generated in interval i, Ei, as a linear function from the number of photons derived from the flux, ϕi as
(B.1)
where ∆A is the area covered by the CCD camera, ∆t the time interval, єphoton the energy of the photon of wavelength λi, and f a conversion function from photons to electrons. Assuming that f is independent of λi (or at least for the wavelength range of the spectrum) and recalling that the resolving power is R = λ/∆λ, the previous equation leads to:
(B.2)
where C is a constant that encompasses the rest of parameters.
The signal S and the noise N are defined in the standard way as
(B.3)
where is the error associated to the flux, ϕi.
For each interval i, we adopt a Gaussian noise in the number of estimated electrons with a deviation following a Poissonian distribution, that is . The error in the number of electrons is propagated through equation B.2 to the error in the flux as
(B.5)
For a fixed S/R defined as the quotient of B.3 over B.4, we can derived the normalization constant C as
(B.6)
Assuming that the S/R is proportional to the flux, S/R ∝ ϕ and using the relation between magnitude (in our case the Gaia magnitude G) and flux,
(B.7)
where C is a constant of calibration, we can easily derive the relation between S/R and magnitude:
(B.8)
From Table 2 the values of the S/R for G = 13 are known. Thus, by adopting G1=13 and the corresponding S/R1 depending on the type of star and the range of wavelengths (BP or RP), we can estimate the S/R2 for each one of the objects of magnitudes G2. For binary stars we only have approximations of S/R for their MS and WD components individually. In this case, we assume that individual fluxes are of the same order, ϕ1 ≈ ϕ2, which leads to signals of the same order, S1 ≈ S2 and results in a combined S/R of the system as
(B.9)
where S/R1 and S/R1 are the individual S/R.
Appendix C Accuracy and F1-score for each model under study
In Table C.1 we summarize the accuracy and the F1-score for the different training models analyzed in this work.
Accuracy and F1-score for each model analyzed in this study.
References
- Althaus, L. G., Córsico, A. H., Isern, J., & García-Berro, E. 2010, A&Amp;ARv, 18, 471 [NASA ADS] [CrossRef] [Google Scholar]
- Bailer-Jones, C. A. L., Irwin, M., & von Hippel, T. 1998, MNRAS, 298, 361 [NASA ADS] [CrossRef] [Google Scholar]
- Ball, N. M., & Brunner, R. J. 2010, Int. J. Mod. Phys. D, 19, 1049 [Google Scholar]
- Bazell, D., & Peng, Y. 1998, ApJS, 116, 47 [NASA ADS] [CrossRef] [Google Scholar]
- Breiman, L. 2001, Mach. Learn., 45, 5 [Google Scholar]
- Carrasco, J. M., Weiler, M., Jordi, C., et al. 2021, A&A, 652, A86 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Casewell, S. L., Belardi, C., Parsons, S. G., et al. 2020, MNRAS, 497, 3571 [NASA ADS] [CrossRef] [Google Scholar]
- El-Badry, K., Rix, H.-W., & Heintz, T. M. 2021, MNRAS, 506, 2269 [NASA ADS] [CrossRef] [Google Scholar]
- Gaia Collaboration (Brown, A. G. A., et al.) 2016, A&A, 595, A2 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Gaia Collaboration (Brown, A. G. A., et al.) 2021a, A&A, 649, A1 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Gaia Collaboration (Smart, R. L., et al.) 2021b, A&A, 649, A6 [EDP Sciences] [Google Scholar]
- García-Berro, E., & Oswalt, T. D. 2016, New A Rev., 72, 1 [CrossRef] [Google Scholar]
- Gunn, J. E., Siegmund, W., Mannery, E. J., et al. 2006, AJ, 131, 2332 [NASA ADS] [CrossRef] [Google Scholar]
- Hambly, N. C., Miller, L., MacGillivray, H. T., Herd, J. T., & Cormack, W. A. 1998, MNRAS, 298, 897 [NASA ADS] [CrossRef] [Google Scholar]
- Hernandez-Pajares, M., & Floris, J. 1994, MNRAS, 268, 444 [NASA ADS] [Google Scholar]
- Husser, T. O., Wende-von Berg, S., Dreizler, S., et al. 2013, A&A, 553, A6 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Kaiser, N., Aussel, H., Burke, B. E., et al. 2002, Proc. SPIE, 4836, 154 [NASA ADS] [CrossRef] [Google Scholar]
- Kepler, S. O., Koester, D., Pelisoli, I., Romero, A. D., & Ourique, G. 2021, MNRAS, 507, 4646 [NASA ADS] [CrossRef] [Google Scholar]
- Koester, D. 2010, Mem. Soc. Astron. It., 81, 921 [NASA ADS] [Google Scholar]
- Naim, A., Lahav, O., Sodre, L. J., & Storrie-Lombardi, M. C. 1995, MNRAS, 275, 567 [Google Scholar]
- Parsons, S. G., Gänsicke, B. T., Marsh, T. R., et al. 2012, MNRAS, 426, 1950 [Google Scholar]
- Parsons, S. G., Gänsicke, B. T., Marsh, T. R., et al. 2013, MNRAS, 429, 256 [Google Scholar]
- Parsons, S. G., Hermes, J. J., Marsh, T. R., et al. 2017, MNRAS, 471, 976 [NASA ADS] [CrossRef] [Google Scholar]
- Pedregosa, F., Varoquaux, G., Gramfort, A., et al. 2011, J. Mach. Learn. Res., 12, 2825 [Google Scholar]
- Pérez-Ortiz, M. F., García-Varela, A., Quiroz, A. J., Sabogal, B. E., & Hernández, J. 2017, A&A, 605, A123 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Plewa, P. M. 2018, MNRAS, 476, 3974 [NASA ADS] [CrossRef] [Google Scholar]
- Pyrzas, S., Gänsicke, B. T., Brady, S., et al. 2012, MNRAS, 419, 817 [NASA ADS] [CrossRef] [Google Scholar]
- Raddi, R., Torres, S., Rebassa-Mansergas, A., et al. 2022, A&A, 658, A22 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Rebassa-Mansergas, A., Gänsicke, B. T., Rodríguez-Gil, P., Schreiber, M. R., & Koester, D. 2007, MNRAS, 382, 1377 [CrossRef] [Google Scholar]
- Rebassa-Mansergas, A., Gänsicke, B. T., Schreiber, M. R., Koester, D., & Rodríguez-Gil, P. 2010, MNRAS, 402, 620 [Google Scholar]
- Rebassa-Mansergas, A., Nebot Gómez-Morán, A., Schreiber, M. R., et al. 2012, MNRAS, 419, 806 [NASA ADS] [CrossRef] [Google Scholar]
- Rebassa-Mansergas, A., Anguiano, B., García-Berro, E., et al. 2016a, MNRAS, 463, 1137 [NASA ADS] [CrossRef] [Google Scholar]
- Rebassa-Mansergas, A., Ren, J. J., Parsons, S. G., et al. 2016b, MNRAS, 458, 3808 [NASA ADS] [CrossRef] [Google Scholar]
- Rebassa-Mansergas, A., Maldonado, J., Raddi, R., et al. 2021, MNRAS, 505, 3165 [NASA ADS] [CrossRef] [Google Scholar]
- Ren, J. J., Rebassa-Mansergas, A., Parsons, S. G., et al. 2018, MNRAS, 477, 4641 [NASA ADS] [CrossRef] [Google Scholar]
- Serra-Ricart, M., Aparicio, A., Garrido, L., & Gaitan, V. 1996, ApJ, 462, 221 [NASA ADS] [CrossRef] [Google Scholar]
- Torres, S., García-Berro, E., & Isern, J. 1998, ApJ, 508, L71 [NASA ADS] [CrossRef] [Google Scholar]
- Torres, S., Cantero, C., Rebassa-Mansergas, A., et al. 2019, MNRAS, 485, 5573 [NASA ADS] [CrossRef] [Google Scholar]
- Torres, S., Canals, P., Jiménez-Esteban, F. M., Rebassa-Mansergas, A., & Solano, E. 2022, MNRAS, 511, 5462 [NASA ADS] [CrossRef] [Google Scholar]
- Turon, C., Gomez, A., Crifo, F., et al. 1992, A&A, 258, 74 [NASA ADS] [Google Scholar]
- West, A. A., Morgan, D. P., Bochanski, J. J., et al. 2011, AJ, 141, 97 [NASA ADS] [CrossRef] [Google Scholar]
- Wilson, J. C. 2019, PASP, 131, 055001 [NASA ADS] [CrossRef] [Google Scholar]
- York, D. G., Adelman, J., Anderson, J. E., Jr., et al. 2000, AJ, 120, 1579 [NASA ADS] [CrossRef] [Google Scholar]
- Zhao, G., Zhao, Y.-H., Chu, Y.-Q., Jing, Y.-P., & Deng, L.-C. 2012, Res. Astron. Astrophys., 12, 723 [Google Scholar]
- Zwitter, T., Siebert, A., Munari, U., et al. 2008, AJ, 136, 421 [Google Scholar]
In Table C.1 the reader can find a summary of the accuracy and F1-score obtained for all the models analyzed along this work.
They are accessible at: http://wdmsrf.epizy.com/
All Tables
Summary of the number of models and the range of parameters used to generate our synthetic spectra for the MS, WD, and WDMS populations.
Expected S/R for Gaia WD and MS star spectra for an apparent magnitude of G = 13 mag (Carrasco, priv. comm.).
Hyperparameters optimal values adopted in the present work. Default values are marked with asterisk.
All Figures
![]() |
Fig. 1 Example of a WDMS binary spectrum (blue line) obtained from the combination of the individual spectra of a WD (green line) and a MS star (red line). The effective temperatures and surface gravities are. respectively, 9250 K and 7.75 dex for the WD, and 2700 K and 5.5 dex for the MS star. As seen, the blue part of the spectrum is dominated by the WD, while the red part has clear M star spectral lines. |
In the text |
![]() |
Fig. 2 Accuracy density map as a function of R and S/R. Minimum and maximum values of the accuracy are indicated in red in the gray scale. |
In the text |
![]() |
Fig. 3 F1-score density maps as a function of R and S/R for the MS (left panels), WD (centrαl pαnel), and WDMS (right pαnel) populations. Minimum and maximum values of the respective score are marked in red in the gray scale. |
In the text |
![]() |
Fig. 4 Confusion matrices (top panels) for the three cases under study and the corresponding ROC curves (bottom panels). |
In the text |
![]() |
Fig. 5 Parameters density maps. Top panels: WDMS recall density map as a function of the MS (left panel) and WD component (right panel) stellar parameters. Bottom panels: WDMS miss-classification as a function of the MS (left panel) and WD (right panel) stellar parameters. |
In the text |
![]() |
Fig. 6 Feature importance as function of the wavelenght for our Case 2. As a visual reference we marked as vertical dashed lines the doubly ionized oxygen transition O III, and the Hδ, Hγ, and Hβ Balmer lines transitions on WDs. |
In the text |
![]() |
Fig. 7 S/R histogram distribution of the SDSS spectra. |
In the text |
![]() |
Fig. 8 Confusion matrix (left panel) and ROC curves (right panel) of the SDSS spectra set for the initial Random Forest model. |
In the text |
![]() |
Fig. 9 Same as Fig. 8 but for the final Random Forest model. As shown, the agreement between our Random Forest algorithm and the SDSS classification is almost perfect. |
In the text |
![]() |
Fig. 10 Example of the building process of a simulated Gaia WDMS spectrum. The initial noiseless infinite-resolution theoretical spectrum (top left panel) is rescaled to an specific distance of 5 pc and downgraded to the Gaia resolution (top right panel). It is then filtered through the BP and RP passbands (bottom left panel). A Gaussian noise with a deviation per wavelength interval proportional to the Poisson noise of the number of electrons in that interval is added and, once defiltered, the final synthetic spectrum (bottom right panel) is obtained. See text for details. |
In the text |
![]() |
Fig. 11 Distribution of stellar parameters of the Gaia accessible population of single MS stars (top left), single WDs (top right), MS star components in WDMS (bottom left), and WD components in WDMS (bottom right). |
In the text |
![]() |
Fig. 12 HR-diagram, confusion matrices, and ROC curves. Top panels: HR-diagram (left panel) of WDs (blue open circles), MS stars (red open circles), and WDMS systems (grey solid circles) for those spectra with G < 18 accessible by Gaia within the WDMS region (red solid line) defined by Rebassa-Mansergas et al. (2021). The performance of our Random Forest in classifying these population is shown in the corresponding confusion matrix (middle panel) and the ROC curve (right panel). Bottom panels: same as previous panels when the region defined by Rebassa-Mansergas et al. (2021) is extended 1 mag up into the MS region (red dashed line). Even in those extreme conditions the algorithm is able to identify ~60% of WDMS binary systems within the extended region (see text for details). |
In the text |
![]() |
Fig. A.1 Performance of the accuracy score as a function of different hyperparameters for the training sample (red line) and testing sample (blue line). In the top left panel we show the accuracy for different scalers: standard (1), MinMax (2), MaxAbs (3), robust (4), power transformer (5), quantile transformer (6), normalizer L1 (7), normalizer L2 (8), and normalizer-max (9). Marked in red (7) the one adopted in the present work. |
In the text |
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.