Free Access
Issue
A&A
Volume 656, December 2021
Article Number A93
Number of page(s) 12
Section Galactic structure, stellar clusters and populations
DOI https://doi.org/10.1051/0004-6361/202141589
Published online 08 December 2021

© ESO 2021

1. Introduction

Since 2013, the Gaia telescope has been making observations of Milky Way stars (Gaia Collaboration 2016). While the main objective of Gaia is astrometry, it also collects stellar spectra using the red photometer (RP), blue photometer (BP), and radial velocity spectrometer (RVS). The combined BP/RP spectra cover the wavelength range 3300−10 500 Å with a resolution between 13 and 85 (Carrasco et al. 2021). The BP/RP spectra are observed for all targets, and a large subset will be released in Gaia Data Release 3 (GDR3). The BP/RP spectra can be used to estimate the stellar parameters effective temperature Teff, surface gravity log g, and metallicity [Fe/H] as well as the extinction along the line of sight A0 (Liu et al. 2012). The RVS spectra can be used to estimate the same parameters, and also [α/Fe] for cool stars. However, the RVS requires the stars to be relatively bright, so this will only be done for a few tens of million of stars out of the Gaia sample (Recio-Blanco et al. 2016). We aim to investigate whether the BP/RP spectra can also be used to estimate alpha abundances [α/Fe] by using machine learning.

Estimating [α/Fe] from BP/RP spectra is very much the type of problem that machine learning is intended for: we expect that the shape of the BP/RP spectra will depend on [α/Fe] in some way that cannot easily be handled with conventional spectroscopic methods that measure line depths since the individual α-sensitive spectral lines and bands cannot be resolved, but in large numbers they should still affect the continuum shape of the spectra. We also have known values for [α/Fe], as well as other stellar parameters, for some subsets of the stars observed by Gaia from the surveys Galactic Archaeology with HERMES (GALAH) and Apache Point Observatory Galactic Evolution Experiment (APOGEE) (Buder et al. 2019; Majewski et al. 2017). Put in machine learning terms, we have features for a sample of objects in the form of BP/RP spectra, and we have known labels for a subset of those objects in the form of GALAH and APOGEE estimates of the [α/Fe]. We also have a theoretical reason to believe that labels correlate with features in some predictable way, but we lack a detailed model of what the relationship is.

Despite the argument outlined above, it could turn out that estimating [α/Fe] from BP/RP spectra is impossible in practice due to degeneracies between [α/Fe] and other parameters, or due to the effect of [α/Fe] simply being swamped in noise. If it is possible, that leads to a second issue: even if the algorithm can estimate [α/Fe] well on average, the basis for this estimate may or may not be something that will allow it to tell apart two stars that only differ by [α/Fe]. Ideally, we would want an algorithm that estimates [α/Fe] based on the direct causal effect on the shape of the spectrum since this could be used to identify stars or populations of stars that have unusual [α/Fe]. On the other hand, an algorithm that estimates [α/Fe] based on indirect correlations with other parameters – at the most trivial, the Galactic trend of [α/Fe] as a function of [Fe/H] – could not be used for this purpose, but may still be used cautiously in other contexts.

The accuracy with which [α/Fe] can be estimated is expected to depend partly on the other stellar parameters. A priori, one would expect it to be easier to estimate for cooler stars, in particular for cool dwarf stars due to their prominent titanium oxide molecular bands which react quadratically to changes in α-element abundance. However, even if machine learning can be reliably used on cool dwarfs, this is of limited use for Galactic population studies since Gaia can only observe those stars with high S/N if they are very nearby.

Section 2 describes the machine learning algorithm used, the extremely randomized trees (ExtraTrees) algorithm. Section 3 describes the four samples of spectra that we use. Section 4 describes the results of training an ExtraTrees model on a sample of synthetic spectra. Section 5 describes the results of instead training an ExtraTrees model on a sample of spectra with parameters known through the GALAH survey. Section 6 summarises our results.

2. ExtraTrees algorithm

For our machine learning algorithm we used the ExtraTrees algorithm, modified to allow simultaneous estimation of several outputs (Geurts et al. 2006; Dumont et al. 2009). The ExtraTrees algorithm is similar to the random forest algorithm, in that it is an ensemble learning method using decision trees as the base estimator. We used the implementation of the algorithm in the Python module scikit-learn (Pedregosa et al. 2011). The algorithm has some advantages and some disadvantages, which lead us to prefer it over other algorithms.

One simple but important advantage is ease of use. It does not require lengthy fine-tuning of meta-parameters, and it is computationally simple enough to run very quickly. There are only three parameters that need to be defined when training the model. The first parameter is the maximum number of features K to use to split a node when defining the decision trees (Geurts et al. 2006, Sect. 2.1). We choose to always allow the full set of features to be used, since this is recommended for regression problems (Scikit-learn developers 2020, Sect. 1.11.2.3). The second parameter is the minimum sample size nmin to split a node when defining the decision trees. When training on the synthetic sample (described in Sect. 3.1) we choose nmin = 2 since this typically gives the best results (Scikit-learn developers 2020, Sect. 1.11.2.3). When training on the GALAH sample (described in Sect. 3.2) we choose nmin = 10, since we otherwise ran into memory issues. The third parameter is the number of trees ntrees to include in the ensemble. In principle, the algorithm will always perform better the higher ntrees is, making the optimal value a trade-off with speed and memory. We choose ntrees = 200, since at this value we found that the uncertainties in the [α/Fe] estimates due to the internal scatter among the base estimates were of the order of 0.01–0.03 dex, making them safely negligible compared to other sources of uncertainty.

The algorithm has the second advantage that it can handle multi-output problems. This means that we could to some extent break the degeneracy between [α/Fe] and other parameters by training the models to simultaneously estimate Teff, log g, [Fe/H], and [α/Fe], instead of [α/Fe] alone (Dumont et al. 2009). When relevant, we show the estimates for the other parameters.

The algorithm has the third advantage that it is not very susceptible to overfit: many algorithms, such as neural nets or individual decision trees, are prone to learn features of the training sample that will not recur in any sample drawn from the same distribution, such as random noise. Ensemble learning methods are in general fairly robust to this issue, and the ExtraTrees algorithm was designed specifically to that end (Geurts et al. 2006).

One disadvantage of the algorithm is that being as simple as it is, it is likely to have slightly worse performance than a carefully calibrated version of some other algorithm, such as a neural networks. However, we believe that the algorithm is close enough to optimal that it is sufficient for the proof (or disproof) of concept that this study is intended to be. That is, if it is possible to do a measurement with the ExtraTrees algorithm, it may be possible to do it better with another algorithm – but we do not believe that a measurement that is categorically impossible with the ExtraTrees algorithm will become possible with another algorithm.

A second disadvantage of the algorithm is that it is essentially incapable of extrapolating: as soon as it is used on data with features falling outside of the range represented in the training sample, the results are likely to be physically meaningless. To make our results easier to interpret we, where relevant, show the convex hull of the training sample in plots showing estimates made by the algorithm. In some cases we also filter our data to remove spectra that are not represented in the training sample.

The inability to extrapolate outside of the training sample also has the effect that the errors in estimates given by an ExtraTrees model are not necessarily symmetric: If a model is trained on some label covering the range [x, y] and is then used to estimate labels for samples with true label values x and y, then the former will always be overestimated and the latter will always be underestimated. This will be an issue for the synthetic sample described in Sect. 3.1, since it contains the discrete [α/Fe] values 0.0 dex and 0.4 dex.

3. Spectral samples

We have four samples of BP/RP spectra: one sample of synthetic spectra that we dub the ‘synthetic sample’ and describe in Sect. 3.1; One sample of observed BP/RP spectra with parameters known from the GALAH survey that we dub the ‘GALAH sample’ and describe in Sect. 3.2; One sample of observed BP/RP spectra without known parameters that we dub the ‘Gaia sample’ and describe in Sect. 3.3; One sample of observed BP/RP spectra that have parameters known from the APOGEE survey and are believed to be part of the Gaia-Enceladus structure, that we dub the ‘Gaia-Enceladus sample’ and describe in Sect. 3.4.

Any of the three samples with known or estimated parameters can be used as a training sample, and the resulting model can then be used to estimate parameters for the same sample or any of the others. Section 4 describes the results of using the synthetic sample as training sample. Section 5 describes the results of using the GALAH sample as training sample.

3.1. Synthetic sample

We used the MARCS code to model a grid of synthetic spectra covering a range of stellar parameters and α-abundances (Gustafsson et al. 2008). We then extended the grid of spectra by adding an axis over the extinction parameter A0 as defined in Bailer-Jones (2011, Sect. 2.2). Finally we applied a model of the Gaia instrumental profile (Paolo Montegriffo, priv. comm.) This model has not yet been publicly documented, but may be released as part of the upcoming Gaia DR3.

This sample consists of 21560 spectra in a piecewise regular grid, although a handful of spectra are missing due to specific MARCS models failing to converge. The sample contains three subsets of stars, which we dub ‘giants’, ‘dwarfs’, and ‘cool dwarfs’. The Teff–log g nodes are shown in Fig. 1. The Teff and log g values form three rectangles in Teff–log g space, one for each type of star. The giants cover Teff between 4000–5500 K in steps of 250 K and log g between 1.0–3.0 in steps of 0.5 dex. The dwarfs cover Teff between 4000–6250 K in steps of 250 K and log g between 3.0–5.0 dex in steps of 0.5 dex. The cool dwarfs cover Teff between 3500 and 3900 K in steps of 100 K, and log g of 3.0–5.5 dex in steps of 0.5 dex. For each combination of Teff and log g we calculate spectra with [Fe/H] values from −2.0 dex to −1.0 dex in steps of 0.5 dex, then to 0.25 dex in steps of 0.25 dex; [α/Fe] values of 0.0 dex and 0.4 dex; A0 values from 0.0 to 1.0 in steps of 0.1.

thumbnail Fig. 1.

Values of Teff and log g of the grid of synthetic spectra. For each node, spectra with nine values of [Fe/H], two values of [α/Fe], and eleven values of A0 have been calculated.

Figure 2 shows synthetic BP/RP spectra for stars near the middle of each sub-sample, for [α/Fe] = 0.0 and 0.4 dex. The giant spectra have Teff = 4750 K and log g = 2.0 dex. The dwarf spectra have Teff = 5250 K and log g = 4.0 dex. The cool dwarf spectra have Teff = 3700 K and log g = 4.5 dex. Figure 3 shows the ratio in predicted flux between the α-poor and α-rich spectra. Based on this, one would expect performance to be best for the cool dwarfs, worse for giants and the worst for dwarfs.

thumbnail Fig. 2.

Predicted BP/RP spectra, with [α/Fe] of 0.0 and 0.4 dex, for representative spectra in the giant, dwarf, and cool dwarf sub-grids.

thumbnail Fig. 3.

Ratio between the fluxes in the synthetic spectra in Fig. 2.

For the scientific aim of this study, the coolest stars are probably of less interest than the rest of the sample, since their intrinsic faintness means that they are usually not used as tracers of Galactic trends. We originally included them because preliminary cross-validation with a smaller version of the synthetic sample indicated that the models had learned to use the edges of the grid to ‘cheat’, breaking the degeneracies of [α/Fe] with Teff and log g by assuming that Teff and log g have sharp cut-offs beyond which no stars exist. In principle this problem still exists – the grid has to end somewhere – but the grid should be large enough that the region of main scientific interest is unaffected.

3.2. GALAH sample

We constructed a sample of 188078 observed spectra of stars with stellar parameters that were estimated as part of Data Release 2 of the GALAH survey (GALAH DR2) (Buder et al. 2018). The full GALAH DR2 contains slightly below 350 000 stars observed with the High Efficiency and Resolution Multi-Element Spectrograph (HERMES) at the Anglo-Australian Telescope (AAT). These have estimates of Teff, log g, [Fe/H], and [α/Fe] which were made using the Cannon (Ness et al. 2015). The Cannon was in turn trained on a representative subset of 10605 stars, for which stellar parameters were estimated using the spectral synthesis and fitting code Spectroscopy Made Easy (SME) (Piskunov & Valenti 1996, 2017).

We have applied the following selection criteria to construct our sample:

  • The GALAH uncertainty in Teff is below 5%.

  • The GALAH uncertainties in log g and [Fe/H] are below 0.5 dex.

  • The BP/RP spectra must have at least five transits in both BP and RP.

  • The BP/RP spectra have S/N > 300.

  • They must not have been flagged by The Cannon as possibly being unusual or having poor spectral reduction.

  • The S/N in the GALAH green band must be larger than 25.

  • Teff must be below 7000 K, since stars hotter than that are underrepresented in the training sample of the Cannon.

Figure 4 shows the distribution over [Fe/H] and [α/Fe] of the sample. The two distinct peaks correspond to the metal-rich, α-poor Thin Disk and the metal-poor α-rich Thick Disk. Figure 5 shows the distribution over Teff and log g for the sample. A model trained on this sample will not be applicable to spectra of any star that falls outside the parameter ranges represented in these plots.

thumbnail Fig. 4.

Logarithmic-scale 2D histogram of the literature [Fe/H] and [α/Fe] for the GALAH sample.

thumbnail Fig. 5.

Logarithmic-scale 2D histogram of the literature Teff and log g for the GALAH sample.

Figure 6 shows in red a spectrum with literature parameters Teff = 4324 K, log g = 1.95 dex, [Fe/H] = 0.018 dex, [α/Fe] = 1.6 ⋅ 10−3 dex. Plotted in yellow to blue are the corresponding synthetic spectra with parameters 4300 K, log g = 2.0 dex, [Fe/H] = 0.0 dex, [α/Fe] = 0.0 dex, covering A0 from 0.0 to 1.0. The observed spectrum does not quite look like any of the synthetic spectra, or interpolation between them. This implies that the effects of the instrumental profile and extinction are not perfectly captured by our models, unless the star itself is atypical or has incorrectly estimated parameters.

thumbnail Fig. 6.

Example spectrum from the GALAH sample shown in red. Shown below in blue to yellow are synthetic spectra with the same parameters, up to the step-length of the grid, and A0 varying from 0.0 to 1.0.

3.3. Gaia sample

This sample is taken from the spectra currently being evaluated for inclusion in the upcoming GDR3, although some of them may end up being filtered out by the Gaia internal validation process. The spectra mostly do not have known stellar parameters. This means that they cannot be used to directly test how well a model performs, by comparing the parameter estimates to literature values. However, it can be used for indirect tests, by observing how well the parameter estimates replicate the Galactic structure.

The spectra were originally selected based on the following criteria:

  • They belong to the December 2019 version of the validation source table (VST) sample that was assembled by coordination unit 8 (CU8) of Gaia to validate the results of the Gaia astrophysical parameters inference system (APSIS).

  • The uncertainty in the parallax is better than 20%.

  • The BP/RP spectra have at least five transits in both BP and RP.

Since the ExtraTrees algorithm is incapable of extrapolating outside the range of the labels it has been trained on, we also apply the following cuts in magnitude to limit the sample to stars resembling those represented in the GALAH DR2 sample:

0.5 < G BP G RP < 2.1 , $$ \begin{aligned}&0.5 < G_{\text{BP}} - G_{\text{RP}} < 2.1, \end{aligned} $$(1)

0 < G + 5 ( log 10 ( ϖ / [ mas ] ) + ( G BP G RP ) ) < 9 , $$ \begin{aligned}&0 < G + 5 \left( \log _{10} \left( \varpi / \left[ \mathrm{mas} \right] \right) + \left( G_{\text{BP}} - G_{\text{RP}} \right) \right) < 9, \end{aligned} $$(2)

where GBP is the integrated BP flux, GRP is the integrated RP flux, G is the integrated G-filter flux and ϖ is the Gaia parallax. Figure 7 shows this sample, with the cuts superimposed.

thumbnail Fig. 7.

Logarithmic-scale 2D histogram of the sum of the G magnitude and the 10-logarithm of the Gaia parallax, and the BP-RP colour for the Gaia sample. Cuts given by Eqs. (1) and (2) superimposed.

3.4. Gaia-Enceladus sample

Gaia-Enceladus is a kinematically and chemically distinct population of stars in the Galaxy. This structure was discovered in the kinematic data from Gaia DR2, together with the chemical data from the APOGEE survey. It is believed to be the remains of a dwarf galaxy slightly larger than the Small Magellanic Cloud, which was absorbed early in the history of the Galaxy (Helmi et al. 2018; Gaia Collaboration 2018; Majewski et al. 2017).

Gaia-Enceladus is chemically different from the rest of the Galaxy. It is more α-rich than the Thin Disk and more metal-poor than either the Thick or the Thin Disk (Helmi et al. 2018, Fig. 2) Because of this, it can be used as a test of what a model estimating [α/Fe] is actually measuring. If a model is measuring the direct, causal effect of [α/Fe], it should give estimates for Gaia-Enceladus that follow the true trend of [α/Fe] as a function of [Fe/H] in that structure. If the model is simply assuming that [α/Fe] follows the Galactic trend, then that will be revealed by the estimates following the same trend for Gaia-Enceladus as for the rest of the Galaxy. If the model is using indirect correlations between [α/Fe] and several other parameters, then the estimates can be expected to depart from those of the Galaxy without therefore following the true trend for Gaia-Enceladus.

We built a sample of 5783 stars (Carine Babusiaux, priv. comm.), selected as described in Helmi et al. (2018). For those stars, we collected BP/RP spectra which will be published in the upcoming GDR3. These spectra have parameters and abundances estimated as part of the APOGEE survey. Figure 8 shows the literature [Fe/H] and [α/Fe] of the sample.

thumbnail Fig. 8.

Logarithmic-scale 2D histogram of the literature [Fe/H] and [α/Fe] for the Gaia-Enceladus sample.

4. Training on synthetic sample

We made a regression model by training the ExtraTrees algorithm on the synthetic sample. In Sect. 4.1 we perform cross-validation to verify that the model can reconstruct its own training sample. In Sect. 4.2 we investigate to what extent the reconstructed [α/Fe] can be used to distinguish the α-rich and α-poor population in the training sample. In Sect. 4.3 we apply the model on the GALAH sample. In Sect. 4.4 we calculate the permutation feature importance when applying the model to either sample. We do not show the results of applying the model to the Gaia and Gaia-Enceladus samples, since they lead to the same conclusions as the results for the GALAH sample.

4.1. Cross-validation

We did ten-fold cross-validation on the training sample. This means that instead of training a model on the full sample, as we did in the rest of our analyses, we divided the sample into ten random subsets of approximately equal size. Then for each subset we trained a model using the other nine, and used that model to estimate the parameters for that subset. This essentially gave a best-case estimate of how well a model trained on the full sample can possibly be expected to perform, since it showed how well a sample with that statistical distribution over features and values can do on a sample with the exact same distribution.

By construction, the synthetic spectra are noise-free. This means that even if the ExtraTrees algorithm is robust to overfitting as such, there is a risk that models trained on these spectra may pick up on features in the data that would not be possible to distinguish in any observed spectrum. For our cross-validation, we therefore added Gaussian pixel noise to the spectra that were not currently used for training. To get realistic noise, we selected standard deviations from the uncertainty vector in that spectrum in the GALAH sample that had the most similar parameters. We took the most similar spectrum to be the one with the smallest distance d defined as

d ( p GALAH p s y n t h . σ GALAH ) 2 , $$ \begin{aligned} d \equiv \sqrt{ \sum \left( \frac{ p_{\text{GALAH}} - p_{synth.} }{\sigma _{\text{GALAH}}} \right)^2 }, \end{aligned} $$(3)

where pGALAH is the literature value of parameter p for the spectrum in the GALAH sample, psynth. is the value of the same parameter in the synthetic sample and σGALAH is the uncertainty in pGALAH, and the summation runs over the parameters Teff, log g, [Fe/H], and [α/Fe]. As described in Appendix B, not including noise in the spectra that the model is applied to gave rise to unphysical artefacts in the estimates. On the other hand, including noise in the training spectra as well turned out to have a completely negligible effect.

Figure 9, shows the difference between the estimated [α/Fe] for the α-rich and α-poor spectra, at each Teff–log g node and averaged over [Fe/H] and A0. For the hottest stars, the spectra are practically indistinguishable. For some grid nodes the difference Δ[α/Fe] even drops slightly below zero, reaching as far as −0.03 dex. As the temperature drops it rises above 0.2 dex. This is considerably less than the actual difference, as expected due to the asymmetric errors described in Sect. 2. Even so, some actual signal is clearly present.

thumbnail Fig. 9.

Difference Δ[α/Fe] between the α-rich and α-poor stars in the synthetic sample, when performing cross-validation. For each Teff–log g node, the average is taken over [Fe/H] and A0.

4.2. Categorisation

To quantify the practical usefulness of the [α/Fe] estimates shown in Fig. 9, we made a test that can serve as a toy model of a spectroscopic study. The assumption is that the spectroscopist is interested in distinguishing stellar populations based on the [α/Fe] estimates, without being interested in the numeric estimates themselves.

In reality, each spectrum in the synthetic sample belongs to either a α-poor population with [α/Fe] = 0.0 dex or a α-rich population with [α/Fe] = 0.4 dex. At each node in the Teff–log g grid we introduced some threshold value of [α/Fe], and assigned all spectra below that to the α-poor population and all spectra above to the α-rich. We placed this threshold so that the fraction of mis-classified spectra is the same for both populations. (If the threshold were set low enough, all α-rich and no α-poor spectra would be correctly classified, and vice-versa). This crossover error rate (CER) is shown in Fig. 10. For the hottest stars the populations are almost indistinguishable, but as the temperature drops, the stars have a probability around 80 − 90% of being correctly classified. The fact that the CER has a maximum around 4000 K rather than monotonously increasing as the temperature drops shows that the results are not an artefact of the grid edge. Based on this, it seemed that at least in an ideal case, the ExtraTrees algorithm could be useful in a study attempting to assign stars to populations, even if the numeric [α/Fe] can be difficult to interpret.

thumbnail Fig. 10.

Crossover error rate (CER) when attempting to distinguish α-rich and α-poor spectra when imposing a threshold [α/Fe] on the estimates derived during cross-validation of the synthetic sample. The black region around 6000 K has a CER near 50%, indicating that the α-rich and α-poor spectra are practically indistinguishable. The pale yellow region around 4000 K has a CER below 10%, indicating that the α-rich and α-poor spectra can easily be told apart.

4.3. Application on GALAH sample

We used the model trained on the synthetic sample to estimate [α/Fe] for the GALAH sample. This resulted in a root mean squared error (RMSE) of 0.2 dex when applied to the entire sample, and 0.1 dex when limited to giants.

To evaluate the results in more detail we looked at the normalised difference:

Δ p norm p GALAH p ExtraTrees σ GALAH , $$ \begin{aligned} \Delta^{\text{ norm}}_p \equiv \frac{ p_{\text{GALAH}} - p_{\text{ExtraTrees}}}{ \sigma _{\text{GALAH}}}, \end{aligned} $$(4)

where pGALAH is the parameter estimate from GALAH, pExtraTrees is the parameter estimate made with the Extratrees Algorithm trained on a non-overlapping tenth of the GALAH sample, and σGALAH is the stated uncertainty in the GALAH estimate. The most ideal case possible would be for the GALAH estimates to be normally distributed around the true parameter values following normal distributions with standard deviations equal to the literature uncertainties, and for the estimates by the ExtraTrees model to always be exactly equal to the true parameter values. In that ideal case, Eq. (4) would tend towards a standard normal distribution in the limit of large numbers of estimates. However, in Appendix A we find that at least for Teff the GALAH survey has slightly overestimated errors, which causes our estimates to seem slightly more accurate than they actually are.

Figure 11 shows histograms over Δ [ α/Fe ] norm $ \Delta^\textrm{norm}_{{\left[ \alpha / \mathrm{Fe} \right]}} $ for the full sample and for giant stars. It also shows the same for the cross-validation on the GALAH sample, discussed in Sect. 5.1. While there is some ability to determine Teff, log g, and [Fe/H] (discussed in Appendix A), the values of [α/Fe] are essentially random. It appears that the features of the synthetic spectra that the model has learned to use in the determination of [α/Fe] are not robust enough to be usable in real-world spectra.

thumbnail Fig. 11.

Normalised difference, as defined in Eq. (4), between GALAH [α/Fe] estimates and our estimates using the ExtraTrees algorithm. If our performance was perfect, the distributions would in the limit of infinite data tend towards the standard normal distribution shown in grey.

4.4. Permutation feature importance

While a model using the ExtraTrees algorithm is mostly opaque, there are a few tools for estimating the relative importance of different features. We used the permutation feature importance, which for a particular model and a sample with known labels attempts to estimate the importance of each feature by testing how much replacing that feature with a random value lowers the quality of the estimates. However, this method has the disadvantage that it can only be used on models that estimate a single parameter (Scikit-learn developers 2020, Sect. 4.2). Hence, we trained a model that only estimates [α/Fe] and used it for this test, rather than the model simultaneously estimating Teff, log g, [Fe/H], and [α/Fe] that we used in the rest of the article. The difference in performance between the models is small enough that conclusions based on the smaller model are likely to apply to the full model as well.

Figure 12 shows the permutation feature importance when applying the model trained on the synthetic sample on either the synthetic sample itself or the GALAH sample, together with a spectrum for comparison. There are two large peaks and one small peak where the permutation importance is higher when applying the model to the GALAH sample than when applying it to the synthetic sample. This shows that those pixels are genuinely important to getting useful estimates for the GALAH sample – more so than in the synthetic sample. Over the rest of the spectrum the permutation importance is at best close to zero for the GALAH sample. In several places it is even negative, meaning that those pixels actively make the estimates worse.

thumbnail Fig. 12.

Permutation feature importance when applying a model trained on the synthetic sample to the synthetic sample itself and to the GALAH sample. For comparison, the spectrum of the α-rich dwarf star from Fig. 2 is shown in black, normalised to match the highest peak of the feature importances.

5. Training on GALAH sample

We made a second regression model by training the ExtraTrees algorithm on the GALAH sample. In Sect. 5.1 we perform cross-validation. In Sect. 5.2 we use the model to estimate [α/Fe] for the synthetic sample. In Sect. 5.3 we calculate the permutation feature importance when applying the model to either sample. In Sect. 5.4 we use the model to estimate [α/Fe] for the Gaia sample. In Sect. 5.5 we use the model to estimate [α/Fe] for the Gaia-Enceladus sample.

5.1. Cross validation

We performed ten-fold cross-validation analogously to Sect. 4.1, except that we did not add synthetic noise. This resulted in a RMSE of 0.06 dex when applied to the entire sample, and 0.07 dex when limited to giants.

To evaluate the results in more detail, we again looked at the normalised difference as defined in Eq. (4). Figure 11 shows histograms over the Δ [ α/Fe ] norm $ \Delta^\textrm{norm}_{{\left[ \alpha / \mathrm{Fe} \right]}} $ for the full sample and for giant stars. It also shows the same for a model trained on the synthetic sample, as discussed in Sect. 4.3. Our [α/Fe] estimates are less accurate than those of GALAH, but there is still clearly a signal. Our estimates for Teff, log g, and [Fe/H] (discussed in Appendix A) are of similar accuracy to GALAH.

5.2. Application on synthetic sample

We used the model trained on the GALAH sample to estimate [α/Fe] for the synthetic sample. Figure 13 shows the difference in estimated [α/Fe] for the α-rich and α-poor spectra, at each Teff–log g node and averaged over [Fe/H] and A0, analogously to Fig. 9. The convex hull of the training sample is overplotted. There is almost no difference between the two groups of spectra, with the best performance being just below Δ[α/Fe] = 0.05 dex. This reveals that the information that the model has learned to pick out of the observed spectra is not actually present in the synthetic spectra.

thumbnail Fig. 13.

Difference Δ[α/Fe] between estimated average [α/Fe] for the α-rich and α-poor stars in the synthetic sample, after training on the GALAH sample. For each Teff–log g node, the average is taken over [Fe/H] and A0. Convex hull of training sample shown as dashed line. We note that if this plot had used the same colour scale as in Fig. 9, it would seem almost uniform to the eye.

For the cool dwarfs the Δ[α/Fe] change sign, so that the estimated [α/Fe] is actually slightly higher for the α-poor spectra than the α-rich. This reflects that those spectra fall outside of the range of features represented in the training sample: while the model correctly detects that the α-rich and α-poor spectra are different, it cannot use that information in any meaningful way. Otherwise we might have expected better performance for them, based on Fig. 10.

5.3. Permutation feature importance

Similarly to the test on the model trained on the synthetic sample described in Sect. 4.4, we used the GALAH sample to train a model to only estimate [α/Fe]. We then calculated the permutation feature importance for this model with respect to the synthetic sample and to the GALAH sample itself.

Figure 14 shows the permutation feature importance, together with a spectrum for comparison. When applying the model to the GALAH training sample, there are several peaks that show what pixels are important and useful in the estimates. When applying the same model to the synthetic sample there are no major peaks, reflecting the fact that the sample contains very little information that the model is able to use. When comparing to Fig. 12, it is clear that the two models have learned to look at very different parts of the spectra.

thumbnail Fig. 14.

Permutation feature importance when applying a model trained on the GALAH sample to the synthetic sample and to the GALAH sample itself. For comparison, the spectrum of the α-rich dwarf star from Fig. 2 is shown in black, normalised to match the highest peak of the feature importances.

5.4. Application on Gaia sample

We used the model trained on the GALAH sample to estimate [α/Fe] for the Gaia sample. We applied a cut at log g = 3.3, since we found that the performance is different for giant stars. Figure 15 shows our average estimated [α/Fe] as functions of Galactic position, for giant stars1. This reveals a qualitatively realistic Galactic structure, with a [α/Fe]-poor Thin Disk and an [α/Fe]-rich Thick Disk. There is also flaring of the Disk at increasing radial distance. To verify this result, we show the corresponding estimated [Fe/H] in Fig. 16. This again shows a qualitatively realistic structure, with a [Fe/H]-rich Thin Disk and a [Fe/H]-poor Thick Disk.

thumbnail Fig. 15.

Estimated [α/Fe]for the Gaia sample, using the GALAH sample as training sample, averaged over Galactocentric distance r and height z over Galactic plane. Position of the Sun shown for reference as a ⊙.

thumbnail Fig. 16.

Estimated [Fe/H] for the Gaia sample. Interpretation otherwise the same as in Fig. 15.

This demonstrated that an ExtraTrees model trained on observed Gaia BP/RP spectra can make qualitatively realistic [α/Fe] estimates for a different sample of observed Gaia BP/RP spectra. However, since it was unable to do so on the synthetic spectra, it appears to be doing this by using indirect correlations between [α/Fe] and other stellar properties that have an effect on the spectrum, rather than the direct effect of [α/Fe] on the spectrum. Such correlations exist in all observed samples, but by construction are missing from the synthetic sample.

This conclusion is slightly tentative, since it could in principle instead be that shortcomings in the modelling of the synthetic spectra made them unusable to a model trained on observed spectra. However, we do not believe so: While we know that there are shortcomings in our spectral synthesis, we do not expect this to result in information that is present in the observed spectra simply disappearing in the synthetic spectra. Rather, we would expect it to cause offsets in the estimates derived by a model trained on observed spectra, which is not what we see.

5.5. Application on Gaia-Enceladus

Proceeding on the assumption that the model is using correlations between [α/Fe] and other properties of the stars, we attempted to constrain what those properties are. At the most trivial, the model could simply have been using [Fe/H] as a proxy for [α/Fe] by using the Galactic trend of [α/Fe] as a function of [Fe/H]. To test this, we used the model to estimate parameters for the Gaia-Enceladus sample. Gaia-Enceladus contains a significant fraction of stars which do not follow the Galactic trend in [α/Fe].

Figure 17 shows the estimated parameters for the Gaia-Enceladus sample, marking the metal-poor giants in blue and other stars in red. The literature parameters are shown in grey, together with the convex hull of the training sample. It is apparent that the model is not able to reproduce the true distribution over [α/Fe] and [Fe/H] in Gaia-Enceladus.

thumbnail Fig. 17.

Estimated parameters for Gaia-Enceladus using GALAH DR2 as training sample. α-poor giant stars ([α/Fe] < 0.1, log g < 3.3) are shown in blue, other stars in red. True parameters shown in the background in grey. Convex hull of training sample shown as dashed line.

Figure 18 shows the trend of [α/Fe] as function of [Fe/H] for the GALAH DR2 and Gaia-Enceladus samples. The trends are estimated by plotting the [Fe/H] and [α/Fe] estimates for each star, and then smoothing with a kernel one 100th the width of the sample. The trends estimates are different, as they should be. Unfortunately, since we have seen that the models have difficulty discriminating between stars that only differ in [α/Fe], it is unlikely that the models have genuinely captured the difference in [α/Fe] between the samples. Instead, the difference between the estimated trends shows that the model does not simply use [Fe/H] as a proxy for [α/Fe], but takes other correlations into account as well.

thumbnail Fig. 18.

Trend in [α/Fe] as a function of [Fe/H] for the GALAH DR2 and Gaia-Enceladus samples, using GALAH DR2 as training sample.

6. Summary and conclusion

We have attempted to find out if it is possible to use the ExtraTrees algorithm to estimate [α/Fe] from Gaia BP/RP spectra. In our study we used four samples of spectra: The ‘synthetic sample’, consisting of simulated spectra covering a grid of parameters; the ‘GALAH sample’, consisting of observed spectra with parameters known from the GALAH survey; the ‘Gaia sample’, consisting of observed spectra without known parameters; the ‘Gaia-Enceladus sample’, consisting of observed spectra that are part of the Gaia-Enceladus structure and have known parameters from the APOGEE survey.

We first trained a model on the synthetic sample. When applied to synthetic spectra, the model could estimate [α/Fe] with enough discrimination to allow distinguishing model populations of α-rich and α-poor stars. We then applied the model to the GALAH sample and found that it was unable to estimate [α/Fe] to any useful extent. Since models using the ExtraTrees algorithm are not very transparent, it was not possible to directly tell what information the model had learned to use from the synthetic sample, but this showed that that information is not actually present in observed spectra.

Next, we trained a model on the GALAH sample of observed spectra. We found that it was unable to estimate [α/Fe] for the synthetic sample, but for the Gaia sample it did so well enough to reconstruct a realistic Galactic structure. Based on this we tentatively concluded that while the model could estimate [α/Fe], it did so by using indirect correlations between [α/Fe] and other properties that have an effect on the spectrum, rather than the direct, causal effect of [α/Fe] on the spectrum. We then applied the model to the Gaia-Enceladus sample, demonstrating that while the model does make use of indirect correlations, it does not merely treat [Fe/H] as a proxy for [α/Fe].

In the process of testing the GALAH sample we also found that, at least for the parameter Teff, the GALAH survey has slightly overestimated their own random errors. However, the discrepancy is likely small enough not to affect our conclusions.

In conclusion, we find that ExtraTrees models trained on observed spectra can make estimates of [α/Fe] that are relatively close to the true values. However, those estimates are indirect – based on the correlation between [α/Fe] and other parameters including but not limited to [Fe/H] – rather than the direct causal effect of [α/Fe] on the BP/RP spectra. This implies that this method cannot be used to distinguish stars that only differ in [α/Fe]. Finally, there are indications that cool dwarf stars may be an exception to this, but we cannot empirically verify this at present as our samples of observed Gaia BP/RP spectra do not cover that parameter range.


1

Figures 15 and 16 have previously been released with minor differences as a Gaia image of the week (Gavel et al. 2020).

Acknowledgments

AG and AJK acknowledge support from the Swedish National Space Agency (SNSA). RA and MF contributions were funded in part by the DLR (German space agency) via grant 50 QG 1403. RS is supported by the Agenzia Spaziale Italiana (ASI) through contracts I/037/08/0, I/058/10/0, 2014-025-R.0, 2014-025-R.1.2015, and 2018-24-HH.0 to the Italian Istituto Nazionale di Astrofisica (INAF). We thank Bengt Edvardsson for calculating synthetic spectra. We thank Carine Babusiaux for making the Gaia-Enceladus sample available to us. We thank Ulrike Heiter and Coryn Bailer-Jones for helpful feedback. We thank our colleagues from Gaia DPAC CU5, especially Dafydd Wyn Evans, Francesca De Angeli and Paolo Montegriffo, for their continuous support. This work has made use of data from the European Space Agency (ESA) mission Gaia (http://www.cosmos.esa.int/gaia), processed by the Gaia Data Processing and Analysis Consortium (DPAC, http://www.cosmos.esa.int/web/gaia/dpac/consortium). Funding for the DPAC has been provided by national institutions, in particular the institutions participating in the Gaia Multilateral Agreement.

References

  1. Bailer-Jones, C. A. L. 2011, MNRAS, 411, 435 [Google Scholar]
  2. Buder, S., Asplund, M., Duong, L., et al. 2018, MNRAS, 478, 4513 [Google Scholar]
  3. Buder, S., Lind, K., Ness, M. K., et al. 2019, A&A, 624, A19 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  4. Carrasco, J. M., Weiler, M., Jordi, C., et al. 2021, A&A, 652, A86 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  5. Dumont, M., Marée, R., Wehenkel, L., & Geurts, P. 2009, Proc. Fourth Int. Conf. Comput. Vis. Theory Appl., 2, 196 [NASA ADS] [Google Scholar]
  6. Gaia Collaboration (Prusti, T., et al.) 2016, A&A, 595, A1 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  7. Gaia Collaboration (Brown, A. G. A., et al.) 2018, A&A, 616, A1 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  8. Gavel, A., Korn, A. J., Andrae, R., & Fouesneau, M. 2020, The chemical trace of Galactic stellar populations as seen by Gaia, https://www.cosmos.esa.int/web/gaia/iow_20200320 [Google Scholar]
  9. Geurts, P., Ernst, D., & Wehenkel, L. 2006, Mach. Learn., 63, 3 [Google Scholar]
  10. Gustafsson, B., Edvardsson, B., Eriksson, K., et al. 2008, A&A, 486, 951 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  11. Helmi, A., Babusiaux, C., Koppelman, H. H., et al. 2018, Nature, 563, 85 [Google Scholar]
  12. Liu, C., Bailer-Jones, C. A. L., Sordo, R., et al. 2012, MNRAS, 426, 2463 [NASA ADS] [CrossRef] [Google Scholar]
  13. Majewski, S. R., Schiavon, R. P., Frinchaboy, P. M., et al. 2017, AJ, 154, 94 [Google Scholar]
  14. Ness, M., Hogg, D. W., Rix, H.-W., Ho, A. Y. Q., & Zasowski, G. 2015, ApJ, 808, 16 [NASA ADS] [CrossRef] [Google Scholar]
  15. Pedregosa, F., Varoquaux, G., Gramfort, A., et al. 2011, J. Mach. Learn. Res., 12, 2825 [Google Scholar]
  16. Piskunov, N. E., & Valenti, J. A. 1996, A&AS, 118, 595 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  17. Piskunov, N. E., & Valenti, J. A. 2017, A&A, 597, [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  18. Recio-Blanco, A., de Laverny, P., Allende Prieto, C., et al. 2016, A&A, 585, A93 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  19. Scikit-learn developers 2020, User Guide https://scikit-learn.org/stable/user_guide.html [Google Scholar]

Appendix A: Performance on the GALAH sample for parameters other than [α/Fe]

In the main body of the article, we focus on estimates of [α/Fe], but the models simultaneously estimate the parameters Teff, log g, and [Fe/H]. To evaluate the trustworthiness of the models, we also looked at the quality of these estimates.

Figure A.1 shows the normalised residuals, as defined by Eq. (4), for the parameter Teff, when applying models trained on either sample to the GALAH sample. For the cross-validation using the GALAH sample, the performance is slightly better than the theoretical maximum described in 4.1. This cannot happen purely as a result of good performance on the part of the model. Rather, it shows that the errors reported by the GALAH survey are slight overestimates, at least if they are interpreted as standard deviations in normally-distributed random scatter. Limiting our results to the giants in the GALAH sample, our performance is slightly lower. With training on the synthetic sample, the scatter is considerably larger, with an offset for the entire sample but not for the subset of giants.

thumbnail Fig. A.1.

Normalised difference, as defined in Eq. (4), between GALAH Teff estimates and our estimates using the ExtraTrees algorithm. Standard normal distribution shown in grey.

Figure A.2 shows the normalised residuals for log g. The performance is close to ideal during cross-validation on the entire GALAH sample, but slightly lower for the subset of giants. With training on the synthetic sample, there is larger scatter as well as a considerable offset, which gets noticeably worse for the subset of giants.

thumbnail Fig. A.2.

Normalised difference, as defined in Eq. (4), between GALAH log g estimates and our estimates using the ExtraTrees algorithm. Standard normal distribution shown in grey.

Figure A.3 shows the normalised residuals for [Fe/H]. The performance is again close to ideal during cross-validation on the entire GALAH sample and slightly lower for the subset of giants. With training on the synthetic sample, there is larger scatter as well as a considerable offset, but no large difference between the full sample and the subset of giants.

thumbnail Fig. A.3.

Normalised difference, as defined in Eq. (4), between GALAH [Fe/H] estimates and our estimates using the ExtraTrees algorithm. Standard normal distribution shown in grey.

Appendix B: Cross-validation on the synthetic sample with and without simulated noise

When performing cross-validation on the synthetic spectra in Sect. 4.1, we trained the models on the synthetic spectra as they are, but then applied the model to synthetic spectra with simulated noise added. We believe this is the optimal test, since it trains the model on the highest-quality spectra we have, but does not exaggerate the performance of the model by testing it on spectra of impossibly high quality. For completeness, we also tested applying the model to spectra without simulated noise, as well as training a model on spectra with noise and then applying it to spectra with noise.

Figure B.1 shows the Δ[α/Fe] when cross-validation is performed without adding synthetic noise to the application sample. The performance is better than that with synthetic noise in the application sample, as shown in Fig. 9. However, this is unlikely to reflect any performance that could be achieved with real spectra: In the dwarf sample, at Teff = 5000 K there is a sudden rise in Δ[α/Fe], which continues along the boundary to the giant sample at log g = 3.0 dex. This is unlikely to have any physical basis, and might indicate that the model is making use of artefacts in the synthetic spectra to estimate [α/Fe]. When noise is added to the spectra, this feature disappears.

thumbnail Fig. B.1.

Difference Δ[α/Fe] between the α-rich and α-poor stars in the synthetic sample, when performing cross-validation without adding synthetic noise. For each Teff-log g node, the average is taken over [Fe/H] and A0.

Figure B.2 shows the Δ[α/Fe] when cross-validation is performed with the addition of synthetic noise to both the training and application sample. The performance is almost identical to that shown in Fig. 9, with Δ[α/Fe] shifting less than 0.03 dex in any bin.

thumbnail Fig. B.2.

Difference Δ[α/Fe] between the α-rich and α-poor stars in the synthetic sample, when performing cross-validation adding synthetic noise to both training sample and application sample. For each Teff-log g node, the average is taken over [Fe/H] and A0.

All Figures

thumbnail Fig. 1.

Values of Teff and log g of the grid of synthetic spectra. For each node, spectra with nine values of [Fe/H], two values of [α/Fe], and eleven values of A0 have been calculated.

In the text
thumbnail Fig. 2.

Predicted BP/RP spectra, with [α/Fe] of 0.0 and 0.4 dex, for representative spectra in the giant, dwarf, and cool dwarf sub-grids.

In the text
thumbnail Fig. 3.

Ratio between the fluxes in the synthetic spectra in Fig. 2.

In the text
thumbnail Fig. 4.

Logarithmic-scale 2D histogram of the literature [Fe/H] and [α/Fe] for the GALAH sample.

In the text
thumbnail Fig. 5.

Logarithmic-scale 2D histogram of the literature Teff and log g for the GALAH sample.

In the text
thumbnail Fig. 6.

Example spectrum from the GALAH sample shown in red. Shown below in blue to yellow are synthetic spectra with the same parameters, up to the step-length of the grid, and A0 varying from 0.0 to 1.0.

In the text
thumbnail Fig. 7.

Logarithmic-scale 2D histogram of the sum of the G magnitude and the 10-logarithm of the Gaia parallax, and the BP-RP colour for the Gaia sample. Cuts given by Eqs. (1) and (2) superimposed.

In the text
thumbnail Fig. 8.

Logarithmic-scale 2D histogram of the literature [Fe/H] and [α/Fe] for the Gaia-Enceladus sample.

In the text
thumbnail Fig. 9.

Difference Δ[α/Fe] between the α-rich and α-poor stars in the synthetic sample, when performing cross-validation. For each Teff–log g node, the average is taken over [Fe/H] and A0.

In the text
thumbnail Fig. 10.

Crossover error rate (CER) when attempting to distinguish α-rich and α-poor spectra when imposing a threshold [α/Fe] on the estimates derived during cross-validation of the synthetic sample. The black region around 6000 K has a CER near 50%, indicating that the α-rich and α-poor spectra are practically indistinguishable. The pale yellow region around 4000 K has a CER below 10%, indicating that the α-rich and α-poor spectra can easily be told apart.

In the text
thumbnail Fig. 11.

Normalised difference, as defined in Eq. (4), between GALAH [α/Fe] estimates and our estimates using the ExtraTrees algorithm. If our performance was perfect, the distributions would in the limit of infinite data tend towards the standard normal distribution shown in grey.

In the text
thumbnail Fig. 12.

Permutation feature importance when applying a model trained on the synthetic sample to the synthetic sample itself and to the GALAH sample. For comparison, the spectrum of the α-rich dwarf star from Fig. 2 is shown in black, normalised to match the highest peak of the feature importances.

In the text
thumbnail Fig. 13.

Difference Δ[α/Fe] between estimated average [α/Fe] for the α-rich and α-poor stars in the synthetic sample, after training on the GALAH sample. For each Teff–log g node, the average is taken over [Fe/H] and A0. Convex hull of training sample shown as dashed line. We note that if this plot had used the same colour scale as in Fig. 9, it would seem almost uniform to the eye.

In the text
thumbnail Fig. 14.

Permutation feature importance when applying a model trained on the GALAH sample to the synthetic sample and to the GALAH sample itself. For comparison, the spectrum of the α-rich dwarf star from Fig. 2 is shown in black, normalised to match the highest peak of the feature importances.

In the text
thumbnail Fig. 15.

Estimated [α/Fe]for the Gaia sample, using the GALAH sample as training sample, averaged over Galactocentric distance r and height z over Galactic plane. Position of the Sun shown for reference as a ⊙.

In the text
thumbnail Fig. 16.

Estimated [Fe/H] for the Gaia sample. Interpretation otherwise the same as in Fig. 15.

In the text
thumbnail Fig. 17.

Estimated parameters for Gaia-Enceladus using GALAH DR2 as training sample. α-poor giant stars ([α/Fe] < 0.1, log g < 3.3) are shown in blue, other stars in red. True parameters shown in the background in grey. Convex hull of training sample shown as dashed line.

In the text
thumbnail Fig. 18.

Trend in [α/Fe] as a function of [Fe/H] for the GALAH DR2 and Gaia-Enceladus samples, using GALAH DR2 as training sample.

In the text
thumbnail Fig. A.1.

Normalised difference, as defined in Eq. (4), between GALAH Teff estimates and our estimates using the ExtraTrees algorithm. Standard normal distribution shown in grey.

In the text
thumbnail Fig. A.2.

Normalised difference, as defined in Eq. (4), between GALAH log g estimates and our estimates using the ExtraTrees algorithm. Standard normal distribution shown in grey.

In the text
thumbnail Fig. A.3.

Normalised difference, as defined in Eq. (4), between GALAH [Fe/H] estimates and our estimates using the ExtraTrees algorithm. Standard normal distribution shown in grey.

In the text
thumbnail Fig. B.1.

Difference Δ[α/Fe] between the α-rich and α-poor stars in the synthetic sample, when performing cross-validation without adding synthetic noise. For each Teff-log g node, the average is taken over [Fe/H] and A0.

In the text
thumbnail Fig. B.2.

Difference Δ[α/Fe] between the α-rich and α-poor stars in the synthetic sample, when performing cross-validation adding synthetic noise to both training sample and application sample. For each Teff-log g node, the average is taken over [Fe/H] and A0.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.