Machine learning in APOGEE

Rafael Garcia-Dias; Carlos Allende Prieto; Jorge Sánchez Almeida; Ignacio Ordovás-Pascual

doi:10.1051/0004-6361/201732134

Home

All issues

Volume 612 (April 2018)

A&A, 612 (2018) A98

Full HTML

Free Access

Issue		A&A Volume 612, April 2018


Article Number		A98
Number of page(s)		56
Section		Numerical methods and codes
DOI		https://doi.org/10.1051/0004-6361/201732134
Published online		04 May 2018

A&A 612, A98 (2018)

Unsupervised spectral classification with K-means^★

Rafael Garcia-Dias¹^,2, Carlos Allende Prieto¹^,2, Jorge Sánchez Almeida¹^,2 and Ignacio Ordovás-Pascual³

¹ Instituto de Astrofísica de Canarias, 38200 La Laguna, Tenerife, Spain
e-mail: rafaelagd@gmail.com
² Departamento de astrofísica, Universidad de La Laguna, Tenerife, Spain
³ Instituto de Física de Cantabria (CSIC-UC), 39005 Santander, Spain

Received: 19 October 2017
Accepted: 23 January 2018

Abstract

Context. The volume of data generated by astronomical surveys is growing rapidly. Traditional analysis techniques in spectroscopy either demand intensive human interaction or are computationally expensive. In this scenario, machine learning, and unsupervised clustering algorithms in particular, offer interesting alternatives. The Apache Point Observatory Galactic Evolution Experiment (APOGEE) offers a vast data set of near-infrared stellar spectra, which is perfect for testing such alternatives.

Aims. Our research applies an unsupervised classification scheme based on K-means to the massive APOGEE data set. We explore whether the data are amenable to classification into discrete classes.

Methods. We apply the K-means algorithm to 153 847 high resolution spectra (R ≈ 22 500). We discuss the main virtues and weaknesses of the algorithm, as well as our choice of parameters.

Results. We show that a classification based on normalised spectra captures the variations in stellar atmospheric parameters, chemical abundances, and rotational velocity, among other factors. The algorithm is able to separate the bulge and halo populations, and distinguish dwarfs, sub-giants, RC, and RGB stars. However, a discrete classification in flux space does not result in a neat organisation in the parameters’ space. Furthermore, the lack of obvious groups in flux space causes the results to be fairly sensitive to the initialisation, and disrupts the efficiency of commonly-used methods to select the optimal number of clusters. Our classification is publicly available, including extensive online material associated with the APOGEE Data Release 12 (DR12).

Conclusions. Our description of the APOGEE database can help greatly with the identification of specific types of targets for various applications. We find a lack of obvious groups in flux space, and identify limitations of the K-means algorithm in dealing with this kind of data.

Key words: methods: data analysis / methods: numerical / catalogs / surveys / techniques: spectroscopic / Galaxy: stellar content

^★

Full Tables B.1–B.4 are only available at the CDS via anonymous ftp to cdsarc.u-strasbg.fr (130.79.128.5) or via http://cdsarc.u-strasbg.fr/viz-bin/qcat?J/A+A/612/A98

© ESO 2018

1 Introduction

The volume of date generated by many existing and forthcoming astronomical instruments is simply too large for traditional analysis techniques. Two extreme cases are the Large Synoptic Survey Telescope (LSST; Gressler et al. 2014) and the Gaia mission (Gaia Collaboration 2016).

Optimal use of modern astronomical instrumentation requires open and efficient access to the resulting observations. Such access is provided by large and well-organised databases, (e.g. the Hubble Space Telescope (HST), Gaia, or the Sloan Digital Sky Survey archives). As happens with data reduction, the exploitation of these vast data sets cannot be made using traditional tools (see e.g. the discussion by Bailer-Jones 2002). Classification is the first step in any automated analysis. It can be used to identify and discard noisy data or to group like objects to follow a common interpretation pipeline. It is certainly needed when exploring new types of data and it is also an invaluable tool to identify rare objects, usually the most telling from a scientific point of view.

Numerous works have been done that explore the performance of the automatic MK classification of spectra (see e.g. Bailer-Jones et al. 1998; Singh et al. 1998; Bailer-Jones 2002; Rodríguez et al. 2004; Giridhar et al. 2006; Manteiga et al. 2009; Navarro et al. 2012). The main approach followed in these works was to apply supervised learning training using labelled data. The unsupervised approach was also applied in works like Vanderplas & Connolly (2009), Daniel et al. (2011) and Reis et al. (2018). In this work we focus on an unsupervised approach that does not aim to reproduce the MK classification.

Among all unsupervised classification methods, K-means (e.g. MacQueen et al. 1967; Everitt & Dunn 1992; Jain 2010) is a flexible clustering algorithm that has being extensively used in the literature. We have already employed K-means in several applications, including the identification of similar targets to average and reduce noise (Sánchez Almeida et al. 2009), the classification of one million galaxy spectra representative of the local universe (Sánchez Almeida et al. 2010), a systematic search for rare extremely metal-poor galaxies (Morales-Luis et al. 2011; Sánchez Almeida et al. 2016), and the classification of the large stellar spectra data set available from the Sloan Digital Sky Survey, in particular data from the Sloan Extension for Galactic Understanding and Exploration (SEGUE; Sánchez Almeida & Allende Prieto 2013). In this work we show the virtues and limitations of K-means in this context, making a first step in the search for alternatives. This work is also the first to perform classification on Apache Point Galactic Evolution Experiment (APOGEE).

In this paper, we turn our attention to high-resolution stellar spectroscopy, and in particular to the APOGEE, part of the Sloan Digital Sky Survey (Eisenstein et al. 2001; Blanton et al. 2017). We examine whether or not the massive APOGEE data set is amenable to a sensible unsupervised classification scheme based on K-means. Section 2 describes the APOGEE spectroscopic data in detail, including the APOGEE Stellar Parameters and Chemical Abundances Pipeline (ASPCAP; García Pérez et al. 2016). Section 3 is devoted to the details of the classification algorithm, and Sect. 3.1 describes its application to the APOGEE data, preceded by numerical experiments based on simulated data. Section 4discusses the main results, and Sect. 5 summarises the conclusions.

2 Data set

APOGEE makes use of a novel fibre-fed high-resolution H-band spectrograph to obtain simultaneously up to 300 stellar spectra (Wilson et al. 2010, 2012). The APOGEE spectrograph is usually coupled to the Sloan Foundation 2.5-m telescope at the Apache Point Observatory, but has also been linked to the New Mexico State University 1-m telescope at the same location. The project has already obtained spectra for more than 300 000 stars in the Milky Way, focusing on red giants and therefore covering a broad range of galactocentric distances. Working in the near-IR, between 1.5 and 1.7 μm, APOGEE can access regions of the Galaxy heavily obscured by dust, such as the mid-plane of the Galaxy, or the bulge and the Galactic barnear the centre (Majewski et al. 2017).

APOGEE spectra are processed by a custom-made data pipeline that extracts the spectra, calibrates them, and corrects telluric absorption and sky emission lines before measuring radial velocities (Nidever et al. 2015). The pipeline ASPCAP performs an automated analysis based on model atmospheres, delivering atmospheric parameters and chemical abundances for the majority of the observed stars¹. The atmospheric model grid boundaries in effective temperature are 3500 and 8000 K, in log g the boundaries are 0 and 5, and in [M/H] they are –2.5 and 0.5 dex. More details about the grid can be found in Table 2 of Holtzman et al. (2015).

The APOGEE pipelines are in constant evolution and the data set continues to grow. In this work, we have adopted the data made publicly available in DR12², the final data release from SDSS-III (Alam et al. 2015; Holtzman et al. 2015). This data set includes over 150 000 stars observed between 2011 and 2014. The resolving power of the APOGEE data is R ≡ λ∕δλ ≃ 22 500, and the typical signal-to-noise ratio (S/N) exceeds 100 per half a resolution element. In addition, we used quality and target flags³, and the uncalibrated parameters derived by ASPCAP⁴ in order to evaluate the result of the classification (Sect. 4). Besides sky coordinates and atmospheric parameters (temperature, surface gravity, and micro turbulence), the data set includes metallicities, α-element abundance, and individual chemical abundances for 15 elements⁵. As described in Holtzman et al. (2015), the DR12 results were calibrated using star clusters’ data in order to eliminate abundance trends with temperature and systematic differences with the literature. Since calibrated parameters are not available for all stars in DR12, we chose to use the uncalibrated parameters and chemical abundances. This choice should not affect the interpretation of our results; we are not interested in absolute values for each object, but in relative differences among spectra with intrinsically different shapes. In addition, using the uncalibrated data we can arguably better understand ASPCAP.

3 Classification algorithm

Cluster analysis aims to organise a collection of objects into classes based on a similarity criterion, such that objects in the same class are more alike than objects in different classes. There is a numerous set of cluster algorithms available in the literature (e.g. Everitt & Dunn 1992), but in general, all involve the following main steps: (1) Feature selection, the identification of the features that better represent the objects in the data set; (2) choosing a feature proximity indicator, the figure of merit that optimally defines the similarity between objects in the data set; (3) establishing the grouping criterion – meaning the clustering algorithm itself, and (4) cluster validation, an evaluation of the output quality.

In the feature selection phase, we excluded all pixels potentially affected by sky emission and telluric absorption. Standard K-means algorithms are designed such that all input objects must have the same dimensions, and therefore we have to consider the same pixels in all spectra. For the vast majority of APOGEE observations 35 fibres are devoted to observe warm stars, measuring telluric absorption, 35 fibres to observe the sky, pointing them to blank regions in the sky, and 230 fibres to acquire science spectra. To determine which are the pixels more affected by sky emission and telluric absorption, we have taken the average of the normalised sky, and the telluric spectra for all fields in APOGEE DR12, and used them to identify and exclude in our analysis all pixels for which the mean sky count is above 1 per cent of the maximum mean normalised sky count. We have also excluded all pixels at which the mean normalised telluric spectrum falls more than five per cent below the continuum.

Figure 1 shows the mean sky and telluric spectra, the cuts applied, and the regions excluded from the spectra used in the K-means classification. In this figure we have displaced vertically the mean sky spectrum for clarity. Since stars have different heliocentric velocities, the spectra were corrected for Doppler shifts, and therefore they are affected by sky emission and telluric absorption at different wavelengths for different stars. This can be seen in Fig. 1 from the width of the mean normalised telluric lines and sky emissions lines. From the 8575 original wavelength pixels, we kept 4838 pixels, or 56 per cent of the APOGEE spectral coverage. All the spectra were also normalised using a fourth degree polynomial regression for each of the three chips in the APOGEE spectrograph. We have also removed values in the normalised flux higher than 1.02 (i.e. two per centabove the pseudo continuum level), setting the flux value to 1.02, avoiding any remaining problem with sky emission lines.

The chosen feature proximity metric was the Euclidean distance. That is the most straightforward possibility, since the objects to be classified are normalised spectra, which can be regarded as data points in an N-dimensional space. It also has the advantage of being easily interpreted and having a low computational cost.

The grouping criterion is the way one assigns each object to a certain cluster and is how groups are designed. For example, groups can be selected in a single partition, that is to say, all clusters are simple partitions, hierarchically equivalent samples, otherwise they would be hierarchical clusters that have a structure with clusters and sub-clusters. Furthermore, clustering is said to be hard if it assigns each object to a single cluster, in opposition to soft clustering where the objects are assigned as having a non-zero probability of belonging to more than one cluster.

In this work we explore the use of K-means (MacQueen et al. 1967), a partitional hard clustering algorithm. It is one of the most popular clustering algorithms, mainly because it is easy to implement and its computational cost scales linearly with the number of objects to be classified. The fundamental steps in K-means are (1) to choose the number of clusters K; (2) define K initial cluster centres; (3) assign each object in the sample to the closest cluster; (4) recompute cluster centres as the centroid of the objects assigned to each cluster; (5) repeat steps 3 and 4 until a convergence criterion is met. Usually the convergence criterion is either a decrease of the within-cluster variance under a threshold or a minimal re-assignation between two consecutive iterations. Here we adopt the criterion of having less than one per cent of re-assignation between two consecutive iterations.

Initialisation also can be done in different ways. The simplest is to randomly choose objects in the entire sample, but if the data set has an over-abundance of a particular kind of object, the clusters would over-sample those objects. In order to avoid this, we initialise in an iterative fashion; we carry out a couple of K-means iterations with K = 10, randomly choose an object in the most abundant cluster as initial centre, discard all objects in this cluster and repeat the process until the desired number of initial cluster centres is reached. During the process, if more than 95 per cent of the objects are discarded, we select the remaining cluster centres randomly in the whole sample. In this work we have translated the algorithm presented by Sánchez Almeida et al. (2010) from IDL⁶ to Python⁷. Besides serial and parallel performance optimisation, no major modifications were made. Using Python we achieved a simpler and faster code, which also has the advantage of being available in an open source platform.

We have compared our results with the results using scipy⁸ and scikit learn⁹ algorithms. The results are qualitatively equivalent. The advantage of using our own code is that we are coherent with previous works in the literature (Sánchez Almeida et al. 2009, 2010, 2016; Morales-Luis et al. 2011; Sánchez Almeida & Allende Prieto 2013).

A major drawback in any clustering classification is that the algorithm will always return partitions regardless of the existence of clusters or not. In addition, the algorithm does not guarantee convergence to a global solution. Moreover, many implementations require choosing the number of clusters. In order to overcome these problems, or even just to find out how serious they are, we apply cluster validation techniques. We are interested in verifying whether the data have intrinsic clusters, whether there is an optimal number of clusters, and whether the clusters derived in flux space exist in parameters’ space.

Fig. 1

Mean sky normalised emissions (blue line) and telluric absorption (red line) spectra for the 153 847 spectra in the sample. Mean sky normalised emissions fluxes are displaced by one unit to help visualisation. Black lines define the cut applied to each spectrum. Grey shades highlight the areas excluded from theK-means classification.

3.1 Choosing the number of clusters

Choosing the optimal number of clusters is a critical step in K-means classification. There is no universal criterion to do it, although many heuristic criteria have being developed over the last fifty years (Tibshirani et al. 2001). In an attempt to select the most suitable criteria for our problem we built a testbed data set with 6900 synthetic spectra spread over 69 well-defined clusters in surface gravity (log g, in cgs units), temperature (T_eff), α abundance ([α/M]), and metallicity ([M/H]), as shown in Figure 2. The brackets, for two given elements X and Y, [X/Y] is defined as: $[X / Y] = \log_{10} {(\frac{N_{X}}{N_{Y}})}_{star} - \log_{10} {(\frac{N_{X}}{N_{Y}})}_{⊙},$ $\begin{equation*} [\mathrm{X}/\mathrm{Y}]= \log_{10}{\left({\frac{N_{\mathrm{X}}}{N_{\mathrm{Y}}}}\right)_{\mathrm{star}}}-\log_{10}{\left({\frac{N_{\mathrm{X}}}{N_{\mathrm{Y}}}}\right)_{\odot}}, \end{equation*}$ (1)

where N_X and N_Y are the number of X and Y nuclei per unit volume, respectively. Metallicity is a measure of all the chemical elements heavier than He, assuming they vary in the same proportions with respect to the solar values. Analogously, [α/M] is a measure of all α-elements (O, Ne, Mg,Si, S, Ar, Ca, and Ti) assuming they vary in union. The centres of the clusters were chosen based on the most dense regions in the HR diagram of the empirical data set with parameters from DR12. The parameters for each spectrum were randomly chosen around each cluster centre, following a normal distribution with $σ_{T_{eff}} = 50$ $\sigma_{T_{\mathrm{eff}}} = 50$ K and σ_{log g}= σ_[M∕H] = σ_[α∕M] = 0.05. The synthetic spectra were built using the code FERRE¹⁰, interpolating in a grid of theoretical models (Allende Prieto et al. 2004, 2006; Zamora et al. 2015). We use a model grid with seven parameters per spectrum, microturbulence velocity (ξ_v), carbon abundance ([C/M]), nitrogen abundance ([N/M]), mean α−elements abundance ([α/M]), metallicity ([M/H]), surface gravity (log g), and effectivetemperature (T_eff). But the parameters ξ_v, [C∕M], [N∕M] were fixed to the mean values¹¹ of the stars in the DR12 sample for all spectra. In order to explore the best-case scenario we have not added any noise to the spectra.

We applied K-means to the simulated data set ten times, with K varying from 5 to 100. We then applied four different statistical criteria trying to recover the optimal number of clusters, knowing that the actual number is 69. We tried the KL index (Gordon 1998), the gap statistic (Tibshirani et al. 2001), the CH index (Caliński & Harabasz 1974), and the silhouette index (Rousseeuw & Kaufman 1990). These indexes were selected for being the most widely and successfully used in the literature. None of the chosen criteria was able to identify the right number of clusters, with the CH index being the only one capable of giving consistent results over different initializations, finding K = 9 ± 1.8, far from the true value of 69. The other methods found a σ_K > 12 over the ten different runs, while randomly selecting ten numbers in this range would result in σ_K ≈ 25. A possible explanation for this failure is that, despite the clusters being well-defined in parameters’ space, the classification is made in flux space, where the separation between classes seems to be more subtle.

In the absence of better criteria, we have chosen the numbers of clusters based on the within-class standard deviation of the atmosphericparameters and chemical abundances. Figure 3 shows the variation of the median σ values for each of the four main input parameters. We use the notation $\hat{X}$ $\widehat{X}$ meaning themedian of X. It is important to use medians instead of means in order to avoid the predominance of the fewer classes which gather faulty and unusual spectra. Especially when we start to work with the observed data set, with classes having few spectra (<30) and a large dispersion in atmospheric parameters and chemical abundances. We see a decrease in ${\hat{σ}}_{X}$ $\widehat{\sigma}_{X}$ as K grows for all quantities. This means that dividing the spectra in flux space into more classes results also in finerpartitions in atmospheric parameters and abundances spaces. Therefore, we can choose K based on a threshold value for σ. The extreme case would be to increase K until havingone star per class, reaching the minimum variation. However, since the computational cost scales with K and we also lose generality when increasing K, we should choose K making a compromise between accuracy, agility, and generality.

We know the ${\hat{σ}}_{X}$ $\widehat{\sigma}_{X}$ values and K for the synthetic data set. Therefore we can verify how much we can trust the variance for the choice of K. Figure 3 shows that when K = 69 we have exactly the input metallicity dispersion; ${\hat{σ}}_{\log (g)}$ $\widehat{\sigma}_{\log(g)}$ is highly above the input level, while ${\hat{σ}}_{α}$ $\widehat{\sigma}_{\alpha}$ and ${\hat{σ}}_{T_{eff}}$ $\widehat{\sigma}_{T_{\mathrm{eff}}}$ are both below the input level. The figure also shows the slope ( $| \partial {\hat{σ}}_{X} / \partial K |$ $|\partial \widehat{\sigma}_{X}/\partial K|$ ) of the curves decreases rapidly for K ≳ 50. Therefore, increasing K does not produce a significant change in ${\hat{σ}}_{T_{eff}}$ $\widehat{\sigma}_{T_{\mathrm{eff}}}$ and ${\hat{σ}}_{\log g}$ $\widehat{\sigma}_{\log g}$ for K ≳ 50. The plots also reveal a different sensitivity for each parameter.

The actual APOGEE data set behaves in a similar way. Figure 4 shows how K affects the median of the standard deviation of T_eff, log g, [M/H] and the abundances of carbon, nitrogen, and α-elements with respectto metallicity, and the same for the abundances of the chemical elements Al, Ca, C, Fe, K, Mg, Mn, Na, Ni, N, O, Si, S, Ti, and V. From these plots we have chosen K = 50 as the number of clusters to be used throughout the paper, since beyond that value increasing K does not reduce significantly the within-cluster parameters’ dispersion.

Fig. 2

Atmospheric parameters for the synthetic data set. Left panel: effective temperature and surface gravity for the synthetic spectra. The right top panel presents the projection of the clusters in the T_eff − [M∕H] plane, while the right bottom panel shows the plane T_eff − [α∕M].

Fig. 3

Variation of median standard deviation as function of the number of clusters K in the synthetic data set. The solid black lines represent the median standard deviation of the classes in a run. The solid horizontal grey lines show the median standard deviation at K = 69. Dashed black lines show the input standard deviation. The top left panel refers to α abundances [α/M], the top right to metallicity [M/H], bottom left to logarithmic surface gravity log g, and the bottom right to effective temperature T_eff.

Fig. 4

Variation of median standard deviation as function of the number of clusters K in the real data set. Top panel refers to effective temperature T_eff, while the bottom panel to the variation of median standard deviation for the other 20 parameters available in DR12 as indicated in the legend box.

3.2 Repeatability of the classification

The randomized initialization of K-means implies that different runs generate slightly different results. In order to evaluate the repeatability of the process we define a coincidence index ε, which measures the ratio of coincidence between two different classifications based on the number of spectra in equivalent classes, as described in Sánchez Almeida et al. (2010). We note that the label assigned to a class can vary over the classifications, even when the class remains with essentially the same objects. Therefore, when comparing two different classifications, we first need to cross identify the classes. For example, let X be a set of N objects, X = {x₀, x₁, …, x_N} classified in K clusters, with two different initializations. Each initialization generates a set of clusters, say Ω = {ω₀, ω₁, …, ω_K} in one classification and Γ = {γ₀, γ₁, …, γ_K} in a second classification. In each classification we label the classes ensuring that the number of objects in the ith class (n_i) follows the rule n_i ≥ n_i+1. To build a comparison between clusters we define a coincidence matrix A _K,K, with the elements a_i,j being the number of objects in cluster ω_i that are also in cluster γ_j. $a_{i, j} = \sum_{ι ϵ ω_{i}} δ_{ι}^{j}, where δ_{ι}^{j} = {\begin{array}{l} 1, & if x_{ι} is in cluster γ_{j} \\ 0, & if it is not . \end{array}$ $\begin{equation*} \centering a_{i,j} = \sum_{\iota \epsilon \omega_i} \delta_{\iota}^{j}, \textrm{where} \delta_{\iota}^{j} = \begin{cases} 1, & \textrm{if $\vec{x}_{\iota}$ is in cluster $\gamma_j$} \\ 0, & \textrm{if it is not.} \end{cases} \end{equation*}$ (2)

Thus, we match the jth cluster in Γ to the cluster in Ω having the maximum number of coincidences with it, j_match = argmax{a_0,j, a_1,j, …, a_i,j}, always ensuring no cluster in Ω is assigned to more than one cluster in Γ. Then we use the matches to transform the matrix A into A^′ permuting its columns to have their largest numbers in the diagonal. The elements of the diagonal of A ^′ ( ${a^{'}}_{i, j}$ $a'_{i,j}$ , with i = j) are counts of the number of agreement between the two classifications, while the other elements ( ${a^{'}}_{i, j}$ $a'_{i,j}$ , with i≠j) are counts of the number of confusions between the two classifications. The trace of A ^′ divided by the total number of classified objects gives an estimate of the mean overall coincidence rate between the two classifications, ${\bar{ε}}_{total} = Tr {A^{'}} / N$ $\bar{\varepsilon}_{\textrm{total}} = \textrm{Tr} \left\{\mathbf{A}'\right\}/N$ . By defining the mean normalised coincidence matrix between a chosen classification ( ${\bar{A}}^{'}_{chosen}$ $\bar{\mathbf{A}}'_{\textrm{chosen}}$ ) and a set of η classifications with the same K as ${\bar{A}}^{'}_{chosen}$ $\bar{\mathbf{A}}'_{\textrm{chosen}}$ , the diagonal elements will give the mean coincidence ratio of each class over the η classifications, which is a measure of how stable the classes in the chosen classification are. Likewise, the elements out of the diagonal measure the mean confusion ratio between different classes.

3.2.1 Synthetic data set

We performed a series of classifications for the synthetic data set varying the number of clusters from K = 5 to 100. For each value of K we initialized the classification with ten different random seeds, the same ten seeds for all values of K. In order to avoid some possible biascaused by choosing a particular reference, the coincidence ratio was measured for every pair of classifications having the same K. For the expected number of clusters in the synthetic data set (K = 69) the mean coincidence ratio is $\bar{ε} (K = 69) = 74.7 \pm 6.2$ $\bar{\varepsilon}(K=69) = 74.7 \pm 6.2$ per cent. The mean coincidence ratio computed for all runs with K = 5 to 100 for the synthetic data set is 75.1 ± 8.4 per cent.

3.2.2 DR12 data set

Under equivalent conditions, that is, comparing all combinations of the ten classifications per value of K, with K from 5 to 100, the DR12 data set had a mean coincidence ratio of $\bar{ε} = 77.9 \pm 7.8$ $\bar{\varepsilon} = 77.9 \pm 7.8$ per cent. When we consider only the K = 50, for which we performed 100 classifications with different random initialization, and using the chosen classification (see Sect. 3.3) as reference, the mean coincidence ratio is found to be $\bar{ε} (K = 50) = 79.6 \pm 2.6$ $\bar{\varepsilon}(K=50) = 79.6 \pm 2.6$ per cent.

To understand what a mean coincidence ratio of 79.6 per cent means, we measured the mean difference between the matching classes over the 100 classifications, and compared this with the mean within the cluster dispersion of the chosen classification (see Appendix A for more details). We found that the variations of the class centroid over the 100 classifications amount to 6.4 ± 3.3 per cent of the average mean internal dispersion of its corresponding class in the chosen classification. That is to say, even for runs with different classifications, for about 25 per cent of the spectra (coincidence of 75 per cent) the main classes end up having their centres displaced by about 6 per cent of the internal dispersion of its class in the 4838-dimensional flux space. As we show in Sect. 3.3, the confusion occurs mainly between classes sharing borders in the space T_eff − log g −[M∕H]. Except for some outlier classes, the shapes of the classes are very similar over different classifications.

3.3 Chosen classification

After running K-means a hundred times with K = 50, we chose the classification with the lowest sum of squared error (SSE). As we are working with the Euclidean metric, the SSE is computed as $SSE = \sum_{i = 1}^{K} \sum_{ι ϵ ω_{i}} | | x_{ι} - μ_{i} | |^{2}, where μ_{i} = \frac{1}{n_{i}} \sum_{ι ϵ ω_{i}} x_{ι},$ $\begin{equation*} \centering \mathrm{SSE} = \sum_{i=1}^{K} \sum_{\iota \epsilon \omega_i}||\vec{x_{\iota}} -\vec{\mu_i}||^2, \textrm{ where } \vec{\mu}_i = \frac{1}{n_i} \sum_{\iota \epsilon \omega_i} \vec{x}_{\iota}, \vspace*{-3pt}\end{equation*}$ (3)

where x_ι is the ιth spectrum in cluster ω_i and μ_i the centroid of the class i. The chosen run has an SSE 9 per cent smaller than the average SSE over all classifications. As mentioned in Sect. 3.2, the coincidence ratio is measured by the number of spectra sharing the same class over two distinct classifications. Comparing the chosen classification with the other 99 runs, the average coincidence ratio is 79.6 ± 2.6 per cent, which can be considered a high repeatability rate. Also the mean variation of the centres of the most popular classes, containing 99 per cent of the objects, is ≈2.4 per cent of the mean within-cluster variation of the classes in the chosen classification. Again, this is a comparison between the standard deviation of the centroids over the 100 classification with the internal standard deviation of the main classes in the chosen classification. In this case the number fallsfrom 6.4 to 2.4% because we are only taking into account the classes containing 99% of the spectra in the sample, classes from 0 to 31.

In Fig. 5 we plot ${A^{'}}_{chosen}$ $\mathbf{A}'_{\textrm{chosen}}$ , comparing the chosen classification with the other 99 classifications. The elements of ${A^{'}}_{chosen}$ $\mathbf{A}'_{\textrm{chosen}}$ are represented by a colour scale in a 2D histogram; the bottom panel in this figure shows a histogram with the main diagonal values of ${A^{'}}_{chosen}$ $\mathbf{A}'_{\textrm{chosen}}$ . This plot will be useful in Sect. 4, where we will describe each group of classes and comment on the stability of each class. From now on, we will refer to the elements in the main diagonal of A ^′ as coincidence rates and to its other elements as the confusion rates.

In Fig. 6 we show a comparison between the root mean squared distances for each spectrum in the sample to its best fit spectrum with respect to the centroid of its assigned class. The plot shows the centroids are n average approximately five times closer to the spectra than its best fit. These higher distances between the spectra and the models are due to systematic differences between synthetic spectra based on model atmospheres and real spectra¹².

Table 1 shows a comparison between the standard deviation within clusters ( $\hat{σ}$ $\widehat{\sigma}$ ) and the overall standard deviation (σ_random), corresponding to clusters randomly built. For example, T_eff and log g have a $\hat{σ}$ $\widehat{\sigma}$ about 3.6 and 4.2 times smaller than their corresponding σ_random, respectively. This means that the algorithm is especially sensitive to T_eff and log g. In Table 1 we also highlight the parameters that present $\hat{σ}$ $\widehat{\sigma}$ at least two times smaller than its σ_random. They are T_eff, log g, [M/H], [Ca/H], [C/H], [Mg/H], [N/H], [Si/H], [S/H] and [Ti/H]. Since these are the most sensitive parameters to K-means, we will focus mainly on them in order to interpret the classes in the next section.

Fig. 5

Top panel: mean coincidence matrix comparing the chosen classification with the other 99 performed classifications. Elements on the mean diagonal represent the coincidence ratio of a class and can be interpreted as the stability of the class. The elements in the diagonal are labelled with their corresponding class number and highlighted in green if the class has a coincidence ratio above 75 per cent or in red if the class has coincidence ratio below 25 per cent. Elements off the diagonal can be interpreted as the confusion rate between two classes. We highlight confusion rates above 25 per cent with white stars. Bottom panel: histogram of the coincidence ratios corresponding to the diagonal of the coincidence matrix. A green dashed line marks the 75 per cent level, while a red line marks the 25 per cent level.

Fig. 6

Two-dimensional histogram comparing distances from each spectra to its best fit with the distance from each spectra to its class centroid. Each pixel in the image is colour-coded according to the number of spectra in that region, as indicated by the colour bar.

Table 1

Comparison of the internal median standard deviation (third column) with the overall standard deviation (second column) for each parameter.

4 Results

After visual inspection we divided all the classes into nine groups sharing similar properties. Here we describe in detail each group, giving a summary of their classes’ mean properties. In Fig. 7 we present contour plots in T_eff −[M∕H] space. We highlight regions enclosing progressively 15, 30, 45, and 68.3 per cent of the stars in each class, with the colour shades varying from strong to light respectively. Class 21 is too concentrated to have its contours seen at this scale, so it is represented by purple dots in the figure. In some cases the separation of the contours is too tight and only three contours are visible. The figure is divided into three panels, aiming to minimise the superposition of classes. We use different colours to help identifying borders between classes. Some classes have the same colour, but there is no overlap between classes with the same colour. Classes are identified with labels. For the labels we use the abbreviations G for group and C for its associated classes. Classes in group 8 have few objects, which are sparsely distributed in the T_eff −[M∕H] plane, making this plot very noisy and hard to read; for these objects we present a scatter plot in Fig. 8. Figure 9 shows the main distribution of the groups in the T_eff − log g plane. Besides the differences found in the T_eff −[M∕H] space, we also found some other particularities in the classes and groups, some of them based in the spatial distribution (RA–DEC), global chemical abundances, or spectral fluxes.

In Fig. 10 we present the mean spectra, in a limited spectral window, for all classes in groups 0 to 7. Each panel in thisfigure shows the mean spectrum of the classes in each group colour-coded as in Fig. 7. In order to offer the highest contrast between the classes’ mean spectra, we chose the spectral coverage which maximises the cumulative variance over the first 32 classes in a 150-pixels-long window. The grey shades in the background of these plots highlight the masked pixels (those discarded from the classification, as discussed in Sect. 3). Besides the description presented in this section, we include complementary material with detailed plots for many of the DR12 available features in the supplementary material described in the Appendix. Table B.1 gives a short description for each class and provides links to the Appendix material. With these figures the reader can find more details about the atmospheric parameters, spatial distributions, and chemical abundances for each class presented.

Tables 2 and 3 present the median values for the atmospheric parameters and all the individual chemical elements in each class. The error bars presented in the tables, as well as those shown in the next sections, were calculated by taking the interval around the median, which encloses 68.3 per cent of the points in each class.

Fig. 7

Contour diagrams in the T_eff −[M∕H] plane. Different colours are used to distinguish different classes. Each class is represented by four colour shades; from dark to light, the shades enclose 15, 30, 45, and 68.3 per cent of the data points in the class. The groups are separated into three panels minimising the superposition of classes. Panel a: groups 0, 1, 4, 5, and two classes of group 7; panel b: groups 2,3, and three classes of group 7; and panel c: group 6. In these panels each class is flagged with a floating label in the form GXCXX, C referring to class and G to its group. Class 21 is represented as a scatter plot, since it is too concentrated to present visible contours on this scale.

Fig. 8

Scatter plot of T_eff against [M/H] for classes in group 8. The classes are identified as shown in the legend. The stars in this group are scattered throughout the plane.

Fig. 9

Contour diagram for the groups in the T_eff − log g plane. Each group is represented by a different colour. Colour shades enclose, from dark to light, 15, 30, 45, and 68.3 per cent of the objects in each group.

Table 2

Median atmospheric parameters and chemical abundances for the 32 most populated classes.

Table 3

Median chemical abundances for the 32 most populated classes.

Fig. 10

Mean spectra of the classes in the wavelength range from 16 178 to 16 222 Å, where the differences among classes are particularly enhanced. Top to bottom: mean spectra for classes belonging to groups from 0 to 7. Each mean spectrum is drawn with the same colours used in Fig. 7. In all panels we plot the spectral windows used in ASPCAP to determine the chemical abundances of stars. Each set of element windows is colour-coded as indicated in the legend of the first panel.

4.1 Metal-rich RC/warm RGB – Group 0 (Classes 2, 4, 6, 8, and 9)

From the distribution of log g and T_eff values in Fig. 11 one can spot this group among the red clump (RC) stars and at the warmest end of the red giant branch (RGB; Binney & Merrifield 1998). Comparing these classes with Bovy et al.’s (2014) catalogue of red clump stars, we found that 31, 26, 26, 1, and 21 per cent of the stars in classes 2, 4, 6, 8, and 9, respectively, belong to the red clump. The classes increase in metallicity in the sense $- 0.07 \pm_{0.11}^{0.10} = {\hat{[M/H]}}_{c 2} < {\hat{[M/H]}}_{c 8} < {\hat{[M/H]}}_{c 4} < {\hat{[M/H]}}_{c 6} < {\hat{[M/H]}}_{c 9} = 0.30 \pm_{0.12}^{0.09}$ $-0.07\pm^{0.10}_{0.11} = \widehat{\mathrm{[M/H]}}_{c2} < \widehat{\mathrm{[M/H]}}_{c8} < \widehat{\mathrm{[M/H]}}_{c4} < \widehat{\mathrm{[M/H]}}_{c6} < \widehat{\mathrm{[M/H]}}_{c9} = 0.30\pm^{0.09}_{0.12}$ . As metallicity increases, the position of the RC moves towards cooler regions in the plane T_eff – log g, as shown in Fig. 11. Chemical abundances for individual elements also vary inside this group; for example, [Si/H] varies as follows: $- 0.22 \pm_{0.30}^{0.20} = {\hat{[Si/H]}}_{c 2} < {\hat{[Si/H]}}_{c 8} < {\hat{[Si/H]}}_{c 4} < {\hat{[Si/H]}}_{c 6} < {\hat{[Si/H]}}_{c 9} = 0.26 \pm_{0.19}^{0.18}$ $-0.22\pm^{0.20}_{0.30} = \widehat{\mathrm{[Si/H]}}_{c2} < \widehat{\mathrm{[Si/H]}}_{c8} < \widehat{\mathrm{[Si/H]}}_{c4} < \widehat{\mathrm{[Si/H]}}_{c6} < \widehat{\mathrm{[Si/H]}}_{c9} = 0.26 \pm^{0.18}_{0.19}$ . This group is similar to group 5 in terms of atmospheric parameters, but classes here are more metal rich. For this group there is some confusion among classes, as shown in Fig. 5. About 30 per cent of the spectra belonging to class 4 in the chosen classification are assigned to class 2 in other classifications.

Classes 2 and 8 are similar in chemical abundances, but differ in log g. Besides metallicity differences, classes 8 and 9 also differ in their spatial distribution over the Galactic plane, as shown in Fig. 12. While stars in class 8, with lower [M/H], lie preferentially at higher galactic longitudes, stars in class 9, which are cooler and more metal rich, are mainly towards the galactic centre. In general the fittings for class 9 are poor, the spectral lines are deeper than the chosen models. Classes 2, 4, and 6 follow approximately the same spatial distribution of the APOGEE sample.

In the top panel of Fig. 10 we have a comparison of the mean spectra for all the classes in this group. For group 0, we see that their mean spectra are very similar in shape, but with different line strengths (s). The intensity of lines grows in the sense s_c8 < s_c2 < s_c4 < s_c6 < s_c9, following their median temperatures. Together, these classes include ≈27 per cent of the spectra in DR12.

Fig. 11

T_eff − log g distribution for classes in group 0. The same rules and colours from Fig. 7were applied to contours here. Top and right panels: histograms of the distributions of T_eff and log g, respectively.The histogram line colours match the colours of the contours.

Fig. 12

Mollweide’s projection of the Galactic coordinates distribution of classes 8 and 9. Yellow and blue contours enclose 68.3 per cent of the stars in classes 8 and 9, respectively. Yellow squares represent stars in class 8 and blue squares represent stars in class 9 out of the regions containing 68.3 per cent of the points. The contour shades follow the same rule as in Fig. 7.

4.2 Metal poor cool RGB – Group 1 (Classes 7, 14, 19, 25, 26 and 28)

As shown in Fig. 13, the classes in group 1 are composed of cooler stars in the RGB ( $3500 ≲ \hat{T_{eff}} ≲ 4200$ $3500 \lesssim \widehat{T_{\mathrm{eff}}} \lesssim 4200$ K and $0.79 ≲ \hat{\log g} ≲ 2.03$ $0.79 \lesssim \widehat{\log g} \lesssim 2.03$ ) (Binney & Merrifield 1998). All classes are mainly formed of low latitude stars, composed of a mixture of thin and thick disk population, except for class 28 which is mainly projected towards the Galactic centre and with high α abundances, $\hat{[α / M]} = 0.24 \pm_{0.11}^{0.04}$ $\widehat{[\alpha/\mathrm{M}]} = 0.24\pm^{0.04}_{0.11}$ . All of them are classes composed of stars in the RGB, but with increasing metallicities¹³, surface gravities¹⁴, and temperatures¹⁵.

Concerning the stability of the classes, class 25 is very stable, having a mean coincidence ratio of 82 per cent. As shown in Figs. 7 and 13, this class consists of giant stars at the tip of the RGB. Confusion higher than 10 per cent occurs between classes inside the group. The highest confusion rates are 12 per cent and 16 per cent between class 7 and classes 14 and 28, respectively, 16 per cent between classes 14 and 26, 16 per cent between classes 19 and 26, and 30 per cent between classes 26 and 28. Again, classes overlapping in the 3D space T_eff − log g −[M∕H] present the highest degrees of confusion. Between classes in this group and other classes out of the group, the confusion rate is above 5 per cent only between class 14 and class 22 (10 per cent).

Tables 2 and 3 show the classes in this group are selecting stars within narrow distributions of the parameters, including the abundances. They typically have ${\bar{σ}}_{T_{eff}} \approx 100$ $\bar{\sigma}_{T_{\mathrm{eff}}} \approx 100$ K, ${\bar{σ}}_{\log g} \approx 0.30,$ $\bar{\sigma}_{\log g} \approx 0.30,$ and, for example, in class 14, the within class dispersion of the parameter can reach ${\bar{σ}}_{X} \leq 0.1$ $\bar{\sigma}_{X} \leq 0.1$ for [α/M], [N/M], [C/M], [Na/H], [Mn/H] and [K/H].

Class 28 is particularly spread in $\hat{[C/M]} = - 0.09 \pm_{0.30}^{0.15}$ $\widehat{\mathrm{[C/M]}} = -0.09 \pm^{0.15}_{0.30}$ , $\hat{[Fe/H]} = - 1.14 \pm_{0.73}^{0.42}$ $\widehat{\mathrm{[Fe/H]}} = -1.14 \pm^{0.42}_{0.73}$ , and $\hat{[Al/H]} = - 0.10 \pm_{0.31}^{0.16}$ $\widehat{\mathrm{[Al/H]}} = -0.10 \pm^{0.16}_{0.31}$ . In Fig. 10, second panel from top to bottom, we see the mean spectra of the stars in this group. As in group 0, we see very similar spectral shapes, but with different line strengths.

Fig. 13

Distribution in T_eff − log g for classes in group 1. The same rules and colours from Fig. 7 were applied to the contours here. Top and right panels: histograms of the distributions of T_eff and log g, respectively.The colours of the histogram match the colours of the contours.

4.3 Warm stars – Group 2 (Classes 3, 11, and 13)

This group assembles the warmest stars in DR12. The sample includes 15 233 spectra flagged as telluric standards, warm objects ideal for characterising the telluric lines that plague the IR, of which 67 per cent are in class 3, 16 per cent in class 11, and 12 per cent in class13. According to target-selection flags, 96 per cent of the 10 628 objects in class 3 are telluric standards, while classes 11 and 13 have up to 50 per cent of stars of this kind. The differences between the classes in this group are mainly found in T_eff and [M/H], as seen in panel b of Fig. 7; class 3 is the warmest, containing A and B type stars, according to a match with the SIMBAD catalogue (Wenger et al. 2000), while classes 11 and 13 are RGB stars, cooler and richer in metals compared with class 3 (see Table 2). The third panel in Fig. 10 shows the differences between the mean spectra of the classes in group 2. The mean spectrum of class 3 is almost featureless, while the mean spectrum in class 13 has the strongest lines in the group. Moreover, there is a difference in their spatial distribution; while class 3 mainly occupies low latitudes, classes 11 and 13 are found primarily out of the Galactic plane and towards the Galactic centre.

As we would expect, since they are easily distinguishable even by eye, classes in this group are among the most stable classes in the classification, with mean coincidence rates of 94 per cent, 73 per cent, and 80 per cent for classes 3, 11, and 13, respectively. As class 11 is cooler than classes 3 and 13, it has the highest mean confusion rate with other classes (for example it has about 10 per cent mean confusion with classes 5 and 24). Classes 5 and 24 are among the most metal-poor in the classification, emphasising the role that the degeneracy between T_eff and [M/H] plays in the determination of the stellar parameters.

All the chemical elements have very wide distributions except for [K/H] in class 11. Nevertheless, the atmospheric parameters of the stars in this group are out of the DR12 model grid, and thus it should be seen as a failure of the model fittings, as suggested by the ASPCAP flag star warn found in ≈35 per cent of the objects in this class.

4.4 Fast rotators – Group 3 (Classes 27 and 29)

This group is formed by fast rotating stars. For both classes the ASPCAP models poorly fit their spectra. As a consequenceof this, some artefacts are observed in their abundances, for example, the abundances of [C/M], [α/M], [Al/H], [K/H], [Na/H], and [Si/H] are not continuous; they appear in clumps, having gaps of at least 0.2 in abundance between them.

In terms of atmospheric parameters, this group is very close to group 6 (dwarfs), but their spectra are remarkably different. The spectra of group 3 have fewer, shallower, and broader lines than those found in group 6, as can be seen in the fourth and seventh panel in Fig. 10. This shows that the algorithm is sensitive to rotation, since it is able to split the stars affected by log g line broadening from those affected by rotational line broadening. On the other hand, ASPCAP determines that the great majority of the stars in this group have log g greater than 4.9 (see Fig. 14), but since the rate of stars flagged with a fast rotation warning are 81 per cent and 93 per cent for classes 27 and 29, respectively, we cannot trust these determinations. The rate of stars flagged with a rotation warning in the entire DR12 data set is 7 per cent.

Class 29is the most unstable of the classes, excluding the outliers (see Sect. 4.9). It has a confusion rate of 62.8 per cent with class 27, which means that for some classifications class 29 dissolves mainly in classes 13, 23, 27, and 29. Class 27 is more stable, with 63 per cent of coincidence, having some degree of confusion with class 10 (13 per cent), which has the shallower lines in group 6.

About one quarter of the stars in class 27 and about half of the stars in class 29 are either young embedded cluster members or known calibration cluster members. Statistically we expect fast rotating stars to be younger than those that rotate more slowly (van Saders & Pinsonneault 2013). In addition, the great majority of stars form in star clusters, dispersing latter on, and thus the fastest rotating stars are expected to be in young embedded clusters.

Fig. 14

Scatter plot for T_eff − log g distributions of the classes in groups 2 and 3. Top and right panels: histograms of the distributions of T_eff and log g, respectively.To aid visualisation, both panels are split into two plots with different scales. The histogram line colours match the colours of the scatter plot.

4.5 Metal-rich cool RGB – Group 4 (Classes 16, 18 and 22)

Group 4 classes include metal rich stars covering the RGB with effective temperatures from 3620 to 4140 K, and with metallicities from 0.17 to 0.22 in the order ${\hat{[M/H]}}_{c 16} < {\hat{[M/H]}}_{c 18} < {\hat{[M/H]}}_{c 22}$ $\widehat{\mathrm{[M/H]}}_{c16} < \widehat{\mathrm{[M/H]}}_{c18} < \widehat{\mathrm{[M/H]}}_{c22}$ . Some stars in this group are near the edge of the model grid, at [Fe∕H] = 0.50 (36 per cent in class 16, 26 per cent in class 18, and 24 per cent in class 22). That also happens in T_eff for class 16, which has 43 per cent of the stars cooler than 3600 K.

The stars in these classes are very concentrated in the Galactic disk, with [α/M] close to the solar value. As shown in Fig. 15, the spatial distribution of class 16 is more concentrated towards the Galactic centre than classes 18 and 22.

Classes 16 and 18 are very stable, with a coincidence rate of 91 per cent and 80 per cent, respectively. Class 22 is much less stable having a coincidence rate of 29 per cent. The highest degree of confusion for class 22 occurs with class 9 (38 per cent), but classes 14 and 18 also contaminate class 22. Those three classes, 9, 14, and 18, share borders with class 22 in the space T_eff −[M∕H], as shown in Fig. 7, and also with superposition in log g, as can be seen by comparing Figs. 13 and 16. Once again, we see that the overlap in the space T_eff −[M∕H] − log g is the main cause of confusion between classes. The abundance distributions for these classes are narrow, as reflected in Tables 2 and 3.

Fig. 15

Galactic coordinates in Mollweide’s projection for objects in classes 22 (orange triangles and contours), 18 (purple triangles and contours), and 16 (grey circles and contours), all belonging to group 4. The contour shades follow the same rule as in Fig. 7.

Fig. 16

T_eff − log g distribution for classes in group 4. The same rules and colours from Fig. 7were applied to contours here. Top and right panels: histograms of the distributions of T_eff and log g, respectively.The colours of the histograms match the colours of the contours.

4.6 Metal-poor RC/warm RGB – Group 5 (Classes 0, 1 and 5)

Just like group 0, this group is made of classes that include stars from the RC and the warmest end of the RGB. For classes 0, 1, and 5 the ratios of red clump stars are 30, 31, and 16 per cent according to a comparison with Bovy et al. 2014. In comparison with group 0, this group is more metal-poor, with $- 0.45 ≲ \hat{[M/H]} ≲ - 0.22$ $- 0.45 \lesssim \widehat{\mathrm{[M/H]}} \lesssim -0.22$ . The group lacks stars in the direction of the Galactic centre, being homogeneously distributed in all other directions. Relative to group 0, group 5 is more dense in regions with Galactic latitudes higher than 30 deg. All three classes are a mixture of thin and thick disk populations, but class 5 is more populated by high [α/M] stars than other classes in the group, as shown in Fig. 17.

As shown in Fig. 18, class 5 almost completely overlaps with classes 0 and 1 in T_eff − log g space. The median temperatures of class 0 stars are about 150 K warmer than class 1 stars. Class 5 is particularly broad in T_eff and log g, covering temperatures from 4125 to 7170 K, with a median value of $\hat{T_{eff}} = 4942 \pm_{202}^{584}$ $\widehat{T_{\mathrm{eff}}} = 4942 \pm^{584}_{202}$ K and $\log g = 3.16 \pm_{- 0.38}^{+ 1.04}$ $\log g = 3.16 \pm^{+1.04}_{-0.38}$ . Figure 17 shows the distribution of the stellar parameters in the planes T_eff −[M∕H], T_eff − [α∕M] and [α∕M] −[M∕H]. The dispersion there is likely to be an artefact due to the degeneracy between T_eff and [M/H] in the ASPCAP parameter determination pipeline. Also the class is broadly spread in $\hat{[Si/H]} = - 1.38 \pm_{1.38}^{0.96}$ $\widehat{\mathrm{[Si/H]}} = -1.38 \pm^{0.96}_{1.38}$ , which may also be an artefact of ASPCAP. In this range of atmospheric parameters the pipeline is probably confusing warmer temperatures with lower metallicities, as discussed in Holtzman et al. (2015).

Fig. 17

Properties of class 5 (group 5), which contains 9144 stars (N_⋆). The panels in the uppermost diagonal contain histograms for T_eff, [M/H] and [α/M], from left to right, respectively. In these plots vertical black dashed lines show the median value and the limits enclosing 68.3 percent of the data points around the median value. The green histograms correspond to the objects in class 5 and the grey histogram shows the distribution of the whole group 5. As indicated by labels in the axes, the other three panels show 2D histograms for T_eff − [M∕H], T_eff − [α∕M] and [α∕M] −[M∕H]. From outside to inside the contours enclose 68.3, 45, 30, and 15 per cent of the objects in the class.

Fig. 18

T_eff − log g distribution for classes in group 5. The same rules and colours from Fig. 7were applied to contours here. Top and right panels: histograms of the distributions of T_eff and log g, respectively.The colours of the histograms match the colours of the contours.

4.7 Dwarfs stars – Group 6 (Classes 10, 12, 15, 17 and 20)

With log g ranging from 4.23 to 4.35, group 6 has only dwarf stars. The classes differ because of their different temperatures and abundance patterns. Figure 19 shows the distribution of log g and T_eff for this group.

Class 12 is over-abundant in Mg ( $\hat{[Mg/H]} = + 0.38 \pm_{0.28}^{0.32}$ $\widehat{\mathrm{[Mg/H]}} = +0.38 \pm^{0.32}_{0.28}$ ), and classes 15 and 20 have low [α/M], especially in [Ca/H] and [O/H]. Some bimodality is found for [Al/H] and [K/H] for classes 15 and 20. However, 99 per cent of the objects in the group have their chemical abundances flagged with a warning and are not reliable, so this strange behaviour is likely to be an artefact of ASPCAP.

In Fig. 10 we see the FeI line around 16 210 Å is blended with the CN and CO lines near it for classes 15 and 20. In other regions of the spectra, blends like this are present. This is caused by the enhancement of molecular lines at low T_eff values.

Class 20 presents two separate blobs of [α/M] abundances, one around solar values and the other around $\hat{[α / M]} = - 0.3$ $\widehat{[\alpha/\mathrm{M}]} = -0.3$ , but almost 70 per cent of the stars in this class are flagged with the star warning, so abundance determination for these stars is not reliable. The abundance distributions of these classes are very narrow, as shown in Tables 2 and 3.

The classes here are relatively stable. Class 17 is the most unstable (50 per cent of mean coincidence rate), but has a significant degree of confusion only with classes 10, 12, and 15. Class 20 is the most stable in the group with a mean coincidence rate of 81 per cent. Other significant confusion rates are found only between classes inside the group, showing that the classes are stable as a group.

Fig. 19

T_eff − log g distribution for classes in group 6. The same rules and colours from Fig. 7were applied to contours here. Top and right panels: histograms of the distributions of T_eff and log g, respectively.The histogram line colours match the colours of the contours.

4.8 Sparse classes – Group 7 (Classes 21, 23, 24, 30 and 31)

This group is formed by the most peculiar classes, with a number of objects corresponding to at least 0.5 per cent of the whole DR12 sample. The group is very diverse, so in this case we describe each class individually. All classes that represent less than 0.5 per cent of the sample are treated as outliers and are discussed in Sect. 4.9. Figure 20 shows the T_eff – log g distribution for the group.

Fig. 20

Scatter plot for T_eff vs. log g of the classes in group 7. Top and right panels: histograms of the distributions of T_eff and log g, respectively.Top panel is divided in two plots with different scales. The colours of the lines in the histograms match the colours of the scatter plot, as indicated in the legends.

4.8.1 M-giants/bulge – Class 21

Ninety-seven per cent of the stars in class 21 are at the edge of the model grid in T_eff. That is to say, their temperatures are likely to be lower than the minimum T_eff of the models in the spectral library. The class presents other anomalies; except for [C/M], [N/M], [α/M], [Al/H], [K/H], [Mn/H] and [Na/H], all other abundances are also at the edge of the model grid. Lacking sufficiently cool spectra, ASPCAP probably tries to change the abundances until reaching its limits. For these stars, the problem has been corrected in DR13 (SDSS Collaboration 2017). This class is the most stable class with a coincidence rate of 95 per cent. Figure 10, bottom panel, shows that the mean spectra of this class looks totally different from the other classes, with very strong molecular bands, so K-means easily identifies these spectra as a class. Spatially, the stars are concentrated at low latitude, specially towards the galactic centre, as shown in Fig. 21. This class also gathers 23 per cent of the bulge targets in DR12, according to its target flags.

Fig. 21

Galactic coordinates distribution of classes 21 (purple triangles), 23 (orange triangles) and 31 (blue circles).

4.8.2 Metal-poor M dwarfs – Class 23

This class is dominated by metal-poor ( $\hat{[M/H]} \approx - 0.54$ $\widehat{\mathrm{[M/H]}} \approx -0.54$ ) M dwarfs. The distribution of [α/M] is divided into four clumps, showing there is some problem with the determination of these abundances, since very similar spectra correspond to differences of 0.25 in [α/M]. The mean spectrum is similar to that of class 20, but with cooler stars; here more than 60 per cent of the stars are at the minimum T_eff= 3500 K. This similarity in their spectra causes a mean confusion rate with class 20 of 12 per cent. However, class 23 is quite stable, with a mean coincidence rate of 87 per cent. Similar to what happened to class 21, this class has many anomalies in its parameters, gaps in chemical abundances, and a high concentration at the borders of the model grid. This can also be related to limitations in ASPCAP. As shown inFig. 21, there seems to be no anisotropy in this class. It approximately follows the spatial distribution of APOGEE.

4.8.3 K-giants from the Halo – Class 24

This is a very metal-poor class with stars lying over the whole RGB, $\hat{T_{eff}} = 4583 \pm_{330}^{322}$ $\widehat{T_{\mathrm{eff}}} = 4583 \pm^{322}_{330}$ K and $\hat{\log g} = 2.22 \pm_{0.54}^{0.60}$ $\widehat{\log g} = 2.22 \pm^{0.60}_{0.54}$ , as shown in Fig. 20. With a median metallicity of $\hat{[M/H]} = - 1.20 \pm_{0.25}^{0.22}$ $\widehat{\mathrm{[M/H]}} = -1.20 \pm^{0.22}_{0.25}$ it is one of the most metal-poor classes in the classification, certainly the most well-defined class among the metal-poor ones. This class is also α enhanced, with $\hat{[α / M]} = 0.24 \pm 0.07$ $\widehat{[\alpha/\mathrm{M}]} = 0.24 \pm 0.07$ . We find that 593 out of 2388 (≈25 per cent) of these objects are globular cluster members used in APOGEE’s calibration. Its spacial distribution is more dense in Galactic latitudes above 30^°. Class 24 has a very low stability, having a coincidence rate of 18 per cent. Its stars are classified as class 11 members 59 per cent of the time.

4.8.4 M 31 GCs – Class 30

In APOGEE DR12, 236 integrated spectra of Globular Clusters (GCs) in M 31 were observed; each of these spectra appears as duplicate in the dataset. In order to remove the contamination from the unresolved M 31 stellar population in these spectra, 141 background spectra near to the clusters were obtained (Zasowski et al. 2013). Altogether they add up to 613 spectra in the region of M 31. This class has the largest number of objects in this region, 171, with 33 background spectra and 69 duplicated GCs spectra. In general the spectra present high absorption in the continuum, as shown for the mean behaviour by the yellow line in the bottom panel of Fig. 10. Its spectra are poorly fitted by the ASPCAP, and their wide chemical abundances and atmospheric parameters distributions (see Tables 2 and 3) should not be trusted since they are all flagged with ASPCAP warnings. Sakari et al. 2016 have determined the abundance for 25 of the GCs in DR12 (eight are in this class) and we refer to their work as a better source of chemical abundances for these objects. This group also has 62 stars in embedded clusters, two member candidates of the GC Palomar 1, six bulge giants, and many metal-poor RGB stars. The class has 562 spectra, from which 93 per cent are flagged with star warnings, so the ASPCAP values cannot be trusted.

4.8.5 M 31 GCs/high persistence – Class 31

Class 31 also has some spectra in the region of M 31 (84 out of 613), from which 20 are background spectra and 64 are duplicated spectra of 32 clusters. In this class the spectra seem to be less affected by continuum absorption. As shown by the light blue circles in Fig. 21, this class has a peculiar spacial distribution, being more dense in 60^° ≤ l ≤ 90^° and 0^° ≤ b ≤ 45^°. Further investigation is needed to determine why the stars in that direction have these characteristics. In this class there are 88 calibration cluster members and 38 spectra that overlap with the Kepler mission sample. Comparing the position of the stars of this class in Fig. 21 with Fig. 2 in Zasowski et al. (2013) one sees the position of these objects match the locus of observation targets of the halo population, the Kepler mission, and some of the calibration cluster. Thirty-five per cent (170) of the spectra in this class are flagged with a warning.

Thirty one per cent of the stars in this class are flagged as high persistence observations. Persistence refers to the latent image of a previous exposure appearing in subsequent images, due to a slow release of an appreciable fraction of accumulated charge in the previousexposure over the subsequent ones. It affects the bluest chip particularly (Nidever et al. 2015). The intensity of the persistence effect depends on the brightness of the spectra and their history of previous observations. In DR12, a flag is used to inform the relevance of the persistence effect on each spectra (Holtzman et al. 2015). Some of the affected spectra by persistence present an obvious excess/deficit of flux in the blue chip. This behaviour is flagged as a positive/negative jump in blue chip.

4.9 Group 8 and outliers

Ninety-nine per cent of the stars in APOGEE are in the classes presented in Sects. 4.1 to 4.8. We briefly discuss the remaining 1 per cent. In addition, we also investigate the outliers of the main classes, that is, those spectra in classes from 0 to 31 for which the distance to the class mean spectrum is larger than 3-σ. Figure 22 shows the number of spectra in the classes of group 8. Figure 23 shows the spatial distribution of these classes. In this figure the classes are represented by different symbols and colours. Figure 24 shows the spectra in classes from 32 to 38 in the same wavelength window as Fig. 10; we plot the spectra as semi-transparent black lines to highlight the locations where the spectra are more similar to each other. In Fig. 24 the mean spectrum of each class is drawn as a white dashed line.

Fig. 22

Number of objects in outlier classes.

Fig. 23

Galactic coordinates distribution of targets in group 8.

Fig. 24

Spectra of the objects in classes from 32 to 38. Each spectra is plotted as a semi-transparent line, in a way that the darkest regions represent the most dense regions in this flux window. The wavelength coverage here is the same of Fig. 10.

4.9.1 Bulge giants – Class 32

This class has 269 spectra, from which 71 are of supergiant stars in the bulge, 33 bulge giants, and 44 spectra in the region of M 31 (20 of the background and 24 of 12 duplicated GCs). Forty-one percent of the spectra in this class are flagged as having a negative jump in blue chip, 19 per cent of them as having high persistence, and 99 per cent of them are flagged as star bad, assigned if there is warning about any of the following issues: T_eff, log g, model fitting χ², rotation, S/N, and if the difference between photometric and spectroscopic temperature is greater than 500 K.

4.9.2 M 31 GCs/high persistence – Class 33

Class 33 has116 spectra in the region of M 31, 18 background spectra, and 98 spectra from 49 GCs. There are 39 spectra flagged as emission line stars in DR12, eight of which are in this class. Figure 24 second panel from the top shows all 232 spectra overlapped. Emission lines are not visible in this figure because all spectra were truncated at 1.02 of the normalised flux. In spite of this constraint, the algorithm is able to identify emission lines since they affect the form of the continuum around them. Ninety-five per cent of the 232 spectra in this class are flagged as star bad, 56 per cent are flagged with the rotation warning, and 33 per cent of these spectra are flagged as high persistence spectra. In Fig. 24, second panel from the top, we see there is no clear resemblance between the spectra in the class. The pixels with lower dispersion seem to be emission-dominated by lines, suggesting the spectra are either actually emission line stars or have some problem with the sky subtraction.

4.9.3 Bad pixels – Class 34

Seventy six per cent of the 170 stars in this class are flagged as high persistence observations. In Fig. 24, third panel from the top, we see they are mainly giant stars whose spectra have sequences of bad pixels, as those seen between 16 205 and 16 220 Å.

4.9.4 M 31 GCs/high persistence – Class 35

Class 35 has 88 spectra in the region of M 31, 38 background spectra, and 50 spectra from 25 duplicated GCs. These 88 spectra are 70 per cent of the 126 spectra in the class. There are 99.2 per cent of the objects in this class flagged as star bad, and 94 per cent of them have signal to noise lower than 30. As we see in Fig. 24, central panel, all the spectra are very noisy.

4.9.5 1m Telescope – Class 36

There is 817 spectra observed with the 1 m telescope in DR12, and 93 of these are in this class. With 123 spectra, it corresponds to 76 per cent of the spectra in the class. Apart from a few cases, the spectra seem to contain sequences of a few bad pixels like the ones seen in class 34, Fig. 24, but in different regions of the spectrum.

4.9.6 Emission line stars/M 31 GCs – Class 37

This class has 13 emission line stars. There are also 11 spectra in the M 31 region, one spectrum from the background, and ten spectra of five GCs. There are six objects identified by SIMBAD as galaxies.

4.9.7 Negative flux – Class 38

This class has 36 spectra, from which eight are embedded cluster members, four are Sagittarius dwarf galaxy members, and one is an integrated spectra of the Pal1 GC. Eighty three per cent of the spectra in the class have pixels with negative counts.

4.9.8 Classes from 39 to 49

Except for class 42, all classes here have extreme negative flux values in some pixels. These negative counts imply high Euclidean distances between these spectra and those restricted to positive fluxes. Therefore they are segregated within these classes. Here we give a brief description of these objects:

Class 39: Three noisy spectra, one of them flagged as an embedded cluster member;
Class 40:Two duplicated spectra of a globular cluster in M 31 and one spectrum of the background in the M 31 region;
Class 42: Two stars with a very similar pattern of sequences of pixels with flux equal to zero;
Class 43: One spectrum of the Pal1 globular cluster. This spectrum has deep asymmetric lines;
Class 44: One noisy spectrum with negative spikes;
Class 45: One background spectrum in the region of M 31;
Class 46: One stellar spectrum with broad absorption lines;
Class 47: One spectrum with great negative spikes;
Class 48: One spectrum with high persistence and a positive jump in the blue chip;
Class 49: One noisy spectrum with wide absorption lines.

4.9.9 Outliers in classes

For the first 32 classes, we define the outliers as the spectra with a distance outside of the 3σ interval around the mean spectrum of their class. It corresponds, on average, to 1.7 ± 0.6 per cent of the objects in the classes. Exploring their target flags, we notice some phenomena as having high persistence, a positive jump in blue chip, emission lines, sequences of bad pixels, and many stars with signal to noise below 70.

5 Summary and conclusion

5.1 Main results

We performed an automated unsupervised classification of 153 847 APOGEE spectra included in SDSS DR12, using K-means. We classified the spectra into 50 classes, which were afterwards sorted manually into nine major groups. By construction, each class collects spectra that are very similar. The resulting classes and groups are interpreted using the physical parameters inferred by the ASPCAP. We found that classes were divided mainly according to their T_eff, log g and [M/H], and less strongly by other characteristics, such as elemental abundances or the quality of the spectra. Groups from 0 to 7 include 32 classes containing 99.3 per cent of the spectra in DR12. The identified groups can be described as follows:

Group 0: Includes five classes dominated by red clump (RC) stars and the warmest end of the red giant branch (RGB) with different chemical abundances;
Group 1: Composed of six classes with stars from the RGB, cooler than those in group 0, and mainly separated from each other by their chemical abundances;
Group 2: Made up of three classes mainly populated by warm dwarfs, warm subgiant stars, and some A- and B-type stars used for telluric correction;
Group 3: Composed of two classes with fast rotating stars. Due to the strong line broadening, they are among the most poorly-fitted spectra in the survey;
Group 4:Has two classes covering almost the same range of T_eff and log g as group1, RGB stars, but with higher metallicities;
Group 5: Contains three classes formed by stars from the RC and the warm end of RGB, with stellar populations from both the thin and thick disk;
Group 6: Formed of five classes composed of dwarf stars over a wide range of temperatures;
Group 7: Including five classes with peculiar stars;
Group 8:Collects 18 classes with all the outliers of the classification, less than 1 per cent of the spectra in SDSS DR12.

5.2 Uses of the classification

As with anyclassification, this work can be used to provide an overview of the APOGEE DR12 data set, which simplifies the visualisationand highlights some features of the survey. For example, we can easily see that class 3, composed of very warm stars with almost featureless spectra, has an unexpectedly well-behaved distribution of values for [C/M], [N/M], [α/M], [Mn/H] and[Na/H]. It also easily identifies strange behaviours such as the bimodality in [K/H] for class 15, the gaps in metallicity found in class 11, and the similarity in parameters of stars with very different spectra, as is the case for classes 20 and 27.

We provide extensive supplementary material in order to encourage the search for features that may be interesting for specific purposes. For example, the catalogue provides a set of standard spectral templates that could be applied in stellar populations synthesis for galaxies. The mean spectrum (centroids) of the classes are arguably more reliable templates than the traditional synthetic models of standard MK type stars. However, the application should be restricted to those classes with a high number of members and low internal dispersion. Moreover, calibration of the atmospheric parameters and abundances is required, since the ones presented here are based on uncalibrated parameters.

The centroids of the classes are also useful to find substantial differences between the spectra and their best fit model found by ASPCAP. Since the classes are a collection of very similar spectra, the comparison between the class’ mean and the mean of their best fit model can underline systematic differences between spectra and models. This comparison will be implemented soon and made available in a future publication.

Some classes have a different spatial distribution without an obvious reason, for example, classes in group 2 differ in their spatial distribution, something unexpected since the main difference among them is the T_eff of their member stars. Class 31 has an especially peculiar distribution, occupying mainly the region with 60^° ≤ l ≤ 90^° and 0^° ≤ b ≤ 45^°. The reason is unclear. Further investigations must be carried out to find out the cause of this spatial segregations. Other spatial distributions are less surprising, for example, classes in group 4 are concentrated in the disk. This is to be expected, since their metallicity and [α/M] distributions match those expected for red giants that are part of the thin disk population. Classes 24 and 28, formed by metal-poor stars with high α-element abundances, corresponding to the halo population, are expected to be out of the galactic disk, as we found. Class 21 can be interpreted as the population of the bulge, with high α-element abundances and high metallicity, and is also expected to have a preferential spatial distribution like the one observed. These are the most evident examples of spatial segregation, but others can be found among the classes.

Finally, the extensive supplementary material we provide can be used to explore deeper aspects of DR12 APOGEE. We encourage the use of Tables B.1 and B.2 for the reader to explore the results of the classification.

5.3 Additional issues

In this work we face the problem of determining the optimal number of clusters for the K-means classification. In our case, none of the standard criteria provided a reliable answer. That is probably a consequenceof the continuous nature of the dataset. In general, there are no sharp changes in the spectral properties of the stars. Indexes like CH and KL are mathematically proven to work in data sets with well separated clusters, but perform poorly in overlapping clusters or continuous distributions. In this case, K-means provides a way of artificially dividing a continuous space into meaningful slices, maximising the similarity among objects in the same class. Thus, the number of classes can be tuned according to the degree of within-class compactness we are interested in, as shown in Sect. 3.1.

Another consequence of applying K-means to a continuous data set is a significant observed degree of confusion between classes sharing borders in the space T_eff − log g −[M∕H]. However, these issues are not restricted to K-means. Any analysis tool, independently of whether it is supervised or not, will face the intrinsic degeneracy of these quantities in the stellar spectra. Soft clustering algorithms such as fuzzy K-means or density based algorithms such as Gaussian mixture models or DBSCAN could provide a more natural way to deal with this kind of problem, but would not solve the overlap of the classes in the space of parameters. We have shown how the random seed used by the algorithm affects its solution. Although there is no unique solution, the variations are negligible compared to the internal dispersion of the classes. In addition, we show how the centroids of the classes are much closer to the spectra in the class than their corresponding best fit models. This suggests that K-means can be used to identify the systematic deficiencies of the modelling adopted in the determination of physical parameters and abundances with ASPCAP, and improve the agreement with the data.

Although the within-class dispersions in the parameter space are larger than the typical uncertainties derived from this kind of data, K-means provides good insight into the general characteristics of the spectra in the data set. In this sense, K-means is not the optimal algorithm to be used for parameter determination, but can be useful in an early analysis of the data, helping to design solutions and map the general behaviour of the data set.

K-means essentially performs hyperspherical cuts in the N-dimensional space. Future works in unsupervised spectral classification should address the issues presented in this section and search for algorithms that can more generically divide the space taking into account its density distribution. Also a soft clustering approach can arguably produce a more reliable classification. However, more complex algorithms are also more computationally expensive, therefore any further application has to address the scalability problem.

5.4 Conclusions

As exemplified in this work, K-means provides an easy way to divide complex problems into smaller pieces, which are simpler to solve. The version of ASPCAPused in DR12 was designed to work optimally on K and early-M giant stars. For dwarfs, warmer (T_eff > 6000 K), cooler (T_eff < 3800 K), or metal-poor stars ([M/H] < −1), the results are less accurate. Prior to a model-atmospheres spectral analysis, K-means can provide guidance on the most natural groups in the data set. This can be very useful to design a pipeline that treats differently the distinct groups of objects, which is necessary for groups such as 2, 6, 7, and 8, for example.

Wolpert 1996 puts forward what is known as the “no free lunch” theorem for machine learning. That is to say, there is no best machine learning algorithm; it is always a matter of which one is better suited to the specific features of a given problem. Knowing the problem, we can only presume which kind of algorithm is most suitable for solving it, but finding the best solution always requires testing some algorithms and tuning their parameters. This work adds to previous applications of K-means (Sánchez Almeida et al. 2009, 2010; Morales-Luis et al. 2011; Sánchez Almeida & Allende Prieto 2013; Sánchez Almeida et al. 2016) consolidating a guideline for the use of this algorithm in the analysis of spectroscopic data, and providing a new perspective for the APOGEE data.

In this work we made a serious effort to organise the spectra into classes and groups according to the similarity within their spectra. This classification is completely independent of any atmospheric and spectroscopic model. It provides a useful way to explore the data in APOGEE, since it allows a quick identification of the main different types of objects in the survey.

Acknowledgements

We thank the help and comments by Diogo Souto and Dante Minniti. We thank the anonymous referee for useful and precise comments that helped improve the readability of the paper. We acknowledge financial support through grants, AYA2014-56359-P y AYA2017-86389-P and AYA2016-79724-C4-2-P (MINECO/FEDER). The research that leds to this article was partially funded by the Brazilian National Research Council (CNPq) through scholarship of the CSF program. CAP is thankful to the Spanish Government for funding for his research through grant AYA2014-56359-P. Funding for SDSS-III has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Science Foundation, and the U.S. Department of Energy Office of Science. The SDSS-III web site is http://www.sdss3.org/. SDSS-III is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS-III Collaboration including the University of Arizona, the Brazilian Participation Group, Brookhaven National Laboratory, Carnegie Mellon University, University of Florida, the French Participation Group, the German Participation Group, Harvard University, the Instituto de Astrofisica de Canarias, the Michigan State/Notre Dame/JINA Participation Group, Johns Hopkins University, Lawrence Berkeley National Laboratory, Max Planck Institute for Astrophysics, Max Planck Institute for Extraterrestrial Physics, New Mexico State University, New York University, Ohio State University, Pennsylvania State University, University of Portsmouth, Princeton University, the Spanish Participation Group, University of Tokyo, University of Utah, Vanderbilt University, University of Virginia, University of Washington, and Yale University.

Appendix A: Hint to repeatability index interpretation

We define the centroid of class i as $μ_{i} = \frac{1}{n_{i}} \sum_{ι ϵ ω_{i}} x_{ι},$ $\begin{equation*} \vec{\mu}_i = \frac{1}{n_i} \sum_{\iota \epsilon \omega_i} \vec{x}_{\iota}\,, \end{equation*}$ (A.1)

where ω_i is the set of spectra x_ι assigned to class i, and n_i is the number of spectra in the class. So the mean difference between the classes in a particular classification c compared with the chosen classification is given by $σ_{c} = \sqrt{\frac{\sum_{i = 0}^{49} | | μ_{i, c} - μ_{i, chosen} | |^{2}}{50}} .$ $\begin{equation*} \vec{\sigma}_{c} = \sqrt{ \frac{ \sum_{i=0}^{49} ||\vec{\mu}_{i,c} - \vec{\mu}_{i,\textrm{chosen}}||^2 } {50} }\,. \end{equation*}$ (A.2)

Therefore, when we refer to mean difference between the matching classes over the 100 classifications we mean $〈 σ_{compare} 〉 = \frac{1}{100} \sum_{c = 0}^{99} σ_{c} .$ $\begin{equation*} \langle \vec{\sigma}_{\textrm{compare}} \rangle = \frac{1}{100}\sum_{c=0}^{99}\vec{\sigma}_{c}\,. \end{equation*}$ (A.3)

This is the mean pixel by pixel difference between the 99 classifications as compared with the chosen one. This vector can be compared with the mean within the cluster dispersion of the chosen classification, $〈 σ_{within} 〉 = \frac{1}{50} \sum_{i = 0}^{49} \sqrt{\frac{\sum_{ι ϵ ω_{i}} | | x_{ι} - μ_{i} | |^{2}}{n_{i}}},$ $\begin{equation*} \langle \vec{\sigma}_{\textrm{within}} \rangle = \frac{1}{50}\sum_{i=0}^{49} \sqrt{\frac{\sum_{\iota \epsilon \omega_i}||\vec{x_{\iota}} -\vec{\mu_i}||^2}{n_i}}\,, \end{equation*}$ (A.4)

giving the main difference ratio between these quantities over the 4838 pixels of the spectra: $〈 σ_{ratio} 〉 = \frac{1}{4838} \sum_{j = 0}^{4837} \frac{σ_{j, compare}}{σ_{j, within}} \approx 0.064.$ $\begin{equation*} \langle \sigma_{\textrm{ratio}} \rangle = \frac{1}{4838} \sum_{j=0}^{4837} \frac{\sigma_{j, \textrm{compare}}}{\sigma_{j, \textrm{within}}} \approx 0.064. \end{equation*}$ (A.5)

The standard deviation of 3.3% is given by the standard deviation of σ_j,compare∕σ_j,within over the 4838 pixels.

Appendix B: Classes summary and supplementary material

In Table B.1 we present a summary of the 32 classes containing ≈99 per cent of the spectra in the data set. In this table the first column is the group and the second column is a hyper-link for the appendix supplementary figures for each class. The third column gives the main stellar type found in each class. This information was inferred based only on the range of atmospheric parameters covered by each class (Payne 1925) and should be taken just as an idea of what kind of object is dominant in each class. The fourth column gives information about the main spatial distribution of each class. It is also a simple approximation based on their distribution of Galactic coordinates and [α/M] – [M/H] (see Bensby et al. 2003, 2007). Finally, the fifth column presents some extra comments about the main features of the class.

Table B.1

Summary of the classes and complementary material.

The complete information about the classification is also available at the CDS in the form of three tables; Table B.2 presents the classification for each spectra, APOGEE ID, and class; Table B.3 gives the mean spectra for each class, in the form of normalised fluxes and wavelengths; and Table B.4 contains the spectral within-class standard deviation for each class, normalised fluxes, and wavelengths. In both Tables B.3 and B.4 the last column gives the mask applied to the spectra: a binary index, where zero means the wavelength was not considered during classification and one means it was included in the classification procedure.

Table B.2

Spectral classification.

Table B.3

Mean spectra of the 50 classes.

Table B.4

Within-class spectral standard deviation for the 50 classes.

Appendix C: Appendix images

Fig. C.1

Panel A: 2D histogram in the T_eff – log g plane; panel B: galactic coordinates distribution; panel C: 2D histogram of the class in the T_eff – [M/H] plane; panel D: 2D histogram in the [M/H] – [α/M] plane; panel E: parallel plot for all the atmospheric parameters and individual chemical elements available in DR12 for all classes in each group, depicting the main class as a solid line and the other classes as dashed lines. The colours used here are the same as those used in Fig. 7; and panel F: comparison of the mean spectra of the class (blue line) with the mean best fit model for each spectra in the class (red line) and also shows the Arcturus spectrum (grey line) for comparison. The second figure in the file is a corner plot for T_eff, log g, [M/H], [C/H], [N/H] and [α/M]. The figure contains 15 panels comparing these quantities with each other in 2D histograms. Four contours mark the levels enclosing 15, 30, 45, and 68.3 per cent of the points in each class. The top panel in each column gives the histogram comparing the class parameter distribution (using the same colours used in Fig. 7) with the distribution in its corresponding group (gray bars); the median values and the region enclosing 68.3 per cent of the points around the mean are marked by vertical lines and the values are shown above the top panels.