Asteroid Taxonomy from Cluster Analysis of Spectrometry and Albedo

The classification of the minor bodies of the Solar System based on observables has been continuously developed and iterated over the past 40 years. While prior iterations followed either the availability of large observational campaigns or new instrumental capabilities opening new observational dimensions, we see the opportunity to improve primarily upon the established methodology. We developed an iteration of the asteroid taxonomy which allows the classification of partial and complete observations (i.e. visible, near-infrared, and visible-near-infrared spectrometry) and which reintroduces the visual albedo into the classification observables. The resulting class assignments are given probabilistically, enabling the uncertainty of a classification to be quantified. We built the taxonomy based on 2983 observations of 2125 individual asteroids, representing an almost tenfold increase of sample size compared with the previous taxonomy. The asteroid classes are identified in a lower-dimensional representation of the observations using a mixture of common factor analysers model. We identify 17 classes split into the three complexes C, M, and S, including the new Z-class for extremely-red objects in the main belt. The visual albedo information resolves the spectral degeneracy of the X-complex and establishes the P-class as part of the C-complex. We present a classification tool which computes probabilistic class assignments within this taxonomic scheme from asteroid observations. The taxonomic classifications of 6038 observations of 4526 individual asteroids are published. The ability to classify partial observations and the reintroduction of the visual albedo provide a taxonomy which is well suited for the current and future datasets of asteroid observations, in particular provided by the Gaia, MITHNEOS, NEO Surveyor, and SPHEREx surveys.


Introduction
The minor planets of the Solar System exhibit a wide range of surface compositions as outcomes of their diverse formation histories. Mineralogical insights into the main asteroid belt gained from observing the bodies' exteriors serve to constrain the dynamic evolution scenarios of our planetary environment (Morbidelli et al. 2015), to establish relationships in asteroid families , and to identify the parent bodies of the members of the meteorite collection (Burbine et al. 2002;Granvik and Brown 2018). The conclusion of a static Solar System formation history (Gradie and Tedesco 1982) has since been discarded in favour of a dynamical version Morbidelli et al. 2005;Tsiganis et al. 2005) following the increasing resolution of the compositional distribution of asteroids in the main belt and in near-Earth orbits thanks to a growing number of minor bodies characterised by dedicated observational efforts (e.g. Bus and Binzel 2002a;Devogèle et al. 2019;Xu et al. 1995). Today, the majority of the mass in the main belt is thought to have been dynamically implanted during a later evolutionary stage of the Solar System (DeMeo and Carry 2014; Gradie and Tedesco 1982), including some of the largest members of the main belt (Bottke et al. 2006;De Sanctis et al. 2015;Vernazza et al. 2021;Vokrouhlický et al. 2016). Evidence of a dichotomous meteorite population further strengthens this interpretation of a large compositional variability among minor bodies as result of early-stage formation processes in the Solar System (Warren 2011).
To describe the compositional distribution, a classification scheme based on the observable features of asteroids is required. A common device used in the interpretation of observations is asteroid taxonomy. Taxonomic classification refers to the grouping of objects with shared characteristics (Candolle 1813). For asteroids, these characteristics are the observable surface properties, such as the absorption bands imprinted into their reflectance spectra or the surface albedos. The implicit assumption is that the observables are related to the minor planets' surface mineralogy (Gaffey and McCord 1979), though this is not a prerequisite for a practical taxonomy.
Schemes for the compositional classification of minor planets have been devised and iterated regularly since the 1970s (e.g. Bowell et al. 1978;Chapman et al. 1971;McCord and Chapman 1975). The initial division into carbonaceous C-types and silicaceous S-types was readily apparent in different observables, even for a small number of observed objects and limited observational detail. However, with an increasing number of smaller objects observed, the underlying continuum distribution between these complexes has been revealed (Bus and Binzel 2002b).
The most commonly used taxonomies for minor bodies are the Tholen system (Tholen 1984) and the Bus-DeMeo system (Bus and Binzel 2002b;DeMeo et al. 2009). While the latter offers a feature-based classification which encompasses a wide range of the variability observed in spectral observations and has been adapted to visible and near-infrared (NIR) photometric observations (Carvano et al. 2010;De-arXiv:2203.11229v2 [astro-ph.EP]

Sep 2022
Asteroid Taxonomy from Cluster Analysis of Spectrometry and Albedo 2 Meo and Carry 2013; Popescu et al. 2018), the former has not been fully replaced, in part due to two advantages of the used asteroid observables: the visual albedo p V and spectrophotometric observations down to ultraviolet (UV) wavelengths. Both features increase in particular the resolution of classes which only show faint features in the visible and NIR wavelength regimes.
In this work we aim to methodologically improve upon the existing taxonomic schemes for minor bodies with regard to three aspects. First, we introduce a method which enables the classification of complete and partial observations. This offers consistent class definitions across the visible-nearinfrared (VisNIR) region. Second, the visual albedo p V is reintroduced into the taxonomy observables, solving the degeneracy of the X-complex as a primary consequence. Third, asteroids are classified in a probabilistic model, yielding a vector of class probabilities rather than a definite class assignment, which enables taxonomic outliers and transitional populations to be identified.
In addition to the methodological advancement, we further aim to align the scheme of taxonomic classes with advancements in the understanding of asteroid surface compositions acquired over the last decade. Studies such as , Vernazza et al. (2014Vernazza et al. ( , 2015, and Shepard et al. (2015) have combined observational evidence for several asteroid and meteorite connections which show that the classes in the current schemes do not reflect mineralogical groups. While this is acceptable a priori as taxonomies are built on spectroscopic data alone, by taking into account the multi-observable studies we believe that a correction is acceptable and necessary.
In Sect. 2 we outline the collection of the observational data used in this study, as well as the methodological advancement of the clustering strategy with respect to previous taxonomies. In Sect. 3 we outline the clustering results and the strategy of identifying compositional classes. These classes are then discussed in detail in Sect. 4. In Sect. 5 we investigate degeneracies between the classes in this taxonomy and compare the classifications of asteroids in this study to those in the literature. The classy tool to classify asteroid observations in the framework of this taxonomy is presented. Finally, we draw conclusions and give an outlook in Sect. 6.

Method
In this section we describe the compilation and preprocessing of the asteroid spectra and albedos for the cluster analysis, an overview of which is provided in Fig. 1. It is followed by a description of the issues that arise when working with partial observations (i.e. missing data). After motivating the split of the dataset into clustering and classification data, the section concludes with a description of our approach to the dimensionality reduction and clustering problem at hand. Figure 1: Overview of preprocessing the input observations. The preprocessing steps encompassed in the dashed rectangle can be performed using the classy python package described in Sect. 5.

Selecting the observables
The selection of asteroid observables to be included in a taxonomical system is a crucial decision in its design. A broad set of observables ensures its applicability to a large number of asteroids and high compositional resolution; however, it complicates the derivation of the classification scheme and limits the number of available observations as only the intersection in terms of observed asteroids can be considered when combining different datasets. 1 This first led Tholen (1984) to apply the albedo only in a secondary classification step before the observable was completely dropped by Bus and Binzel (2002b).
One of our main goals for this iteration of the taxonomy is the possibility to classify partial observations; we are a priori accepting gaps in the input data, and are thus not limiting the sample size when combining datasets and can use the union rather than the intersection of observations. Nevertheless, while we first considered a classification system based on spectrometric and photometric observations, and on visual albedos and phase curve coefficients, we found that including photometric observations and phase curve coefficients did not add to the compositional resolution of the resulting scheme as they are effectively low-resolution versions of the former (DeMeo and Carry 2013; Mahlke et al. 2021;Shevchenko et al. 2016). Therefore, we chose to build the taxonomy from VisNIR spectra and visual albedos.

Spectra
Spectrometric observations are the most compositionally informative asteroid features accessible via remote sensing.
In preparing this work we focused both on building a large repository of asteroid spectra and on curating the data. In total, we acquired over 7500 spectra from online repositories, archived publications, and directly from the observers. The majority of spectra are unpublished spectra from the Small Main-belt Asteroid Spectroscopic Survey (SMASS) (Xu et al. 1995) and MIT-Hawaii Near-Earth Object Spectroscopic Survey (MITHNEOS) (Binzel et al. 2019;Marsset et al. 2022) surveys available online. 2 Literature sources of the spectra are given in Table E.3. After several iterations of visual inspection and rejection of low-quality data and duplicated observations, 6038 spectra of 4526 individual asteroids remained. About 50 % of spectra cover the visible wavelength range only, while the sample of VisNIR spectra is about three times as large as in DeMeo et al. (2009) (see Fig. 2 in this paper).
The collection of spectra is heterogeneous in numerous aspects, including but not limited to their wavelength coverage, resolution, and sampling patterns. However, a consistent sampling pattern between all spectra is required for the following numerical analyses. We define this pattern in close resemblance to the one used by DeMeo et al. (2009), though we halve the sampling step size in the visible wavelength range as we find that possible superpositions of absorption features due to mafic minerals around 1 µm are better described by the finer sampling.
totaling 53 wavelengths. In the following cluster analysis, each of these wavelengths represents one data dimension.
Before resampling the spectra, we apply a filter (Savitzky and Golay 1964) to smoothen features present in the spectra (e.g. telluric absorption features). The filter consists of applying least-squares fits of polynomials to a window of adjacent data points. The window size in units of data points and the degree of the polynomial dictate the amount of smoothing that is applied. We set these two parameters for each spectrum separately by visual inspection of the results. The smoothened spectra are then linearly interpolated and resampled to the pattern in Eq. (1). We then transform the spectra using the natural logarithm, which serves to approximate a zero mean and uniform standard deviation of the input spectra as they are generally normalised to unity at either 0.55 µm or 1.25 µm. This standardisation transform is generally beneficial to clustering and dimensionality reduction methods (Bouveyron et al. 2019).
The inclusion of missing data in the analysis poses a new challenge when it comes to normalising the spectral data. The common approach of multiplicatively setting the reflectance to unity at a shared wavelength is not possible as no single wavelength is shared among all spectra, as can

Number of Observations
More than 60% missing data p V Figure 2: Input data shown as a matrix. The columns represent the asteroid observables (i.e. the spectral wavelength bins and the visual albedo p V ) and each row represents one observation. The density of sampled wavelength bins is doubled in the visible compared to the near-infrared region. The cells are white if the corresponding value was not observed. The black cells indicate the samples used in the clustering analysis; the grey cells are samples that are classified but not used to build the taxonomy itself, due to the large degree of missing information in these spectra. 2983 observations are at least 40 % complete and were used to train the clustering model. The matrix is sorted by increasing completeness of the asteroid spectra from top to bottom.
be seen in Fig. 2. Furthermore, this approach would artificially decrease the variance in the wavelength chosen for normalisation and the neighbouring wavelength bins, causing the subsequent clustering analysis to effectively ignore the normalisation region.
Instead, we prepare the spectra in a way which benefits the following analysis most by employing a Gaussian mixture model (GMM). We assume that each spectrum can be written αy, where α ∈ R is a normalisation constant that depends on the considered spectrum, and y ∈ R 53 is a normalised spectrum. Further assuming that y follows a mixture of k log-normal distributions with diagonal covariances, all parameters of the models can be estimated from an incomplete data set via an expectation-maximisation algorithm (Dempster et al. 1977). This allows in particular to estimate the normalisation constants of all spectra, and to finally normalise them. By trial and error, we find that k = 30 mixture components result in a satisfying normalisation. As outlined in Sect. 2.2, the assumption of a normal distribution of the input samples in data space is also made in the clustering analysis.
Finally, we note that DeMeo et al. (2009) removed the slope component of the spectra to decrease the influence of space weathering on the taxonomy and to increase the depth of features present in the data. We cannot do this due to the missing data; however, we consider the presence of spectralweathering effects in the taxonomy a beneficial rather than Asteroid Taxonomy from Cluster Analysis of Spectrometry and Albedo 4 unfavourable aspect, as we further outline in Sect. 4.

Albedo
The visual albedos used in this study were compiled for the IMCCE's Solar system Open Database Network (SsODNet, 3 Berthier et al.,in prep.). The main contributors in this compilation are the Infrared Astronomical Satellite (IRAS) (Tedesco et al. 2002), the Wide-field Infrared Survey Explorer (WISE) , AKARI , and Spitzer . We use the SsOD-Net service to collect 4704 albedo measurements for the 3543 asteroids of which we have spectral observations using the rocks 4 python-interface (Berthier et al., in prep.).
When possible, we make use of several albedo measurements per asteroid when combining the input features (Sect. 2.1.4).
To get the most accurate available value for each asteroid, we first compute the albedo based on the weighted averages of the asteroid's diameter and absolute magnitude provided by SsODNet following Harris and Lagerros (2002). These averages consist of the subjectively best available observations (Berthier et al., in prep.). In a second step we compute the weighted mean of any albedo measurement available in the literature for the given asteroid and use this value as the second available albedo observation in the input data. Finally, the non-aggregated albedo observations are appended as additional available measurements. The literature sources we used to compile available albedo values and recompute updated ones from absolute magnitude and diameter are given in Table E.4 As for the spectra, we aim to have Gaussian distributions in the albedo data. Wright et al. (2016) shows that the distribution of albedos follows a double-Rayleigh distribution with a dark peak and a bright peak. To get a Gaussian distribution, we pass the log 10 -transform of the albedos to the clustering algorithm after limiting all albedo values to the interval [0.01, 1). We note that the albedo represents a single data dimension in the following analysis, compared to the 53 spectral data dimensions.

Merging of data samples
Previous taxonomies were derived based on photometry or spectrometry from a single dataset, for example the Eight Color Asteroid Survey (ECAS) (Zellner et al. 1985) for Tholen (1984) or the SMASS survey for Bus and Binzel (2002b). Individual observations of a single asteroid were combined in these datasets to give the single best-possible observation. In this work we do not combine the observations as they come from numerous different sources; for example, 549 of the 2125 asteroids have more than one observation in the input data.
When merging the asteroid spectra and albedo observations for each asteroid, we aim to create as many complete rows as possible. The array of albedo values is merged with the asteroid's spectral values in order of most complete observations. If there are more albedo observations than spectra, we discard the remaining values. We further set an upper  Figure 3: Distribution of the 2125 individual asteroids used to build the taxonomy over orbital class and absolute magnitude. MB refers to the main belt. The number N of asteroids in the orbital population is given below each orbital class. The bin size of the histograms varies with N .
limit of five spectra for each asteroid, removing any excess ones in order of the fewest observed wavelength bins. Many spectra of a single asteroid may cause artificial clusters or trends in the resulting taxonomy. This reduces the number of available spectra from 6038 to 5906. Figure 2 shows the final matrix of observations, colour-coded to differentiate partial and complete observations.

Clustering versus classification
When clustering the entire input dataset described above with the method outlined below, we note a population of clusters which contains mostly visible-only spectra. The intuitive explanation of these clusters is that when computing dimensionality reduction and projecting the spectra into a lower-dimensional space, they will be distributed over a smaller volume than the VisNIR spectra due to their large degree of missing information. This artificial accumulation of input data in the latent space disturbs the identification of real clusters in the data. We therefore set an upper limit Asteroid Taxonomy from Cluster Analysis of Spectrometry and Albedo 5 of 60 % of missing values per spectrum to be included into the clustering input data, which includes 2983 of the 5906 samples in the input data (see Fig. 2). The remaining 2923 are not used to derive the taxonomy; instead, they are classified in the resulting scheme and used to cross-validate the classification, as outlined in Sect. 5. In Fig. 3 we display the distribution of the 2125 individual asteroids in the data with which we derive the taxonomy in the sample over orbital classes and absolute magnitude.

Data availability
The dataset containing the input data samples, asteroid metadata, and the resulting classifications as outlined in the next sections is available at the Centre de Données astronomiques de Strasbourg (CDS). 5

Dimensionality reduction and clustering
The derivation of a taxonomy falls into the realm of unsupervised machine learning. In the context of minor bodies, the approach consists of two steps: dimensionality reduction followed by clustering. Previous taxonomies have predominantly chosen principal component analysis (PCA) for the former and visual clustering for the latter. Given our goals for this new iteration of the current taxonomy as stated in Sect. 1, we need to evolve the established method, in particular to allow for the classification of partial observations. In the following we outline this method evolution, which arises naturally when challenging the PCA method with the requirements of our input data and the prior knowledge from previous taxonomies. The description of the resulting model is kept concise; the reader is referred to Tipping and Bishop (1999), Baek et al. (2010), and Montanari and Viroli (2010) for detailed explanations, and to Casey et al. (2019) for an example application of the same model but with a different treatment of missing data in the field of stellar physics.

Dimensionality reduction
The necessity of dimensionality reduction derives mainly from the spectrometric observations, where each bin of the sampling pattern represents one of the 54 data dimensions. Clustering in such high-dimensional space is not feasible as any model would be overparametrised given the limited sample size of the input data. Reducing the dimensionality of the observed data space is achieved by building linear combinations of the observed variables, referred to as latent variables, 6 and projecting the input data into the space spanned by the latent variables, referred to as latent space.
We assume that we have N observations of a p-dimensional observable. The input data Y is thus of shape N ×p, denoted Y N ×p . 7 PCA can be described as eigendecomposition of 5 The table of asteroid classifications and the templates of the defined taxonomic classes are available in electronic form at the CDS via anonymous ftp to cdsarc.u-strasbg.fr (130.79.128. 5) or via http://cdsweb.u-strasbg.fr/cgi-bin/qcat?J/A+A/ 665/A26 6 Latent can here be understood as a synonym for hidden or underlying as these variables are not directly observable.
7 In the following we state the shape of the tensors in this the covariance matrix Σ p×p of Y, where W and Λ are the eigenvectors and eigenvalues of Σ (Pearson 1901). Expressing the p-dimensional Σ by the q-eigenvectors corresponding to the largest eigenvalues of Σ, where q < p, leads to dimensionality reduction while retaining the largest possible amount of variance within the data. The lower-dimensional representation Z N ×q of the input data Y is given by the matrix product of Y with the matrix of the subset of eigenvectors W p×q . In the following latent components are denoted W and latent scores Z. The p elements of each latent component are referred to as latent loadings. They are the coefficients of the linear combination of dimensions of the input data. Latent components are constrained to unit L2-norm (i.e. the square root of the sum of the squared latent loadings is one). We note that the latent components and their loadings are determined from the data alone; no a priori information is used to influence the matrix.
PCA does not allow for missing data as it relies on the eigendecomposition of Σ. This limitation is overcome by reformulating PCA as a latent generative variable model. In essence, while computing the latent components and scores from the input data, we are making the assumption that there exists a Gaussian-distributed variable z in the latent space which causes the variance observed in the higherdimensional data (i.e. that the resulting latent scores are normal distributed). A general model of this approach can be expressed as (Tipping and Bishop 1999) where f is a function of the latent scores Z and the latent components W and p×p is a noise matrix independent of Z. The reformulation of PCA in this model framework is referred to as probabilistic PCA (PPCA) (Tipping and Bishop 1999). The model parameters are fit using an expectationmaximisation algorithm (Dempster et al. 1977) and, when the input data is complete, gives the same solution as the conventional PCA.
PPCA assumes that the noise matrix is isotropic (i.e. all data dimensions carry the same noise). This is not necessarily the case for our observations; the uncertainties between visible and NIR spectra may differ from one another and from that of the visual albedo. Factor analysis (FA) is another latent generative variable model analogous to PPCA except that the noise matrix is assumed to be diagonal rather than isotropic (Rubin and Thayer 1982). The noise matrix is referred to as uniqueness as it captures the variance that is unique to each data dimension, effectively decoupling the measurement uncertainties from the data covariance.
In FA, the observations Y are modeled as where µ p×1 is a p-dimensional vector containing the mean values of Y along the feature dimensions; W is the matrix manner when we first introduce them, and drop the notation afterwards.  of the latent components, as above; and is a diagonal Gaussian noise matrix, ∼ N (0, Ψ), where Ψ p×p is diagonal. The latent variables Z follow a normal distribution with zero mean and unit covariance, Z ∼ N (0, I). The model parameters can be determined by a maximum-likelihood approach, even in the case of missing data, under the assumption that the data is missing at random (i.e. its probability of being missing is independent of its value) (Little and Rubin 2019).

Clustering
Using the FA model given in Eq. (4), we assume that the distribution of the latent scores (i.e. the asteroid observations mapped into the latent space) follows a single Gaussian distribution. However, we know a priori from the previous taxonomic efforts that this is not the case; the C-and S-complexes form separate distributions, and endmember classes such as A, K, and V follow separate trends in the latent scores (see Figure 2 in DeMeo et al. 2009). Instead of a single Gaussian, we therefore model the data as a mixture of g Gaussian distributions, an approach referred to as mixture of common factor analysers (MCFA, Baek et al. 2010) in the case where the model components are fit in the same latent space as is the case here. MCFA can be expressed as specialisation of the FA model in Eq. (4) using (Baek et al.

2010)
where i ∈ (1, . . . , g), A is the common subspace of the mixture components (i.e. the matrix of latent components), ξ i is the mean value of the ith mixture components in latent space, and Ω i is its variance. The noise matrix retains its definition as above, meaning that all mixture components share the same noise.
In MCFA, dimensionality reduction and clustering are achieved concurrently during the model training. Starting from an initial set of model parameters as outlined in Sect. 3.1.2, at each training epoch (i.e. the optimisation of the log-likelihood of the model against the entire input dataset), this model searches for the q-dimensional latent space and divides the input samples into g components, which gives the most likely projection of the input data assuming that it follows the mixture of g Gaussian distributions in the reduced space.
The hyperparameters in the model are the number g of clusters and the number q of latent components.

Model implementation and availability
Implementations of the MCFA mixture-model approach are available in the R programming language 8 by Baek et al. (2010) and in the python language 9 by Casey et al. (2019). Nevertheless, we chose to write an alternative implementation in python as the implementation by the latter imputes the missing data via mean imputation before training the model using an expectation-maximisation algorithm. Mean imputation is not appropriate for our dataset as we know that the spectra of different asteroid classes may appear entirely different in terms of absorption features and slope. Inserting the mean column value in each empty cell thus does not represent the missing data well. Instead, we use the tensorflow library (Abadi et al. 2015) to implement a stochastic gradient descent learning strategy which maximises the log-likelihood of the model given the observed data only, which is statistically sound under the missing-atrandom assumption (Little and Rubin 2019), contrarily to using mean imputation. The stochastic gradient descent is of particular interest here as it estimates the model parameters based on batches of the input data, meaning that it can scale easily with an increasing number of observations. This MCFA implementation is independent of the taxon-7 omy itself and may be applied in different studies. The implementation and documentation are available online. 10

Results
In this section we present the results of fitting the MCFA model outlined in Sect. 2.3 to the input dataset described in Sect. 2.1. After depicting the latent space and the structure of the latent dimensions, we explain how the asteroid classes building this taxonomy are derived from the modelled Gaussian clusters. An overview of the clustering steps is given in Fig. 4.

Parameters
We choose to cluster the asteroid observations in q = 4 latent dimensions using g = 50 Gaussian clusters. Both numbers are selected from a wide range of values after assessing the resulting model fits. Larger values retain and describe more variability in the data, and at the same time increase the number of free parameters in the model, hence a trade-off is made in both cases. The model fits obtained with four or five latent factors were comparable in terms of captured variability in the cluster, thus we opted for the smaller number of model parameters.
The large initial number of 50 clusters accounts for the model assumption of Gaussianity in the latent space. We have no reason to expect a Gaussian distribution of the asteroid classes; therefore, we model them as superpositions of one or more Gaussian clusters. The modelled clusters are later joined and mapped to build the asteroid classes using a many-to-many relationship and following a decision tree.

Initialisation and training
The latent loadings and cluster assignments of each observation have to be initialised at the start of the gradient descent algorithm to train the MCFA model. The initialisation dictates the global position in the Hamiltonian which is sampled by the training and thus has a significant impact on the final result.
A practical issue when reducing the dimensionality of asteroid data made up by different observables is the feature weighting. In our case the spectra contribute 53 data dimensions compared to the single dimension of the albedo. The summed variance in the former is much larger than the variance in the latter, resulting in a negligible contribution of the albedo to the latent space computation which does not reflect its actual information content. Tholen (1984) therefore chose not to include the albedo in the dimensionality reduction, using it in a subsequent manual clustering step instead. We employ an alternative strategy outlined below which allows us to account for the albedo while building the latent space.
We initialise the latent loadings using PPCA. This approach has two advantages. First, the latent loadings are set to the axes of largest variance in the data, ensuring a high 10 https://github.com/maxmahlke/mcfa resolution in the latent space, and second, PPCA is variant to feature scaling (i.e. data dimensions are weighted with respect to their variance when computing the dimensionality reduction). An effective way to increase the importance of the albedo information is hence to increase the variance of albedo values by some transformation prior to model training. We achieve this by means of the log 10 transformation described in Sect. 2.1.3, which increases the variance in the albedo dimension by a factor of 6.8. During the gradientdescent model training, we monitor the log-likelihood of the model given the data. As opposed to PPCA, MCFA is invariant to factor scaling, which leads to a decrease in the albedo loadings with each training step. Therefore, we do not train until the model has fully converged, instead stopping the training when a good balance between the weight of the albedo and of the spectra has been achieved. This subjective choice of training epochs is a concession we make to the challenge of combining different observables in the same model.
The latent cluster memberships are initialised by fitting a Gaussian mixture model with 50 components to the principal scores of the PPCA and assigning each sample to its most probable cluster. We train the MCFA model on the 2983 observations of 2125 individual asteroids as outlined in Sect. 2.1.4.

Latent space
During the model training the latent components matrix W is derived based on the covariance of the input observations. Each latent component contains one linear coefficient for each input data dimension (i.e. the latent loading). The absolute value of a loading indicates the degree to which the latent component responds to variance in the corresponding data dimension. Positive loadings lead to an increase in the latent scores z with increasing value in the data dimension, negative values to a decrease in z. The latent scores Z are essentially a vector product of the input data with the latent components. As such, both the spectra and the visual albedo of the observations influence the latent scores Z simultaneously.
The latent components resulting from the model training are depicted in Fig. 5, with the spectral loadings given on the left side and the latent loadings corresponding to the albedo dimension on the right side. We note that they are displayed separately only for visualisation purposes; for the clustering model itself, there is no principal distinction between the latent loadings corresponding to the spectra and that corresponding to the albedo.
The spectral loadings in Fig. 5 resemble different mineralogical features commonly present in asteroid spectra. The first component approximates a positive slope, 11 with an inflection point around 1 µm. Components two and three resemble the spectra of pyroxene minerals due to their bands at 1 µm and 2 µm, though the band minima and depths differ between the components. The strongest distinction between these two components is the visible slope, which is positive for component two and negative for component The latent scores of the asteroid observations are shown in Figs. 6 and 7. The input data depicts a larger variance when projected along the first two components rather than along the last two due to the initialisation of the latent components with PPCA. It is clear that the featureless spectra will show little variance when projected along the pyroxene-and olivine-like axes z 2 , z 3 , and z 4 .
Figures 6 and 7 additionally indicate the mean latent scores of all asteroids assigned to a given asteroid class, designated by the class letter and derived in the following sections.
As an example for the interpretation of latent scores, we point out here that the degeneracy between classes E and S in the latent scores depicted in Fig. 6 is the result of the large loading of the albedo in the second component, which offsets the generally featureless E-types with respect to the feature-rich but darker S-types. This degeneracy is resolved in other latent components, as can be seen in Fig. 7.

Clusters
Concurrent with the dimensionality reduction, the input data is divided into 50 Gaussian clusters during the model training. The clusters are not constrained in their covariances, yielding a wide range of cluster shapes and orientations in the latent space. Illustrating the distribution of the clusters in the four-dimensional latent space is not practical due to their large number; instead, we show the distributions of input spectra and albedos over the clusters in Figs. 8 and A.1 respectively.
Most clusters occupy a narrow volume in the latent space and encompass Gaussian populations in previously recog- Figure 6: Latent scores of the input data projected along the first two latent components (grey circles). The mean score of all asteroids assigned to a given class is indicated by the class letter. For better readability, the mean score of class C has been shifted by -0.1 in z 1 .
nised classes such as S and V. When building the asteroid classes from the clustering, we map the probability of any sample to belong to either of these narrow clusters one-toone to the respective asteroid class. As an example, for all observations the probability of belonging to cluster 0 is added to the probability p S of belonging to the S-types (see Fig. 8). Additional S-like clusters such as cluster 6 further add to p S ; 33 clusters are mapped to a single asteroid class in this manner.
Other clusters either capture continuous trends between classes or the diffuse background population. An example of the former is cluster 22, containing spectra from both Mand P-types, and of the latter cluster 13, containing observations with varying spectral characteristics and albedos. For these clusters, we implement decision trees to separate the observations into mostly two or three distinct classes. These decision trees are described in Sect. 4 on a per class basis at the end of each class description. The probability of belonging to either of these clusters is split and added to the respective class probabilities following the decision trees.
As an example, cluster 22 is resolved via the albedo. If no albedo is present in the observation, the cluster probability is added entirely to p X , otherwise it is split between p M and p P proportionally based on the albedo distribution of M-and P-types, derived in Sect. 3.4.2. For clusters 13, 29, and 41, we note that they capture objects with high variability in their spectral and albedo features. These are either unique objects, such as the only O-types (3628) Boznemcova and (7472) Kumakiri in cluster 13, or spectra of questionable quality. We resolve these clusters with decision trees based on GMMs into different classes: cluster 13 into C, O, Q; cluster 41 into B and V; and cluster 29 into every class except for E, K, L, O, R, X, and Z (see Table D.2). Objects in either of the three clusters are flagged in the classification output as DIFFUSE and should undergo visual scrutiny.  Fig. 6, but giving the scores in the third and fourth latent components. For better readability, the mean score of class S has been shifted by -0.02 in z 3 and of classes C and P by 0.04 in z 4 .

Class continuity
When deriving the mapping of the Gaussian clusters to the asteroid classes, we strive to maximise the resemblance of the resulting taxonomy to the established system by Tholen (1984) and the Bus-DeMeo system. For any change in the classes, we weigh the evidence in the data to necessitate the change against the overall practicality of class continuity, opting for the latter when in doubt. Furthermore, we also take into account mineralogic and meteoritic interpretations established in the literature using observables outside this feature space, allowing us to derive classes which are more useful for communicating class properties within the community. These influences from outside the data-driven approach are stated in the description of the respective class in Sect. 4.
The main drivers for the evolution of the class scheme are twofold. The first is the fundamental difference between the probabilistic clustering employed here and the visual clustering used in previous schemes, affecting specifically classes that reflect continuous trends in the asteroid population. The second is the reintroduction of the albedo to the observables of the taxonomy.
The fundamental division of asteroids into feature-poor and feature-rich populations, the C-and S-complexes, is the baseline of our scheme, as it has been since the first taxonomic efforts by Chapman et al. (1975). A small population of asteroids with faint features occupies the space between these complexes in DeMeo et al. (2009), separated into the classes K, L, Xc, Xe, Xk, and T. Thanks to the taxonomic information provided by the albedo and targeted campaigns of these populations (e.g. , this population has grown considerably, to the point that we recognise it as a third complex, which we dub the M-complex based on its most populous class.
Taxonomic constants such as the A-and V-types represent no challenge in identification. It is more difficult to prove the definition of the Q-types, which represent a continuous trend towards smaller slopes compared to the S-types and as such does not separate clearly in the latent space. In favour of class continuity, we still identify a population in the S-complex as Q-types. Subclasses such as Sa, Sq, and Sr are not identified, however, as we observe numerous clusters with varying slopes and mineralogies in the S-complex. Labelling each cluster with a secondary letter would increase the entropy of the taxonomic system, and would lead to more confusion than resolution. Furthermore, we note an overall large variability between observations of single asteroids which often exceeds the variability between these subclasses. Instead, we highlight the different mineralogical interpretations of these clusters in Sect. 4.5.1.

Resolving the X-complex
Solving the spectral degeneracy of the X-complex in the Bus-DeMeo scheme is the main motivation to reintroduce the albedo to the taxonomic system. We employ the system established in Tholen (1984); asteroids in the X-complex are differentiated based on their albedo values and are labelled P, M, and E in ascending order of albedo, while the letter X is retained for observations without albedo. However, instead of applying strict limits, 12 we model the joint albedo distribution of all observations in clusters that we consider to be X-like based on their spectral appearance: clusters 17, 22, 35, 37, and 46. The employed model is a GMM with three components. In Fig. 9 we show the model fit to the albedo distribution of the X-complex, as well as the derived mean and standard deviations in p V for classes E, M, and P. Any asteroid that falls into one of the these clusters and has an albedo observation is assigned based on its probability in this model to the respective class. The subclassification indicating the presence of features in the spectra (e and k) is retained and discussed in the following subsection.

Feature flags
The Bus-DeMeo system recognises four classes which are based on the presence of distinct absorption features in addition to the overall shape of the spectra: (1) Ch, exhibiting a feature around 0.7 µm associated with possible surface hydration (e.g. Rivkin et al. 2015); (2) Xe, showing a narrow feature at 0.5 µm (Bus and Binzel 2002b); (3) Xk, depicting a faint broad feature between 0.8 -1.0 µm (Bus and Binzel 2002b); and (4) Xn, with a feature around 0.9 µm (Binzel et al. 2019). Example spectra carrying the e-, h-, and k-features are shown in Fig. 10.
The identification and flagging of these features by use of the secondary letters in the class designation is carried over in this scheme with a slight modification. First, we do not differentiate between the k-and the n-feature. Both are centred around 0.9 µm and after slope removal, we find no appreciable systematic difference between the features in a sample of spectra previously classified as Xk or Xn. We do not rule out that these features are imprinted by different surface mineralogies; however, we chose the evidence in the data over class continuity, we decided to drop the n-feature, Figure 8: Overview of asteroid spectra assigned to each cluster, including the number N of spectra and the asteroid classes to which the cluster contributes, excluding classes with fewer than three contributed observations except for cluster 25 which only has three members. The classes are sorted by the total number of observations the cluster contributed. The dotted line gives the mean value of the spectra per cluster except for diffuse clusters (defined in Sect. 3.3) and cluster 25. The mean spectra are normalised to unity at 0.55 µm. The y-axis limits change in each row. We note that these values slightly change later as other class members are added from clusters which are not assigned purely to the X-complex. The final albedo distributions are given in Table 3.
and continued with only the k-feature.
Second, we do not reserve unique classes for observations depicting the e-or the k-feature. As discussed in Sect. 4.3, both features are prevalent in members of the X-complex showing a variety of spectral slopes and albedos. We judge these two properties to be more important when deriving classes than the presence of a single feature. Furthermore, we note that e and k are not mutually exclusive; for example, (2035) Stearns depicts both features as shown in Fig. 10. We thus decided to flag the presence of these features by appending the respective letter to the class designation without considering the resulting combinations such as Mk or Eek as proper classes.
On the other hand, the h-feature is treated consistently with the Bus-DeMeo system. It is exclusive to the members of the C-complex and displays a much narrower, continuous distribution than the other two features, as shown in Sect. 4.1. Any sample depicting the 0.7 µm band is assigned to the Ch-class, regardless of the subclass in the C-complex that the spectra falls in.
The features are identified in a semi-automated manner. For each feature we defined a wavelength interval around the band centre in which the spectral continuum is removed and the reflectance is fitted using a polynomial of fourth degree, following Fornasier et al. (2014). Both the interval and the expected band centre were defined heuristically using a training sample of visually identified features, and are  given in Table B.1. Using the polynomial fit, we estimate the band depth with respect to the continuum, the band centre, and its signal-to-noise ratio. The last is given by the ratio of the band depth to the reflectance uncertainty, which is estimated using the residuals of the polynomial fit. The band is considered to be present if the band centre is within three standard deviations of the expected position derived from the training sample and the signal-to-noise ratio is higher than one.
The fitting procedure is run automatically to identify the h-feature in spectra classified as members of the C-complex (B, C, P, and the degenerate class X; see Sect. 4.1) and the e-and k-features for those belonging to the X-complex (E, M, P, and X). In practice, we find that relying on the automated band identification yields many false positives given the low threshold of one in the signal-to-noise ratio and the general uncertainty of the expected wavelength of the band centre. For example, Cloutis et al. (2018) give band centres between 0.6 µm-0.75 µm for the h-feature. Hence, we recommend a semi-automated approach where the bands are fitted automatically and the observer visually confirms the quality of the fit and the presence or absence of the band. The fitting and confirmation are handled by the classification tool presented in Sect. 5. In the 2983 spectra classified during the clustering, 13 (144, 135) carry the efeature (h-feature, k-feature). For 392 spectra (361, 360), no conclusion could be made as the spectral region is missing.
The k-feature is particularly challenging to observe as it falls in the transition of visible and near-infrared spectra, which are acquired using different instruments. Merging the spectral parts is non-trivial and several subjective decisions have to be made, as outlined in . The unknown offsets between visible and near-infrared can give rise to an artificial feature when joining the observations. In the case of the e-feature, Bus and Binzel (2002a) point out a systematic feature between 0.515 µm and 0.535 µm in the SMASS spectra, which are frequently used to complement Table 1: Distribution of observations and asteroids over taxonomic classes and orbital populations. The second column gives the number of observations assigned to each class, while the third and all following columns refer to the number of individual asteroids assigned to the class. DM09 refers to DeMeo et al. (2009). The fractions in this column do not add up to 100 %, due to the missing T-class in this scheme. The orbital classes use the following acronyms: NEA -near-Earth asteroids; MC -Mars-crosser; H -Hungaria; IMB -inner main belt; MMB -middle main belt; OMB -outer main belt; Cyb -Cybele; JT -Jovian trojans.

Fraction
Orbital acquired NIR-only spectra. Hence, we note here that the e-feature should only be considered present if its band centre is well below this wavelength range.

Class per asteroid
A total of 549 of 2125 asteroids in the input data have more than one sample in the input data. These observations may or may not have been assigned to the same class, opening the possibility that asteroids have different classes assigned. We resolve these ambiguities by computing the sum of the class probabilities across all observations of the asteroid, weighted by the fraction of observed data dimensions. Observations with albedo values received an additional weight corresponding to 25 data dimensions, meaning that a visibleonly spectrum including albedo has approximately as much weight as a VisNIR spectrum without albedo. If one of the e-, h-, or k-features is detected in any of the observations, the final class of the asteroid carries the respective suffix letter.
In Table 1, we report the total number of observations per taxonomic class, followed by the number of distinct asteroids in the class. The latter number only includes asteroids which were assigned to the class after the merging procedure outlined above in the case of multiple observations.

Discussion
In the following, we discuss the main properties of the 17 classes defined in this taxonomy in data and latent space, structured into three complexes: C, M, and S. We give our motivation for class scheme and point out where it aligns with or deviates from the existing classifications, in particular the taxonomy by Tholen (1984) and the Bus-DeMeo system (Bus and Binzel 2002b;DeMeo et al. 2009), which are the closest predecessors in terms of the observables. We further outline the decision tree used to derive the classes from the 50 clusters that were fit to the input observations in the previous section. An overview of this decision tree is given in Table D.2 A general overview of the class properties in data space is given in Figs. C.2 and C.3. Table 1 gives an overview of the number of samples and asteroids per taxonomical and orbital class. Tables 2 and 3 at the end of this section show an evolution of the taxonomic scheme and describe the classes defined in this taxonomy, including an overview of the spectra of class prototype asteroids, most of which are discussed in the text. The mean spectra and albedos for each class ('class templates') are available in the CDS repository. The X-class is not discussed separately as its members are covered by classes E, M, and P.

C-complex: B, C, Ch, P
The members of the C-complex are found throughout the main belt and dominate the regions past the 3:1 mean-Asteroid Taxonomy from Cluster Analysis of Spectrometry and Albedo 13 motion resonance in terms of number and mass (DeMeo and Carry 2014; Vernazza et al. 2017). Their spectral appearance is generally feature-poor apart from the h-feature at 0.7 µm observed in about one-third of the population and associated with phyllosilicates present on the surface . Instead, the diversity of the complex constituents is present in the slope and in the shape of the spectra, the former ranging from blue over neutral to red and the latter from overall linear to a concave appearance attributed to a carbonaceous surface composition including magnetite Cloutis et al. 1990b;Gaffey and McCord 1979) Common meteorite linkages to the population of the Ccomplex involve carbonaceous chondrites such as CI, CK, CM, and CO with different degrees of thermal metamorphism or aqueous alteration (Clark et al. 2010;Cloutis et al. 2011;Hiroi et al. 1996;León et al. 2012). However, the paucity of these meteorite groups among the falls even after bias-correction is difficult to reconcile with the abundance of the complex members in the main belt, leading Vernazza et al. (2015) to suggest interplanetary dust particles (IDPs) as analogues for the non-hydrated asteroids. Using a radiative transfer model, the spectral appearance of most C-complex asteroids is well matched using constituents of chondritic-porous IDPs. The open question on the surface composition is decisive for the behaviour of the asteroids under the influence of spectral weathering. Laboratory irradiation experiments (Lantz et al. 2018(Lantz et al. , 2017 and statistical approaches ) both show opposite trends for different initial surface compositions: while high-albedo material exhibits spectral reddening and surface darkening, low-albedo assemblages become bluer in slope and brighter.
Apart from the C-types, Tholen (1984) defined three smaller classes based on the albedo and UV distributions: B-types are 'bright-C' types with visual albedos around 10 %, while Fand G-types are characterised by their behaviour in the UV wavelength region (the former flat, the latter showing strong absorption behaviour). The Bus-DeMeo system retained classes B and C and extended the taxonomy by addition of the Ch-class for hydrated C-type asteroids, as well as the classes Cb, Cg, and Cgh, which describe different slope behaviours in different wavelength regions. Neither system counts the members of P-class as members of C, but rather as member of the X-complex.
In this taxonomy, we divide the C-complex into four classes: B, C, Ch, P. The P-class is here defined for the first time in both albedo and spectral appearance, allowing us to move it from the X-complex and firmly establish it as part of the C-complex. Any object within the complex that exhibits the h-feature is classified as a Ch-type, even if it falls in B or P.
The distribution of reflectance spectra and visual albedos for each class is shown in Fig. 11. The heterogeneous yet continuous distribution of the C-complex members in latent space is illustrated in Fig. 12. As change in slope and a broad feature around 1 µm-1.3 µm are the main differentiators, the complex members split best in the z 1 and z 4 latent dimensions. 14 We note that the apparent diagonal gaps between the C-and Ch-class members in the lower part of 14 We consider an absorption feature to be concave, while other works such as Lantz et al. (2018) define it as convex.

B-types
The B-class was first defined in Tholen (1984) based on their average albedo, which is higher in comparison to the other members of the C-complex. With the disappearance of the UV wavelength region from taxonomy, F-types are no longer distinguishable from B-types, and the distribution of generally high albedos of the latter has become a broad distribution with a standard deviation of around 10 %, see Fig. 11. This distribution is further visible in the large variance in the z 2 -scores of B-types (Fig. 12). Instead, the bright C (Tholen 1984) are best identified by another common interpretation of the class mnemonic, their blue slope longwards of ∼0.7 µm, causing a readily apparent distinction from other classes specifically in the z 1 latent score. Nevertheless, the B-types do not separate entirely from the neighbouring C-types and form a diffuse but continuous B Ch Ch Ch Ch Ch Ch Ch P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P B Ch P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P The latent scores of all samples outside these classes are shown as grey circles. Some outliers in z 2 and z 4 are not shown for readability. The five subpanels above each panel show regions of interest where a selection of asteroids are highlighted by replacing the symbol with the respective asteroid's number. If more than one spectrum of the asteroid is in the input data, its number may appear several times.
branch of the complex, as shown in Fig. 12.
The class variance in z 1 -z 2 indicates that bluer B-types also tend to be brighter. As shown in subpanel (a) in Fig. 12, the archetype B-type ( A total of 45 asteroids (2.1 %) are classified as B-types in this study. The B-class is made up of a single cluster (2) and is not subject to any decision tree. We note that the Themis-like B-types with a neutral-to-reddish slope in the NIR, as described in Clark et al. (2010) and León et al. (2012), are C-types in this taxonomy, in agreement with their classification in the Bus-DeMeo system (see subpanel (h) in Fig. 12).

C-types
The carbonaceous C-types present spectra with a neutral to small red slope and are generally featureless except for a broad feature around 1.3 µm, which may give the spectrum an overall concave shape. In the upper part of Fig. 12, we observe a uniform distribution of the C-types in z 1 -z 2 with the class variance aligned with the z 1 axis; z 2 is not a suitable projection for the C-types as they are featureless and present a narrow albedo distribution, as shown in Fig. 11. Instead, the concave feature shape is captured in z 4 , hence in the lower part of Fig. 12 we observe a more structured Asteroid Taxonomy from Cluster Analysis of Spectrometry and Albedo 15 clustering. The positive correlation of z 1 and z 4 scores among the C-types indicates that the spectra on average get more concave as they get redder. Nevertheless, the wide and continuous distribution around this general trend prevents us from defining analogues to the classes Cb, Cg, and Cgh in the Bus-DeMeo system as we aim to refrain from subjectively partitioning the latent space.
Both (1) Ceres and (10) Hygiea are members of the C-class (see subpanel (b) in Fig. 12). In subpanel (h), we highlight (24) Themis, (45) Eugenia, and (52) Europa. All these asteroids are well matched by the models composed of IDP constituents as described in (Vernazza et al. 2015) and have on average higher z 4 scores than the Ch-class members of comparable slope.
A total of 221 asteroids (10.4 %) are classified as C-types in this study. C-types are present in three different clusters (5, 19, 26, where the first two are the two largest of the 50 clusters in the model). Cluster 19 contains both prominent C-types such as (45) Eugenia and (52) Europa as well as prominent P-types such as (65) Cybele and (87) Sylvia, as shown in subpanel (c) in Fig. 12. The cluster resembles the Cb-class from the Bus-DeMeo system. We split this cluster into two components (C and P) using a GMM in z 1 -z 4 . While we generally aimed to keep the number of post-clustering decision trees to a minimum, we make the choice here to follow the mineralogical interpretation of the C-complex given in Vernazza et al. (2015) and Marsset et al. (2016), among others, and to increase class continuity for the objects in these clusters.

Ch-types
Unlike for the other feature flags outlined in Sect. 3.4.3, we reserve a unique class for the 0.7 µm h-feature, following the convention of the Bus-DeMeo system. We observe the continuous and narrow distribution of samples carrying this feature similar to the other classes in the C-complex. Furthermore, as above for the C-types, we recognise the mineralogical and meteoritic interpretation of the C-complex members in the literature (e.g. Cloutis et al. 2011;Marsset et al. 2016;Vernazza et al. 2015).
While degenerate with the distribution of C-types in z 1 -z 2 , the Ch-types generally have lower scores in z 4 than the Ctypes, corresponding to linear rather than concave spectra. In subpanels (g) and ( The actual number is likely higher as 12.1 % of samples in the C-complex are missing the visible wavelength range, for example a NIR-only spectrum of (41) Daphne indicated as C-type in subpanel (g) of Fig. 12.

P-types
The P-types have been absent from the taxonomic schemes since Bus and Binzel (2002b), and thus no definition of the VisNIR behaviour exists. As part of the X-complex, the 'pseudo-M' types (Gradie and Tedesco 1982) are spectrally degenerate to the E-and M-types in the visible wavelength range, specifically, the ECAS colours. In the NIR, P-types show a red linear slope (see Fig. 11). We find that the spectral degeneracy between P and M continues in the NIR, while E-types differentiate by showing overall neutral slopes. Classes P and M have to be distinguished by visual albedo observations, which is about 5 % for P-types.
As the X-complex is dissolved in this taxonomy, we assign the class to the C-complex following the proximity to the other classes in Fig. 12. This assignment is also in line with the IDP interpretation (Marsset et al. 2016;Vernazza et al. 2015). In z 1 -z 4 space we observe a high-density cluster of P-types immediately adjacent to C-types. These samples are spectrally similar to the Cb-class in the Bus-DeMeo system. Furthermore, there is a more diffuse population of P-types building a bridge between the C-complex and the D-class.
The P-class is part of the former X-complex and of the new C-complex. Observations assigned to the P-class are thus inspected for all three features. While 19.2 % of samples in the P-class present the h-feature, we note that no sample carries the k-feature, which is most prominent in the Mclass. Three samples assigned to P show the e-feature, yet they belong to asteroids which are later assigned to the M-class: (4660) Nereus and (5645) 1990 SP. The k-feature may thus be a reliable differentiator between the spectrally degenerate M and P. The distribution of these features is discussed further in Sect. 4.3.
A total of 135 asteroids (6.4 %) are classified as P-types in this study. Class P is built primarily from clusters 17, 19, and 22, where cluster 19 entails the continuous transition to class C and the first and third M-types. As mentioned above, we used the prototypes (65) Cybele and (87) Sylvia to differentiate between the classes, though assigning both to the C-types would have been justified as well given the cluster trend depicted in Fig. 12 (see subpanel (c)).

Endmembers: D, Z
We refer to D and Z as endmembers, due to the visible gap between their members and the C-complex in the latent space in Fig. 12; however, some of the P-types form a bridge population between the two classes.

D-types
The defining property of dark D-type asteroids is their featureless and strongly red-sloped spectrum both in the visible and in the NIR (DeMeo et al. 2009;Tholen 1984). They are predominantly found beyond the outer main belt, especially among the Jupiter trojan population, where they dominate the region in terms of mass (DeMeo and Carry 2013, 2014).
D-types form a homogeneous population in spectral and in albedo space, as shown in Fig. 11. This homogeneity is mirrored in the latent scores z 1 and z 4 as well (see Fig. 12), where in subpanel (d) we show the positions of (911) Agamemnon and (1143) Odysseus. In the second latent score, the D-types appear to split into a blue and a red population. We attribute this again to the normalisation of the spectra, which can cause these spurious offsets. Comparing the samples in the clusters showed no significant difference in the observables, and (2246) Bowell and (2674) Pandarus are present in the two clusters. Nevertheless, this serves as an example that the normalisation algorithm we devised for the partial observations may require further improvement. Furthermore, all clusters in latent space have to be verified by comparing the members in the observed features.
A total of 82 asteroids (3.9 %) are classified as D-types in this study. D-types appear predominantly in two clusters, the homogeneous main cluster 1 and a more diffuse cluster 34, which may contain interlopers of classes P and M. Furthermore, there are two small clusters containing both Dand S-types. Cluster 8 has 16 VisNIR spectra of D-types and strongly-sloped S-types, which are separated using a twocomponent GMM in z 2 -z 4 , where the feature-rich S-types have higher scores in z 2 . Cluster 43 contains 14 spectra, which are mainly visible-only S-types but include five visibleonly D-types, which we separate in the same way as in cluster 8.

Z-types
The clustering revealed a low-number diffuse cluster of featureless extremely red objects, showing larger slopes than the D-types. Figure 12 shows that in z 1 these objects form a continuum with the D-types; however, the classes show different variances in the z 1 -z 4 space: unlike D-types, Z-types show a clear trend towards a more convex shape with increasing slope. In addition, the classes show distinct orbital distributions, as illustrated in Fig. 13. While D-types are mostly situated among the Jupiter trojan population and the Hildas, these extremely red objects are largely scattered over the main belt. Three members of this population, The distinct distributions in latent and in orbital space prompt us to define a new class for this group of minor bodies. We propose the letter Z, which had previously been suggested by Mueller et al. (1992) for the extremely red Centaur (5145) Pholus. The 23 asteroids in the Z-class show overall low albedos, though we note the presence of outliers in Fig. 11.
The two reddest objects in this new class, (203) Pompeja and (269) Justitia, have been proposed as implanted trans-Neptunian objects by Hasegawa et al. (2021b). The authors suggest that complex organic material on the surface of these objects leads to the extremely red appearance. The prevalence of Z-types in the inner and middle main belt orbits of the objects could also indicate that a surface process such as spectral weathering is responsible.
A total of 23 asteroids (1.1 %) are classified as Z-types in this study. They fall exclusively into cluster 36. Even so, we expect a certain number of D-type interlopers in this class as we observe an overlap in the latent space (see Fig. 12) and in subpanel (j), where we have highlighted the Trojan asteroids (1172) Aneas, (1542) Schalen, (4035) Thestor, and (7641) Cteatus, which spectrally match D-types.

M-complex: K, L, M
The M-complex comprises classes that fall in terms of spectra and albedo between the C-and the S-complex. Compositionally, it is the most diverse complex. For C and S the ensemble properties can be regarded as carbonaceous, primitive for the former and silicaceous, in part thermally metamorphosed for the latter Cloutis et al. 1990a;Vernazza et al. 2014), while the likely mineralogical properties of any M-complex member cannot be given based solely on its complex membership. Meteorite analogues range from most carbonaceous chondrite clans in the meteorite collection to stony-iron and iron meteorites Eschrig et al. 2021;Shepard et al. 2010;Sunshine et al. 2008). Indeed, the only unifying property of these objects appears to be the spectral appearance with absent or generally faint features around 0.9 µm or 1.9 µm and an albedo around 15 % with the exception of the endmember class E.
Devising the cluster-to-class decision tree proved challenging in this complex. In combination with the faint features, we observe slight variations in the slope in the NIR, and class degeneracies appear when the visible information is missing. Furthermore, we cannot rely as much on previously established terminology as this is a new complex in terms of taxonomic systems, replacing the X-complex as a third complex in previous taxonomic systems. Both the K-and the L-types are more recent than the Tholen (1984) taxonomy, which introduced the X-complex (Bell 1988;Bus and Binzel 2002b). The Bus-DeMeo system captures the diversity in the NIR in part in the form of the X-and Xk classes; however, no clear separation between the X-and the C-complex is achieved due to the lack of albedo information.
We split the complex into the three classes K, L, and M, shown in Fig. 14. Class M, in particular, contains a wide distribution of spectral appearances and likely mineralogical compositions. Nevertheless, we opt against a division of this class as no clear separation presents itself in this study, and we advocate for a division based on observables not included in this taxonomy. The T-class, which was tentatively introduced by Tholen (1984) and carried over in the Bus-DeMeo taxonomy, is dropped as prototypes (114) Kassandra and (308) Polyxo are well described by classes P and M.

K-types
Members of the K-class exhibit a red slope in the visible region with a 1 µm band associated with forsteritic olivine The class-averaged slope is neutral to slightly red in the NIR. However, some members, including the class archetype (221) Eos and (3028) Zhangguoxi, have a blue NIR slope, indicated by their low z 1 scores in subpanels (a) and (b) of Fig. 15. As the NIR spectrum is featureless above ∼1 µm, this leads to a spectral degeneracy with the B-types, and the brighter part of the B-population requires the visible wavelength range information to be separated from the K-class. In subpanel (a) of Fig. 15, we see that (2100) Ra-Shalom is classified as a K-type, based on two NIR spectra. The only VisNIR sample of (2100) Ra-Shalom in this study is classified as a B-type. We note that (2100) Ra-Shalom is classified both as B and as K in the literature, based on its VisNIR spectrum (B: Binzel et al. 2019;León et al. 2012 andK: Shepard et al. 2008a). The same degeneracy has been reported for B-and K-types in NIR spectra ) and in the colour-space of the VISTA MOVIS survey ).
The distribution in z 1 -z 2 shows a considerable overlap between M and K, with a slight gap between the populations around z 1 = 0.3. We considered whether the redder K-types may be Mk instead; however, among them are Eos family members such as (579) Sidonia and (653) Berenike, and thus we consider this slope variability to indicate K-types. The overlap is further resolved in z 3 -z 4 , where the K-class forms a denser population than the sparsely distributed Mk-types (not shown).
A total of 42 asteroids (2.0 %) are classified as K-types in this study. K-types are found in two clusters, neither of which they populate entirely on their own. Cluster 24 is shared with M-types with neutral NIR slopes, while cluster 31 contains NIR-only observations of B-types as well as L-types. We resolve cluster 24 into K-and M-types using a two-component GMM fit to the cluster distribution in z 2 -z 3 , where K-types separate due to the large 1 µm band. Cluster

L-types
L-type asteroids are associated with large abundances of spinel-bearing calcium-aluminium-rich inclusions due to a wide absorption feature around 2 µm (Sunshine et al. 2008). This composition would imply that the L-type parent bodies were among the first planetesimals to form in the accretion disk, making them of high interest for formation scenario studies . However, in addition to the 2 µm feature, L-types are spectrally heterogeneous in slope and shape of the visible and 1 µm region and in their albedo distribution, shown in Fig. 14. The diversity of L-types makes it difficult to reliably identify them in a taxonomy based on spectral features and opens up degeneracies with a handful of neighbouring classes, such as K, M, and S.
We find that many previously classified L-types cluster in dimensions z 2 -z 4 , where they branch off of the M-complex below the S-complex together with the E-types (see Fig. 15). The second latent component matches the spinel-associated 2 µm band best, giving L-types higher z 2 scores compared to the other classes in the complex, while compared to the S-types the 0.9 µm contribution to the z 4 score is missing. In Fig. 15 we see that the L-types identified in z 2 -z 4 exhibit  Fig. 15). We regard the slope variability of the L-types classified here as intrinsic to the class, supported by (599) Luisa, which has both a blue and a red spectrum (see subpanels (b), (d), and (i) in Fig. 15).
Of particular interest among the L-types are the subgroup members referred to as Barbarians after (234) Barbara, which show anomalously high inversion angles in their negative polarisation branch (Cellino et al. 2014;Devogèle et al. 2018). We find that this group of asteroids has a large variance in latent space. In z 1 -z 2 , Barbarians such as (234) Barbara,  Fig. 15). We also do not find a reliable clustering in the remaining latent scores. The spectral L-types do not include all Barbarians, among which we observe a diversity that is too large to derive a unique class in this taxonomy. Of the 16 Barbarians from Devogèle et al. (2018), 7 are L-types and 5 are M-types. An extension of the taxonomy observables with polarimetric observations is required to reliably identify Barbarians.
A total of 58 asteroids (2.7 %) are classified as L-types in this study. L-types occur predominantly in clusters 4 and 23. As for the K-class, these two clusters are populated by members from other classes as well. For cluster 4 we split the L-and S-types based on a two-component GMM in z 3 -z 4 trained on the distribution of the members of cluster 23 and cluster 40 in this space. For cluster 23, we split the L-and M-types based on a two-component GMM in z 1 -z 4 . A small fraction of L-types are also in cluster 37, which consists largely of M-types. The L-types are recovered using a two-component GMM in z 2 -z 4 .

M-types
The M-class is one of the oldest asteroid designations (Zellner and Gradie 1976). Originally introduced to describe asteroids representing presumably metallic cores of disrupted planetesimals (Bell et al. 1989;Gaffey and McCord 1979), dedicated observational efforts have revealed a variety of objects based on their densities (Carry 2012;, hydration (Rivkin 1995(Rivkin , 2000, radar albedos , and silicate spectral features Fornasier et al. 2010;).
In spectral space, M-types asteroids are red with either linear or convex shapes, as shown in Fig. 14. The convex trend may even result in an overall blue slope longwards of 1.5 µm, as is the case for (21)  Silicate features at 0.9 µm or 1.9 µm are likely more common than an entirely featureless spectrum among M-types, with 40.9 % of M-type samples exhibiting the k-feature. Of the samples, 30.2 % lack the corresponding wavelength region observed. In Fig. 16, we display the first two latent scores of samples with the e-and k-feature. The latter feature is ubiquitous among M-types, and a concentration in latent space around (16) Psyche is visible. (55) Pandora, (129) Antigone, and (201) Penelope further show the k-feature in one or several samples, and are highlighted in subpanels (c), (f), and (g) in Figure 15. The bands are linked to different pyroxenes , and the presence of the 1.9 µm band is accompanied by the 0.9 µm band, but not vice versa .
The distribution of M-types in latent space and the results acquired in the studies cited above suggest that there are at least two populations of M-types, the chondritic population, of which (21) Lutetia may be the archetype, and the metallic population, of which (16) Psyche is the prototype Viikinkoski et al. 2017). We see this as a reasonable division of the M-class to further dissolve the compositional degeneracy of the X-complex. However, this division cannot be done based on spectra alone. To not increase the entropy of the taxonomy in a false direction, we refrain here from dividing the M-class.
A total of 142 asteroids (6.7 %) are classified as M-types in this study. The main clusters containing M-types are clusters 22, 37, and 46. Smaller contributors are clusters 17 and 35. All these clusters make up the X-complex in this taxonomy, and the spectra are split into E, M, and P as described in Sect. 3.4.2. Additional members of the M-class are found in clusters 23 and 24, which are spectrally close to L-and K-types.

Endmembers: E-types
E-type asteroids are linked to the enstatite achondrites (Gaffey et al. 1992). Their standout feature is a visual albedo generally above 50 %, see Fig. 14. This unique property makes them easy to recognise in the reduced latent space, where they exhibit large absolute values in z 2 and z 3 , with the former shown in Fig. 15.
Spectrally, E-types have a steep visible slope before flattening out in the NIR. In the case where the albedo observation is missing, E-types are degenerate with all classes of the M-complex. As an example, we observe samples of (44) Nysa located in subpanels (a) and (d), around the K-and the L-types, correctly identified as an E-type, due to the albedo observation. However, the third sample in subpanel (b) lacks an associated albedo value and is classified as an M-type.
As for L and M, we find a large intrinsic variability of the samples of individual asteroids in the E-class.
Most E-types in Tholen (1984) are classified as Xe in the Bus-DeMeo system due to the presence of the e-feature at 0.5 µm. In Fig. 16, we see that the e-feature is overall sparse compared to the k-feature. Thirteen samples in the M-complex exhibit the feature, while 65.4 % of samples lack the corresponding wavelength region observed. Of the 13 samples, 4 are classified as E-type. Considering the relative sizes of the M-and E-class, the latter are hence more likely to exhibit the feature. We do not observe a clustering of the e-feature.
The bias towards E over M for e-feature presence may be of observational nature. As an abundance of metal on the surface of M-types may lead to a drop-off of the spectral reflectance in the UV, the 0.5 µm feature might not be observed as the reflectance does not increase again towards smaller wavelengths. The band is associated with the sulfide mineral oldhamite present in aubrites (Watters and Prinz 1979) or to titanium-bearing pyroxene (Shestopalov et al. 2010). The prototype for this feature is (64) Angelina, while the E-class archetype (434) Hungaria does not present it.
Asteroid Taxonomy from Cluster Analysis of Spectrometry and Albedo 20 The k-feature is present in 30.8 % of E-type samples, while 36.9 % of samples lack the corresponding wavelength region observed.
(214) Aschera highlights the benefit of resurrecting the visual albedo. Since its classification as E-type in Tholen and Barucci (1989), it has been classified as X, B, Cgh, and C in different works (DeMeo et al. 2009;Lazzaro et al. 2004;León et al. 2012). With a visual albedo above 50 %, (214) Aschera is here classified as Ek-type and concludes its spin through the proverbial 'alphabet soup'. Observations of (64) Angelina, (214) Aschera, and (434) Hungaria are highlighted in subpanel (j) of Fig. 15.
A total of 46 asteroids (2.2 %) are classified as E-types in this study. They are predominantly located in cluster 35, though other clusters of the M-complex may also contain single samples of E-types. These are identified and assigned to the E-class in a late branch of the decision tree using the albedo distributions of E, M, and P given in Fig. 9. E-types also appear in cluster 44 among the S-types, where they are identified based on a two-component GMM fitted to the albedo distribution of the cluster.

S-complex: S, Q
The S-complex is by far the largest complex in terms of individual asteroids, in this work and in previous taxonomies. This can be attributed to observational biases such as the numeric dominance of the S-types in the inner main belt and near-Earth space (Binzel et al. 2019;DeMeo andCarry 2013, 2014) and the high average albedo of more than 20 %.
The abundance of S-types makes their homogeneity both in spectra and albedos as shown in Fig. 17 even more remarkable. While trends in the slope and the silicate features at 0.9 µm, 1.0 µm, and 1.9 µm are observable, these are primarily continuous trends and well explained by variations in the mineral composition, in particular olivine and pyroxene, as well as trends of thermal alteration in ordinary chondrites (Eschrig et al. 2022;Vernazza et al. 2014), ). S-types are one of two classes of asteroids that have an established meteorite analogue; they were linked to ordinary chondrites by the JAXA Hayabusa mission (Nakamura et al. 2011). This linkage in combination with the wealth of data on ordinary chondrites and S-types gives a solid understanding of the spectral weathering processes occurring on the surfaces of the minor bodies (Brunetto and Strazzulla 2005;Chrbolková et al. 2021;Thomas et al. 2012), which, unlike the C-complex members, shows a universal trend of surface darkening and spectral reddening with the surface age.
We divide the S-complex into two classes: S and Q. Including the endmember classes A, R, and V, we establish all classes defined in the Tholen (1984) system while extending it with the O-class. Compared to the Bus-DeMeo system, we reduce the taxonomy by subclasses of the S-class, as we explain in the following.

S-types
While class C has been split into subclasses since early taxonomic efforts (Gradie and Tedesco 1982;Tholen 1984), the S-class was not divided until Bus and Binzel (2002b) as  Figure 17: As in Fig. 11, but for the data space properties of the S-complex. The albedo distribution of the S-complex is 0.24 +0.10 −0.07 . the silicaceous surfaces are particularly subject to changes in slope and band structure induced by phase-angle effects ) and space weathering ).

Reflectance
The Bus-DeMeo system accounts for these effects by subtracting the spectral slope before classification; however, as outlined in previous sections, the partial observations prevent us from applying this taxonomy. Instead, we rely on the interpretation of the latent components to serve as vectors in the compositional analysis of the S-types.
The second and third latent components both resemble pyroxene as this mineral dominates the S-class, in addition to the large contribution in terms of variance provided by the V-types. The first component resembles the slope, hence we can approximate the vector of space weathering within the Scomplex with it (e.g. Brunetto et al. 2006). S-types denoted with the w-suffix for weathered in the Bus-DeMeo system exhibit higher z 1 scores than their class siblings with fresh surfaces. The degeneracy between a weathered S-type and an olivine-rich S-type (Sa in the Bus-DeMeo system), which is redder by mineralogy rather than by surface alteration, is resolved in the third and fourth latent component, which separates the pyroxene-olivine composition of objects.
As a practical example, in subpanel (d) in Fig. 18 we show the Bus-DeMeo Sa-types (984) Gretia and (5131) 1990 BG and the Sw-types (1019) Strackea and (4713) Steel. The subpanel (h) shows that both Sw-types have below average olivine components, indicating that the red surface is indeed due to weathering; also shown in this subpanel is the S-type (1036) Ganymed. (984) Gretia is classified as A-type in this study due to its high z 4 score (see subpanel (f) in Fig. 18).   A total of 898 asteroids (42.3 %) are classified as S-types in this study. The class is made up of several clusters: 0, 3, 6,11,14,20,21,30,33,38,39,40,42,and 47. Clusters 4,8,10,43,and 44 contain members from other classes, which we divide via GMMs, as described in the respective class descriptions (L, D, R, D, and E, in order of the clusters).

Q-types
Q-type asteroids are mostly found in the near-Earth asteroid population and resemble spectrally the ordinary chondrites in the meteorite collection (Binzel et al. 2004c). Compared to S-types, Q-types have a wider 1 µm band and a neutral to blue slope over the whole spectral range (see Fig. 17). The albedo distribution is more extended towards higher albedo values, with values of 20 %-35 %, in agreement with space weathering models predicting darkening of silicaceous asteroids with increasing surface age (Brunetto et al. 2006).
In latent space, Q-types occupy the blue end of the S-complex in z 1 . They are also distinguished from the less weathered S-types in the z 2 -z 3 space based on their high z 3 scores due to the wide 1 µm band. The archetype (1862) Apollo and class member (54827) Kurpfalz are highlighted in subpanel (b) in Fig. 18.
A total of 107 asteroids (5.0 %) are classified as Q-types in this study, 83.2 % of which are near-Earth asteroids, which is considerably higher than the average of 34.4 % over all asteroids in the input data. They populate clusters 16 and 48, as well as the diffuse cluster 13, further outlined in Sect. 4.6.3. We considered merging the Q-class into the S-class as it represents the overall continuity in the S-complex. However, as for the Z-class, the orbital distribution of the Q-types convinced us to keep this class.

Endmembers: A, O, R, V
The endmembers of the S-complex are the well-established classes A and V and the two classes that were initially built around single objects, O and R. Their distribution in the third and fourth latent scores is given in Fig. 19.

A-types
A-type asteroids are differentiated asteroids linked to brachinite achondrites (Burbine et al. 2002;Cruikshank and Hartmann 1984;DeMeo et al. 2019) and are easily recognised in spectral space by their strong red slope and deep olivine imprint at 1 µm (see Fig. 17). The albedo is within the complex average of about 20 %-30 %.
In latent space, the red colour of A-types leads to a high score in z 1 , forming a diffuse branch off the S-type population. We highlight the prototypes (246) Asporina, (354) Eleonora, and (446) Aeternitas in subpanel (e) in Fig. 18. Further characteristic of A-types is a high z 4 score due to the high olivine content (see Fig. 19). Of all the classified asteroids, A-types have the highest z 1 and z 4 scores. We note that all three spectra of Mars-Crosser (1951) Lick are exceptionally red, even among A-types (Brunetto et al. 2007).
A total of 32 asteroids (1.5 %) are classified as A-types in this study. They fall into clusters 9, 12, 27, and 49.

O-types
The class O was introduced in 1993 for supposedly ordinarychondritic (3628) Boznemcova . Its noteworthy characteristic is the wide round 1 µm feature as shown in Fig. 17, placing it between the known A-, Q-, and V-types. The albedo is close to the S-complex average at 25 %.
None of the previously classified O-types, except for archetype (3628) Boznemcova remains as an O, and a comparison of these objects in spectral space showed little resemblance. While we assign with (7472) Kumakiri a second asteroid to the class, we find in this work that (3628) Boznemcova remains without a true spectral sibling. (7472) Kumakiri was previously classified as V (Solontoi et al. 2012); however, its spectral resemblance to (3628) Boznemcova has been pointed out by Burbine et al. (2011).
The unique appearance of the O-types can be seen by their position in the latent space shown in Figs. 18 and 19. The depth and shape of the 1 µm band in combination with the lack of overall slope place the O-types (3628) Boznemcova and (7472) Kumakiri between the classes Q and V in z 1 -z 2 (see subpanel (a) in Fig. 18), while in z 3 -z 4 , they are closest to A-types.
Two asteroids (0.1 %) are classified as O-types in this study. We debated whether keeping the O-class in the taxonomy is compatible with the overall approach of data-driven clustering. In the end, the unique feature and position of (3628) Boznemcova convinced us, although an argument  Figure 19: Distribution of the S-complex endmember classes A, O, R, and V in the last two components of the latent space.
against single-object classes can be made. The O-class was difficult to carve out from the clusters using the given method. It is derived from a three-component mixture model of the already diffuse cluster 13, which is split into C, O, and Q. Any assignment of the O-class by the classification tool should undergo visual scrutiny and direct comparison to the spectrum of (3628) Boznemcova.

R-types
The R-types are the second niche class of this taxonomy, built around (349) Dembowska. The unique nature of (349) Dembowska is recognised jointly with that of (4) Vesta in early works of taxonomy Zellner and Gradie 1976) and the R-class was introduced in Bowell et al. (1978). However, the A-class, which was split off the R-class in Veeder et al. (1983), has since been absorbed into most R-types. The continuity between A and R is visible in Fig. 19.
R-types show 1 µm and 2 µm features which are deeper than those in S-types. The width of the 1 µm is between the Vand the Q-types. They have albedos at the upper end of the S-complex distribution, around 28 % (see Fig. 17). The spectral appearance is associated with low-iron ordinary chondrites (Zellner and Gradie 1976). We note that of the four samples of (349) Dembowska two are classified as R and another two as V (see subpanel (c) in Fig. 18, where we also give the position of R-class member (8693) Matsuki).
A total of 10 asteroids (0.5 %) are classified as R-types in this study. The class is derived from cluster 10 in a twocomponent GMM fit in z 1 -z 2 , where objects with lower z 2 scores are assigned to the S-class.

V-types
(4) Vesta was the first asteroid to be observed spectrophotometrically (McCord et al. 1970) and the V-types have been an established and easily-recognizable class in all asteroid taxonomies since Tholen (1984). They are the second class, in addition to S, with an established meteoritic analogue, the HED meteorites (e.g. Kelley et al. 2003). The class makes no exception here; its members are differentiated easily in Table 2: Evolution of taxonomic scheme from Tholen (1984) to Bus-DeMeo to this work. Arrows are used to indicate the overall evolution of each class. The T-class is not present in this taxonomy and the feature characteristic of the Xn has been grouped into the k-feature. The evolution of the X-complex between the taxonomies is unclear as the visual albedo is not taken into account in the Bus-DeMeo system. No analogues for K and L were defined in Tholen (1984). Tholen Bus-DeMeo This Work both z 2 and z 3 due to the large contribution of pyroxene to the spectral appearance giving rise to the characteristic deep 1 µm and 2 µm features (see Figs. 17 to 19). The class archetype (4) Vesta is highlighted in subpanel (a) in Fig. 18.
The large class variance in z 2 and z 3 represents high variability in terms of band depth and position in the 0.9 µm and 2.0 µm features. However, we do not identify a subpopulation based on the band parameters, as was suggested by Binzel and Xu (1993).
A total of 142 asteroids (6.7 %) are classified as V-types in this study. V-types populate clusters 7, 15, 18, 28, 32, and 45. V-types with a blue slope in the NIR further share the diffuse cluster 41 with the B-types.

Classification
In this section we introduce the classification tool described in this work. We demonstrate the probabilistic classification results using asteroid observations with different wavelength regions covered. We further investigate degeneracies in the classification space. Finally, we compare the results obtained in this taxonomy to the previous systems.

Classification tool: classy
To facilitate the classification of asteroid observations within the framework of this taxonomy, we provide the CLAssification of a Solar System bodY (classy 15 ) tool written in Python. It is able to interactively smooth the input spectral observations prior to resampling them to the required wavelength grid, to automatically apply the necessary preprocessing steps outlined in Sect. 2 to both spectra and albedo, to identify features in the spectra as outlined in Sect. 3.4.3 (either fully automated or guided by the user), to execute the cluster-to-class decision tree, and to return the probabilistic classifications for each observation.
The classy tool provides a command-line interface written in Python and is available for Windows, MacOS, and Linux. The software is actively maintained and developed by the authors.

Class degeneracies
The probabilistic nature of the classifications in this taxonomy allow the degeneracies between classes to be quantified in certain wavelength regions and in albedo. One example is given in Sect. 4.1.1, where we point out the degeneracy of B and K in the case of a NIR-only observation.
We can quantify class degeneracies for three datasets in this work, with the aim of reflecting the most commonly available observation ranges of asteroid spectra: the 2983 spectra used to devise the clustering, the 2923 visible-only spectra shown in grey in Fig. 2 with 81.4 % albedos observed, and the 2813 spectra from the clustering sample which have NIR information. For the last we remove all observations of wavelengths below 0.8 µm and the albedo information present in the samples. We refer to these samples as the complete, the visible-only, and the NIR-only datasets; however, this wording is not entirely accurate as more than 50 % of the samples in the complete sample are NIR-only spectra and the visible-only sample contains more than 80 % of albedo observations.

Complete sample
To estimate the class degeneracy in the complete dataset we compute the average probability of belonging to any other class for all samples assigned to a given class. This comparison is given in Fig. 20. The Ch-class is missing as it relies on the detection of the h-feature, and as such does not have an associated class probability. Larger matrix element values indicate a higher degeneracy between the classes. A large sum per matrix row indicates that the class assignment is overall less certain. Figure 20 shows the intuitive result that endmember classes such as A, V, and Z are assigned with a large probability. The largest degeneracies in pairs of classes are between Band C-types and R-and S-types. Neither result is surprising as they overlap in latent space, and even in visual inspection  these classes can be difficult to tell apart. The largest uncertainty overall for a single class (given by the sum per row in Fig. 20) is around 20 % for K, L, and M and also for the R-types. For the first three, we already pointed out in Sect. 4.3 the similarity in data space between these classes, hence this result is again expected.

Visible-only sample
The estimation of the class degeneracy is repeated for the visible-only dataset after classifying the samples therein using the classy tool. Figure 21 shows a result similar to that for the complete dataset, except that the overall values of uncertainty increase. Instead of 80 %-99 % certainty in the class assignment, we obtain values between 63 %-91 %. Except for this overall change in scale, we do not observe significant differences between the results for the visible-only and the complete datasets. K, L, and M are among the leastcertain classes, while O-types have the largest uncertainty due to the missing 1 µm band. For classes from the M-and S-complex, we see an overall increasing probability to be classified as S-type.

NIR-only sample
The class degeneracy is next calculated for the NIR-only spectra that are part of the input observations used to train the MCFA model. We remove the albedo information present in 78.5 % of the samples prior to classifying them. The confusion matrix is shown in Fig. 22. The overall scale of the uncertainty in the class assignment is between the results for the complete and the visible-only dataset, with the maximum average degeneracy between two classes just over 25 % between P and C, likely due to both the missing albedo information and the truncation of the broad 1.3 µm feature in the C-types. We note that no sample is classified as E-type, due to the missing albedo information, and no sample is classified as O-type, as both (3628)   and (7472) Kumakiri are classified as Q without the visiblewavelength information. The bowl-shaped 1 µm band of the O-types extends below the 0.8 µm limit we apply to this dataset, hence this misclassification is acceptable. The expected degeneracy between B and K in NIR-only data is not visible in Fig. 22 due to the presence of the 1 µm information. While visible-only spectra lead to uncertainty among the Mand S-complexes, in particular with respect to the S-types, this calculation shows that NIR-only spectra lead to greater confusion between the C-and the M-complexes.

Visible-Only Dataset Classification
We conclude that the class degeneracies in the complete, visible-only, and NIR-only samples follow an intuitive behaviour: the largest classes in terms of number of samples (S, C, and M) become more probable with decreasing observational data. This is in line with the established classification guideline that, when in doubt, assignment to small classes should only be done on the basis of convincing observational evidence.

Complete versus visible-only sample
Another way to investigate class degeneracies is the comparison of classifications resulting from samples with different wavelength regions observed. There are 267 asteroids present in both the complete and the visible-only datasets with a total of 328 observations. For these asteroids, we compare the resulting classifications based on the samples in both datasets, shown in Fig. 23. Each row gives for each class in the taxonomy the fraction of asteroids classified as any class  Figure 23 shows that Ch, S, and V are the most reliable when classified using visible-only data. Ch benefits from the binary classification which takes place once the h-feature is observed. The members of the M-complex show increasing degeneracy with the S-class with decreasing near-infrared coverage. The least-expected degeneracies are Z and C, as well as E and B, however, they are all based on a single sample.
Both results in Figs. 21 and 23 show that visible-only spectra in combination with the albedo place a strong constraint on the taxonomic class, as is well-established from the previous taxonomies which relied on the visible wavelength ranges exclusively. This highlights the strengths of the new method employed here: NIR-spectra are not strictly necessary to derive a classification as incomplete observations can be classified and the albedo as an accessible observable is accounted for.
We do not repeat this comparison for the complete and the NIR-only samples as the latter make up a significant fraction of the former, hence the agreement between the samples would be overestimated.

Comparison to previous taxonomies
Class continuity was one of the aspects which we considered when designing the scheme of classes in this taxonomy. We quantify this goal as above using a confusion matrix, except that we compare the classes assigned based on the complete dataset to the most-probable previous classification of the asteroid in the literature, retrieved for 2676 samples of 1852 individual asteroids from the SsODNet database. We convert the previous classifications done mostly in the Bus-DeMeo scheme to this scheme using the mapping given in Table 2. Figure 24 shows an overall good agreement of the classes assigned in this work with the ones from the literature. Notable exceptions are the O-type, which has no legacy members apart from (3628) Boznemcova as pointed out in Sect. 4.6.2, and the new Z-class, which hosts almost exclusively previous D-types. Furthermore, L loses members to S as well as O to V.

Conclusion
The taxonomic scheme for minor body classification has been in development for close to 50 years. During this time, numerous efforts to categorise the observational properties of asteroids have been driven forwards through dedicated observational campaigns and instrumental advancement. We focused on the methodology and statistical foundation, allowing us to increase the sample size by an order of magnitude compared to the previous taxonomy by DeMeo et al. (2009) and to reintroduce the albedo into the classifying observables as done in Tholen (1984).
The dimensionality reduction and clustering applied to 2983 spectra of 2125 asteroids revealed three main complexes: the well established C-and S-complexes and a restructured M-complex. While the S-complex is well understood in terms of mineralogy and meteoritic analogue material, both the C-complex and M-complex show a large degree of variability of so far unknown origin. We derive 17 classes from the three complexes, where the data-driven clustering is guided by the previous taxonomies and the goal of class continuity.
A classification tool named classy is available online and allows the user to classify asteroid observations covering the spectral VisNIR region and the visual albedo either completely or partially. The resulting array of class probabilities for each sample serves to estimate classification uncertainty and possible taxonomic trends.
We established a methodology for asteroid taxonomy which is well suited for the current and future datasets of asteroid observations. The ongoing MITHNEOS survey, the upcoming Gaia Data Release 3 (including visible spectra, Delbó et al. 2012), and the planned NEO Surveyor mission  and SPHEREx survey (Ivezić et al. 2022) will provide or continue to provide spectral and albedo observations of asteroids in different wavelengths, which are able to be classified within the framework of this taxonomy.
The dimensionality reduction and clustering are able to resolve more features and find more meaningful clusters when fed with more data. It may be worthwhile exploring how the model properties described in Sect. 3 change when fed with significantly more data. Nevertheless, during this work, we found that the latent space properties show little change whether we train with 500, 1000, or all samples in the dataset. Instead, we anticipate that a future taxonomy-revision will benefit more from an increased feature set. In particular the UV information offered by the Gaia data may solve degeneracies in the C-and M-complex. A further improvement should be the addition of polarimetric data, provided the amount of observations is comparable to the availability of the other features. The M-complex could benefit, and we consider that most work is left to be done in this complex. Extension of the spectral space into the 3 µm region is promising as well. Overview of the albedo distribution per cluster, including the number N of albedos and the asteroid classes to which the cluster contributes , excluding classes with fewer than three contributed observations except for cluster 25 which has only three observations. The classes are sorted by the total number of observations the cluster contributed. The dotted line gives the mean value of the albedos per cluster except for diffuse clusters and cluster 25. The y-axis limit is different in each row. C Distribution of spectra and albedos in classes D Cluster-to-class decision tree Table D.2: Cluster-to-class decision tree. Overview of the computation of the asteroid-class probability for each observation based on its cluster probabilities. The upper part of the table contains clusters whose members are mapped to a single asteroid class. The lower part contains clusters where the resulting asteroid class probabilities depend on the criterion given in the middle column. GMM(z x , z y ) means that the cluster probability is split based on a Gaussian mixture model with N components fit to all cluster members in z x and z y , where N is equal to the number of possible outcome classes (i.e. each mixture component represents one candidate class). P X (y) refers to the probability of belonging to the class or cluster X given the value of y. The last line gives the definition of the Ch-class, which is the last step of the classification.  ( GMM(z 1 , z 2 ) B, V 44 P E (p V )/P M (p V ) E, S Class is B, C, P, or X and h-feature is present Ch