Issue 
A&A
Volume 677, September 2023



Article Number  A59  
Number of page(s)  59  
Section  Galactic structure, stellar clusters and populations  
DOI  https://doi.org/10.1051/00046361/202243690  
Published online  06 September 2023 
Significance mode analysis (SigMA) for hierarchical structures
An application to the ScoCen OB association^{⋆,}^{⋆⋆}
^{1}
University of Vienna, Department of Astrophysics, Türkenschanzstraße 17, 1180 Vienna, Austria
email: sebastian.ratzenboeck@univie.ac.at
^{2}
University of Vienna, Research Network Data Science at Uni Vienna, Kolingasse 1416, 1090 Vienna, Austria
^{3}
University of Vienna, Faculty of Computer Science, Währinger Straße 29/S6, 1090 Vienna, Austria
^{4}
University of Vienna, ISOR/VCOR, OskarMorgensternPlatz 1, 1090 Vienna, Austria
Received:
31
March
2022
Accepted:
16
May
2023
We present a new clustering method, significance mode analysis (SigMA), for extracting cospatial and comoving stellar populations from largescale surveys such as ESA Gaia. The method studies the topological properties of the density field in the multidimensional phase space. We validated SigMA on simulated clusters and find that it outperforms competing methods, especially in cases where many clusters are closely spaced. We applied the new method to Gaia DR3 data of the closest OB association to Earth, ScorpioCentaurus (ScoCen), and find more than 13 000 comoving young objects, about 19% of which have a substellar mass. SigMA finds 37 comoving clusters in ScoCen. These clusters are independently validated by their narrow HertzsprungRussell diagram sequences and, to a certain extent, by their association with massive stars too bright for Gaia, and are hence unknown to SigMA. We compared our results with similar recent work and find that the SigMA algorithm recovers richer populations, is able to distinguish clusters with velocity differences down to about 0.5 km s^{−1}, and reaches cluster volume densities as low as 0.01 sources pc^{−3}. The 3D distribution of these 37 coeval clusters implies a larger extent and volume for the ScoCen OB association than typically assumed in the literature. Additionally, we find the association more actively starforming and dynamically complex than previously thought. We confirm that the starforming molecular clouds in the ScoCen region, namely, Ophiuchus, L134/L183, Pipe Nebula, Corona Australis, Lupus, and Chamaeleon, are part of the ScoCen association. The application of SigMA to ScoCen demonstrates that advanced machine learning tools applied to the superb Gaia data allows an accurate census of the young populations to be constructed, which in turn allows us to quantify their dynamics and recreate the recent star formation history of the local Milky Way.
Key words: methods: data analysis / open clusters and associations: individual: ScoCen / solar neighborhood / ISM: clouds
Full Table E.1 is only available at the CDS via anonymous ftp to cdsarc.cds.unistra.fr (130.79.128.5) or via https://cdsarc.cds.unistra.fr/vizbin/cat/J/A+A/677/A59
Interactive Figs. 10, 12, and 13 are available at https://www.aanda.org.
© The Authors 2023
Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
This article is published in open access under the Subscribe to Open model. Subscribe to A&A to support open access publication.
1. Introduction
The ESA Gaia mission (Gaia Collaboration 2016, 2018, 2021, 2023a) is transforming our knowledge of the local Milky Way, particularly in regards to the distribution of young stellar populations. However, disentangling and extracting coeval populations remains notoriously difficult. This is reflected in the wide variety of methods applied to the Gaia data (e.g., Oh et al. 2017; Kushniruk et al. 2017; Zari et al. 2017, 2019; CastroGinard et al. 2018; CantatGaudin et al. 2018a; Galli et al. 2018; Damiani et al. 2019; Meingast et al. 2019, 2021; Kounkel & Covey 2019; Chen et al. 2020; Hunt & Reffert 2021; Olivares et al. 2021). This wide range reflects the rather complex feature space^{1} from where the stellar populations are extracted. Firstly, these initially compact objects are stretched into elongated, sometimes nonconvex structures in position space as a consequence of interactions with the Milky Way potential, spiral arms, and giant molecular clouds (e.g., Kamdar et al. 2021). This “galactic stretching” leads to a variety of cluster^{2} shapes, from compact (when young) to lowcontrast, spreadout, sometimes Sshaped clusters dominated by Milky Way tidal forces (e.g., Meingast & Alves 2019; Röser et al. 2019; Meingast et al. 2019, 2021; Beccari et al. 2020; Kounkel & Covey 2019; Jerabkova et al. 2019, 2021; Ratzenböck et al. 2020; Kerr et al. 2021; Kamdar et al. 2021). Secondly, due to the low number of available radial velocities, about 2% in the Gaia Data Release 3 (DR3) database (Gaia Collaboration 2023a; Katz et al. 2023), one is, for the most part, restricted to two tangential velocity axes plus the spatial threecoordinate axes derived from Gaia positions, parallaxes, and proper motions (5D phase space). Thus, even under the assumption of perfectly Gaussiandistributed 3D velocities within clusters, the projection on the sky distorts the multivariate Gaussian (5D space) into arbitrary shapes depending on the orientation, distance, and size of the stellar cluster. To make matters worse, stellar cluster members constitute a minute subset of the Gaia data, with unrelated field stars creating background noise that is not easily removable in the 5D space. Thus, the feature space consists of stellar clusters of various shapes and densities embedded in a sea of noise.
To tackle the challenge of identifying subpopulations in a starforming region, we have developed a method that analyzes the topological structure of the 5D density field spanned by 3D positions and 2D tangential velocities. We applied a fast modality test procedure that introduces a measure of significance to peaks in the density distribution, thus providing an interpretable cluster definition. This clustering method is called significance mode analysis, or SigMA, and it is designed to extract cospatial and comoving stellar populations from largescale surveys such as ESA Gaia.
The goal of this paper is to present the SigMA method and apply it to the ScorpiusCentaurus (ScoCen) OB association (Kapteyn 1914; Blaauw 1946, 1952, 1964a; Blaauw 1964b) to identify the different subpopulations and compare results to recent papers with similar goals. ScoCen is the closest and best studied OB stellar association (e.g., de Geus et al. 1989; de Geus 1992; de Bruijne 1999; Preibisch & Zinnecker 1999; de Zeeuw et al. 1999; Lépine & Sartori 2003;; Preibisch & Mamajek 2008; Makarov 2007a,b; Diehl et al. 2010; Pöppel et al. 2010; Rizzuto et al. 2011; Pecaut et al. 2012; Pecaut & Mamajek 2016; Forbes et al. 2021) and has an age of ≲20 Myr (Pecaut et al. 2012). These and many other papers in the literature have established ScoCen as an important laboratory for star formation, for the characterization of stellar associations, and for understanding the impact of massive stars on the interstellar medium and planet formation. Since the advent of largescale astrometric data from the ESA Gaia mission, which started to become available in 2016 (Gaia Collaboration 2016), there has been a renewed interest in this benchmark region, with a focus on the kinematics and 3D structure of the association (Villa Vélez et al. 2018; Wright & Mamajek 2018; Goldman et al. 2018; Damiani et al. 2019; Luhman & Esplin 2020; Grasser et al. 2021; Squicciarini et al. 2021; Kerr et al. 2021; Luhman 2022; Schmitt et al. 2022; MiretRoig et al. 2022a; BriceñoMorales & Chanamé 2023).
In this paper we present the method SigMA in Sect. 3, using Gaia DR3 data (Sect. 2), and validate it in Sect. 4. In Sect. 5 we present an application of SigMA to ScoCen, including comparisons to previous work (Sect. 5.2). In Sect. 6 we summarize our findings.
2. Data
In this work we apply the newly developed method presented in this paper, SigMA, to Gaia DR3 data at and around the ScoCen OB association. To this end, we select a box of about 1.5 × 10^{7} pc^{3} from the Gaia DR3 Archive (Gaia Collaboration 2023a), which extends well beyond the traditional and wellstudied ScoCen regions. Several hints in the literature suggest that the ScoCen OB association is a larger complex than traditionally defined by Blaauw (1964a). It includes several starforming regions that have originally not been assigned to ScoCen (e.g., Lépine & Sartori 2003; Sartori et al. 2003; Bouy & Alves 2015; Kerr et al. 2021; Zucker et al. 2022). The box is defined in a heliocentric Galactic Cartesian coordinate frame (XYZ) within
$$\begin{array}{cc}& 50\phantom{\rule{0.166667em}{0ex}}\mathrm{pc}<\mathrm{X}<250\phantom{\rule{0.166667em}{0ex}}\mathrm{pc}\hfill \\ \hfill & 200\phantom{\rule{0.166667em}{0ex}}\mathrm{pc}<\mathrm{Y}<50\phantom{\rule{0.166667em}{0ex}}\mathrm{pc}\hfill \\ \hfill & 95\phantom{\rule{0.166667em}{0ex}}\mathrm{pc}<\mathrm{Z}<100\phantom{\rule{0.166667em}{0ex}}\mathrm{pc}.\hfill \end{array}$$(1)
The 3D space positions (XYZ) are derived from the Gaia DR3 positions right ascension (α, deg) and declination (δ, deg), and the parallax (ϖ, mas). The distance (d, pc) is estimated from the inverse of the parallax, which is a fairly good approximation of the distance for sources within 200 pc and with low uncertainties (see also Appendix A). The box contains in total 5 587 760 Gaia sources when additionally requiring ϖ > 0 mas. To reduce the influence of spurious measurements, we applied the following quality criteria to the Gaia DR3 data within the selected box:
$$\begin{array}{cc}& \mathrm{fidelity}\_\mathrm{v}2>0.5\hfill \\ \hfill & \varpi /{\sigma}_{\varpi}\equiv \phantom{\rule{0.333333em}{0ex}}\text{parallax signaltonoise}\phantom{\rule{0.166667em}{0ex}}(S/{N}_{\varpi})\hfill \\ \hfill & {S/N}_{\varpi}>4.5.\hfill \end{array}$$(2)
The parameter fidelity_v2 is a classifier to identify spurious sources in the Gaia DR3 and EDR3 (Gaia Collaboration 2021) catalogs, developed by Rybizki et al. (2022), which can be used to select high fidelity astrometry. The parallax_over_error cut (similar to an S/N threshold) reduces additional uncertainties in distance. This leaves 980 348 sources inside the box out of 5 587 760 (∼18%) to which we applied the SigMA clustering algorithm, which we describe in Sect. 3. In Appendix A we give more details on data retrieval and the choice of the quality criteria. In this paper, the methodology is validated using data with the mentioned quality criteria applied; therefore, if using different criteria, the completeness and contamination estimates (see Sects. 3.5.4, 4.2, and Appendix D) could change as well. Using sources not fulfilling these quality criteria would require updated validation alongside new completeness and contamination estimates.
The clustering is primarily done in the 5D phase space, using the 3D spatial coordinates XYZ in parsecs, and the 2D tangential velocities v_{α} and v_{δ} in km s^{−1}, as derived from the observed proper motions (${\mu}_{\alpha}^{*}={\mu}_{\alpha}cos(\delta )$, μ_{δ}) and parallaxes (see Appendix A). The Sun’s reflex motion strongly influences the tangential velocities v_{α}, v_{δ}. If this is not accounted for, the distribution of ScoCen members in tangential velocity space is strongly correlated and depends on a cluster’s position and apparent size (see Fig. C.1). Such nonlinear relationships contradict a central underlying assumption of many clustering algorithms. These clustering methods assume a universally valid metric, which implies a global correlation behavior (e.g., the commonly used Euclidean norm assumes no correlation between input features). To avoid formulating a locally adaptive metric, we transform tangential velocities to velocities relative to the local standard of rest (LSR), written as v_{α, LSR} and v_{δ, LSR} in km s^{−1}. This transformation reduces the influence of the reflex motion of the Sun.
We used the barycentric standard solar motion relative to the LSR from Schönrich et al. (2010), while there are different values in the literature, which would give slightly different resulting motions (e.g., Kerr & LyndenBell 1986; Francis & Anderson 2009, see also Appendix C in Großschedl et al. 2021). However, the differences are only marginal and irrelevant for our purposes since the main goal is to reduce the strong positional correlation, which is applied consistently to all stars when deciding on one standard Solar motion correction. The effect of the transformation on tangential velocities is highlighted in Appendix C. The LSR conversion of the proper motions is accomplished with Astropy, as outlined in Appendix A. Finally, the different dimensions are scaled to each other, as described in the methods in Sect. 3.3.3.
The final clustering result is obtained from the 5.5D space since we include Gaia radial velocities (Katz et al. 2023), when available, to remove possible field star contamination, as outlined in Sect. 3.5. In this cleaning step, we assign approximate radial velocities (v_{r}) to all stars in a minimization procedure involving the hypothetical cluster bulk motion. While this procedure also works without access to v_{r} measurements (see details in Sects. 3.5.2 and 3.5.3), we find in simulations that ≥5% of v_{r} measurements are necessary to achieve no loss in accuracy (see Appendix B.4). Hence, we refer to the used dimensions as the 5.5D space, since v_{r} are added if available in Gaia DR3. Gaia DR3 only includes v_{r} for about 2% of the sources with parallaxes (or about 20% in the selected ScoCen box if considering sources with σ_{vr} < 2 km s^{−1}). Adding auxiliary v_{r} data from other surveys would improve the statistics but lead to a very inhomogeneous data sample with 6D phase space information. Therefore, we restrict our clustering procedure to the socalled 5.5D phase space, as provided by Gaia, allowing us to create a homogeneous and more complete overview of the existing clusters in regions like ScoCen. Moreover, by focusing on the 5.5D phase space, we can create a method that does not strongly rely on radial velocity information, which can be used more widely on larger data samples.
3. Methods
In this section we first give a brief overview of the basic definitions of several widely used clustering algorithms, which leads to detailed explanations on the buildup of the SigMA clustering algorithm in Sect. 3.2, as developed in this work^{3}. An indepth description of related work underlying SigMA can be found in Appendix B.1.
3.1. Clustering algorithms: A brief review
Understanding the Milky Way, or any object in the Universe, is directly linked to the quantity and quality of the available data. Nowadays, the biggest effort is no longer data collection, but the large sample sizes and high dimensionality significantly impact all parts of the analysis pipeline – storage, processing, modeling, and interpretation. “Big data” usually contain extensive information, diversity, and complexity; thus, we require more complex methods to model its observations. However, many traditional analysis techniques have time and memory complexities that fail to perform under millions or even billions of data samples (Ashok Kumar 2020). Consequently, many studies start with an exhaustive prefilter step to improve downstream analyses (e.g., Zari et al. 2019; Kerr et al. 2021).
To deal with large complex data, new interpretive methods need to be tailored to the particular scientific question, in our case, identifying comoving and coeval clusters of stars inside the 1+ billion stars in the Gaia archive. Clustering analysis, or unsupervised machine learning, has recently become essential to identifying coeval stellar structures. Clustering aims to obtain an organization of data points into meaningful clusters. However, due to the lack of labeled data, partitioning into “meaningful” clusters is generally an illposed problem (Cornuéjols et al. 2018). The algorithm’s choice and parameters must match the problem at hand. Clustering methods can generally be split into space partitioning algorithms (e.g., KMeans, MacQueen 1967), hierarchical algorithms (e.g., single linkage clustering, Johnson 1967), densitybased techniques (e.g., DBSCAN; Ester et al. 1996), and modelbased or parametric clustering algorithms (e.g., expectation maximization; Dempster et al. 1977). An introduction to cluster analysis and classical methods is available, for example, in Jain et al. (1999).
From this list of clustering methods, parametric clustering algorithms and densitybased methods are commonly used on astronomical data sets (Kuhn & Feigelson 2019; Hunt & Reffert 2021, 2023). In the following, we give an overview of modelbased clustering algorithms in Sect. 3.1.1 with an extension to Bayesian formulation in Sect. 3.1.2. In Sect. 3.1.3 we present densitybased clustering methods, which build the foundation for SigMA, discussed in Sect. 3.2.
3.1.1. Parametric clustering
Parametric clustering algorithms are appealing because of the probabilistic interpretation of the clusters these algorithms generate. The modelbased approach introduces a finite mixture of density functions of a given parametric class. The clustering problem reduces to the parameter estimation of the mixture components, typically done using the expectationmaximization (EM) algorithm (Dempster et al. 1977). The EM algorithm tries to find maximum likelihood estimates of given parameters iteratively. A popular approach is to model the mixture components as multivariate Gaussian density (e.g., Gagné et al. 2018a; CantatGaudin et al. 2019b; Kuhn & Feigelson 2019).
A considerable downside limiting parametric clustering algorithms’ versatility is their dependence on the unknown number k of mixture components. Depending on data characteristics such as dimensionality, the number of samples per cluster, and cluster separation, determining k is a difficult problem. Addressing this problem is paramount, as the resulting model is very sensitive to the choice of k (Celeux et al. 2019).
Although model selection methods such as the Akaike information criterion (AIC; Akaike 1974) and the Bayesian information criterion (BIC; Schwarz 1978) aim to provide a principled approach to selecting k, they make the somewhat restricting assumption that the data are sampled from a model within the collection to be tested. This assumption can result in overestimating the number of k in practical situations (Celeux et al. 2019). Further, BIC and AIC only work well in cases with plenty of data samples, wellseparated clusters, and a wellbehaved background distribution (Hu & Xu 2003). These circumstances make extracting clusters with a low signaltonoise ratio difficult, especially in the lowdensity regime. Many of the abovepresented problems can be mitigated using a Bayesian analysis approach.
3.1.2. Bayesian clustering approach
The deviance information criterion (DIC) for missing data models (Celeux et al. 2006) is commonly mentioned as a Bayesian model selection criterion. However, in a Bayesian setting, the unknown number k of mixture components can naturally be treated as a random variable estimated jointly with the componentspecific parameters.
Notably, two Bayesian formulations of selecting k exist, finite and infinite mixtures models. Finite mixture models often rely on the reversible jump Markov chain Monte Carlo (RJMCMC) technique (Richardson & Green 1997), which can navigate between finite mixture densities with variable k. Similarly, sparse finite Gaussian mixtures (MalsinerWalli et al. 2016) involve specifying sparse priors on the mixture parameters and can be performed using classical Markov chain Monte Carlo (MCMC) methods. In contrast, nonparametric Bayesian approaches are based on mixture models with a countably infinite number of components. In this case, the prior over the mixing distribution typically takes the form of a Dirichlet process (Müller & Mitra 2013).
Although Bayesian analysis methods provide wellestablished methods for identifying the number of clusters k, fitting models to data requires a statistical model that can generate the data set reasonably well (Hogg et al. 2010). The observed morphological structure of comoving stellar systems in position space is significantly more intricate than simple multivariate Gaussians. In recent years many new cluster shapes such as tidal tails (Meingast & Alves 2019; Röser et al. 2019; Jerabkova et al. 2021), streams (Meingast et al. 2019), strings (Kounkel & Covey 2019), rings (CantatGaudin et al. 2019a), snakes (Wang & Ge 2021), and pearls (Coronado et al. 2022) have been identified. Additionally, the projection of 3D space velocity onto the celestial sphere provides another complexity requiring a flexible clustering scheme.
Nonparametric models are frequently employed when the process that generates data is intricate, and the distribution form is unclear or hard to define. Nonparametric models, unlike parametric models, do not impose strict assumptions about the shape or features of the underlying distribution. Instead, they aim to learn the underlying pattern straight from the data. This allows us to make predictions for exceedingly complex distributions without having to know or presume the shape of the distribution.
3.1.3. Nonparametric, densitybased clustering
The premise of nonparametric densitybased methods states that the observed data points^{4}X = {x_{1}, …, x_{N}} with x_{i} ∈ ℝ^{d} are drawn from an unknown density function f. The goal of nonparametric cluster analysis is then to understand the structure of the underlying density function, which is estimated from data. In one of the earliest formulations, Wishart (1969) argues that clusters are data samples associated with modes in f. The work proposed by Koontz et al. (1976) and the widely used meanshift algorithm and its variants (Cheng 1995; Comaniciu & Meer 2002; Vedaldi & Soatto 2008) are examples of this “modeseeking” category.
Modeseeking methods proceed to cluster the data by locating local peaks in f and their corresponding attraction basins. Attraction basins are regions in which all gradient trajectories converge into one single peak. However, the gradients and modes are highly dependent on the density function approximation $\widehat{f}$. To increase the robustness of the result, MeanShift, for example, seeks to reduce random fluctuations by employing a smoothing kernel to $\widehat{f}$. The introduction of an extra parameter shifts the issue to the user, who is tasked to carefully select the nonintuitive smoothing factor in order to obtain a satisfying clustering result. Moreover, the time complexity of at least 𝒪(N^{2}) makes them not great candidates for application to astronomical data sets.
Hartigan (1975) proposed a similar definition of clustering in which a cluster is defined as the connected components of the level sets^{5} of f. Given a data set, X, drawn from an unknown density function, f, that has compact support, 𝒳, we can formally write the resulting level sets for the threshold λ as
$$\begin{array}{c}\hfill L(\lambda ):=\{\mathit{x}\in \mathcal{X}\phantom{\rule{3.33333pt}{0ex}}:\phantom{\rule{3.33333pt}{0ex}}f(\mathit{x})\ge \lambda \}.\end{array}$$(3)
Thus, L(λ) constitutes a set of connected components that we identify as clusters. Varying the parameter λ from ∞ to −∞ gives a hierarchical data summary, called the merge tree. Figure 1 highlights the generation of such a merge tree, which builds the basis for hierarchical densitybased clustering. For more details see Appendix B.1.
Fig. 1. Levelset method generating a hierarchical merge tree. Via a continuous change of λ from ∞ to −∞, a new component is created at each maximum (white points). At each saddle point (black points), components are merged. The merge tree is fully computed when λ reaches the global minimum. 
In the levelset framework, popular clustering algorithms such as DBSCAN can be simply thought of as a single level that is obtained by fixing λ. DBSCAN avoids estimating the data density explicitly, by employing a radius parameter, usually called ϵ, along with a minimum number of points parameter, min_points. Clusters are defined as connected regions of points that contain at least min_points within ϵsized shells around them.
The connected components of the level set L(c) are the resulting clusters while the remaining data are treated as noise. However, the choice of the parameter λ, which is related to DBSCAN’s ϵ parameter, is ambiguous, a task that gets especially challenging when the number of clusters varies greatly between levels. We find a reflection of this difficulty in choosing the right parameters in the astronomical literature, which employs a variety of different heuristics to select the parameter ϵ (e.g. CastroGinard et al. 2018; Zari et al. 2019; Fürnkranz et al. 2019; Hunt & Reffert 2021).
For many data sets containing clusters with variable densities, employing a single threshold λ cannot reveal all peaks in f. A hierarchy of clustering solutions can be obtained by considering all possible threshold values at once (see Fig. 1).
3.2. SigMA: Significance mode analysis
This section describes our clustering pipeline, SigMA, which builds on several established methods from data mining and statistics. We discuss those methods in further detail in Appendix B.1.
SigMA is tuned to astrometric data provided by Gaia and aims at producing astrophysically meaningful clustering results. Our technique seeks to identify modal regions in the data set (5D phase space) that are separated by dips. By applying a modality test for each pair of neighboring modes, we obtain a clustering result with measures of significance. The workflow is schematically highlighted in Fig. 2. A modal region is defined as the set of points that all end in a particular mode when following the path tangent to the gradient field at each point. It is important to note that modal regions fully segment the data set, as seen in Fig. 2 (panel 6). Thus, modal regions are a mixture of cluster members and field stars, while the field stars will be removed as noise as outlined in Sect. 3.5.
Fig. 2. Proposed clustering process SigMA, highlighted on a 2D toy data set of three Gaussians with variable covariance matrices and means. (1) The generated toy data set consisting of three bivariate Gaussians shown in white alongside 2σ confidence ellipses in color. (2) The clustering procedure starts off by estimating the density of the input data. (3) Next, a graphbased hill climbing step is performed in which points are propagated along gradient lines toward local peaks. (4) This gradient propagation results in a preliminary segmentation of input samples that typically is far too finegrained. (5) These segmented regions are iteratively merged with a parent mode if a modality test along the MEP detects no significant density dip. (6) The final segmentation retains all three clusters. 
3.2.1. A fast modality test procedure
We considered the hypothesis test introduced by Burman & Polonik (2009), which examines the modality structure of a path between two peaks in the density. Conceptually two neighboring peaks are “true” clusters in the data if there exists no path between them that does not undergo a significant dip in density.
Given the ddimensional data X = {x_{1}, …, x_{N}} drawn from f and any point r on a path connecting two modes c_{i}, c_{j} in f, Burman & Polonik (2009) show that
$$\begin{array}{c}\hfill \widehat{\mathrm{SB}}(\mathit{r})=\mathrm{d}\sqrt{k/2}[\mathrm{log}\phantom{\rule{3.33333pt}{0ex}}{d}_{k}(\mathit{r})\mathrm{max}(\mathrm{log}\phantom{\rule{3.33333pt}{0ex}}{d}_{k}({\mathit{c}}_{i}),\phantom{\rule{3.33333pt}{0ex}}\mathrm{log}\phantom{\rule{3.33333pt}{0ex}}{d}_{k}({\mathit{c}}_{j})]\end{array}$$(4)
is asymptotically standard normally distributed. Here d_{k}(z) denotes the distance to the kth nearest neighbor of the point z. The null hypothesis of unimodality is rejected at significance level α if
$$\begin{array}{c}\hfill \widehat{\mathrm{SB}}(\mathit{r})\ge {\mathrm{\Phi}}^{1}(1\alpha ),\end{array}$$(5)
where Φ is the standard normal cumulative distribution function (CDF). For a more thorough derivation of Eqs. (4) and (5) see Appendix B.2.
Since Eq. (4) processes a single point rather than a complete path, the modality test in Eq. (5) describes a pointwise procedure. Burman & Polonik (2009) employ the test with samples generated along the straight line connecting two modal candidates to determine the modality for an entire path. The null hypothesis is rejected if any single test fulfills Eq. (5). However, this procedure only applies to convex clusters and does not scale well as tens to hundreds of distance computations along each path increase the runtime drastically.
Instead of computing the test statistic $\widehat{\mathrm{SB}}(\mathit{r})$ for multiple values of r, we propose limiting the calculation to only a single realization. Importantly, reducing the number of pointwise evaluations of pointwise tests does not interfere with the distributional assumption of the test statistic itself. Burman & Polonik (2009) show that the test statistic $\widehat{\text{SB}}$ along the entire path p with p = [r_{1}, …r_{N}] follows an Ndimensional multivariate normal distribution with zero mean and identity covariance matrix under the null hypothesis. Thus, the null hypothesis is independent of the number N of pointwise tests performed.
Modifying the modality test procedure to a single evaluation of the test statistics reduces the overall statistical power of the test. Because we undersample the path between two modes, we decrease the chance of sampling in places where a significant drop in density occurs. Therefore, the probability of type II errors increases, that is, the null hypothesis is not rejected even though it is false. To maintain statistical power while also extending the test procedure to nonconvex cluster shapes, we analyze the nature of possible connections between modal candidates in the data.
Of all possible paths between two peaks, only the minimum energy path (MEP) needs to be considered. The MEP is the optimal solution for the problem of finding the continuous path from one peak to another through input space 𝒳 with the highest minimal density. Thus, the density dip along the MEP is the minimal possible dip that can exist between two neighboring peaks.
Given a set of initial modal candidate regions in $\widehat{f}$ the MEP leads over the connecting saddle point when moving from one mode to another. At the saddle point position, the path reaches its global density minimum. Figure 2 (panel 5) schematically illustrates two possible paths, the MEP and a second arbitrary path.
To effectively reduce the number of pointwise tests without losing all statistical power we need to evaluate Eq. (4) in areas close to the maximum density dip while ignoring other areas irrelevant to the rejection decision. This maximum density dip at point s maximizes the test statistic and, thus, dominates the test procedure. Due to the test statistics’ proportionality to the distance d_{k}(s), its value is maximal when the density is minimal.
For two neighboring modal regions, the modality test procedure can, therefore, be reduced to a single pointwise test at the saddle point s connecting the two peaks. As the saddle point governs the modality test, we can assign a pvalue that takes the following form:
$$\begin{array}{c}\hfill p=1\mathrm{\Phi}\left(d\sqrt{k/2}[\phantom{\rule{0.333333em}{0ex}}\text{log}\phantom{\rule{3.33333pt}{0ex}}{d}_{k}(\mathit{s})\phantom{\rule{0.333333em}{0ex}}\text{max}(\phantom{\rule{0.333333em}{0ex}}\text{log}\phantom{\rule{3.33333pt}{0ex}}{d}_{k}({\mathit{c}}_{i}),\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{log}\phantom{\rule{3.33333pt}{0ex}}{d}_{k}({\mathit{c}}_{j})]\right).\end{array}$$(6)
At the end of Appendix B.2, we empirically show that these assumptions hold and introduce a small correction factor to the variance of the standard normally distributed test statistic under H_{0} that is valid for Gaia phase space data.
Determining the saddle point is discussed in the following section. If all density minima lie on the boundary of modal regions, the saddle point of two neighboring modes lies at their shared border. Using this monotonous property assumption, we aim to provide a fast yet accurate test procedure to examine the modality structure of the data.
3.2.2. Identifying and pruning modal candidates
To identify modal regions from the data set X, we implement a graphbased, hillclimbing algorithm analogous to Koontz et al. (1976) where the vertex set of the graph G represents the data X. The initial modal search is performed in one pass over the vertices of G sorted in descending $\widehat{f}$order.
A data point becomes a local mode of $\widehat{f}$ if all its neighbor connections have lower densities. Alternatively, points are propagated according to their slope in $\widehat{f}$. Each point is iteratively assigned to neighbors with maximum $\widehat{f}$value (see Fig. 2, panel 3, for a schematic illustration). After this pass the data are separated into m disjoint modal sets M = {M_{1}, …, M_{m}}.
Since graphbased hillclimbing procedures are susceptible to perturbations in $\widehat{f}$, a second pass is needed to merge insignificant modal regions into their stable parent mode. To determine the merge order, we computed the cluster tree of M. As described in Sect. 3.1.3, the cluster tree is obtained by varying the density threshold λ from ∞ → −∞ and registering modal regions when λ passes through a peak in $\widehat{f}$ and their unification when λ passes through the respective saddle point. To finalize the cluster tree, we need to identify the saddle points between modal regions of M.
We determine the saddle point between two modes via an edge search in G. Specifically, we consider edges that connect vertices that lie in different modal sets. We assume extracted modal regions are proper ascending manifolds. Thus, the modal regions are devoid of local minima on the inside, which only lie on the border; consequently, saddle points are found at the common boundary of both regions. The “saddle edge” represents the bridge between two modal regions where the density is maximal. We define edge density as the minimum density along the connecting line segment. To account for density dips along the edge path while limiting the number of distance computations, the edge density is set to be the minimum density between its two vertices and the density at the geometric mean of the vertex positions. This edge density approximates the corresponding saddle point density between two adjacent modal regions.
The merging of spurious modes then proceeds by iterating over the set of predetermined saddle points sorted in descending $\widehat{f}$value order. At each step, the unimodality test in Eq. (6) is evaluated, and neighboring modal regions are merged if the respective pvalue exceeds the significance level α. Therefore, the significance level α provides an immediate and meaningful way to simplify the initial cluster tree.
3.3. Parameter selection
In the following, we discuss various parameter choices that affect the final clustering result. The presented modeseeking methodology is agnostic to the choice of the (1) graph used in the hillclimbing step, (2) density estimator, and (3) scaling factors between positional and velocity features. In the following we explain our decisions on these three algorithmic aspects.
3.3.1. Graph
The choice of the graph directly affects the gradient approximation. For instance, in a complete graph where every pair of vertices are connected via an edge, the graphbased gradient approximation loses its locality meaning entirely. In this case, the hillclimbing algorithm merges each vertex with the densest point in the data set on the first pass. Thus, overconnected graphs lead to clusters that falsely merge numerous distinct modes in the data set.
Conversely, underconnected graphs such as minimum spanning trees restrict the gradient estimation too much, producing vast amounts of spurious clusters. Furthermore, the low number of neighboring vertices dramatically restricts the possible paths between two initially formed modes. Thus, underconnected graphs introduce significant errors in determining saddle points, which drastically compromises the validity of extracted modal regions. We consider empty region graphs (ERGs) to strike a balance between over and underconnecting points in the data set X. In an ERG, a vertex between two points is created if a given region around them does not contain any other point (see Jaromczyk & Toussaint 1992, for a review).
The βskeleton (Kirkpatrick & Radke 1985) is a oneparameter generalization of an ERG where β determines the size of the empty region. For β = 1, the graph becomes the Gabriel graph (Gabriel & Sokal 1969), while for β < 1 and β > 1, edges are added or removed, respectively. Correa & Lindstrom (2011) find that critical point searches (necessary for topological decomposition, clustering, and gradient estimation) are more accurate with βskeletons, with β < 1 compared to knearest neighbor (kNN) graphs and the Gabriel graph. Since the number of vertices grows very fast as β gets smaller, we chose a value of β = 0.99.
Adopting a βskeleton on our 5D data, we find that points have, on average, approximately 50 neighbors. To reduce the chance of separate modal regions being connected via vertices and, thus, erroneously merging in the first hillclimbing step, we pruned the initially computed graph in a postprocessing step. We removed vertices that show a significant density dip as one moves from one vertex to another. For simplicity, we assumed that the saddle point lies at the arithmetic mean of the two vertex points.
3.3.2. Density estimation
As the graph choice, density estimation is a core part of the algorithmic pipeline that affects gradient propagation and, consequently, the initial mode finding step (see panels 2 and 3 in Fig. 2). Since we cannot describe the complex stellar distribution via parametric models, we employed a modelagnostic, nonparametric estimator for the underlying density.
The most popular nonparametric density descriptors are kernel density estimation (KDE) and kNN. The KDE technique estimates the density f by convolving the data with a symmetric kernel function. The bandwidth parameter can be thought of as the standard deviation of the kernel, which determines the smoothing effect of convolution. A gradual increase in bandwidth and its impact on the density is shown in Fig. 3. The kNN method takes a more naive approach to estimating the underlying density. The density value at any given point in the phase space is inversely proportional to the distance to its kth nearest neighbor.
Fig. 3. Schematic figure linking the cluster number to the density estimation process. Applying a smoothing operator generates a family of density fields. This hierarchical family of functions is called a scale space. 
The KDE inherits the smoothness properties of the kernel. Thus, the density becomes infinitely differentiable for a Gaussian kernel. Conversely, the kNN density estimate is not smooth and, in fact, not even continuous. Despite its noncontinuous nature, the kNN density estimation method has several advantages for modal clustering. Notably, Dasgupta & Kpotufe (2014) show that point modes of a kNN density estimate approximate the true modes of the underlying density function. The nearest neighbor method is also able to provide a more accurate estimate of highdensity regions compared to the kernel method (Burman & Nolan 1992).
In contrast to KDE, the computational cost of nearest neighbor methods is highly efficient due to the use of kdtree^{6} queries that provide desirable memory complexity (Bentley 1975). Further, choosing the number of neighbors k is more straightforward than the bandwidth parameter for KDE. Finally, the locality of the kNN approach provides a versatile method for determining densities when structures exist at different densities scales. Since KDE employs a constant bandwidth, it can only adapt to a single characteristic density scale. A fixed, “intermediate” bandwidth may adequately resolve mediumdensity clusters when structures are present at various scales. However, finegrained and largescale patterns will be oversmoothed or undersmoothed, respectively.
We employed a kNN estimator to approximate the density function considering these advantages. Specifically, we used a density estimator based on the distance to an empirical measure described by Biau et al. (2011). It is a weighted kNN estimate, which incorporates distances d_{1}, …, d_{k} to all nearest neighbors up to k. The distance to an empirical measure is a distancelike function robust to the addition of noise and is used to recover geometric and topological features such as level sets. It is defined as follows,
$$\begin{array}{c}\hfill {d}_{m}(\mathit{x})=\sqrt{\frac{1}{k}{\displaystyle \sum _{{\mathit{y}}_{i}\in {N}_{k}(\mathit{x})}}{\mathit{y}}_{i}\mathit{x}{}^{2}},\end{array}$$(7)
where N_{k}(x) is the neighborhood point set of x of size k. In other words, the distance to empirical measure takes the form of a mean distance from the point x to its kNNs. The density estimator is defined via the inverse of this quantity,
$$\begin{array}{c}\hfill {\widehat{f}}_{m}(\mathit{x})=\frac{1}{n{V}_{d}}{\left(\frac{{\sum}_{j=1}^{k}{j}^{2/d}}{k{d}_{m}^{2}(\mathit{x})}\right)}^{d/2},\end{array}$$(8)
where V_{d} denotes the volume of the ddimensional unit ball and n is the number of data points. Since in our use case the order of density values is important, we can ignore constant normalization terms in Eq. (8).
The kNN algorithm is not only used to estimate the density but also during the modality test procedure (see Sect. 3.2.1). Since classical kNN, as employed in the modality test, automatically ignores points within its kdistance, SigMA has a builtin limit to the size of structures it can resolve. This allows us to determine a lower bound on the velocity dispersion of a population that SigMA can identify. We find the minimally resolvable tangential velocity dispersion to be 0.5 km s^{−1} by analyzing the distribution of kdistances with a lower bound on k = 15, which we also assume to be the minimum cluster size. Clusters with lower velocity dispersion get smoothed to at least this minimum dispersion. This value increases as k gets larger.
3.3.3. Scaling factors
The clustering analysis of comoving populations in position and velocity occurs in a combined positional and kinematic phase space. Distance relationships among stars are needed to express densities and build a graph from the input data. Since tangential velocities are measured in km s^{−1} and galactic coordinates in pc, both subspaces have different ranges. Significant range discrepancies between dimensions influence the clustering process as it directly impacts the distance function. Individual 1D distance contributions along feature axes with narrow ranges can be ignored when features with large standard deviations are present. Hence, we consider scaling factors between positional and kinematic feature subspaces.
Scaling factors, c_{i}, put weight on specific subspaces to increase or decrease their importance in the clustering process. The multiplicative factor affects the range of feature axes impacting the distance function. Thus, scaling factors c_{i} > 1 increase the distance to objects in a given dimension i, increasing their importance in the process. We applied the same scaling c_{v} to both tangential velocity axes while leaving the positional axes unchanged with c_{x} = 1. SigMA is applied to the following set of dimensions, 𝒟:
$$\begin{array}{c}\hfill \mathcal{D}=\{X,Y,Z,{c}_{\mathrm{v}}\times {v}_{\alpha ,\mathrm{LSR}},{c}_{\mathrm{v}}\times {v}_{\delta ,\mathrm{LSR}}\}.\end{array}$$(9)
Theoretical considerations of the scaling relationship c_{x}/c_{v} depend on various initial cloud and cluster configurations and interactions. However, the estimation of these influences is plagued by substantial uncertainties. Instead, we aim to determine a suitable scaling factor empirically by considering successful past extractions. Since the tangential velocity is inversely proportional to parallax, we aim to extract a relationship between a stellar cluster’s distance and its scaling factor.
The ScoCen association is at a distance of about 100–200 pc from us. To model the empirical distancescaling relationship and subsequently apply it to ScoCen, we used data on stellar clusters within 500 pc. CantatGaudin & Anders (2020) have compiled a list of open clusters in the Milky Way disk. However, using a single cluster census can introduce a bias in the resulting scaling factor as only a single member selection function was used to obtain the sample. Thus, we substitute and add clusters covered by Gagné et al. (2018b), who have used a multivariate Bayesian model to identify members of young associations within 150 pc^{7}.
The scaling fraction should account for the distance differences between positional and kinematic subspaces. To quantify this idea, we consider the distance distribution of sources to the cluster’s center in each subspace. Specifically, we compare the median absolute deviation of sources from their centers in position and velocity space, providing a robust statistic for statistical dispersion. We refer to this ratio of observed dispersion in the respective subspaces as the x–v dispersion ratio. To estimate the uncertainties in the x–v dispersion ratio, we perform bootstrap sampling with 100 resamples for each cluster and compute the mean and standard deviation of computed dispersion ratios across the ensemble.
The dispersion of cluster members in a given feature provides a measure of the scale of stellar populations in that dimension. Since the x–v dispersion ratio is not one (see Fig. 4), we have to prevent an unequal emphasis of one subspace against the other. To compensate for the bias toward positional axes during clustering, the velocity features must be scaled by a factor c_{v} equal to the observed x–v dispersion ratio.
Fig. 4. Empirical distancescaling relationship using data from Gagné et al. (2018b) and CantatGaudin & Anders (2020). The xaxis represents the distance to stellar clusters; the yaxis shows the dispersion ratio of positional over kinematic subspaces. This dispersion directly corresponds to the velocity scaling factor, c_{v}, as discussed in Sect. 3.3.3. The gray line represents the best fit linear regression line, while the gray band indicates the 1σ highest density interval obtained from sampling the posterior predictive distribution. 
Figure 4 shows the relation between a cluster’s distance and its x–v dispersion ratio (alongside determined uncertainties), which equivalently is our choice of c_{v}. We identify an approximately linear trend and fit a Bayesian linear model to the data (see Appendix B.3 for a detailed description of the model choice and fitting procedure).
Using this empirical mean model, we find mean suitable scaling factors, c_{v}, between approximately 5–10, assuming the clusters of ScoCen are at a distance of about 100–200 pc. Without the LSR correction, these values translate to a range of 4–7, which are comparable to the correction factors by Kerr et al. (2021, using v_{α}, v_{δ}) who used the values 5 and 6 in their clustering approach (see Appendix C for further discussion on these two velocity spaces). At first glance, the model suggests sampling values in the range of 5–10 or using the mean of 7.5. However, we also observe a significant scatter around the model that we need to consider. Instead of a single mean scaling factor, we aim to obtain a distribution of values from a given range of distances to the clusters we aim to find.
As discussed in Appendix B.3, possible scaling factors can be expressed by the conditional probability integrated over a range of distance values. Given the linear model and associated model uncertainties obtained from the posterior predictive distribution, we find a resulting distribution of scaling factors within distances of 100–200 pc. Keeping the number as small as possible is essential since we must perform a separate clustering run for each sample we draw from the distribution. We generate ten samples, which try to cover the sample space while keeping the underlying probability distribution in mind. The resulting samples can be seen in Fig. B.4^{8}.
We run the clustering pipeline for each scaling fraction sample, creating an ensemble of 10 clustering solutions. By summarizing the (potentially conflicting) results, we obtain a single consensus clustering solution. The consensus result is more robust against noisy data by aggregating multiple clustering solutions. This aggregation technique creates a metasolution that usually provides better accuracy than any single clustering result can (Strehl & Ghosh 2002; VegaPons & RuizShulcloper 2011).
A consensus function aims to produce a result that shares as much information as possible with individual clustering results among the ensemble. In particular, we are interested in robust cluster solutions that exist through multiple velocity scales while ignoring unstable solutions where clusters randomly break apart or merge into others. We outline our consensus clustering approach in Appendix B.5 where Fig. B.6 shows a schematic of the proposed pipeline.
3.4. Measurement uncertainties and multiple hypothesis testing
Rigorous integration of measurement uncertainties into the modality testing procedure of Burman & Polonik (2009) is a highly complex task, primarily due to the heteroscedastic nature of the uncertainties. Instead, we used a Monte Carlo approach that attempts to approximate the sensitivity of the modality structure under statistical uncertainty. We did this by resampling the data N times using a Gaussian distribution centered on each point with an appropriate covariance matrix obtained from Gaia data. The resampling procedure creates N “merge” or “do not merge” decisions at each saddle point location. In the following Sect. 3.4.1, we discuss methods for combining the Npvalues into a single merge decision. In Sect. 3.4.2 we aim to reduce the chance of falsely rejecting the null hypothesis and thus making incorrect do not merge decisions, the likelihood of which increases as the number of hypothesis tests (i.e., saddle points) grows.
3.4.1. Single merge decision: Combining multiple pvalues
Recomputing the modal structure on each resampled data set individually is computationally expensive. Therefore, we aim to study the effect of deviations on the initially computed modal layout instead. Since every merge decision impacts the final modal structure, we must evaluate the impact of uncertainty locally at each saddle point. While looping through all saddle points, we reevaluate the hypothesis test for each resampled modal and saddle point density. However, testing each hypothesis multiple times increases the likelihood of rejecting an individual null hypothesis. Instead of focusing on single tests, we need to combine these individual tests and simultaneously test the global null hypothesis that no pvalue is significant. As a result, a global hypothesis test can “borrow” information from the other test statistics to gain significance.
A popular method of computing the global pvalue is Fisher’s method (Fisher 1934). Fisher’s method assumes that individual pvalues are uniformly distributed in the interval [0,1]. Consequently, the negative logarithm follows an exponential distribution: −log p_{i} ∼ Exp(1). The test statistic, t, then becomes the sum of the negative logsum of npvalues, which follows a χ^{2} distribution with 2n degrees of freedom:
$$\begin{array}{c}\hfill t=2{\displaystyle \sum _{i=1}^{n}\phantom{\rule{0.333333em}{0ex}}\text{log}\phantom{\rule{0.166667em}{0ex}}{p}_{i}\sim {\chi}^{2}(2n).}\end{array}$$(10)
Fisher’s method is especially attractive for densely packed cluster agglomerates such as ScoCen. If clusters have only marginally different velocities and positions, our pointwise test might produce a pvalue slightly larger than the rejection threshold. In such cases finding the precise saddle point position is challenging, leading to a type II error. Although no single test (or very few) can reject the null hypothesis, many small effects in Eq. (10) enable us to reject the global null hypothesis. However, Fisher’s method assumes statistical independence between individual tests. Since the resampled data sets are not independent, this assumption is somewhat violated (for a more detailed discussion, see Appendix B.6).
Instead, we employed the Cauchy combination test (CCT; Liu & Xie 2020), which is similar to Fisher’s method in the sense that it is also able to combine multiple individual pvalues that aggregate multiple small effects. Compared to Fisher’s method, the authors show that CCT is still powerful under arbitrary dependence structures among pvalues. The test statistic t – which is asymptotically standard Cauchy distributed – is defined in the following:
$$\begin{array}{c}\hfill t={\displaystyle \sum _{i=1}^{n}{w}_{i}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{tan}[(0.5{p}_{i})\pi ].}\end{array}$$(11)
The weights w_{i} must sum to one and can reflect the power of respective hypothesis tests. Since all tests are performed equally, we distribute the weights evenly by choosing w_{i} = 1/n.
In practice, we computed distances to the kth neighbor of each initial modal candidate and corresponding saddle points across resampled data sets. The number of resampled data sets limits the proposed procedure, as data generation is costly. Thus, we restricted the number of samples to n = 50, rejecting the global null hypothesis at a 5% significance level, hence t < 0.05.
3.4.2. Multiple merge decisions: Reducing false discoveries
The number of random spurious modes and associated saddle points is proportional to the data set size. For large data sets, such as the Gaia data, we obtain hundreds of modal candidates, which we prune in a second pass (see Sect. 3.2.2). As the number of modal candidates grows, the chance of falsely rejecting the null hypothesis (a false do not merge decision) increases.
To conceptualize the probability of making one or more type I errors when performing multiple hypothesis tests across a data set, the concepts of familywise error rate (FWER) and false discovery rate (FDR) have been introduced. FWER aims to control the probability of at least one false positive in the sample, while FDR aims to control the expected proportion of false positives among all positives. Tests from both families involve modifying the significance threshold of statistical tests. FWER correction typically involves more conservative adjustments, such as the Bonferroni correction (Bonferroni 1936), that increase the threshold for statistical significance, while FDR correction typically involves less stringent adjustments. As the number of tests grows, FDR correction provides greater statistical power at the cost of an increased number of type I errors (Shaffer 1995).
Since we aim to identify a small number of significant results among a large number of statistical tests instead of aiming to minimize the total risk of any false positives, tests from the FDR family are preferred over controlling the FWER. The BenjaminiHochberg (BH; Benjamini & Hochberg 1995) method is a widely used statistical method for controlling the FDR in multiple hypothesis testing. The BH procedure guarantees that the FDR is controlled at the desired significance level α, assuming that the null hypotheses are independent or positively correlated. Since individual tests are performed on spatially disconnected saddlepoint and mode pairs, we did not assume any correlation structure among individual tests on a given data set. Thus, once we obtain revised pvalues for each saddlepoint – mode pair through resampling (see Sect. 3.4.1), we applied the BH procedure to adjust the pvalues globally and ensure FDR control.
3.5. Noise removal
Following the procedures described above, we obtain a data set segmentation into prominent peaks by iteratively merging modal regions separated by insignificant dips in density. This segmentation yields a list of nonoverlapping areas in the data set without a noise characterization in mind. In principle, each modal region contains a dense core and background population corresponding to the stellar clusters and field content. This section aims to remove the field star component from the modal region to obtain a final clustering result.
In the following, we discuss the noise removal pipeline schematically highlighted in Fig. 5. The noise removal scheme is based on a densitybased member selection technique, which we motivate in Sect. 3.5.1. The pipeline is roughly split into two main parts. In the first part, we aim to assign a radial velocity to each source to transform the data into 6D Galactic Cartesian coordinates (XYZUVW). We determine the cluster’s 3D space motion to estimate missing radial velocity information, discussed in Sect. 3.5.2. In the second part, we describe the automated cluster member selection using socalled clusternoise classifiers (see Sect. 3.5.3). Finally, we discuss contamination and completeness estimates of our member selection procedure in Sect. 3.5.4.
Fig. 5. Noise removal pipeline separating cluster members from field stars. The pipeline is split into two main parts. In the first part, we aim to assign a radial velocity to each source to transform the data into Galactic Cartesian coordinates. In the second step, we used the transformed data to fit several clusternoise classifiers to separate the signal from the background. Blue represents data products at various steps, green denotes processes, and red shows decisions. For more details, see Sect. 3.5. 
3.5.1. Densitybased member selection
Identifying signal and background sources can be formulated using a mixture model approach in which cluster and field star populations are modeled directly in phase space. However, due to complex cluster shapes found in the literature (see Sect. 3.1.2), we cannot create a generative model of the data set at hand. Instead, we return to the densitybased formalism of clustering, where we treat clusters as an enhancement of density over the background. To select cluster members as overdensities in phase space, we reduce the 5D phase space information to the univariate density information. A single density threshold in this univariate space corresponds to an isosurface in the original phase space.
We aim to describe the univariate density distribution as a mixture model to automatically obtain a suitable isosurface threshold to separate the signal from the background. This model should be able to capture the pointwise density distribution of field stars and cluster members. Before fitting a mixture model to data, we must define the number of mixture components and distributions we used.
A mixture of two components in a first approximation seems plausible as the algorithm divides the input space into regions containing a signal and background component. By design, each region consists of a singledensity peak in phase space. This distributional condition forces the field star component to lie locally around the cluster while exhibiting no extra peaks. To concur with unimodality, the background distribution is restricted to uniformity or exactly one density peak that coincides with the signal mode. As the former is more likely, we assume the field component to be approximately uniform in phase space around a cluster. Uniform distributions in Ndimensional feature spaces translate to a single Gaussian in the univariate density space.
The distribution of cluster star densities is harder to model. Cluster members are commonly modeled as multivariate Gaussians (e.g., Gagné et al. 2014; Sarro et al. 2014; Crundall et al. 2019; Riedel et al. 2017). As discussed in Appendix B.7, given a kNN density estimator, a multivariate Gaussian in phase space approaches a Gaussian distribution in univariate density space as the dimensionality grows. However, observational findings point to more complex morphologies (e.g., Meingast & Alves 2019; Röser et al. 2019; Meingast et al. 2019; Kounkel & Covey 2019; CantatGaudin et al. 2019a; Jerabkova et al. 2021; Wang & Ge 2021; Coronado et al. 2022) and significant mass is contained in the lowdensity region outside the cluster core (Meingast et al. 2021). As discussed in Sect. 3.1.2, we lack critical information to formulate a generative model for the signal distribution in phase space and, consequently, in univariate density space. Instead of explicitly building a univariate signal model, we employed multiple Gaussian mixture components to describe the pointwise kNN density distribution. Thus, we did not restrict the number of Gaussian components during the fitting procedure to provide flexibility to capture more complex distributions.
To decide on a proper density threshold ρ_{0}, we determine the number of mixture components by minimizing the BIC. The background is automatically identified as the Gaussian component with a low relative mean (i.e., lower pointwise densities), small variance (uniform background component has less relative variance around its mean density than the signal), and large weight (the number of field stars exceeds cluster members by about 100:1). This procedure can be seen in Fig. 6 where we show an example of two Gaussians fitted to the univariate density data (denoted by ρ) of one modal region.
Fig. 6. Noise reduction schematic. We fit the observed univariate density distribution, ρ, with a mixture of two Gaussians that model the cluster (red line) and field star (black line) population, respectively. We obtained an approximation to the field star contamination and incompleteness rate in the cluster sample by considering the clusternoise classifier’s confusion matrix entries, particularly false positives, false negatives, and true positives. 
3.5.2. Bulk velocity estimation
The Gaussianity assumption of density components is appropriate only in the original Cartesian coordinate system. Densities computed from tangential velocities suffer from perspective effects, leading to deviations from normality due to the nonlinearity of projections onto the celestial sphere. We find such distortions empirically when analyzing distributions of various modal regions in projected 2D (see the tangential velocity space in Fig. C.1) compared to Cartesian 3D velocities. This effect is reduced by correcting the observed tangential velocities for the Sun’s motion, yielding motions relative to the LSR (see Appendix C). However, very nearby clusters, which cover large areas in the sky, are still affected by the observer’s point of view from Earth. We move to the Galactic Cartesian coordinate system to eliminate these observational effects. Thus, we transform the data into the 6D space (XYZUVW) to facilitate efficient signal and background models.
A transformation from proper motion space to a 3D Cartesian velocity space is only possible if radial velocity information is available. However, the majority of radial velocity measurements of sources are missing (∼62% in our box, 80% if sources with σ_{vr} > 2 km s^{−1} are removed). Nevertheless, we can exploit the comoving property of stellar populations. We aim to adopt a similar strategy to Meingast et al. (2021), inspired by convergence point ideas (e.g., van Leeuwen 2009). The expected radial velocity value can be determined when the 3D bulk motion of stars alongside their positions is known. However, some clusters lack enough statistics to compute their bulk motion. Before determining individual radial velocities of member stars, we have to estimate the space motion of individual populations. In the following, we describe bulk motion estimation, which provides a mean to estimate an optimal radial velocity. We summarize the process in the first part of Fig. 5.
We determined the space motion, $\stackrel{\mathbf{\sim}}{\mathit{v}}$, of individual populations of size n by minimizing the following loss function, henceforth called membership loss:
$$\begin{array}{cc}\hfill L(\stackrel{\mathbf{\sim}}{\mathit{v}})& =\sum _{i=1}^{n}(\frac{\mathrm{\Delta}{v}_{\alpha ,i}^{2}}{{\sigma}_{{v}_{\alpha ,i}}^{2}}+\frac{\mathrm{\Delta}{v}_{\delta ,i}^{2}}{{\sigma}_{{v}_{\delta ,i}}^{2}}+\frac{\mathrm{\Delta}{v}_{r,i}^{2}}{{\sigma}_{{v}_{r,i}}^{2}})\hfill \end{array}$$(12)
$$\begin{array}{cc}\hfill \mathrm{\Delta}{v}_{x,i}& ={v}_{x,i}^{\phantom{\rule{0.333333em}{0ex}}\text{obs.}}{\stackrel{\sim}{v}}_{x,i}.\hfill \end{array}$$(13)
The minimization is done over the tangential (v_{α}, v_{δ}) and radial (v_{r}) velocities^{9}.
The delta terms in the membership loss describe the offset between observed and computed values at the specified velocity $\stackrel{\mathbf{\sim}}{\mathit{v}}$. Although we introduce an additional observational error via the parallax uncertainty, we chose the tangential velocities to match the unit of radial velocities, the essential component in the sum in Eq. (12). Each term in the sum is weighted by its respective uncertainty, which decreases the influence of observations with large measurement errors on the membership loss. If all observations lack radial velocities, the last term is set to zero; if only a subset of v_{r}’s is missing, their values are imputed with the average of its complement.
For a perfectly comoving population, the membership loss has a global minimum with a value of 0 at the cluster motion. Observational uncertainties, contamination from field stars, and a nonzero velocity dispersion will increase the minimum value accordingly. To search the 3D bulk motion that minimizes the membership loss, we used the quasiNewton Broyden, Fletcher, Goldfarb, and Shanno method (Nocedal & Wright 1999) with an initial guess of the mean 3D velocity^{10}. We denote the velocity, which minimizes the membership loss (Eq. (12)), as the optimal bulk motion (v_{OBM}).
To determine the cluster motion of the comoving population via our minimization approach, finding v_{OBM} needs a large and pure selection of cluster sources, meaning truly comoving stars. We attempt to obtain a relatively clean sample of cluster stars via the aforementioned mixture model approach (see Fig. 6, and “Likely cluster members” in Fig. 5). By fitting a mixture of univariate Gaussians to the density distribution of a modal region, we get a classifier that separates the cluster from field stars^{11}. Since the input density is 1D, the classifier, also called a clusternoise classifier, becomes a simple threshold classifier.
Sources with a density greater than the threshold ρ_{0} are likely cluster members. As the classifier is trained on densities determined in the 5D space, which experiences projection distortion, we only use the 80% most dense stars in the cluster sample to determine v_{OBM}. This density filter is designed to remove likely field star contaminants (false positives), which are typically expected to be less dense than cluster members. Figure 6 shows an example of the contamination estimation.
The optimal bulk motion v_{OBM} is used to infer an “ideal” radial velocity. The ideal radial velocity minimizes the Euclidean distance between v_{OBM} and the velocity vector constrained by the measured proper motions. We refer to the computed 3D space motion, which is a combination of measured proper motions and the inferred radial velocity, as the minimally different velocity (v_{MDV}). On the one hand, genuine cluster members should receive an estimated space velocity close to their true motion (assuming low intracluster velocity dispersion of a few km s^{−1}). Field stars (if not an interloper in phase space) show incompatible observations with the comoving population and are, on average, assigned a different space velocity. Together with sparseness in positional space, field stars consequently show lower densities in phase space.
We infer v_{MDV} for sources without v_{r} measurements as well as for sources with large uncertainties of σ_{vr} > 2 km s^{−1}. After this step, all sources have an associated radial velocity, either measured or inferred. We provide these inferred radial velocities (${\widehat{v}}_{\mathrm{r}}$) in our final catalog^{12}. Thus, compared to the clustering step of the SigMA pipeline removing the field background explicitly requires radial velocity estimates for all sources (regardless of uncertainty). In Appendix B.4 we estimate the necessary fraction of sources with radial velocity measurements without losing accuracy in ${\widehat{v}}_{\mathrm{r}}$. Using two simulated data sets, we find that SigMA requires at least 5% of radial velocity measurements in the input catalog to keep errors to a minimum. Due to the need for some v_{r} measurements, we refer to the SigMA pipeline as operating in 5.5D.
The bulk velocity estimate correlates directly with observed proper motions and radial velocities. Therefore, systematic errors, as in the case of binary or multiple stellar systems, introduce corrupted measurements that potentially bias the final result. However, directly flagging binary stars and removing them from the inference process does not significantly alter the results as only a tiny fraction (0.05%) of Gaia sources are identified as multiples (Gaia Collaboration 2023b). Since the work by Gaia Collaboration (2023b) does not represent the entire binary population, we investigated another indicator for multiples in the Gaia catalog. For example, the renormalized unit weight error parameter (Lindegren et al. 2018, 2021) can be used as a discriminator, which measures how well the astrometric solution is fitted to a single star model, as also discussed in Penoyre et al. (2022a,b). These authors also show that binaries, which have been observed with longer time baselines (e.g., comparing HIPPARCOS, DR2, and DR3), could still deliver parallaxes and proper motions that are close to the true values (see also Kervella et al. 2022). As a consequence, binaries can still be selected as truepositive members of a cluster if selected with 5D Gaia astrometry, as can be seen, for example, by the clear binary sequences in HertzsprungRussell diagrams (HRDs; e.g., Meingast et al. 2021). However, even in these cases, the multiple systems do not comprise a significant fraction of the cluster selection. Consequently, we assume that binaries contribute only marginally to the bulk velocity computation.
We applied the presented method to clusters in the ScoCen OB association (see Sect. 5) to validate the bulk velocity and 3D velocity estimation procedure. During inference, we randomly remove 95% of radial velocities to facilitate a comparison with observed values. The absolute Δv_{r} and relative errors δv_{r} to Gaia measurements with σ_{vr} < 2 km s^{−1} are shown in Fig. 7. The absolute error is defined as the difference between the estimated radial velocity, ${\widehat{v}}_{\mathrm{r}}$, and the observed value, v_{r}:
$$\begin{array}{c}\hfill \mathrm{\Delta}{v}_{\mathrm{r}}={\widehat{v}}_{\mathrm{r}}{v}_{\mathrm{r}}.\end{array}$$(14)
Fig. 7. Absolute and relative error of inferred radial velocities compared to observed values in Gaia DR3 with radial velocity errors below 2 km s^{−1}. We randomly removed 95% of available radial velocities during inference to facilitate this comparison. Only inferred values where the Gaia observable has been removed are shown. We highlight the 1σ and 2σ percentiles and find that the majority (68%) of absolute errors are within ±2.35 km s^{−1} and have relative errors below 0.55. 
The relative error expresses the magnitude of the absolute error compared with its measured magnitude:
$$\begin{array}{c}\hfill \delta {v}_{\mathrm{r}}=\left\frac{{\widehat{v}}_{\mathrm{r}}{v}_{\mathrm{r}}}{{v}_{\mathrm{r}}}\right.\end{array}$$(15)
We find that 68% of sources (1σ) have absolute errors of less than ±2.35 km s^{−1} and 95% (2σ) of absolute errors are within ±5.66 km s^{−1}. Thus, the average error is close to the large statistical uncertainties (2 km s^{−1}) in the sample, constituting an approximation for the lower bound for the mean estimation error. The majority (1σ) of relative errors are below 0.55. Thus, inferred radial velocities are in good agreement with observations, validating our method and highlighting its robustness to (not yet fully understood) binary effects and contamination.
The following steps use the space velocity information to determine cluster membership. We prefilter unlikely members via a kinematic selection before applying the clusternoise classifier (see Sect. 3.5.3) to a complete 6D phase space, including the computed ${\widehat{v}}_{\mathrm{r}}$ estimates. The prefilter removes possible contaminant stars that have vastly different 3D motion, namely sources that differ by more than 10 km s^{−1} from the determined bulk motion; hence, v_{MDV} − v_{OBM} < 10 km s^{−1}.
3.5.3. Removing field star contamination
Figure 5 shows the noise removal pipeline, which consists of two main parts. In the previous section, we discussed the first part in which we impute missing radial velocities by assuming the presence of a single dense comoving population in the input data. In the following, we discuss the second part, in which we aim to fit clusternoise classifiers to remove the field star content in the combined space of heliocentric Galactic Cartesian position and determined v_{MDV}’s.
Combining positional and kinematic spaces (here XYZUVW) directly puts an emphasis on one of the subspaces (position or velocity) due to different value ranges (see Sect. 3.3.3 for more details). Large axis ranges automatically dominate the extraction as distances along these dimensions are penalized, more drastically impacting the density estimation. Instead of selecting a single scaling factor, we chose multiple plausible scaling factors and compute a univariate density distribution ρ for each one. We obtain scaling factors c_{1}, …c_{N} by repeating the procedure discussed in Sect. 3.3.3 in Galactic Cartesian phase space (XYZUVW), setting N to 10 (see forking into N subprocesses in Step II in Fig. 5).
We separate the stellar population from the field star component using a clusternoise classifier (the classifier is motivated and discussed in detail in Sect. 3.5.2). This classifier is applied to the 1D density estimation ρ determined in 6D phase space, using measured and estimated radial velocities (see the xaxis in Fig. 6). This thresholding method results in a global isosurface selection that is independent of positional information of sources in the original feature space (see Fig. 1). To reduce the contamination of random field star components, we employed the βskeleton as a localityaware neighborhood graph from which we delete vertices that fall below the computed density threshold ρ_{0}, as shown in Fig. 6. More details on this graphbased approach are provided in the related work discussion in Appendix B.1.
Field stars account for the majority of sources in the given samples. Thus, the number of vertices that fall below the density threshold makes up most of the graph. Removing them disconnects the graph and splits it into multiple connected components. We define sources within the densest (and typically the largest) connected component as cluster members. To extract cluster members more robustly, we computed one extraction for a range of scaling parameters (see Sect. 3.3.3). We obtain a final cluster catalog by removing unlikely members that appear in less than half of the N extractions when using different scaling factors, c_{i} (see Fig. 5).
3.5.4. Contamination and completeness estimate
The clusternoise classifier is a discriminative model, whose conditional densities (or mixture components) describe the cluster and field star distributions. In combination with the decision threshold ρ_{0}, we can internally compute estimates for the field star contamination fraction f_{cont} and the incompleteness fraction f_{inc} for each cluster sample.
In the fitting procedure, the number of mixture components k is a free parameter determined by minimizing the BIC. Thus, the discriminative model can have a different number k of Gaussian components for each cluster. To formalize a consistent definition across a different number of components, we introduce the following notations. We denote the set variables specifying the identity of the mixture component by Z; its k components are then Z = {z_{1}, …, z_{k}}. Individual Gaussian components can then be formulated as densities conditioned on the mixture component:
$$\begin{array}{c}\hfill p(x\mid {z}_{i})=\mathcal{N}({\mu}_{i},{\sigma}_{i}).\end{array}$$(16)
The mixture density can then be written as
$$\begin{array}{c}\hfill p(x)={\displaystyle \sum _{z\in Z}}p(z)p(x\mid z),\end{array}$$(17)
where p(z) is the prior probability of the mixture component z, also called mixture weight. The k mixture weights must add up to one.
The subset of components that describe the distribution of cluster members is denoted with S. It encompasses all mixture components whose mean (or expected value) exceeds the threshold ρ_{0}, and thus S = {z ∈ Z : 𝔼[p(x ∣ z)] > ρ_{0}}. The relative complement of S with respect to Z then contains all components describing the field star distribution; we denote this set as B, defined by B = Z \ S.
Finally, with this formulation in mind, we can express both the incompleteness fraction f_{inc} and contamination fraction f_{cont}. The incompleteness fraction, as shown in Fig. 6, is the probability of observing a sample from the cluster distribution with a value less than ρ_{0}:
$$\begin{array}{c}\hfill {f}_{\mathrm{inc}}=\frac{{\sum}_{s\in S}{\int}_{\infty}^{{\rho}_{0}}p(s)p(x\mid s)\phantom{\rule{0.166667em}{0ex}}\mathrm{d}x}{{\sum}_{s\in S}p(s)}.\end{array}$$(18)
The contamination fraction is defined as the fraction of false positive samples in the overall cluster sample. Thus, f_{cont} can be expressed as the probability of observing a sample from the field star distribution among all samples with a value larger than the threshold ρ_{0}:
$$\begin{array}{c}\hfill {f}_{\mathrm{cont}}=\frac{{\sum}_{b\in B}{\int}_{{\rho}_{0}}^{\infty}p(b)p(x\mid b)\phantom{\rule{0.166667em}{0ex}}\mathrm{d}x}{{\sum}_{z\in Z}{\int}_{{\rho}_{0}}^{\infty}p(z)p(x\mid z)\phantom{\rule{0.166667em}{0ex}}\mathrm{d}x}.\end{array}$$(19)
Both f_{inc} and f_{cont} are schematically shown in Fig. 6 for the example of a twocomponent mixture.
The contamination and incompleteness are computed for each cluster. We obtain a mean contamination estimate across all clusters in ScoCen of 5.3% with a standard deviation of 3.1% across clusters. This value agrees well with photometric contamination estimates via the Gaia HRD, as described in Sect. 5 and Appendix D.1. Although we find good agreement, we do not completely trust the stated values due to a lack of knowledge of systematic uncertainties.
The major source of systematic uncertainty is a deviation from Gaussianity of any of the mixture components. Especially, in the case of more than two mixture components, we expect that these internal estimations have increased error rates. By departing from the paradigm of “one mixture component per signal and background” we increase the accuracy of the model (and ideally the obtained cluster members) at the cost of direct model interpretability (and all of its consequences). Further uncertainty is added via the density estimation to which the mixture model is fit. Since we do not have access to the true underlying density f, we inevitably make mistakes by substituting it with our estimate $\widehat{f}$.
We find a mean completeness across clusters of approximately 89.2% with a standard deviation of 8.3%. Similarly to the contamination fraction, determining the incompleteness depends on the mixture components and density approximation. Still, compared to the contamination fraction, the incompleteness estimate is relatively high. A caveat of our noise reduction procedure is that we reduce highdimensional phase space information into a univariate variable that is used to filter the data. This univariate formulation lacks descriptions of local positional and kinematic relationships that might help increase the completeness of our catalog. Further, we estimate the actual value even lower, as we find multiple connected components in the neighborhood graph from which we only extract the main component. We also only admit stars that pass a threshold of 50% across different scaling fractions. All these decisions increase the precision of our sample at the cost of a reduced recall. Additionally, we evaluate the estimated completeness fraction by comparing our sample to past extractions in the literature in Sect. 5.2. These comparisons suggest a sample completeness of about 90% (e.g., when compared to Damiani et al. 2019, Luhman 2022, or Schmitt et al. 2022), which agrees with our estimate. However, these surveys can also not be considered complete. On the contrary, a direct comparison shows (see Sect. 5.2) that other applications on ScoCen are missing sources that SigMA is able to uncover.
Instead of comparing estimated values to past extractions, we aim to evaluate the accuracy of internal contamination and completeness estimates of SigMA in Sect. 4.2. Using simulations, we find that on average the contamination from field stars can be approximated quite well using our univariate mixture model approach. However, especially in dense cluster environments, our approach seems to overestimate completeness. If a large portion of the cluster exists in the lowdensity environment outside the cluster core the density approach fails to adequately capture the true number of missing cluster members (see Sect. 4.2.2 for a detailed discussion).
3.6. Multiscale clustering
The density field is the main parameter of the proposed clustering method. Its topology is affected by the estimation process, which impacts the final result. Especially the smoothing parameter can create, on the one hand, a very rough and, on the other hand, an oversimplified density field. The schematic Fig. 3 illustrates the dependence of the cluster number on the density estimation process. Applying a smoothing operator generates a family of density fields, called a scale space (Witkin 1987). We used this scalespace concept to study the dependence of extracted clusters on density estimation. Clusters with a long lifetime in the scale space are preferred over, for example, “shortlived” children.
We approximate the scale space by running SigMAN times obtained by progressively smoothing the initial density field. Given an ensemble of N density estimates ${\{\widehat{f}\}}_{i},i\in [0,N]$, we track clusters through various density filters. To track clusters through different levels of scale space, we used three cluster connection rules based on cluster modes, which we approximate by the densest point in a modal region. The connections we define are the following: direct link, merge, and split.
A direct link connection denotes a connection between two modal regions whose Jaccard similarity is larger than 50% and both cluster modes lie in the intersection set. A merge connection is a weaker condition and is only placed if no direct link can be established. A merge link is made when a parent cluster^{13} contains the cluster mode of its child. If both conditions for direct and merge links are not satisfied, a split connection is placed between a parent and child cluster if the child contains the cluster mode from its parent.
The emergence of critical points, or additional clusters, in smoother versions of the scale space, is a result of the inexact nature of our density estimation (Reininghaus et al. 2011; Lifshitz & Pizer 1990) as well as due to randomness introduced by our Monte Carlo strategy. In the absence of noise, smoother density filters result in a simplified topology. Thus, we applied the pruning strategy introduced by Reininghaus et al. (2011) to the resulting mergesplit graph, which generates a simplified merge tree. The merge tree for our running toy data set is schematically illustrated in Fig. 3.
We extract stable components from the resulting merge tree via a consensus clustering approach, discussed in Sect. 3.3.3. In total, we find 60 stable clusters in the search box, while 37 are discussed in more detail in Sect. 5 as being part of the ScoCen association. We find the 23 remaining clusters to be unrelated to the young ScoCen association (see Sect. 5). Often these appear as incomplete (or truncated) cluster extractions; in particular, the shape and position of the majority of these clusters suggest that they extend beyond the defined search box.
The consensus approach also lets us characterize how often individual sources appear throughout the cluster ensemble^{14}. We report this value as “stability” in our cluster catalog. The stability criterion can be used as an effective measure to remove spurious sources from the catalog. Figure 8 highlights the effect of the stability criterion on sources when empirically estimating the contamination from older sources (hence likely unrelated sources) via an HRD. The xaxis shows a given stability criterion where we filter sources with stability > x. The yaxis shows the empirical contamination estimate, determined via positions of filtered (older) sources in the HRD. The fraction of sources to the left of the 25 Myr isochrone is used to estimate the false positive rate while sources to the right account for true positives (see further details in Sect. 5 and Appendix D). Based on this result, we recommend a quality criterion of stability > 11%.
Fig. 8. Stability versus estimated contamination rate. The contamination estimate was determined via source positions in the HRD relative to the 25 Myr isochrone as discussed in Sect. 5, selecting potential contaminants from older populations. We identify a sharp drop in contamination for low stability values that levels off at around 11%. 
However, due to the densitybased nature of SigMA, the stability criterion is strongly correlated with density. Especially clusters with minor density enhancement over the field background are shortlived in scale space. Thus, although SigMA detects them clearly, some clusters contain members with overall disproportionately small stability values causing them to fall out of the sample for relatively low stability values (e.g., the cluster OphNorthFar; see Sect. 5.1.7). Therefore, we do not recommend generally using the stability quality criterion, but to investigate its behavior per cluster, to get potentially cleaner cluster samples.
3.7. Removing spurious cluster solutions
Unstable cluster solutions are automatically filtered out in our scale space approach in Sect. 3.6. However, the distribution of sources in the Milky Way in phase space is far from uniform. SigMA’s job is to separate the input data space into unimodal regions. This segmentation does not distinguish between compact, clustered overdensities and longrange, lowdensity modes inherent to the Milky Way distribution. We aim to remove the latter from our cluster sample.
We assume a natural clustering of “real” and “spurious” clusters in 1D density space. That means if the density of all members across the N extracted clusters are plotted we expect a bimodal distribution. The modes of these distributions then correspond to the real and spurious clusters.
To classify each cluster into any one category, we employed an iterative approach that starts off by assigning all clusters to the category real. Subsequently, we loop through all clusters sorted by their median member density (computed in 5D phase space) in ascending order. In each iteration i, with i ∈ {1, …, N}, the first i clusters (i clusters with lowest median density) are classified as spurious. At each step, we track the separation and compactness in the 1D density space of individual members across both groups of clusters. The more compact and well separated the members of both groupings are – measured by the CalińskiHarabasz score (Caliński & Harabasz 1974) – the more the classification at step i agrees with our bimodal assumption.
We used the classification at step i, which maximizes the CalińskiHarabasz score to characterize each cluster as real or spurious. This classification is directly applied to each SigMA clustering solution, hence before obtaining a consensus result across scaling factors and scale space (see last process in SigMA core in Fig. 9).
Fig. 9. Schematic illustration of the SigMA pipeline. The pipeline consists of two main parts, the SigMA core and two consensus clustering steps. For a detailed explanation, see the main text of Sect. 3.8 and the references therein. 
3.8. The SigMA pipeline
The proposed clustering method SigMA has many components that require sensible choices in order to work together properly. In Sects. 3.2–3.7 we justify and discuss the parameter choices that yield the final analysis pipeline. Figure 9 shows an overview of the full pipeline. It consists of two main parts, the SigMA core, and two consensus clustering steps. In the following, we briefly summarize how these individual components come together.
The SigMA core outputs a clustering result for a given density level i and velocity scale c_{j}. It iterates through the following steps: First, a kNN density estimation is computed (see Sect. 3.3.2). Second, a gradientbased hillclimbing step is performed that produces individual modes and saddle point locations. The modal candidates (or clusters) at this point oversegment the data set (see Sect. 3.2.2). Third, at each saddle point, a modality test is applied that determines whether two neighboring modal candidates should be merged or not (see Sect. 3.2.1). Fourth, the field star background is removed from each modal region (see Sect. 3.5). Fifth, spurious clusters are removed from the extraction (see Sect. 3.7).
To guarantee stable results against velocity scaling factors (see Sect. 3.3.3) and density estimation (see Sect. 3.6), we employed a consensus clustering approach, discussed in Sect. 3.3.3. From the cluster ensemble, we can extract a stability value for each source in our final catalog.
4. Validation
We followed a twopronged approach to verify the proposed clustering method SigMA. First, in Sect. 4.1 we validate our clustering technique qualitatively in a case study on the ScoCen OB association. We describe the results and comparisons to other studies in Sect. 5. Second, in Sect. 4.2, the algorithm is validated quantitatively on simulated data, where we compare SigMA to other established clustering methods used to identify comoving clusters.
4.1. Validation using astrophysical knowledge
Two direct observables that can be identified in our application on ScoCen (Sect. 5) serve as a validation test of the method. First, and apart from the youngest clusters that are affected by dust extinction, the Gaia colorabsolutemagnitude diagrams (CMDs; equivalent to observational HRDs) for the stars in each cluster show a narrow (coeval) distribution (see the followup work in Ratzenböck et al. 2023). There is no procedural reason why this should be the case, the method does not know about the brightness and colors of the stars. Only a meaningful selection of comoving stellar siblings can produce the observed narrow sequences in the HRDs.
Another observable that serves as a test is the prominence of massive stars associated in 2D projection with the SigMA identified clusters, while they are often located at a central position within the concerned clusters (e.g., α Sco, β Sco, δ Sco, and ν Sco; see Sect. 5 and Tables 3–5). Nearby massive stars are often too bright to have reliable measurements in the Gaia archive, and they are also often part of multiple stellar systems, further reducing the quality of the astrometry. The brightest are not even in Gaia (like Antares, Ohnaka et al. 2013), while most have been observed by HIPPARCOS (Perryman et al. 1997). Still, the method finds clusters around many of the massive stars, particularly in the UpperSco region. Based on HIPPARCOS astrometry (see Sect. 5), we find strong evidence that many of these bright stars share similar parallaxes and proper motions as the clusters they seem to belong to in projection. This is an astrophysically relevant result (massive stars do not form alone and are often found at central positions), and it serves as another direct validation of the method.
4.2. Validation using simulations
To objectively investigate the effectiveness of new clustering algorithms, synthetic data with known ground truth information facilitates the comparison to other clustering techniques. Since SigMA is tuned to astrophysical phase space data, we want to test its efficacy on simulations that approximate observational data as closely as possible. Therefore, simulated data should replicate the data model, content, volume, uncertainties, and selection effects of Gaia data as closely as possible.
To our knowledge, the Gaia Early Data Release 3 (EDR3) mock catalog by Rybizki et al. (2020) best meets these criteria. In particular, the catalog contains a large, realistic open cluster sample with internal rotation and corresponding uncertainties. Although the cluster structure of the open cluster sample differs from ScoCen, access to ground truth data and realistic Gaia selection effects and content provides a firm validation basis. To test SigMA’s ability to separate clusters in dense cluster environments, commonly found in young OB associations and starforming regions such as ScoCen (Sect. 5) or Orion (e.g., Chen et al. 2020), we aim to create a derived data set from the original mock catalog that mimics these conditions.
We applied the same error cuts to mock and real data to increase comparability between our qualitative and quantitative tests, as described in Eq. (2). We emphasize that the following results are, thus, conditioned on the given quality criteria. Hence, the reported classifier performances and empirically determined contamination and completeness scores should always be understood in the context of these quality criteria. Therefore, future use of SigMA should aim to meet the same criteria or, in case of modifications, follow the validation protocol discussed in Sects. 4.2.1 and 4.2.2 for the new filter setup.
In the following sections we describe the comparison of results in detail, particularly the data used, the algorithms against which we compare SigMA, and the validation results.
4.2.1. Open cluster sample
The Gaia EDR3 mock catalog (Rybizki et al. 2020) is extensive. As it aims to reproduce Gaia data realistically, it contains simulated measurements on over 1.5 billion sources and over 1000 open clusters compiled from the catalogs of Kharchenko et al. (2013) and CantatGaudin et al. (2018b). We need to reduce the data size to a manageable subset to validate SigMA and compare different clustering algorithms. Therefore, we limit the data to a range of 200 pc around the Sun, which yields uncertainty characteristics similar to our ScoCen box sample. Applying the quality criteria Eq. (2) results in the final test catalog size of 2 682 883 samples, of which 18 682 are part of 12 open clusters^{15}. The mock catalog does not contain the full 5D astrometric uncertainty covariance matrix that SigMA uses. We substituted missing values with real measurements from Gaia DR3. Each mock sample was randomly paired with a source within our ScoCen box (see Sect. 2), whose values it adopts.
This simulation allows our proposed analysis method to be tested for accuracy. Importantly, we must highlight its capacity to identify clusters in positional and kinematic data in contrast to established analysis methods. We limit our comparison to the two relevant clustering methods (DBSCAN, HDBSCAN) as listed in Table 1, which are considered among the most promising candidates for stellar cluster analysis in a metastudy by Hunt & Reffert (2021).
Overview of the clustering algorithms that we test with synthetic data (simulated cluster samples) and compare to SigMA.
Since SigMA’s parameters are tuned to deal specifically with Gaia data, we employed a grid search to find suitable parametrizations for each of the three clustering algorithms. This strategy measures the peak performance these clustering methods can achieve. A comparison against the best performance results allows for a discussion of methodological advantages and disadvantages rather than reflecting poor parameter selection. A detailed discussion on the parameter search is provided in Appendix B.8.
We report the performance of the best model across our search to facilitate a fair comparison. The performance itself is measured using the following clustering validation metrics: the normalized mutual information (NMI) score (Strehl & Ghosh 2002), adjusted mutual information (AMI; Vinh et al. 2010), and adjusted rand index (ARI; Hubert & Arabie 1985). We also report on classification metrics that are easier to interpret, such as true positive rate or recall, precision, accuracy (henceforth denoted as ACC), balanced accuracy (BACC; Brodersen et al. 2010), and the Matthews correlation coefficient (MCC; Matthews 1975). In addition, we report the total number of identified clusters, N_{tot}, as well as the number of nonnoise clusters, N_{cluster}, that are found to coincide with a toy cluster instead of field members. Similar to precision and recall, we also report contamination and completeness. In contrast to precision and recall, we computed the average cluster contamination and completeness only for clusters that coincide with a true cluster (i.e., for the N_{cluster} identified nonnoise clusters). Thus, these measures are not influenced by large false positives. When N_{cluster} = N_{tot} the completeness is exactly equal to recall and contamination becomes 1 – precision. The resulting numbers are summarized in Table 2.
Test results on simulated cluster samples.
Only a fraction of sources, less than 1%, are located in clusters. Hence, many of the aboveproposed validation metrics will report high values as long as most field stars are clustered in the same group. We remove correctly identified field stars before computing the validation metrics to prevent reporting on artificially inflated clustering scores. By removing this “true negative” component without removing field stars labeled as cluster members (false positives) and cluster members identified as field components (false negatives), the reported scores are a conservative estimate of the algorithms’ actual performances.
The results are summarized in Table 2. All algorithms can recover the 12 clusters within the data set. We find that the performance of SigMA and DBSCAN are essentially equal – with relatively high evaluation scores – while outperforming HDBSCAN. We find that HDBSCAN (within the parameters we searched) identifies fewer members while also identifying two false positives (i.e., two large extra clusters that entirely contain field stars).
The access to ground truth data also allows us to test internal measurements of contamination and completeness estimates. We find a true mean contamination rate of 2.6 ± 0.7% across the twelve identified clusters. SigMA’s internal estimation is slightly lower than that at a mean contamination rate of 1.1 ± 0.4%. We find an even better agreement between true and estimated completeness values. The true mean completeness rate is 98.3 ± 0.7%, almost identical to SigMA’s internal estimation of 98.4 ± 0.2%.
Although the true contamination value is outside the 1σ confidence interval, the estimated value is still very close to the true one in absolute terms. In the open cluster sample, the internal measurements provide a surprisingly good approximation given that we have not explicitly modeled signal and background in univariate density distribution but assumed a simple mixture of Gaussians.
The high reported accuracy across all clustering methods highlights the nature of open clusters. They appear as salient overdensities in phase space, making their detection fairly easy. This situation contrasts with the complex structure that constitutes ScoCen. Distinguishing more densely packed clusters from each other is a nontrivial task. SigMA was created with the intention of an interpretable cluster definition, which is put to the test, especially in such environments. Therefore, we aim to create a test data set reproducing densely packed piles to put our analysis tool through its paces.
4.2.2. Tightly packed cluster environment
To our knowledge, there is no realistic (replicating data model, content, volume, uncertainties, and selection effects of Gaia data) simulation of ScoCenlike, densely packed associations that can be used to validate SigMA in densely packed cluster environments. Moreover, there are no similar (or any) starforming regions where a consensus has been reached on the number of true clusters along with their members. Hence, we created a derivative toy data set from the EDR3 mock catalog, which simulates groupings in tightly packed arrangements. We refer to this newly generated mock catalog as the “compact cluster sample”.
The biggest unknown in creating this sample concerns the cluster details. In particular, their number, location, extent, and respective size. However, the application to ScoCen has already produced a cluster sample that can be considered when generating toy data, as it provides a candidate set of these quantities.
Two conflicting objectives pose challenges for sampling with known cluster sizes. On the one hand, the cluster sample should avoid strong correlations with results on ScoCen (for a discussion of results, see Sect. 5). We want to avoid reproducing previous results as it would favor the SigMA clustering objective over other alternative formulations and inhibit an objective comparison across methods. On the other hand, the compact cluster sample should aim to represent reality faithfully. Since no ground truth exists either on stellar clusters or in the form of dedicated simulations that reproduce such dense cluster structures, we anchor our simulations on reality by considering our ScoCen extraction and other literature results. To balance overconfidence in the given results with an accurate description of reality, we perturb the obtained cluster sizes to avoid an unduly high correlation with SigMA and literature results.
In the following we describe how we use these general cluster details as a starting point to generate the compact cluster sample: (1) number of sources, (2a) mean position (in heliocentric Galactic Cartesian coordinate frame, XYZ), (2b) mean space velocity (in the heliocentric Galactic Cartesian coordinate frame, UVW), mean statistical dispersion of objects in the cluster in (3a) positional and (3b) velocity space.
To introduce small and mediumsized deviation from the selected sample, we treat the number of sources and statistical dispersion as a normal distribution centered on the measured value with a relative variance of 25%. We employed a different strategy to sample new cluster means in position and velocity. Typically, neighboring cluster centroids in the combined positional and velocity space are way within the relative variance of 25%. Thus, updated centroid positions would commonly lie outside the original cluster boundary, drastically interfering with the initial cluster distribution. Instead, we sample centroid positions from a 50% subset of each cluster, introducing variations that guarantee to retain the overall structure.
After sampling a set of cluster quantities, we pair each of the 37 extracted ScoCen clusters (see Sect. 5) with an open cluster from the mock EDR3 catalog. We used 15 open clusters within 250 pc from the Sun. This selection provides access to a slightly more diverse cluster sample showing measurement uncertainties similar to the initial ScoCen clusters. As different cluster sizes show distinct morphological features – a small cluster can typically not be reproduced by downsampling a large one to its size – we aim to pair clusters based on member size. A given ScoCen cluster, c_{s}, with n_{s} sources has the following probability of being paired with one of the N synthetic mock clusters, c_{k}, with n_{k} sources, where k ∈ [1, …, N]:
$$\begin{array}{c}\hfill p({c}_{k}{c}_{\mathrm{s}})=\frac{{({n}_{\mathrm{s}}{n}_{k})}^{2}}{{\sum}_{i}{({n}_{\mathrm{s}}{n}_{i})}^{2}}.\end{array}$$(20)
Thus, on average ScoCen clusters are paired with similarsized clusters from the mock catalog while maintaining a nonzero probability of being paired with more, unlike clusters. To better understand variations in the number of clusters across random instances, we maintain a fixed cluster count of 37.
For each cluster, the mock counterpart is scaled to the randomly sampled dispersion in position and velocity space (separately), randomly downsampled to its corresponding (randomly determined) size, and positioned at the corresponding mean in 6D phase space. Subsequently, the synthetic clusters are embedded into the remaining field distribution. Finally, we project the space velocities to the tangential velocity plane, compute right ascension (α, deg), declination (δ, deg), and parallax (ϖ, mas), randomly remove about 62% of radial velocity measurements and apply the coordinate and quality criteria^{16} from Eq. (1)–(2) to reflect the clustering conditions of ScoCen, as described in Sect. 2.
To evaluate and compare SigMA’s clustering performance to alternative algorithms, we generate 10 compact cluster samples and report mean performance scores across these realizations. The results are summarized in Table 2. Compared to results on the open cluster sample, SigMA shows a significantly higher score than competing algorithms, achieving only half of SigMA’s performance on average. The performance of DBSCAN and HDBSCAN on the compact cluster sample is approximately similar. Compared to DBSCAN, we find that topperforming HDBSCAN runs again falsely identify clusters of field stars as clusters in the data set. On average, HDBSCAN finds as many false positives as true positives. SigMA, on the other hand, can, on average, identify about twice as many subgroups as other algorithms while keeping the relative number of false positives low.
Although we can highlight SigMA’s performance in these compact cluster agglomerates compared to DBSCAN and HDBSCAN, the performance values are drastically worse than in the open cluster sample. We can partially attribute the poor performance to the tough clustering challenge created by the randomized process. By randomly perturbing mean cluster positions, neighboring clusters easily merge, decreasing the maximally achievable performance. We also find that some clusters (on average, two to three populations) can no longer be identified as their density is indistinguishable from the field. Further, the clusters’ extent in 5D is scaled to approximate ScoCen deviations, which reliably reproduces the cluster core. However, in some cases, a considerable part of the cluster extends far (up to over 100 pc) beyond traditional ScoCen boundaries. The density of these stars compared to the field and their distance to the cluster core makes them impossible to detect with the three tested algorithms. Although a very tough clustering challenge, it still provides a good test bed for algorithms applied to ScoCenlike cluster environments.
Besides the clustering performance, we again compare true to estimated contamination and completeness estimates. We find a true mean contamination rate of 23.7 ± 13.1% across the, on average, 24 identified clusters. SigMA’s internal estimation on the compact cluster sample is significantly lower with a mean contamination rate of 6.8 ± 3.4% (although ∼7 times higher than in the open cluster sample). As discussed above, merging nearby clusters into a single indistinguishable cluster drastically increases the contamination, a factor that SigMA cannot account for. In contrast, SigMA can only control the contamination of lowdensity field stars and not crosscontamination (the major contributor) from other clusters. When we ignore crosscontamination from other clusters and focus purely on contamination from field stars, the true contamination fraction becomes 8.2 ± 4.1%, close to the internal value of 6.8 ± 3.4%. SigMA’s internal estimate on the real Gaia DR3 data is about 5.3 ± 3.1% (see Sect. 3.5.4), which is well within 1σ uncertainties of both estimated and true values determined from mock data.
The true mean completeness rate is 76.2 ± 15.2% while SigMA’s internal estimation is 89.1 ± 2.0%. Ignoring crosscontamination between clusters, the total fraction of cluster members SigMA is able to pick up is 81.4 ± 1.2. Although the mean total completeness is only slightly above the average per cluster completeness rate and does not match internal predictions, the result is relatively stable across different resampled data sets. On the compact cluster sample SigMA (and to an even greater extent, DBSCAN and HDBSCAN) cannot find the large source fraction far outside the central cluster region. This fraction is possibly exaggerated, considering the young nature of ScoCen sources (because we used opencluster analogs to build the compact cluster catalog). SigMA’s internal completeness estimate on real Gaia DR3 data toward ScoCen is approximated with 89.2 ± 8.3% (Sect. 3.5.4), which is slightly overestimating completeness when compared to the true value of the simulated data, while still comparable within the uncertainties.
The substantial agreement of SigMA’s internal metrics with true values in the open cluster sample is in stark contrast to the large discrepancy between estimated and true values in the compact cluster sample^{17}. While the contamination estimate yields satisfying results if only field star contamination is considered, the completeness estimate likely systematically underestimates the lowdensity cluster component of stellar clusters. Thus, internal contamination and completeness estimates should only be used as rough first approximations of the stellar content of detected clusters. To obtain a better understanding, especially of the completeness fraction, we call to consider additional membership analysis tools such as UPMASK (KroneMartins & Moitinho 2014), BANYAN (Gagné et al. 2018a), or Uncover (Ratzenböck et al. 2020).
5. Application to ScoCen
We applied SigMA to Gaia DR3 data inside a box of about 10^{7} pc^{3} containing the ScoCen OB association, as defined in Sect. 2. The box was chosen to include the classical Blaauw definition of ScoCen, including the classical subgroups UpperScorpius (US), UpperCentaurusLupus (UCL), and LowerCentaurusCrux (LCC), and to go beyond them and include the molecular cloud complexes of Pipe, Corona Australis (CrA), Chameleon (Cham), and three stellar clusters to the Galactic northeast of ScoCen, which we put in the separate Northeast group (NE). Some of these regions were tentatively associated with ScoCen in the past (e.g., Lépine & Sartori 2003; Sartori et al. 2003; Preibisch & Mamajek 2008; Bouy & Alves 2015; Kerr et al. 2021).
In this paper, we discuss the SigMA extracted young stellar clusters in ScoCen, which are part of the ≲20 Myr ScoCen star formation event (Pecaut et al. 2012), and their connection to previous work. In a followup study (Ratzenböck et al. 2023) we discuss in more detail the ages of the individual SigMA clusters and the star formation history of the ScoCen complex.
In total SigMA extracts 60 clusters inside the defined search box. Of these, 23 clusters are older populations with ages > 20 Myr, or which are kinematically unrelated. These older clusters include for example, the wellstudied IC 2602 (∼30 Myr; e.g., Randich et al. 1995; Stauffer et al. 1997; Dobbie et al. 2010; Damiani et al. 2019; Meingast et al. 2021), or the Hyades, β Pictoris, Platais 8, Platais 9, Platais 10, IC 2391, Alessi 9, Alessi 13, TucanaHorologium, ComaBerenices, VolansCarina, or NGC 2451A (e.g., Riedel et al. 2017; Gagné et al. 2018b,c; Gagné & Faherty 2018; Sim et al. 2019; Fürnkranz et al. 2019; CantatGaudin & Anders 2020; Meingast et al. 2021; Kerr et al. 2021; Galli et al. 2021a; He et al. 2022). These clusters generally occupy distinct velocity spaces, different from the bulk motion of ScoCen. Moreover, the majority of these clusters are truncated by the borders of our defined box; hence, they are incomplete, which is of no consequence to this study. In this work, we focus solely on the young ScoCen complex (1) to get a more complete picture of the substructure of this important nearby association, (2) to evaluate the differences to previous studies on ScoCen (Sect. 5.2), and (3) to highlight the capability of SigMA to untangle distinct clusters in a dense environment containing overlapping populations in space, which is especially true for young stellar associations like ScoCen. The 23 older or unrelated clusters are not discussed further here, although they might be related, or not, to ScoCen at larger scales (e.g., “blue streams”; Bouy & Alves 2015). We will discuss these older clusters in future work.
We find that 37 stellar clusters are associated spatially and kinematically with the ScoCen OB association, containing in total 13 103 stellar cluster members, which will be discussed in more detail in this section. Figures 10 and 11 show the distribution of the 37 ScoCen SigMA clusters projected in Galactic coordinates. Figure 12 shows the distribution of the clusters in 3D space in a heliocentric Galactic Cartesian coordinate frame (see also the interactive 3D version^{18} online and Figs. E.1–E.5 for a better appreciation of individual clusters). The 37 clusters seem to form the continuous body of the ScoCen association, beyond Blaauw’s original three subgroups’ boundaries.
Fig. 10. Distribution of the 37 SigMA clusters in ScoCen, projected in Galactic coordinates. Traditionally, the ScoCen OB association is separated into US, UCL, and LCC, marked with gray dashed lines. The clusters extracted with SigMA reveal a more complex substructure of ScoCen than initially proposed by Blaauw (1946), and they show a more extended spatial distribution that includes the CrA, Pipe, and Cham regions and additional stellar clusters toward the northeast (NE). The clusters are ordered in the legend by region, as given in Table 3. See the interactive 2D version online or Fig. 11 for a separate view of each cluster. For a better visualization of the clusters’ distribution, see the interactive 3D version online (Fig. 12). 
Fig. 11. Distribution of SigMA clusters in ScoCen, projected in Galactic coordinates, stratified by cluster membership. Compared to Fig. 10, the small multiples highlight the distribution of individual clusters in the ScoCen complex. The color coding represents the seven regions: US (orange), UCL (blue), LCC (red), Pipe (green), CrA (magenta), Cham (cyan), and NE (yellow). 
Fig. 12. 3D distribution of the 37 SigMA ScoCen clusters in heliocentric Galactic Cartesian coordinates. The Sun is at (0,0,0). Colors and labels are as in Fig. 10. See also the interactive 3D version online and Figs. E.1–E.5, which allow a better appreciation of individual cluster properties. By doubleclicking on a cluster in the legend of the interactive version, the selected cluster can be isolated; by hovering over data points, the cluster membership and observed l, b, d position of a source becomes visible. 
Overview of the 37 SigMA clusters in ScoCen, assigned to seven subregions (Col. 2).
Figure 13 shows the location of the SigMA clusters in the tangential velocity plane as observed from the Sun (v_{α}/v_{δ}) and also relative to the LSR (v_{α, LSR}/v_{δ, LSR}). Since the clusters partially occupy similar velocity spaces in the velocity planes, we also provide online an interactive 2D version of this figure, allowing a better appreciation of 2D kinematical properties of the clusters in ScoCen (see also Figs. E.1–E.5). The 37 young clusters all fall on a connected looplike pattern in tangential velocity space (Fig. 13, left panel), a pattern largely created by the reflex motion of the Sun. This is highlighted in Fig. C.1, showing that these projected motions are expected for clusters at ScoCen positions and distances. To avoid this pattern caused by the Sun’s motion, we additionally transform the tangential velocities to velocities relative to the LSR, using the standard solar motion from Schönrich et al. (2010; see Sect. 2). This is shown in Fig. 13 (right panel), where we can see that the clusters now occupy a more compact velocity space. In particular larger clusters, which are stretched over larger areas in the sky, show a smaller velocity dispersion after the LSR conversion (see Appendix C). The 2D kinematical properties of individual clusters can be better appreciated when investigating the online interactive 2D version of the figure, where both velocity spaces can be compared directly.
Fig. 13. Tangential velocity distribution of the 37 SigMA clusters. Colors and labels are as in Fig. 10. Left: Observed tangential velocities along α and δ are strongly influenced by the Sun’s reflex motion, while stellar clusters at similar distances and with similar space motions are arranged in a looplike pattern. Sources at l ∼ 0° are located in the lowerright part of the figure, and sources at l ∼ 290° in the upperleft part of the figure (see Fig. C.1). Right: Tangential velocities corrected for the Sun’s motion, and hence relative to the LSR. The correction reduces the projection effects of the observed tangential stellar motions. See the interactive 2D version online and Figs. E.1–E.5 for a better appreciation of the 2D kinematical properties of the clusters in ScoCen. 
The ScoCen association, as extracted with SigMA, reaches well below the Galactic plane, as was indicated by previous works (e.g., Kerr et al. 2021) and is now further confirmed here. This includes regions not traditionally associated with ScoCen, like Pipe, CrA, Cham, and clusters toward the Galactic northeast (NE), including a cluster connected to the L134/L183 clouds. Moreover, other wellknown stellar clusters, traditionally not assigned to ScoCen but later suggested to be associated with it, were picked up by SigMA, like ϵ Cham and η Cham (e.g., Mamajek et al. 1999, 2000; Fernández et al. 2008), which are added here to the ScoCen complex.
The relatively young β Pictoris stellar cluster (β Pic; e.g., Fernández et al. 2008; Crundall et al. 2019; MiretRoig et al. 2020, age ∼18–20 Myr) was also picked up by SigMA in the selection box. As mentioned above, we decided not to include this young local association as part of our final sample of 37 stellar clusters. The SigMA extraction of β Pic covers only one side of the known population as defined in MiretRoig et al. (2020). This is likely due to the larger extent of β Pic in the sky (partially outside of our box boundaries) and due to the relatively close distance to the Sun (average distance of about 40 pc), which makes it more difficult to extract members from the 5D (or 5.5D) phase space as used by SigMA in this work as the stars are distributed across the sky as seen from Earth (the Solar system is located inside some of these nearby young associations).
The majority of the 37 clusters can be related to previously identified clusters from the literature, which are often larger scale structures containing several of the SigMA clusters (see comparisons to the literature in Sect. 5.2). The rich substructure identified by SigMA also includes clusters with no clear counterpart in previous works. We decided to name such clusters after their location in the sky (based on constellation) or after the brightest star that is seen in projection to a cluster and where we feel confident that it is part of a cluster (see Sect. 5.1 and Tables 3 and 5). We often find bright Bstars toward cluster centers at approximately the same distance and proper motion, in itself a validation of the SigMA algorithm, as many of these bright stars are not in Gaia (but only in HIPPARCOS). We used HIPPARCOS astrometry (van Leeuwen 2007) to tentatively associate bright Bstars to the new clusters and list them and their astrometric properties in Table 5, showing the HIP ID and the HIPPARCOS astrometry. This table allows a direct comparison with the average properties of the SigMA clusters in Tables 3–4. For the cases where there is a reasonable match, we name the cluster with the name of the bright Bstar^{19}. In some cases, we name the clusters after their location in constellations. Additionally, we index the stellar clusters within this work from 1 to 37 as given in Col. “SigMA” in Table 3.
Median velocity parameters for the 37 SigMA clusters in ScoCen.
Figure 14 shows the SigMA cluster members in a Gaia CMD (similar to the HRD), confirming the youth of most sources. In Appendix D.1, we give more details on the chosen photometric quality criteria and the selection conditions to estimate the contamination from older populations or field stars. We find an excess of older lowmass sources that visibly separate from the ScoCen population, potentially false positive ScoCen members. We used a 25 Myr isochrone (to allow for random scatter) to separate “younger” ScoCen members from “older” populations or field stars as shown in Fig. 14 (middle panel). This gives a rough estimate for a contamination fraction of about 4–10%, depending on the photometric quality criteria (see details in Appendix D.1 and Table D.1). This contamination fraction is similar to the estimate in Sect. 3.5.4. The influence of the stability that SigMA assigns each cluster member can be seen in Fig. 8, where we show how different stability cuts influence the fraction of older sources, as estimated in Fig. 14. The trend in Fig. 8 suggests that a cut at about 11% would give a cleaner cluster membership selection and a lower contamination fraction (Appendix D.1) since such a cut overproportionally influences the older sources. However, a stability cut would also deliver an overall less complete sample.
Fig. 14. Gaia CMD using M_{G} versus G_{BP} − G_{RP} showing the SigMAselected ScoCen members. Left: SigMA cluster members that pass the photometric quality criteria as given in Eq. (D.2). Middle: Potential contamination from older sources (orange), selected with a 25 Myr isochrone from PARSEC (black line) and Baraffe et al. (2015, dashed black line) and an additional cut at M_{G} > 3 mag (dasheddotted black line), which excludes the UMS. The combined conditions indicate contamination from older sources of about 6–7% when using the given photometric quality criteria and no stability cut. Right: Substellar candidates (red dots) selected with a 0.08 M_{⊙} isomass line from Baraffe et al. (2015, dark red line) using only the younger source from the middle panel. This cut indicates that roughly 19% of substellar sources are within the SigMA ScoCen members when the mentioned cuts and photometric quality criteria are applied and extinction effects are ignored. More details on the quality criteria, the selection borders, and the isochrone models are given in Appendix D. 
Figure 14 (right panel) shows that there could be up to 19% of substellar candidates in the SigMA ScoCen sample, selected with an isomass line at 0.08 M_{⊙} from Baraffe et al. (2015, hereafter BHAC15; see Appendix D.2). In the future, more complete samples of the individual clusters can be obtained by using the known members as training sets (e.g., using Uncover, Ratzenböck et al. 2020). Knowing the brown dwarf population will allow the construction of more complete initial mass functions beyond the hydrogen burning limit and a better characterization of the mass of the individual clusters (e.g., MiretRoig et al. 2022b).
5.1. Overview of the seven subregions in ScoCen
In the following, to help compare SigMA results with the literature, we give a brief overview for each subregion within ScoCen (US, UCL, LCC, Pipe, CrA, Cham, and NE). We then give a moreetailed comparison to recent works in Sect. 5.2. The listed seven subregions include four regions that are not a traditional part of the ScoCen OB association, namely CrA, Pipe, Cham, and NE clusters, while we find them to be part of ScoCen because they are comoving with the complex itself, and are likely within the 20 Myr age cut we used for the association. Even if we assign each stellar cluster to one of the seven subregions, we stress that this classification should not be seen as physically distinct regions inside ScoCen, but simply to help compare our results with the literature.
5.1.1. Upper Scorpius (US)
Toward US we identify nine clusters containing in total 3596 stellar sources, which are partially extending beyond the traditional borders (Fig. 10). Of these nine clusters, seven appear higher surface density and tend to be associated with prominent Bstars, as already pointed out above, namely ρ Oph/L1688, β Sco, δ Sco, ν Sco, σ Sco, Antares, and ρ Sco (see Tables 3–5). The remaining two clusters appear more extended, which we name USforeground and ScorpioBody.
HIPPARCOS astrometry from van Leeuwen (2007) of bright stellar members in ScoCen.
The clusters ρ Oph/L1688, Antares, and ρ Sco show significant overlap in the same volume in space, while separating in velocity space. In a recent paper (Grasser et al. 2021) we studied the ρ Oph/L1688 cluster with Gaia EDR3 data and identified two kinematically distinct populations within the same volume (Pop 1 and Pop 2). These two populations coincide with the ρ Oph/L1688 and Antares clusters, respectively. In detail, the crossmatched Pop 1 sample contains ∼93% of the ρ Oph/L1688 group and few matches with other clusters (Antares, σ Sco, β Sco, δ Sco). The crossmatched Pop 2 sample contains ∼75% of the Antares group and few matches with other clusters (ρ Sco, σ Sco, USforeground). Luhman (2022) point out that “new” ρ Oph/L1688 members in Grasser et al. (2021) have already been identified previously by other literature as being part of US. We clarify here that the new sources in Grasser et al. (2021) refer to sources not previously assigned as members of the young ρ Oph/L1688 starforming event. The two intertwining distinct populations (both containing new sources) within the same volume have been first studied in detail in Grasser et al. (2021). In this work, we identify another stellar population, ρ Sco, which also seems to occupy a similar volume in space, partially overlapping with the two populations while having distinct velocities from these.
The group USforeground is located in front of the more compact clusters, visible in 3D space (Fig. 12), hence the chosen name. Finally, the ScorpioBody group extends from US toward the Galactic South, beyond the traditional borders of US, with a significant fraction located in UCL and in the direction of CrA (Sect. 5.1.5). It spans the Scorpius constellation’s central body, hence the name. The nine clusters toward US reveal a complex star formation history, which will be further discussed in future work.
5.1.2. Upper Centaurus Lupus (UCL)
We identify rich substructure within UCL separated into 11 SigMA clusters (5935 stellar sources), as listed in Table 3. The most prominent cluster in the region is V1062 Sco (Röser et al. 2018), lying toward the far side of ScoCen. This cluster was picked up easily by visual selection methods (e.g., by Damiani et al. 2019 or Luhman 2022; see Sects. 5.2.1, 5.2.4). We identify a second cluster close to V1062 Sco, which we call μ Sco, since its members are scattered around the bright Bstar * mu01 Sco. We find that the positions and velocities of the two SigMA clusters are very similar, and members of both clusters are part of V1062Scoselections in previous work (Sect. 5.2), also named UPK 640 in CantatGaudin & Anders (2020). The star * mu01 Sco, which is the namegiver of μ Sco lies in the center of the cluster, while the star * mu02 Sco is part of the SigMAselected members for V1062 Sco, located at the periphery of this cluster. This suggests a possible connection between the two clusters, but this statement is tentative.
Lupus 1–4 appears correlated with regions of high dust columndensity, matching with previous selections of Lupus–3 and 4 stellar members (e.g., Damiani et al. 2019; Kerr et al. 2021), which are merged in our SigMA selection. The average distance to Lupus 1–4 matches well with cloud distance estimates from Zucker et al. (2021, derived from Leike et al. 2020), who report a distance between 155–198 pc, or an average of about 165 pc for the Lupus 1–4 clouds. Similar distances have been reported in Teixeira et al. (2020) or Galli et al. (2020a).
At the heart of UCL lie the clusters η Lup, ϕ Lup, and e Lup, which likely belong to the oldest parts of ScoCen, probably the clusters where the first supernovae in ScoCen originated from (Zucker et al. 2022). To the north of the traditional UCL borders, we find a clustering, which has not been isolated in previous works, named LibraSouth, based on its location within that constellation.
There is one cluster slightly in front and to the south of the main UCL body, called NormaNorth, named after its location in that constellation. This is a new clustering, which does not have a clear counterpart in the literature. Another SigMA cluster lies to the far side of UCL and the Galactic west of the Lupus constellation. This cluster correlates with UPK 606 when compared to CantatGaudin & Anders (2020; see also Kerr et al. 2021 and Table E.3). Finally, to the Galactic west, UCL connects with LCC via the clusters ν Cen and ρ Lup.
5.1.3. Lower Centaurus Crux (LCC)
We find five SigMA clusters (2363 stellar sources) toward the LCC region (see Table 3), which is now reaching farther below the Galactic plane compared to earlier definitions in the literature. For the SigMA extraction, the young local associations ϵ Cham and η Cham are part of LCC, located at the Southern most tip, confirming the results of Mamajek et al. (1999, 2000) or Fernández et al. (2008). Consequently, the main body of LCC is composed of five subgroups, which seem to constitute an age gradient (e.g., Kerr et al. 2021) from north to south (σ Cen, Acrux, Muscaforeground, ϵ Cham, and η Cham), which we analyze in a followup study (Ratzenböck et al. 2023).
5.1.4. Pipe Nebula
Although not traditionally considered part of the ScoCen association, we find three SigMA clusters toward the Pipe nebula (172 stellar sources), including B59, PipeNorth, and θ Oph. The group B59 seems to be closely related to the star forming B59 cloud (e.g., Lombardi et al. 2006; Brooke et al. 2007; RománZúñiga et al. 2007, 2010). This is supported not only by projection in the sky toward cluster and cloud but also by the cloud distance between 147–163 pc (Zucker et al. 2021), compatible with the cluster distance of about 160 pc. The θ Oph cluster, surrounding the B2 star * tet Oph, is located at about the same distance to B59 and is close to the stem of the Pipe Nebula cloud, giving ground to studies of a possible interaction between the B2 star and the cloud (Gritschneder & Lin 2012). PipeNorth lies slightly in front of the other two clusters (at about 130 pc) and to the Galactic north of the Pipe Nebula, as the name suggests.
5.1.5. Corona Australis (CrA)
The possible physical connection between CrA and the ScoCen association was already pointed out in previous studies (e.g., Mamajek & Feigelson 2001; Preibisch & Mamajek 2008; Kerr et al. 2021) and confirmed by our work. We count three SigMA clusters to the CrA region, containing a total of 579 sources. We identify a distinct cluster projected on top of the CrA molecular cloud and the embedded Coronet clusters, which we call the CrAMain group. The cluster distance fits the cloud distance of about 136–179 pc (Zucker et al. 2021). To the Galactic North, we identify a second and more extended group, called CrANorth, which was already discussed in Galli et al. (2020b) or Esplin & Luhman (2022). Additionally, we identify a third group to the Galactic northwest of the two other clusters, apparently building a bridge to the main body of ScoCen. This group we name ScorpioSting since its projected location matches the sting of the Scorpio constellation. ScoSting has only one clear counterpart in the literature, namely the TLC22/EOM7 group in Kerr et al. (2021; see Sect. 5.2.2 and Table E.3), while they identify a smaller subsample of this group (12 members in Kerr et al. 2021 versus 132 members in this work).
5.1.6. Chamaeleon (Cham)
The wellknown starforming molecular clouds of Chamaeleon are seen through the same lineofsight as the southern tip of LCC but lie clearly toward the back of LCC when seen in 3D (Fig. 12). We identify two clusters likely associated with the clouds with a total of 246 stellar sources in Chamaeleon 1 & 2, which have already been characterized with Gaia (e.g., Roccatagliata et al. 2018; Galli et al. 2021b; Kerr et al. 2021, see also Sect. 5.2.2).
In addition, toward the middleeastern part of the traditional LCC borders, SigMA extracts another cluster that seems unrelated to the main body of LCC, which we name CentaurusFar (containing 99 sources), since it lies about 60 pc to the back of it, at a distance similar to that of the Chamaeleon clouds. This cluster was already identified in Kerr et al. (2021), as part of the TLC21 group (Chamgroup) as EOM3, and named CenSouth (see Sect. 5.2.2 and Table E.3). Consistent with Kerr et al. (2021), we count this cluster to the Cham subregion.
Due to their youth, position, and tangential velocities, we assume that the three Cham clusters and the two clouds are part of the ScoCen star formation event, but this must be confirmed by tracebacks of the young populations (see, e.g., Großschedl et al. 2021). Similar suggestions appear in Lépine & Sartori (2003) or Sartori et al. (2003).
5.1.7. Northeast clusters (NE)
We identify three extra clusters to the Galactic north and east of ScoCen, which we discuss separately in this section: L134/L183, Ophsoutheast, and OphNorthFar. We assigned these clusters to a separate region, which we call the Northeast clusters (NE), based on their location relative to ScoCen in Galactic coordinates, since they do not fit any other of the ScoCen subregions.
The cluster L134/L183 is a small, newly identified stellar group to the Galactic north of US (with 24 stellar members). This stellar group is likely associated with the small molecular clouds L134 and L183 (or MBM 36 and 37, Magnani et al. 1985), which are currently nonstarforming (Pagani et al. 2003, 2004, 2005). The distances to the clouds in Zucker et al. (2019) are about 105–120 pc, which matches the cluster distance of about 114 pc. The presence of the young stellar group close by the clouds suggests that (1) the clouds are remnants of a larger cloud that formed the newly identified SigMA cluster and (2) that the newly identified sources might be playing a role in the observed “cloudshine” phenomenon toward this cloud (Steinacker et al. 2010, 2015).
The cluster Ophiuchus Southeast (OphSE, 61 members) lies at a similar projected position as PipeNorth, while being at a farther distance, about 50 pc in the back (hence, we did not count it to the Pipe region). This stellar group was already selected by Kerr et al. (2021) as TLC 4 with 31 members (Sect. 5.2.2). Finally, the group OphiuchusNorthFar (OphNF, 28 members) appears to be a newly identified stellar group, located at a similar distance as OphSE. This new group needs more investigations in the future, since the stability of the selected members, as determined by the SigMA algorithm, is generally very low (stability < 11%).
5.2. Comparison with previous work
In the following we compare the SigMAselected stellar clusters with recent results from the literature (Table 6), including eight publications. The studies by Damiani et al. (2019), Schmitt et al. (2022), Luhman (2022), and Žerjal et al. (2023) discuss the whole ScoCen region, slightly extending beyond the traditional ScoCen borders while excluding the regions to the Galactic South (CrA and Cham). The first three of these studies select members within broad selection borders decided by hand, which we call in this paper visual selection methods. Squicciarini et al. (2021), MiretRoig et al. (2022a), and BriceñoMorales & Chanamé (2023) focus only on the US region and extract clusters using a combination of Gaia astrometry and radial velocities. Kerr et al. (2021) present an allsky study of young stars within 333 pc, hence covering the new extended view of the ScoCen association, using an unsupervised machine learning approach, which is more similar to our work than the aforementioned studies.
Overview of the recent literature to which we compare the SigMA ScoCen clustering results in more detail.
The literature samples are crossmatched with the SigMA clusters using the Gaia DR3 source_id, as specified in Appendix A. We provide: an overview of the discussed literature samples in Table 6, giving the total number of sources of each literature sample; the total number of sources of SigMA ScoCen cluster members within the respective studied areas, the field of view (FOV); and the number of total matches. Finally, we list the fraction of sources that we recover or reject (or miss) when compared to the individual literature samples, and the fraction of new sources when comparing the matches to the SigMA sample.
5.2.1. Comparison with Damiani et al. (2019)
Damiani et al. (2019, hereafter DPP19) analyze ScoCen using Gaia DR2 data and a traditional approach, selecting by hand overdensities in velocity and position space, followed by selecting pre–mainsequence (PMS) stars from an HRD. Their FOV goes slightly beyond the traditional borders of the association (see Table 6). They discuss eight compact clusters, which are prominently peaked in projection and in velocity space (hence, easier to identify with visual selection methods); these are UCL1, UCL2, UCL3, Lupus 3, LCC1, USfar, USnear, and the wellstudied IC 2602. Although SigMA easily detects IC 2602, we do not discuss this cluster since its age (∼30 Myr) excludes it as a part of the recent ScoCen star formation event, as mentioned above, and it also has distinctly different tangential velocities compared to the bulk motion of ScoCen (see Fig. 6 in DPP19). DPP19 also discuss four diffuse populations (D1, D2a, D2b, USD2), which are generally distributed across large parts of the traditional Blaauw ScoCen OB association. Moreover, their catalog includes sources, which have not been assigned to any group (labeled with “N” in Table E.2).
The DPP19 catalog contains in total 14 437 sources, of which 1734 are in their seven clustered ScoCen populations (350 in IC 2602), 8727 are in their four diffuse populations, and the rest 3626 have not been assigned to any population (labeled with “N”). When crossmatching the DPP19 Gaia DR2 sample with DR3 astrometry, we find that 201 stars (1.4%) are rejected when applying the distance criteria from DPP19 (d < 200 pc), due to updated parallaxes in DR3. The majority of these sources have not been assigned to any group or belong to one of the diffuse populations. When now considering only the sources in the clustered and diffuse populations within 200 pc (and without IC2602), then there are 10,425 potential ScoCen members in DPP19, or 10,421 when additionally applying the box and quality criteria from Sect. 2.
There are in total 9635 crossmatches between the SigMA clusters and DPP19, while 9328 of these (89.5% out of 10,421) belong to either the clustered or diffuse populations (307 are not assigned, “N”). Of the 9328 crossmatches, 7609 belong to one of the four diffuse populations. Comparing this number to their total diffuse population (8689 within 200 pc), we find that about 88% are a match with SigMA clusters. The fact that the majority of sources from socalled diffuse populations are now in clustered populations (at least in the statistical sense) is interesting and calls for future investigations to better understand if physically meaningful “young diffuse stellar populations” actually exist within young stellar associations like ScoCen. In most cases, more than one DPP19 group (both clustered or diffuse) fits one of our clusters, and vice versa (see Table E.2). In particular, their diffuse groups each contain subparts of about 10–20 of the SigMA clusters.
Focusing on the 1732 DPP19 sources in compact clusters, there are 1719 matches with SigMA clusters (99%) within 200 pc. The better consensus considering their compact samples highlights the higher robustness of these samples (see also Table 6). For individual samples toward UpperSco, we find that their USnear and USfar cannot be assigned clearly to only one of the SigMA clusters (see Table E.2). USnear correlates best with ρ Oph/L1688 (containing fractions of δ Sco, ν Sco, Antares, and β Sco), and USfar with σ Sco (containing fractions of Antares, β Sco, δ Sco, and ρ Sco). In particular, Antares is distributed almost equally among these two clusters. The Antares group is partially occupying the same volume as ρ Sco and in particular ρ Oph/L1688 (see Sect. 5.1.1 and Grasser et al. 2021). This highlights the capability of SigMA to untangle young populations that share the same volume but have different space motions. The rest of the DPP19 compact clusters correlate best with SigMA clusters as follows: UCL1 with V1062 Sco and μ Sco, UCL2 with UPK 606, UCL3 with ϕ Lup, LCC1 with Acrux, and Lup III with Lupus 1–4. Finally, about 9% of the SigMA cluster members correlate with unassigned sources in DPP19 (N), within their FOV and our box criteria.
Concerning the different approaches, comparing the DPP19 visual selection method and the SigMA unsupervised clustering method, we first note that the method used by DPP19 starts with a selection of stars byhand in velocity space, followed by a selection byhand of PMS stars on the HRD. Such an approach will deliver the most prominent clusters. However, somewhat less dense clusters cannot be identified easily when compared to unsupervised machine learning tools, such as SigMA, and their method is less sensitive to possible spatial and kinematical structure in the ScoCen population. For example, a look at Figs. 2, 3, and 4 in DPP19 will make clear that the total number of member candidates using this approach is a strong function of the size of the selectionshapes used in tangential velocity space and the HRD. These selection borders will select a larger number of candidates than a finetuned machine learning classifier with also a likely higher number of contaminants.
When focusing on a comparison of the total number of ScoCen members in DPP19 stellar clusters (compact and diffuse within 200 pc, 10,421 sources) to the number of matched SigMA cluster members (9328 within 200 pc; see Table 6), we find that there are 1093 sources only in DPP19, implying that we could be missing about 10% of possible members if all 10,421 sources were good members. We did not perform a detailed comparison but find that the 1093 sources also contain sources that seem to be older than the SigMA clusters when investigated in a CMD (as in Fig. 14); hence, the difference based on this comparison is likely lower than 10% (see also the comparison with Luhman 2022). As mentioned above, we expect SigMA to be missing possible candidates when compared with a method that selects broad regions in various 2D planes of the phase space, but also expect the SigMA sample to be less contaminated. Nevertheless, the SigMA sample contains in total 11,796 sources inside the DPP19 FOV, implying that, in the end, we find about 20% more clustered ScoCen members. This could partially be caused by the different data sets, DR2 versus DR3, while the different methodologies likely cause more severe disagreements. A deeper analysis is needed, although not warranted in this paper.
5.2.2. Comparison with Kerr et al. (2021)
Recently, Kerr et al. (2021, hereafter KRK21) presented a study of nearby young stellar populations within 333 pc from the Sun. They use the HDBSCAN clustering algorithm (see Appendix B.1) on Gaia DR2 parallaxes and proper motions on a preselected sample of PMS stars with ages ≲50 Myr. They identify 27 toplevel clusters (TLCs), including Chameleon as TLC 21 and the ScoCen association as TLC 22. The latter was further broken down into another 27 subgroups based on the excess of mass (EOM) method, selecting the most persistent clusters in the clustering tree. Three of these EOM subgroups (EOM 12 Lupus; EOM 17 US; and EOM 27 LCC) were further broken down into leaves, which are nodes of the clustering tree.
TLC 22 covers the main ScoCen association, and TLC 21 the Chamaeleon region. Additionally, there were crossmatches with members of the group TLC 4, which is called Ophiuchus Southeast in KRK21. These three TLC groups combined show a similar extent to our ScoCen extraction. SigMA finds in total a slightly lower number of groups toward ScoCen (37 in this work versus 45 in KRK21), while the TLC 22 subgroups in KRK21 also include older or unrelated populations (e.g., β Pic, IC 2602, Platais 8, and EOM2 & 5), which are not included in our final ScoCen sample, as outlined above. Consequently, only 39 of the KRK21 groups toward ScoCen fall within the 37 selected SigMA clusters from this work.
In Table E.3, we show an overview of the matches of SigMA groups with corresponding KRK21 groups. Overall, the SigMA ScoCen groups are more richly populated compared to the KRK21 groups. In most cases, there is at least some overlap between our groups and the TLC 22 main ScoCen group (and with TLC 21, Cham; or TLC 4, OphSE), while some of our groups also distinctly correspond to EOM subgroups (or leaves). For about 40% of the SigMA groups, clear accordance with a single EOM group (or leaves group) is not possible due to overlaps with more than one SigMA group or due to no or only insignificant overlap (see also Table E.3).
Some differences between the SigMA and KRK21 clustering results might arise from the different data input since we used Gaia DR3 and KRK21 used DR2, while this would only create minor deviations. Although both HDBSCAN and SigMA approximate the hierarchical cluster tree, we expect discrepancies in clustering results. The primary reason for this difference is the cluster tree pruning strategy discussed in Appendix B.1. The EOM heuristic prioritizes large clusters over their children when they maintain a long lifetime in the density hierarchy. The resulting children fail to exceed the parent’s EOM. Conversely, our pruning strategy does not depend on cluster lifetimes but only cares about substantial density valleys between neighboring density peaks.
The additional leaf separations in KRK21 were applied to the Lupus, US, and LCC regions (in TLC 22, EOM12,17,27) since they found that there are substructures that have not been identified by the EOM method. Some leaf clusters match quite well with SigMA clusters; in contrast, the SigMA clusters are significantly richer and mostly more extended, and in some cases, they are differently separated (see details in Table E.3). Compared to the EOM heuristic or SigMA’s multimodality considerations, leaf clusters do not come with statistical guarantees. The clustering result is highly susceptible to random density fluctuations since leaf nodes are extracted only considering the minimum cluster size criterion (Stuetzle & Nugent 2010); see Appendix B.1 for more details. Without any additional pruning strategy, which deals with spurious clusters, leaf clustering results need to be taken with a grain of salt. Nevertheless, some of the leaves in US (ρ Oph/L1688, ν Sco, δ Sco, β Sco) show good agreement with the SigMA US cluster separations, indicating the robustness of these clusters (see also Sect. 5.2.7).
When comparing all members in the TLC22 group (7394), which contains the main ScoCen association, with our SigMA extraction (12 669 members without Cham, OphNE, and OphSE) we find 6270 crossmatches in total (Table 6). Hence, 1124 (∼15% of TLC22) sources are only in TLC22, and 6399 are only in SigMA (∼50% of SigMA). We find that the KRK21 TLC22 sample contains at least 256 sources from older stellar groups, which gets apparent from their Table 6 (EOM 1–5, including β Pic, IC 2602, and Platais 8), and 456 sources of TLC22 match with sources that are in older SigMA clusters. Combined, this leaves 6895 potentially younger TLC22 ScoCen members, and hence 625 possible extra sources (∼8% of TLC22). For these extra sources, a clear separation of the younger ScoCen stellar groups as discussed in this work, and the somewhat older groups is not straightforward, since about 50% of the sources in the TLC22 group have not been assigned to a separate subcluster (EOM or leaf). The somewhat older sources can also be estimated when investigating the CMD or the velocity space. In the CMD no clear separation of older or younger sources can be identified. In tangential velocity space, there are sources that have slightly deviating motions from expected ScoCen motions or which coincide with velocity spaces of the KRK21 older EOM groups or older SigMA groups. Taking all this into account, the fraction of young TLC22only sources is likely below 8%.
The reason for these extra potential ScoCen members in the KRK21 TLC22 group is similar to the mentioned reasons above (e.g., in Sect. 5.2.1). The TLC22 group represents a cluster root, enveloping the whole ScoCen region and somewhat beyond, and no additional substructure was extracted (yet). In the following step KRK21 use the EOM and leaf methods to identify individual clusters, while in this step they lose almost 50% of the original TLC22 group, as mentioned above. Focusing only on the TLC22 members that are in one of the 22 younger EOM subgroups, we find that we recover 99.8% of these sources as ScoCen members. Finally, the TLC22 group seems to be overall more incomplete compared to the SigMA ScoCen extraction, since we find in total more members (∼42%, 12 669 versus 7394), and also somewhat different substructure. At the same time, the subclusters themselves are significantly richer compared to KRK21.
In conclusion, the comparison with KRK21 highlights the differences that can arise with different unsupervised machine learning tools. Compared to applications of HDBSCAN, we find that SigMA is able to extract similar substructure, however, with only one clustering step (no substeps like EOM or leaves are needed), while at the same time extracting significantly higher numbers of members per cluster. The modeling results in Sect. 4.2.2, where we compare the performance of DBSCAN, HDBSCAN, and SigMA, also suggest that SigMA outperforms HDBSCAN. However, a comparison of the KRK21 HDBSCAN results with the model HDBSCAN results cannot be done at face value. In Table 2 we list the performance results for the bestperforming HDBSCAN model configuration, where the model parameters were optimized in a grid search, as described in Appendix B.8. The HDBSCAN configuration used by KRK21 is not identical to these model runs; hence, we cannot directly use the performance numbers in Table 2 for a comparison of the two ScoCen cluster catalogs (SigMA versus KRK21). Moreover, we did not combine multiple results (KRK21 combines results from EOM and leaf runs) to maximize the performance. Regardless of the difficulties to compare the performances, there is sufficient evidence to conclude that both the direct comparison of the two ScoCen cluster catalogs and also the modeling runs indicate that SigMA outperforms HDBSCAN in ScoCenlike environments.
5.2.3. Comparison with Schmitt et al. (2022)
Recently, Schmitt et al. (2022, hereafter SCF22) used eROSITA^{20} (Merloni et al. 2020) to search for lowmass ScoCen members by crosscorrelating the eRASS1 source catalog with the Gaia EDR3 catalog. They discuss 6190 Xray observed sources within the traditional borders (Blaauw 1964a; de Zeeuw et al. 1999), which are ScoCen candidate members. They include sources within a distance range of 60 to 200 pc (Table 6), restricted to lowmass stars (G_{BP} − G_{RP} > 1, following Pecaut & Mamajek 2013). The 6190 sources include 40 double Gaia sources that match with two different eROSITA sources. We only discuss the 6150 single Gaia sources.
Since Xray emitting sources are expected to be young (e.g., Schmitt 1997; Neuhäuser 1997; Feigelson & Montmerle 1999; Bouvier et al. 2014), the sources detected by eROSITA in the direction of ScoCen, as discussed in SCF22, are potential members of ScoCen. They found Xray sources down to about 0.1 M_{⊙}, and, unexpectedly, they also found the existence of a population of young Xray emitting stars that appear to be more diffuse in velocity space^{21}, calling into question search approaches relying on kinematic selections.
We crossmatched the 6150 SCF22 Xrayselected sources with the SigMA selection in the same FOV (containing 11 348 SigMA sources; see Table 6). We find a total of 3385 crossmatches, while none of these belong to their velocitydiffuse population. This is expected since SigMA only selects clusters confined in positionvelocity space, which naturally excludes any such velocitydiffuse sources. SCF22 claim that the diffuse population is largely composed of young stars, only somewhat older compared to the kinematically confined ScoCen members. We confirm the general youth of the sources by inspecting the two populations in the CMD. However, we see a relatively clear age separation between the velocityclustered and velocitydiffuse populations. Xray sources that occupy similar velocity spaces as the ScoCen members have ages between 0.1–20 Myr, while Xray sources that are velocitydiffuse have ages between 10–1000 Myr, with the majority at about 30–100 Myr, when compared to PARSEC model isochrones. While these are technically young stars, they seem too old to be related to the ScoCen association.
The origin of this cospatial but velocity diffuse population remains mysterious. Since these sources are older than ScoCen, they are unlikely to result from stellar interactions in ScoCen (an a priori unlikely process given the low stellar density of ScoCen). The diffuse population, or the coeval part, could be related to a relatively old starformation episode, sharing today the volume space of ScoCen, a plausible scenario in the Milky Way (Fürnkranz et al. 2019). We posit here that the SCF22 velocitydiffuse young sources are unlikely to be part of ScoCen, but represent a mystery that needs to be solved. As SCF22 points out, the sensitivity of eROSITA will allow virtually all young ScoCen lowmass members to be detected in the near future. A combination of eROSITA future releases and Gaia data in ScoCen will be crucial to increase statistics and better understand the relation between observed Xray luminosity with distance, age, stellar masses, and the origin of the velocitydiffuse population.
When concentrating on the velocitycoherent sample in SCF22 (without IC 2602), we find that there are about 20% in the whole SCF22 sample that could be additional ScoCen candidate members. These are only in SCF22 and have similar velocities as SigMA ScoCen members. When investigating these 20% in the CMD to check possible contamination from older stars (similar to Fig. 14), this fraction would reduce to about 10%. These extra potential members might result from the broad selection conditions in SCF22, based on all Xray detected sources within the Blaauw borders in a distance range of 60 to 200 pc (Table 6). These broad conditions, which do not attempt to identify any underlying clustered structure, will naturally pick up more candidate members, although more false positives too, as discussed in Sects. 5.2.1 and 5.2.9, also because the Xray detection is not a guarantee to only pick up the youngest sources below 20 Myr.
The mean completeness rate of SigMA in an environment with densely packed clusters as estimated in Sect. 4.2.2 (obtained with mock data), is about 81%, which could explain the missed 10% to 20% of SCF22 sources. On the other hand, there is no completeness estimate given in SCF22, while this sample is likely also incomplete, in particular regarding Xray faint sources, since we find about 70% more sources in the same FOV. This is an indication that the Xray luminosity of all young sources is not in general bright enough to be picked up by eROSITA, while future data releases might increase the numbers of observed Xray ScoCen members.
5.2.4. Comparison with Luhman (2022)
Luhman (2022, hereafter L22A) recently investigated the ScoCen region using Gaia EDR3 data to identify 10 509 kinematic candidate members (see Table 6). L22A includes selections for US, UCL/LCC, V1062 Sco, Ophiuchus, and Lupus (the Southern parts of ScoCen are not discussed in L22A), and concentrates on established stellar groups in ScoCen to guide the selection. The visual selection approach of L22A is not suitable to separate the underlying kinematical substructure of the ScoCen population. For example, it is clear from Fig. 4 in L22A (bottom panel) that the UCL/LCC group contains several overdensities in l/b space, but these are not extracted or identified. The L22A selection is based on global kinematic criteria, extracting candidates exhibiting proper motions similar to expected proper motions of known members.
Crossmatching the 10 509 L22A ScoCen candidate members with the SigMA clusters gives a total of 9838 matches (93.6%), 671 L22A only sources (6.4%), and 2377 SigMA only sources within the L22A studied area (Table 6), where the SigMA sample contains 12 215 sources in total. A more detailed comparison of the SigMA clusters with the L22A subgroups^{22}, which are generally larger scale groups, shows no clear correlation between single groups. Virtually each SigMA cluster has several matches with various L22A subgroups (and vice versa). When investigating the 671 L22A only sources, we find that about 50% of these sources do not show significant signs of being older than 20 Myr, and the majority of the sources do not show significant deviating motions from SigMA ScoCen cluster velocities. These extra L22A sources, or part of them, could be ScoCen members, meaning we might be missing up to about 6% of the candidates in L22A. This is not surprising because methods based on visual selection, using broad selection borders, will naturally find more candidates, as also discussed in Sect. 5.2.1. Nevertheless, the SigMA samples contain in total more ScoCen member candidates within the same FOV.
5.2.5. Comparison with Žerjal et al. (2023)
Žerjal et al. (2023, hereafter ZIC21) present another clustering result for the ScoCen association using CHRONOSTAR, a clustering tool developed by Crundall et al. (2019). This is a Bayesian tool to kinematically decompose stellar groups using the full 6D kinematic data, also performing a kinematic age determination. They identify eight distinct kinematic components containing in total 9556 sources^{23}. The 9556 stellar members are both within dense and also diffuse stellar groups. They also include two known clusters that we excluded from the final ScoCen sample, which are IC 2602 and Platais 8 (H and I in ZIC23). Without these clusters, their sample contains 8185 stars, of which 7671 (94%) match with the SigMA groups.
The groups are C–US, E–USmulti^{24}, D–UCLV1062Sco, F–UCLV1062Sco, G–UCLEast, T–UCLWest, A–LCCNorth, and U–LCCSouth. Hence, with their method, they are splitting US, UCL, LCC, and also V1062 Sco, each into two parts. We list all matches of SigMA clusters with ZIC23 in Table E.2. Generally, it can be seen that there is significant mixing of various groups in both the ZIC23 and the SigMA groups. In particular, the ZIC23 groups encompass larger areas, often containing several or up to 20 of the SigMA groups. Concerning the groups D and F, we find that both match with V1062Sco and μ Sco, while D has a slightly higher correlation with V1062Sco and F with μ Sco.
Within their FOV also other groups exist (see Fig. 8 in ZIC23), like CrA or the Cham cloud regions, while they were not selected as kinematic members of ScoCen. We speculate that their initial sample by Gagné et al. (2018a), which includes sources from US, UCL, and LCC, limits their ability to select these additional groups. Their low signaltonoise ratio, positional distance, and slightly deviating motions from bulk ScoCen sources^{25} likely prevented their classification as ScoCen groups.
Their method fits a mixture of Gaussians to data. Instead of allowing arbitrary covariance matrices, CHRONOSTAR constrains 6D Gaussian distributions in XYZUVW to the following form. Presentday observations are assumed to follow a ballistically evolved Gaussian in Galactic potential. The freefitting parameters are a cluster’s mean birth position in phase space, birth positional and kinematic variance (the covariance matrix is assumed uncorrelated), and age. The fitting is done via a modified EM algorithm where the number of components is determined via the BIC. ZIC23 state that they see evidence for substructure in several groups. Thus, it has to be investigated if the groups found in ZIC23 will eventually break into subgroups. When compared to SigMA we can already see that the individual ZIC23 groups generally contain more than one SigMA cluster (see Table E.2).
5.2.6. Comparison with Squicciarini et al. (2021)
Squicciarini et al. (2021, hereafter SGB21) studied 2745 potential US members (see Table 6) by selecting subgroups solely based on kinematics. They cluster the US region into eight groups that they call the clustered population (1442 stars), and into one older diffuse population (1303), which is, however, differently defined than the diffuse or velocitydiffuse populations in DPP19 or SCF22.
When comparing the SGB21 selection to the SigMA clusters, we find that there are 2575 crossmatches (∼94%) in total out of the 2745 sources in SGB21. Hence, we miss 170 SGB21 US candidate members, while 13 sources are lost due to our box and quality criteria (Sect. 2). Focusing on the SGB21 candidate members in the clustered populations (1442), there are only seven sources that are only in SGB21 and not in SigMA, while we miss 163 of the SGB21 diffuse members, which are overall more uncertain members. In total, we find almost a factor of two more clustered sources with SigMA in the same FOV as SGB21, 2717 versus 1442 (see Table 6). The 2575 sources match with nine of the SigMA clusters, while only seven SigMA clusters have a significant number of matches.
We list the crossmatches of SigMA with SGB21 in Table E.4. We highlight more significant matches here. Groups 1, 2, 3, and 4 match best with ρ Oph/L1688, ν Sco, δ Sco, and β Sco, respectively, while Group 6 also has significant matches with β Sco. Group 5 matches best with σ Sco, while the majority of σ Sco is in the SGB21 diffuse population. Group 7 and Group 8 match best with Antares, while the majority of Antares is also in their diffuse population. Generally, the Antares group seems to split up into more than one cluster, also in other previous work. The two groups USforeground and ϕ Lup have only a few matches with the diffuse population. The SGB21 diffuse population is largely contained within the SigMA groups σ Sco, Antares, ρ Sco, and δ Sco, with some diffuse members distributed among each mentioned group (see Table E.4). This indicates that the diffuse population is maybe not a separate older group but it contains stars that were not clustered by the methodology in SGB21, while they are clustered in SigMA.
The differences in the final cluster definition in US likely arise from the different clustering methodologies. To better understand the SGB21 approach we outline the basics here. SGB21 use a semiautomated approach based on iterative kmeans clustering on a 4D sample, using 2D sky positions and 2D tangential velocities. The authors propagate the sky positions 15 Myr into the past and future, producing a new 4D data set at each step; tangential velocities are constant throughout individual data sets. By studying the sky distribution of each slice, SGB21 visually identify overdensities. These overdensities are extracted via kmeans clustering in 4D space at a given time step. Subsequently, the clustered data points are removed from the data set, and the process of looking for overdensities starts anew. The clustering process terminates when the authors cannot find any apparent density peaks in the sky distribution.
Besides the feature space difference, SigMA has significant differences compared to the SGB21 iterative clustering approach. First, the kmeans algorithm cannot deal with the observed nonconvex cluster shapes in projected coordinates. The extracted clusters are 4D Voronoi cells^{26} that can have very elongated shapes. Second, SGB21 analyze 2D projections of the highdimensional data to identify clusters visually. Thus, cluster selection is influenced by projection effects and human judgment. Conversely, SigMA employs a modality test directly in the highdimensional phase space, taking into account multidimensional relationships between data axes. These rather different approaches to extracting clusters in US make it clear that the results cannot be compared at face value, while fractions of the most robust clusters (ρ Oph/L1688, ν Sco, δ Sco, β Sco, σ Sco, and Antares) have been identified by either method.
5.2.7. Comparison with MiretRoig et al. (2022b)
MiretRoig et al. (2022a, hereafter MR22) recently applied a Gaussian mixture model (GMM) to Gaia DR3 data of the US region. They include radial velocities, when available, to select the stellar groups. They identify seven stellar groups within a FOV of about 220 deg^{2} that is centered on US, containing 2810 sources (see also Table 6). A crossmatch with the SigMA members gives 2683 matches; hence, we are missing 127 sources, of which 23 are outside of our box and quality criteria (Sect. 2). This leaves 104 potential members that the SigMA algorithm did not select. Within the same FOV SigMA contains 3089 ScoCen members.
When comparing the SigMA groups to the individual seven MR22 groups in more detail, we find largely good agreement, especially for ρ Oph, ν Sco, δ Sco, and β Sco (see also Table A.1 in MR22 and Table E.4). The SigMA Antares and σ Sco groups seem to be mixed in MR22, with significant fractions in both the MR22 α Sco and σ Sco. Finally, the MR22 π Sco group, which lies largely in the foreground of the other US groups, coincides largely with two of the SigMA groups: USforeground (which we identified as a foreground population to the traditional US) and ρ Sco. The latter is overlapping in space with Antares and ρ Oph. The larger volume investigated with SigMA allows a more complete sample of the USforeground to be selected since the MR22 FOV cuts off the outer edges of what they call π Sco.
Since MR22 also use radial velocity information for a subsample (∼30%), the classification of these 6Dselected sources might be more robust compared to the 5.5D selection as used by SigMA. Investigating the bona fide 6Dselected members in MR22, we still find some mixing of SigMA clusters within MR22 clusters (and vice versa), especially concerning Antares and σ Sco. These differences need more investigations in the future, and dedicated cluster studies are called for (e.g., with Uncover, Ratzenböck et al. 2020).
5.2.8. Comparison with BriceñoMorales & Chanamé (2023)
BriceñoMorales & Chanamé (2023, hereafter BMC23) present another clustering study on US using Gaia EDR3 data. They obtain a clustering solution by first combining the convergent point method (Perryman et al. 1998) with a Gaussian Mixture fit (Pedregosa et al. 2011) to identify kinematic groups with a Bayesian approach. Second, they use OPTICS (Ankerst et al. 1999) to identify spatial substructure in XYZ. This is complemented with age estimates based on Gaia photometry. Their astrometrically clean sample obtained in the first step contains 3004 sources (USco kinematic group, KG). In the same FOV, we find 3439 clustered ScoCen members (see Table 6).
The USco KG is first broken down into three KGs in a “corrected tangential velocity space^{27}”: ρ Oph KG, USco Young KG, and USco Old KG. This suggests that USco comprises three main kinematic components, with ρ Oph KG mainly correlating with the traditional ρ Oph region, and USco Young KG with the traditional USco region. They argue that their USco Old KG is a new yet unstudied population while pointing out that large fractions of this KG are likely interlopers from UCL. Next, the USco KG is independently substructured with OPTICS in the XYZ space, delivering eight spatial clusters and one diffuse population. The latter are sources that OPTICS did not assign to a cluster. In fact, the majority of sources in the USco KG are contained in this diffuse population (∼67%).
Using OPTICS, it can be challenging to identify noise points because there is no explicit noise cluster or appropriate noise threshold. Instead, the noise specification is prone to 1D projection effects of the ordering algorithm (e.g., loss of highdimensional structure, small perturbations in data point positions produce different orderings), which eventually produces the reachability plot (Rplot) from which clusters and noise are determined. In particular, the presence of noise can lead to the creation of many small clusters in the reachability plot, making it challenging to identify clusters.
When crossmatching the BMC23 and SigMA samples, we find 2720 sources in common, 90.5% out of the 3004 BMC23 sources or about 79% out of the 3439 SigMA sources in the same FOV. Hence, we find about 21% additional candidate ScoCen members compared to BMC23 (see also Table 6). A detailed overview of the matched clusters is given in Table E.4. We summarize the best matches as follows.
There are five SigMA clusters that have relatively clear matches with BMC23 spatial clusters, which are ρ Oph, β Sco, ν Sco, σ Sco (named α Sco in BMC23), USforeground (named π Sco in BMC23), and η Lup (named UCL in BMC23). However, all of these clusters have more or less significant fractions contained in their diffuse population (Table E.4). Interestingly, our δ Sco is equally contained within the BMC23 δ Sco and ω Sco, while the largest part of our δ Sco is in their diffuse population. They also compare these two clusters to KRK21, where they find that both, δ Sco and ω Sco, are contained within EOM17H (which matches with our δ Sco). In this case, we find that SigMA and KRK21 agree while BMC23 find further substructure in δ Sco. At this stage, we do not know which result is more physical, and future work is needed to understand this conundrum.
The majority of SigMA’s Antares, ρ Sco, and LibraSouth are contained within their diffuse population; hence, these clusters have not been identified as individual clusterings by BMC23. As mentioned above, their α Sco seems to match best with our σ Sco cluster (not with Antares), while they mention that the two stars * alf Sco and * sig Sco are likely members of this cluster. This connection is of interest since also other works seem to mix the SigMAAntares and SigMAσ Sco clusters, which needs further investigations in the future.
In their conclusions, BMC23 find an agevelocity correlation, suggesting that older clusters have increasing deviating tangential motions compared to ρ Oph (which they use as a starting point) as a function of age. Moreover, they find a correlation between cluster density and cluster age, suggesting that these substructures expand at a measurable rate. Finally, they argue that four potential supernovae happened in the US region based on literature information (e.g., Breitschwerdt et al. 2016; Neuhäuser et al. 2020; Forbes et al. 2021), cluster mass estimates, and on certain “voids” that are identified in the 3D distribution of their cluster extraction. Notably, we do not see such voids in the distribution of the SigMA clusters.
5.2.9. Concluding remarks on the comparison with the literature
In Table 6, we summarize the comparisons with the literature, stating the overlap size and the percentages of recovered, rejected, and new sources when comparing our sample to the literature samples within the same FOV used in respective publications. The rejected sources can also be seen as missing since they are still potential candidate members of ScoCen. We find that we miss sources on the order of a few percent up to about 45%, while this depends on the subsamples we compare to. For instance, some listed publications include compact clusterings and also diffuse populations, which are generally differently defined (e.g., velocity diffuse, remaining sources after clustering steps, etc.). Compared to the more compact clusters, we only miss sources on the order of < 1% to a few percent. This underlines the robustness of the more compact clusterings, while the socalled diffuse populations need to be further investigated, particularly in light of their physical nature. The velocityclustered Xray observed sources, discussed in SCF22, contain the largest fraction of missed sources. However, as mentioned in Sect. 5.2.3, some of these members seem to have older ages in the CMD, leaving only about 10% of potentially missed young ScoCen members in SCF22.
Compared to other methods described in the literature, SigMA consistently identifies more ScoCen candidate members, surpassing other studies by 5–70%. We considered various estimates of contamination, including 4–10% determined by older sources in the CMD (Appendix D.1), 5.3 ± 3.1% determined by SigMA’s internal approximation (Sect. 3.5.4), and 8.2 ± 4.1% field star contamination determined by ScoCenlike cluster simulations (Sect. 4.2.2). Based on these estimates, the true field star contamination rate likely lies between 2 and 12%. Given this contamination estimate and the rate of yet undetected members in comparison to prior work, we expect SigMA has contributed a significant fraction of new, real member stars throughout ScoCen (only exceptions are comparable US samples from SGB21 and possibly MR22).
The visual selection methods (e.g., Damiani et al. 2019; Luhman 2022) contain about 6–10% candidate members not detected by SigMA (Table 6). This is mainly due to these methods’ “selectbyeye” approach in projected subspaces of the multidimensional phase space to identify ScoCen candidates. However, unsupervised machine learning methods find more spatial and kinematical substructures in the ScoCen population and arguably produce samples with lower contamination levels compared to visual selection methods. More importantly, the SigMA method reveals a more complex velocity structure across the entire ScoCen, critical for a physical description of the formation process of OB associations such as ScoCen. When comparing SigMA to other unsupervised methods, which studied the whole ScoCen area (in particular Kerr et al. 2021), we find a compatible substructure. In contrast, the SigMA clusters are generally richer in members. The additional sources cannot be explained with our contamination estimate (about 2–12%) since this lies below the fraction of new sources when compared to Kerr et al. (2021; about 50% new sources).
Focusing on the US region, we generally find good agreement between cluster selections from SGB21, KRK21, MR22, BMC23, and SigMA. Not surprisingly, the denser clusters in US (ρ Oph, ν Sco, δ Sco, and β Sco) have been recovered well by the different approaches. The Antares, σ Sco, ρ Sco, and USforeground clusters are slightly more dispersed and have less clear matches across the methods (e.g., merged or distributed differently in different samples). As revealed by Gaia data, the newly identified velocity substructure in the US region is relevant to understanding the star formation processes at play in ScoCen and will continue to be an obvious target for future studies and surveys. Adding radial velocity information will be critical to characterize these clusters further, as already presented, for example, by MiretRoig et al. (2022a). Considering the five mentioned studies, the somewhat different clustering results for the US region call for a reanalysis of this important young region to better understand its star formation history.
When considering the crosscontamination between clusters in general, we find that the clustered substructure within ScoCen as extracted with SigMA is generally differing to some degree from other clustering results. The deviations have a different extent depending on which literature sample we compare to (see Tables E.2–E.4). There are some clusters, in particular in USco, where several studies agree, highlighting the robustness of these clusters. However, we rarely find a perfect match, and the final cluster membership of an individual source is often highly uncertain, mainly if located at the outskirts of a cluster’s center in a tightly packed environment surrounded by other clusters. Such behavior is expected since we find in Sect. 4.2.2 that crosscontamination leads to a percluster completeness estimate of about 76% of recovered sources when testing on mock data (while this fraction also considers sources lost to the field). Nevertheless, the substructure of ScoCen as identified with SigMA is a good starting point for future studies, supported by the narrow CMD sequences per cluster (see the followup study, Ratzenböck et al. 2023).
In conclusion, SigMA finds significant numbers of sources not present in other samples (often more than 10% new sources, Table 6) with a contamination fraction of about 2–12%, while missing candidates when compared with visual selection methods (on the order of 10%). More work is needed to understand the sources SigMA misses. A possible way forward toward a most complete sample of ScoCen members is to use SigMA cluster members and 3D velocities (by including radial velocities to select the most robust members) as training sets to the Uncover method, a validated bagging classifier of oneclass support vector machines (see the application in Ratzenböck et al. 2020 to the Meingast1 stellar stream, Meingast et al. 2019). In the near future, improved membership lists will allow a more precise analysis of the star formation history of ScoCen, the initial mass function of each cluster, and the dynamic state of the ScoCen complex.
6. Summary
In this paper we present SigMA, a method that explores the topological properties of a density field to define significant substructure. To test and validate SigMA, we applied it to Gaia DR3 data of the nearest OB association to Earth, ScoCen. The main results of this work can be summarized as follows:

SigMA is a novel clustering method that interprets density peaks separated by density dips as significant clusters. Using a graphbased approach, the technique detects peaks and dips directly in the multidimensional phase space.

SigMA is finetuned to largescale surveys in astrophysics. In this context, this new method is able to identify cospatial and comoving groups with nonconvex shapes and variable densities with a measure of significance. SigMA is able to appropriately incorporate 5D astrometric uncertainties alongside radial velocity uncertainties, does not require any photometric prefiltering of stellar populations, and scales to millions of points.

SigMA is capable of finding clusters in Gaia DR3 data, reaching stellar volume densities as low as 0.01 sources pc^{−3} and tangential velocity differences between clusters of about 0.5 km s^{−1}.

We validated SigMA on two simulated data sets and highlight its merits in relation to established clustering techniques. Our comparison shows that SigMA can significantly outperform competing methods, especially in environments where clusters are densely packed, such as the ScoCen OB association. In these dense cluster situations, our simulations show that SigMA has a mean contamination rate of 23.7 ± 13.1% and a mean completeness rate of 76.2 ± 15.2. Considering only the field star influence on contamination and completeness (i.e., ignoring crosscontamination from neighboring clusters), these scores improve to 8.2 ± 4.1% and 81.4 ± 2.0%.

SigMA identifies more than 13 000 ScoCen members located in 37 individual clusters of cospatial and comoving young stars. The CMD for each cluster shows a welldefined sequence. Because SigMA is not aware of a star’s brightness or color, the welldefined sequences constitute a validation test for the ability of SigMA to extract coeval populations. A large fraction of clusters are seen toward wellknown ScoCen massive stars, too bright to be in Gaia DR3, and we (tentatively) associated respective clusters with them. Because SigMA is not aware of these massive stars, their association with SigMA clusters also constitutes a validation test for SigMA.

When comparing the 37 SigMA stellar populations in ScoCen to previous results from the literature, we find mostly agreement; however, several discrepancies exist. When compared to visual selection methods used recently on Gaia data of ScoCen, we find that we might be missing roughly 10% of candidates, while at the same time finding a higher total number of stellar members. Unsupervised methods such as SigMA find more spatial and kinematical substructure for the same data set and produce samples with lower contamination levels.
In the future, in particular when combined with auxiliary radial velocity surveys, a detailed comparative study of the different clustering methods is fully warranted. The application of SigMA to upcoming Gaia data releases promises^{28} the unveiling of detailed cluster distributions such as the one presented here but for all the near starforming regions. Reconstructing an accurate and highspatialresolution star formation history of the last 50 Myr in the Local Milky Way with Gaia data is within reach.
Interactive Figures
Interactive 2D and 3D images associated with Figs. 10, 12, and 13 Access here
The source code is publicly available via GitHub under: https://github.com/ratzenboe/SigMA.
We crossmatched the Gaia DR1 sources identified by Gagné et al. (2018b) and DR2 sources from CantatGaudin & Anders (2020) with DR3 for more precise astrometry. We opted for the Gagné et al. (2018b) census if a cluster appears in both surveys. Compared to densitybased methods, the mixture of Gaussian densities deals naturally with scaling factors (provided the Gaussian assumption holds). The scaling factors can be compensated (to some extent) in the covariance matrix of the individual Gaussian components.
Compared to the clustering analysis, which assumes a universally valid metric implying a global correlation behavior (see Sect. 2), the optimization procedure is not affected by nonlinear relationships between input features. Thus, to avoid propagating errors through the LSR conversion, we stick to observed tangential and radial velocities.
Given as v_ERV (estimated radial velocity) in Table E.1.
See also Sect. 4.2 for a brief discussion on quality filters and future use.
We want to emphasize here once again that these results are conditioned on a given set of quality criteria. We strongly suggest reaffirming internal contamination and completeness estimates through simulations (as discussed in Sect. 4.2) when modifying these quality criteria in future uses of the SigMA software.
This approach seems valid in particular for US, since also other authors, like MiretRoig et al. (2022a) or BriceñoMorales & Chanamé (2023), independently decided for similar cluster names.
Possibly caused by internal feedback mechanisms in the history of ScoCen (e.g., Zucker et al. 2022).
See Sect. 4.2 for a brief discussion on quality filters and future use.
https://dc.zah.uniheidelberg.de/, German Astrophysical Virtual Observatory
While varying one parameter, the remaining ones are set to their default values as described in Sect. 3.3
The uncertainties in distance and tangential velocity have been obtained by propagating the uncertainties in parallax and proper motions. We restrict the sample to sources with small relative uncertainties to guarantee minor uncertainty approximation errors, using quality criteria discussed in Eq. (2). Modifying the quality criteria affects the range of y values but does not eliminate the linear trend between distance and the presented uncertainty ratio when analyzing the rolling median.
http://stev.oapd.inaf.it/cgibin/cmd; assuming solar metallicity (metal fraction z = 0.0152) and no extinction.
Acknowledgments
We thank the anonymous referee for their detailed and very helpful comments. We thank Núria MiretRoig, Maruša Žerjal, Vito Squicciarini, and J. H. M. M. Schmitt for providing their data ahead of final publication. S. Ratzenböck acknowledges funding by the Austrian Research Promotion Agency (FFG, https://www.ffg.at/) under project number FO999892674. Additionally, S. Ratzenböck extends his gratitude to the research network Data Science at Uni Vienna and Petra Schönfelder for excellent administrative support. J. Großschedl acknowledges funding by the Austrian Research Promotion Agency (FFG) under project number 873708. This work has used data from the European Space Agency (ESA) mission Gaia (https://www.cosmos.esa.int/gaia), processed by the Gaia Data Processing and Analysis Consortium (DPAC, https://www.cosmos.esa.int/web/gaia/dpac/consortium). Funding for the DPAC has been provided by national institutions, in particular, the institutions participating in the Gaia Multilateral Agreement. This research has made use of Python, https://www.python.org, of Astropy, a communitydeveloped core Python package for Astronomy (Astropy Collaboration 2013, 2018), NumPy (van der Walt et al. 2011), Matplotlib (Hunter 2007), Galpy (Bovy 2015), and Plotly (Plotly 2015). This research has used the SIMBAD database operated at CDS, Strasbourg, France (Wenger et al. 2000), of the VizieR catalog access tool, CDS, Strasbourg, France (Ochsenbein et al. 2000), and of “Aladin sky atlas” developed at CDS, Strasbourg Observatory, France (Bonnarel et al. 2000; Boch & Fernique 2014). This research has made use of TOPCAT, an interactive graphical viewer and editor for tabular data (Taylor 2005).
References
 Akaike, H. 1974, IEEE Trans. Autom. Control, 19, 716 [Google Scholar]
 Ankerst, M., Breunig, M. M., Kriegel, H.P., & Sander, J. 1999, SIGMOD Rec., 28, 49 [CrossRef] [Google Scholar]
 Ashok Kumar, G. 2020, Int. J. Sci. Technol. Res., 9, 6 [Google Scholar]
 Astropy Collaboration (Robitaille, T. P., et al.) 2013, A&A, 558, A33 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Astropy Collaboration (PriceWhelan, A. M., et al.) 2018, AJ, 156, 123 [Google Scholar]
 Azzalini, A., & Torelli, N. 2007, Stat. Comput., 17, 71 [CrossRef] [Google Scholar]
 BailerJones, C. A. L., Rybizki, J., Fouesneau, M., Demleitner, M., & Andrae, R. 2021, AJ, 161, 147 [Google Scholar]
 Baraffe, I., Chabrier, G., Allard, F., & Hauschildt, P. H. 1998, A&A, 337, 403 [Google Scholar]
 Baraffe, I., Homeier, D., Allard, F., & Chabrier, G. 2015, A&A, 577, A42 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Beccari, G., Boffin, H. M. J., & Jerabkova, T. 2020, MNRAS, 491, 2205 [Google Scholar]
 Benjamini, Y., & Hochberg, Y. 1995, J. R. Stat. Soc. Ser. B (Methodological), 57, 289 [Google Scholar]
 Bentley, J. L. 1975, Commun. ACM, 18, 509 [CrossRef] [Google Scholar]
 Biau, G., Chazal, F., CohenSteiner, D., Devroye, L., & Rodríguez, C. 2011, Electron. J. Stat., 5, 204 [CrossRef] [Google Scholar]
 Blaauw, A. 1946, Publ. Kapteyn Astron. Lab. Groningen, 52, 1 [NASA ADS] [Google Scholar]
 Blaauw, A. 1952, Bull. Astron. Inst. Netherlands, 11, 414 [NASA ADS] [Google Scholar]
 Blaauw, A. 1964a, ARA&A, 2, 213 [Google Scholar]
 Blaauw, A. 1964b, in The Galaxy and the Magellanic Clouds, ed. F. J. Kerr 20, 50 [NASA ADS] [Google Scholar]
 Boch, T., & Fernique, P. 2014, in Astronomical Data Analysis Software and Systems XXIII, eds. N. Manset, & P. Forshay, ASP Conf. Ser., 485, 277 [NASA ADS] [Google Scholar]
 Bonferroni, C. 1936, Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze, 8, 3 [Google Scholar]
 Bonnarel, F., Fernique, P., Bienaymé, O., et al. 2000, A&AS, 143, 33 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Bouvier, J., Matt, S. P., Mohanty, S., et al. 2014, in Protostars and Planets VI, eds. H. Beuther, R. S. Klessen, C. P. Dullemond, & T. Henning, 433 [Google Scholar]
 Bouy, H., & Alves, J. 2015, A&A, 584, A26 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Bovy, J. 2015, ApJS, 216, 29 [NASA ADS] [CrossRef] [Google Scholar]
 Breitschwerdt, D., Feige, J., Schulreich, M. M., et al. 2016, Nature, 532, 73 [NASA ADS] [CrossRef] [Google Scholar]
 Bressan, A., Marigo, P., Girardi, L., et al. 2012, MNRAS, 427, 127 [Google Scholar]
 BriceñoMorales, G., & Chanamé, J. 2023, MNRAS, 522, 1288 [CrossRef] [Google Scholar]
 Brodersen, K. H., Ong, C. S., Stephan, K. E., & Buhmann, J. M. 2010, in 2010 20th International Conference on Pattern Recognition, 2010, 3121 [CrossRef] [Google Scholar]
 Brooke, T. Y., Huard, T. L., Bourke, T. L., et al. 2007, ApJ, 655, 364 [NASA ADS] [CrossRef] [Google Scholar]
 Burman, P., & Nolan, D. 1992, J. Multivariate Anal., 40, 132 [CrossRef] [Google Scholar]
 Burman, P., & Polonik, W. 2009, J. Multivariate Anal., 100, 1198 [CrossRef] [Google Scholar]
 Burrows, A., Hubbard, W. B., Lunine, J. I., & Liebert, J. 2001, Rev. Modern Phys., 73, 719 [NASA ADS] [CrossRef] [Google Scholar]
 Caliński, T., & Harabasz, J. 1974, Commun. Stat., 3, 1 [Google Scholar]
 Campello, R. J., Moulavi, D., & Sander, J. 2013, in PacificAsia Conference on Knowledge Discovery and Data Mining (Springer), 160 [Google Scholar]
 CantatGaudin, T., & Anders, F. 2020, A&A, 633, A99 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 CantatGaudin, T., Jordi, C., Vallenari, A., et al. 2018a, A&A, 618, A93 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 CantatGaudin, T., Vallenari, A., Sordo, R., et al. 2018b, A&A, 615, A49 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 CantatGaudin, T., Mapelli, M., BalaguerNúñez, L., et al. 2019a, A&A, 621, A115 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 CantatGaudin, T., KroneMartins, A., Sedaghat, N., et al. 2019b, A&A, 624, A126 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 CastroGinard, A., Jordi, C., Luri, X., et al. 2018, A&A, 618, A59 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 CastroGinard, A., Jordi, C., Luri, X., CantatGaudin, T., & BalaguerNúñez, L. 2019, A&A, 627, A35 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 CastroGinard, A., Jordi, C., Luri, X., et al. 2020, A&A, 635, A45 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 CastroGinard, A., Jordi, C., Luri, X., et al. 2022, A&A, 661, A118 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Celeux, G., Forbes, F., Robert, C. P., & Titterington, D. M. 2006, Bayesian Anal., 1, 651 [Google Scholar]
 Celeux, G., FrühwirthSchnatter, S., & Robert, C. P. 2019, Handbook of Mixture Analysis, Chapman& Hall/CRC Handbooks of Modern Statistical Methods (CRC Press), 117 [Google Scholar]
 Chaudhuri, K., & Dasgupta, S. 2010, in Advances in Neural Information Processing Systems, eds. J. Lafferty, C. Williams, J. ShaweTaylor, R. Zemel, & A. Culotta (Curran Associates, Inc.), 23 [Google Scholar]
 Chaudhuri, K., Dasgupta, S., Kpotufe, S., & von Luxburg, U. 2014, IEEE Trans. Inf. Theor., 60, 7900 [CrossRef] [Google Scholar]
 Chazal, F., Guibas, L. J., Oudot, S. Y., & Skraba, P. 2013, J. ACM, 60 [CrossRef] [Google Scholar]
 Chen, B., D’Onghia, E., Alves, J., & Adamo, A. 2020, A&A, 643, A114 [EDP Sciences] [Google Scholar]
 Cheng, Y. 1995, IEEE Trans. Pattern Analy. Mach. Intell., 17, 790 [CrossRef] [Google Scholar]
 Chen, Y., Girardi, L., Bressan, A., et al. 2014, MNRAS, 444, 2525 [Google Scholar]
 Chen, Y., Bressan, A., Girardi, L., et al. 2015, MNRAS, 452, 1068 [Google Scholar]
 Chronis, P., Athanasiou, S., & Skiadopoulos, S. 2019, in 2019 IEEE International Conference on Data Mining (ICDM), 91 [CrossRef] [Google Scholar]
 Comaniciu, D., & Meer, P. 2002, IEEE Trans. Pattern Anal. Mach. Intell., 24, 603 [CrossRef] [Google Scholar]
 Cornuéjols, A., Wemmert, C., Gançarski, P., & Bennani, Y. 2018, Inf. Fusion, 39, 81 [CrossRef] [Google Scholar]
 Coronado, J., Fürnkranz, V., & Rix, H.W. 2022, ApJ, 928, 70 [NASA ADS] [CrossRef] [Google Scholar]
 Correa, C., & Lindstrom, P. 2011, IEEE Trans. Visual. Comput. Graphics, 17, 1852 [CrossRef] [Google Scholar]
 Crundall, T. D., Ireland, M. J., Krumholz, M. R., et al. 2019, MNRAS, 489, 3625 [Google Scholar]
 Damiani, F., Prisinzano, L., Pillitteri, I., Micela, G., & Sciortino, S. 2019, A&A, 623, A112 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Dasgupta, S., & Kpotufe, S. 2014, in Advances in Neural Information Processing Systems, eds. Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, & K. Q. Weinberger (Curran Associates, Inc.), 27 [Google Scholar]
 de Bruijne, J. H. J. 1999, MNRAS, 310, 585 [NASA ADS] [CrossRef] [Google Scholar]
 de Geus, E. J. 1992, A&A, 262, 258 [NASA ADS] [Google Scholar]
 de Geus, E. J., de Zeeuw, P. T., & Lub, J. 1989, A&A, 216, 44 [NASA ADS] [Google Scholar]
 Dempster, A. P., Laird, N. M., & Rubin, D. B. 1977, J. R. Stat. Soc. Series B Stat. Methodol., 39, 1 [Google Scholar]
 de Zeeuw, P. T., Hoogerwerf, R., de Bruijne, J. H. J., Brown, A. G. A., & Blaauw, A. 1999, AJ, 117, 354 [Google Scholar]
 Diehl, R., Lang, M. G., Martin, P., et al. 2010, A&A, 522, A51 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Dieterich, S. B., Henry, T. J., Jao, W.C., et al. 2014, AJ, 147, 94 [NASA ADS] [CrossRef] [Google Scholar]
 Ding, J., Shah, S., & Condon, A. 2016, Bioinformatics, 32, 2567 [CrossRef] [Google Scholar]
 Dobbie, P. D., Lodieu, N., & Sharp, R. G. 2010, MNRAS, 409, 1002 [Google Scholar]
 Edelsbrunner, H., Letscher, D., & Zomorodian, A. 2000, in Proceedings 41st Annual Symposium on Foundations of Computer Science, 454 [CrossRef] [Google Scholar]
 Esplin, T. L., & Luhman, K. L. 2022, AJ, 163, 64 [NASA ADS] [CrossRef] [Google Scholar]
 Ester, M., Kriegel, H. P., Sander, J., & Xu, X. 1996, in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96 (AAAI Press), 226 [Google Scholar]
 Evans, D. W., Riello, M., De Angeli, F., et al. 2018, A&A, 616, A4 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Feigelson, E. D., & Montmerle, T. 1999, ARA&A, 37, 363 [Google Scholar]
 Feng, Y., & Hamerly, G. 2007, in Advances in Neural Information Processing Systems, eds. B. Schölkopf, J. Platt, & T. Hoffman, 19 (MIT Press) [Google Scholar]
 Fernández, D., Figueras, F., & Torra, J. 2008, A&A, 480, 735 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Fisher, R. A. 1934, Breakthroughs in Statistics (Springer), 66 [Google Scholar]
 Forbes, J. C., Alves, J., & Lin, D. N. C. 2021, Nat. Astron., 5, 1009 [NASA ADS] [CrossRef] [Google Scholar]
 Francis, C., & Anderson, E. 2009, New A, 14, 615 [NASA ADS] [CrossRef] [Google Scholar]
 Freytag, B., Allard, F., Ludwig, H. G., Homeier, D., & Steffen, M. 2010, A&A, 513, A19 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Freytag, B., Steffen, M., Ludwig, H. G., et al. 2012, J. Comput. Phys., 231, 919 [Google Scholar]
 FrühwirthSchnatter, S., Celeux, G., & Robert, C. 2019, in Handbook of Mixture Analysis, Chapman& Hall/CRC Handbooks of Modern Statistical Methods (CRC Press) [Google Scholar]
 Fürnkranz, V., Meingast, S., & Alves, J. 2019, A&A, 624, L11 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Gabriel, K. R., & Sokal, R. R. 1969, System. Biol., 18, 259 [Google Scholar]
 Gagné, J., & Faherty, J. K. 2018, ApJ, 862, 138 [Google Scholar]
 Gagné, J., Lafrenière, D., Doyon, R., Malo, L., & Artigau, É. 2014, ApJ, 783, 121 [Google Scholar]
 Gagné, J., Mamajek, E. E., Malo, L., et al. 2018a, ApJ, 856, 23 [Google Scholar]
 Gagné, J., RoyLoubier, O., Faherty, J. K., Doyon, R., & Malo, L. 2018b, ApJ, 860, 43 [Google Scholar]
 Gagné, J., Faherty, J. K., & Mamajek, E. E. 2018c, ApJ, 865, 136 [Google Scholar]
 Gaia Collaboration (Brown, A. G. A., et al.) 2016, A&A, 595, A2 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Gaia Collaboration (Brown, A. G. A., et al.) 2018, A&A, 616, A1 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Gaia Collaboration (Brown, A. G. A., et al.) 2021, A&A, 649, A1 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Gaia Collaboration (Vallenari, A., et al.) 2023a, A&A, 674, A1 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Gaia Collaboration (Arenou, F., et al.) 2023b, A&A, 674, A34 [CrossRef] [EDP Sciences] [Google Scholar]
 Galli, P. A. B., Joncour, I., & Moraux, E. 2018, MNRAS, 477, L50 [Google Scholar]
 Galli, P. A. B., Bouy, H., Olivares, J., et al. 2020a, A&A, 643, A148 [EDP Sciences] [Google Scholar]
 Galli, P. A. B., Bouy, H., Olivares, J., et al. 2020b, A&A, 634, A98 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Galli, P. A. B., Bouy, H., Olivares, J., et al. 2021a, A&A, 654, A122 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Galli, P. A. B., Bouy, H., Olivares, J., et al. 2021b, A&A, 646, A46 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Ghrist, R. 2008, Bull. Am. Math. Soc., 45, 61 [Google Scholar]
 Goldman, B., Röser, S., Schilbach, E., Moór, A. C., & Henning, T. 2018, ApJ, 868, 32 [Google Scholar]
 Grasser, N., Ratzenböck, S., Alves, J., et al. 2021, A&A, 652, A2 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Gritschneder, M., & Lin, D. N. C. 2012, ApJ, 754, L13 [NASA ADS] [CrossRef] [Google Scholar]
 Großschedl, J. E., Alves, J., Meingast, S., & HerbstKiss, G. 2021, A&A, 647, A91 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Hamerly, G., & Elkan, C. 2004, in Advances in Neural Information Processing, eds. S. Thrun, L. Saul, & B. Schölkopf, 16 (MIT Press) [Google Scholar]
 Hartigan, J. A. 1975, Clustering Algorithms, 99th edn. (USA: John Wiley& Sons, Inc.) [Google Scholar]
 Hartigan, J. A., & Hartigan, P. M. 1985, Ann. Stat., 13, 70 [Google Scholar]
 He, Z., Wang, K., Luo, Y., et al. 2022, ApJS, 262, 7 [NASA ADS] [CrossRef] [Google Scholar]
 Hogg, D. W., Bovy, J., & Lang, D. 2010, ArXiv eprints [arXiv:1008.4686] [Google Scholar]
 Hu, X., & Xu, L. 2003, in Intelligent Data Engineering and Automated Learning, eds. J. Liu, Y. M. Cheung, & H. Yin (Berlin, Heidelberg: Springer), 195 [CrossRef] [Google Scholar]
 Hubert, L., & Arabie, P. 1985, J. Classification, 2, 193 [CrossRef] [Google Scholar]
 Hunt, E. L., & Reffert, S. 2021, A&A, 646, A104 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Hunt, E. L., & Reffert, S. 2023, A&A, 673, A114 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Hunter, J. D. 2007, Comput. Sci. Eng., 9, 90 [Google Scholar]
 Jain, A. K., Murty, M. N., & Flynn, P. J. 1999, ACM Comput. Surveys, 31, 264 [CrossRef] [Google Scholar]
 Jaromczyk, J., & Toussaint, G. 1992, Proc. IEEE, 80, 1502 [CrossRef] [Google Scholar]
 Jerabkova, T., Boffin, H. M. J., Beccari, G., & Anderson, R. I. 2019, MNRAS, 489, 4418 [Google Scholar]
 Jerabkova, T., Boffin, H. M. J., Beccari, G., et al. 2021, A&A, 647, A137 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Johnson, S. C. 1967, Psychometrika, 32, 241 [CrossRef] [Google Scholar]
 Kalogeratos, A., & Likas, A. 2012, in Advances in Neural Information Processing Systems, eds. F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Curran Associates, Inc.), 25 [Google Scholar]
 Kamdar, H., Conroy, C., Ting, Y.S., & ElBadry, K. 2021, ApJ, 922, 49 [NASA ADS] [CrossRef] [Google Scholar]
 Kapteyn, J. C. 1914, ApJ, 40, 43 [NASA ADS] [CrossRef] [Google Scholar]
 Katz, D., Sartoretti, P., Guerrier, A., et al. 2023, A&A, 674, A5 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Kerr, F. J., & LyndenBell, D. 1986, MNRAS, 221, 1023 [CrossRef] [Google Scholar]
 Kerr, R. M. P., Rizzuto, A. C., Kraus, A. L., & Offner, S. S. R. 2021, ApJ, 917, 23 [NASA ADS] [CrossRef] [Google Scholar]
 Kervella, P., Arenou, F., & Thévenin, F. 2022, A&A, 657, A7 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Kharchenko, N. V., Piskunov, A. E., Schilbach, E., Röser, S., & Scholz, R. D. 2013, A&A, 558, A53 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Kirkpatrick, D. G., & Radke, J. D. 1985, in Machine Intelligence and Pattern Recognition, ed. G. T. Toussaint (NorthHolland), Comput. Geom., 2, 217 [Google Scholar]
 Koontz W. L. G., Narendra P. M., & Fukunaga K. 1976, IEEE Trans. Comput., 25, 936 [CrossRef] [Google Scholar]
 Kounkel, M., & Covey, K. 2019, AJ, 158, 122 [Google Scholar]
 Kounkel, M., Covey, K., & Stassun, K. G. 2020, AJ, 160, 279 [NASA ADS] [CrossRef] [Google Scholar]
 Kpotufe, S., & von Luxburg, U. 2011, in Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11 (Madison, WI, USA: Omnipress), 225 [Google Scholar]
 KroneMartins, A., & Moitinho, A. 2014, A&A, 561, A57 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Kuhn, M. A., & Feigelson, E. D. 2019, Handbook of Mixture Analysis, Chapman& Hall/CRC Handbooks of Modern Statistical Methods (CRC Press), 463 [Google Scholar]
 Kushniruk, I., Schirmer, T., & Bensby, T. 2017, A&A, 608, A73 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Lange, K. L., Little, R. J. A., & Taylor, J. M. G. 1989, J. Am. Stat. Assoc., 84, 881 [Google Scholar]
 Leike, R. H., Glatzle, M., & Enßlin, T. A. 2020, A&A, 639, A138 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Lépine, J. R. D., & Sartori, M. J. 2003, Astrophys. Space Sci. Libr., 299, 63 [Google Scholar]
 Lifshitz, L., & Pizer, S. 1990, IEEE Trans. Pattern Anal. Mach. Intell., 12, 529 [CrossRef] [Google Scholar]
 Lindegren, L., Hernández, J., Bombrun, A., et al. 2018, A&A, 616, A2 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Lindegren, L., Klioner, S. A., Hernández, J., et al. 2021, A&A, 649, A2 [EDP Sciences] [Google Scholar]
 Liu, Y., & Xie, J. 2020, J. Am. Stat. Assoc., 115, 393 [CrossRef] [Google Scholar]
 Lombardi, M., Alves, J., & Lada, C. J. 2006, A&A, 454, 781 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Luhman, K. L. 2022, AJ, 163, 24 [NASA ADS] [CrossRef] [Google Scholar]
 Luhman, K. L., & Esplin, T. L. 2020, AJ, 160, 44 [Google Scholar]
 Luri, X., Brown, A. G. A., Sarro, L. M., et al. 2018, A&A, 616, A9 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 MacQueen, J. B. 1967, in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds. L. M. L. Cam, & J. Neyman (University of California Press), 1, 281 [Google Scholar]
 Magnani, L., Blitz, L., & Mundy, L. 1985, ApJ, 295, 402 [NASA ADS] [CrossRef] [Google Scholar]
 Makarov, V. V. 2007a, ApJS, 169, 105 [NASA ADS] [CrossRef] [Google Scholar]
 Makarov, V. V. 2007b, ApJ, 670, 1225 [NASA ADS] [CrossRef] [Google Scholar]
 MalsinerWalli, G., FrühwirthSchnatter, S., & Grün, B. 2016, Stat. comput., 26, 303 [CrossRef] [Google Scholar]
 Mamajek, E. E., & Feigelson, E. D. 2001, in Young Stars Near Earth: Progress and Prospects, eds. T. R. Jayawardhana, & T. Greene, ASP Conf. Ser., 244, 104 [NASA ADS] [Google Scholar]
 Mamajek, E. E., Lawson, W. A., & Feigelson, E. D. 1999, ApJ, 516, L77 [Google Scholar]
 Mamajek, E. E., Lawson, W. A., & Feigelson, E. D. 2000, ApJ, 544, 356 [NASA ADS] [CrossRef] [Google Scholar]
 Marigo, P., Girardi, L., Bressan, A., et al. 2017, ApJ, 835, 77 [Google Scholar]
 Matthews, B. 1975, Biochim. Biophys. Acta (BBA)  Protein Structure, 405, 442 [CrossRef] [Google Scholar]
 Maurus, S., & Plant, C. 2016, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16 (New York, NY, USA: Association for Computing Machinery), 1055 [CrossRef] [Google Scholar]
 McInnes, L., Healy, J., & Astels, S. 2017, J. Open Source Softw., 2, 205 [NASA ADS] [CrossRef] [Google Scholar]
 Meingast, S., & Alves, J. 2019, A&A, 621, L3 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Meingast, S., Alves, J., & Fürnkranz, V. 2019, A&A, 622, L13 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Meingast, S., Alves, J., & Rottensteiner, A. 2021, A&A, 645, A84 [EDP Sciences] [Google Scholar]
 Merloni, A., Nandra, K., & Predehl, P. 2020, Nat. Astron., 4, 634 [Google Scholar]
 MiretRoig, N., Galli, P. A. B., Brandner, W., et al. 2020, A&A, 642, A179 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 MiretRoig, N., Galli, P. A. B., Olivares, J., et al. 2022a, A&A, 667, A163 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 MiretRoig, N., Bouy, H., Raymond, S. N., et al. 2022b, Nat. Astron., 6, 89 [NASA ADS] [CrossRef] [Google Scholar]
 Muller, D. W., & Sawitzki, G. 1991, J. Am. Stat. Assoc., 86, 738 [Google Scholar]
 Müller, P., & Mitra, R. 2013, Bayesian Anal., 8, 269 [Google Scholar]
 Neuhäuser, R. 1997, Science, 276, 1363 [CrossRef] [Google Scholar]
 Neuhäuser, R., Gießler, F., & Hambaryan, V. V. 2020, MNRAS, 498, 899 [CrossRef] [Google Scholar]
 Nocedal, J., & Wright, S. J. 1999, in Numerical Optimization (New York, NY: Springer) [CrossRef] [Google Scholar]
 Ochsenbein, F., Bauer, P., & Marcout, J. 2000, A&AS, 143, 23 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Oh, S., PriceWhelan, A. M., Hogg, D. W., Morton, T. D., & Spergel, D. N. 2017, AJ, 153, 257 [NASA ADS] [CrossRef] [Google Scholar]
 Ohnaka, K., Hofmann, K. H., Schertl, D., et al. 2013, A&A, 555, A24 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Olivares, J., Bouy, H., Sarro, L. M., et al. 2021, A&A, 649, A159 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Pagani, L., Lagache, G., Bacmann, A., et al. 2003, A&A, 406, L59 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Pagani, L., Bacmann, A., Motte, F., et al. 2004, A&A, 417, 605 [CrossRef] [EDP Sciences] [Google Scholar]
 Pagani, L., Pardo, J. R., Apponi, A. J., Bacmann, A., & Cabrit, S. 2005, A&A, 429, 181 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Pecaut, M. J., & Mamajek, E. E. 2013, ApJS, 208, 9 [Google Scholar]
 Pecaut, M. J., & Mamajek, E. E. 2016, MNRAS, 461, 794 [Google Scholar]
 Pecaut, M. J., Mamajek, E. E., & Bubar, E. J. 2012, ApJ, 746, 154 [Google Scholar]
 Pedregosa, F., Varoquaux, G., Gramfort, A., et al. 2011, J. Mach. Learn. Res., 12, 2825 [Google Scholar]
 Penoyre, Z., Belokurov, V., & Evans, N. W. 2022a, MNRAS, 513, 2437 [NASA ADS] [CrossRef] [Google Scholar]
 Penoyre, Z., Belokurov, V., & Evans, N. W. 2022b, MNRAS, 513, 5270 [Google Scholar]
 Perryman, M. A. C., Lindegren, L., Kovalevsky, J., et al. 1997, A&A, 323, L49 [Google Scholar]
 Perryman, M. A. C., Brown, A. G. A., Lebreton, Y., et al. 1998, A&A, 331, 81 [NASA ADS] [Google Scholar]
 Plotly, T. I. 2015, Collaborative Data Science, Montreal, QC, https://plot.ly [Google Scholar]
 Pöppel, W. G. L., Bajaja, E., Arnal, E. M., & Morras, R. 2010, A&A, 512, A83 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Preibisch, T., & Mamajek, E. 2008, in Handbook of Star Forming Regions, Volume II, ed. B. Reipurth, 5, 235 [NASA ADS] [Google Scholar]
 Preibisch, T., & Zinnecker, H. 1999, AJ, 117, 2381 [NASA ADS] [CrossRef] [Google Scholar]
 Randich, S., Schmitt, J. H. M. M., Prosser, C. F., & Stauffer, J. R. 1995, A&A, 300, 134 [NASA ADS] [Google Scholar]
 Ratzenböck, S., Meingast, S., Alves, J., Möller, T., & Bomze, I. 2020, A&A, 639, A64 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Ratzenböck, S., Großschedl, J. E., Alves, J., et al. 2023, A&A, in press, https://doi.org/10.1051/00046361/202346901 [Google Scholar]
 Reininghaus, J., Kotava, N., Guenther, D., et al. 2011, IEEE Trans. Visual. Comput. Graphics, 17, 2045 [CrossRef] [Google Scholar]
 Richardson, S., & Green, P. J. 1997, J. R. Stat. Soc. Ser. B (Stat. Methodol.), 59, 731 [CrossRef] [Google Scholar]
 Riedel, A. R., Blunt, S. C., Lambrides, E. L., et al. 2017, AJ, 153, 95 [NASA ADS] [CrossRef] [Google Scholar]
 Riello, M., De Angeli, F., Evans, D. W., et al. 2021, A&A, 649, A3 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Rizzuto, A. C., Ireland, M. J., & Robertson, J. G. 2011, MNRAS, 416, 3108 [Google Scholar]
 Roccatagliata, V., Sacco, G. G., Franciosini, E., & Randich, S. 2018, A&A, 617, L4 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 RománZúñiga, C. G., Lada, C. J., Muench, A., & Alves, J. F. 2007, ApJ, 664, 357 [CrossRef] [Google Scholar]
 RománZúñiga, C. G., Alves, J. F., Lada, C. J., & Lombardi, M. 2010, ApJ, 725, 2232 [Google Scholar]
 Röser, S., Schilbach, E., Goldman, B., et al. 2018, A&A, 614, A81 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Röser, S., Schilbach, E., & Goldman, B. 2019, A&A, 621, L2 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Rybizki, J., Demleitner, M., BailerJones, C., et al. 2020, PASP, 132, 074501 [Google Scholar]
 Rybizki, J., Green, G. M., Rix, H.W., et al. 2022, MNRAS, 510, 2597 [NASA ADS] [CrossRef] [Google Scholar]
 Salvatier, J., Wiecki, T. V., & Fonnesbeck, C. 2016, PeerJ Comput. Sci., 2, e55 [Google Scholar]
 Sarro, L. M., Bouy, H., Berihuete, A., et al. 2014, A&A, 563, A45 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Sartori, M. J., Lépine, J. R. D., & Dias, W. S. 2003, A&A, 404, 913 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Schmitt, J. H. M. M. 1997, A&A, 318, 215 [NASA ADS] [Google Scholar]
 Schmitt, J. H. M. M., Czesla, S., Freund, S., Robrade, J., & Schneider, P. C. 2022, A&A, 661, A40 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Schönrich, R., Binney, J., & Dehnen, W. 2010, MNRAS, 403, 1829 [Google Scholar]
 Schwarz, G. 1978, Ann. Stat., 6, 461 [Google Scholar]
 Shaffer, J. P. 1995, Ann. Rev. Psychol., 46, 561 [CrossRef] [Google Scholar]
 Sim, G., Lee, S. H., Ann, H. B., & Kim, S. 2019, J. Korean Astron. Soc., 52, 145 [NASA ADS] [Google Scholar]
 Squicciarini, V., Gratton, R., Bonavita, M., & Mesa, D. 2021, MNRAS, 507, 1381 [NASA ADS] [CrossRef] [Google Scholar]
 Stauffer, J. R., Hartmann, L. W., Prosser, C. F., et al. 1997, ApJ, 479, 776 [NASA ADS] [CrossRef] [Google Scholar]
 Steinacker, J., Pagani, L., Bacmann, A., & Guieu, S. 2010, A&A, 511, A9 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Steinacker, J., Andersen, M., Thi, W. F., et al. 2015, A&A, 582, A70 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Strehl, A., & Ghosh, J. 2002, J. Mach. Learn. Res., 3, 583 [Google Scholar]
 Stuetzle, W., & Nugent, R. 2010, J. Comput. Graphical Stat., 19, 397 [CrossRef] [Google Scholar]
 Taylor, M. B. 2005, in Astronomical Data Analysis Software and Systems XIV, eds. P. Shopbell, M. Britton, & R. Ebert, ASP Conf. Ser., 347, 29 [Google Scholar]
 Teixeira, P. S., Scholz, A., & Alves, J. 2020, A&A, 642, A86 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Torra, F., Castañeda, J., Fabricius, C., et al. 2021, A&A, 649, A10 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 van der Walt, S., Colbert, S. C., & Varoquaux, G. 2011, Comput. Sci. Eng., 13, 22 [Google Scholar]
 van Leeuwen, F. 2007, A&A, 474, 653 [CrossRef] [EDP Sciences] [Google Scholar]
 van Leeuwen, F. 2009, A&A, 497, 209 [CrossRef] [EDP Sciences] [Google Scholar]
 Vedaldi, A., & Soatto, S. 2008, in Computer Vision  ECCV 2008, eds. D. Forsyth, P. Torr, & A. Zisserman (Berlin, Heidelberg: Springer), 705 [CrossRef] [Google Scholar]
 VegaPons, S., & RuizShulcloper, J. 2011, Int. J. Pattern Recogn. Artif. Intell., 25, 337 [CrossRef] [Google Scholar]
 Vehtari, A., Gelman, A., & Gabry, J. 2017, Stat. Comput., 27, 1413 [Google Scholar]
 Villa Vélez, J. A., Brown, A. G. A., & Kenworthy, M. A. 2018, Res. Notes Am. Astron. Soc., 2, 58 [Google Scholar]
 Vinh, N. X., Epps, J., & Bailey, J. 2010, J. Mach. Learn. Res., 11, 2837 [Google Scholar]
 Wang, K., & Ge, Y. 2021, Astrophysics Source Code Library [record ascl:2102.002] [Google Scholar]
 Watanabe, S. 2010, J. Mach. Learn. Res., 11, 3571 [Google Scholar]
 Wenger, M., Ochsenbein, F., Egret, D., et al. 2000, A&AS, 143, 9 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Wishart, D. 1969, in Proceedings of the Colloquium in Numerical Taxonomy, ed. A. J. Cole (New York: Academic Press), 282 [Google Scholar]
 Witkin, A. P. 1987, in Readings in Computer Vision, eds. M. A. Fischler, & O. Firschein (San Francisco, CA: Morgan Kaufmann), 329 [Google Scholar]
 Wright, N. J., & Mamajek, E. E. 2018, MNRAS, 476, 381 [Google Scholar]
 Zari, E., Brown, A. G. A., de Bruijne, J., Manara, C. F., & de Zeeuw, P. T. 2017, A&A, 608, A148 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Zari, E., Brown, A. G. A., & de Zeeuw, P. T. 2019, A&A, 628, A123 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Zari, E., Rix, H. W., Frankel, N., et al. 2021, A&A, 650, A112 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Žerjal, M., Ireland, M. J., Crundall, T. D., Krumholz, M. R., & Rains, A. D. 2023, MNRAS, 519, 3992 [CrossRef] [Google Scholar]
 Zomorodian, A., & Carlsson, G. 2005, Discrete Comput. Geom., 33, 249 [CrossRef] [Google Scholar]
 Zucker, C., Speagle, J. S., Schlafly, E. F., et al. 2019, ApJ, 879, 125 [Google Scholar]
 Zucker, C., Goodman, A., Alves, J., et al. 2021, ApJ, 919, 35 [NASA ADS] [CrossRef] [Google Scholar]
 Zucker, C., Goodman, A. A., Alves, J., et al. 2022, Nature, 601, 334 [NASA ADS] [CrossRef] [Google Scholar]
Appendix A: Gaia DR3 data retrieval and details on quality criteria
The Gaia DR3 data were downloaded from the Gaia Archive^{29} using the following ADQL query:
SELECT * FROM gaiadr3.gaia_source WHERE (1000./parallax*COS(l*PI()/180)*COS(b*PI()/180))>50. AND (1000./parallax*COS(l*PI()/180)*COS(b*PI()/180))<250. AND (1000./parallax*SIN(l*PI()/180)*COS(b*PI()/180))>200. AND (1000./parallax*SIN(l*PI()/180)*COS(b*PI()/180))<50. AND (1000./parallax*SIN(b*PI()/180))>95. AND (1000./parallax*SIN(b*PI()/180))<100. AND parallax_over_error>4.5
The first six expressions give the XYZ box conditions, while X is positive toward the Galactic center, Y is positive in the direction of Galactic rotation, and Z points toward the Galactic northpole. XYZ can also be calculated using astropy.coordinates.SkyCoord from Astropy v4.0. The parameter fidelity_v2 from Rybizki et al. (2022) was retrieved with the following ADQL query, using the Topcat TAP Query and the GAVO service^{30}:
SELECT gaia.* FROM gedr3spur.main AS gaia JOIN tap_upload.t1 AS mine USING (source_id)
We support our quality criteria choices from Sect. 2 as follows. Using fidelity_v2 > 0.5 is suggested as separator into “good” and “bad” sources by Rybizki et al. (2022) (see also Zari et al. 2021). Using a threshold of 0.9 would give slightly cleaner data; however, few sources lie in the range between 0.5 and 0.9 (∼2% in the box), and we opted for the less conservative value. The additional cut using parallax_over_error, which is similar to the signaltonoise ratio (S/N_{ϖ}), is used to reduce further parallax uncertainties. The choice of the threshold S/N_{ϖ} > 4.5 is further supported by Rybizki et al. (2022), where they apply different classifiers (high and lowS/N classifiers) for sources above and below this threshold. To avoid inhomogeneous data, we decided to include this S/N threshold. Moreover, we want to avoid too high parallax errors. As mentioned, we used the inverse of the parallax to estimate the distance to a source, which gets unreliable if the uncertainties are too large. For more distant sources, or intrinsically faint sources with high parallax errors, the distance estimate becomes a nontrivial inference problem (e.g., Luri et al. 2018; BailerJones et al. 2021). In Table A.1 we list the typical uncertainties of the various parameters for sources inside the box after the applied quality criteria, and also separately for sources that are selected as members of the 37 ScoCen clusters (given in brackets in Table A.1). It can be seen that the majority of the sources in the box (2σ) have parallax uncertainties below 0.6 mas.
Typical parameter uncertainties for sources used in our analysis.
For SigMA we used the 5.5D phase space, as mentioned in Sect. 2, for which we used tangential velocities in km s^{−1}. The proper motions (${\mu}_{\alpha}^{*}$ = μ_{α}cos(δ), μ_{δ}) are transformed from mas yr^{−1} to tangential velocities in km s^{−1} as follows, where the conversion constant 4.74047 is in units of km yr s^{−1}:
$$\begin{array}{cc}\hfill {v}_{\alpha}& =4.74047\xb7{\mu}_{\alpha}^{\ast}/\varpi \hfill \\ \hfill {v}_{\delta}& =4.74047\xb7{\mu}_{\delta}/\varpi .\hfill \end{array}$$(A.1)
Moreover, we corrected the tangential motions for the Sun’s reflex motion, resulting in tangential motions relative to the LSR, using the values by Schönrich et al. (2010). This conversion is accomplished with the help of Astropy, by defining the below sky coordinates. For a conversion of heliocentric proper motions to proper motions relative to the LSR, the radial velocity can be set to an arbitrary value in the skycoordinate definition of Astropy (here set to 0), since different RV values do not change the outcome of this conversion.
from astropy.coordinates import ICRS, LSR from astropy import units as u skyicrs = ICRS(ra = ra * u.deg, dec = dec * u.degree, distance = 1000./parallax * u.pc, pm_ra_cosdec = pmra * u.mas/u.yr, pm_dec = pmdec * u.mas/u.yr, radial_velocity = 0. * u.km/u.s) pma_lsr = skyicrs.transform_to(LSR()).pm_ra_cosdec.value pmd_lsr = skyicrs.transform_to(LSR()).pm_dec.value v_a_lsr = 4.74047 * pma_lsr / parallax v_d_lsr = 4.74047 * pmd_lsr / parallax
In Sect. 5.2 we compare the SigMA clusters with recent literature samples. To this end, we crossmatch the samples using the Gaia DR3/EDR3 source_id. This crossmatch is straight forward for the samples of Schmitt et al. (2022), Squicciarini et al. (2021), Luhman (2022), and MiretRoig et al. (2022a) who used Gaia DR3/EDR3 data, which allows a direct match with the source_id. In the case of Damiani et al. (2019), Kerr et al. (2021), and Žerjal et al. (2023), who used Gaia DR2 data, we first retrieve the Gaia DR3 source_id using the gaiadr3.dr2_neighbourhood catalog from the Gaia Archive, since the DR2 and DR3 source_ids are not generally the same. Such a crossmatch delivers a few sources that have several possible matches of DR3 with DR2 sources (see Torra et al. 2021; Gaia Collaboration 2021). In such cases, we chose the closest match, using the provided angular_distance parameter.
Appendix B: Detailed information and background on the methods
In this appendix, we give indepth and background information on some aspects of our methodology from Sect. 3.
B.1. Related work: Cluster analysis
In the following, we present a cross section of related work upon which SigMA rests. From the vast corpus of data mining and statistics literature, we focus specifically on identifying stable groups using densitybased clustering methods.
Hierarchical, densitybased clustering
The strength of levelset formulation (see Eq. (3) and further discussions in Sect. 3.1.3) lies in the natural emergence of a cluster tree, a clustering hierarchy that arises from sweeping the density threshold λ from ∞ → −∞. With a continuous change in λ, the number of connected components changes when the threshold passes through a critical point in f, and thus ∇f = 0. A new cluster is born when λ reaches the height of a mode in f. On the other hand, a cluster dies when λ traverses a saddle point or a local minimum, in which case the two connected components merge into a single one. The cluster creation and merging process is schematically shown in Fig. 1.
However, estimating the connected components of level sets, while easy in one dimension, gets nontrivial in higher dimensions. Consequently, algorithmic realizations of the Hartigan (1975) levelset idea rely on graph heuristics and graph theory in which connected components arise naturally. Early implementations by Azzalini & Torelli (2007) and Stuetzle & Nugent (2010) and subsequent theoretical analyses (Chaudhuri & Dasgupta 2010; Kpotufe & von Luxburg 2011; Chaudhuri et al. 2014) adopt a graph G(λ) over the data samples where vertices and/or edges are filtered according to λ, and thus $\{\mathit{x}\in X\phantom{\rule{3.33333pt}{0ex}}:\phantom{\rule{3.33333pt}{0ex}}\widehat{f}(\mathit{x})\ge \lambda \}$^{31}.
However, the use of graphs to represent the connectivity comes with its own limitations. This scheme guarantees that two samples from one connected component of G(λ) are to be found in a connected component in L(λ). However, as Stuetzle & Nugent (2010) point out, the reverse implication is not necessarily given. This means samples from the same connected component in L(λ) may end up in different connected components of G(λ). Since density estimates are inherently noisy, usually, too many clusters arise from this iterative filtration procedure. To counteract this overclustering, the resulting graph cluster tree is usually pruned in a postprocessing step during which spurious clusters are identified and merged back into the “mother cluster” (Stuetzle & Nugent 2010; Kpotufe & von Luxburg 2011; Chaudhuri et al. 2014).
The HDBSCAN algorithm
A wellknown algorithm belonging to the family of hierarchical levelset methods is the HDBSCAN (Hierarchical DensityBased Spatial Clustering of Applications with Noise) algorithm (Campello et al. 2013), which recently has been gaining attention in the astronomical community (e.g., Kounkel & Covey 2019; Kounkel et al. 2020; Hunt & Reffert 2021; Kerr et al. 2021). In order to prevent overclustering, the authors introduce the minimum cluster size parameter that provides an interpretable pruning strategy.
At each cluster split decision, the smaller cluster created is merged back into the mother cluster if it has fewer than minimum cluster size points; otherwise, a new cluster is created. To obtain a flat clustering result from the cluster tree, HDBSCAN estimates the stability of a cluster in the hierarchy via the concept of relative EOM. Similar to the concept of excess mass (Muller & Sawitzki 1991), it measures the lifetime and size of a cluster. The heuristic favors more prominent and stable clusters that live longer in the cluster tree. For example, a group that persists for a long time as a single connected component should be preferred over the two small clusters it breaks into and which quickly vanish.
However, the EOM criterion tends to produce too large clusters in practice. If a large group persists in the hierarchy for a long enough time, its children are unlikely to exceed the parent’s EOM. Alternatively, the HDBSCAN implementation by McInnes et al. (2017) offers the opportunity to extract the leaf nodes from the cluster tree. Since the leaf nodes are extracted only considering the minimum cluster size criterion, the resulting clusters lack any stability guarantee; thus, the clustering result is highly susceptible to random density fluctuations. In general, these methods suffer from complex and hardtointerpret pruning procedures and parameters, which affect the confidence and interpretability of the clustering result.
Topological methods
Extracting a flat clustering from the cluster tree requires a notion of cluster stability. As discussed, the concept of relative EOM, which inherently depends on the pruning process, can lead to too coarse clusters. A related pruning heuristic comes from considering the topological persistence of each mode in $\widehat{f}$, introduced by Chazal et al. (2013). Persistence is defined as the lifespan of each connected component. The notion of persistence is shown to be stable under small perturbations to the initial density f (Edelsbrunner et al. 2000; Zomorodian & Carlsson 2005; Ghrist 2008).
A variation on the persistence formulation is proposed by Ding et al. (2016), who instead of thresholding the cluster lifetime use cluster saliency, ν, defined by the ratio of birth and death density, as a cluster stability criterion. By varying ν between 0 and 1 the cluster tree is revealed and the most stable and longlived configuration is chosen as an appropriate clustering result.
While easy to interpret, these stability parameters can get quite tedious to select in practice. In the large data and cluster regime, the separation between stable and unstable clusters becomes less apparent. In these limiting cases, selecting the input parameters again warrants a proper parameter search.
Extracting stable and significant clusters
Compared to the notion of persistence, there is also growing research to apply statistical methods that test the modality structure of the data. These methods offer the advantage of an interpretable and meaningful parameter α, defining the significance level of a corresponding hypothesis test. The null hypothesis H_{0} commonly assumes that the data, or subsets of it, are sampled from a unimodal density, whereas the alternative hypothesis H_{1} suggests multimodality. The null hypothesis is rejected at a significance level α if the pvalue from the corresponding test procedure exceeds this significance level.
We identify first applications of hypothesis test procedures in the clustering literature in the context wrapper methods around the kmeans and EM frameworks. Gmeans (Hamerly & Elkan 2004) employs the AndersonDarling statistic to test the hypothesis that each cluster is generated from a Gaussian distribution. Instead of testing on a percluster basis, Pgmeans (Feng & Hamerly 2007) tests the whole GMM at once. Dip means (Kalogeratos & Likas 2012) proposes an incremental clustering scheme for selecting k in kmeans that employs Hartigan’s dip statistic (Hartigan & Hartigan 1985). If the distance distribution of one or more points to their cocluster members exhibits a significant multimodal structure, the cluster is split.
Skinnydip (Maurus & Plant 2016) also implements Hartigan’s dip test and applies it to 1D linear projections of the data set. Distinct density peaks are to be identified based on the gradient of the projected CDF. By projecting the data iteratively into multiple axes, the samples are partitioned into clusters. Skinnydip is specifically able to handle background noise very well; however, it considers noise samples to be uniformly distributed and clusters to be axisparallel.
These algorithms, however, are intrinsically tied to convex or Gaussian cluster assumptions. The recently proposed Mdip (Chronis et al. 2019) is able to deal with arbitrarily oriented and shaped clusters, which applies a simulation strategy to approximate values for the smallest density dips of unimodal data sets of the same size and density. However, we do not want to depend on simulations but directly obtain a measure of significance from given data.
B.2. Testing for unimodality
Here we highlight the work of Burman & Polonik (2009) more closely, whose modality test procedure we adopt in this work. The modality procedure is tied to the notion of a density dip along a path between two points in the data set. In the following, we aim to formally define the concept of such a path.
A formal description of the test procedure
We consider directed, continuous paths from x_{1} to x_{2} through input space 𝒳. By assuming there exists a parametrization r(t), with t ∈ [0, 1], the path becomes the image of r(t). With this map, we can uniquely express every point on the path via the parameter t. For example, its start and endpoints are given by x_{1} = r(0) and x_{2} = r(1), respectively.
Let f be the underlying density function and x_{1} and x_{2} two candidate modes of f. We assume, without loss of generality, that f(x_{1}) < f(x_{2}). If all possible paths undergo a density dip when moving from x_{1} to x_{2}, both points are found in two distinct modal regions:
$$\begin{array}{c}\hfill \exists \phantom{\rule{0.166667em}{0ex}}t\in (0,1):f(\mathit{r}(t))<f({\mathit{x}}_{1})\end{array}$$(B.1)
Conversely, if we can find a path between x_{1} and x_{2} where all points have a higher density than x_{1}, both points are part of the same modal region:
$$\begin{array}{c}\hfill f(\mathit{r}(t))\ge f({\mathit{x}}_{1})\phantom{\rule{1em}{0ex}}\forall \phantom{\rule{0.166667em}{0ex}}t\in (0,1].\end{array}$$(B.2)
Eq. (B.2) describes the case of singlemodality, which constitutes the null hypothesis we aim to reject. For general pairs of modal candidates, it becomes
$$\begin{array}{c}\hfill f(\mathit{r}(t))\ge \phantom{\rule{0.333333em}{0ex}}\text{min}(f({\mathit{x}}_{1}),f({\mathit{x}}_{2}))\phantom{\rule{1em}{0ex}}\forall \phantom{\rule{0.166667em}{0ex}}t\in (0,1].\end{array}$$(B.3)
An equivalent and useful formulation is obtained by taking the logarithm on both sides; after that, the left side is subtracted from the inequality:
$$\begin{array}{c}\hfill \text{SB}(t):=\phantom{\rule{0.333333em}{0ex}}\text{log}f(\mathit{r}(t))+\phantom{\rule{0.333333em}{0ex}}\text{min}(\phantom{\rule{0.333333em}{0ex}}\text{log}f({\mathit{x}}_{1}),\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{log}f({\mathit{x}}_{2}))\\ \hfill .\end{array}$$(B.4)
Using the variable SB(t), we can formulate the null hypothesis as follows:
$$\begin{array}{c}\hfill \phantom{\rule{0.333333em}{0ex}}{\text{H}}_{0}:\phantom{\rule{0.333333em}{0ex}}\text{SB}(t)\le 0\phantom{\rule{2em}{0ex}}\forall \phantom{\rule{3.33333pt}{0ex}}t\in (0,1).\end{array}$$(B.5)
Rather than testing H_{0} across the entire path, a pointwise test H_{0, t} : SB(t)≤0 for some values of t is employed.
Since we do not have access to the underlying density f, we cannot directly test the hypothesis in Eq. (B.5). Instead, we have a data set of ddimensional random variables drawn from f. Given proper normalization of the coordinate axes (see Sect. 3.3.3), Burman & Polonik (2009) show that the following expression is asymptotically standard normal distributed and converges – up to a constant factor – to SB(t) as the number data samples approaches infinity:
$$\begin{array}{c}\hfill \widehat{\text{SB}}(t)=d\sqrt{k/2}[\phantom{\rule{0.333333em}{0ex}}\text{log}\phantom{\rule{3.33333pt}{0ex}}{d}_{k}(\mathit{r}(t))\phantom{\rule{0.333333em}{0ex}}\text{max}(\phantom{\rule{0.333333em}{0ex}}\text{log}\phantom{\rule{3.33333pt}{0ex}}{d}_{k}({\mathit{x}}_{1}),\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{log}\phantom{\rule{3.33333pt}{0ex}}{d}_{k}({\mathit{x}}_{2})].\end{array}$$(B.6)
Here d_{k}(x) denotes the distance to the kth nearest neighbor of the point x. The distance is an approximation to the density f. Due to their inverse proportionality, the sign is flipped between Eq. (B.4) and Eq. (B.6), and the minimum is replaced with the maximum function.
Since the corresponding test statistic $\widehat{\text{SB}}(t)$ is approximately standard normally distributed, the null hypothesis is rejected at significance level α if
$$\begin{array}{c}\hfill \widehat{\text{SB}}(t)\ge {\mathrm{\Phi}}^{1}(1\alpha ),\end{array}$$(B.7)
where Φ is the standard normal CDF. Therefore, if any t ∈ (0, 1) fulfills condition (B.7), H_{0} is rejected.
Due to the employment of the kNN technique, this test procedure applies naturally to multivariate data without the need to project the data onto a 1D line, as is the case for most modality tests. Furthermore, nearest neighbor queries have access to very efficient algorithms such as the kdtree (Bentley 1975), which reduces neighbor searches to only 𝒪(log N) distance computation. Thus, these considerations allow us to study the modality structure of the data set at Gaia data scales without careful projection loss considerations.
Empirical results
Burman & Polonik (2009) describe the iterative application of the test procedure to modal candidates to cluster the data into significant modal regions. However, the test is employed along the straight line path connecting two modes, which limits the procedure to convex cluster shapes only. Moreover, enough samples must be tested along the path to detect significant dips reliably.
We provide a natural extension to the presented procedure, which applies to arbitrary cluster shapes while reducing the number of test evaluations to one. In Sect. 3.2, we describe the modification in more detail and argue that reducing the original method to a single pointwise evaluation at the saddle point leaves the test unchanged.
To substantiate this statement, we empirically validate our method on simulated data. In particular, we aim to show that these changes do not affect the test statistic distribution under the null hypothesis. According to Burman & Polonik (2009), the test statistic $\widehat{\text{SB}}$ is standard normally distributed under H_{0}, which assumes unimodality between two points in the input space.
Figure B.1 shows the test statistic distribution $\widehat{\text{SB}}$ of our reduced test procedure on a unimodal data set. To faithfully test the distribution under H_{0}, we require a unimodal data set with Gaia and ScoCenlike positions and kinematic properties, realistic errors, and suitable size. Since the Gaia EDR3 mock catalog (Rybizki et al. 2020) reproduces kinematic features found in the Milky Way, its unimodality is guaranteed.
Fig. B.1. Test statistic distribution, $\widehat{\text{SB}}$, of our reduced test procedure on a unimodal data set. The distribution is stable to variations in different parameters and respective values (as shown in the corresponding rows and columns). 
Instead, we directly use data within the box defined around ScoCen in Eq.(1). We randomly shuffle the observations in each feature to remove any local overdensities. This procedure leaves marginal distributions unchanged while fully decorrelating the data. Since all marginal distributions are unimodal, the joint distribution (implicitly constructed as a factorization of marginals) is equally unimodal.
To gauge parameter effects on the test statistic distribution, we vary the sample size and three SigMA parameters across different values^{32}. The parameters are the overall sample size, the k parameter of kNN density estimation method (see Sect. 3.3.2), the β parameter of the underlying βskeleton graph (see Sect. 3.3.1), and the velocity scaling factor (see Sect. 3.3.3).
As shown in Fig. B.1, varying the given parameters within a sensible range does not modify the test statistic distribution. Due to its stability, a single test statistic distribution under H_{0} can be universally assumed over different parametrizations of SigMA.
The $\widehat{\text{SB}}$ distribution on unimodal data closely follows a zeromean Gaussian. However, in our tests, the standard deviation differs slightly from unity as stated by Burman & Polonik (2009). We update its value to the average standard deviation across our tests of σ = 0.78.
B.3. Determining velocity scaling factors
Here we discuss the derivation of the scaling factor distribution, which we used to weigh the velocity subspace in the clustering process. A more detailed justification is provided in Sect. 3.3.3.
We replace the scaling factor variable c_{v} with y to simplify and shorten the reading flow. Additionally, compared to the main text, we denote the distance to a cluster with r instead of d. This notation makes the integration alongside the differential dr easier to read (otherwise, the differential would be dd).
B.3.1. Statistical model
Figure 4 shows the relation between a cluster’s distance and the scaling factor y alongside determined uncertainties. We observe an approximately linear relationship between the cluster distance and the scale factor that we aim to model. We observe that determined uncertainties for the ith data point σ_{i} cannot account for deviations from any hypothetical regression line. Thus, we include the simple assumption in the model that reported variances are underestimated by some factor, f. The scaled variance for the ith sample becomes the following:
$$\begin{array}{c}\hfill {s}_{i}^{2}={\sigma}_{i}^{2}+{f}^{2}(m{x}_{i}+b).\end{array}$$(B.8)
Here the parameters m and b represent the slope and intercept of the regression line, respectively.
We assume Cauchy distributed deviations from the regression line to further reduce the influence of outliers. The Cauchy distribution provides a robust statistical model that naturally handles observations with considerable deviations from the mean with its longerthannormaltails (Lange et al. 1989). The scale parameter becomes the adjusted variance in Eq. (B.8). In this model, given a distance to a stellar cluster, r_{i}, a modified scaling factor uncertainty, s_{i}, a fractional amount, f, a slope, m, and an intercept, b, the density of observed velocity scaling factors p(y_{i} ∣ r_{i}, s_{i}, f, m, b) becomes
$$\begin{array}{c}\hfill p({y}_{i}\mid {r}_{i},{s}_{i},m,b)={\left[\pi {s}_{i}(1+\frac{{({y}_{i}m{r}_{i}b)}^{2}}{{s}_{i}^{2}})\right]}^{1}.\end{array}$$(B.9)
By factoring these conditional probabilities for the N data points (assuming statistical independence between them), we obtain the likelihood:
$$\begin{array}{c}\hfill \mathcal{L}={\displaystyle \prod _{i=1}^{N}p({y}_{i}\mid {r}_{i},{s}_{i},m,b).}\end{array}$$(B.10)
Using Bayes’ theorem, we obtain the posterior probability density function (PDF):
$$\begin{array}{c}\hfill p(m,b,f\mid {\{{y}_{i}\}}_{i=1}^{N},I)\propto p({\{{y}_{i}\}}_{i=1}^{N}\mid m,b,f,I)\phantom{\rule{0.166667em}{0ex}}p(m,b,f\mid I).\end{array}$$(B.11)
The PDF p(m, b, f ∣ I) describes our prior knowledge of the line parameters (m, b) and the scaled variance f, while I summarizes all the prior knowledge of the r_{i} and σ_{i}. We employed weakly informative priors (as the problem is relatively low dimensional, this does not affect the inference outcome) with normal, zeromean prior densities for m and b with a standard deviation of 20. The prior PDF for f follows a halfCauchy distribution with β = 10. The HalfCauchy distribution provides a suitable prior PDF as it is truncated to have nonzero probability density on ℝ^{+}.
To obtain samples from the posterior PDF, we used PyMC (Salvatier et al. 2016), a publicly available code that implements the MCMC method. For each line parameter (m, b), we computed its maximum a posteriori position, representing the bestfitting line. We generate samples from the posterior predictive distribution to estimate the uncertainty in y around the mean at each location r. Figure 4 and B.3 show the 1σ credible interval determined via computing the 68% highdensity interval (HDI) of generated posterior predictive samples.
B.3.2. Model caveat
The relationship between the dispersion in position and velocity space is, in principle, not affected by the observer. We conjecture that the observed correlation likely stems from different magnitudes of observational uncertainties between tangential velocities and the parallax. Figure B.2 shows the distribution of the ratio between measured distance and tangential velocity uncertainties as a function of distance^{33}. We identify an almost perfect linear trend when plotting the rolling median (window size 5 pc) for sources within 500 pc. This trend suggests that, on average, the distance error (i.e., parallax error) grows faster than the tangential velocity error when moving away from the observer (which is to be expected). This relationship suggests a faster growth rate in positional cluster dispersion than in onsky motions caused by the convolution with (on average) larger uncertainties. It directly affects the density by smearing out the unobserved source distribution in respective positional and kinematic subspaces at different scales as a function of distance, which we aim to counteract using scaling factors conditioned on distance.
Fig. B.2. Frequency distribution of the ratio between measured distance and tangential velocity uncertainties as a function of distance. We observe a linear trend (suggested by the rolling median, window size 5 pc) between this ratio and distance. We hypothesize that the empirical distancescaling relationship (see Fig. 4) is caused by this trend. 
This relationship suggests deriving the scaling factors directly from the observed error ratio distribution. However, this entails propagating the uncertainties through the complex selection functions of clustering algorithms. Instead, we aim to derive the relationship directly from data from extracted clusters in the Gaia data set (see Sect. 3.3.3 and Fig. 4).
As a final validation to choosing a linear model (assuming a correlation between distance and the dispersion relation) over a constant one (assuming variable independence), we fit a constant and linear model to the data. As the linear model, the constant model assumes Cauchy distributed deviations from the mean and assumes that reported variances are underestimated by some factor f. We employed the same weakly informative priors for the intercept and factor parameters (b, f). Using samples drawn from the posterior distribution and the computed log pointwise posterior predictive density, we applied leaveoneout crossvalidation (Vehtari et al. 2017) and the widely applicable information criterion (Watanabe 2010) to perform model selection. We find that both criteria significantly favor the linear over the constant model.
B.3.3. A distribution over scaling factors
We aim to obtain the distribution f(y ∣ r_{0} ≤ r ≤ r_{1}), which describes the scaling factor (y) behavior for a given range of distances to clusters of interest. A simple way to find this distribution is to interpret the empirical linear model g(r) and associated Gaussian uncertainties as an improper probability function f(r, y)^{34}.
As we are dealing with an improper PDF, we consider the following proportionality condition and handle the normalization of the lefthand side later:
$$\begin{array}{cc}\hfill f(y\mid {r}_{0}\le r\le {r}_{1})& \propto {\displaystyle {\int}_{{r}_{0}}^{{r}_{1}}f(r,y)\phantom{\rule{0.166667em}{0ex}}dr}\hfill \\ \hfill & \propto {\displaystyle {\int}_{{r}_{0}}^{{r}_{1}}f(y\mid r)f(r)\phantom{\rule{0.166667em}{0ex}}dr.}\hfill \end{array}$$(B.12)
Since f(r)∝1 is approximately independent of the distance r, we can add it to the yet unknown constant normalization factor and move it out of the integral. Hence, we can write the target distribution as
$$\begin{array}{c}\hfill f(y\mid {r}_{0}\le r\le {r}_{1})\propto {\displaystyle {\int}_{{r}_{0}}^{{r}_{1}}f(y\mid r)\phantom{\rule{0.166667em}{0ex}}dr.}\end{array}$$(B.13)
Thus, to obtain a solution to Eq. (B.13), we need an expression for the conditional PDF f(y ∣ r). We assume that inlier data are Gaussian distributed around the linear model with a variable standard deviation σ(r), which we estimate from the 1σ HDI of generated posterior predictive samples. The distribution of scaling parameters y conditioned on the distance r can then be approximated via the following expression:
$$\begin{array}{c}\hfill f(y\mid r)\propto \phantom{\rule{0.333333em}{0ex}}\text{exp}(\frac{{(yg(r))}^{2}}{2\sigma {(r)}^{2}}).\end{array}$$(B.14)
Figure B.3 schematically shows the integrating process where the conditional PDFs f(y ∣ r) are shown for r = 100 and r = 200.
Fig. B.3. Scaling factor determination via the empirical distancescaling relationship. The scaling factor distribution for clusters at a distance between 100 and 200 pc depends on the conditional distribution of scaling factors at a given distance, f(c_{x}/c_{v} ∣ r). 
Since we do not have any analytic expression for σ(r), we numerically approximate the integral in Eq. (B.14). The top part of Fig. B.4 shows the resulting PDF when we solve f(y ∣ r_{0} ≤ r ≤ r_{1}) for sources in ScoCen, where we assume a minimum distance of r_{0} = 100 and a maximum distance of r_{1} = 200. Here an immediate caveat of our simple symmetric model uncertainty assumption becomes apparent; the resulting distribution has infinite support and, thus, a nonzero probability density for f(y < 0 ∣ r). Since negative scaling is physically meaningless, we limit the distribution to ℝ^{+}, thus setting f(y < 0 ∣ r) = 0 and normalizing the PDF to integrate to unity in ℝ^{+}.
Fig. B.4. PDF and CDF of the scaling factor conditioned on a given range of distances. The ten red scatter points indicate samples drawn from the tenquantile splitting procedure. We separate the PDF into ten continuous intervals with equal probabilities from which we derive samples as the mean position of these intervals. 
We consider sampling strategies to obtain scaling factors for the clustering process. Random sampling can generate almost identical realizations, so the possible solution space might not be covered evenly. Since we need to perform a separate clustering run for each sample drawn, keeping the number as small as possible is essential. To cover the space evenly while considering the underlying probability distribution, we select a set of ten samples that represent ten quantiles of the PDF. We separate the PDF into ten continuous intervals with equal probabilities from which we derive samples as the mean position of these intervals.
To compute the quantiles, we numerically determine the CDF of f(y ∣ r_{0} ≤ r ≤ r_{1}). The CDF for r ∈ [100, 200] is shown in the bottom part of Fig. B.4. The ten red scatter points^{35} indicate samples drawn from the 10quantile splitting procedure where horizontal lines indicate equal probability intervals. To invert the CDF and obtain scaling fraction samples from F^{−1}(y ∣ r_{0} ≤ r ≤ r_{1}) we used a numerical approximation^{36}.
B.4. Influence of missing radial velocities on the bulk velocity determination
In this section we examine the influence of missing radial velocities v_{r} on the accuracy of determined bulk velocities. As discussed in Sect. 3.5.2, the bulk velocity of a sample is determined by minimizing observed proper motions and radial velocities against theoretical observables for a given cluster bulk motion. Since radial velocities are not always available (in our sample ∼20% of sources have v_{r} measurements), especially as more distant objects are studied, we estimate how the fraction of missing v_{r} affects the accuracy of determined bulk velocities and consequently inferred radial velocities of cluster members. For this purpose, we simulate a given fraction of missing v_{r} in our open cluster and compact cluster samples (see Sect. 4.2 for detailed descriptions of these samples) and compare the inferred radial velocities of cluster members to known true values. We repeat this process for several fractions of missing v_{r}, ranging from 1% missing to 100% missing, and calculate the mean relative error (and the 1σ quantile) between the resulting calculated radial velocities (via the inferred bulk motion; see Sect. 3.5.2) and the ground truth. Figure B.5 shows the influence of various fractions of missing v_{r}s on the relative error.
Fig. B.5. Simulating the impact of the fraction of missing v_{r} measurements on the relative error of inferred radial velocities. The top panel shows relative errors of inferred v_{r} on the open cluster sample as a function of missingness. The bottom panel shows the result of this study using the compact cluster sample. We find twice larger errors on the compact cluster sample. Additionally, the estimation procedure seems almost independent of the number of missing radial velocity measurements in both samples, except in a small region of the compact cluster sample, where the error increases if fewer than 5% of sources have v_{r} measurements. 
First, we find that relative errors of inferred v_{r} on the open cluster sample are, on average, around twice (1.97 ± 0.51) as low compared to inferred v_{r} from the compact cluster sample. This difference is consistent with results reported in Sect. 4.2, where we report very low contamination in clustering solutions on the open cluster sample compared to the compact cluster sample. Further, the cluster sizes in the open cluster sample are, on average, four times larger, providing significantly better statistics to determine the bulk motion from the minimization procedure in Eq. (12).
Second, we identify negligible correlation between the relative error in inferred radial velocities and the fraction of missing v_{r} measurements. In particular, the open cluster sample shows no performance loss as the number of missing radial velocities goes to 100%. From this result, we conjecture that given large enough clusters, the bulk velocity determination discussed in Sect. 3.5.2 is sufficiently constrained by the observed proper motions and, hence, independent of radial velocity measurements. Although the relative errors also seem mostly independent of the number of missing radial velocity measurements, we observe an increased error of inferred v_{r} in the compact cluster sample as the missingness goes toward 100%. When no radial velocities are available, the relative uncertainty of inferred v_{r} grows around twofold (mean relative error of 0.38 versus 0.21). At the same time, the statistical dispersion (measured via the 1 σ range) appears to be almost stable when compared to cases with access to v_{r} measurements. Since the v_{r} estimation procedure seems independent of the number of missing radial velocity measurements for less than 95% missingness, we recommend using SigMA on data with at least 5% of available v_{r} measurements to keep radial velocity estimation errors to a minimum. Since SigMA identifies clusters in 5D phase space and determines members most effectively if at least 5% of input stars have radial velocity measurements, we refer to the entire SigMA pipeline working in 5.5 dimensions (5.5D).
B.5. Consensus clustering
This section discusses algorithmic means to identify stable cluster solutions from an ensemble of clustering results. To identify robust solutions, we represent all clusters across the clustering ensemble in a graph. Each cluster of a single solution from the ensemble is represented by one node. We connect two nodes via an edge if two clusters share at least one common point. Figure B.6 highlights this step in the first two frames. The ensemble comprises three clustering solutions. The individual runs A, B, and C contain three, two, and four clusters, respectively. Each source is classified into a single cluster for a given run, resulting in disjoint sets. Thus, edges connect clusters from different runs.
Fig. B.6. Consensus clustering pipeline for a simple example using an ensemble of three clustering solutions: A, B, and C. (2) Clusters from the ensemble are linked in a graph based on overlapping points. (3) Edges between clusters are removed if the overlap between their members is insufficient to assume a common cluster solution. (4) Cliques that represent stable cluster solutions, i.e., consensus clusters, are extracted. (5) A voting strategy determines the assignment of individual sources to cliques. 
Edges in the graph are weighted by the corresponding Jaccard similarity, which measures their common overlap. The Jaccard similarity is defined as the ratio of the intersection of two sets over their union. Typically, a 0.5 or greater value indicates a high similarity between two sets. Since this linkage is fundamental for determining the consensus result, it should avoid connecting dissimilar clusters. Thus, we remove edges with a weight below 0.5 (see Fig. B.6, step 3). This threshold is quite conservative, as it can separate similar cluster solutions, for example if one cluster is a subset of the other. We relax the cut criterion in the following way to avoid overpruning the graph. Two clusters a and b with respective sizes n_{a} and n_{b}, where n_{a} > n_{b}, are linked if the n_{b} densest points of a amount to a Jaccard similarity of 0.5 or greater with points from cluster b. This criterion guarantees connectivity between cluster extractions at different isosurface thresholds while separating ties to clusters that randomly fragment into multiple subclusters.
Robust clusters throughout the ensemble will have strong connections to their counterparts from different runs while having none or very weak connections to other nodes in the graph. Thus, a robust cluster solution builds a strong clique in this graph^{37}. In contrast, unstable clusters will have many weak connections to many other clusters from different runs but none or very few strong ones; see Fig. B.6, panel 4.
To extract all stable cluster solutions, we aim to identify all strongly connected cliques in the graph (i.e., the consensus result). Since individual sources and clusters can be part of several cliques, we employed a voting strategy to determine the final data partitioning (VegaPons & RuizShulcloper 2011). Each clique is represented as a vector of length N, where N is the number of sources in the data set. The N values correspond to the sum of the individual clusters represented as 0–1 vector, where all entries are 1 for sources inside a cluster and 0 elsewhere. Each source is then associated with a single clique by maximizing the respective entry at the source’s position (in the vector). Thus, the larger a clique, the more likely it wins a vote. To favor robust cluster solutions, we multiply each vector by the median of its connection strengths. This number is maximized if the cluster is unchanged throughout different runs. This step is summarized in panel 5 in Fig. B.6.
B.6. Testing the independence assumption of resampled Gaia data
When testing for multimodality, SigMA simulates multiple realizations of the Gaia data set to see how robust density dips between pairs of neighboring peaks are. In Sect. 3.4, we discuss how these realizations are used in a global hypothesis test that determines whether samples of pairwise adjacent density enhancements are part of a single underlying mode. This global hypothesis test combines individual tests from each realization and evaluates the global null hypothesis that no pvalue is significant.
However, the choice of combining different pvalues is affected by the correlation structure between distinct tests. Typically, statistical tests require independence across tests to guarantee proper levels of specificity and sensitivity. In the case of the commonly used Fisher’s combination test (Fisher 1934), positively correlated pvalues increase the chance of type I errors (rejects a true null hypothesis). To investigate the independence assumption of pvalues across resampled data sets, we consider the effects that resampling has on the modality test procedure.
The modality test is fully described by the dimensionality of the data and, most importantly, by the kdistances of modal and saddle points. Resampling the data set influences these kdistances. The sampling of new data points is done with Gaussians centered on mean astrometric observables with heteroscedastic error covariance matrices available in the Gaia database. We aim to use the covariance matrix in relation to the nearest neighbor distance to estimate the interdependence of pvalues.
As the entries of the covariance matrix shrink to zero, the resampled data points converge to the original data points until they eventually coincide. At the same time, the pvalues across different data sets approach the same value, leading to a perfect correlation between them. On the other hand, if the standard deviation along its principle axis (or any other direction for that matter) extends far beyond a point nearest neighbor or even its kdistance, the resampled data differs substantially from the original one, decorrelating the pvalues.
Figure B.7 illustrates this relationship for data in the ScoCen box. The histogram shows the relative uncertainty of data points (i.e., the ratio of positional uncertainty given by the error covariance matrix over the nearest neighbor distance). The distances are computed in the space of observed astrometric quantities where the uncertainties are assumed Gaussian. Most data points are far below unity (i.e., errors are small relative to their absolute values). The majority (68%, i.e., 1σ) has a relative uncertainty below 0.08 (see the 68, 95, and 99.7% percentiles in Fig. B.7 marked as 1σ, 2σ, and 3σ, respectively). The concentration of relative uncertainty values at zero indicates that kdistances across resampled data sets are strongly correlated. Thus, we cannot assume independence of pvalues across different samples.
Fig. B.7. Loglog histogram showing the relative uncertainty of data points, i.e., the ratio of positional uncertainty given by the error covariance matrix over the nearest neighbor distance. The majority of data points are far below unity. The concentration of relative uncertainty values at zero indicates that kdistances across resampled data sets are strongly correlated. The 1σ, 2σ, and 3σ lines indicate the percentiles containing 68, 95, and 99.7% of the respective distribution. 
B.7. Distribution of pointwise densities
Instead of directly recovering clusters from phase space data, SigMA extracts modal regions, a mixture of field stars and cluster members. To separate signal from background, we employed a densitybased classifier that selects cluster members as an overdensity over the background. We consider the distribution of field stars and cluster members in univariate density to determine a density threshold automatically.
As discussed in Sect. 3.5, we approximately can treat the field star content as a uniform distribution locally around a cluster in phase space. A uniform distribution in phase space translates to a Gaussian distribution in 1D density space (see Sect. 3.5 for a more detailed discussion).
Cluster members are commonly modeled as multivariate Gaussians (e.g., Gagné et al. 2014; Sarro et al. 2014; Crundall et al. 2019; Riedel et al. 2017). Although observational findings point to more complex morphologies (e.g., Meingast & Alves 2019; Röser et al. 2019; Jerabkova et al. 2021; Meingast et al. 2019; Kounkel & Covey 2019; CantatGaudin et al. 2019a; Wang & Ge 2021; Coronado et al. 2022), the Gaussianity assumption provides a good starting point to consider the pointwise univariate density distribution. Figure B.8 shows multiple distributions of pointwise density estimates of 100, 000 samples drawn from an Ndimensional Gaussian^{38}. The left column shows the likelihood of individual samples. It provides an assessment of the local density under the true model. The number of samples with a relatively high likelihood decreases exponentially as the dimension N increases from one (top row) to six (bottom row).
Fig. B.8. Distribution of pointwise density estimates of samples drawn from an Ndimensional Gaussian. The left column shows the likelihood of individual samples. It provides an assessment of the local density under the true model. The right column shows pointwise densities estimated via the kNN technique. Increasing the dimensionality, N, from one to six (top to bottom row), the distribution of pointwise densities approaches a Gaussian. 
The curse of dimensionality plagues neighborhood queries in high dimensions. As the dimensionality grows, points are increasingly isolated, making neighborhoods no longer local. This effect can already be seen for moderate dimensions in the right column of Fig. B.8. It shows pointwise density estimates obtained via the kNN technique. Around the fourth dimension, the distribution of kdistances starts to converge to a normal distribution, which incorrectly suggests an underlying uniform distribution in Ndimensional input space.
We computed pointwise density estimations in five and six dimensions. Although we cannot specifically write down a generative model for stellar clusters in phase space, we can assume that neighborhood queries are strongly affected by the given dimensionality. Thus, we model background and signal contributions as Gaussians in univariate density space.
B.8. Parameter optimization
The parameter choices of our proposed SigMA analysis pipeline are tuned to Gaia data (see Sect. 3.3). In contrast, DBSCAN and HDBSCAN, which we used to compare and test our clustering technique, are general clustering techniques whose parameters we must set. Instead of using subjective, errorprone prior knowledge to determine suitable parametrizations, we search the space of possible clustering results for the best result. This strategy measures the peak performance (in case of comprehensive/absolute prior knowledge) a clustering algorithm can achieve. Thus, a comparison against the best results allows for a discussion on methodological advantages and disadvantages rather than reflecting poor parameter selection.
To search the space of possible model configurations, we employed a grid search to evaluate the clustering algorithm on a regular grid in parameters space. A grid search has significant benefits in this scenario compared to other parameter tuning methods, such as random search or Bayesian optimization. First, a large portion of the parameter space is discrete. Thus, a finite step size can cover the entire parameter space in these subspaces. Second, the parameter spaces are low dimensional (maximally 4D), allowing densely spaced samples in each parameter axis. Third, compared to random search and Bayesian optimization, a grid search is deterministic and, thus, provides reproducibility. Further, as the grid is predetermined (compared to Bayesian optimization), it allows the computation to be fully parallelized, thus guaranteeing dense sampling in reasonable times.
To evaluate the performance of a clustering model (i.e., parameter choice), a range of metrics are used, including both clustering validation metrics (such as NMI, AMI, and ARI) and classification metrics (such as recall, precision, accuracy, balanced accuracy, and MCC). The bestperforming model is determined by selecting the model with the highest score across these metrics. Since all of these metrics report on slightly different model summaries, some scores might show large outlying (either abnormally high or low scores) values. To remove their influence, the best parameter choice is determined as maximizing the median across these eight metrics.
While we determine optimal parameters on the “open cluster sample” (see Sect. 4.2.1) by simply running the grid search, the “compact cluster sample” (see Sect. 4.2.2) is a realization of a random effects model, making the model parameters themselves stochastic variables. In the latter case, we aim to find expected values for respective clustering parameters across ten realizations of the compact cluster sample. Optimal parametrization is expected to fluctuate between data realizations. However, the compact cluster sample is a single cluster region to which a unique parametrization should be applied. Thus, after searching optimal parameter sets across the ten individual data samples, we took the median (to reduce the effect ) of the resulting parameters and reran the algorithm on every sample to obtain the clustering score reported in Table 2.
In the following subsections, we describe algorithmspecific parameter choices of the employed grid search. We also search optimal values for each algorithm for the velocity scaling factor. We adopt values discussed in Appendix B.3.
DBSCAN
DBSCAN has two main parameters, epsilon and min_samples. The parameter epsilon describes a neighborhood radius; in particular, it is the maximum distance within which two samples are considered neighbors. The parameter min_samples denotes the minimum number of samples required in an epsilonneighborhood to be considered a cluster (or specifically a core point that seeds a cluster).
Since we normalize the velocities to the spatial subspace XYZ we can treat epsilon as a distance in parsecs. Together with the given minimum number of points, it defines a minimum density needed to be considered a cluster. We search for optimal results within a range of epsilon ∈[2, 25] pc with a step size of 0.5 pc. At the same time, we vary the minimum number of samples in the following range: min_samples ∈[4, 40] with a step size of 2.
We find optimal parameters for the open cluster sample to be (epsilon, min_samples, c_{v}) = (9.5, 6, 5.95). For the compact cluster sample, we find an optimal solution with (epsilon, min_samples, c_{v}) = (9.0, 20, 5.95)
HDBSCAN
HDBSCAN has three main parameters, min_cluster_size, min_samples, and cluster_selection_method. Intuitively, min_cluster_size determines the smallest cluster sizes that HDBSCAN considers. Which points are still associated with a cluster is determined by min_samples. By increasing min_samples, clusters are progressively forced into denser areas leaving more points to be declared as noise. The parameter cluster_selection_method determines how clusters are selected from the cluster tree hierarchy (see Appendix B.1 for a detailed discussion).
We search for optimal results within a range of min_cluster_size ∈[20, 100] with a step size of 2. At the same time, we vary the minimum number of samples min_samples in the same range and step size while requiring min_cluster_size ≥ min_samples^{39}. The selection method parameter can take the following values: cluster_selection_method ∈ [“leaf”, “EOM”].
We find optimal parameters for the open cluster sample to be (min_cluster_size, min_samples, cluster_selection_method, c_{v}) = (60, 60, “EOM”, 8.51). For the compact cluster sample, we find an optimal solution with (min_cluster_size, min_samples, cluster_selection_method, c_{v}) = (24, 18, “EOM”, 5.95).
Appendix C: Projected velocities
The reflex motion of the Sun influences how the observed tangential velocities are distributed in v_{α}/v_{δ} space. Figure C.1 shows the theoretical positions of objects if they follow a circular orbit around the Galactic center at given positions within the Galactic potential. The orbits are estimated within a Milky Way potential, including a disk, bulge, and halo component, using the python package galpy by Bovy (2015) (galpy.potential.MWPotential2014; galpy.potential.vcirc) and assuming the LSR velocity from Schönrich et al. (2010). The projected motions are given for all Galactic longitude (l) positions at two distances (d) of 100 pc and 200 pc and at three Galactic latitudes (b) of −20°, 0°, and 25°. These d and b ranges encompass the ScoCen region, which reaches from about l = 0° to 290°. The members of ScoCen within the selected SigMA clusters are plotted as gray dots in Fig. C.1.
Fig. C.1. Tangential velocities in the v_{α}/v_{δ} plane, showing the theoretical locations of sources with circular Galactic orbits and LSR velocities. Six different cases are shown, while each line represents sources at all l positions. The six cases are for two different distances (100 pc, dashed lines; 200 pc, dashdotted lines), and for three different b positions (b = −20°, green; b = 0°, blue; b = 25°, magenta). The indicated longitude positions at l = 0° (box symbols) and l = 290° (diamond symbols) roughly mark the eastern and western borders of ScoCen. The SigMAselected ScoCen members are shown as gray dots (without stability cut). See also Fig. 13 for a separation of the clusters and for the v_{α, LSR}/v_{δ, LSR} plane. 
Overall, the young stellar clusters in ScoCen seem to roughly follow expected motions in our Galaxy, assuming they follow the LSR velocity. The figure additionally highlights the issues with the projected tangential velocity plane v_{α}/v_{δ}, which is a function of position in the sky and the Sun’s motion. Very nearby stellar clusters, like in nearby young local associations, could cover large areas in the sky and consequently also in the tangential velocity plane while being simultaneously confined in 3D velocity space (UVW). This effect can be reduced if computing the tangential velocities relative to the LSR v_{α, LSR}/v_{δ, LSR}, which eliminates the influence of the reflex motion of the Sun (see also Sect. 2 and Appendix A). A comparison of the two velocity spaces (v_{α}/v_{δ} and v_{α, LSR}/v_{δ, LSR}) is shown in Fig. 13.
Appendix D: The Gaia DR3 CMD
In this section we first describe our procedure to estimate the fraction of possible contaminants from older stellar populations (or field stars) in the SigMAselected ScoCen sample using the Gaia CMD, and second, we estimate the fraction of substellar objects (brown dwarf candidates).
D.1. Estimating the contamination from older sources
Figure 14 in Sect. 5 shows a Gaia CMD using the magnitudes from the Gaia DR3 passbands G, G_{BP}, and G_{RP}. Not all sources have detections in all three passbands; within the SigMAselected ScoCen members, 12,724 (97%) sources have an entry in all three passbands. The absolute magnitude M_{G} is calculated with the distance modulus using the inverse of the parallax as distance. To estimate how many SigMAselected ScoCen members are consistent with the expected ages (≲20 Myr) and which sources could be contaminants from older populations (or field stars), we first need to apply photometric quality criteria. Inferior photometric quality mostly affects the fainter lowmass sources and shifts them to the left in the Gaia CMD. We used the magnitude errors and the photometric flux excess factor C, which are defined as follows:
$$\begin{array}{cc}& {G}_{\mathrm{err}}=1.0857/\mathtt{phot\_g\_mean\_flux\_over\_error}\hfill \\ \hfill & {G}_{\mathrm{RP},\mathrm{err}}=1.0857/\mathtt{phot\_rp\_mean\_flux\_over\_error}\hfill \\ \hfill & {G}_{\mathrm{BP},\mathrm{err}}=1.0857/\mathtt{phot\_bp\_mean\_flux\_over\_error}\hfill \\ \hfill & C=\mathtt{phot\_bp\_rp\_excess\_factor}=({I}_{\mathrm{BP}}+{I}_{\mathrm{RP}})/{I}_{G}\hfill \\ \hfill & {C}^{\ast}=\mathrm{corrected}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}C.\hfill \end{array}$$(D.1)
The flux excess factor, C (Evans et al. 2018; Riello et al. 2021), gives the flux excess in the G_{BP} − G_{RP} color relative to the G band flux. It is recommended to use the corrected C (denoted as C^{*}) as given in Riello et al. (2021), reducing color dependence. Using these parameters, we applied the following quality criteria to the photometry:
$$\begin{array}{cc}& {G}_{\mathrm{err}}<0.007\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{mag}\hfill \\ \hfill & {G}_{\mathrm{RP},\mathrm{err}}<0.03\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{mag}\hfill \\ \hfill & {G}_{\mathrm{BP},\mathrm{err}}<0.15\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{mag}\hfill \\ \hfill & {C}^{\ast}<0.3.\hfill \end{array}$$(D.2)
These cuts reduce the number of ScoCen members from 12,724 with complete photometric information to 11,162 (leaving 88%), mainly reducing the number of the fainter lowmass stars. If not applying any quality criteria, many sources would be shifted toward older ages only because of unreliable Gaia colors.
Figure 14 shows two isochrones. The dashed line is a 25 Myr isochrone from Baraffe et al. (2015) (BHAC15^{40}) for Gaia DR3/EDR3 passbands. The solid line is a 25 Myr isochrone from PARSEC^{41} for Gaia DR3/EDR3 passbands (e.g., Bressan et al. 2012; Chen et al. 2014, 2015; Marigo et al. 2017; Riello et al. 2021), including the upper–mainsequence (UMS), which is missing in the BHAC15 models. We used both models since BHAC15 models deliver a better representation of lowmass stars.
To get a measure for the contamination from older sources (older than the expected ∼20 Myr), we select sources to the left of the two 25 Myr isochrones, allowing for random scatter around the 20 Myr isochrone (in particular young stars often show higher variability than mainsequence stars). Additionally, we did not consider sources at the UMS since there the trend reverses (younger sources are to the left of the UMS). Hence, we applied a cut at M_{G} > 3 mag, only including fainter sources to select older stellar candidates.
The combined conditions deliver 760 candidate contaminants (and 10,402 young ScoCen candidates) out of 11,162 sources with applied photometric quality criteria. This is about 6–7% possible contaminants from older populations within the SigMA clusters. Considering the chosen borders, we stress that this separation can only be seen as a rough estimate. In particular, we did not consider any possible contaminants in the UMS regime, where it is more difficult to distinguish young members from older stellar populations. Moreover, the chosen isochrone models have intrinsic uncertainties, and any change in metallicity or extinction is ignored in our test. Finally, we only examine sources from the CMD in Fig. 14 with the applied photometric quality criteria from Eq. (D.2). Hence, we cannot make any statement for sources with inferior photometric quality, which often suffer from higher astrometric uncertainties and which could be shifted in the CMD space. Consequently, such sources could also have a higher probability of having a wrong cluster membership solely due to the generally larger scatter in various dimensions.
To better understand the influence of the quality criteria, we repeat the selection of old star contamination with different photometric quality cuts. First, we consider the case of applying no cuts, using the 12,724 sources with entries in all three Gaia bands, delivering a contamination fraction from older sources of about 13%. Next, we applied somewhat looser and also stricter quality criteria than given in Eq. (D.2) as follows, first showing the looser cuts (D.3) and then the stricter cuts (D.4):
$$\begin{array}{cc}& {G}_{\mathrm{err}}<0.01\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{mag}\hfill \\ \hfill & {G}_{\mathrm{RP},\mathrm{err}}<0.045\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{mag}\hfill \\ \hfill & {G}_{\mathrm{BP},\mathrm{err}}<0.25\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{mag}\hfill \\ \hfill & {C}^{\ast}<0.5\hfill \end{array}$$(D.3)
$$\begin{array}{cc}& {G}_{\mathrm{err}}<0.004\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{mag}\hfill \\ \hfill & {G}_{\mathrm{RP},\mathrm{err}}<0.015\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{mag}\hfill \\ \hfill & {G}_{\mathrm{BP},\mathrm{err}}<0.05\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.333333em}{0ex}}\text{mag}\hfill \\ \hfill & {C}^{\ast}<0.3.\hfill \end{array}$$(D.4)
With the looser cuts, we get a contamination fraction of about 10%, and with the stricter cuts about 4% (see overview in Table D.1). It gets clear that the fraction of sources to the left of the chosen isochrones decreases significantly (from about 13% to 4%) when using superior photometry, which indicates that many sources indeed get erroneously shifted to older ages when not considering the influence of photometric uncertainties. In conclusion, we estimate the contamination fraction from older populations or field stars to be between 4–10%.
Overview of the contamination fraction from older stars as estimated with different photometric quality criteria.
Additionally, we investigate the influence of the membership stability as delivered by the SigMA algorithm (Sect. 3.6). Figure 8 shows the influence of different stability cuts on the “old star contamination fraction” when using the quality criteria from Eq. (D.2). It can be seen that for low stability (≲11%) there is also a significant increase in the contamination fraction. When only using sources with stability > 11%, we would get a old star contamination fraction in the range of about 2–9% for the cases of strict to no photometric cuts (see Table D.1). Hence, 2–4% can be considered as the lower limit for contamination from older stellar populations (or field stars), while about 10% is likely the upper limit. We conclude that the majority of the SigMAselected ScoCen members (likely more than 90%) are sources with young ScoCen–typical ages and therefore likely not contaminants from interloping older populations or field stars. The stability can be used to get more reliable members, while the stability cut needs to be decided individually per cluster.
D.2. Estimating the fraction of substellar sources
To get an estimate of substellar sources in our sample (brown dwarf candidates), we used a 0.08 M_{⊙} isomass line in Fig. 14 (right panel), which is extracted from BHAC15 models using ages from 0.5–30 Myr. We select sources below 0.08 M_{⊙}, which is defined as the approximate hydrogenburning limit (e.g., Baraffe et al. 1998; Burrows et al. 2001; Freytag et al. 2010, 2012; Dieterich et al. 2014). The uncertainties at the lowmass regime make this selection only a rough estimate, in combination with the uncertainties of the stellar models (e.g., unknown metallicity, neglected extinction, different models give different results). Additionally, all the uncertainties mentioned above in Sect. D.1 (e.g., chosen error or stability cuts) should be considered.
Using a cut 0.08 M_{⊙} and the quality criteria from Eq. (D.2) we find that there are 1946 out of 10,402 (18.7%) potential substellar sources when considering only the younger sources from the middle panel in Fig. 14. This selection indicates a fraction of substellar objects of about 19% within the SigMA clusters. If applying less strict error cuts (no cuts or Eq. (D.3)), the fraction stays at about 19%, and if applying more strict error cuts (Eq. (D.4)) the fraction decreases to about 12%. This is expected since more conservative photometric error cuts affect in particular faint sources. Changing the stability criteria does not influence these different fractions significantly, since sources both in the stellar and substellar regime seem to be affected almost equally. Concluding, we estimate that there are about 19% of brown dwarf candidates in the SigMAselected ScoCen sample, which can be considered as an upper limit (if correcting for extinction would likely deliver a lower fraction).
Appendix E: Auxiliary tables and figures
In Table E.1 we give an overview of the contents of the source catalog containing all ScoCen members as selected in this work, including labels for cluster membership. The full version of the table is available online as a machinereadable version.
Catalog of the 13,103 ScoCen members labeled for cluster membership as identified with SigMA.
We provide three additional tables, giving an overview of the literature comparisons between the SigMA clusters and other ScoCen samples. In Table E.2 we compare to Damiani et al. (2019) and Žerjal et al. (2023), in Table E.3 to Kerr et al. (2021), and in Table E.4 we compare to Squicciarini et al. (2021), MiretRoig et al. (2022a), and BriceñoMorales & Chanamé (2023). More details on the comparisons can be found in Sect. 5.2.
Comparing the SigMA clusters with stellar group selections from Damiani et al. (2019) and from Žerjal et al. (2023).
Comparing the SigMA clusters with Kerr et al. (2021) clusters toward ScoCen.
Comparing the SigMA clusters with stellar group selections from Squicciarini et al. (2021), MiretRoig et al. (2022a), and BriceñoMorales & Chanamé (2023).
Finally, we provide additional figures of the position and velocity spaces of the individual 37 SigMA clusters. This allows a better appreciation of the individual cluster source distribution in each parameter space. The Figs. E.1–E.5 are constructed as follows. The cluster names are given in the left panel (l, b panel) of each row. Each column shows one of the six different parameter spaces for one cluster. The parameter spaces are the same as in Figs. 10–13, namely l versus b (deg), X versus Y (pc), X versus Z (pc), Y versus Z (pc), v_{α} versus v_{δ} (km s^{−1}), and v_{α, LSR} versus v_{δ, LSR} (km s^{−1}). These axes labels are given at the top of each column. We note that Col. 5 (tangential velocities) shows a larger velocity range than Col. 6 (tangential velocities relative to the LSR), where the clusters actually occupy a smaller velocity space and hence show a smaller velocity dispersion. All SigMAselected ScoCen members are plotted in gray in all panels and the given cluster is overplotted with red dots. See also Figs. 10–13 for an alternative view of the 37 SigMA clusters, and the interactive 2D and 3D versions https://www.aanda.org/10.1051/00046361/202243690/olm.
Fig. E.1. Six parameter spaces, with the individual clusters highlighted in red. Shown are clusters SigMA 1–8 (part of US). The gray background sources are all SigMAselected ScoCen members. Cluster names are given in the left panels of each row. The used xyaxes are given as titles at the top of each column. Tick labels are only given in the bottom row. Note that the v_{LSR} space shows a smaller velocity range compared to the tangential velocity space in Col. 5 and hence a lower velocity dispersion. See also Figs. 10–13 and the main text for more details. 
All Tables
Overview of the clustering algorithms that we test with synthetic data (simulated cluster samples) and compare to SigMA.
Overview of the 37 SigMA clusters in ScoCen, assigned to seven subregions (Col. 2).
HIPPARCOS astrometry from van Leeuwen (2007) of bright stellar members in ScoCen.
Overview of the recent literature to which we compare the SigMA ScoCen clustering results in more detail.
Overview of the contamination fraction from older stars as estimated with different photometric quality criteria.
Catalog of the 13,103 ScoCen members labeled for cluster membership as identified with SigMA.
Comparing the SigMA clusters with stellar group selections from Damiani et al. (2019) and from Žerjal et al. (2023).
Comparing the SigMA clusters with stellar group selections from Squicciarini et al. (2021), MiretRoig et al. (2022a), and BriceñoMorales & Chanamé (2023).
All Figures
Fig. 1. Levelset method generating a hierarchical merge tree. Via a continuous change of λ from ∞ to −∞, a new component is created at each maximum (white points). At each saddle point (black points), components are merged. The merge tree is fully computed when λ reaches the global minimum. 

In the text 
Fig. 2. Proposed clustering process SigMA, highlighted on a 2D toy data set of three Gaussians with variable covariance matrices and means. (1) The generated toy data set consisting of three bivariate Gaussians shown in white alongside 2σ confidence ellipses in color. (2) The clustering procedure starts off by estimating the density of the input data. (3) Next, a graphbased hill climbing step is performed in which points are propagated along gradient lines toward local peaks. (4) This gradient propagation results in a preliminary segmentation of input samples that typically is far too finegrained. (5) These segmented regions are iteratively merged with a parent mode if a modality test along the MEP detects no significant density dip. (6) The final segmentation retains all three clusters. 

In the text 
Fig. 3. Schematic figure linking the cluster number to the density estimation process. Applying a smoothing operator generates a family of density fields. This hierarchical family of functions is called a scale space. 

In the text 
Fig. 4. Empirical distancescaling relationship using data from Gagné et al. (2018b) and CantatGaudin & Anders (2020). The xaxis represents the distance to stellar clusters; the yaxis shows the dispersion ratio of positional over kinematic subspaces. This dispersion directly corresponds to the velocity scaling factor, c_{v}, as discussed in Sect. 3.3.3. The gray line represents the best fit linear regression line, while the gray band indicates the 1σ highest density interval obtained from sampling the posterior predictive distribution. 

In the text 
Fig. 5. Noise removal pipeline separating cluster members from field stars. The pipeline is split into two main parts. In the first part, we aim to assign a radial velocity to each source to transform the data into Galactic Cartesian coordinates. In the second step, we used the transformed data to fit several clusternoise classifiers to separate the signal from the background. Blue represents data products at various steps, green denotes processes, and red shows decisions. For more details, see Sect. 3.5. 

In the text 
Fig. 6. Noise reduction schematic. We fit the observed univariate density distribution, ρ, with a mixture of two Gaussians that model the cluster (red line) and field star (black line) population, respectively. We obtained an approximation to the field star contamination and incompleteness rate in the cluster sample by considering the clusternoise classifier’s confusion matrix entries, particularly false positives, false negatives, and true positives. 

In the text 
Fig. 7. Absolute and relative error of inferred radial velocities compared to observed values in Gaia DR3 with radial velocity errors below 2 km s^{−1}. We randomly removed 95% of available radial velocities during inference to facilitate this comparison. Only inferred values where the Gaia observable has been removed are shown. We highlight the 1σ and 2σ percentiles and find that the majority (68%) of absolute errors are within ±2.35 km s^{−1} and have relative errors below 0.55. 

In the text 
Fig. 8. Stability versus estimated contamination rate. The contamination estimate was determined via source positions in the HRD relative to the 25 Myr isochrone as discussed in Sect. 5, selecting potential contaminants from older populations. We identify a sharp drop in contamination for low stability values that levels off at around 11%. 

In the text 
Fig. 9. Schematic illustration of the SigMA pipeline. The pipeline consists of two main parts, the SigMA core and two consensus clustering steps. For a detailed explanation, see the main text of Sect. 3.8 and the references therein. 

In the text 
Fig. 10. Distribution of the 37 SigMA clusters in ScoCen, projected in Galactic coordinates. Traditionally, the ScoCen OB association is separated into US, UCL, and LCC, marked with gray dashed lines. The clusters extracted with SigMA reveal a more complex substructure of ScoCen than initially proposed by Blaauw (1946), and they show a more extended spatial distribution that includes the CrA, Pipe, and Cham regions and additional stellar clusters toward the northeast (NE). The clusters are ordered in the legend by region, as given in Table 3. See the interactive 2D version online or Fig. 11 for a separate view of each cluster. For a better visualization of the clusters’ distribution, see the interactive 3D version online (Fig. 12). 

In the text 
Fig. 11. Distribution of SigMA clusters in ScoCen, projected in Galactic coordinates, stratified by cluster membership. Compared to Fig. 10, the small multiples highlight the distribution of individual clusters in the ScoCen complex. The color coding represents the seven regions: US (orange), UCL (blue), LCC (red), Pipe (green), CrA (magenta), Cham (cyan), and NE (yellow). 

In the text 
Fig. 12. 3D distribution of the 37 SigMA ScoCen clusters in heliocentric Galactic Cartesian coordinates. The Sun is at (0,0,0). Colors and labels are as in Fig. 10. See also the interactive 3D version online and Figs. E.1–E.5, which allow a better appreciation of individual cluster properties. By doubleclicking on a cluster in the legend of the interactive version, the selected cluster can be isolated; by hovering over data points, the cluster membership and observed l, b, d position of a source becomes visible. 

In the text 
Fig. 13. Tangential velocity distribution of the 37 SigMA clusters. Colors and labels are as in Fig. 10. Left: Observed tangential velocities along α and δ are strongly influenced by the Sun’s reflex motion, while stellar clusters at similar distances and with similar space motions are arranged in a looplike pattern. Sources at l ∼ 0° are located in the lowerright part of the figure, and sources at l ∼ 290° in the upperleft part of the figure (see Fig. C.1). Right: Tangential velocities corrected for the Sun’s motion, and hence relative to the LSR. The correction reduces the projection effects of the observed tangential stellar motions. See the interactive 2D version online and Figs. E.1–E.5 for a better appreciation of the 2D kinematical properties of the clusters in ScoCen. 

In the text 
Fig. 14. Gaia CMD using M_{G} versus G_{BP} − G_{RP} showing the SigMAselected ScoCen members. Left: SigMA cluster members that pass the photometric quality criteria as given in Eq. (D.2). Middle: Potential contamination from older sources (orange), selected with a 25 Myr isochrone from PARSEC (black line) and Baraffe et al. (2015, dashed black line) and an additional cut at M_{G} > 3 mag (dasheddotted black line), which excludes the UMS. The combined conditions indicate contamination from older sources of about 6–7% when using the given photometric quality criteria and no stability cut. Right: Substellar candidates (red dots) selected with a 0.08 M_{⊙} isomass line from Baraffe et al. (2015, dark red line) using only the younger source from the middle panel. This cut indicates that roughly 19% of substellar sources are within the SigMA ScoCen members when the mentioned cuts and photometric quality criteria are applied and extinction effects are ignored. More details on the quality criteria, the selection borders, and the isochrone models are given in Appendix D. 

In the text 
Fig. B.1. Test statistic distribution, $\widehat{\text{SB}}$, of our reduced test procedure on a unimodal data set. The distribution is stable to variations in different parameters and respective values (as shown in the corresponding rows and columns). 

In the text 
Fig. B.2. Frequency distribution of the ratio between measured distance and tangential velocity uncertainties as a function of distance. We observe a linear trend (suggested by the rolling median, window size 5 pc) between this ratio and distance. We hypothesize that the empirical distancescaling relationship (see Fig. 4) is caused by this trend. 

In the text 
Fig. B.3. Scaling factor determination via the empirical distancescaling relationship. The scaling factor distribution for clusters at a distance between 100 and 200 pc depends on the conditional distribution of scaling factors at a given distance, f(c_{x}/c_{v} ∣ r). 

In the text 
Fig. B.4. PDF and CDF of the scaling factor conditioned on a given range of distances. The ten red scatter points indicate samples drawn from the tenquantile splitting procedure. We separate the PDF into ten continuous intervals with equal probabilities from which we derive samples as the mean position of these intervals. 

In the text 
Fig. B.5. Simulating the impact of the fraction of missing v_{r} measurements on the relative error of inferred radial velocities. The top panel shows relative errors of inferred v_{r} on the open cluster sample as a function of missingness. The bottom panel shows the result of this study using the compact cluster sample. We find twice larger errors on the compact cluster sample. Additionally, the estimation procedure seems almost independent of the number of missing radial velocity measurements in both samples, except in a small region of the compact cluster sample, where the error increases if fewer than 5% of sources have v_{r} measurements. 

In the text 
Fig. B.6. Consensus clustering pipeline for a simple example using an ensemble of three clustering solutions: A, B, and C. (2) Clusters from the ensemble are linked in a graph based on overlapping points. (3) Edges between clusters are removed if the overlap between their members is insufficient to assume a common cluster solution. (4) Cliques that represent stable cluster solutions, i.e., consensus clusters, are extracted. (5) A voting strategy determines the assignment of individual sources to cliques. 

In the text 
Fig. B.7. Loglog histogram showing the relative uncertainty of data points, i.e., the ratio of positional uncertainty given by the error covariance matrix over the nearest neighbor distance. The majority of data points are far below unity. The concentration of relative uncertainty values at zero indicates that kdistances across resampled data sets are strongly correlated. The 1σ, 2σ, and 3σ lines indicate the percentiles containing 68, 95, and 99.7% of the respective distribution. 

In the text 
Fig. B.8. Distribution of pointwise density estimates of samples drawn from an Ndimensional Gaussian. The left column shows the likelihood of individual samples. It provides an assessment of the local density under the true model. The right column shows pointwise densities estimated via the kNN technique. Increasing the dimensionality, N, from one to six (top to bottom row), the distribution of pointwise densities approaches a Gaussian. 

In the text 
Fig. C.1. Tangential velocities in the v_{α}/v_{δ} plane, showing the theoretical locations of sources with circular Galactic orbits and LSR velocities. Six different cases are shown, while each line represents sources at all l positions. The six cases are for two different distances (100 pc, dashed lines; 200 pc, dashdotted lines), and for three different b positions (b = −20°, green; b = 0°, blue; b = 25°, magenta). The indicated longitude positions at l = 0° (box symbols) and l = 290° (diamond symbols) roughly mark the eastern and western borders of ScoCen. The SigMAselected ScoCen members are shown as gray dots (without stability cut). See also Fig. 13 for a separation of the clusters and for the v_{α, LSR}/v_{δ, LSR} plane. 

In the text 
Fig. E.1. Six parameter spaces, with the individual clusters highlighted in red. Shown are clusters SigMA 1–8 (part of US). The gray background sources are all SigMAselected ScoCen members. Cluster names are given in the left panels of each row. The used xyaxes are given as titles at the top of each column. Tick labels are only given in the bottom row. Note that the v_{LSR} space shows a smaller velocity range compared to the tangential velocity space in Col. 5 and hence a lower velocity dispersion. See also Figs. 10–13 and the main text for more details. 

In the text 
Fig. E.2. Same as Fig. E.1, but for clusters SigMA 9–16 (part of US and UCL). 

In the text 
Fig. E.3. Same as Fig. E.1, but for clusters SigMA 17–24 (part of UCL and LCC). 

In the text 
Fig. E.4. Same as Fig. E.1, but for clusters SigMA 25–32 (part of LCC, Pipe, CrA, and Cham). 

In the text 
Fig. E.5. Same as Fig. E.1, but for clusters SigMA 33–37 (part of Cham and NE). 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.