An approach to the analysis of SDSS spectroscopic outliers based on self-organizing maps

D. Fustes; M. Manteiga; C. Dafonte; B. Arcay; A. Ulla; K. Smith; R. Borrachero; R. Sordo

doi:10.1051/0004-6361/201321445

Home

All issues

Volume 559 (November 2013)

A&A, 559 (2013) A7

Full HTML

Free Access

Issue		A&A Volume 559, November 2013


Article Number		A7
Number of page(s)		10
Section		Numerical methods and codes
DOI		https://doi.org/10.1051/0004-6361/201321445
Published online		28 October 2013

A&A 559, A7 (2013)

Designing the outlier analysis software package for the next Gaia survey

D. Fustes¹, M. Manteiga¹, C. Dafonte¹, B. Arcay¹, A. Ulla², K. Smith³, R. Borrachero⁴ and R. Sordo⁵

¹ Universidade da A Coruña (UDC), Fac. Informática, Campus de Elviña, 15071 A Coruña, Spain
e-mail: dfustes@udc.es; manteiga@udc.es; dafonte@udc.es; cibarcay@udc.es
² Universidade de Vigo (Uvigo), Dept. Física Aplicada, Campus Lagoas-Marcosende s/n, 36310 Vigo, Spain
e-mail: ulla@uvigo.es
³ Max Planck Institute For Astronomy (MPIA), Knigstuhl 17, 69117 Heidelberg, Germany
e-mail: smith@mpia-hd.mpg.de
⁴ Universitat de Barcelona (UB), Dept. Astronomia i Meteorologia ICCUB-IEEC, Martí Franquès 1, Barcelona, Spain
e-mail: rborrachero@am.ub.es
⁵ Osservatorio Astronomico di Padova (INAF), Vicolo Osservatorio 5, Padova, Italy
e-mail: rosanna.sordo@oapd.inaf.it

Received: 11 March 2013
Accepted: 31 August 2013

Abstract

Aims. A new method applied to the segmentation and further analysis of the outliers resulting from the classification of astronomical objects in large databases is discussed. The method is being used in the framework of the Gaia satellite Data Processing and Analysis Consortium (DPAC) activities to prepare automated software tools that will be used to derive basic astrophysical information that is to be included in final Gaia archive.

Methods. Our algorithm has been tested by means of simulated Gaia spectrophotometry, which is based on SDSS observations and theoretical spectral libraries covering a wide sample of astronomical objects. Self-organizing maps networks are used to organize the information in clusters of objects, as homogeneously as possible according to their spectral energy distributions, and to project them onto a 2D grid where the data structure can be visualized.

Results. We demonstrate the usefulness of the method by analyzing the spectra that were rejected by the SDSS spectroscopic classification pipeline and thus classified as “UNKNOWN”. First, our method can help distinguish between astrophysical objects and instrumental artifacts. Additionally, the application of our algorithm to SDSS objects of unknown nature has allowed us to identify classes of objects with similar astrophysical natures. In addition, the method allows for the potential discovery of hundreds of new objects, such as white dwarfs and quasars. Therefore, the proposed method is shown to be very promising for data exploration and knowledge discovery in very large astronomical databases, such as the archive from the upcoming Gaia mission.

Key words: Galaxy: general / methods: data analysis / methods: statistical / methods: miscellaneous

© ESO, 2013

1. Introduction

The ESA Gaia mission, which is now in phase D (qualification and production), is expected to be launched by September 2013. It will provide the first highly accurate 6D map of the Milky Way, measuring positions, parallaxes, and motions to the microarcsec level. The satellite’s complex instrumentation, mode of operation, astrophysical main objectives, and its expected scientific performance have been extensively reviewed elsewhere, see for example de Bruijne (2012). Since Gaia is the first unbiased survey of the entire sky down to approximately a magnitude of 20, it is raising enormous expectations from a wide range of astronomical research areas, from solar system studies to cosmology, especially after it was decided that the final archives, containing the observations and basic astrophysical products, will be made public immediately after being produced.

The spacecraft will measure every object in the sky over 80 epochs on average and over the course of its five years of operating time, allowing for variability studies, as well as an increase in the signal-to-noise ratio with time. We expect approximately observations and an extensive number of iterations, which will be required to process astrometry, photometry, and radial velocities, with the additional challenge of a data flow of approximately observations per second (Holl et al. 2012).

The main astrophysical properties of astronomical objects observed by Gaia will be derived by a software pipeline, which is being produced by an international consortium, the Gaia Data Processing and Analysis Consortium (DPAC). DPAC arose, in response to an ESA Gaia Announcement of Opportunity in March 2007, as an international collaboration with memberships from all over Europe, which nowadays includes a community of over 400 scientists and software engineers from more than 20 countries. DPAC is organized in several coordination units (CUs) and responsible for a well-defined set of tasks in the Gaia data processing effort. CU8 was in charge of classifying the observed astronomical sources by both supervised and unsupervised algorithms and of producing an outline of their main astrophysical parameters (Ordóñez-Blanco et al. 2010). CU8 is subdivided into several work packages: DSC (discrete source classifier) is the main package for classification, whereas GSP-Phot (general stellar parameterizer – photometry) and GSP-Spec (general stellar parameterizer – spectroscopy) are the main parameterization packages. There are a number of additional packages dedicated to more specific tasks, such as quasar/galaxy parameterization (QSOC and UGC, respectively) or specific stellar population parameterizers (ESP). Finally, there are two packages dedicated to the unsupervised analysis of the raw data, OCA (object cluster analysis) and OA (outlier analysis), which is the package that is described in this work, aimed at analyzing classification outliers.

The astrophysical parameters inference system (Apsis, Bailer-Jones et al. 2013), developed by CU8, is composed of a number of algorithms to derive information on the nature of all astronomical objects that will be observed by the satellite, mainly through analyzing their astrometric properties and their spectral energy distribution (SED). The SED will be obtained for all objects observed by Gaia, using two spectrophotometers: BP (Blue Photometer, operating in the wavelength range of 300–680 nm) and RP (Red Photometer, range 640 to 1050 nm). Figures 1a,b show the normalized passbands as the instrumental response to photons and the spectral dispersion as a function of wavelength and for BP and RP instruments.

Fig. 1

Gaia spectrophotometers BP and RP properties. Credits: ESA. a) Normalized passbands of Gaia instruments as a function of wavelength. G stands for Gaia broad-band white light, RVS refers to Gaia’s Radial Velocity Spectrograph, and BP and RP refer to blue and red spectrophotometers. b) Spectral dispersion of Gaia BP and RP spectrophotometers.

The main objective of DSC consists in providing a probabilistic classification for every object observed by Gaia from among a well-defined set of astronomical classes: STAR, WD, PHYSICAL BINARY, GALAXY, QSO, and NON-PHYSICAL BINARY (composite object). White dwarf stars are considered an additional class (WD), so that STAR includes all non-WD single stars. DSC is mainly based on a supervised method, more specifically a SVM (support vector machine, Cortes & Vapnik 1995) algorithm working on Gaia spectrophotometry, which is complemented by other two subclassifiers working with astrometric data, see Smith (2012) for further information.

OCA is using an unsupervised classification algorithm, called HMAC, see (Li et al. 2007), with the objective of determining the “natural” observed classes of astronomical objects among all Gaia observations, without any a priori hypothesis about their physical nature. The expectation of CU8 is that both general classification working packages will be able to classify approximately 95% of all Gaia observations with a reasonable level of reliability.

Our group has been participating in the CU8 DPAC activities since 2007, and is responsible, among others subjects, for analyzing the classification outliers that result from DSC and OCA operations. The knowledge of the SED of each astronomical object with its precise astrometry and information about variability will, undoubtedly, provide us with the most complete physical information for describing its astrophysical nature. However, it is important to stress that, although the physics of the stars is nowadays well understood and there are extensive archives containing information about the light distribution of most known astronomical objects, it is expected that Gaia will observe such an enormous number of sources that many of them could differ significantly from model predictions or previous observations.

Gaia will observe a significant sample of peculiar objects, such as supernovae, stars with abnormal abundance patterns, Wolf-Rayet stars, multiple systems, or high-redshift quasars, as well as, probably, new kinds of previously unseen objects. In addition, low signal-to-noise ratios, cosmic rays, instrument artifacts, and other damaged data will eventually occur, leading to classification errors. To deal with this issue, DSC is using an automatic outlier detector based on a one-class SVM that rejects objects that are far from the training data space (see for example, Schölkopf et al. 2001). It is estimated that approximately objects (5% of the total) will be marked as UNKNOWN by the DSC outlier detector, which means that some type of automatic analysis becomes mandatory. Furthermore, some objects will receive a set of probabilities that is not decisive in terms of final classification, so that their nature should be clarified by further analysis. These objects will be processed by OA.

The remainder of this paper is organized as follows. Section 2 presents the data that is being used to test the CU8 algorithms, Sect. 3 describes the algorithm used to process the outliers with OA and the resulting performance with Gaia simulations, and Sect. 4 presents the results that were obtained by the algorithm working with 10 125 spectroscopic outliers from SDSS (Sloan Digital Sky Survey), which may lead to identifying new astronomical objects. Finally, Sect. 5 outlines our conclusions and discusses the adaptation of the proposed methodology for data mining in the next Gaia mission.

2. Gaia simulated libraries

DPAC is using a powerful simulator, the Gaia Object Generator (GOG, Isasi et al. 2010), to simulate a wide variety of observations that are expected to take place during the mission. Among the data generated for testing CU8 algorithms, we used several spectra from the SDSS Data Release 7 (Abazajian et al. 2009), transformed by GOG to BP/RP low-resolution format and instrumental characteristics. We refer to such spectra as the SDSS semi-empirical libraries, which are actually three libraries containing different classes of objects: stars, quasars, and galaxies (Tsalmantza et al. 2012). In addition, to complete the set of reference spectra, we considered model-based BP/RP DPAC libraries, which were compiled in different ways from stellar synthetic spectra obtained from the MARCS and PHOENIX models, Gustafsson et al. (2008) and Brott & Hauschildt (2005). We also considered compiled libraries composed of binary stars (Castanheira et al. 2006), white dwarfs, ultra cool dwarfs (Allard et al. 2000), emission line stars, and planetary nebulae (Blomme et al. 2010). Finally, spectra for nonphysical pairs were generated by adding the spectra of either two stars, a star and a galaxy, or a star and a quasar. The Gaia stellar libraries were presented and compared in Sordo et al. (2011).

Simulated Gaia spectrophotometry, delivered by DPAC to CU8 and used by Apsis, is currently produced without any correction for instrumental sensitivity: the true response function will not be known until late in the mission, since it will be the result of photometric calibration using standard stars. This is the reason the available spectrophotometry is mostly dominated by a low-frequency signal showing a typical belt structure for each of the photometers. Figure 2 presents an SDSS spectrum for a QSO, and its GOG version (BP/RP spectrophotometry). GOG spectra are internally calibrated (i.e., the flux, magnitudes, and colors are correct). Section 4 discusses the impact of using this representation as opposed to working directly with SDSS spectra. During operations, Gaia DPAC will need to run a regular assessment of the effect of bandwidth nonuniformities by comparing accumulated spectra of sources of similar spectral types taken in different CCD rows and at different times (see Fabricius et al. 2013). It is expected that external calibration should be able to remove this.

Fig. 2

Comparison between an SDSS spectrum for a QSO and its GOG simulated BP/RP spectrophotometry. a) A quasar spectrum from SDSS. b) Quasar spectrum after simulating it with GOG.

Both the astronomical classification and the main parameters of the objects populating the previous datasets are well known, and the compiled libraries extensively cover the range of physico-chemical and evolutionary parameters expected for the main classes of astronomical objects, thus allowing for the use of this data to test or validate our method for identifying “classical” object categories (see Sect. 3). However, our algorithm is aimed at processing objects with an unknown nature, such as instrumental failures, faint observations, etc. Keeping this in mind, we compiled a new library formed by spectra from SDSS that were classified as “UNKNOWN” by the SDSS spectroscopic classification pipeline. After removing some spectra with zero or negative fluxes, we obtained a dataset composed of 10 125 objects, mostly faint objects (with a mean magnitude of 19 in G band), incomplete spectra, and unsuccessful observations. The performance of our algorithm in the SDSS outliers library is discussed in Sect. 4.

3. Outlier analysis algorithm

Any study related to the classification of extensive datasets has to address the problem of analyzing multidimensional classification outliers. Among others, this problem has recently arisen in the analysis of astronomical surveys, such as Pan-STARRS1, (Saglia et al. 2012), the Sloan Digital Sky Survey, SDSS, (Dobos et al. 2012), and the Blanco Cosmology Survey (Desai et al. 2012). Since outliers are, by definition, objects that do not fit in the existing models, the analysis of large outlier datasets must be done by means of unsupervised algorithms, which do not consider any knowledge a priori. In the data mining field, there are two main approaches to dealing with multidimensional data based on unsupervised techniques: dimensionality reduction and clustering. Dimensionality reduction tries to reduce the number of dimensions (variables, attributes) in the dataset to a level where they can be more reasonably analyzed by domain experts. Principal component analysis (PCA, see Jolliffe 2002) is the best known algorithm of this kind. On the other hand, clustering aims to group the data into a number of clusters that share similar properties. A wide variety of clustering algorithms have been proposed, since it is an ill-defined problem (see Xu & Wunsch 2005; Baraldi & Blonda 1999; Warren Liao 2005).

Fig. 3

Distribution of astronomical object classes obtained with Gaia photometric simulations, over a computed SOM, with 30 by 30 clusters. The color assigned to each of the clusters was set as a function of the predominant class in the objects belonging to it. The black color indicates that the cluster is empty.

Our choice for analyzing outliers is a clustering algorithm based on self-organizing maps (SOM, Kohonen et al. 2001). SOM have been used extensively in a number of scientific fields. Indeed, the paper that opened the field, Kohonen (1982), currently counts more than 5000 citations. However, they have been used sparingly thus far in astronomy (Naim et al. 1997; Geach 2012; Way & Klose 2012).

The main advantage of the SOM is that they provide quality clustering and nonlinear dimensional reduction at the same time, by projecting the data into a fixed number of clusters (called neurons or units in the neural networks field), arranged in a 2D (or 3D) structure, usually a matrix with rows by columns. Each cluster has a representative, called a prototype, which is a virtual pattern that better represents or resembles the set of input patterns belonging to such a cluster. The problem to be optimized is to find the best prototypes for the SOM clusters. Since this is an NP-hard problem, an iterative optimization procedure is followed to reach an acceptable solution from a randomly selected initialization of neuron weights. First, for each input pattern, the neuron that most resembles the pattern is activated. This is calculated by means of the squared euclidean distance between the pattern and the neuron prototype. Then, the activated neuron and its neighbors are updated according to the activating patterns. The number of neurons in the neighborhood of the activated neuron is large in the first iterations, but shrinks as the iterations succeed themselves. In this way, the algorithm starts sorting out the neurons and then smoothly moves to focusing on the clustering procedure, thereby minimizing the residual (also called quantization error) between the neuron prototype and its activating patterns.

We have carried out several experiments with Gaia spectrophotometry and SOM with considerable success (see Ordóñez et al. 2012; Fustes et al. 2013). The experiments show that different object classes lie in well-defined map regions and that it is possible to compress the data objects into a reduced number of clusters without significant loss of astrophysical information. See, for instance, the distribution of the objects on the map in Fig. 3 where we computed a 30 by 30 SOM (900 clusters) for 150 417 objects simulated with GOG, covering a wide variety of astronomical classes with varying parameters described in Table 1. A confusion matrix is a useful tool for evaluating the success of a classification algorithm. It is a table in which each row represents an object class, for which we compute the percentage of objects falling into clusters where the predominant class corresponds to each of the columns. The confusion matrix corresponding to the SOM clustering of the previous experiment is presented in Table 2. The last row shows the number of objects per class in the input dataset. In this case, the achieved compression rate is 167:1, with a mean class purity around 98.5% in the SOM clusters.

It was necessary to apply some preprocessing to the BP/RP data before presenting it to the SOM. We started by joining both BP and RP in a single vector, which was then normalized to have a unit area; otherwise, the SOM would only focus on apparent magnitudes. BP and RP present a wavelength region of overlapping and different spectral sensitivities (see Fig. 2). We performed tests with two BP/RP formats, one using the two spectra that just merged one after the other, hence showing a central region with several near-zero points, and a second one with spectra obtained by matching the overlapped wavelength region between both photometers. In the first case we preserve the information on the spectral colors, but introduce several pixels that correspond to the same wavelengths. It has also been taken into account that low pixel values will not significantly affect the performance of the SOM neurons. Tests carried out with both data configurations allow us to confirm that only small differences were found, approximately 3% in clustering purity, which proves that the small color differences are not introducing significant changes in the obtained groups. We decide to use merged spectra without wavelength redundancy, thereby saving some computation time.

Table 1

Classes of objects among Gaia simulations used for testing OA algorithms.

Table 2

Confusion matrix of the SOM computed using Gaia simulations of a wide variety of astronomical objects, as explained in Sect. 3.

In general, when Gaia satellite observations are available, we do not know the physical nature of the objects that populate the obtained SOM clusters. Therefore, an identification phase will be mandatory, in order to at least provide a description of the cluster by all available means, including both Gaia internal data (such as known SEDs, object variability, astrometry, and photometry) and external data (such as information from other astronomical surveys), possibly complemented by human experts’ knowledge and additional ground observations when necessary. The topology preservation of the SOM can help in this sense, since it provides researchers with meaningful visualizations, such as the well-known U-Matrix, which serve as maps for data exploration (see Kaski 1997). Section 4 puts these visualizations into practice, along with others specifically designed for the task, to unveil the nature of the objects populating the SDSS spectroscopic outliers.

4. Unveiling the nature of SDSS outliers with the OA algorithm

This section describes the processing of OA by means of the SOM algorithm mentioned above. To do so, we applied our method to a set of data, of an unknown nature, concretely the SDSS outlier library described in Sect. 2. We are trying to identify the classes of astronomical sources that populate it. This case is realistic, in the sense that the physical nature of the objects can be considered unknown since they were rejected by a classification pipeline. The following sections describe the process of computing an SOM that is suited for this dataset and a posterior identification procedure of the clusters populating it.

4.1. SOM learning procedure

The first step in the analysis of the SDSS outliers is to set the learning parameters for the SOM. Some of them can be fixed with simulations after some experimentation, such as the number of learning iterations and the neighborhood function. But the most important parameter to set, the size of the map (the number of clusters), is more difficult to estimate, since it strongly depends on the data. We opt for using a measure of error in the clustering, called mean quantization error (MQE), to establish the map size (Polzlbauer 2004). MQE measures the mean distance among a cluster prototype and the objects populating it. As such, we established that 30 by 30 is an acceptable map size for the present experiment. Finally, we selected the batch learning mode instead of the online mode, because the batch mode has the advantage of being independent of the order in which the patterns are presented to the SOM (Fort et al. 2002).

The MQE index measures the quality of the clustering, but we still do not know if the map is correctly ordered according to the input topology. It is difficult to estimate this mathematically. One way to assess the ordering is to visualize the color () distribution (magnitude differences in integrated RP and BP bands) in the SOM prototypes, as shown in Fig. 4, where a general ordering in the color distribution can be visualized. Additionally, with this plot and the MQE, we were able to select the best SOM among several randomly initialized learning procedures.

Fig. 4

Photometric color distribution (magnitude differences in integrated RP and BP bands) in the SDSS outliers SOM.

4.2. Data navigation through SOM visualizations

Once the final SOM has been obtained, several visualization tools are available with which to unveil the data’s physical nature and distribution. For instance, the color plot described in the previous section can be used as a guide for guessing stellar atmospheric temperatures.

The prototypes of each cluster in the SOM can also be visualized, provided that there is an expert that is able to interpret the BP/RP data in some way. Such an exploration can be performed region by region in the SOM instead of visualizing every cluster, since clusters that are located close in the SOM are close in the input space as well. The U-Matrix, which displays the distances among clusters, is a visualization tool that can assist the expert in the analysis process. The distance between the adjacent clusters is calculated and presented with different gray levels. A dark color corresponds to a large distance and thus a gap between the clusters in the input space, whereas a light color between the clusters means that they are close to each other in the input space. Light areas can be thought of as dense regions in the input space, whereas dark areas correspond to sparser ones. Figure 5 shows the U-Matrix computed from the SOM built for the SDSS outliers dataset, at several levels of contrast. This way, we can identify clusters that are outlying with respect to the others, such as the darker one in position (30,1). In addition, we can select several groups of clusters for joined identification. These groups can be further studied in order to propose a taxonomy among the unclassified objects in the SDSS survey, as is shown in the next sections.

The nature of the outliers (darker clusters) detected in the U-Matrix can vary: they can be instrumental errors, bad detections, or faint or unexpected objects. The analysis of the darker regions in the map has to be carried out in a more detailed way (even cluster by cluster), since the objects in each of the dark regions could differ considerably from each other.

Fig. 5

U-Matrix for the outlier SOM. The different plots correspond to different distance limits, which allow us to change the image contrast so as to unveil the underlying structure.

4.3. Identification of SOM clusters using spectral templates

Fig. 6

Identifications obtained for the SOM of SDSS outliers using Gaia simulations. The clusters receive a black color when the distance between the outlier prototype and the corresponding template is above the established limit.

Fig. 7

Examples of fitness between the Gaia templates and two outlier SOM clusters. a) Prototype (blue) and best-matching template (red) for unit at position (30, 1). b) Prototype and best matching template for unit at position (28, 4).

Fig. 8

Bad SDSS detections, populating the lower right region of the outliers SOM.

As stated in previous sections, we cannot rely on supervised models to classify the unknown objects, since these would probably fail. However, we can still try to “make a guess” concerning the nature of the unknown objects. This can be achieved by means of distance-based models, such as a k-nearest neighbors (KNN) classifier. We have applied this method to labeling the units in the SOM built with the SDSS Outliers Library spectra, by retrieving the closer templates for each cluster prototype. The templates compiled with this purpose were obtained from the cluster prototypes in the SOM presented in Sect. 3. Figure 6a shows the results of the matching procedure, where each cluster in the SOM of outliers was given a color depending on the class of the templates retrieved for it. From the figure, we can see that some regions in the map are filled with the same colors, thus receiving the same identification. For instance, the lower lefthand corner of the map is dominated by white dwarfs (green clusters) and the lower center by quasars (blue units), whereas normal stars and galaxies (pink and yellow clusters, respectively) are mostly located in the upper half of the map. Finally, ultra cool dwarfs are found in the upper righthand corner according to their very red colors.

Fig. 9

Diagram of the number of hits in each SOM cluster obtained for the SDSS outliers library, for different SIMBAD classes (see Sect. 4.4 for details).

The previously described identification type can be useful as a first approach to studying the nature of the SOM clusters, but one should be careful with the results, since these are rare or even damaged objects. Therefore, the likelihood of the identifications should be studied. One way to obtain likelihoods is to look at the Euclidean distance obtained between the template and the outlier prototype that was obtained when the identification was performed. Figure 7 shows both the fitness obtained for neuron (30,1), which is the poorest one, and the fitness obtained for unit (5,4), where the prototype and the template are very close to each other. The distance can also be used to filter the regions in the map that are not likely to belong to known classes of astrophysical objects, as is shown in Fig. 6. This filtering, together with the exploration of cluster prototypes and SDSS images, has allowed us to distinguish between common astrophysical objects and instrumental artifacts. For instance, the lower righthand region of the map, which was given a dark color in Figs. 5 and 6c, is populated by bad image detections, like the ones shown in Fig. 8.

4.4. Cross-matching with external archives

External archives can contribute to the identification process, since they contain additional information, including other wavelength ranges, imaging, and other classifications. Usually, cross-matching is performed by using a cone search in the sky, looking for objects within a certain radius. This functionality is already included in well-known tools such as Topcat or Aladin, which are integrated with the Virtual Observatory standards. In that sense, we can take advantage of the structure provided by the SOM to enhance the data exploration.

Fig. 10

Labeling of the map according to SIMBAD object types, see Sect. 4.4 for details. a) Simbad identifications for the outliers SOM. b) Class purity among identifications in the SIMBAD database for the outliers’ SOM. Green indicates that no object in the cluster was identified.

We opted for the SIMBAD catalog to perform cross-matching with the SDSS outliers, looking for more identifications. In this case, we retrieved those objects in SIMBAD within a radius of one arcsecond from every SDSS outlier, obtaining its SIMBAD type in case it exists. We obtained identifications among the following SIMBAD object types: AGN, Seyfert I galaxy, Seyfert II galaxy, BL-Lac object, galaxy, QSO, radio sources (in general), white dwarfs, brown dwarfs, and low-mass stars. For simplicity, SIMBAD classes AGN, as well as Seyfert1, Seyfert2, and Bl-Lac objects, were grouped together under the “AGN” label, which should be interpreted as active extragalactic objects excluding explicit SIMBAD identifications “QSO”. It is to be observed that we expect almost no normal stars to form part of the SDSS outliers dataset. Figure 9 shows the distribution of the retrieved SIMBAD identifications across the map. We can see that the different SIMBAD types are populating significantly separated regions in the SOM. It is remarkable that white dwarfs, quasars, and cool dwarfs are found in similar locations when identified by means of SIMBAD and Gaia simulations (see Fig. 6), which increases our confidence in the method. A more compact view on the distribution of SIMBAD identifications is given in Fig. 10. SOM clusters in Fig. 10a receive a color in function of the most frequent SIMBAD identification. In the figure, black clusters do not have objects identified in SIMBAD and gray is assigned to clusters with a similar frequency among two or more classes. On the other hand, Fig. 10b displays, for each cluster, the purity (in percentage) of the most frequent SIMBAD class. Green is assigned to clusters without any SIMBAD identification.

Table 3

Confusion matrix of the SOM, computed by means of the SDSS outliers semi-empirical library in Gaia spectrophotometer format.

Table 4

Confusion matrix of the SOM computed using roughly calibrated spectrophotometry.

Table 5

Confusion matrix of the SOM computed using SDSS full resolution spectra, classified as UNKNOWN.

Apart from the distribution of SIMBAD identifications in the SOM, we performed a more formal evaluation through confusion matrices, which measures how the different types of objects are being mixed in the SOM, as is shown in Table 3. It can be observed that the SOM is effective in classifying the SIMBAD types, especially considering the uncertainty introduced by the cross-matching and the SIMBAD misclassification rate. In addition, Table 3 gives an idea of the discovery possibilities of our method. From 7898 objects without identifications in SIMBAD, 624 are candidates for new WDs, 1674 for new QSOs, and so on, following the percentages of the row corresponding to the UNKNOWN class.

Table 3 can be compared with Tables 4 and 5. Table 4 shows the results obtained when the BP/RP data is divided by the response curve of the spectrophotometers, while Table 5 shows the matrix when the SOM was computed from SDSS original high-resolution spectra. In this way, we can evaluate the impact of the conversion from the SDSS spectra format to the Gaia BP/RP format. By analyzing these matrices, we can say that the algorithm behaves better when working with BP/RP data. Even more, the BP/RP format without the application of any instrumental correction, chosen by CU8 for Apsis, does not degrade the SOM performance. Provided that the outlier sources are very faint, the lack of resolution is compensated for by a higher signal-to-noise ratio in the spectrophotometric data.

5. Discussion and future developments

As a precise, large, and complete survey, Gaia is raising enormous expectations from the astrophysical community. In order to process such a tremendous amount of data, automated specialized analysis tools are being developed by the Gaia DPAC, with the aim of classifying objects and estimating their astrophysical parameters. A set of standardized classification labels are used to enable the supervised classification of approximately 95% of the observed sources. In this sense, an enormous number of outliers are expected (of the magnitude of ), which will be detected as objects that cannot be reliably related to any of the predefined categories, either because their observations are complex (several superposed objects, instrumental/calibration errors, signals with high noise levels, etc.) or because they belong to a new class of objects that is rarely found. The purpose of the OA Gaia DPAC group is to prepare algorithms for analyzing such objects by means of unsupervised classification techniques.

Our work presents and discusses the application of the OA algorithms to a significant set of spectra from the well-known SDSS survey, with the aim of characterizing the spectroscopic classification outliers. Specifically designed SOMs were used to compress the dataset in a euclidean-distance optimal representation. This simplifies the posterior analysis, since we can perform complex operations on the obtained clusters. Furthermore, the SOM algorithms project the dataset onto a 2D grid where the topological relations are preserved, allowing us to easily visualize the dataset distribution, in order to find clusters and outliers, and facilitating data exploration and knowledge discovery.

With the identifications obtained by means of Gaia spectrophotometric simulations and by the retrieval of external data via the SDSS SkyServer and the SIMBAD database, we were able to identify the nature of clusters populating some of the SOM regions, including classes of unknown objects whose spectra were not included among the templates in the current Gaia simulation (such as some types of AGN as BL-lac objects). We also identified, among the outlying regions, several sources of errors, such as poor photometric detections or incomplete spectra, and some objects of an uncommon nature that require further identification.

The results shown in this work demonstrate that the proposed method can effectively assist researchers in making a distinction among source candidates to complete the training sets of supervised classification algorithms, faint/damaged objects, and new classes of astronomical objects. Our expectation is that this will help the process of data processing of the upcoming Gaia dataset, finding, as soon as possible, systematic errors in the instrumental or in the pipeline procedures. Additionally, we hope that it will be possible to detect new objects unseen before, with the help of the methods presented here and with those of the astronomical community. However, the Gaia mission will bring new challenges, such as the processing of large amounts of data and high extinction levels, that will be addressed in the near future.

There is still work to be done extending the OA functionality. The present paper has shown a process of identification that makes use of spectrophotometric templates and semiautomated cross-matching of sources. This identification procedure will be extended in the future by incorporating additional data, such as Gaia astrometry and photometry, as well as measured spectral features, integrating an expert system (see Davis & Lenat 1982) that would perform inferences based on all available internal Gaia data. On the other hand, the process of cross-matching could take other characteristics into account, apart from positions in the sky, such as photometric similarities. Also, the cross-matching method could be further automated by building robots that could navigate the Web looking for good identifications in surveys such as LSST, LAMOST, or VISTA. Finally, interesting objects could be proposed for follow-up programs and identification by new ad-hoc observations.

Acknowledgments

This work was supported by the Spanish MINECO FEDER through Grants AYA2009-14648-C02-01 and 02 and AYA2012-39551-C02-02 and 01, and CONSOLIDER CSD2007-00050. The GOG simulations were run on the supercomputer MareNostrum at the Barcelona Supercomputing Center (Centro Nacional de Supercomputacin). In addition, we would like to acknowledge the support of the Italian Space Agency, through ASI contract I/058/10/0, and of the German space agency (DLR). This research has made use of the SIMBAD database, operated at CDS, Strasbourg, France. The Digitized Sky Surveys were produced at the Space Telescope Science Institute under US Government grant NAG W-2166.

References

Abazajian, K. N., Adelman-McCarthy, J. K., Agüeros, M. A., et al. 2009, ApJS, 182, 543 [NASA ADS] [CrossRef] [Google Scholar]
Allard, F., Hauschildt, P. H., & Schweitzer, A. 2000, ApJ, 539, 366 [NASA ADS] [CrossRef] [Google Scholar]
Bailer-Jones, C. A. L., Andrae, R., Arcay, B., et al. 2013, A&A, in press DOI: 10.1051/0004-6361/201322344 [Google Scholar]
Baraldi, A., & Blonda, P. 1999, Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Trans., 29, 778 [Google Scholar]
Blomme, R., Frémat, Y., Lobel, A., & Martayan, C. 2010, EAS Pub. Ser., 45, 373 [CrossRef] [EDP Sciences] [Google Scholar]
Brott, I., & Hauschildt, P. H. 2005, in The Three-Dimensional Universe with Gaia, eds. C. Turon, K. S. O’Flaherty, & M. A. C. Perryman, ESA SP., 576, 565 [Google Scholar]
Castanheira, B. G., Kepler, S. O., Handler, G., & Koester, D. 2006, A&A, 450, 331 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Cortes, C., & Vapnik, V. 1995, Machine Learning, 20, 273 [Google Scholar]
Davis, R., & Lenat, D. B. 1982, Knowledge-based systems in artificial intelligence, McGraw-Hill advanced computer science series (New York, St. Louis, San Francisco: McGraw-Hill) [Google Scholar]
de Bruijne, J. H. J. 2012, Astrophys. Space Sci., 341, 31 [Google Scholar]
Desai, S., Armstrong, R., Mohr, J. J., et al. 2012, ApJ, 757, 83 [NASA ADS] [CrossRef] [Google Scholar]
Dobos, L., Csabai, I., Yip, C.-W., et al. 2012, MNRAS, 420, 1217 [NASA ADS] [CrossRef] [Google Scholar]
Fabricius, C., Jordi, C., Carrasco, J. M., Voss, H., & Weiler, M. 2013, in Highlights of Spanish Astrophysics VII, Proc. Meet. SEA, 880 [Google Scholar]
Fort, J.-C., Letrémy, P., & Cottrell, M. 2002, in Proc. ESANN 2002, 10th Eurorean Symposium on Artificial Neural Networks, Bruges, Belgium, April 24–26,, ed. M. Verleysen, 223 [Google Scholar]
Fustes, D., Dafonte, C., Arcay, B., et al. 2013, Expert Syst. Appl., 40, 1530 [CrossRef] [Google Scholar]
Geach, J. E. 2012, MNRAS, 419, 2633 [NASA ADS] [CrossRef] [Google Scholar]
Gustafsson, B., Edvardsson, B., Eriksson, K., et al. 2008, A&A, 486, 951 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Holl, B., Lindegren, L., & Hobbs, D. 2012, A&A, 543, A15 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Isasi, Y., Figueras, F., Luri, X., & Robin, A. C. 2010, in Highlights of Spanish Astrophysics V, eds. J. M. Diego, L. J. Goicoechea, J. I. González-Serrano, & J. Gorgas, Astrophys. Space Sci. Proc. (Berlin, Heidelberg: Springer) [Google Scholar]
Jolliffe, I. T. 2002, Principal Component Analysis, 2nd edn. (Springer) [Google Scholar]
Kaski, S. 1997, Acta Polytechnica Scandinavica, Mathematics, Computing and Management in Engineering Series No. 82, Dtech Thesis, Helsinki University of technology, Finland [Google Scholar]
Kohonen, T. 1982, Biol. Cybern., 43, 59 [Google Scholar]
Kohonen, T., Schroeder, M. R., & Huang, T. S., 2001, Self-Organizing Maps, 3rd edn. (Secaucus, NJ, USA: Springer-Verlag New York, Inc.) [Google Scholar]
Li, J., Ray, S., & Lindsay, B. G. 2007, J. Mach. Learn. Res., 8, 1687 [Google Scholar]
Naim, A., Ratnatunga, K. U., & Griffiths, R. E. 1997, ApJS, 111, 357 [NASA ADS] [CrossRef] [Google Scholar]
Ordóñez-Blanco, D., Arcay, B., Dafonte, C., Manteiga, M., & Ulla, A. 2010, Lect. Notes Essays Astrophys., 4, 97 [NASA ADS] [Google Scholar]
Ordóñez, D., Dafonte, C., Manteiga, M., & Arcay, B. 2012, Appl. Soft Comp., 12, 203 [Google Scholar]
Polzlbauer, G. 2004, in Proceedings of the Fifth Workshop on Data Analysis (WDA’04) (Sliezsky dom, Vysoké Tatry, Slovakia: Elfa Academic Press) eds. J. Paralic, G. Polzlbauer, & A. Rauber, 67 [Google Scholar]
Saglia, R. P., Tonry, J. L., Bender, R., et al. 2012, ApJ, 746, 128 [NASA ADS] [CrossRef] [Google Scholar]
Schölkopf, B., Platt, J. C., Shawe-Taylor, J. C., Smola, A. J., & Williamson, R. C. 2001, Neural Comput., 13, 1443 [CrossRef] [Google Scholar]
Smith, K. 2012, in Springer Series in Astrostatistics, 2, Astrostatistics and Data Mining, eds. L. M. Sarro, L. Eyer, W. O’Mullane, & J. De Ridder (New York: Springer), 239 [Google Scholar]
Sordo, R., Vallenari, A., Tantalo, R., et al. 2011, J. Phys. Conf. Ser., 328, 012006 [NASA ADS] [CrossRef] [Google Scholar]
Tsalmantza, P., Karampelas, A., Kontizas, M., et al. 2012, A&A, 537, A42 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Warren Liao, T. 2005, Pattern Recogn., 38, 1857 [Google Scholar]
Way, M. J., & Klose, C. D. 2012, PASP, 124, 274 [NASA ADS] [CrossRef] [Google Scholar]
Xu, R., & Wunsch, D. I. 2005, Neural Networks, IEEE Trans., 16, 645 [Google Scholar]

All Tables

Table 1

Classes of objects among Gaia simulations used for testing OA algorithms.

In the text

Table 2

Confusion matrix of the SOM computed using Gaia simulations of a wide variety of astronomical objects, as explained in Sect. 3.

In the text

Table 3

Confusion matrix of the SOM, computed by means of the SDSS outliers semi-empirical library in Gaia spectrophotometer format.

In the text

Table 4

Confusion matrix of the SOM computed using roughly calibrated spectrophotometry.

In the text

Table 5

Confusion matrix of the SOM computed using SDSS full resolution spectra, classified as UNKNOWN.

In the text

All Figures

	Fig. 1 Gaia spectrophotometers BP and RP properties. Credits: ESA. a) Normalized passbands of Gaia instruments as a function of wavelength. G stands for Gaia broad-band white light, RVS refers to Gaia’s Radial Velocity Spectrograph, and BP and RP refer to blue and red spectrophotometers. b) Spectral dispersion of Gaia BP and RP spectrophotometers.
In the text

	Fig. 2 Comparison between an SDSS spectrum for a QSO and its GOG simulated BP/RP spectrophotometry. a) A quasar spectrum from SDSS. b) Quasar spectrum after simulating it with GOG.
In the text

	Fig. 3 Distribution of astronomical object classes obtained with Gaia photometric simulations, over a computed SOM, with 30 by 30 clusters. The color assigned to each of the clusters was set as a function of the predominant class in the objects belonging to it. The black color indicates that the cluster is empty.
In the text

	Fig. 4 Photometric color distribution (magnitude differences in integrated RP and BP bands) in the SDSS outliers SOM.
In the text

	Fig. 5 U-Matrix for the outlier SOM. The different plots correspond to different distance limits, which allow us to change the image contrast so as to unveil the underlying structure.
In the text

	Fig. 6 Identifications obtained for the SOM of SDSS outliers using Gaia simulations. The clusters receive a black color when the distance between the outlier prototype and the corresponding template is above the established limit.
In the text

	Fig. 7 Examples of fitness between the Gaia templates and two outlier SOM clusters. a) Prototype (blue) and best-matching template (red) for unit at position (30, 1). b) Prototype and best matching template for unit at position (28, 4).
In the text

	Fig. 8 Bad SDSS detections, populating the lower right region of the outliers SOM.
In the text

	Fig. 9 Diagram of the number of hits in each SOM cluster obtained for the SDSS outliers library, for different SIMBAD classes (see Sect. 4.4 for details).
In the text

	Fig. 10 Labeling of the map according to SIMBAD object types, see Sect. 4.4 for details. a) Simbad identifications for the outliers SOM. b) Class purity among identifications in the SIMBAD database for the outliers’ SOM. Green indicates that no object in the cluster was identified.
In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.

Homepage

Table of Contents

Article contents

Database links

NASA ADS Abstract Service

Metrics

Show article metrics

Services

Articles citing this article
CrossRef (23)
Same authors
- Google Scholar
- NASA ADS
- EDP Sciences database

Bookmarking

[1] Abazajian, K. N., Adelman-McCarthy, J. K., Agüeros, M. A., et al. 2009, ApJS, 182, 543 [NASA ADS] [CrossRef] [Google Scholar]

[2] Allard, F., Hauschildt, P. H., & Schweitzer, A. 2000, ApJ, 539, 366 [NASA ADS] [CrossRef] [Google Scholar]

[3] Bailer-Jones, C. A. L., Andrae, R., Arcay, B., et al. 2013, A&A, in press DOI: 10.1051/0004-6361/201322344 [Google Scholar]

[4] Baraldi, A., & Blonda, P. 1999, Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Trans., 29, 778 [Google Scholar]

[5] Blomme, R., Frémat, Y., Lobel, A., & Martayan, C. 2010, EAS Pub. Ser., 45, 373 [CrossRef] [EDP Sciences] [Google Scholar]

[6] Brott, I., & Hauschildt, P. H. 2005, in The Three-Dimensional Universe with Gaia, eds. C. Turon, K. S. O’Flaherty, & M. A. C. Perryman, ESA SP., 576, 565 [Google Scholar]

[7] Castanheira, B. G., Kepler, S. O., Handler, G., & Koester, D. 2006, A&A, 450, 331 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[8] Cortes, C., & Vapnik, V. 1995, Machine Learning, 20, 273 [Google Scholar]

[9] Davis, R., & Lenat, D. B. 1982, Knowledge-based systems in artificial intelligence, McGraw-Hill advanced computer science series (New York, St. Louis, San Francisco: McGraw-Hill) [Google Scholar]

[10] de Bruijne, J. H. J. 2012, Astrophys. Space Sci., 341, 31 [Google Scholar]

[11] Desai, S., Armstrong, R., Mohr, J. J., et al. 2012, ApJ, 757, 83 [NASA ADS] [CrossRef] [Google Scholar]

[12] Dobos, L., Csabai, I., Yip, C.-W., et al. 2012, MNRAS, 420, 1217 [NASA ADS] [CrossRef] [Google Scholar]

[13] Fabricius, C., Jordi, C., Carrasco, J. M., Voss, H., & Weiler, M. 2013, in Highlights of Spanish Astrophysics VII, Proc. Meet. SEA, 880 [Google Scholar]

[14] Fort, J.-C., Letrémy, P., & Cottrell, M. 2002, in Proc. ESANN 2002, 10th Eurorean Symposium on Artificial Neural Networks, Bruges, Belgium, April 24–26,, ed. M. Verleysen, 223 [Google Scholar]

[15] Fustes, D., Dafonte, C., Arcay, B., et al. 2013, Expert Syst. Appl., 40, 1530 [CrossRef] [Google Scholar]

[16] Geach, J. E. 2012, MNRAS, 419, 2633 [NASA ADS] [CrossRef] [Google Scholar]

[17] Gustafsson, B., Edvardsson, B., Eriksson, K., et al. 2008, A&A, 486, 951 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[18] Holl, B., Lindegren, L., & Hobbs, D. 2012, A&A, 543, A15 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[19] Isasi, Y., Figueras, F., Luri, X., & Robin, A. C. 2010, in Highlights of Spanish Astrophysics V, eds. J. M. Diego, L. J. Goicoechea, J. I. González-Serrano, & J. Gorgas, Astrophys. Space Sci. Proc. (Berlin, Heidelberg: Springer) [Google Scholar]

[20] Jolliffe, I. T. 2002, Principal Component Analysis, 2nd edn. (Springer) [Google Scholar]

[21] Kaski, S. 1997, Acta Polytechnica Scandinavica, Mathematics, Computing and Management in Engineering Series No. 82, Dtech Thesis, Helsinki University of technology, Finland [Google Scholar]

[22] Kohonen, T. 1982, Biol. Cybern., 43, 59 [Google Scholar]

[23] Kohonen, T., Schroeder, M. R., & Huang, T. S., 2001, Self-Organizing Maps, 3rd edn. (Secaucus, NJ, USA: Springer-Verlag New York, Inc.) [Google Scholar]

[24] Li, J., Ray, S., & Lindsay, B. G. 2007, J. Mach. Learn. Res., 8, 1687 [Google Scholar]

[25] Naim, A., Ratnatunga, K. U., & Griffiths, R. E. 1997, ApJS, 111, 357 [NASA ADS] [CrossRef] [Google Scholar]

[26] Ordóñez-Blanco, D., Arcay, B., Dafonte, C., Manteiga, M., & Ulla, A. 2010, Lect. Notes Essays Astrophys., 4, 97 [NASA ADS] [Google Scholar]

[27] Ordóñez, D., Dafonte, C., Manteiga, M., & Arcay, B. 2012, Appl. Soft Comp., 12, 203 [Google Scholar]

[28] Polzlbauer, G. 2004, in Proceedings of the Fifth Workshop on Data Analysis (WDA’04) (Sliezsky dom, Vysoké Tatry, Slovakia: Elfa Academic Press) eds. J. Paralic, G. Polzlbauer, & A. Rauber, 67 [Google Scholar]

[29] Saglia, R. P., Tonry, J. L., Bender, R., et al. 2012, ApJ, 746, 128 [NASA ADS] [CrossRef] [Google Scholar]

[30] Schölkopf, B., Platt, J. C., Shawe-Taylor, J. C., Smola, A. J., & Williamson, R. C. 2001, Neural Comput., 13, 1443 [CrossRef] [Google Scholar]

[31] Smith, K. 2012, in Springer Series in Astrostatistics, 2, Astrostatistics and Data Mining, eds. L. M. Sarro, L. Eyer, W. O’Mullane, & J. De Ridder (New York: Springer), 239 [Google Scholar]

[32] Sordo, R., Vallenari, A., Tantalo, R., et al. 2011, J. Phys. Conf. Ser., 328, 012006 [NASA ADS] [CrossRef] [Google Scholar]

[33] Tsalmantza, P., Karampelas, A., Kontizas, M., et al. 2012, A&A, 537, A42 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[34] Warren Liao, T. 2005, Pattern Recogn., 38, 1857 [Google Scholar]

[35] Way, M. J., & Klose, C. D. 2012, PASP, 124, 274 [NASA ADS] [CrossRef] [Google Scholar]

[36] Xu, R., & Wunsch, D. I. 2005, Neural Networks, IEEE Trans., 16, 645 [Google Scholar]