Open Access
Issue
A&A
Volume 693, January 2025
Article Number A95
Number of page(s) 19
Section Extragalactic astronomy
DOI https://doi.org/10.1051/0004-6361/202451323
Published online 07 January 2025

© The Authors 2025

Licence Creative CommonsOpen Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article is published in open access under the Subscribe to Open model. Subscribe to A&A to support open access publication.

1. Introduction

One of the most challenging and important topics in astrophysics is the activity classification of galaxies. Galaxies can be separated into categories based on their prevailing gas excitation mechanism. Typically, the identification of the radiation source that excites the gas is determined from the observed spectrum of a galaxy (e.g., atomic emission lines; Baldwin et al. 1981).

In a galaxy, three primary gas excitation mechanisms can occur. The first involves the presence of young stars that emit substantial amounts of intense UV radiation, thereby ionizing the surrounding gas cloud. Another distinct mechanism arises from an active black hole situated at the galaxy’s center, where circumnuclear material undergoes accretion and produces intense UV and X-ray radiation. Finally, hot evolved stellar populations can generate sufficient UV radiation to excite the surrounding gas in their vicinity. Furthermore, the spectrum of a galaxy cannot be exclusively linked to a specific excitation mechanism. For instance, a population of post-asymptotic giant branch (post-AGB) stars can emulate an active galaxy (Stasińska et al. 2008). This intricate nature of galaxies complicates the characterization of their activity.

Moreover, the development of highly effective activity diagnostic tools, in terms of completeness and reliability, are critical for demographic surveys of galactic activity. These are key elements for understanding the interplay among the interstellar medium, star formation, stellar populations, and active galactic nucleus (AGN) activity in galactic cores, and hence the general galactic evolution process (e.g., Ho et al. 1997b).

Numerous approaches have been taken in this direction, leading to the creation of a multitude of diagnostic tools, in an endeavor to address this problem, which is both highly important and complex. Some of these approaches rely on atomic emission-line fluxes (e.g., ratios of Balmer hydrogen lines with forbidden emission lines; Baldwin et al. 1981), while others use infrared colors (e.g., Donley et al. 2012; Mateos et al. 2012; Stern et al. 2012; Daoutis et al. 2023), aiming to pinpoint the primary source of radiation driving the observed galaxy spectrum, and thus categorize its activity class.

One of the most successful and widely used diagnostics is the Baldwin et al. (1981) diagram (hereafter the BPT diagram). The most commonly used version employs the ratios of the first two Balmer series lines (Hα and Hβ) and two forbidden emission lines ([O III] λ5007 and [N II] λ6584) to construct a two-dimensional plot of [O III] λ5007/Hβ versus [N II] λ6584/Hα. Based on their position on this diagram, galaxies are categorized as active galactic nuclei (AGN), star-forming (SF), or transition objects (TO or composite galaxies; e.g., Ho et al. 1997b). The galaxies on this diagram form a seagull-like shape, with star-forming galaxies situated in the left wing, composite galaxies comprising its body, and AGN positioned in the right wing. Similarly, additional diagrams involving the [S II] λλ6717, 6731/Hα and [O I] λ6300/Hα ratios have been utilized, expanding the usefulness of emission-line ratios to objects hosting low-ionization sources. As a result, a new activity class was introduced, the low-ionization nuclear emission-line region (LINER; Heckman 1980).

Another activity diagnostic tool for galaxies in the optical spectrum was introduced by Cid Fernandes et al. (2010). This approach, which can be viewed as a modified BPT diagram, employs optical emission lines and the Hα equivalent width to classify galaxies. An advantage of this diagram is its versatility beyond emission spectra allowing the inclusion of objects with weaker emission lines.

Machine learning algorithms are powerful tools for discerning intricate relationships and patterns within data. They have already been applied to numerous problems across various scientific domains. In astrophysics, specifically, examples of their application ranges from stellar classification tasks (e.g., Kyritsis et al. 2022) to galaxy classification challenges, such as classification based on the BPT diagram (Stampoulis et al. 2019), the identification of AGN properties (e.g., Pennock et al. 2021), and the classification of galaxy morphology (e.g., Domínguez Sánchez et al. 2018).

Up to this point, we have described the activity of galaxies that are exclusively characterized by a singular gas excitation mechanism. However, a galaxy spectrum is rarely the result of only one gas excitation mechanism. Composite and LINER are two galaxy classes introduced as distinct classes, but in practice they are characterized by different sources of excitation or a combination of excitation mechanisms. Identifying the activity source in a composite galaxy can be intricate. Positioned between star-forming and AGN galaxies in the BPT diagram, the spectra of these galaxies can be excited by AGN or young stellar populations. Moreover, recent investigations propose that not all composite galaxies host an active nucleus; instead, the added ionization source may originate from populations of hot evolved stars (Byler et al. 2017, 2019).

The mechanism driving the activity of LINER galaxies is more challenging to interpret. For several years, it was believed that their activity resulted from a low-luminosity active nucleus (Ho et al. 1997b). Some studies introduced the notion that the activity can also be attributed to post-AGB stars (Binette et al. 1994; Stasińska et al. 2008; Papaderos et al. 2013). Agostino et al. (2021) provide further evidence that LINERs can be separated into two subclasses based on the hardness of their ionization field. To date, all available diagnostic methods have converged on the conclusion that these two classes result from mixed activities, yet they fail to provide detailed insight into the characterization of the true underlying excitation mechanism.

While some of these classification methods have demonstrated success in discerning the AGN from star formation based on the hardness of the radiation source, the majority have limited applicability, complex implementation, or an inability to simultaneously include all activity classes. However, none of these methods can effectively characterize complex instances of galaxy activity where combinations of coexisting principal activity mechanisms result in degeneracy in the observed spectrum (e.g., composite galaxies). For this study we developed a machine-learning diagnostic tool that employs four key features and can effectively categorize galaxies into activity classes based on their similarities (i.e., shared properties) with the three primary excitation mechanisms: star formation, AGN, and emission from hot evolved stellar populations. To achieve this, we employed a machine-learning approach. Our choice of discriminative features includes the equivalent widths (EWs) of [O III] λ5007, [N II] λ6584, Hα, and the D4000 continuum break index (Balogh et al. 1999). Using EWs instead of actual flux values for spectral lines enables us to encompass passive galaxies within a unified classification framework. Additionally, we anticipate that the D4000 index will help to identify the excitation resulting from hot evolved stellar populations, which is frequently misidentified as emission from active galaxies. This confusion is especially pronounced in cases of mixed-activity classes (e.g., composite galaxies). Our ultimate objectives encompass the identification of the dominant gas excitation mechanism in a galaxy, as well as the recognition of combinations of gas excitation mechanisms coexisting within galaxies exhibiting mixed-activity classes.

This paper is organized as follows. In Sect. 2 we introduce the data sample, selection method of the galaxy activity classes, and data processing. In Sect. 3 we present the algorithm used for the development of our diagnostic tool, its training process, and the metrics used to evaluate its performance. In Sect. 4 we present the results and how we treat mixed-activity galaxy classes. In the same section, we introduce a new activity diagnostic diagram tailored to handle all galaxy activity types (active to passive) offering more resolution (in terms of activity) for mixed-activity classes that are not addressed in the current diagnostics. In Sect. 5 we discuss the potential limitations, and we compare our diagnostic with other diagnostic methods. In Sect. 6 we summarize our conclusions.

2. Data sample

2.1. Data acquisition

Our data sample consists of a combination of data from two sky surveys. We began with the MPA-JHU (Kauffmann et al. 2003; Brinchmann et al. 2004; Tremonti et al. 2004) DR8 release of the Sloan Digital Sky Survey (SDSS). In this respect we cross-matched the galSpecInfo, galSpecIndx, and galSpecLine catalogs. We focused on the EWs corresponding to Hα, the doubly ionized forbidden line of oxygen ([O III] λ5007), and the singly ionized forbidden line of nitrogen ([N II] λ6584). All these EW values are derived from continuum-subtracted spectra, where negative values indicate emission. We also included the D4000 index from the galSpecIndx catalog. This is the continuum break at the 4000 Å as defined by Balogh et al. (1999). For all emission lines we applied corrections on the line measurements as reported in the process mentioned in Stampoulis et al. (2019).

To identify a set of inactive galaxies, we also relied on ultraviolet photometry data from the Galaxy Evolution Explorer (GALEX; Martin et al. 2005) survey. To achieve this, we cross-matched the SDSS sample and the GALEX-SDSS-WISE Legacy Catalog (GSWLC) described by Salim et al. (2016), using a search radius of 1″.

Upon aggregating all available data based on the features of interest for each object, the resulting catalog encompassed a total of 206 476 galaxies. Next we imposed the following quality criteria. Within the MPA-JHU dataset, objects flagged with RELIABLE = 0 in the SDSS catalog were omitted from our analysis. For the remaining sample we required a signal-to-noise ratio (S/N) > 5 in the continuum around all the considered spectral lines, namely, Hα, [O III], and [N II]. To ensure the reliability of D4000 measurements, we utilized the Hγ continuum, selecting galaxies with a continuum exhibiting S/N > 5. The rationale for employing Hγ lies in its proximity to the 4000 Å, making it a suitable proxy for assessing the quality of the blue continuum.

Moreover, we excluded galaxies with D4000 values set to 0 as our visual inspection revealed erroneous measurements in spectra due to incomplete coverage of the region of the spectrum needed for the calculation of the D4000. Following the application of the aforementioned quality criteria, the resulting galaxy sample comprised 180 436 galaxies.

2.2. Multi-dimensional emission-line classification of active galaxies

For the formulation of our diagnostic, we opted to train a supervised machine learning algorithm, which requires accurate labels for the galaxies that we intended to incorporate in the training process. To obtain these labels (classifications) for our galaxy sample, we employed the Soft Data-Driven Analysis (SoDDA) classifier of Stampoulis et al. (2019), a four-dimensional diagnostic based on four distinct emission-line ratios of log10([N II]/Hα), log10([S II]/Hα), log10([O I]/Hα), and log10([O III]/Hβ). This model was formulated by fitting multivariate Gaussian distributions within this four-dimensional emission-line ratio space. This method offers a notable advantage as it concurrently considers all four important features, as opposed to their two-dimensional projection, as observed in the diagnostic introduced by Kewley et al. (2006). By doing so, we optimized the credibility of the classification outcome, while avoiding contradictory classifications. It also gives the probability for a galaxy to belong to each of four activity classes (star-forming, AGN, LINER, and composite). The class with the highest predicted probability is adopted as the activity class of each galaxy.

The SoDDA employed here is a probabilistic classifier, where the class label for each galaxy is assigned based on the highest predicted probability. To ensure high-confidence classification of every active galaxy (both star-forming and AGN), an additional cut was applied based on the predicted probabilities for each galaxy to belong to all available classes. Thus, we selected galaxies that had probability differences between the first and second predicted classes of 25%.

2.3. Classification of passive galaxies

After establishing a sample of active galaxies, we also needed to define a sample of passive galaxies. For the definition of the initial sample of passive galaxies, we selected galaxies in the red sequence of the color-magnitude diagram (CMD; e.g., Bell et al. 2004; Haines et al. 2008) of galaxies. The near-ultraviolet (NUV) and SDSS r-band photometries are based on the GSWLC and SDSS catalogs, respectively. More specifically, we used the red sequence definition of Haines et al. (2008) on the NUV − r CMD: NUV-r > 5.393 − 0.1782 (Mr + 20)−0.370. In Fig. 1 we show this selection criterion by plotting it in a NUV − r versus Mr CMD alongside all the galaxies in our sample. To ensure the robustness of the classification, we set S/N > 3 in the NUV − r color for all passive galaxies.

thumbnail Fig. 1.

CMD diagram of NUV − r vs. Mr. The gray dots and black contours represent the entire galaxy sample from which we selected our sample of passive galaxies. The sample of galaxies is from the SDSS. The black solid line is the best-fit color-magnitude relation defined by Haines et al. (2008), and the black dashed line is parallel to the former line, but displaced by a distance of 1σ.

In our sample of passive galaxies selected in this way, we find that the contamination by spectroscopically classified star-forming or AGN galaxies is negligible (∼0.5% and ∼0.7%, respectively). Therefore, we opted not to remove any emission-line objects that had been spectroscopically classified as star-forming or AGN. Furthermore, we observed a significant number of optically selected LINER and composite galaxies that are located in the red sequence on the (NUV − r)−Mr CMD. Since LINERs and composites may be powered by old stellar populations, we adopted the passive classification. We preferred the NUV − r definition for the red sequence over the definition based on the (u − r)−Mr CMD since we found that the latter has significant contamination by AGN and obscured star-forming galaxies. This approach provides a continuous mapping of the different observed features, ranging from weak emission to absorption, as a result of the diverse evolutionary stages of underlying stellar populations found within the host galaxy.

Even though the gas ionization mechanism in LINER and composite galaxies is generally considered to be an active nucleus, there is growing evidence that their emission can also be attributed to hot evolved stellar populations (post-AGB stars; e.g., Stasińska et al. 2008; Singh et al. 2013). In the case of composite galaxies, the prevailing notion is that their activity originates from both an active nucleus and a star formation component. However, as discussed in Byler et al. (2019), it is also possible that the emission lines of these galaxies may be excited by weak residual star formation aided by ionization from hot evolved stellar populations. Even though these two subpopulations of LINER and composite galaxies exhibit emission lines that, under different circumstances, would characterize them as active galaxies, the primary ionization mechanism derives from old stellar populations. By incorporating objects with faint emission lines into the training sample, we do not bias our diagnostic against galaxies with excitation by old stellar populations.

Finally, we applied extinction and k-corrections to all utilized photometry. The optical u and r SDSS colors of the galaxies were also corrected for galactic dust extinction, following the Cardelli et al. (1989) extinction law with RV = 3.1 and the E(B − V) values obtained from the dust maps of Schlegel et al. (1998). The NUV band was corrected for reddening effects using the extinction coefficients provided by Peek & Schiminovich (2013). Given the substantial range of distances covered by galaxies in our sample (up to z = 0.08), we also performed k-corrections in the NUV and r bands using the k-correction calculator1 based on the methods presented in Chilingarian et al. (2010) and Chilingarian & Zolotukhin (2012).

2.4. Data processing and final sample

Our focus is on training a diagnostic for the three principal mechanisms of gas excitation. For this reason, from the emission-line classification we only considered these three principal classes: star formation, active nucleus, and old stellar populations. All other active galaxy classes, such as composite and LINER galaxies, were intentionally excluded from the training dataset. The composition breakdown per class within the final sample, intended for algorithm training, is detailed in Table 1.

Table 1.

Composition of the final sample per galaxy class.

In Fig. 2 we depict the projection of our sample on the traditional two-dimensional BPT diagrams illustrating the distribution of the training set for each of the three classes: SF, AGN, and passive. The projections are shown with respect to log10([O III] λ5007/Hβ) versus log10([N II] λ6584/Hα), log10([S II] λλ6717, 6731/Hα), and log10([O I] λ6300/Hα).

thumbnail Fig. 2.

Projections of the training sample for the three principal galaxy activity classes. The left plot shows the standard BPT diagram, the middle plot displays log10([O III]/Hβ) vs. log10([S II]/Hα), and the right plot illustrates log10([O III]/Hβ) vs. log10([O I]/Hα). The blue dots represent SF galaxies, the green dots represent AGN galaxies, and the red dots represent passive galaxies. We note that in all three plots, only a subset of the passive galaxies is presented due to the typically poor quality of their spectra. For these plots, we selected galaxies with emission lines having S/N > 3. In all three plots, the black dashed line corresponds to Kauffmann et al. (2003), the black solid curved line represents Kewley et al. (2001), and the straight black line is Schawinski et al. (2007), which separates LINERs from AGN.

These diagrams also include the subset of passive galaxies with S/N > 3 in all emission lines used for the plots. As anticipated, a nonnegligible fraction of the passive galaxy sample is absent from these diagrams due to their minimal or nonexistent emission lines. However, the presence of passive galaxies in these diagrams ensures that our passive galaxy sample is not biased against early-type galaxies hosting an AGN, residual star formation, or photoionized regions by post-AGN stars. As expected, these are found in the locus of composite and LINER galaxies (cf. Cid Fernandes et al. 2010).

2.5. Feature selection

Activity diagnostics in the optical band are traditionally based on emission-line ratios to characterize galactic activity. However, this introduces a bias as early-type galaxies tend to have limited or negligible reserves of dust and gas, thereby hindering their ability to generate such lines, and as a result excluding them from these diagnostics. To address these limitations, we propose the utilization of equivalent width measurements of a spectral line instead of its flux. This approach aims to address these issues and establish a self-consistent diagnostic method applicable seamlessly to galaxies exhibiting emission on absorption lines.

Our next step is to determine the minimum number of spectral lines required to identify the dominant ionization mechanism within the host galaxy. In our pursuit of identifying these optimal features, we commence by contemplating both astrophysical and practical considerations for each potential feature selection. Beginning from an astrophysical standpoint, our motivation is partially rooted in the physics underlying emission-line diagnostic methods. We specifically consider the EWs of the spectral lines present in BPT diagrams: Hα, [N II] λ6584, [O III] λ5007, [S II] λλ6717,6731, and [O I] λ6300. There is a strong physical reasoning for this choice, as we anticipate that the EWs of singly or doubly ionized oxygen will be higher in an AGN environment compared to a H II region (Baldwin et al. 1981). This arises from the fact that the UV radiation in an AGN environment tends to be more energetic than that typically encountered in a H II region.

From a practical perspective, we opted for strong spectral features that are generally straightforward to observe and measure, such as the Balmer lines and strong forbidden emission lines (e.g., [O III] λ5007). In Fig. 3 we see the distributions of all the features we considered as potential discriminating features for the development of our diagnostic. However, due to the correlation between the Hα and Hβ lines, including both is redundant. Consequently, we chose to utilize Hα because of its larger EW, making it applicable to a broader range. Similarly, we observed that EW([S II] λλ6717,6731) and EW([O I] λ6300) exhibit very similar behavior to EW(Hα) and the EW([N II] λ6584). This observation suggests that the inclusion of the EW of these lines, [S II] λλ6717,6731 and [O I] λ6300, in our diagnostic scheme may not provide significant additional information for discriminating the primary activity classes. Therefore, we adopted the EW of Hα, [O III] λ5007, and [N II] λ6584 spectral lines.

thumbnail Fig. 3.

Distributions of six potential features for the three principal activity classes: star-forming (SF), AGN, and passive galaxies. In the top left, we observe the EW of Hα; in the top right, the EW of [N II] λ6584; in the middle left, the EW of [O III] λ5007; in the middle right, D4000; in the bottom left, the EW of [S II] λλ6717,6731; and in the bottom right, the EW of [O I] λ6300. These features represent the equivalent widths of the corresponding emission lines commonly utilized in galactic activity classification models.

The features discussed in the previous paragraph are good indicators of the activity of a galaxy. However, these activity indicators alone are not capable of identifying whether a galaxy hosts primarily old or young stellar populations that can produce ionizing continua with hardness between that of a young stellar population and an AGN (Cid Fernandes et al. 2010). Thus, in addition to these features, we also require a feature that conveys information regarding the age of the stellar populations. Of all stellar absorption line indicators, the D4000 is the only one that shows the weakest metallicity dependence (Poggianti & Barbaro 1997). The amplitude of the D4000 index is primarily influenced by massive stars in the main sequence (MS), and therefore it exhibits a nearly monotonic increase with the age of the stellar populations. The definition of D4000 we adopted is from the work of Balogh et al. (1999), which uses narrower bands than the one typically defined by Bruzual (1983). This enables us to include a larger number of galaxies, while simultaneously being less sensitive to reddening effects.

In conclusion, we find that by using the EW of Hα, [O III] λ5007, and [N II] λ6584 lines along with the D4000 continuum break index, we can define a diagnostic tool that fulfills all the proposed criteria. The distributions of each feature per class are presented in Fig. 3.

3. The classifier

3.1. The random forest

The complexity of this problem necessitates the utilization of a versatile algorithm capable of effectively discriminating between galaxy activity classes. We determined that the random forest algorithm (Louppe 2014) is best suited to address the requirements of this classification task. Our choice of this specific algorithm was motivated by its inherent flexibility as it allows the adjustment and fine-tuning of numerous parameters to align with the specific demands of the problem at hand. Furthermore, the random forest algorithm is well-known for producing robust results as its training process is generally unaffected by outliers. Another advantage lies in its straightforward and intuitive operation, which simplifies the implementation and interpretation of results. Along with the classification labels, the random forest also calculates the probability of the classified object to belong in each of the considered classes.

3.2. Implementation

We implemented the random forest algorithm using the RandomForestClassifier from the scikit-learn Python 3 package (Pedregosa et al. 2011), version 1.1.1. Our algorithm is provided with four features: EWs of [O III] λ5007, [N II] λ6584, and Hα, the D4000 continuum break, and the activity class of each object. These features enable the algorithm to be trained to classify galaxies into the three primary activity types: star-forming, AGN, and passive.

Following the selection of the optimal values of the hyperparameters (for more details, see Appendix A), the subsequent step was training the algorithm. We employed the data that were selected based on the specific criteria outlined in Sect. 2.4. The entire sample was divided into two subsets: the training set and the test set, with a 70%–30% split. To ensure uniformity within the two subsets, we performed a stratified split ensuring that each subset contained the same fraction of each class. Before splitting the data, we randomly shuffled them to guarantee homogeneity within the two subsets. The training set, comprising most of the data (70%), was utilized for training the algorithm, serving as the data from which the decision trees were constructed. On the other hand, the test subset, accounting for the remaining 30% of the sample, was solely used to evaluate the performance of the algorithm. This approach ensures that the algorithm generalizes well and performs effectively on unseen data, extending its utility to datasets beyond the one used for training.

3.3. Performance metrics

To visualize the performance, we calculated a confusion matrix based on a subset of objects for which we already have their ground-truth classifications (the test set). A perfect classifier is characterized by a confusion matrix with elements only along its primary diagonal, indicating that every prediction matches the true class. However, if the classifier has misclassified instances, the off-diagonal elements will be nonzero, offering a detailed view of the misclassified objects.

In addition to the confusion matrix, we used several other performance metrics (see Table 2 in Daoutis et al. 2023). In particular, we made use of the accuracy, the recall, and the F1 score.

4. Results

4.1. Performance on the principle activity classes

We evaluated the performance of our model by calculating its accuracy by splitting the whole sample into ten subsamples. Each time we used nine out of the ten subsamples (folds) to train it and one to evaluate the accuracy. We repeated the training process, each time replacing the test set with one of the training sets. This process was repeated until all the folds were in the position of the test fold once. This is known as K-fold cross-validation. We find that the overall accuracy we achieved is 0.989 ± 0.004. The low standard deviation suggests a robust classifier as its accuracy score is independent of the training data.

Table 2 provides an overview of the performance scores for each class. The scores for all classes are nearly perfect. The high recall scores for each class indicate that the classifier can correctly retrieve nearly all objects of each class, showcasing its high level of completeness. Furthermore, the high precision scores imply that there are only a few instances where objects have been incorrectly predicted to belong to a different class than their true class. This demonstrates that the contamination within each class is minimal. These results collectively indicate the effectiveness of the diagnostic in accurately identifying the galaxies that are representative of the three principal gas excitation mechanisms (i.e., young stars, active nucleus, and old stellar populations).

Table 2.

Report of performance scores calculated on the test sample for each galaxy class using three different metrics.

In Fig. 4 we present the confusion matrix for this diagnostic, which was calculated using the objects from the test subset. It is clear that the confusion matrix exhibits a nearly diagonal pattern. Misclassification instances for the star-forming and passive galaxies are negligible. Only a small fraction of AGN galaxies (1.6%) are misclassified as star-forming.

thumbnail Fig. 4.

Confusion matrix calculated on the test subset of the final sample. The color and the number in each box refer to the percentage of objects calculated relative to the total number of true instances for each class separately. The labels on the x- and y-axes indicate respectively the predicted and true class of a galaxy. We note that this matrix is almost diagonal, indicating a high-confidence classification.

4.2. Feature importance

While all initially selected features may appear relevant, some may have a greater impact than others. Identifying the most important features allows us to identify any redundant features that should be eliminated. This reduction in feature complexity not only results in a more efficient classifier, but also enhances its adaptability to a broader number of datasets.

Figure 5 provides a feature importance plot, extracted during the training process of the algorithm. Based on the feature importance we find that the EW of Hα is the most crucial feature. The second most important feature is the D4000 index. Notably, all features exhibit similar importance, indicating the robustness of the feature scheme. A closer examination of this plot reveals that the EW of [N II] is ranked last and exhibits a high standard deviation. This hints at the possibility of a feature with low significance, suggesting redundancy. However, the removal of the EW of [N II] from the feature scheme resulted in a significant decrease in performance, particularly for the AGN galaxies.

thumbnail Fig. 5.

Plot of the feature importance calculated during the training process of the algorithm. The error bars are indicative of the standard deviation, providing insight into the stability of the feature importance estimates. The feature with the highest importance score in our diagnostic is EW(Hα), while all other features exhibit similar levels of relevance.

4.3. Mixed-activity classes and predicted probabilities

So far, we have developed a diagnostic that focuses on three classes representing the principal mechanisms of gas excitation in galaxies. However, often different excitation mechanisms can co-exist (e.g., star formation, AGN, and old stellar populations). One such example is composite galaxies, where both an active nucleus and ongoing star formation processes may be present simultaneously. Another scenario for composite galaxies involves gas excitation by hot evolved stars alongside an active star formation component. Additionally, the class of LINER galaxies represents another group with mixed activity, which may be powered by an AGN (even displaying broad emission lines (Ho et al. 1997a) or old stellar populations (Binette et al. 1994; Stasińska et al. 2008; Papaderos et al. 2013).

Our analysis so far has resulted in a tool that is able to clearly separate the galaxies that are dominated by one of the principal activity classes. This is supported by the well-defined distributions in the four-dimensional feature space, showcasing excellent separation among them. However, it is important to note that there is some overlap among these distributions. This overlap is a critical aspect as it allows us to map the entire range of distributions of the different classes in the four-dimensional feature space and identify outliers within each of the three classes. It is expected that the transition and mixing from one pure principal activity class to another is a continuous process within our four-dimensional feature space. As a result, the diagnostic has the capability to deconstruct mixed-activity classes into their principal activity components based on the resemblance of the observed spectral features of their distributions for the different excitation mechanisms, as they are mapped in the four-dimensional feature space we consider. This decomposition ability is a direct consequence of the random forest algorithm as well as the chosen feature space, as mixed classes naturally exist in the intermediate region between the principal activity classes.

The random forest functions as a probabilistic classifier, employing a “voting” process where each tree in the ensemble independently decides the class of each object. Subsequently, an object is assigned to the class that most closely resembles it, along with the associated probabilities indicating its likelihood of belonging to each of the considered classes. As a result, galaxies that are dominated by a single activity mechanism will have a high first-ranking predicted probability of belonging to the assigned activity class. However, in the case of mixed-activity galaxies, where the activity cannot be uniquely attributed to a single mechanism, all predicted probabilities for each class will tend to be similar. In other words, due to the nature of mixed-activity galaxies, they are likely to share characteristics with multiple principal classes, leading to a lower maximum predicted probability for their first-ranked class.

4.4. Definition of a refined classification scheme

We can leverage the predicted probabilities generated by the random forest to expand our analysis to the mixed-activity classes. For example an object that has a nonnegligible probability to be classified into more than one class can be considered as a mixed-class object since it is located in the overlapping region between the locus of pure classes in the four-dimensional parameter space we are considering. This way we can establish selection criteria for each activity class through an analysis of the first- and second-rank predicted probabilities for each galaxy. This allows us to include the criteria in our scheme and characterize the mixed classes.

This way the activity classes from this new classification scheme (hereafter DONHa diagnostic for D4000, [O III], [N II], and Hα) can be categorized into two main groups: pure (or principal) and mixed-activity classes. Beginning with the pure-activity classes, we retain the three classes of star-forming (or starburst), AGN, and passive, introduced in Sects. 2.2 and 2.3 which serve as representatives of the three primary gas excitation mechanisms: star formation, active nucleus, and old-stellar populations, respectively. In our analysis so far, we have been interested in the predicted probabilities of each galaxy rather than just the classification output from the diagnostic. We used these in order to define new selection thresholds for each of primary activity class.

To adjust the probability selection thresholds, we analyzed the predicted probabilities for each of the three principal classes individually, utilizing the test sample (Sect. 3.2). As these galaxies were rigorously selected, we assume that they are dominated by only one activity mechanism (i.e., pure classes) and represent pure-activity classes. Subsequently we examined the probability of the first-ranked class of the population of each of the three pure classes (i.e., star-forming, AGN, and passive galaxies). We find that the lowest 90th percentile of the probability distribution for all the first-ranking classes is higher than 90%. This means that more than 90% of the objects belonging in each of the classes we consider have a classification probability higher than 90%. We denote this probability as max_pi since it is the highest probability (corresponding to the highest-ranking class) for each object.

The second group of activity classes describes the mixed-activity galaxies, characterized by a lower first-rank probability of resembling a pure class. Therefore, if the max_pi falls below 90%, we classify the galaxy as mixed-activity. For such objects, we consider not only the first-ranking, but also the second-ranking predicted probability. This way we characterize each mixed-activity galaxy as having properties of two principal activity classes that describe the two dominant gas excitation mechanisms present in each mixed-activity class. This approach is similar to how clustering algorithms such as K-means or Gaussian mixture models (GMMs) operate. These algorithms categorize objects into groups based on their relative proximity within the feature space; two examples include Mukherjee et al. (1998) and Stampoulis et al. (2019). By considering the highest and second-highest predicted probabilities, we can effectively characterize objects that fall between the locus of the considered classes (e.g., composite) based on their similarity to the main classes originally considered.

The names we assign to the additional activity classes are derived by pairing pure classes, resulting in six new activity classes through permutations. In addition, we place the principal class with the highest predicted probability in the first position of the label, and the class with the second-highest probability in the second position. For instance, in this refined classification scheme, the mixed-activity class “starburst-AGN” differs from “AGN-starburst”. Although both labels convey that these two galaxy classes are primarily characterized by star formation and AGN processes, starburst-AGN describes a galaxy in which the dominant source of excitation originates from star formation, while AGN-starburst characterizes a galaxy in which the dominant source of excitation results from an active nucleus. Comprehensive definitions and selection criteria for the classes in the refined (DONHa) activity classification scheme are provided in Table 3. It is essential to note that the dominant source of ionization is determined based on the similarity of an object to AGN, SF, and passive galaxies in the four-dimensional space considered here, rather than the predominant flux of ionizing photons determined through SED analysis.

Table 3.

Definitions and selection criteria for all activity classes based on the refined classification scheme (DONHa diagnostic).

4.5. Decomposing LINER and composite galaxies

Building upon the analysis and performance evaluation of our diagnostic described in the two previous sections, we proceed to utilize it for decomposing the classes of composite and LINER galaxies into their principal gas excitation components. We follow two approaches in classifying these galaxies and subsequently evaluate the results.

First, we apply the diagnostic tool to a sample of composite and LINER galaxies, drawn from our parent sample described in Sect. 2. They are selected by applying the SoDDA diagnostic of Stampoulis et al. (2019). We then apply our diagnostic defined in Sect. 3 to classify these objects into one of the three principal classes, allowing us to characterize them based on their similarity to one of these classes. For instance, a galaxy that closely resembles a star-forming galaxy (i.e., with EWs of the diagnostic lines and D4000 that are closer to the locus of the SF galaxies) will be classified as a SF-composite. The rest of the mixed-activity galaxies are characterized as AGN-composite, passive-composite, SF-LINER, AGN-LINER, and passive-LINER. As discussed in the previous section the first component of the name indicates the principal activity class to which the galaxy bears the greatest resemblance, while the second component indicates its spectroscopic classification from SoDDA.

Based on this analysis, for the composite galaxies, it is found that ∼51% of them are predicted as being dominated by star formation, ∼6% as being dominated by an active nucleus, and ∼29% as being dominated by hot evolved stars. For the case of LINER galaxies, we find that about ∼12% are predicted as being dominated by AGN activity and ∼80% by hot evolved stars. LINER galaxies predicted as star-forming dominated are almost nonexistent. In Tables 4 and 5 we summarize these results. A small fraction of these objects (13.6% of composite and 7.3% of LINER galaxies) had almost equal probabilities to belong to all three principal classes; we characterize the result of these classifications as inconclusive (see Sect. 5.2 for a detailed discussion).

Table 4.

Class predictions following the implementation of the new diagnostic on the composite galaxy sample.

Table 5.

Class predictions following the implementation of the new diagnostic on the LINER galaxy sample.

We can further analyze these results by exploring the locus of these different subclasses on the standard emission-line ratio diagnostic diagrams. In the ([O III] λ5007/Hβ versus [N II] λ6584/Hα) plot, the composite galaxies occupy the area between the two lines of Kauffmann et al. (2003) and Kewley et al. (2001).

Figure 6 shows that the classes assigned by the new diagnostic tool form distinct clouds, with the center of each distribution being distinguishably different from the others. In that diagram the location of the composite galaxies that are dominated by star formation processes is just above the star-forming cloud and is tangential to the Kauffmann et al. (2003) line. Correspondingly, the AGN-composite predicted galaxies are found in the upper part of the composite population, close to the theoretical line of extreme starburst defined by Kewley et al. (2001). For the passive-composite galaxies, we see that their distribution is wider, extending to the area of objects with strong low-ionization emission lines (LINERs). Interestingly, they follow the trend described in Byler et al. (2019) for galaxies with contributions from an ionizing component of older hot stellar populations.

thumbnail Fig. 6.

Two BPT plots of log10([O III] λ5007/Hβ) vs. log10([N II] λ6584/Hα) displaying the outcome of the activity decomposition on spectroscopically selected sample of composite (left) and LINER (right) galaxies. Following the application of our diagnostic to these samples, we employed a color-coding scheme based on the highest likelihood of similarity to one of the principal activity classes. On the left, the blue contours represent composites predicted as SF (SF-composite), green as AGN (AGN-composite), and red as passive (passive-composite). This plot reveals the overlap of some SF and passive predicted composites. On the right, the green contours represent LINERs predicted to be AGN and red predicted to be passive. This plot shows that there is some overlap between the LINERs predicted as AGN and as passive. The black boundary lines are the same as defined in Fig. 2.

Following this observation, we next applied the DONHa classification scheme described in Table 3 in order to further refine our classification and to identify the primary and secondary activity classes of mixed-activity objects. The results of this analysis are shown in Table 6 and Fig. 7.

thumbnail Fig. 7.

Most probable (highest similarity) class among the three principal activity classes (top) and predictions based on the DONHa classification scheme (bottom) for the subsample of composite galaxies. The bars in both histograms are color-coded based on the principal class that each galaxy is predicted to resemble most (see Sect. 5.2 for for inconclusive classifications). Labels: SF (star-forming), SB (starburst), AGN (active galactic nucleus), Pas (passive).

Table 6.

Class predictions following the application of the DONHa diagnostic to the composite and LINER galaxies.

The results of the classification of the sample of composite galaxies based on the two classification schemes are shown in Fig. 7. The top histogram of Fig. 7 shows the predictions based on the likelihood of similarity of a composite galaxy to one of the three principal activity classes. In the bottom histogram of the same figure, we show the classification based on the more refined activity classification scheme. In Fig. 8 we project these composite galaxies onto the BPT plot to observe the location of each one of the refined activity classes.

thumbnail Fig. 8.

[O III]/Hβ vs. [N II]/Hα flux ratio, illustrating the positions of each refined activity class as predicted by our diagnostic tool after applying it on a sample of spectroscopically selected composite galaxies. We utilized the DONHa classification scheme outlined in Table 3. The left, middle, and right panel indicates spectra excited by star formation (SB), AGN activity, and old stellar populations (Pas), respectively. The contours denote galaxies predicted to exclusively belong to one of the primary activity classes (max_pi > 90%), while the data points represent galaxies with mixed activities (max_pi < 90%). In each plot the symbols represent the class with the highest predicted probability, while the colors the class with second-highest probability. The black boundary lines are the same as defined in Fig. 2. The black dots represent the training sample for demonstration purposes.

We repeated this analysis on the subsample of LINER galaxies. In Table 6 we show the results obtained by the DONHa classification scheme (see Table 3). In Fig. 9 we present the results of the classification obtained by discriminating them based on their similarity to the three principal activity classes (the same approach as for composites, top histogram of Fig. 9) and with the DONHa classification scheme that also considers the mixing of the different gas excitation mechanisms present in the host galaxy (bottom histogram of Fig. 9). In Fig. 10 we plot the subsample of LINER galaxies on the standard BPT plot. It is known that the [S II] doublet and the [O I] line are good probes of low-ionization sources. For this reason, in Fig. 11 we plot the subsample of LINERs on the [O III]/Hβ versus [S II]/Hα (top row) and on the [O III]/Hβ versus [O I]/Hα (bottom row) plots. The classification labels we used were assigned by the DONHa activity model. In that figure we observe a distinct difference in the center of the distribution for LINERs predicted to be powered purely by an AGN compared to those where the activity is solely attributed to old stellar populations. Furthermore, the latter distribution is noticeably below the former in both [O III]/Hβ versus [S II]/Hα and [O III]/Hβ versus [O I]/Hα.

thumbnail Fig. 9.

Most probable (highest similarity) class among the three principal activity classes (top) and predictions based on the DONHa classification scheme (bottom) for the subsample of LINER galaxies. The bars in both histograms are color-coded based on the principal class that each galaxy is predicted to resemble most (for the inconclusive classification, see Sect. 5.2). Labels: SF (star-forming), SB (starburst), AGN (active galactic nucleus), Pas (passive).

thumbnail Fig. 10.

[O III]/Hβ vs. [N II]/Hα, illustrating the positions of each refined activity class as predicted by our diagnostic tool (DONHa) after applying it on a sample of spectroscopically selected LINER galaxies. We utilized the DONHa classification scheme outlined in Table 3. The left, middle, and right panels indicates spectra excited by star formation (SB), AGN activity, and old stellar populations (Pas), respectively. The contours, symbols, and color-coding are as in Fig. 8. The black boundary lines are the same as defined in Fig. 2. The black dots represent the training sample for demonstration purposes.

thumbnail Fig. 11.

Log10([O III]/Hβ) vs. log10([S II]/Hα) (top row) and log10([O III]/Hβ) vs. log10([O I]/Hα) (bottom row) are presented for the spectroscopically selected subsample of LINER galaxies. The left, middle, and right panels highlights LINER-like spectra that are found to be dominated by star formation (SB), AGN activity, and old stellar populations (Pas), respectively. The contours, symbols, and color-coding are as in Fig. 8. In all plots the black boundary lines are the same as defined in Fig. 2. The black dots denote the training sample for demonstration purposes.

4.6. Development of a two-dimensional diagnostic diagram

While our DONHa diagnostic provides excellent performance, for simplicity and ease of use, we present a two-dimensional diagnostic that is a projection from the original four-dimensional feature space with almost no loss in performance. This is possible due to the clear separation of the three principal activity classes in the four-dimensional feature space. We find that the best separation between the activity classes occurs when they are projected on the D4000 against the log10(EW([O IIIλ5007])2). Another reason we chose these two features is that Hα is sensitive to uncertainties from starlight subtraction and it is sometimes blended with the [N II]. In Fig. 12 we present our sample of galaxies on a plot of D4000 against log10([EW(O IIIλ5007)]2) (hereafter DO3). The labels for each class are defined by the DONHa classifier (see Table 3). Due to the small number of objects in the mixed classes as defined in Table 3, and for better clarity, we merged the mixed-activity classes involving the same two classes regardless of which one was the primary and secondary.

thumbnail Fig. 12.

Two-dimensional diagnostic diagram displaying the D4000 continuum break vs. log10(EW([O III])2), the DO3 diagnostic diagram. Within this plot we observe the spatial distribution of galaxies predominantly influenced by a single principal excitation mechanism (pure-activity classes), as well as the positions of those with mixed-activity classes. The black lines are the boundaries separating the different classes (see Sect. 4.6; for the equation of each boundary see Appendix B). The contours representing each pure-activity class are computed using a kernel density function and illustrate the 68% (inner contour) and 90% (outer contour) population density levels. Labels: pure SB (pure starburst), pure AGN (pure active galactic nucleus), pure Pas (pure passive).

The different boundaries that delineate the classes were determined through the application of the support vector machine algorithm (SVM; Cortes & Vapnik 1995). Specifically, we employed the SVM algorithm provided by the scikit-learn package in Python 3. We trained the SVM on the two-dimensional feature space comprising the D4000 and the log10([EW(O IIIλ5007)]2), adopting a one-versus-rest approach. In this approach, we treat the class (or classes) on one side of the boundary we want to define as one class, while considering the class (or classes) on the other side of the boundary as the other class. After training the SVM, we derived the boundaries delineating the classes and fit the functions to extract these boundary equations. We show the optimal boundary equations for selecting each class in the Appendix B.

We find that the majority of the boundary lines can be accurately described by simple polynomial equations. We find that more complex equations for each boundary, such as higher-order polynomials, did not provide any advantages in terms of reliability and completeness.

From the application of the random forest diagnostic described in Sect. 3, we obtained a small set of objects (13.6% of composites and 7.3% of LINERs) that have almost equal probabilities of belonging in each of the three main classes, making their classification inconclusive (see Sect. 5.2). Given the limited number of such instances and the dispersion of these objects in the DO3, it was not possible to determine a distinct region for inconclusive objects on the DO3 diagram. However, it is worth noting that the majority of inconclusive classifications are clustered at the center of the DO3, where the three mixed-activity classes intersect. We advise potential users of the DO3 diagram to exercise caution when interpreting the classification results of any object situated close to the intersection of all activity classes.

To assess the effectiveness of the DO3 diagnostic diagram, we compared the derived classification on the sample defined in Sect. 2.4 with the classifications obtained with the DONHa diagnostic and with SoDDA method described in Sect. 2.2. Table 7 shows the comparison of the DO3 diagram with the SoDDA (ground-truth) classification. We see a high degree of agreement with the two-dimensional correctly classifying the majority of galaxies, particularly in the case of the pure-activity classes.

Table 7.

Comparison of the activity classification between the DO3 and the spectroscopic classification (SoDDA).

In Table 8 we compare the DO3 diagram with the DONHa diagnostic tool, which shows a very good agreement between the two classification methodologies. In particular, we see excellent agreement for the pure-activity classes (top right quadrant) and good performance for the mixed-activity classes, apart from the starburst-AGN where 37% are classified as pure starbursts.

Table 8.

Comparison of the activity classification between the DO3 and the DONHa diagnostic tool defined in Sect. 3.

In conclusion, the DO3 diagram proves to be a simple yet effective method for classifying galaxies into the three principal activity classes and mixed-activity classes, yielding results comparable to the DONHa diagnostic. Three notable advantages are evident. First, the inclusion of passive galaxies in this DO3 diagram broadens its applicability to accommodate the full range of galaxy activity classes. Second, it successfully disentangles the ambiguity between objects whose activity results from a combination of star formation and an AGN component, or excitation from hot evolved stars, a phenomenon observed in the locus of composite galaxies on the BPT diagram. Third, the fact that our DO3 diagnostic diagram requires only a short observed rest-frame spectral range (3850–5050 Å) makes it widely applicable to high-redshift galaxies observed in wide area spectroscopic surveys or even higher redshift objects observed with the James Webb Space Telescope (JWST).

5. Discussion

In this work, we established a diagnostic tool that can characterize the activity of galaxies based on their similarity to the locus of the three fundamental excitation mechanisms in a four-dimensional plane consisting of the EW of the Hα, O III, N II, lines and the continuum break at 4000 Å (D4000). The fundamental activity classes are star-forming, AGN, and passive galaxies. The advantage of using EWs instead of the fluxes of a spectral line is twofold: first, it allows us to include galaxies that have very weak or even absent emission lines (passive galaxies), while being insensitive to reddening effects.

5.1. Dominant photoionization mechanism of mixed-activity classes

The area of the BPT enclosed by the lines of Kauffmann et al. (2003) and Kewley et al. (2001) contains galaxies whose activity is thought to be a combination of star formation and an AGN component. However, photoionization models with pure AGN (Ferland & Netzer 1983; Halpern & Steiner 1983; Stasińska 1984) or purely old stellar populations (Stasińska et al. 2008) can occupy and cover the entire area between the lines of Kauffmann et al. (2003) and Kewley et al. (2001). This suggests that in this region of the BPT diagram, these two subpopulations of composite galaxies (starburst-AGN and passive-starburst, and starburst-passive, see Fig. 8) could coexist. By including D4000 as an age indicator we can remove this degeneracy. In other words, the inclusion of D4000 in the classification scheme will identify galaxies that host old stellar populations and are located in the locus of composite galaxies. For example, Treyer et al. (2010) suggest that the younger stellar component in star-forming galaxies typically have D4000 ≲ 1.3.

In the case of LINERs, another complex activity class, we observe that they can also be divided into two distinct subclasses based on their origin of activity. As shown in the right plot of Fig. 6, LINERs can be divided into two well-defined subpopulations. Those with greater similarities to AGN galaxies are positioned near the separation line between AGN and LINER galaxies, as delineated by Schawinski et al. (2007), while a second subpopulation is located below the first at lower values of log10([O III] λ5007/Hβ). In Fig. 6 (right plot) we also see that there is some overlap between these two subpopulations of LINERs. An important point reinforces our results. While a distinct separation line exists between AGN and LINER galaxies on the BPT diagram, Ho et al. (2003) argues that it lacks absolute physical significance. In this sense, the distribution of AGN galaxies covers a broader range than is typically considered. On the other hand, the remaining LINERs located beneath the cluster of the AGN-predicted LINERs are predicted as passive and the lower values of log10([O III] λ5007/Hβ) suggests a weaker excitation mechanism than the AGN-LINERs, which could be explained by the presence of post-AGB stars. The categorization of LINERs into two subpopulations is further validated by Agostino et al. (2021), which demonstrates that LINERs can be classified as hard and soft based on the ionization field’s hardness. In addition, Agostino et al. (2021) shows that these two subcategories of LINERs exhibit distributions on a BPT diagram that are similar to those depicted in the right plot of Fig. 6.

In the left plot of Fig. 6, we observe that certain composite galaxies predicted to be powered by an old stellar component are situated just above the Kauffmann et al. (2003) line when projected onto the standard BPT diagram. Initially, this outcome may appear unexpected, but it aligns with growing evidence that an increasing contribution from older hot evolved stellar populations results in higher [N II]/Hα ratios than typical star-forming galaxies and increasingly lower [O III]/Hβ ratios shifting galaxies toward the lower region of the standard BPT plot (Stasińska et al. 2008; Cid Fernandes et al. 2010; Byler et al. 2017, 2019). This trend becomes more evident in Fig. 13, which displays the distributions of passive-composite, star-forming-composite, and AGN-composite galaxies alongside the trajectory of points from Byler et al. (2019). In this figure we observe that the increasing age of the stellar populations within a galaxy tends to shift the galaxy’s position on a BPT diagram downward toward lower [O III]/Hβ ratios, traversing the region of composite galaxies. It is worth noting that AGN-composite galaxies are positioned above the locus of passive-composite galaxies shown here, whereas SF-composite galaxies are closer to the dashed line representing the empirical line of Kauffmann et al. (2003) that delineates SF galaxies (left plot of Fig. 6) and closer to the youngest stellar populations as indicated by the colored points.

thumbnail Fig. 13.

Log10([O III]/Hβ) vs. log10([N II]/Hα). Within this plot we can see the distribution of the spectroscopically selected subsample of composite galaxies that have been classified by the new diagnostic as passive-composite galaxies, i.e., composites where the ionization source originates from hot evolved stars. The circular data points are derived from the photoionization models from the work of Byler et al. (2019), and they are overlaid to indicate the locations of galaxies with aging stellar populations spanning from 2 to 14 Gyr. The circles are color-coded to denote the age of the stellar populations. The black boundary lines are the same as defined in Fig. 2. The black dots in the top plot represent the training sample of the principal classes, included for illustrative purposes.

To further explore these results, we study the specific star formation rate (sSFR) within our sample of optically selected composite galaxies. The star formation rates and stellar masses are from the galSpecExtra catalog of the SDSS. Figure 14 shows histograms of the sSFR distribution of composite galaxies along with star-forming and passive galaxies. It is evident that composite galaxies displaying spectral characteristics similar to those of SF galaxies (i.e., starburst-composite) tend to have higher sSFR in comparison to those predicted to have similarities with passive galaxies (passive-composite). This observation signifies that our classifier can effectively discern whether the activity within a composite galaxy is attributed to young stars or old stellar populations, thereby aiding in resolving the degeneracy encountered in optical diagnostics (Kewley et al. 2001; Kauffmann et al. 2003).

thumbnail Fig. 14.

Distribution of the specific star formation rate (sSFR) is examined for composite galaxies predicted to share similar characteristics with star-forming galaxies (green) and passive galaxies (yellow). Additionally, we present the distributions of star-forming (blue) and passive (red) galaxies from our sample for illustration purposes.

In conclusion, this new diagnostic tool can be employed not only for the classification of galaxies, but also for characterizing the underlying activity of mixed-activity galaxies. However, it should be noted that the probabilities derived from the random forest classifier do not correspond to the contribution of each of the corresponding principal activity mechanisms to the observed spectrum; instead, they represent the likelihood of similarity with each of these classes.

5.2. Ambiguous classifications

As a general guideline to decide whether objects with maximum predicted probability (max_pi) < 90% are dominated by two activity mechanisms, we propose selection criteria based on each galaxy belonging in a given class. To establish these, we analyze the differences among their highest (p1), second-highest (p2), and lowest predicted (p3) probabilities. Consequently, for the mixed-activity classes of composite and LINER galaxies, we computed the Δp = p1 − p2 and the Δp′=p2 − p3, indicating the probability difference between the first and second class and the probability difference between the second and third class, respectively. Figure 15 shows Δp plotted against Δp′ for the subsample of composite (top) and LINER (bottom) galaxies. Here, we focus solely on the probability difference as a metric of the discriminating power of this method, rather than the actual classes.

thumbnail Fig. 15.

Δp vs. Δp′. In both plots each dot represents a galaxy of mixed-activity class. The top plot is populated by the composite galaxies, while the bottom plot is populated by the LINER galaxies. In both plots we observe that the lower left corner contains objects with Δp = Δp′ = 0, indicating equal probabilities of belonging to all classes, resulting in unreliable classification. The black dashed line represents the extreme reliability line (Δp = −2 ⋅ Δp′+1). The region between the back line and the yellow solid line (Δp = −2 ⋅ Δp′+0.8) contains 90% of the mixed-class objects for both composites and LINERs, while the region between the black line and the solid red line (Δp = −2 ⋅ Δp′+0.6) encloses 95% of the mixed-class objects for both composites and LINERs.

Plotting Δp against Δp′ reveals that the lower left corner of this plot is populated by objects with similar Δp and Δp′ values. A pure-activity galaxy would be situated in the upper left corner (max_pi → 1, Δp → 1, Δp′→0). On the contrary, mixed-class objects tend to be scattered below this ridge. Additionally, we observe that no object crosses the line Δp = −2 ⋅ Δp′+1. This outcome is a direct consequence of having three classes, as the sum of these three predicted probabilities must equal 1 for each object. This line serves as the upper limit for an object in the Δpp′ space and can be considered as the extreme reliability line. These two-class objects are situated below the upper left corner along the Δp + Δp′ = 1 ridge and up to the middle of the ridge. Objects located below the middle of the ridge or below the line and toward the bottom left corner have increasingly similar probabilities of belonging to all three classes, and hence their classifications are not reliable.

In order to determine the reliability threshold, we identify the area in the Δpp′ diagram (Fig. 15) that includes 90% of the galaxies below the maximum probability line. This process establishes a reliability threshold that encompasses 90% of each population, similar to the reliability thresholds set for the pure classes (see Table 3 in Sect. 4). To define this reliability criterion, we initially use the inequality Δp < m ⋅ Δp′+C, where m and C are constants. We set m = −2, which is the slope of the maximum reliability line (represented by the black dashed lines in Fig. 15), and then shift this line toward the lower left corner until 90% of the objects are enclosed, in order to find the value of C. Based on our findings, the equation that meets these criteria is Δp < −2 ⋅ Δp′+0.8. Therefore, any object adhering to the relationship Δp < −2 ⋅ Δp′+0.8 is designated as having an inconclusive classification. It is worth noting that the value of C can be adjusted based on user preference. An extreme class of inconclusive galaxies are those that have similar probabilities between all three classes.

In Sects. 4.4 and 4.6, it was found that for a small subset of galaxies the classification probabilities are equally split; this means that the probability that an object belongs in an activity class is the same or very similar across all classes, and as a result we cannot characterize the dominant activity mechanism or the two most prevailing ones for these galaxies. As anticipated, this issue primarily arises in the mixed-activity classes, specifically for galaxies spectroscopically identified as composites and LINERs.

5.3. Comparison with other methods

The use of equivalent width as an activity diagnostic has been explored in the past. Cid Fernandes et al. (2010) considered diagnostic diagrams involving the EW of Hα and the [N II]/Hα line ratio (WHAN diagrams) in order to include objects with weak emission lines, which are typically excluded from most diagnostic tools. This led to a three-class scheme consisting of star-forming, AGN, and weak emission-line objects. The last category comprises objects characterized by low S/N (below 3) in the emission lines conventionally employed in the BPT diagram, except for the emission lines of [N II] and Hα. The classification of this particular activity class bears resemblance to the categorization applied to our definition of passive galaxies, as explained in Sect. 2.3. This similarity emerges in the transition of galaxies exhibiting weak emission lines stemming from excitation by old stellar populations to a state of complete passivity, indicative of a dormant system marked solely by absorption lines. In this respect, the definition of passive galaxies presented in Sect. 2.3, which is based on photometric criteria, is more general since it encompasses completely passive galaxies that do not exhibit any Balmer or forbidden lines.

To evaluate the performance of our diagnostic tool against the WHAN diagnostic, we consider all galaxies that meet the eligibility criteria defined for our sample in Sect. 2.4, with the additional condition that the S/N of both [N II] and Hα is above 3, as required by the WHAN diagnostic. Subsequently, we consider the spectroscopic classification (SoDDA classifier) as the ground truth for star-forming and AGN galaxies, and the photometric selection as the ground truth for passive galaxies. Table 9 is segmented into three distinct (and labeled) subtables that serve as the ground truth for the classification. Within each subtable the rows correspond to the classification from our disgnostic, while the columns correspond to objects classified by the WHAN diagnostic.

Table 9.

Comparison of the activity classification between the DONHa and the WHAN diagnostic (Cid Fernandes et al. 2010).

In Table 9 we see that our diagnostic performs better as it manages to identify a larger fraction of the objects within each activity class. Notably, for the AGN class, our diagnostic accurately identifies approximately 99% of true AGN, whereas the WHAN diagnostic correctly identifies only around 76%. Another key advantage of our diagnostic is its capability to include galaxies that exhibit absorption features, and not only those with emission lines. This feature enhances the overall efficacy of our diagnostic and broadens its applicability to diverse datasets.

Another interesting fact from Table 9 is that there are 318 objects that the DONHa diagnostic has correctly identified as star-forming (according to the ground truth based on the SoDDA classifications), which were classified as AGN based on the WHAN diagnostic. After further investigation, we found that these objects have −0.8 < [O III]/Hβ < −0.2 and −0.32 < [N II]/Hα < −0.18, placing them low at the rightmost tip of the star-forming galaxy locus on a BPT diagram (close to the line of composite galaxies defined by Kauffmann et al. 2003). These galaxies have equivalent widths of Hα > 6 Å, which classifies them as AGN in the WHAN diagnostic. However, their [O III]/Hβ ratio is not strong enough to characterize them as AGN in the two-dimensional BPT diagram or the four-dimensional SoDDA diagnostic. In addition, their [N II]/Hα ratios indicate relatively high metallicities. After visual inspection of their optical SDSS images, we see that these galaxies have prominent bulges, an indication of the presence of old stellar populations (Feltzing & Gilmore 2000; Ortolani et al. 2001). In addition, their optical spectra show high Hα and [N II], but low [O III] fluxes, as well as absorption features around Hβ, which is another characteristic of evolved stellar populations.

6. Conclusions

In this study we undertook a classification task centered around the identification of the three primary excitation mechanisms governing galaxy activity: star formation, AGN activity, and photoionization by old stellar populations. Furthermore, we delved into the classification of galaxies exhibiting mixed-activity profiles, specifically composite and LINER galaxies. Our investigation has demonstrated the feasibility of characterizing these mixed-activity galaxies by determining their most probable source of activity among the three principal mechanisms. Furthermore, we introduced a new two-dimensional diagnostic diagram, the DO3, that encompasses all forms of galaxy activity. This diagnostic is highly versatile and applicable to a wide range of datasets, based on its simplicity and narrow spectral range. Below we provide a summary of the results and conclusions derived from this work:

  1. We presented a galaxy activity diagnostic tool utilizing machine-learning methods, relying solely on four spectral features (the equivalent widths of the Hα, [O III] λ5007, [N II] λ6584 lines, and the D4000 index), which allows us to seamlessly integrate excitation by star formation, AGN, and old stellar populations, with excellent reliability and completeness that outperforms the currently available methods.

  2. We introduced a novel two-dimensional activity diagnostic diagram (the DO3 diagram), involving the D4000 break versus log10([EW(O IIIλ5007)]2) line. To the best of our knowledge, this represents the first two-dimensional diagnostic plot that encompasses all activity classes, while providing insights into the primary ionization The requirement for only a brief observed rest-frame spectral range for our DO3 diagnostic greatly extends its applicability to high-redshift galaxies observed in wide area spectroscopic campaigns or even more distant galaxies observed with JWST.

  3. The application of our diagnostic method on the composite galaxies identifies two types of composite galaxies: those powered by AGN alongside star-forming activity, and those powered by star formation as well as older stellar populations. The use of the D4000 break allows us to discriminate between these two types.

  4. For LINER galaxies, our analysis enables us to discriminate between those dominated by AGN activity and those excited by old stellar populations. Our analysis suggests that the ionization mechanism of 72% of all BPT LINERS can be solely attributed to evolved-stellar populations and not AGN activity.

  5. The probabilities generated by our diagnostic can serve as an indicator for characterizing the principal activity mechanism of the host galaxy.

Data availability

The code, along with detailed application instructions, can be found in this GitHub repository2.


Acknowledgments

We sincerely thank the anonymous referee for their valuable comments and suggestions, which greatly enhanced the quality and clarity of this manuscript. We would like to express our sincere gratitude to prof. Samir Salim for generously providing GALEX UV photometry that was essential for this research. CD and EK acknowledge support from the Public Investments Program through a Matching Funds grant to the IA-FORTH. KK is supported by the project “Support of the international collaboration in astronomy (Asu mobility)” with the number: CZ 02.2.69/0.0/0.0/18_053/0016972. Funding for the Sloan Digital Sky Survey IV has been provided by the Alfred P. Sloan Foundation, the U.S. Department of Energy Office of Science, and the Participating Institutions. SDSS-IV acknowledges support and resources from the Center for High Performance Computing at the University of Utah. The SDSS website is http://www.sdss.org. SDSS-IV is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS Collaboration including the Brazilian Participation Group, the Carnegie Institution for Science, Carnegie Mellon University, Center for Astrophysics | Harvard & Smithsonian, the Chilean Participation Group, the French Participation Group, Instituto de Astrofísica de Canarias, The Johns Hopkins University, Kavli Institute for the Physics and Mathematics of the Universe (IPMU)/University of Tokyo, the Korean Participation Group, Lawrence Berkeley National Laboratory, Leibniz Institut für Astrophysik Potsdam (AIP), Max-Planck-Institut für Astronomie (MPIA Heidelberg), Max-Planck-Institut für Astrophysik (MPA Garching), Max-Planck-Institut für Extraterrestrische Physik (MPE), National Astronomical Observatories of China, New Mexico State University, New York University, University of Notre Dame, Observatário Nacional/MCTI, The Ohio State University, Pennsylvania State University, Shanghai Astronomical Observatory, United Kingdom Participation Group, Universidad Nacional Autónoma de México, University of Arizona, University of Colorado Boulder, University of Oxford, University of Portsmouth, University of Utah, University of Virginia, University of Washington, University of Wisconsin, Vanderbilt University, and Yale University.

References

  1. Agostino, C. J., Salim, S., Faber, S. M., et al. 2021, ApJ, 922, 156 [NASA ADS] [CrossRef] [Google Scholar]
  2. Baldwin, J. A., Phillips, M. M., & Terlevich, R. 1981, PASP, 93, 5 [Google Scholar]
  3. Balogh, M. L., Morris, S. L., Yee, H. K. C., Carlberg, R. G., & Ellingson, E. 1999, ApJ, 527, 54 [Google Scholar]
  4. Bell, E. F., Wolf, C., Meisenheimer, K., et al. 2004, ApJ, 608, 752 [NASA ADS] [CrossRef] [Google Scholar]
  5. Binette, L., Magris, C. G., Stasińska, G., & Bruzual, A. G. 1994, A&A, 292, 13 [NASA ADS] [Google Scholar]
  6. Brinchmann, J., Charlot, S., White, S. D. M., et al. 2004, MNRAS, 351, 1151 [Google Scholar]
  7. Bruzual, A. G. 1983, ApJ, 273, 105 [CrossRef] [Google Scholar]
  8. Byler, N., Dalcanton, J. J., Conroy, C., & Johnson, B. D. 2017, ApJ, 840, 44 [Google Scholar]
  9. Byler, N., Dalcanton, J. J., Conroy, C., et al. 2019, AJ, 158, 2 [Google Scholar]
  10. Cardelli, J. A., Clayton, G. C., & Mathis, J. S. 1989, in Interstellar Dust, eds. L. J. Allamandola, & A. G. G. M. Tielens, 135, 5 [Google Scholar]
  11. Chilingarian, I. V., & Zolotukhin, I. Y. 2012, MNRAS, 419, 1727 [Google Scholar]
  12. Chilingarian, I. V., Melchior, A.-L., & Zolotukhin, I. Y. 2010, MNRAS, 405, 1409 [Google Scholar]
  13. Cid Fernandes, R., Stasińska, G., Schlickmann, M. S., et al. 2010, MNRAS, 403, 1036 [Google Scholar]
  14. Cortes, C., & Vapnik, V. 1995, Mach. Learn., 20, 273 [Google Scholar]
  15. Daoutis, C., Kyritsis, E., Kouroumpatzakis, K., & Zezas, A. 2023, A&A, 679, A76 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  16. Domínguez Sánchez, H., Huertas-Company, M., Bernardi, M., Tuccillo, D., & Fischer, J. L. 2018, MNRAS, 476, 3661 [Google Scholar]
  17. Donley, J. L., Koekemoer, A. M., Brusa, M., et al. 2012, ApJ, 748, 142 [Google Scholar]
  18. Feltzing, S., & Gilmore, G. 2000, A&A, 355, 949 [NASA ADS] [Google Scholar]
  19. Ferland, G. J., & Netzer, H. 1983, ApJ, 264, 105 [NASA ADS] [CrossRef] [Google Scholar]
  20. Haines, C. P., Gargiulo, A., & Merluzzi, P. 2008, MNRAS, 385, 1201 [NASA ADS] [CrossRef] [Google Scholar]
  21. Halpern, J. P., & Steiner, J. E. 1983, ApJ, 269, L37 [CrossRef] [Google Scholar]
  22. Heckman, T. M. 1980, A&A, 87, 142 [NASA ADS] [Google Scholar]
  23. Ho, L. C., Filippenko, A. V., & Sargent, W. L. W. 1997a, ApJS, 112, 315 [NASA ADS] [CrossRef] [Google Scholar]
  24. Ho, L. C., Filippenko, A. V., & Sargent, W. L. W. 1997b, ApJ, 487, 568 [NASA ADS] [CrossRef] [Google Scholar]
  25. Ho, L. C., Filippenko, A. V., & Sargent, W. L. W. 2003, ApJ, 583, 159 [NASA ADS] [CrossRef] [Google Scholar]
  26. Kauffmann, G., Heckman, T. M., Tremonti, C., et al. 2003, MNRAS, 346, 1055 [Google Scholar]
  27. Kewley, L. J., Dopita, M. A., Sutherland, R. S., Heisler, C. A., & Trevena, J. 2001, ApJ, 556, 121 [Google Scholar]
  28. Kewley, L. J., Groves, B., Kauffmann, G., & Heckman, T. 2006, MNRAS, 372, 961 [Google Scholar]
  29. Kyritsis, E., Maravelias, G., Zezas, A., et al. 2022, A&A, 657, A62 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  30. Louppe, G. 2014, arXiv e-prints [arXiv:1407.7502] [Google Scholar]
  31. Martin, D. C., Fanson, J., Schiminovich, D., et al. 2005, ApJ, 619, L1 [Google Scholar]
  32. Mateos, S., Alonso-Herrero, A., Carrera, F. J., et al. 2012, MNRAS, 426, 3271 [Google Scholar]
  33. Mukherjee, S., Feigelson, E. D., Jogesh Babu, G., et al. 1998, ApJ, 508, 314 [NASA ADS] [CrossRef] [Google Scholar]
  34. Ortolani, S., Barbuy, B., Bica, E., et al. 2001, A&A, 376, 878 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  35. Papaderos, P., Gomes, J. M., Vílchez, J. M., et al. 2013, A&A, 555, L1 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  36. Pedregosa, F., Varoquaux, G., Gramfort, A., et al. 2011, J. Mach. Learn. Res., 12, 2825 [Google Scholar]
  37. Peek, J. E. G., & Schiminovich, D. 2013, ApJ, 771, 68 [NASA ADS] [CrossRef] [Google Scholar]
  38. Pennock, C. M., van Loon, J. T., Bell, C. P. M., et al. 2021, in Nuclear Activity in Galaxies Across Cosmic Time, eds. M. Pović, P. Marziani, J. Masegosa, et al., 356, 335 [NASA ADS] [Google Scholar]
  39. Poggianti, B. M., & Barbaro, G. 1997, A&A, 325, 1025 [NASA ADS] [Google Scholar]
  40. Salim, S., Lee, J. C., Janowiecki, S., et al. 2016, ApJS, 227, 2 [NASA ADS] [CrossRef] [Google Scholar]
  41. Schawinski, K., Thomas, D., Sarzi, M., et al. 2007, MNRAS, 382, 1415 [Google Scholar]
  42. Schlegel, D. J., Finkbeiner, D. P., & Davis, M. 1998, ApJ, 500, 525 [Google Scholar]
  43. Singh, R., van de Ven, G., Jahnke, K., et al. 2013, A&A, 558, A43 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  44. Stampoulis, V., van Dyk, D. A., Kashyap, V. L., & Zezas, A. 2019, MNRAS, 485, 1085 [NASA ADS] [CrossRef] [Google Scholar]
  45. Stasińska, G. 1984, A&A, 135, 341 [NASA ADS] [Google Scholar]
  46. Stasińska, G., Vale Asari, N., Cid Fernandes, R., et al. 2008, MNRAS, 391, L29 [NASA ADS] [Google Scholar]
  47. Stern, D., Assef, R. J., Benford, D. J., et al. 2012, ApJ, 753, 30 [Google Scholar]
  48. Tremonti, C. A., Heckman, T. M., Kauffmann, G., et al. 2004, ApJ, 613, 898 [Google Scholar]
  49. Treyer, M., Schiminovich, D., Johnson, B. D., et al. 2010, ApJ, 719, 1191 [NASA ADS] [CrossRef] [Google Scholar]

Appendix A: Algorithm optimization

Hyperparameters are parameters that enable an algorithm to be tailored to the specific needs of a given problem. Not all hyperparameters are always relevant and require optimization; some perform effectively with their default settings, while others can significantly impact the performance of the algorithm even with minor adjustments. Upon investigation, we have determined that the performance of the random forest algorithm for our problem is primarily influenced by the following hyperparameters:

  • n_estimators : The number of trees.

  • max_depth : The maximum depth of a each tree.

  • min_samples_split : The minimum number of samples required for node splitting

  • min_samples_leaf : The minimum number of samples in a leaf node.

  • max_leaf_nodes : The maximum number of final nodes in a tree.

  • max_samples : The number of samples from the training set need to build a tree.

  • criterion : Measure of the quality of a split.

Further investigation confirms that the remaining hyperparameters do not significantly affect the performance, and as such, they are left at their default values as imported with the RandomForestClassifier.

The conventional approach for determining the optimal values for the relevant hyperparametrs involves assessing the performance of the random forest across a multidimensional grid of their values. Given the multitude of hyperparameters and the extensive range of potential values, we employ validation curves to reduce the number of grid points and render the problem more manageable. These curves depict how a selected performance score, in our case accuracy, changes as a function of the various values that a specific hyperparameter can take. To elaborate, our approach to explore the hyperparameter ranges is as follows: for each hyperparameter of interest, we vary it across its feasible range of values while maintaining all other hyperparameters at their default values. Subsequently, we record the corresponding performance score (the balanced accuracy in our case) for the entire range of values explored. We repeat this procedure for each of the other hyperparameters. In Figure A.1, we present the validation curves for all significant hyperparameters that influence the performance of the classifier.

thumbnail Fig. A.1.

Accuracy scores (CV scores) vs all important random forest hyperparameters. CV scores have been computed through cross-validation, varying a single random forest hyperparameter while keeping all others constant. The red dashed line represents the average of these scores, and the gray shaded area indicates their standard deviation (1σ). By analyzing these curves, we can determine the range for each hyperparameter, which is crucial for reducing the number of hyperparameter combinations during the grid-search optimization procedure.

To evaluate the performance of the algorithm and select optimal values for each hyperparameter, we use k-fold cross-validation method. In this process, the training data are divided into k subsets, with one used for testing while the rest train the algorithm. The algorithm is trained K times, each time using a different subset as the test set. The reported scores are averages across these cycles, providing not only a performance assessment but also an estimation of score uncertainty. The algorithm utilized for the optimization task is GridSearchCV, from the scikit-learn library. After analyzing the validation curves, we determined the optimal ranges and employed them as inputs for the GridSearchCV algorithm to construct a multidimensional search grid. The ranges and the optimal hyperparameter values are presented in Table A.1.

Table A.1.

Hyperparameter search ranges and optimal values.

Appendix B: Separating boundaries between activity classes in the two-dimensional activity diagram

Here we present the selection criteria defined for the DO3 activity diagnostic diagram as described in Sect. 4.6. These equations indicate the boundaries that delineate the activity classes defined on Fig. 12. In the following equations we set y≡ D4000 and x≡ 2 log10(|EW(O IIIλ5007)|).

Pure star-forming:

y < 0.003 x 4 + 0.034 x 3 0.132 x 2 + 0.090 x 2 + 1.411 $$ \begin{aligned} y < -0.003~x^{4} + 0.034~x^{3} - 0.132~x^{2} + 0.090~x^{2}+ 1.411 \end{aligned} $$(B.1)

Pure AGN:

y < 1.476 x 0.006 $$ \begin{aligned} y < 1.476~x - 0.006 \end{aligned} $$(B.2)

and

y > 0.597 x 0.081 + 1.068 $$ \begin{aligned} y > \frac{0.597}{x-0.081} + 1.068 \end{aligned} $$(B.3)

Pure passive:

x < 0.90 $$ \begin{aligned} x < 0.90 \end{aligned} $$(B.4)

and

y > 0.119 x + 1.511 $$ \begin{aligned} y > 0.119~x + 1.511 \end{aligned} $$(B.5)

Starburst-passive:

x < 0.90 $$ \begin{aligned} x < 0.90 \end{aligned} $$(B.6)

and

y > 0.003 x 4 + 0.034 x 3 0.132 x 2 + 0.090 x + 1.411 $$ \begin{aligned} y > -0.003~x^{4} + 0.034~x^{3} - 0.132~x^{2} + 0.090~x+ 1.411 \end{aligned} $$(B.7)

and

y < 0.119 x + 1.511 $$ \begin{aligned} y < 0.119~x + 1.511 \end{aligned} $$(B.8)

Starburst-AGN:

y > 0.003 x 4 + 0.034 x 3 0.132 x 2 + 0.090 x + 1.411 $$ \begin{aligned} y > -0.003~x^{4} + 0.034~x^{3} - 0.132~x^{2} + 0.090~x + 1.411 \end{aligned} $$(B.9)

and

y < 1.476 x 0.006 $$ \begin{aligned} y < 1.476~x - 0.006 \end{aligned} $$(B.10)

and

y < 0.597 x 0.081 + 1.068 $$ \begin{aligned} y < \frac{0.597}{x-0.081} + 1.068 \end{aligned} $$(B.11)

AGN-passive:

y > 0.003 x 4 + 0.034 x 3 0.132 x 2 + 0.090 x + 1.411 $$ \begin{aligned} y > -0.003~x^{4} + 0.034~x^{3} - 0.132~x^{2} + 0.090~x + 1.411 \end{aligned} $$(B.12)

and

y > 1.476 x 0.006 $$ \begin{aligned} y > 1.476~x - 0.006 \end{aligned} $$(B.13)

and

x > 0.90 $$ \begin{aligned} x > 0.90 \end{aligned} $$(B.14)

All Tables

Table 1.

Composition of the final sample per galaxy class.

Table 2.

Report of performance scores calculated on the test sample for each galaxy class using three different metrics.

Table 3.

Definitions and selection criteria for all activity classes based on the refined classification scheme (DONHa diagnostic).

Table 4.

Class predictions following the implementation of the new diagnostic on the composite galaxy sample.

Table 5.

Class predictions following the implementation of the new diagnostic on the LINER galaxy sample.

Table 6.

Class predictions following the application of the DONHa diagnostic to the composite and LINER galaxies.

Table 7.

Comparison of the activity classification between the DO3 and the spectroscopic classification (SoDDA).

Table 8.

Comparison of the activity classification between the DO3 and the DONHa diagnostic tool defined in Sect. 3.

Table 9.

Comparison of the activity classification between the DONHa and the WHAN diagnostic (Cid Fernandes et al. 2010).

Table A.1.

Hyperparameter search ranges and optimal values.

All Figures

thumbnail Fig. 1.

CMD diagram of NUV − r vs. Mr. The gray dots and black contours represent the entire galaxy sample from which we selected our sample of passive galaxies. The sample of galaxies is from the SDSS. The black solid line is the best-fit color-magnitude relation defined by Haines et al. (2008), and the black dashed line is parallel to the former line, but displaced by a distance of 1σ.

In the text
thumbnail Fig. 2.

Projections of the training sample for the three principal galaxy activity classes. The left plot shows the standard BPT diagram, the middle plot displays log10([O III]/Hβ) vs. log10([S II]/Hα), and the right plot illustrates log10([O III]/Hβ) vs. log10([O I]/Hα). The blue dots represent SF galaxies, the green dots represent AGN galaxies, and the red dots represent passive galaxies. We note that in all three plots, only a subset of the passive galaxies is presented due to the typically poor quality of their spectra. For these plots, we selected galaxies with emission lines having S/N > 3. In all three plots, the black dashed line corresponds to Kauffmann et al. (2003), the black solid curved line represents Kewley et al. (2001), and the straight black line is Schawinski et al. (2007), which separates LINERs from AGN.

In the text
thumbnail Fig. 3.

Distributions of six potential features for the three principal activity classes: star-forming (SF), AGN, and passive galaxies. In the top left, we observe the EW of Hα; in the top right, the EW of [N II] λ6584; in the middle left, the EW of [O III] λ5007; in the middle right, D4000; in the bottom left, the EW of [S II] λλ6717,6731; and in the bottom right, the EW of [O I] λ6300. These features represent the equivalent widths of the corresponding emission lines commonly utilized in galactic activity classification models.

In the text
thumbnail Fig. 4.

Confusion matrix calculated on the test subset of the final sample. The color and the number in each box refer to the percentage of objects calculated relative to the total number of true instances for each class separately. The labels on the x- and y-axes indicate respectively the predicted and true class of a galaxy. We note that this matrix is almost diagonal, indicating a high-confidence classification.

In the text
thumbnail Fig. 5.

Plot of the feature importance calculated during the training process of the algorithm. The error bars are indicative of the standard deviation, providing insight into the stability of the feature importance estimates. The feature with the highest importance score in our diagnostic is EW(Hα), while all other features exhibit similar levels of relevance.

In the text
thumbnail Fig. 6.

Two BPT plots of log10([O III] λ5007/Hβ) vs. log10([N II] λ6584/Hα) displaying the outcome of the activity decomposition on spectroscopically selected sample of composite (left) and LINER (right) galaxies. Following the application of our diagnostic to these samples, we employed a color-coding scheme based on the highest likelihood of similarity to one of the principal activity classes. On the left, the blue contours represent composites predicted as SF (SF-composite), green as AGN (AGN-composite), and red as passive (passive-composite). This plot reveals the overlap of some SF and passive predicted composites. On the right, the green contours represent LINERs predicted to be AGN and red predicted to be passive. This plot shows that there is some overlap between the LINERs predicted as AGN and as passive. The black boundary lines are the same as defined in Fig. 2.

In the text
thumbnail Fig. 7.

Most probable (highest similarity) class among the three principal activity classes (top) and predictions based on the DONHa classification scheme (bottom) for the subsample of composite galaxies. The bars in both histograms are color-coded based on the principal class that each galaxy is predicted to resemble most (see Sect. 5.2 for for inconclusive classifications). Labels: SF (star-forming), SB (starburst), AGN (active galactic nucleus), Pas (passive).

In the text
thumbnail Fig. 8.

[O III]/Hβ vs. [N II]/Hα flux ratio, illustrating the positions of each refined activity class as predicted by our diagnostic tool after applying it on a sample of spectroscopically selected composite galaxies. We utilized the DONHa classification scheme outlined in Table 3. The left, middle, and right panel indicates spectra excited by star formation (SB), AGN activity, and old stellar populations (Pas), respectively. The contours denote galaxies predicted to exclusively belong to one of the primary activity classes (max_pi > 90%), while the data points represent galaxies with mixed activities (max_pi < 90%). In each plot the symbols represent the class with the highest predicted probability, while the colors the class with second-highest probability. The black boundary lines are the same as defined in Fig. 2. The black dots represent the training sample for demonstration purposes.

In the text
thumbnail Fig. 9.

Most probable (highest similarity) class among the three principal activity classes (top) and predictions based on the DONHa classification scheme (bottom) for the subsample of LINER galaxies. The bars in both histograms are color-coded based on the principal class that each galaxy is predicted to resemble most (for the inconclusive classification, see Sect. 5.2). Labels: SF (star-forming), SB (starburst), AGN (active galactic nucleus), Pas (passive).

In the text
thumbnail Fig. 10.

[O III]/Hβ vs. [N II]/Hα, illustrating the positions of each refined activity class as predicted by our diagnostic tool (DONHa) after applying it on a sample of spectroscopically selected LINER galaxies. We utilized the DONHa classification scheme outlined in Table 3. The left, middle, and right panels indicates spectra excited by star formation (SB), AGN activity, and old stellar populations (Pas), respectively. The contours, symbols, and color-coding are as in Fig. 8. The black boundary lines are the same as defined in Fig. 2. The black dots represent the training sample for demonstration purposes.

In the text
thumbnail Fig. 11.

Log10([O III]/Hβ) vs. log10([S II]/Hα) (top row) and log10([O III]/Hβ) vs. log10([O I]/Hα) (bottom row) are presented for the spectroscopically selected subsample of LINER galaxies. The left, middle, and right panels highlights LINER-like spectra that are found to be dominated by star formation (SB), AGN activity, and old stellar populations (Pas), respectively. The contours, symbols, and color-coding are as in Fig. 8. In all plots the black boundary lines are the same as defined in Fig. 2. The black dots denote the training sample for demonstration purposes.

In the text
thumbnail Fig. 12.

Two-dimensional diagnostic diagram displaying the D4000 continuum break vs. log10(EW([O III])2), the DO3 diagnostic diagram. Within this plot we observe the spatial distribution of galaxies predominantly influenced by a single principal excitation mechanism (pure-activity classes), as well as the positions of those with mixed-activity classes. The black lines are the boundaries separating the different classes (see Sect. 4.6; for the equation of each boundary see Appendix B). The contours representing each pure-activity class are computed using a kernel density function and illustrate the 68% (inner contour) and 90% (outer contour) population density levels. Labels: pure SB (pure starburst), pure AGN (pure active galactic nucleus), pure Pas (pure passive).

In the text
thumbnail Fig. 13.

Log10([O III]/Hβ) vs. log10([N II]/Hα). Within this plot we can see the distribution of the spectroscopically selected subsample of composite galaxies that have been classified by the new diagnostic as passive-composite galaxies, i.e., composites where the ionization source originates from hot evolved stars. The circular data points are derived from the photoionization models from the work of Byler et al. (2019), and they are overlaid to indicate the locations of galaxies with aging stellar populations spanning from 2 to 14 Gyr. The circles are color-coded to denote the age of the stellar populations. The black boundary lines are the same as defined in Fig. 2. The black dots in the top plot represent the training sample of the principal classes, included for illustrative purposes.

In the text
thumbnail Fig. 14.

Distribution of the specific star formation rate (sSFR) is examined for composite galaxies predicted to share similar characteristics with star-forming galaxies (green) and passive galaxies (yellow). Additionally, we present the distributions of star-forming (blue) and passive (red) galaxies from our sample for illustration purposes.

In the text
thumbnail Fig. 15.

Δp vs. Δp′. In both plots each dot represents a galaxy of mixed-activity class. The top plot is populated by the composite galaxies, while the bottom plot is populated by the LINER galaxies. In both plots we observe that the lower left corner contains objects with Δp = Δp′ = 0, indicating equal probabilities of belonging to all classes, resulting in unreliable classification. The black dashed line represents the extreme reliability line (Δp = −2 ⋅ Δp′+1). The region between the back line and the yellow solid line (Δp = −2 ⋅ Δp′+0.8) contains 90% of the mixed-class objects for both composites and LINERs, while the region between the black line and the solid red line (Δp = −2 ⋅ Δp′+0.6) encloses 95% of the mixed-class objects for both composites and LINERs.

In the text
thumbnail Fig. A.1.

Accuracy scores (CV scores) vs all important random forest hyperparameters. CV scores have been computed through cross-validation, varying a single random forest hyperparameter while keeping all others constant. The red dashed line represents the average of these scores, and the gray shaded area indicates their standard deviation (1σ). By analyzing these curves, we can determine the range for each hyperparameter, which is crucial for reducing the number of hyperparameter combinations during the grid-search optimization procedure.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.