Supernova search with active learning in ZTF DR3

We provide the first results from the complete SNAD adaptive learning pipeline in the context of a broad scope of data from large-scale astronomical surveys. The main goal of this work is to explore the potential of adaptive learning techniques in application to big data sets. Our SNAD team used Active Anomaly Discovery (AAD) as a tool to search for new supernova (SN) candidates in the photometric data from the first 9.4 months of the Zwicky Transient Facility (ZTF) survey, namely, between March 17 and December 31 2018 (58194<MJD<58483). We analysed 70 ZTF fields at a high galactic latitude and visually inspected 2100 outliers. This resulted in 104 SN-like objects being found, 57 of which were reported to the Transient Name Server for the first time and with 47 having previously been mentioned in other catalogues, either as SNe with known types or as SN candidates. We visually inspected the multi-colour light curves of the non-catalogued transients and performed fittings with different supernova models to assign it to a probable photometric class: Ia, Ib/c, IIP, IIL, or IIn. Moreover, we also identified unreported slow-evolving transients that are good superluminous SN candidates, along with a few other non-catalogued objects, such as red dwarf flares and active galactic nuclei. Beyond confirming the effectiveness of human-machine integration underlying the AAD strategy, our results shed light on potential leaks in currently available pipelines. These findings can help avoid similar losses in future large-scale astronomical surveys. Furthermore, the algorithm enables direct searches of any type of data and based on any definition of an anomaly set by the expert.


Introduction
The advent of modern astronomical surveys, initiated by the Sloan Digital Sky Survey (SDSS, Blanton et al. 2017) and further propelled by the Zwicky Transient Facility (ZTF, Bellm et al. 2019), has popularised the use of automated machine learning methods (Baron 2019).This shift towards a data-driven approach to astronomical research has been developing swiftly for supervised learning tasks in the areas of classification (see e.g.Carleo et al. 2019;Ishida 2019;Malik et al. 2022, and references therein) and regression (e.g.Krone-Martins et al. 2014;Pasquet et al. 2019;Cabayol et al. 2021;Henghes et al. 2021;Chen et al. 2022).
Nevertheless, thanks to the availability of continuous scans of the sky with instruments that are capable of achieving unprecedented resolution, it is natural to expect that new and interesting astrophysical sources will continue to be detected.The challenge then becomes developing automated unsupervised learning strategies that can successfully identify such sources among large and complex data sets.The astronomical community has devoted significant efforts to this direction.For example, Pruzhinskaya et al. (2019) applied the isolation forest (IF, Liu et al. 2008) algorithm to identify contaminants in the Open Supernova Catalog (Guillochon et al. 2017).Malanchev et al. (2021a) used four different anomaly detection (AD) algorithms and a comprehensive feature extraction process to identify unusual light curves in the third ZTF data release (DR).In searching for changing-state active galactic nuclei (AGNs), Sánchez-Sáez et al. (2021) identified 75 promising candidates by combining dimensionality reduction via deep learning with IF.Storey-Fisher et al. (2021) applied a Wasserstein generative adversarial network on nearly one million optical galaxy images in the Hyper Suprime-Cam survey.Martínez-Galarza et al. (2021) combined tree-based AD and manifold learning to identify sets of unusual light curves in Kepler data.Chan et al. (2022)  periodic variables in ZTF data.Sarkar et al. (2022) used the Earth as an anomaly example in order to estimate the habitability of exoplanets using a multi-stage memetic algorithm.Kovačević et al. (2022) used self-organising maps to analyse temporal-only parameters computed from ⪆105 sources from the Exploring the X-ray Transient and variable Sky catalogue and Aleo et al. (2022b) used simulated light curves to search for counterparts in ZTF DR4, identifying 11 non-catalogued transients.
Despite such promising results, all AD studies need to deal with the discrepancy between the statistical definition of an outlier (which directly affects the output from traditional machine learning models) and astrophysically interesting anomalies1 (unforeseen or yet to be confirmed events generated by unusual astrophysical phenomena).In large data sets, outliers tend to dominate the set of objects with high anomaly scores (Malanchev et al. 2021a).Adaptive learning techniques are aimed at sequentially incorporating expert knowledge in machine learning models (see e.g.Ishida et al. 2021;Lochner & Bassett 2021).The SNAD team2 has been consistently improving and testing such an adaptive learning strategy, whereby at each iteration, a binary reply from the expert is incorporated into the weight calculation of an IF model, producing updated anomaly scores.The active anomaly discovery (AAD, Das et al. 2017) algorithm has proven to be effective in its first application to real data (Ishida et al. 2021).In this work, we stress test the effectiveness of this strategy by applying it to light curves from ZTF DR3.Considering as anomalies any light curves that resemble those of supernovae (SNe), our experts scanned 70 ZTF fields searching for uncatalogued or anomalous transients.
This paper is organised as follows.Section 2 describes the data selection process (Sect.2.1), learning algorithm (Sect.2.2), and a summary of the results (Sect.2.3).In Sect.3, we present the results of our light-curve modelling for a subset of the newly reported transients.Section 4 presents an in-depth discussion on superluminous supernova (SLSN) candidates (Sect.4.1), along with a complete set of labels within the SNAD viewer knowledge database (Sect.4.2) and a description of other non-catalogued objects found during our search (Sect.4.3).We present our conclusions in Sect. 5. Additionally, the complete SNAD catalogue of discovered transients is shown in Appendix A. Appendix B shows light curves and corresponding fit models for SNAD objects.Appendix C gives a glimpse of the domain knowledge database within the SNAD viewer3 .

ZTF data and field selection
We analysed photometric data from the first 9.4 months of the ZTF survey, between 2018 March 17 and December 31 (58194 ≤ MJD ≤ 58483).This period includes data from the ZTF private survey, thus offering a better cadence than the rest of DR34 .However, the expert analysis of discovered SNe (see Sect. 3) used more complete light curves from ZTF DR8.
Given the higher probability of finding SNe in low extinction regions, we analysed only those fields with centres at > 20 • above the galactic plane.The distribution of the 70 fields considered in this work is given in Fig. 1.
Each one of the selected fields contains from a few thousand to a little more than a million objects with at least 100 photometric points in zr-band (catflags = 0), thus comprising ∼26.5 million light curves in total.Each object is characterised by ZTF Object ID (OID).This identifier is unique only within each field and each band, therefore the same source observed in different fields and in different bands can have several OIDs.
Per each OID, we extracted 42 zr-band light curve features including magnitude amplitude, Stetson K coefficient (Stetson 1996), standard deviation of Lomb-Scargle periodogram (Lomb 1976;Scargle 1982), and others.A full description of all features used is given in Malanchev et al. (2021b,a).

Active anomaly discovery
Recommendation systems are automatic algorithms whose goal is to minimise the cost of labelling tasks and, at the same time, to optimise classification or anomaly detection results.In this work, we use the AAD algorithm proposed by Das et al. (2017).It starts with a traditional IF and sequentially presents the object with highest anomaly score to the expert.If the expert judges a particular outlier not to be interesting, the weights of each decision path is changed to accommodate this new information and the data is passed through the slightly modified forest.The process is repeated until a certain budget has been reached.This framework was first applied to a simulated as well as a small real data set by Ishida et al. (2021).
Here, we present the first application of AAD to a significantly larger data set of real observations (∼26.5 million light curves).Since the algorithm can adapt to the expert's opinion, it can be used for a targeted search of transients of a certain type (e.g.SNe).Therefore, in this analysis, a human expert considered only SN-like candidates as anomalies; all other objects proposed by the algorithm are rejected by the expert as 'uninteresting' (i.e.'yes' and 'no' in the AAD interface).For each field, the expert has gone through a total budget of 30 objects.
In order to enable a smooth interaction between our experts and the AAD algorithm when dealing with such a large data set, we developed the SNAD knowledge database (Malanchev et al. 2023), a framework used by our experts to log their input as one entry in a tailored set of labels (see further details in Sect.4.2).For each one of the ZTF fields, our experts went through 30 objects registering their feedback as a binary answer.The distribution of objects by type for each of the 35 fields containing SNe or SN candidate is given in Fig. 2. Each line represents one AAD run with 30 queries in the order of appearance to the expert.The colour denotes the assigned tag, namely, whether it is a supernova, artefact, or other type of object.
In what follows, we further investigate the most interesting objects we encountered.The source code is publicly available as a part of zwad (Malanchev et al. 2021b) GitHub repository 5 .

Results
We visually inspected 2100 (70 × 30) outliers.Among them, we found 104 SN-like objects, 57 of which were reported for the first time and 47 were previously mentioned in other catalogues, either as SNe of known types or as SN candidates (see Sect. 4.2 for other type of objects found).Sources which were not previously mentioned in the Transient Name Server6 (TNS) received  Figure 1 shows the distribution of inspected fields on the sky in equatorial coordinates, along with the corresponding number of objects.There are 35 fields with detected SN candidates that are outlined in black.Naively, we would expect that fields with supernovae should be concentrated at the regions further away from the galactic plane and galactic centre.However, we observe that they are located in the middle galactic longitude and latitude.This can be explained by the smaller number of observations in more extragalactic regions.Moreover, the number of objects in different fields varies from a few thousand to more than a million, and the fact that we did not detect any SN in regions with more than a million objects (only three regions), which are also very close to the Milky Way centre, may indicate that the budget of 30 objects was not enough for the AAD to adapt and ideally should be scaled according to the number of objects in the field.
Among the previously reported supernovae candidates, there are 14 SNe Ia, 13 possible SNe, 7 SNe II, 3 SNe Ic, 2 SNe IIP, and 1 SN Ib; the remaining 7 catalogued SNe belong to the rare supernova classes considered as anomalies in Pruzhinskaya et al. (2019); Ishida et al. (2021), namely, 2 SNe IIb, 1 SN Ia Pec, 1 SN Ia-91bg, 1 SN Ic BL, 1 SN IIn, and 1 SLSN-I.To compare the efficiency of the AAD algorithm in searching for more rare and therefore potentially interesting objects, we recorded the number of spectroscopically confirmed SNe found in this work and discovered by different groups, in the ZTF data, according to TNS for the same period of time (58194 ≤ MJD ≤ 58483), as shown in Table 1.The fraction of rare SN types among the total number is ∼21% for AAD discoveries and ∼10% for general TNS findings.
Non-catalogued SN-like objects are listed in the beginning of Table A.1.We note that 15 SNAD possible supernovae (PSNe) are missing in the official ZTF alert stream (Table A.1,Col. 6).Missed transients have peak zr magnitude ∼19.5-20 mag, which is indeed quite low, but still compatible with those of some other SNAD transients detected by the alert system.Furthermore, some of our candidates (e.g.SNAD128, SNAD165) have well-sampled early light curves which is of interest for surveys such as the Young Supernova Experiment (Jones et al. 2021).

Supernova modelling
We used the PYTHON library SNCOSMO 8 to obtain a preliminary photometric classification for SNAD objects.Their light curves were fitted with Peter Nugent's supernova models 9 , which cover the main SN types (Ia, Ib/c, IIP, IIL, IIn).Nugent's models are simple spectral time series that can be scaled up and down.The model parameters are the redshift, z, the observer-frame time corresponding to the source's zero phase, t 0 , and the amplitude.The zero phase is defined relative to the explosion moment and the observed time, t, is related to phase via t = t 0 + phase × (1 + z).
In order to perform a preliminary fit, we used only the zr-band from DR8.We subtracted the reference magnitude from ZTF light curves, thus roughly accounting for the host galaxy contamination.The reference magnitude was retrieved from ZTF archival data 10 and listed in the SNAD catalogue 7 .We also corrected for a line-of-sight reddening in the Milky Way galaxy using Schlafly & Finkbeiner (2011) estimates.For sources holding SDSS DR16 (Ahumada et al. 2020) photometric redshift of a host galaxy at the source position, we fixed the redshift to this value.If this was not available, we adopted [−15; −22] as an acceptable range for the supernova absolute magnitude (Richardson et al. 2014) and then, using the maximum apparent magnitude, roughly transformed it to the corresponding redshift range.We applied a χ 2 criterion to choose the best-fit model for each SNAD object.Results of the light curve fit are given in Appendix B, the best-fit model for each SNAD transient is listed in Col. 5 of Table A.1.
It should be noted that we did not intend to make a detailed fit, but, rather, to show that the candidate light curves, selected initially by eye, can be satisfactorily fitted by different supernova models.That is why only one band (zr) has been used in the fit.Also, we did not take into account the possible extinction in host galaxies of the candidates, therefore, our fit is less accurate for highly reddened objects.Moreover, the redshift we assigned to some host galaxies is photometric, which is another source of uncertainty.Finally, the model itself is rather simple and limited in wavelength and time range.As a result of these conscious simplifications and assumptions, the obtained absolute magnitude for some of the objects is not typical for normal SNe (e.g.SNAD122, M r (IIP) ≃ −22.6 mag) and we cannot trust the classification in those cases.However, this simple fit is enough to show that a few transients have anomalously wide light curves when compared to normal SNe, making them candidates to the SLSN class (Sect.4.1).
Although this classification should be treated with caution, it follows closely the behaviour of light curves with a sufficient number of observations before and after maximum light.Using the SNCOSMO library, we also performed a multi-band light-curve fit for a few objects with the models suggested by the preliminary classification.The parameters of the fit are z, t 0 , and the amplitude.Then, SNAD112, SNAD142, SNAD165, and SNAD137 fitted by Nugent's Type Ia, IIP, Ibc, and IIn models are given in Figs.3-6, respectively.The quality of the fit allows us to conclude that those supernovae belong to the suggested types.

Superluminous supernovae candidates
Four supernova candidates from our list possess significantly broader light curves in comparison with Nugent's models and other candidates: SNAD120, SNAD121, SNAD160, and SNAD187 (see Appendix B).In this section we explore the possibility of these objects belonging to the SLSN class.SNAD120 (AT2018lxa) is located at α = 17 h 00 m 16.296 s , δ = +70 • 30 ′ 49.55 ′′ .In the official ZTF alert stream, it is denoted as ZTF18aazydub.According to Strotjohann et al. (2021), the transient has a spectroscopic redshift of z sp = 0.202 and was classified as SN IIn.Assuming this redshift, the estimated absolute magnitude at maximum brightness is M r ≃ −20.5 mag, which is slightly dimmer than the threshold of −21 mag established for SLSNe (Gal-Yam 2012).
SNAD187 (AT2018mcb, ZTF18aaqctvg) is located at α = 13 h 53 m 7.366 s , δ = +40 • 48 ′ 7.42 ′′ .There are several photometric redshift estimations of its possible host provided by different surveys: z ph = 0.204 ± 0.084 by the Legacy Surveys Sky Viewer (Zhou et al. 2021), z ph = 0.343 ± 0.128 by SDSS DR16 (Ahumada et al. 2020), and z ph ≃ 0.201 by Gaia DR3 (Gaia Collaboration 2022).Also, according to Gaia variability classification results there is an AGN at the transient position (Gaia Collaboration 2022).It is possible that SNAD187 is not associated with the host AGN activity and could be a SLSN.Recently, the ANTARES broker AD filter reported the discovery of a SLSN -SN 2022mnj at the central region of an AGN (Aleo et al. 2022a;Ashall 2022; see also Moriya et al. 2017).
Figure 7 shows the observed light curves of SNAD120, SNAD121, SNAD160, and SNAD187 in the zr-band in comparison with SN 2006gy (Smith et al. 2007) -one of the brightest among the well-studied SLSNe, shifted to z = 0.3 and 0.4.SN 2006gy has a very broad light curve, but it is clear that the SNAD candidates have even broader light curves, making them really peculiar objects among known SNe.The discovery of four slowevolving transients among the SNAD objects, non-reported by previous searches provides clear evidence that the AAD is efficient in searching for rare classes of astronomical objects within large and complex data sets.

SNAD knowledge database
Beyond the transient candidates discussed previously, this work also produced a valuable knowledge database incorporated within the SNAD viewer (Malanchev et al. 2023).The viewer is a specially designed web-interface, which allows the expert to visualise ZTF DR light curves, provides access to the individual exposure images, and performs cross-matches with different databases and catalogues.For authorised users, there is a possibility to assign the labels (tags) to ZTF objects (see Fig. We defined a system of tags that includes some general classes: variable star of unspecified type (VAR), transient (TRANSIENT), active galactic nucleus (AGN), quasar (QSO), normal star without strong variability (STAR), and galaxy (GALAXY), as well as the most popular types and subtypes of variable stars and transients13 , such as: -Supernova (SN): Type Ia supernova (SNIA There are a few custom tags for internal purposes, such as transients with one outlier point (1-POINT) or candidates to be send to TNS (TNS_CANDIDATE).Also, tags of non-astrophysical origin such as artefacts and their subtypes are present.Several tags can be assigned to one object, the history of tag changes is also stored in the database (Fig. C.1, on the right).
The choice of tags is determined by the experts, based on the most frequent types of objects appearing in the output of the AD algorithms and also determined by the project needs.Therefore, we do not claim to be complete in covering all possible types of variables and transients.
During the supernova search a total of 1482 objects were labelled.Despite the fact that ZTF data processing pipeline includes a procedure to separate the astrophysical events from bogus ones, namely, false positive detections (Masci et al. 2019), fields with SNe consists of ∼45% of artefacts.Examples of found artefacts are given in Fig. 8 14 .
For real variables, among the most common types in fields containing SNe, are eclipsing (N = 51, ∼5%) and pulsating (N = 53, ∼5%) variables, as well as AGNs (N = 176, ∼17%).The assigned labels can be used to further improve the ZTF pre-processing pipeline (in case of artefacts) as well as for machine-learning classification tasks (in case of astrophysical labels).

Other non-catalogued objects
During the supernova search, a number of interesting noncatalogued objects of other types have been found.Among those there are red dwarf flares, namely, transients caused by the sudden release of stored magnetic energy from surface magnetic loops into the outer stellar atmosphere (Pettersen 1989;Haisch et al. 1991), and AGNs.For example, a two-peak flare of a red dwarf, OID = 726209400028833, located at a distance of ∼162 pc (Bailer- Jones et al. 2018) is shown in Fig. 9.The amplitude of the flare is ∼1.8 mag, the minimum duration is ∼46 min.There are many unsolved questions related to flare physics, red dwarf distribution in the Galaxy, and habitability of host planets, which can benefit from a systematic study of a large sample of such events (e.g.Segura et al. 2010;Engle & Guinan 2011;France et al. 2013;Webb et al. 2021).Moreover, good observational cadence of the flare (∼70 points in 46 min) also opens up a possibility to search for fast transients in ZTF data.
Another interesting object, OID = 676213300006792, located at a distance of ∼234 pc (Bailer- Jones et al. 2018), shows two outbursts, one of which is observed at a high frequency (see Fig. 10).Based on its SDSS spectrum, 676213300006792 was previously identified as a white dwarf-main sequence binary with a secondary M-dwarf companion (Liu et al. 2012).676213300006792 is a weak UV source and does not appear in any X-ray database.Its SDSS spectrum does not show a significant H α emission.The high cadence zg-band light curve shows a periodicity with P ≃ 4.25 min, just before the flare.We assume that there is no stable mass transfer in the system, and the M-dwarf has not overflowed its Roche lobe.We attribute the outbursts to the low accretion rate of the unstable stellar wind on the white dwarf during the increase in magnetic activity from the M-dwarf.The periodic variation before the flare may be related to a hot spot in the temporary accretion disc.
Other non-catalogued objects include candidates for AGNs (e.g.Fig. 11) and variable stars of different nature (e.g.eclipsing binary candidate in Fig. 12).All these objects can be studied separately in the future by the domain experts.

Conclusions
In this work, we provide the first results from the complete SNAD adaptive learning pipeline in the presence of big data from large-scale astronomical surveys.The SNAD team became aware of the existence of non-reported supernova candidates within the ZTF DRs once they appeared in a non-targeted anomaly detection search (Malanchev et al. 2021a).A new experiment was then designed to develop a tailored machine learning model which would explore this possibility by taking advantage  of the SNAD adaptive learning pipeline (Ishida et al. 2021) and our experts' long-term experience studying supernovae.We selected 70 ZTF fields in high galactic latitude, employed a series of quality cuts followed by designed feature extraction (Sect.2.1).The resulting homogeneous feature sets (one per field) were submitted independently of 30 iterations of the active anomaly discovery algorithm, where at each iteration, the domain expert would input a positive feedback to any outlier whose light curve resembles a SN and a negative one otherwise.During this process, human-assigned labels were added to the SNAD knowledge database, opening the way for future deeper analysis of the same data (Sect.4.2).From the 2100 objects visually inspected, we found 104 SN-like events, 57 of which were reported for the first time.These transients received an internal name, were reported to TNS and added to the SNAD catalogue 7 (see Sect. 2.3).
In order to evaluate probable classification types for the newly found transients, we performed light curve fits using different supernova models (Sect.3).Among the newly found transients, we reported three objects (SNAD121, SNAD160 and SNAD187) with broad, slowly evolving light curves that stand as promising superluminous supernova candidates (see Fig. 7 and Pruzhinskaya et al. 2022).
Despite the fact that the AAD was aimed at supernova search, other potentially interesting objects have been found, including non-catalogued AGNs and red dwarf flares.The high cadence data of discovered flares opens the possibility of searching for fast transients in ZTF.Moreover, the visual inspection of AAD outliers during the SN search led to the creation of the SNAD knowledge database that can be used for different machine learning tasks in the future 15 .
The overall efficiency of the pipeline is highly dependent on the total number of objects being analysed, feature choices, and maximum iterations budget, among other parameters.Nevertheless, the results presented here confirm the effectiveness of adaptive learning approaches in filtering large astronomical data sets for expert analysis.They reveal important characteristics of ZTF data releases that ought to be further scrutinised to avoid similar losses in the future (Aleo et al. 2022b).

Appendix A: AAD results
We report the complete set of SN-like transients shown to the expert by the AAD pipeline below.

Appendix B: Light curves of the SNAD supernova candidates
We present below a subset of SNAD candidates and their respective light curve fits (Section 2.3).

SNAD185
Ia, z = 0.17, M −19.9 m Ibc, z = 0.17, M −19.9 m IIP, z = 0.17, M −19.7 m IIL, z = 0.17, M −20.2 m IIn, z = 0.17, M −19.9 m e) applied a similar strategy to identify anomalous A111, page 1 of 22 Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0),which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.This article is published in open access under the Subscribe to Open model.Subscribe to A&A to support open access publication.

Fig. 1 .
Fig. 1.Sky map in equatorial coordinates with plotted positions of ZTF fields analysed in this work, the colour bar shows the number of objects in each field.Fields with detected supernova candidates are highlighted with bold black boundaries.The blue curve denotes the galactic plane.The black triangle marks the galactic centre and the black circle corresponds to the position of the Andromeda galaxy.
Fig. 7. Light curves of SNAD SLSN candidates in zr-band in comparison with the R-band light curve of well-studied SLSN SN 2006gy shifted to z = 0.3 (black pluses) and z = 0.4 (black crosses).The observed magnitudes of SN 2006gy are taken from Smith et al. (2007).All the light curves are shown relative to the maximum light.

Fig. B. 4 .
Fig. B.4.Light curves of SNAD supernova candidates in zr-band and the results of their fit by Nugent's supernova models.
Fig. B.9.Light curves of SNAD supernova candidates in zr-band and the results of their fit by Nugent's supernova models.

Table 1 .
Sub-populations of spectroscopically confirmed supernovae, found in this work (AAD) and total reported in TNS (TNS) for the same time period.

Table A .
1. Complete list of supernovae and supernova candidates found by active anomaly discovery algorithm in ZTF DR3.

Table A
For the SNAD candidates, the type corresponds to the best-fit model according to the fit with Nugent's supernova templates. *