Active Anomaly Detection for time-domain discoveries

We present the first evidence that adaptive learning techniques can boost the discovery of unusual objects within astronomical light curve data sets. Our method follows an active learning strategy where the learning algorithm chooses objects which can potentially improve the learner if additional information about them is provided. This new information is subsequently used to update the machine learning model, allowing its accuracy to evolve with each new information. For the case of anomaly detection, the algorithm aims to maximize the number of scientifically interesting anomalies presented to the expert by slightly modifying the weights of a traditional Isolation Forest (IF) at each iteration. In order to demonstrate the potential of such techniques, we apply the Active Anomaly Discovery (AAD) algorithm to 2 data sets: simulated light curves from the PLAsTiCC challenge and real light curves from the Open Supernova Catalog. We compare the AAD results to those of a static IF. For both methods, we performed a detailed analysis for all objects with the ~2% highest anomaly scores. We show that, in the real data scenario, AAD was able to identify ~80\% more true anomalies than the IF. This result is the first evidence that AAD algorithms can play a central role in the search for new physics in the era of large scale sky surveys.


Introduction
The detection of new astronomical sources is one of the most anticipated outcomes of the next generation of large-scale sky surveys. Experiments such as the Vera Rubin Observatory Legacy Survey of Space and Time 1 (LSST) are expected to continuously monitor large areas of the sky with remarkable deliberation, which will undoubtedly lead to the detection of unforeseen astrophysical phenomena. At the same time, the volume of data gathered every night will also increase to unprecedented levels, rendering serendipitous discoveries unlikely. In the era of big data, most detected sources will never be visually inspected, and the use of automated algorithms is unavoidable.
The task of automatically identifying peculiar objects within a large set of normal instances has been highly explored in many areas of research (Aggarwal 2016). This has led to the development of a number of machine learning (ML) algorithms for anomaly detection (AD) with a large range of applications (Mehrotra et al. 2017). In astronomy, these techniques have largely been applied to areas such as the identification 1 https://www.lsst.org/ of anomalous galaxy spectra (Baron & Poznanski 2017), problematic objects in photometric redshift estimation tasks (Hoyle et al. 2015), characterization of light curves (LCs) of transients (Zhang & Zou 2018;Pruzhinskaya et al. 2019), and variable stars (e.g., Rebbapragada et al. 2009;Nun et al. 2014;Giles & Walkowicz 2019;Malanchev et al. 2021), among others.
Despite encouraging results, the application of traditional AD algorithms to astronomical data scenarios is far from simple. Most of these strategies involve constructing a statistical model for the nominal data and identifying objects that significantly deviate from this model as anomalous. Once identified, these sources are subjected to further scrutiny by an expert who confirms (or not) the discovery of a new phenomenon. However, a statistical anomaly is often the result of observational defects or other spurious interference that are not scientifically interesting, leading to a high rate of candidates that turn out to be of a wellknown nature despite their high anomaly scores. This incorrect identification results in a proportional fraction of resources, and research time, spent on investigating these non-peculiar objects.
Since measuring the details of a new source often requires the allocation of spectroscopic follow-up resources, the development of AD strategies able to deliver a low rate of objects from scientifically well-known categories is an exceedingly important task. This task will be made more crucial in the light of the upcoming generation of telescopes, which will drastically increase the volume of nominal data and, in the process, engender a challenging AD task. In ML jargon, this would require an adaptive recommendation system that is able to optimally exploit a given ML model by carefully choosing objects that can significantly influence the results if more information about them is provided.
Active learning (AL) is a subclass of ML algorithms designed to guide such an optimal allocation of labeling resources in situations where labels are expensive and/or time consuming (Settles 2012). It has been widely applied in many real world situations and research fields, for example, natural language processing (Thompson et al. 1999), spam classification (DeBarr & Wechsler 2009), cancer detection (Liu 2004), and sentiment analysis (Kranjc et al. 2015). In the context of largescale photometric surveys, this translates into a recommendation system for planning the distribution of follow-up resources -given a particular scientific goal. Prototypes using this underlying philosophy for supervised learning tasks were applied to the determination of stellar population parameters (Solorio et al. 2005), the supervised classification of variable stars (Richards et al. 2012), microlensing events (Xia et al. 2016), photometric redshift estimation (Vilalta et al. 2017), supernova (SN) photometric classification (Ishida et al. 2019;Kennamer et al. 2020), and the determination of galaxy morphology (Walmsley et al. 2020).
In this work, we present the first application of AL for AD in astronomical data. Similar strategies have already been reported, with encouraging results, in the identification of anomalous behavior dangerous to web services (Fan 2012), intrusion identification in cloud systems (Ibrahim & Zainal 2019), and the detection of anomalous features in building construction (Wu & Ortiz 2019) -to cite a few. Despite this successful track record, the particular characteristics of astronomical data, more specifically that of astronomical transients (errors in measurements, influence of observation conditions, sparse non-periodic and non-homogeneous time series, etc.) make this demonstration an important milestone in the exploitation of such techniques by the astronomical community. As a proof of concept, we applied the active anomaly detection (AAD) strategy proposed by Das et al. (2017) to two different data sets: simulated LCs from the Photometric LSST Astronomical Time-series Classification Challenge 2 (PLAsTiCC; The PLAsTiCC team 2018) and real LCs from the Open Supernova Catalog 3 (OSC). Our goal with the real data analysis is to lower the burden inflicted by the ML algorithms on domain experts and propose a strategy that would improve the results presented in Pruzhinskaya et al. (2019) while requiring the expert to confirm a lower number of sources. Used in combination with a traditional isolation forest (IF) algorithm, the method allows an increasingly large incidence of true positives (scientifically interesting anomalies) among the objects presented to the expert, in turn enabling a better allocation of resources with the evolution of a given survey.
This paper is organized as follows. We present the data Sect. 2 and the preprocessing analysis in Sect. 3. Section 4 describes the AAD algorithm and its implementation, and the results are presented in Sect. 5. Finally, we present our

Data
This work focuses on finding anomalies within transient LC data sets. Our experiments were performed in both simulated and real data sets.
Our real data sample comes from the OSC. This is a public repository containing SN LCs, spectra, and metadata from a range of sources. The catalog is built using data from different sources whose labels can sometimes be contradictory. This includes preliminary or "fast" classifications that need further confirmation. The catalog is constantly evolving, but at any moment in time it is also known to contain some percentage of non-SN contaminants (Guillochon et al. 2017;Pruzhinskaya et al. 2019), which makes it well suited for our purposes. The real data analysis is based on the data set 4 first presented in Pruzhinskaya et al. (2019). Therefore, the detailed description of quality cuts, the data selection process, and the preprocessing pipeline are given there. For clarity, we describe the main steps of the data preparation below.
From the OSC catalog, we extracted objects with LCs in BRI (Bessell 1990), g r i , or gri filters. We assumed that g r i filters are very similar to gri and that the coefficients of their transformation equations are quite small (Fukugita et al. 1996;Tucker et al. 2006;Smith et al. 2007). Light curves originally observed in BRI filters were converted to gri using Lupton's transformation equations 5 .
The simulated data used in this work are a subsample of the LCs prepared for the PLAsTiCC data challenge, which was constructed to mimic the data scenario that we will encounter after 3 years of LSST observations. In order to build a data environment similar to the one we found in the OSC, we restricted our sample to six classes -SN Ia, SN II, SN Ibc, SN Ia-91bg, binary microlensing, and pair-instability SN (PISN) 6 . The entire PLAs-TiCC test set was subjected to the LC fitting procedure described in Sect. 3.

Light curve fit
In order to obtain a homogeneous input data matrix for the ML algorithms, all LCs were submitted to a Multivariate Gaussian Process 7 pipeline. Instead of approximating the LCs in different filters independently, Multivariate Gaussian Process takes the correlation between different bands into account, approximating the data via a Gaussian process (GP) in all filters with one global fit. The kernel used in our implementation is composed of three radial-basis functions, where i denotes the photometric band and l i are the parameters of the GP to be found from the LC approximation. In addition, Multivariate Gaussian Process includes six constants, three of which are unit variances of the basis processes and the other three describe their pairwise correlations. In total, the model has nine parameters to be fitted. The approximation procedure was done in flux space. For each object, we only took those epochs that lie within the interval [−240, +240] days since the maximum in the r-band, averaging measurements within a 1-day time bin. Each object was characterized by 374 features. The feature set included ten parameters of Multivariate Gaussian Process (nine fitted parameters of the kernel and the final log-likelihood), the LC maximum flux, and normalized GP results within [−20, +100] days since maximum brightness in the r-band in steps of 1 day, concatenated according to their effective wavelength 8 . After applying these steps to the OSC, we visually inspected the results and eliminated bad fits, obtaining a final set of 1999 objects 9 .
For the PLAsTiCC data, we automatically removed all objects for which the GP fit was unsuccessful (i.e., the likelihood maximization procedure was unable to converge). A total of 7223 objects survived this preprocessing pipeline.
The two approaches described above illustrate the flexibility demanded from any feature extraction procedure aimed at preparing astronomical data to be used in standard ML environments (with the exception of a few deep learning techniques). For future large-scale surveys, such as LSST, a numerical criterion such as the one we employ for PLAsTiCC is advised since visual inspection will not be possible. Regardless of the feature extraction choice, our goal is to highlight how AAD can improve upon results from IF, given that a small fraction of scientifically interesting anomalies are present in the final data set.

Methodology
In order to compare the AAD results with those obtained with a traditional AD method and with a blind search, we performed a detailed analysis of all instances within ∼2% of the highest anomaly scores (145 objects for the simulated data and 40 objects for the real data). In the simulated data, this process was automatic. Once we selected the classes that represented anomalies (see Sect. 5), the algorithm was able to read the labels directly from the data file. For the OSC data, we recruited a team of two human experts (MP and AV), with extensive experience in observational and theoretical aspects of SN science, to carefully analyze each of the 40 candidates. These specialists performed a thorough investigation of each candidate -including consultation of external literature -and were not involved in the development or implementation of the AAD strategy. In what follows, we only considered the objects flagged as scientifically interesting by both experts as confirmed anomalies.
Anomaly scores were obtained according to three different strategies: random sampling (RS), IF (Sect. 4.1), and AAD (Sect. 4.2). The screening described above allowed us to coherently estimate the rate of scientifically interesting candidates for all these strategies. Each candidate was considered anomalous or nominal according to the guidelines described in Sect. 4.3.
In essence, we followed a methodological strategy similar to that used in internet search engines, where the relevance of a document is judged with respect to the information the user needs, that is, the capability of solving the user's real world problem, not just the presence of the queried words (see, e.g., Manning et al. 2008). Similarly, when evaluating different algorithms, we started from a statistically identified anomaly candidate but left the final judgment to the experts -allowing the system to learn the connection between the data and the userspecific interests.

Isolation forest
Anomalies are identified as patterns or individual objects within a data set that do not conform to a well-defined notion of "normal" (Chandola et al. 2009). Starting from this definition, popular AD techniques begin by modeling the nominal data (defining what is normal) and subsequently identifying anomalies as samples that are unlikely to be generated by the determined model. In real data problems, this task is non-trivial since the underlying statistical distribution guiding the data generation process can be quiet complex. It is possible to avoid the need for modeling the nominal data by using distance-based techniques. In this paradigm, one starts with the hypothesis that anomalous instances are likely to be far from the normal ones in the input feature space. Thus, by calculating the distance between every possible pair of objects in the data set, it is possible to select samples that, on average, are farther from the bulk of the data set. Such a strategy avoids the need for defining a complete statistical model for the normal data, but it can still be computationally very expensive for large data volumes (Taha & Hadi 2019).
The IF method is a tree-based ensemble 10 method first proposed by Liu et al. (2008). It was inspired by distance-based techniques and thus considers anomalies as data instances that are isolated from the bulk of the data set in a given feature space. However, this isolation is determined locally by training a randomized decision tree (Louppe 2015). In a sequence of steps, the algorithm randomly selects a subset of the data, input features, and split points (decision boundaries or nodes). The feature space is then sequentially subdivided into cells, with the number of sequential cuts determining the path length from the initially large feature space (root) to each final cell (leaf or external node). In this context, anomalies are identified as objects with the smallest path length between the root and an external node. In other words, anomalies are identified as objects that become isolated in a cell more quickly. The combination of results from a number of trees built with different subsamples makes it robust to over-fitting. By exploiting the fact that anomalies are, by definition, rare and prone to isolation, this method avoids the need for expensive distance calculations or statistical modeling of the normal instances.

Active anomaly detection
Active learning algorithms allow expert feedback to be incorporated into the learning model in an iterative manner and, consequently, improve the accuracy of the predicted results. As such, they work in conjunction with a traditional ML strategy, which must either be sensitive to small changes in input information (adding or subtracting a small number of objects from the training set) or allow the incorporation of such knowledge in the subsequent fine tuning of the model. Decision trees fulfill these requirements (see, e.g., Loh 2014). Moreover, for the specific case of AL for AD tasks, ensemble methods are especially significant.
Ensemble methods for AD rely upon the assumption that anomalies will have a higher anomaly score across the entire ensemble, while nominal samples will be assigned lower onesdespite values of the scores themselves being different among ensemble members. This allows us to define a weight vector, w, whose elements denote the impact of different members of the ensemble in the final anomaly score. In the case of N members with perfect predictions, this will be a uniform vector, In a more realistic scenario, certain members will be better predictors than others, and we can translate this behavior by assigning larger weights to more accurate predictors and lower ones to noisier members of the ensemble (see Fig. 1 of Das et al. 2018).
Active Anomaly Discovery 11 (an AAD algorithm proposed by Das et al. 2017) exploits this adaptability in order to fine tune the ensemble according to a specific definition of an anomaly, as pointed out by the expert through a series of labeled examples. The algorithm starts by training a traditional IF and then presents the candidate with the highest anomaly score to a human annotator for classification. If the expert judges the candidate to be an anomaly, the state of the model does not change and the candidate with the next highest score is presented. Whenever a given candidate is flagged as nominal, the model is updated by rescaling the contribution of each leaf node (changes in w) to the final anomaly score. This slight modification preserves the structure of the original forest while adapting the weights to ensure that labeled anomalies are assigned higher anomaly scores than labeled nominal instances.
In summary, scores from AAD have two biases: bias from the unsupervised IF, which increases scores for objects coming from isolated regions of the parameter space; and bias from previously known labels, which increases (or decreases) scores for candidates coming from sparse (or crowded) regions of the parameter space. The strength of AAD is that it is able to help discover both unknown-unknowns through the first bias and known-unknowns through the second bias. Further details about the algorithm are given in Appendix A and in Das et al. (2017Das et al. ( , 2018.

Defining anomalies
The definition of an anomaly strongly depends upon the goals and objectives of the researcher. In this work, we are mainly interested in identifying non-SN contamination and/or SNe with unusual properties (Milisavljevic & Margutti 2018). Non-SN objects can be divided into cases of misclassification (quasars, binary microlensing events, novae, etc.) or completely new classes of objects. We did not consider as anomalies cases of possible misclassification that were due to signals that were too weak to allow a confident conclusion regarding the nature of the transient. These cases cannot be carefully studied due to low signal-to-noise ratios and, therefore, are not astrophysically interesting.
We consider as unusual SNe those objects that were proved to be peculiar by previous studies. These could be any kind of peculiarities: a signature of interaction with the circumstellar medium (CSM), an unusual rise or decline in the LC rate, or any other features that are not representative of the corresponding SN type.
The anomalous cases included in our simulated data were chosen to represent different classes of anomalies: SN Ia-91bg as an example of a rare type of SN (47 objects), binary microlensing events as examples of misclassifications (45 objects), and PISNe as a representative of "new physics" (184 objects). In summary, the simulated data contains ∼4% (275) anomalies and ∼96% (6958) nominal objects 12 .
For data from the OSC, we consider SLSNe and SNe of rare types as anomalous. Super-luminous SNe (SLSNe; Gal-Yam 2012) have an absolute peak magnitude of M < −21 mag, which is 10-100 times brighter than standard SNe. They are sometimes divided into three broad classes: SLSNe-I without hydrogen in their spectra, hydrogen-rich SLSNe-II that often show signs of interaction with the CSM, and, finally, SLSNe-R, a rare class of hydrogen-poor events with slowly evolving LCs, which are powered by the radioactive decay of 56 Ni. Due to their anomalous luminosity, SLSNe are becoming important probes of massive star formation in the high-redshift Universe and may be important cosmological probes, similar to Type Ia SNe (Inserra & Smartt 2014) -although only a couple dozen events have been observed so far (Moriya et al. 2018). The physics that drives this diverse class of SNe is not clearly understood, making it paramount to increase the number of observations.
As examples of SN types, we considered: Ibn (Pastorello et al. 2008), II-pec (Langer 1991), broad-lined Ic SNe associated with gamma-ray bursts (Cano et al. 2017), and low-luminosity IIP SNe (Lisakov et al. 2018). We also added 91T-like, 91bglike Type Ia, and extreme thermonuclear SNe (e.g., Type Iax SNe; Foley et al. 2013;Taubenberger 2017) to this category. Type 1991bg SNe are characterized by a red color at maximum light and low luminosity. Type 1991T SNe, on the other hand, show a slow decline after maximum light and high-peak luminosity. The contamination due to the presence of 1991bglike and 1991T-like SNe in cosmological samples can affect the measurements of dark energy parameters. This is extremely important for large surveys such as the LSST, which aims to constrain cosmological parameters using the bulk of normal Type Ia SNe. No non-physical effects (e.g., artifacts of interpolation) were considered as anomalies.
The above criteria were designed to serve as examples of the kinds of requirements one might impose to the AAD algorithm. These will certainly vary depending on the research goal, available labeling resources, and the data at hand. However, for the purposes of this work, the exact anomaly definition serves merely to illustrate the flexibility of our framework. The global behavior of exercises using different anomaly criteria should resemble those presented in Sect. 5.

Results
We first report the results from applying our method to the subset of the PLAsTiCC data described in Sect. 2. Figure 1 (left panel) shows the fraction of identified anomalies as a function of proposed candidates. This figure was created considering objects in decreasing order of anomaly scores (for IF) and following the order in which they were presented as candidates (for AAD and RS). In order to account for the random nature of the IF algorithm, we performed the experiment 2000 times using different random seeds. The plot shows the mean behavior of all these experiments as solid lines, and the shaded areas mark 12 We emphasize that we cannot calculate such percentages for the OSC data since it would require our experts to perform a detailed analysis of all 1999 objects. the 5-95 percentiles of all results. After a total of 145 candidates were proposed (∼2% of the entire data set), we confirmed that, on average, RS found four PISNe, one binary microlensing, and one SN-91bg (∼4% of all available anomalies), and IF detected eight binary microlensing events and four PISNe (∼8%) among the objects with highest anomaly score. Meanwhile, the mean results from AAD after 2000 experiments flagged eight binary microlensing events and 112 PISNe (∼83% of all available anomalies in the data set). Considering that in the real case the analysis of each anomaly candidate would require the use of expensive spectroscopic telescope time, these results demonstrate how AAD can be a valuable tool in the allocation of such resources.
In order to demonstrate the flexibility of the AAD algorithm to adapt to the anomaly definition set by the expert, as stated in Sect. 4, we also ran the AAD algorithm with a different anomaly definition. In the case where the expert would flag only binary microlensing events as anomalous, the AAD algorithm returned, on average, 15 true positives (in comparison with eight returned using the broader anomaly definition) -almost doubling the success rate of a very narrow search. This confirms that the method is able to adapt to the type of anomaly that is interesting to the expert and increase the fraction of candidates worthy of being investigated further.
The analysis of real data presents a much more complex scenario. In order to confirm if the AAD performance holds when dealing with real observations, we performed the same analysis on data from the OSC. Results are presented in the right panel of Fig. 1. In this scenario, 2% of the entire data set corresponded to ∼40 objects. Random sampling achieved a maximum AD rate of ∼5% (two objects). The IF method was able to boost this to ∼15% (five objects), while ∼27% of the objects identified by AAD were true positives (11 objects). This represents an increase of ∼80% in the number of true anomalies detected for the same amount of resources spent in scrutinizing candidates 13 . Moreover, similar to what we found in the simulated data, although both strategies require a "burn-in" period to start identifying interesting sources, AAD presented the first anomaly much earlier (14th place, in comparison to 20th place for IF). The full list of identified anomalies is provided in Table 1, and a subset of their LCs is presented in Appendix B.
A more detailed comparison between the IF and AAD results is displayed in Fig. 2. The diagram shows the identification of candidates presented to the expert by IF (top) and AAD (bottom). The first two objects are the same for both algorithms, with a discrepancy starting only from the third one. Candidates are ordered by their scores for IF, from left to right. For AAD, they correspond to the highest anomaly score for successive iterations of the AL loop. Anomalies confirmed by the experts are highlighted in yellow. The plot clearly illustrates not only the higher incidence of anomalies for AAD versus IF (11 vs. 6), but also the larger density among the latter candidates. The lines connecting objects that are present in both branches show that the first half of the list contains many objects in common between the two algorithms. On the other hand, the second half of the AAD list contains anomalies that are absent in the upper branch. This demonstrates that the algorithm is also able to adapt to the definition of anomaly according to the feedback received from the expert in a real data scenario. Moreover, one of the most obvious peculiar objects in our sample is a binary microlensing event, Gaia16aye. It was assigned the 33rd highest anomaly score by the IF and was the first real anomaly presented by AAD (in the 14th iteration). These results provide the first pieces of evidence that adaptive learning algorithms can be important tools in planning the optimized distribution of resources in the search for peculiar astronomical objects.

Conclusions
The next generation of large-scale sky surveys will certainly detect a variety of new astrophysical sources. However, since every photometrically observed candidate requires further investigation via spectroscopy, the development of automated AD algorithms with low incidences of false positives is crucial. Moreover, such algorithms must be able to detect scientifically interesting anomalies -as opposed to spurious features due to observing conditions or errors in the data processing pipeline. Active learning methods are known to perform well in such data scenarios. They represent a class of adaptive learning strategies where expert feedback is sequentially incorporated into the ML model, allowing high accuracy in prediction while keeping the distribution of analysis resources under control.
We report results supporting the use of AL algorithms in the allocation of resources for astronomical discovery. We use simulated and real LCs as benchmarks to compare the rate of true anomalies discovered by a traditional IF algorithm to those identified by Active Anomaly Discovery (Das et al. 2017). We show that Active Anomaly Discovery is able to increase the incidence of true anomalies in real data by 80% when compared to static IF. Moreover, the algorithm can adapt to the definition of anomaly imposed by the expert, which leads to a higher density of true positives in later iterations. This not only ensures a larger number of peculiar objects in total, but also guarantees that each new scrutinized source will, in the long run, contribute to the improvement of the learning model. In this context, not even the resources spent in analyzing false positives, in the beginning of the survey, are wasted.
In order to ensure a reliable estimation of true positive rates, we presented a controlled real data scenario in the form of a catalog containing 1999 fully observed SN LCs. This allowed visual confirmation of all the objects within the 2% highest anomaly scores for all the algorithms. As an example of the potential AL techniques have in extracting useful information from legacy data, we highlight that the discovery of an important astrophysical contaminant (the binary microlensing event Gaia16aye) was presented to the expert much earlier following the active strategy when compared to its static counterpart (14th vs. 33rd highest anomaly score). Moreover, results from simulated data confirmed that the algorithm is flexible enough to allow the adaptation of the anomaly definition according to the interest of the expert -something that is not possible within the traditional AD paradigm. We acknowledge that important issues need to be further addressed (e.g., the variability of results for different feature extraction methods, stream mode learning, and scalability). Nevertheless, results presented here support the hypothesis that adaptive techniques can play important roles in the future of astronomy.