The ROAD to discovery: machine learning-driven anomaly detection in radio astronomy spectrograms

As radio telescopes increase in sensitivity and flexibility, so do their complexity and data-rates. For this reason automated system health management approaches are becoming increasingly critical to ensure nominal telescope operations. We propose a new machine learning anomaly detection framework for classifying both commonly occurring anomalies in radio telescopes as well as detecting unknown rare anomalies that the system has potentially not yet seen. To evaluate our method, we present a dataset consisting of 7050 autocorrelation-based spectrograms from the Low Frequency Array (LOFAR) telescope and assign 10 different labels relating to the system-wide anomalies from the perspective of telescope operators. This includes electronic failures, miscalibration, solar storms, network and compute hardware errors among many more. We demonstrate how a novel Self Supervised Learning (SSL) paradigm, that utilises both context prediction and reconstruction losses, is effective in learning normal behaviour of the LOFAR telescope. We present the Radio Observatory Anomaly Detector (ROAD), a framework that combines both SSL-based anomaly detection and a supervised classification, thereby enabling both classification of both commonly occurring anomalies and detection of unseen anomalies. We demonstrate that our system is real-time in the context of the LOFAR data processing pipeline, requiring<1ms to process a single spectrogram. Furthermore, ROAD obtains an anomaly detection F-2 score of 0.92 while maintaining a false positive rate of ~2\%, as well as a mean per-class classification F-2 score 0.89, outperforming other related works.


Introduction
Radio telescopes are getting bigger and generating increasing amounts of data to improve their sensitivity and resolution (Norris 2010;van Haarlem et al. 2013;Foley et al. 2016;Nan et al. 2011).The growing system size and resulting complexity increases the likelihood of unexpected events occurring, thereby resulting in datasets that contain anomalies.These anomalies include failures in instrument electronics, miscalibrated observations, environmental events such as lightning, astronomical effects including solar storms, as well as problems in data processing systems among many others.We consider Radio Frequency Interference (RFI) unavoidable and therefore do not consider it an anomaly in this context.Currently, efforts to detect and mitigate these anomalies are performed by human operators, who manually inspect intermediate data products to determine the success or failure of a given observation.The accelerating data rates, coupled with the lack of automation, results in operator-based data quality inspection becoming increasingly infeasible (Mesarcik et al. 2020).
In the context of low-frequency radio astronomy, scientific data processing has been successfully automated by running complex workflows that perform calibration and imaging of interferometric data (de Gasperin et al. 2019;Weeren et al. 2016;Tasse et al. 2018;Wijnholds et al. 2010), radio frequency interference (RFI) mitigation (Offringa et al. 2010), and dedispersion (Barsdell et al. 2012;Bassa et al. 2022) of time-domain data among many more.Additionally, continuous effort is being made to create high-performance real-time algorithms to improve the quality and reliability of the scientific data (Sclocco et al. 2019(Sclocco et al. , 2016;;van Nieuwpoort & Romein 2011;La Plante et al. 2021;Broekema et al. 2018).However, as of yet, there have been no attempts to fully automate the system health management (SHM) pipeline, and by virtue of the lack of work on this topic, no real-time implementations exist.This is in part due to the complexity of the challenge as well as the unavailability of SHM-specific datasets.Furthermore, the successes of SHMbased anomaly detection systems have been extremely impactful in fields ranging from industrial manufacturing (Bergmann et al. 2019a) to spacecraft system health (Baireddy et al. 2021;Spirkovska et al. 2010), thereby motivating this study.
the feature compounding problem (Mesarcik et al. 2020).It must be noted this work makes use of up-stream data products in the form of spectrograms, which are produced by all radio telescopes thereby enabling its applicability to other instruments.
The SHM anomaly detection problem differs from existing work for several reasons.Firstly, the data inspection performed by telescope operators involves analysing both known and unknown anomalies; where known anomalies should be classified into their respective classes and unknown anomalies should be differentiated from all other existing classes.This is in contrast with typical anomaly detection, which is normally posed as a one-class-classification problem.Furthermore, we find that class imbalance not only exists between the normal and anomalous classes (which is common for anomaly detection), but there is also strong imbalance between the anomalous classes.For these reasons, we propose a new framework for detecting and classifying SHM-based anomalies, that is capable of distinguishing both regularly occurring and rare events.
We find the multi-class classification approach more appropriate as it gives more flexibility to telescope operators.This is because the anomalousness of particular events entirely depends on the context of the science goals of an observation.For example, in observations relating to the epoch of reionisation (EoR) (Yatawatta et al. 2013), the signal-to-noise ratio (S/N) is a huge concern, and as such any high-power anomalies such as solar storms should be identified and removed from the data.In contrast, in solar-physics-based observations by Vocks et al. (2018), the high-power solar events should implicitly be kept within the data and should not be flagged as anomalous.Therefore, by including a classification step within the anomaly detector system we offer greater flexibility to telescope operators in data quality inspection.Furthermore, we consider classification to be the first step in anomaly mitigation; for example, given the detection of a data loss event, a telescope operator may need to reestablish a network connection.
Fundamentally, all anomaly detection approaches rely on learning representations of normal data and then measuring some difference between the learnt representations of normal and anomalous data (Chandola et al. 2009).Recent developments in machine-learning leverage pre-trained networks by fine-tuning them on specific classes of anomaly detection datasets (Roth et al. 2021;Reiss & Hoshen 2021;Tack et al. 2020).However, we show that it is not possible to directly apply these pretrained networks to astronomical data due to large differences compared to the natural images used for pretraining since these spectrograms are in the time-frequency domain.This being said, efforts have been made in pretraining paradigms for astronomical data (Hayat et al. 2021;Walmsley et al. 2022); however, similarly to anomaly detection applied to astronomy, these methods are implemented with imaged galaxy data and not the dynamic spectra necessary for SHM.For this reason, we propose a new self-supervised learning paradigm that combines context prediction and reconstruction error (Doersch et al. 2015) as a learning objective and show that it is effective in learning robust representations of non-anomalous time-frequency data.
With this work, we make the following contributions: (1) a new dataset consisting of 6708 manually labelled autocorrelation-based spectrograms consisting of ten different feature classes; (2) a generic self-supervised learning (SSL) framework that is effective in learning representations of timefrequency data with a high-dynamic range; (3) a generic anomaly detection framework that can classify both commonly occurring known anomalies and detect unknown anomalies with a high precision; and (4) real-time performance for LOFAR with our implementation.This paper continues with an analysis of existing literature concerning anomaly detection in astronomy in Section 2, and Section 3 documents our data selection strategy and outline the labelling process used for evaluation of this work.In Section 4, we show the proposed SSL and anomaly detection frameworks.Finally, our results and conclusions are documented in Sections 5 and 6.

Related work
Recent works that apply machine-learning-based anomaly detection to astronomy have so far focused only on scientific discovery, using galaxy images, transient signals or light curves.In this work, we apply machine learning-based anomaly detection to autocorrelation-based spectrograms obtained from the LO-FAR telescope.This section unpacks the current landscape of machine-learning-based anomaly detection and the recent developments in applying it to astronomy-related fields.

Machine-learning-based anomaly detection
Machine-learning-based anomaly detection relies on modelling normal data and then classifying abnormality by using a discriminative distance measure between the normal training data and anomalous samples (Chandola et al. 2009).Autoencoding models are a popular approach for learning latent distributions of normal data (Bergmann et al. 2019a,b;Pidhorskyi et al. 2018;An & Sungzoon 2015).Anomaly detection using autoencoders can be performed either in the latent space using techniques such as one-class support vector machines (OC-SVM) (Schölkopf et al. 1999), k-nearest-neighbours (KNN) (Bergman et al. 2020), isolation forest (IF) (Tony Liu et al. 2008), or the reconstruction error (Mesarcik et al. 2022b).The use of pretrained networks to obtain latent representations of normal data have also been successful in anomaly detection (Bergman et al. 2020;Reiss & Hoshen 2021;Roth et al. 2021).By first training these models on an objective such as ImageNet classification (Fei-Fei et al. 2010), they are able to generalise to other tasks such as anomaly detection.Additionally, self-supervised learning (SSL) has been shown to be invaluable for finding meaningful representations of normal data (Yi & Yoon 2021;Li et al. 2021;Tack et al. 2020).Here, pretext tasks, which allow the model to learn useful feature representations or model weights that can then be used for other (downstream) tasks, are defined as learning objectives on the normal data such that the model can be fine-tuned for the downstream task of anomaly detection.In both the SSL and pretrained cases, KNN-based measures can be used to distinguish anomalous samples from the normal training data (Bergman et al. 2020;Yi & Yoon 2021).
In most machine learning-based anomaly detection, performance is evaluated according to the single-inlier-multiple-outlier (SIMO) or multiple-inlier-single-outlier (MISO) (Burlina et al. 2019) settings on natural image datasets such as MVTecAD- (Bergmann et al. 2019a).With this paradigm in mind, we find that anomaly detection in the radio astronomical context is a multiple-inlier-multiple-outlier (MIMO) problem.In effect, anomaly detection formulations that make a strong assumption about the number of inliers or outliers are not directly applicable to the radio observatory setting due to the increased problem complexity.Furthermore, we find methods that rely on pretraining with natural images to be ill-suited to the spectrograms used in this work, due to differences in dynamic range and S/N, as shown by Mesarcik et al. (2022a).
Efforts have been made to detect anomalies in light curves and spectra in works such as Astronomaly (Lochner & Bassett 2021) and in transients in Malanchev et al. (2021).Astronomaly is an active learning framework developed for the classification of unusual events in imaged data or light curves at observatories to aid scientific discovery.This being said, it closely follows generic anomaly detection methods, where normal data are first projected to a latent representation and metrics such as IF are used to distinguish normal training samples from anomalous testing samples at inference time.Although Astronomaly assumes a MIMO context, it is still only able to detect unknown anomalies (or at least says all anomalies belong to the same class).This is in contrast with our work, where ROAD is capable of both distinguishing all known anomaly classes with a high level of precision and detecting unknown or rare anomalies.
Deep generative neural networks are also used for anomaly detection.Works by Villar et al. (2021), Mesarcik et al. (2020), and Ma et al. (2023) have shown that the Variational Autoencoders (VAEs) can be used for anomaly detection with astronomical data.Whereas Margalef-Bentabol et al. (2020), andStorey-Fisher et al. (2021) show that Generative Adversarial Networks (GANs) are effective in learning representations of normal images of galaxies thereby enabling reconstruction-error-based anomaly detection.In work by Zhang et al. (2019), GANs have also been shown to be effective in the Search for Extraterrestrial Intelligence (SETI) anomaly detection context.However, we find that our SSL method is more stable during training and better suited to anomaly detection using time-frequency data that have a high-dynamic range, ∈ [1, 100] dB, and a low S/N for cross-polarised features in the 'xy' and 'yx' stokes parameters such as the Galactic plane.

Representation learning in astronomy
As already mentioned, learning representations of highdimensional data is essential to the anomaly detection problem.For this reason, among many others, tremendous effort has been made to find methods that learn robust projections of highdimensional data (He et al. 2022;Chen et al. 2020;Grill et al. 2020;Doersch et al. 2015).These successes have materialised in the astronomical community with results mostly in the galaxy classification domain.Walmsley et al. (2022) showed that pretraining on the Galaxy Zoo DECaLS (Walmsley et al. 2021) dramatically improves model performance for several downstream tasks.Furthermore, Hayat et al. (2021) showed how contrastive learning can be applied to galaxy photometry from the Sloan Digital Sky Survey (SDSS) (Gunn et al. 1998).The authors show that with novel data augmentations, they can achieve state of the art results on several downstream tasks.Furthermore, several additions and modifications have been made to the reconstruction-error-based loss functions of autoencoders.Mesarcik et al. (2020) showed how using both magnitude and phase information in VAEs improves performance of finding representations of astronomical data, whereas Villar et al. (2021) used a recurrent adaption of a VAE to make training more suitable to light-curve data.Similarly, Melchior et al. (2022) showed how the inclusion of self-attention mechanisms and redshiftpriors into the latent projection of autoencoders can improve the learnt representations of galaxy spectra.
In this work, we demonstrate that by using a simple adaption of a context-prediction self-supervised loss (Doersch et al. 2015) we effectively learn robust representations of spectrograms from the LOFAR telescope.Our Radio Observatory Anomaly De-tector (ROAD) outperforms existing autoencoding models by a large margin on anomaly detection benchmarks.

Real-time scientific data processing
To cope with the increasing data rates from modern scientific instruments (Norris 2010;van Haarlem et al. 2013;Nan et al. 2011;La Plante et al. 2021), real-time algorithms have been developed for scientific data pipelines.Real-time methods for RFI detection (Sclocco et al. 2016(Sclocco et al. , 2019;;Morello et al. 2021), calibration (Prasad et al. 2014), fast radio burst (FRB) detection (Connor &van Leeuwen 2018), andcorrelation (van Nieuwpoort &Romein 2011;Romein et al. 2010) have been essential to modern radio telescope operations.However, very few machine learning techniques have been shown to be effective in real time.In a seminal work by George & Huerta (2018), machine learning gravitational wave detection algorithms were implemented in real time.Furthermore, Muthukrishna et al. (2022) showed that temporal convolutional networks (TCNs) can be implemented to detect transient anomalies in real time.To demonstrate the effectiveness of our work in the context of radio observatories, we investigated the computation performance and throughput of the proposed system.We show that our system is in real time in the context of the LOFAR telescope data processing pipeline.

Dataset
We created a new dataset for anomaly detection in radio observatories and document the data selection, preprocessing and labelling strategy used in this section.Applying machine learning to radio astronomical datasets poses a significant challenge, particularly when using time-frequency data.Methods for preprocessing and data selection need to be carefully considered, due to issues such as high-dynamic range (due to RFI among other events), combining thousands of stations for a single observation with complex-valued data and multiple polarisations, feature compounding, and many more.An additional challenge with applying machine learning to radio astronomy is the lack of labelled time-frequency datasets from radio telescopes as well as the availability of expert knowledge and the cost associated with creating a dataset.

Observation selection and preprocessing
The ROAD dataset is made up of observations from the Low Frequency Array (LOFAR) telescope (van Haarlem et al. 2013).LOFAR is comprised of 52 stations across Europe, where each station is an array of 96 dual polarisation low-band antennas (LBA) in the 10-90 MHz range and 48 or 96 dual polarisation high-band antenna antennas (HBA) in the 110-250 MHz range.The signals received by each antenna are coherently added in the station beamformer, resulting in each sub-band being approximately 200 kHz wide.These signals are then transported to the central processor to be correlated with a minimum channel width of about 0.7 kHz.This data product is referred to as a visibility and is the data representation used in this work.In contrast, other radio astronomical use-cases where machinelearning-based anomaly detection has been applied (such as detecting unusual galaxy morphologies) use an additional calibration step as well as a 2D Fourier transform and gridding to obtain sky maps.
The visibility data are four dimensional, with the dimensions corresponding to time, frequency, polarisation, and station.Dif- Table 1: Categorisation of data processing, electronic, astronomical, and environmental anomalies in the ROAD dataset.Where A-team sources refer to the four brightest persistent radio sources in the northern sky.We note that each spectrogram may contain multiple anomalies; hence, the number of samples stated is greater than the overall dataset size.
ferent science cases result in different observing setups, which dictate the array configuration (i.e. the number of stations used), the number of frequency channels (N f ), the time sampling, as well as the overall integration time (N t ) of the observing session.Furthermore, the dual-polarisation of the antennas results in a correlation product (N pol ) of size 4.In this work, we only made use of the autocorrelations produced by LOFAR.We did this to minimise the labelling overhead and data size, as well as to simplify the potential feature compounding problem (Mesarcik et al. 2020).
As already mentioned, the required resolution of modern instruments cause the data products to be relatively large.The data size of an observation that consists only of autocorrelations is given by N auto = N t N f N st N pol N bits , where N st is number of stations.This means that a ten-hour observation with a one-second integration time, a 1 kHz channel resolution with a 50 MHz bandwidth, and a 32-bit resolution can result in observations sizes of the order of several terabytes.As this is orders of magnitude larger than the amount of memory available on modern GPUs that are used for training machine-learning algorithms, the data is sub-sampled in time and frequency according to Mesarcik et al. (2020) to result in observations of the order of one gigabyte.
Deep learning architectures typically require equally sized inputs; however, LOFAR observations can have a varying number of time samples and/or frequency bands.Therefore, additional resizing of the intermediate visibilities is done by resizing all observations to (256, 256) bins in time and frequency.This means that observations with fewer than 256 time samples are interpolated and those with more are down-sampled.Furthermore, as the autocorrelations contain no phase information, we only used the magnitude component of each spectrogram.
It must be noted that this processing does modify the morphologies of certain features, particularly those present with a low time resolution.However, as this preprocessing step is consistent across all spectrograms, the overall effects on the anomaly detector and classifier are negligible.In future work, we plan to associate the labels with the full resolution LOFAR data from the Long Term Archive (LTA) 1 and apply it to (256, 256) crops of the full resolution spectrograms.
We selected 110 observations from the LOFAR LTA comprised of a broad set of science use cases and the corresponding 1 https://lta.lofar.eu/observing setups.Of the selected observations, we used the autocorrelations from 2431 LBA stations and 4277 HBA stations from an observation period between 2019 and 2022.

Labelling methodology
The ROAD dataset contains ten classes that describe various system-wide phenomena and anomalies from data obtained by the LOFAR telescope.These classes are categorised into four groups: data processing system failures, electronic anomalies, environmental effects, and unwanted astronomical events.Table 1 shows the classes used as well as the description of events, their band and polarisation in which they occur.We note that the term 'anomaly' is used liberally in this context, while low power effects (that are only present in the cross polarisations) such as the Galactic plane passing through an observation are somewhat unavoidable.Nonetheless, for observations with extremely low S/N, such as The Epoch of Re-ionisation of the Universe (EoR) (Yatawatta et al. 2013), the Galactic foreground signals need to be identified and removed.For this reason, we include such events in the ROAD dataset.Furthermore, we do not consider classes that track the systematic corruptions caused by ionospheric disturbances.This is because the ROAD dataset was created using data from the period of the minimum of the past solar cycle.Thus, the statistics for corruption effects such as scintillation are poorly represented in high-band and low-band data (although low-band data tracks these events better due to the frequency dependence of the signals).In future work, we plan to extend the dataset to consist of classes relating to more ionospheric disturbances.
Our labelling approach took into consideration anomalies that occurred at both the station and observation levels.For example, events such as lightning storms and high-noise events can look fairly similar, especially in the down-sampled context.However, lightning storms are geographically bound to affect all stations in a certain region and therefore only occurring at the station level.Additionally, lightning is highly correlated across stations in time, with minimal delay between the recorded events in each station.On the other hand, high-noise events usually affect only a single antenna at a time with no time dependency between antennas and stations.By this logic, all stations bound to the same geographic location with broad-band high-power events across all polarisations that are correlated in time were considered to be corrupted by lightning storms, whereas individually affected stations were labelled as high-noise events.
We make a distinction between first-and second-order events; for example, the first-order data-loss event corresponds to dropped information from consecutive time samples and/or frequency bands, and second order is for a single time sample or frequency band.We find this a useful distinction as the root cause of these events is different.In the case of first-order dataloss events, the problem can be traced to the correlator pipeline, whereas the second-order events are most likely from conversion overflows due to strong RFI.Additionally, we note some overlap between class labels; for example, it is common for a high-power noise event to trigger instability in an amplifier causing it to oscillate.However, the precise point of transition is often hard to find to distinguish these events from each other.
We labelled the dataset using LOFAR observations that were down-sampled and preprocessed as described in Section 3.1.We made multiple train-test splits during our experimentation to ensure consistent performance across models.Furthermore, the ROAD dataset is publicly available2 .The file is in the hdf5 format and consists of fields corresponding to the raw data, labels, frequency band information, station name, and source observation.Figure 1 illustrates all classes labelled in the available dataset.

Class imbalance
Due to the nature of anomaly detection, the number of normal samples greatly outnumbers anomalous ones.In the case of the ROAD dataset, and the LOFAR telescope more generally, we find there is a class imbalance not only between normal and anomalous classes, but also among the anomalous classes.For example, commonly occurring astronomical signals, such as the Galactic plane, are far better represented in the observations than unlikely events such as the amplifiers oscillating.Practically, this means that when we separate the samples into testing and training sets we also need to maintain the same occurrence rates with respect to the rates in the original dataset.We effectively downsampled the testing data such that the occurrence rate (shown in the second-to-last column of Table 1) is maintained for evaluation.This means that each model needs to be tested multiple times with new samples taken from the testing pool of anomalous samples to effectively evaluate its performance.We evaluated each model ten times with different random seeds to ensure accurate reporting.

Radio frequency interference considerations
As previously mentioned, we consider RFI to be unavoidable, and thus we deem it a normal class.A key problem with using the RFI masks associated with the spectrograms from LOFAR (as done in Mesarcik et al. (2022a)) is that they are generated using AOFLagger.AOFLagger is a SumThreshold-based algorithm that indiscriminately flags all high-power events as interference.In the context of this work, such an approach would result in many of the high-power anomalies such as lightning, solar storms, high-noise events, and oscillating tiles being flagged as RFI.In effect, if we were to blank the RFI before training our models, we would likely remove many of these high-power features, thereby decreasing the efficacy of the model.This would in turn make our end goal of mitigating the anomalies more difficult to achieve, since different classes should lead to different actions by the telescope operators.For this reason, we did not use the RFI masks associated with the spectrograms.

The radio observatory anomaly detector
As outlined in preceding sections, ROAD is designed to detect previously unseen system behaviours and classify known anomalies observed by the LOFAR telescope.To accommodate these requirements, we find it necessary to combine two approaches: supervised classification and self-supervised anomaly detection.This section outlines the motivations and design decisions made for the implementation of ROAD.

Problem formulation
Given the i th spectrogram V i (ν, τ, b, p) from the dataset and model m with parameters θ m , we would like to predict whether an anomaly is present and which class it belongs to, if it is a known event, such that, where ν, τ, b, and p are the indexes corresponding to frequency band, time sample, baseline, and polarisation, respectively, and N is the number of known anomaly classes.Supervised approaches assume that each class is represented in the training set and try to minimise the following loss function: where H is an entropy-based measure of similarity and l is the encoded vector of labels corresponding to the contents of V.During inference, the supervised classifier produces an estimate of which classes are most probable in a given spectrogram, and the 'argmax' function selects the most likely classification as shown in the bottom half of Figure 3.However, as illustrated by the results in Section 5, the performance of such a supervised classifier severely deteriorates when exposed to unseen or out of distribution (OOD) classes during testing.To remedy this, we disentangled the two model objectives; namely, we used a supervised classifier to identify the known classes present in the training set and a self-supervised anomaly detector to classify unseen anomalies.

Self-supervised representation learning
Self-supervised-learning (SSL) methods learn useful feature representations by training on secondary objectives called 'pretext tasks', so that once trained, the model weights can be utilised for downstream applications.We define two pretext tasks that allow the model to learn useful representations for anomaly detection in astronomical data: context prediction and reconstruction error.
Context prediction is a pretext task that makes a model classify the positional relationship between two patches taken from the same image.The two patches are projected to some latent representations, z 0 and z 1 , using a backbone network, f , while keeping track of their position label, c, on a 3 × 3 grid as proposed by Doersch et al. (2015).Then, using g, a two-layer multi-layer perceptron (MLP), we classify the positional relationship from the latent representations, as given by H(g(z i, j,0 , z i, j,1 ), c j ), where i corresponds the the index of each spectrogram, j is the index of each context-pr edition pair in a single spectrogram, and c j is the positional label.Additionally, to ensure the model does not learn positional relationships based purely on the bordering values of each patch, we augment each neighbour in the training process.In the implementation, we randomly crop the patches between 100% and 75% of their original size followed by resizing them to their original dimensions.We illustrated the context prediction loss and patch selection in Figure 2. Furthermore, to enforce consistency across the representations of similar-looking patches, we use reconstruction error.Reconstruction error maintains consistency by ensuring that two patches with common features in visibility space should occupy nearby locations in the latent space and therefore should be reconstructed similarly.The reconstruction loss is given by where d is a de-convolutional decoder that should have significantly fewer parameters than the backbone network f .We do this to ensure that the model has more capacity to learn suitable representations instead of prioritising reconstruction.For completeness, we represent the full SSL learning objective as where λ is a hyper-parameter that changes the influence of each component of the loss.Additionally, we use regularisation in the form of minimising the square size of the latent projections z.Regularisation is used in order to enforce the most compact representations in z.We experimentally select λ = 0.5 and λ reg = 1 × 10 −6 and illustrate λ's impact in Section 5.

Distinguishing normal from anomalous samples
Although we have described a method for learning representations of normal data, the model is incapable of accurately distinguishing between normal and anomalous samples.Several options exist for anomaly detection when utilising the learnt representations of normal training data methods.The simplest involves measuring the distance between a given sample and the normal training data (Bergman et al. 2020) using a K-nearestneighbour (KNN) lookup.This assumes that larger distances correspond to more anomalous samples.However, as we already made use of some of the labelled data for the supervised classifier, we find it beneficial to fine-tune a shallow MLP on top of SSL representations to perform anomaly detection.As the SSL backbone learns representations on the patch-level and ROAD dataset labels are on the spectrogram-level, we first need to concatenate the latent representations of each patch to return to the correct dimensionality before training the MLP.Notably, we propagate the gradients during the fine-tuning step through both the MLP and the backbone network, f , such that the distance between normal and anomalous representations at the spectrogram level are consolidated.We show, in Section 5, that fine-tuning dramatically outperforms random initialisation and KNN-based anomaly detection.Furthermore, we find that using the fine-tuning approach dramatically improves the timecomplexity of the system.Additionally, we need to determine how to threshold the anomaly scores produced by either the fine tuned models or the KNN-distance-based approach.Here, we utilise the threshold from the area-under precision recall curve (AUPRC) which results in the maximum F-β score.A discussion on the evaluation metrics used can be found in Section 5, and the results pertaining to a change of this threshold can be found in Figure 9.

Combining classification with anomaly detection
The final consideration when constructing ROAD is how to effectively combine the fully supervised classifier y sup ∈ [0, N] and the fine tuned anomaly detector y ssl ∈ [0, 1].Simply put, we consider normal predictions from the detector more likely to be correct, and if there is a disagreement between the two models then we flag the sample as an unknown class of anomalies that the classifier may have not seen.The overall method is shown in Figure 3 and is summarised by , if y ssl = 1 and y sup 0 N + 1 , if y ssl = 1 and y sup = 0 . (6) We validate this approach in Section 5 by showing that it is optimal when assuming that normality is better defined by the SSL output.

Experiments
We evaluate the performance of ROAD using the dataset described in Section 3. The evaluation considers both the computation and model performance using both the binary anomaly detection as well as the multi-class classification results.In all cases, we use the F-β score to evaluate the model performance.The F-β score is the harmonic mean between precision and recall, in the context of this work, precision is the anomaly detection performance that is sensitive to the number of false positives and recall is the detection performance relative to the number of false negatives.Moreover, in the context of telescope operations it is necessary to minimise the number of false negatives.In other words, it is more acceptable to classify some normal samples as anomalous than classifying anomalous samples as normal.Following this logic and work by Kerrigan et al. (2019), we consider β = 2 to be the most appropriate as it weighs recall more heavily than precision.For all evaluations we use the threshold from the area under precision recall curve (AUPRC), which maximises the F-2 score.
We do not quantify the benefits of ROAD with regard to imaging.The purpose of ROAD is to provide an efficient preview of large interferometric data products to telescope operators, thereby informing scientists of how best to post-process the data in the presence of instrumental and environmental anomalies.It is anticipated that leveraging the outputs of the model would facilitate the elimination of samples containing anomalies, thus enhancing the overall image fidelity.However, for the sake of brevity and focus of this paper, we leave the actual quantification of the improvements to the imaging to future works.

Model parameters and training
To validate our approach, we experimented with several modern machine-learning architectures of various model sizes.In all cases, we used the same backbone architecture for both the supervised classifier and the SSL models; furthermore, we utilised the same two-layer MLP for position classification.Additionally, the decoder used for the SSL-reconstruction loss is a fivelayer architecture with strided de-convolution and batch normalisation.
For every experiment, each model is trained three times while randomising input seeds on each run.As already mentioned in Section 3.3, the low occurrence rates of some anomalous features mean we need to sub-sample the anomalous classes in the test data to ensure comparable occurrences relative to normal LOFAR telescope operations.This means we run ten separate evaluation loops for the sub-sampled test data.The results shown in this section reflect the mean and standard deviations from 30 runs of each model.The SSL and the supervised models  2. are trained for 100 epochs, while fine-tuning using the two-layer MLP is done for only 20 epochs to prevent over-fitting.We use a batch size, patch size, and latent dimensionality of 64 across all experiments, utilising the Adam optimiser with a learning rate of 1 × 10 −3 to maintain consistency.In all cases, we use the official pytorch -based implementations of the various backbones, with the exception of ViT, for which we utilise an open-source implementation.The code, experiments, and model weights are available online 3 .
Furthermore, to ensure no vanishing or exploding gradients while training, we clip each autocorrelation to the 1 st and 99 th percentiles and take its natural log.Additionally, we normalise each magnitude-based autocorrelation between 0 and 1.

Anomaly detection and classification
To maximise the model performance relative to the problem specification shown in Equation 1, we find the best mean performance of several different backbones.These are different sizes of ResNet (He et al. 2016), ConvNeXt (Liu et al. 2022), and ViT (Dosovitskiy et al. 2021).Notably, our method is agnostic to backbone and could easily be extended to include architectures or model sizes.In Table 2, we present the per-class results after applying the combination of the supervised classifier and the fine tuned anomaly detector specified by Equation 6.Furthermore, we plot the mean performance of each model in Figure 4 to facilitate comparison.We note that all evaluated anomaly detection models utilised fine-tuning to ensure they had been exposed to the same amount of data.Additionally, ROAD-KNN utilises a KNN lookup to determine the distances in the latent space rather than using the MLP prediction.
We find that the ResNet34 exhibits the overall best average performance on the classification task, giving an average increase in F-2 score of 1% relative to the purely supervised model.We note that the performance of ROAD is directly dependant on the supervised performance.We show that the SSL pre-training is highly influential to the overall model performance as it gives a < 5% increase over the randomly initialised (random init) model without pre-training.Furthermore, we find that our SSL-based approach outperforms the variational autoencoder-based model with fine-tuning (VAE) by < 5%, as well as being < 3% better than KNN-based anomaly detectors (ROAD-KNN).Finally, we show that using pre-trained weights from ImageNet classification with fine-tuning (ImageNet) results in a 2% decrease in performance relative to our SSL pre-training paradigm.Across all experiments, it is clear that the high-noise element and oscillating tile classes have the highest standard deviation.We attribute this to the small number of examples present in both the testing and training set after adjusting for occurrence rates.In addition to this, the features represented in these classes can vary significantly from sample to sample and band to band.
To simulate a real-world setting where many unknown anomalies can be present in a given observation, we remove several classes from the training set and test models' performance on the original test set.We refer to these classes removed as outof-distribution (OOD).The objective of this experiment is to see how well the model will react to OOD anomalies and whether it can correctly classify them as anomalous.To effectively sim-ulate this scenario, we randomly remove between one and seven classes and do this ten times while training a model for each removal step.Figure 5 shows the average model performance from the ten runs for both the supervised classifier as well as the fine tuned SSL anomaly detector when removing a number of classes from the training set.Here, it is clear that the supervised model suffers much more from the OOD effects than the SSLpre-trained one, exhibiting a performance drop of between 5% and 18%, thereby illustrating the benefit of using ROAD when both a classifier and detector are in the loop.
We illustrate the t-distributed stochastic neighbour embedding (t-SNE) projections of the latent dimensions from each model in Figure 7 to gain an intuition about the model performance.The same random seed and perplexity parameters are used for all plots shown; here, the perplexity estimates the number of neighbours each point should have (for more information, see Wattenberg et al. (2016)).In the leftmost plot the nonfine tuned SSL model is shown; we can see that both normal and anomalous classes are grouped closely together, with the exception of clusters pertaining to 'first-order data loss', 'ionospheric RFI reflections', and 'solar storms'.Furthermore, we find the normal data are distributed across two clusters, these being LBA and HBA features.It is interesting that even with no explicit training signals the SSL model without fine-tuning is still capable of distinguishing a variety of classes and phenomena.The middle plot shows the effects of fine-tuning on the SSL representations.The fine tuned SSL model is significantly better at distinguishing normal from anomalous samples, with the LBA/HBA separation in the normal samples completely disappearing.Furthermore, the clusters corresponding to features that were once well separated, such as 'solar storm', are now better grouped with the anomalous samples.Finally, in the rightmost plot we can see the learnt-supervised representations of the test data.Here, it is clear that the supervised model is the most capable of separating both anomalous and normal classes alike.It must be noted, however, that the classes relating to 'Galactic plane', 'source in the sidelobes', and 'normal' are overlapping.Therefore, by combining the boundary related to the SSL fine tuned embedding with the specificity of the supervised model, we are able to better detect anomalies.
An interesting consequence of the class imbalance and the few number of samples certain events such as 'oscillating tile' is that ROAD benefits from fewer backbone parameters and does not scale with model size, as it over-fits to the training data.This is illustrated in Figure 6, where it is also shown that ResNets offer the best performance.This being said, we expect that with more samples from the infrequent classes the model performance should scale proportionally with its number of pa-   rameters.This is further validated by Figure 8, where we plot the model performance relative to the amount of training data.Here, it is clear that the model performance scales linearly with training data-size.Furthermore, the fine tuned model outperforms its purely supervised counterpart for all training set sizes.

Model ablations
To validate the correctness of the SSL-model training objective, we perform several ablations.In Table 3, we show the effect of only using the reconstruction term, L recon , or only the context prediction term, L con , or using the combined loss L recon + L con .We show that the combination of the two terms improves both the anomaly detection and the average classification performances by 2%, which at the scale of the LOFAR science data processing pipeline results in a significant improvement.
Furthermore, in order to determine the relative contribution of each of the losses to the overall performance of ROAD we modify the λ hyper-parameter and measure the overall model performance.Figure 10 shows that with 0.3 ≤ λ ≤ 0.7 the SSL anomaly detection obtains optimal performance.
In addition to the loss-function-based ablations we also consider the effect of changing the combination function used between the supervised and SSL model shown in Equation 6.These results are shown in Figure 9, where we vary both the anomaly detection threshold set by the maximum F-β score as well as the combination function.In the plot, 'combination function #1' uses the definition expressed in Equation 6, where the anomaly detector defines both normality and the unknown anomaly events.We define 'combination function #2' as such that y ssl is only used to define unknown anomalous events.
In the leftmost plot, we can see that combination function #1 consistently offers the best precision level, yet this is at the cost of a marginally decreasing the recall (<0.4%).The effect of this is that combination function #1 results in optimal F-2 score performance when the β is greater than 1.Futhermore, we evaluate the false positive rate using combination function #1 and find that it results in a false-positive rate of approximately 2%.

Computation performance analysis
We evaluate the computational performance of ROAD during inference on a Nvidia A10 GPU using CUDA 11.7 and using driver release 515.65.01.The KNN-based experiments utilise the GPU-based implementation of FAISS4 .We use a batch size of 1024, with a patch size and a latent dimensionality of 64.Furthermore, for the case of the KNN search we assume 1000 normal training samples to populate the search space.In all cases we use bfloat16 representations of the input data so as to ensure the tensor-cores are fully utilised.With these results, we performed 1000 forward passes and measured the resulting latency, throughput in spectrograms per second, and peak memory allocation.
The computation performance of the respective models can be seen in Table 4, where it is clear that the supervised model has the lowest computational overhead.We relate the difference performance between the supervised and SSL model to the dimensionality of the models' inputs and required concatenation of the patches on each forward pass.As the SSL operates on the patch level, there are substantially fewer convolution operations that need to be applied (approximately 16), resulting in decreased peak memory performance.ROAD consists of both the supervised and SSL models, and as such the overall performance is given by the addition of the respective values, such that it takes less than 1 ms to predict the normality of a given spectrogram.This is more than 1000x faster than the existing correlator implementations on the IBM Blue Gene/P supercomputer (Romein et al. 2010).Notably, however, the KNN-based model performs significantly worse, suggesting that density-based KNN anomaly detectors are less suitable for real-time applications at observatories.

Conclusions and future work
In this paper, we present the first real-time anomaly detector for system-wide anomalies in spectrographic data from radio telescopes.We produced a freely available dataset that contains 7050 autocorrelation-based spectrograms from the LOFAR telescope with labels relating to both commonly occurring anomalies and rare events.This work provides a formulation of anomaly detection in the SHM context of telescope operations and illustrates how purely supervised models are ill-suited to the problem.Furthermore, we propose a new SSL paradigm for learning normal representations of spectrographic data.We combine both the SSL and supervised models and demonstrate how it remedies the shortcomings of supervised methods.We demonstrated that even with limited examples of anomalous data, our fine tuned SSL model can significantly outperform its supervised counterpart.The ROAD and dataset are the first major effort to address the system health management problem in radio telescopes and its potential benefit to all radio observatories is very promising.We expect through providing open source access to both our models and dataset, the continued efforts of the wider community will increase the amount of training data from scarce events, thereby enabling other training paradigms such as contrastive learning with larger models that are currently unsuited to the highly imbalanced problem.Furthermore, we identify several directions for future work in the area of radio observatory anomaly detection, namely using the cross-correlations to enhance training by using radio interferometer-specific losses.Another interesting direction would be to use Bayesian deep learning to give uncertainty estimates from the classifier such that samples with low confidence would rely on the detector output.Finally, we would like to propagate the labels from the down-sampled data to the full-resolution data from LOFAR Long Term Archive, such that the performance could be better evaluated in the context of the full LOFAR data-processing pipeline.
In future work, we would like to see ROAD tested with data from different radio telescopes.We expect that instruments with roughly the same operating bands and time resolution would be good candidates.In previous work (Mesarcik et al. 2022a), we show that unsupervised machine-learning-based methods for RFI detection are directly transferable between the simulated data from the HERA telescope and real data from LOFAR.One potential problem is that there may be a domain shift between the ROAD dataset and data produced by another instrument.This could be addressed by labelling a few examples of anomalies in other instruments' spectrograms and fine-tuning the ROAD model using the supplied weights and the new small dataset.In this manner, the overhead of extensive labelling would be avoided.However, in principle, ROAD can be applied to any radio telescope provided that a new labelled dataset is produced for the specific instrument.We expect that the anomaly categorisation used for the ROAD dataset is generic enough to be directly transferred to other instruments.However, we note that features such as 'oscillating tile' are LOFAR-specific.
Furthermore, we propose investigating how best to integrate RFI detection and self-supervised anomaly detection for radio telescopes.Foundation models from Bommasani et al. (2021) offer a promising future.Here, a single self-supervised model could be trained on the normal data and then fine-tuned on both RFI-segmentation and anomaly detection tasks.In this manner, a model would be able to learn both representations of anomalous samples as well as RFI-contaminated data, which may improve model performance, generalisability, and false positive rates.We would thus avoid the problem of potentially classifying RFI as anomalies and vice-versa.
(a) First-order data loss (b) Oscillating amplifier (c) High noise element (d) Lightning storm (e) Solar storm with ionospheric effects (f) Galactic plane (g) Source in the antenna side-lobes (h) Low frequency ionospheric RFI reflection (i) Second order data loss (j) Normal

Fig. 2 :
Fig. 2: Illustration of self-supervised training procedure used in ROAD; we used random cropping for augmentation.

Fig. 3 :
Fig. 3: Illustration of inference pipeline of ROAD; we combine both supervised and self-supervised learning to effectively detect radio-observatory-based anomalies.

Fig. 4 :
Fig. 4: Per-class mean F-2 score-based performance of each model shown in Table2.

Fig. 5 :
Fig. 5: One-class anomaly detection performance for a purely supervised model and the fine tuned SSL anomaly detector when removing a number of classes from the training set.The ResNet34 backbone is used for both training paradigms.

Fig. 6 :
Fig. 6: One-class anomaly detection performance after finetuning of various backbone networks when varying the number of available parameters.

Fig. 8 :
Fig. 8: Binary anomaly detection performance when changing the amount of supervision used to train a ResNet-34 backbone for each training paradigm.

Fig. 9 :Fig. 10 :
Fig. 9: Mean classification performance of the ResNet-34 backbone after fine-tuning when changing the threshold used for anomaly detection as well as the combination function.Combinations #1 and #2 correspond to Equations 6 and 7, respectively.
A&A proofs: manuscript no.output SNE projections of test data from ROAD dataset using the representation from the final layer of the SSL-pre-trained ResNet-34 with and without fine-tuning as well as the supervised classifier.

Table 4 :
Computational performance of anomaly detectors, where spec/s refers to the number of spectrograms processed per second by the respective algorithm.