Issue 
A&A
Volume 670, February 2023



Article Number  A54  
Number of page(s)  13  
Section  Numerical methods and codes  
DOI  https://doi.org/10.1051/00046361/202243928  
Published online  03 February 2023 
ASTROMER
A transformerbased embedding for the representation of light curves^{★}
^{1}
Department of Computer Science, Universidad de Concepcion,
Concepcion
4070386, Chile
email: cridonoso@inf.udec.cl
^{2}
Department of Computer Science, Pontificia Universidad Catolica de Chile, Macul,
Santiago
7820436, Chile
^{3}
Inst. for Applied Computational Science, Harvard University,
Cambridge, MA
02138, USA
^{4}
Millennium Institute of Astrophysics (MAS),
Nuncio Monsenor Sotero Sanz 100,
Providencia, Santiago, Chile
^{5}
Univ. AI,
Singapore
050531, Singapore
Received:
2
May
2022
Accepted:
31
October
2022
Taking inspiration from natural language embeddings, we present ASTROMER, a transformerbased model to create representations of light curves. ASTROMER was pretrained in a selfsupervised manner, requiring no humanlabeled data. We used millions of Rband light sequences to adjust the ASTROMER weights. The learned representation can be easily adapted to other surveys by retraining ASTROMER on new sources. The power of ASTROMER consists in using the representation to extract light curve embeddings that can enhance the training of other models, such as classifiers or regressors. As an example, we used ASTROMER embeddings to train two neuralbased classifiers that use labeled variable stars from MACHO, OGLEIII, and ATLAS. In all experiments, ASTROMERbased classifiers outperformed a baseline recurrent neural network trained on light curves directly when limited labeled data were available. Furthermore, using ASTROMER embeddings decreases the computational resources needed while achieving stateoftheart results. Finally, we provide a Python library that includes all the functionalities employed in this work.
Key words: methods: statistical / stars: statistics / techniques: photometric
The library, main code, and pretrained weights are available at https://github.com/astromerscience
© The Authors 2023
Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
This article is published in open access under the SubscribetoOpen model. Subscribe to A&A to support open access publication.
1 Introduction
Over the past few decades, efforts have been made to develop machine learning tools to analyze and discover variable phenomena in the sky. These tools will face a challenge with the construction of the new generation of telescopes, which will generate substantially more data than the old generation (Kremer et al. 2017). Moreover, the observations will be deeper and more precise than ever.
With the upcoming telescopes such as the Vera C. Rubin Observatory (Ivezić et al. 2019), periodic observations of a significant portion of the sky every few days will become the norm. They will produce light curves for all the objects in the observed fields.
Traditional machine learning methods rely on features to explore the variable behavior of light curves. They are based on quantities such as amplitude, period, and color information, among others (SánchezSáez et al. 2021). The expected volume of data will present a significant challenge to featurebased methods if applied to every measured object.
Deep learning techniques have an advantage over traditional machine learning as they do not need to precalculate features, as they extract informative representations of the data automatically (LeCun et al. 2015). Also, deep learning methods leverage GPU parallelization, which accelerates information extraction from the light curves.
In recent years, some of these methods have been developed with promising results. Naul et al. (2018) presented a recurrent autoencoder to extract features from folded periodical light curves. After training, the authors fitted a random forest classifier on the embedded space, outperforming models trained on predefined features. This model was one of the first approaches in astronomy that used unlabeled light curves to extract characteristics from an unsupervised deep learning model. A similar idea was proposed a year later, by Tsang & Schultz (2019), who used an autoencoder not only for classification, but also for novelty detection. These approaches remain supervised, which is problematic for training on small datasets (Charnock & Moss 2017).
Selfsupervised models are ideal for overcoming the limitations of labeled datasets, as they can generate the target values automatically without humangenerated labels (Liu et al. 2021). Most of these methods learn representations by solving auxiliary tasks that do not require any label, such as infilling of time series. The learned representations can be used downstream to solve other tasks, such as the classification of variable stars or the regression of physical parameters, such as temperature or redshift. This learning scheme has achieved impressive results in natural language processing (NLP), with Bidirectional Encoder Representations from Transformers (BERT, Devlin et al. 2018) being one of the most notable examples. BERTlike models learn to extract contextual representations using two auxiliary tasks from a dataset of raw text. In this stage, known as pretraining, the model learns a general representation of the data, without being specific to any particular task. This process is the most resourceintensive part that usually takes weeks to complete, as the models, as well as the datasets, are large. After this stage is complete, the model only needs a few hours of further training to adjust the weights to a more specific domain, in a stage known as finetuning. Once finalized, BERT is used as the initial stage of a new model. This scheme enables BERTbased models to achieve a stateoftheart performance.
The main components of BERT are the selfattention layers (Vaswani et al. 2017), which codify similarities between words in the input sentence. These layers can be computed efficiently in parallel, in contrast to the sequential behavior of recurrent neural networks. Recently, some attentionbased works have been proposed in astronomy, showing competitive results with stateoftheart models (Allam Jr & McEwen 2021; Pimentel et al. 2023; Pan et al. 2022).
Following the advances in NLP to treat sequential data, representation learning, and selfsupervised training strategies, we present ASTROMER, a selfsupervised model for extracting a general representation of astronomical light curves. In this work we show that using light curve embeddings to train other models, such as classifiers of variable stars, we can match or decrease the number of epochs needed to get the best evaluation metrics. We empirically demonstrate the benefits of using pretrained representations when training classifiers on small datasets. Using fewer than 100 samples per class, we significantly overcome a long shortterm memory (LSTM) trained on light curves directly. Additionally, we provide a python library that includes pretrained models and the necessary infrastructure to finetune and obtain domain specific embeddings. Sharing pretrained models by the community helps save computational resources and decreases C02 emissions while improving the performance of automatic learning models. Furthermore, we aim to create different versions of ASTROMER, just as BERT has many variations depending on the target (Polignano et al. 2019; Liu et al. 2019; de Vries et al. 2019; Moradshahi et al. 2019; Masala et al. 2020; Vunikili et al. 2020).
2 Problem statement
Let be a singleband light curve where d_{x} is the number of features with L observations over time. In this case, every observation consists of d_{x} = 2: the magnitude and the modified Julian date (MJD), where the observations occur. Naturally, the number of observations and times vary between different stars, and it strongly depends on the survey science goals. In this representation, the maximum number of points in the light curves remains fixed, even if some of them are shorter than L, in which case we perform zero padding and masking.
The main objective is to train a model f(X, 𝒟_{A}, θ) on a massive set of light curves 𝒟_{A} from a given survey A, with trainable parameters θ. In particular, we propose using learned representations of a transformerbased encoder to create embeddings representing the objects' variability in d_{k}dimensional space. We can finetune the model's weights to adapt to other surveys and use it to solve downstream tasks, such as classification or regression.
3 Methods
In this section, we describe the main components of ASTROMER, which belongs to the transformer neural network proposed by Vaswani et al. (2017). In particular, we focus on two processes: the selfattention block and the positional encoding (PE). Both methods are effective for capturing relationships between observations to encode light curves.
Fig. 1 Selfattention block diagram. Each observation vector x_{l} ∈ X (denoted by solid circles) is projected into three vectors: (k)ey, (q)uery, and (v)alue. Then their similarities are computed by the scaled dotproduct between vectors according to Eq. (1). 
3.1 Selfattention block
Vaswani et al. (2017) introduced the selfattention mechanism as an alternative to the classical attention technique based on recurrent neural networks (RNNs). The idea was to quantify relationships between observations without conditioning the operations to follow a sequential order. Thus, unlike RNNs, selfattention blocks can be executed in parallel, being more efficient and faster to train.
An attention block consists of multiple selfattention heads that compute the similarities of every input within the sequence to each other. In other words, every head measures the relationship between pairs of observations; the more an observation affects the other, the more attention the model pays. As shown in Eq. (1), we perform a weighted sum over the input values V_{i} to obtain the attention matrix Z_{i} using learnable querykey compatibilities : (1)
Queries, keys, and values (Q_{i}, K_{i}, V_{i}) are input transformations such that, (2)
with and being the trainable weight matrices of the ith head, and d_{k} a hyperparameter specifying the embedding size of each selfattention head. The normalization factor in the denominator scales the variance of the product, while the softmax bounds the values to represent a multinomial probability distribution.
The final attention vector is the normalized concatenation of {Z_{0}, .., Z_{i}, Z_{#heads}}, the output from every ith head in the block. Figure 1 illustrates the selfattention block.
3.2 Positional encoding
Light curves are sequences of brightness measurements as a function of time, generally not equally spaced. The cadence of a survey is given by the time interval between observations, which is often not periodic.
Temporal information is essential for characterizing light curves since they bring information about the variability of a particular object. For example, RR Lyrae and Delta Scuti stars have short periods, ranging from hours to days, while the periods of type II Cepheids and longperiod variable stars can be months or years.
By construction, selfattention layers do not take into account the temporal information of light curves when learning similarities between the observations. In other words, any reordering of the magnitudes within a sequence would produce the same attention vector.
A traditional way to include temporal information during selfattention learning consists in modifying the input to incorporate the relative positions into the same vector. We create a representation for the time and add that information to the brightness representation. It should be noted that the positions range from 0 to L − 1, where L is the length of the light curve.
In this work, we adjust a classical positional encoder used by Vaswani et al. (2017), to work with the time domain of the light curves^{1}. The positional encoding (PE) projects an embedding that encodes relative positions and it is consistent to any transformation of them, no matter the number of observations or the first day in MJD where the object began to be observed. It consists of trigonometric functions that codify each observational time t_{l} directly, from the lth observation at different w_{j} frequencies. (3)
In Eq. (3), j ∈ [0, …, d_{pe} − 1], where d_{pe} is the PE dimensionality and w_{j} is the angular frequency defined as (4)
Trigonometrical functions are a natural choice for capturing periodic behavior while bringing unique values within [−1, 1], which work efficiently on GPUs (Patel 2020). The sine and cosine functions take d/2 different angular frequencies, spanning wavelengths from 2π to 2π × 1000^{2}. We make the same remarks as in the original paper regarding the function used. A nontrainable PE is enough to achieve reasonable results, simplifying the network. The value of 1000 gave the best results, rather than the original 10 000. The angular frequencies are indexed by the PE dimension number. Larger frequencies are given by the smaller PE dimensions, while the smaller frequencies are given by the larger PE dimensions. Figure 2 shows the resulting PE given the sequence of times of the MACHO light curve F_1.4175.3433. At high frequencies, the PE shows high variability across all the time steps, which show dynamic behavior. In contrast, lower frequencies show low variability.
The PE relates directly to the objects under study. Since the periods of variable stars typically range from 0.1 to 1000 days (Catelan & Smith 2015), the PE should be sensible to periods in this range. To capture short periods, the minimum wavelength should be 2 × 0.1. In practice, the survey's cadence conditions the shortest period we can study. Moreover, most objects will not show variability for periods larger than 1000 days. As shown in Fig. 2, the PE does not show any change across the time steps.
Fig. 2 Resulting positional embedded space from the F_1.4175.3433 MACHO object. 
4 Proposed solution
Here we introduce and describe ASTROMER, a transformedbased model capable of producing embeddings from light curves. Embeddings summarize photometric data according to a learned representation that can be easily adapted to new data by retraining (or finetuning) the model's weights. In particular, we take inspiration from BERT (Devlin et al. 2018), a selfsupervised trained model that does not need humanbased annotations to learn representations of the input data.
In the following sections, we explain the main modules and objectives of the proposed model. Before describing the component of the model, we motivate the architecture by describing the target task in Sect. 4.1. Then; Sect. 4.2 presents a general overview of the architecture, where we introduce the ASTROMER modules. Finally, the last sections detail the inner mechanisms of encoding (Sect. 4.3) and decoding (Sect. 4.4) light curves.
4.1 Target task: Modeling of masked light curves
ASTROMER aims to learn representations that summarize and characterize the light curve's domain. It does by reconstructing the input light curve from a subsample of observations. The embedding should be as general as possible to help downstream tasks, such as classification or regression. In this sense, the model needs to adjust its weights to create a vector that is informative enough to recover the input light curve.
In order to force the model to learn contextual information, we hide a random subset of magnitudes within the sequence, and reconstruct them using the remaining part of the light curve. Hidden elements are informed to the model by a mask vector m_{i} that is different for every ith light curve sample. The loss function then takes the form (5)
where N is the number of examples, and L is the length of the input sequences, which i equal for all examples. We notice that the loss function containing the actual observation x_{il} and the prediction is equivalent to the rootmeansquare error (RMSE). In Eq. (5), the mask vector m_{i} only permits summing over the masked values' error; nevertheless, the model reconstructs the entire sequence.
Fig. 3 ASTROMER architecture. 
4.2 Model architecture
ASTROMER has an encoderdecoder architecture, as shown in Fig. 3. Input samples are fixedlength light curves of L = 200 observations with two descriptors each: observation times and magnitudes. The first part of the encoder is tasked to compute the PE, which receives the time values at each step and projects them into vectors of 256 dimensions. The next step consists in adding magnitudes to the PE without interfering with the temporal encoding. As shown in Fig. 3, we train a feedforward network (FNN) without hidden layers that transform each magnitude to a vector of size 256. We notice the dimension of the FFN matches the size of the positional embedding. In order to not alter the temporal encoding, the FNN should learn to project brightness information in a way that can be summed with the constant part of the PE, that is to say, after the 100th dimension in Fig. 2^{2}. After adding the projected magnitudes and the PE, the new input is a matrix of dimension 200 × 256.
The core of ASTROMER takes place on the two selfattention blocks. Each block contains four heads with 64 neurons each. The final representation is the normalized concatenation of the heads of the last block. Then, the resulting embedding is a collection of 200 vectors of size 256, describing the attention of every observation to each other. We notice that d_{pe} = d_{k}#heads = 256.
In the decoder, we take the representation to build the model's output, which during the pretraining consists of a linear FFN with no hidden layers, which reconstructs the input magnitudes. Although the decoder presented in Fig. 3 is exclusive for pretraining ASTROMER, we can use different decoding layers that focus on other downstream tasks.
Fig. 4 Preliminary input example with 50% masking. ASTROMER aims to predict masked observations (denoted by “1”s) using only the attention associated with zerotagged elements. 
4.3 Learning representations
ASTROMER uses selfattention blocks to transform light curves into embeddings via learnable parameters. As mentioned in Sect. 4.1, we train ASTROMER to predict a subset of masked observations using the concatenated attention vector at the end of the encoder. The embeddings must capture enough information to reconstruct magnitudes using only the surrounding local context of the hidden observations.
In order to exclude the masked observations from the selfattention blocks, we covert all the dimensions associated with the hidden elements to zero values. Therefore, by modifying Eq. (1), we obtain (6)
where M is a binary matrix that is one on masked positions and zero otherwise. The M matrix is dynamically created before the encoder. We notice that for the points in the light curve that are hidden (i.e., for which M is 1), the value of the softmax function will be close to zero, making Z_{i} close to zero. In other words, when the observation is masked, its assigned predicted value will be zero, which represents no attention.
In this work, we mask 50% of the total number of observations per light curve. It is important to note that M should be reshaped to match the square matrix . Figure 4 shows an example of a tenobservation input.
During training, the mask M provides the model with information about the observations to be predicted. The model can only focus on the target magnitudes and forgets contextual information about unmasked inputs. To avoid learning biased representations conditioned by the masking information only, Devlin et al. (2018) proposed replacing a subset of masked values with random and real elements. In practice, we turn 20% of the masked values from one to zero, while replacing their associated magnitudes with random values from the same sequence. Similarly, we make another 20% of the masked values visible but this time without changing their magnitudes. The unmasked values are still considered in the loss function as they are part of 50% of the initial masked elements. Figure 5 shows the final composition of the input.
Having defined the mask vector, we initiate the forward pass of the encoder by transforming input times and magnitudes to a 256 vector that mixes temporal and brightness information. Finally, we pass the combined input sequences via two selfattention blocks that capture similarities between observations and then output the attentionbased embeddings.
Fig. 5 Final input composition. Following the example in Fig. 4, 20% of the masked values are replaced by random magnitudes while changing the “1”s in the attention mask vector to “0”s. Similarly, we make another 20% of the masking visible, keeping the actual observations. Dotedline squares indicate both random and real observations. At this point, we should keep a second mask containing the initial 50% masked values to evaluate the loss function. 
4.4 Decoding embeddings
The decoder is essential to adjust the encoder weights. The output conditions the purpose of the embedded representation. Despite considering only masked values in the loss function, the decoder reconstructs the entire sequence of magnitudes. As we explained in Sect. 4.1, using a mask vector, we give a value of zero to the observations that are not part of the loss.
The embedding is a collection of 200 vectors (the length of the input sequence) of size 256 (model dimensionality). Every vector describes the relationship between a particular observation and the rest of the sequence. In order to recover the original magnitude, we apply a FFN without hidden layers and no activation on all the vectors of size 256 separately. Although we transform every vector separately, the network weights remain the same.
Finally, it is important to mention that the decoder only works during the pretraining and finetuning of ASTROMER. After that, we usually use the encoder along with a new decoder specializing in another specific task, such as the classification of variable stars. At this step, the encoder can also be trained together with the new decoder. However, the encoder cannot change drastically, as it would forget the learned representations.
5 Data description
This section describes the data involved in the pretraining and finetuning of the ASTROMER model. Every light curve contains magnitudes and observation times in MJD.
ASTROMER takes advantage of a massive set of light curves, which are not necessarily labeled. However, to evaluate a downstream task (i.e., classification), we use catalogs of variable stars with their corresponding labels. This work simulates the science case of having a small subset of labeled samples and employs them to finetune the model, even when the class information is unnecessary.
5.1 Unlabeled dataset
Rband light curves of the Massive Compact Halo Object survey (MACHO; Alcock et al. 2000) were collected from the Galactic bulge (fields 1 and 10), and the Large Magellanic Cloud (LMC, Fields 101–104). The photometry was taken as it was provided, without any additional processing.
Given the nature of photometric observations, the observed flux of every star will exhibit some kind of variability, such as the intrinsic observational noise. As such, we cannot expect every star to be variable. Even though nonvariable and noisy data can be helpful to regularize weights (Bishop 1995), we discard some of the light curves that show white noise behavior (i.e., Kurtosis > 10, Skewness > 1, and Std > 0.1). Removing white noise samples avoids training on a dataset dominated by irrelevant information while keeping variable objects that might contain informative variability. After this filtering process, 1 529 386 light curves are left, with a cadence mean of 2.9 ± 17.3 days, which fits the lower bound of the PE wavelengths mentioned at the end of Sect. 3.2.
MACHO catalog distribution.
Fig. 6 Comparison between the labeled and unlabeled pretraining MACHO datasets. The continuous lines represent the labeled MACHO dataset containing variable stars from Alcock et al. (2003). Left: Distribution of light curve magnitudes. Right: Distribution of the differences between observation times. 
5.2 Labeled datasets
We employ 20 894 labeled variable stars from the MACHO survey (Alcock et al. 2003) to finetune and evaluate the downstream task. Table 1 shows the class distribution.
Similar to the unlabeled dataset, this catalog contains objects from the LMC. Consequently, it contains light curves quite similar to the pretraining ones. However, labeled objects come from all the survey's fields, while the pretraining dataset is a subset (fields 1, 10, and 101–104). Figure 6 shows the distribution of magnitudes and the observational time difference (∆time) from the labeled and unlabeled (pretraining) datasets. In the left histogram, we can see the pretraining dataset overlaps one of the modes from the labeled dataset distribution. Similarly, distributions of delta times follow the same trend along the xaxis. However, they are slightly different for values of delta time greater than one day, where a greater density from the labeled dataset is observed. The labels were updated to conform to modern classifications. In particular, we group the LPV classes into one category and remove the categories “RR Lyrae and GB blends” and “RRe”. The former are excluded because of their small sample size, and the latter because RRe is not considered a subclass on its own, based on later studies (Catelan 2004).
We also tested ASTROMER on the Optical Gravitational Lensing Experiment (OGLEIII; Udalski 2003) and ATLAS Heinze et al. (2018) catalogs of variable stars. The OGLEIII data used were selected by Becker et al. (2020) and contain 358 288 labeled variable stars, and correspond to Iband light curves with a mean cadence of 3.8 ± 14.6 days. Table 2 shows their class distribution. The OGLEIII catalog is chosen because its observations are not captured in the same filter as MACHO, but close to its wavelength range, as shown in Table 3.
The ATLAS dataset, published by Heinze et al. (2018), contains labeled and unclassified objects, as well as a dubious class, which amounts to roughly 10% actual variable stars and 90% instrumental noise, according to their estimates. From this dataset, 422 630 light curves are used, measured in the orange passband, as seen in Table 4, with a mean cadence of 4.7 ± 19.1 days. The reported classes are grouped to obtain labels similar to the other datasets. In particular, we group detached eclipsing binaries with either a full or half period identified into close binaries (CB) and the same for detached binaries (DB). We do not use the remaining objects, as their labels are based on Fourier analysis and do not correspond exactly to astrophysical categories.
OGLEIII catalog distribution.
Surveys and the filters used in this work.
ATLAS catalog distribution.
Fig. 7 Windows sampling diagram. The rectangle represents the entire light curve whose length is L_{real}. The dashed line denotes the sampled windows of size 200 observations. Windows are randomly generated along the light curve. However, we constrain the starting point to be smaller than L_{real} − 200. 
5.3 Preprocessing
Neural networks must work with tensors holding equallength samples. It is not the case for light curves that differ on the number of measurements. As such, it is necessary to set a maximum number of observations for all sequences, padding with zero values if necessary. Since we do not want to constrain the encoder to learn from long series exclusively, and 99.52% of the unlabeled dataset is longer than 200 (see Appendix B), we set 200 as the maximum light curve length to be fed to the model. If the light curve is longer than 200, we sample temporal windows starting from a random position, at least 200 observations behind the last measurement. Figure 7 shows a diagram that exemplifies the process. On the other hand, if the light curve has fewer than 200 observations, we pad it with zero values after the last point in order to obtain a sequence of length 200. Those filler observations are masked during training to be excluded from both the attention vector and the loss function.
After generating windows, we independently subtract the mean from each sample. It implies both magnitude and time vectors have zero mean. We scale neither the magnitude nor time windows by their standard deviation. Scaling by time dispersion would lose some of the interpretability for the positional encoder. Standardizing magnitudes may lose amplituderelated information that is important for discriminating between some classes.
6 Training strategy
The following sections describe the training strategy consisting of two steps: pretraining and finetuning. In the pretraining, 60% of the unlabeled samples are used for training, 20% for validation, and 20% for testing. Alternatively, we set 100 samples per class of the labeled dataset to create the test set for evaluating the finetuning step. The rest of the labeled dataset is divided into 80% and 20% for training and validation, respectively. We recall that the pretraining dataset is different from the labeled one, so the testing samples used to evaluate the finetuning are never seen by ASTROMER.
Additionally, we study the scientific case of having small target datasets to finetune and perform downstream tasks. Therefore, we do not change our 100 samples per class test set, but only the training and validation subsets. First, we sample 20, 50, 100, and 500 objects per class from each labeled dataset. Then we divide them by 80/20 for training and validation. We aim to see the impact of the number of samples when adjusting pretrained weights on a different survey.
Fig. 8 Pretraining learning curves using the MACHO unlabeled dataset. 
6.1 Pretraining
We implemented ASTROMER using Tensorflow 2 (Abadi et al. 2015). Both pretraining and the other experiments can be reproduced by following the official implementation on Github^{3}.
The pretraining constitutes the first stage of representation learning. It defines the preliminary data requirements, such as cadence and filter, which condition other target domains. In this case, we use the massive unlabeled dataset from the MACHO survey presented in Sect. 5.1.
ASTROMER weights are initialized using the Xavier uniform initializer (Glorot & Bengio 2010). Training big models such as ASTROMER uses a large amount of hardware resources and computational time. Despite this, once the weights are adjusted, they can be shared, avoiding having to retrain ASTROMER from scratch.
As mentioned in Sect. 5.3, the light curves are split into windows of 200 consecutive observations, increasing the effective number of training samples to 6 201030. An epoch is completed after training all samples once, in batches of 5000. We used Adam (Kingma & Ba 2014) with a learning rate of 10^{−3}.
The training performance is evaluated on the validation dataset and the results on the test set. Early stopping is used to stop the training after 40 epochs without improving the validation loss. Once the training is finished, the weights corresponding to the lowest RMSE are saved.
Figure 8 shows the learning curves of the training process associated with the architecture in Fig. 3. We also trained on smaller models of 64 and 128 attention sizes. However, we chose the 256dimensional setting as it reached the minimum validation loss on the unlabeled dataset. The model achieved the lowest RMSE close to epoch 1000^{4}, obtaining an RMSE of 0.148 on the testing samples (dotted line).
6.2 Finetuning
After pretraining on the MACHO unlabeled dataset, the model has learned much of the variability patterns of the light curves. However, it is still necessary to incorporate downstream taskrelated information. In this work, we use catalogs of variable stars described in Sect. 5.2 to finetune the attentionbased embeddings.
The loss function and selfsupervised strategy remain the same (i.e., RMSE and masked selfattention technique). Similarly, the same hyperparameters from the pretraining stage are used.
Table 5 shows the number of epochs and total time to finetune ASTROMER on each labeled dataset using all samples, comparing it to the pretraining stage. A significant difference to the pretraining time is evident. The finetuned models converge faster depending on their similarities and the number of examples. For instance, the model finetuned on ATLAS light curves takes more time than the one trained on the MACHO labeled dataset, principally due to the survey differences and the number of samples that allow the model to get a much lower RMSE than the MACHO labeled dataset. At the same time, similar behavior is observed on the OGLE finetuning. However, the minor improvements in OGLE's RMSE suggest the training time is highly dominated by the significant number of samples in the catalog (see Table 2).
On the other hand, Fig. 9 shows the RMSE for the models finetuned on smaller datasets of 20, 50, 100, and 500 labels per class. The testing set contains the same 100 objects per class, independent of the training subsets. We can see a descending trend in the RMSE while increasing the number of samples in the finetuning, which is expected. However, there is not much improvement along the training subsets in the MACHO labeled dataset since it comes from the same survey as the pretraining light curves. Validation learning curves associated with the training process in all datasets can be found in Appendix C.
Finetuning results.
Fig. 9 Test RMSE from models finetuned on subsets of labeled datasets. 
Fig. 10 Classifiers architectures. Architecture names with the +ATT tag refer to models trained on ASTROMER embeddings. In contrast, the baseline uses only the light curves for training. The number in parenthesis is the number of units from the corresponding layer. We notice that the 256 in the LSTM corresponds to the states' size of the RNN. ReLU and Linear correspond to the activations for each feedforward layer. The dimensionality of the last linear layer depends on the number of classes from each labeled dataset. The total number of parameters at the bottom does not include the ASTROMER size. 
7 Specific task: Classification
ASTROMER is designed as a general embedding extractor that can be finetuned to solve downstream tasks such as classification or regression. In this work, we cover the problem of classifying variable stars from different surveys. As explained in Sect. 6, the model is finetuned in subsets of 20, 50, 100, and 500 samples per class, as well as using the full catalog (see Sect. 5). Finetuned models are then used to transform light curves into attentionbased embeddings. In practice, we use ASTROMER at the beginning of the classifiers as an additional layer (see Fig. 10).
We selected a baseline model from the literature (DonosoOliva et al. 2021) to compare the benefits of using the embeddings as opposed to other deep learning approaches. It consists of two layers of LSTM with a memory cell and hidden state of 256 units each. Following the original architecture, we normalize each recurrent layer's output (Ba et al. 2016) and then apply a dropout (Semeniuta et al. 2016) of 0.3 over them. We do not validate this architecture since it was already tested in the previous work. However, we looked for the best learning rate on the MACHO light curves since we changed conditions, such as the batch size and normalization techniques. As shown in Appendix D, we confirm that a 10^{−3} learning rate achieved the best performance in terms of the validation error. In the results shown in Fig. 10, we employed the same LSTM architecture but training with ASTROMER embeddings.
In addition to the LSTM classifier, we tested a multilayer perceptron (MLP+ATT in Fig. 10) with three feedforward hidden layers of 1024, 512, and 256 neurons, and used a rectified linear activation function (ReLU) in each of them. Since MLPs cannot process time, we use the average along the step dimension (L = 200) of the embedding matrix^{5}. By doing this, we collapse all the encoded observations in a single vector of size 256 that fits the input dimensionality of the first hidden layer of the MLP+ATT classifier (feedforward layer with 1024 units from Fig. 10).
For small datasets, freezing the weights of ASTROMER allows us to optimize a classifier without training a huge number of weights. The ASTROMER encoder can also be trained in tandem with the classification network, to better tune weights to solve a specific task, which can be seen as another finetuning step. The following experiments evaluate these two approaches, that is, training and nontraining of the encoder layer of ASTROMER.
To measure the classification performance, we use the F1 score metric, (7)
where Κ is the number of classes, and (8)
In Eq. (8), TP, FN, and FP are true positive, false negative, and false positive cases, respectively. Intuitively, the precision score identifies how many predicted classes are actually valid, and the recall indicates the number of objects in the testing set the model could identify correctly.
Figure 11 shows the scores obtained by the classifiers by (a) freezing the encoder and (b) training the encoder. In addition, a third case (c) is included where all the light curves in the dataset were used to perform finetuning, but still doing the classification on the smaller subsets. As in experiment (b), the third column allows gradients to flow into the ASTROMER encoder while training the classifiers. These three scenarios explore oftenused strategies when implementing pretrained models.
The results show that models trained on ASTROMER embeddings perform better than the baseline in MACHO and OGLEIII across all the experiments. For ATLAS, the difference is smaller but not worse than the baseline. In general, the LSTM trained on attention vectors performs better than the MLP+ATT. However, the MLP+ATT outperforms the baseline and approaches the LSTM performance as the number of samples per class is increased, in all datasets.
It should be noted that in (c), even finetuning with all the light curves, the score improvement is marginal compared to in (b). Similarly, minor improvements can be seen in the classification metrics when optimizing the encoder layer instead of freezing it (i.e., Cols, (b) and (c) in Fig. 11). The MLP+ATT classifier shows the most notorious gain when training with more than 100 samples.
We evaluate the improved learning speed when using ASTROMER. Figure 12 shows the validation loss associated with the best classifier among the three folds for each dataset and the training scenarios (i.e., freezing (a) and training (b) the ASTROMER encoder). We evaluate learning speed on the most significant subset, which contains 500 objects per class.
Attentionbased models generally take fewer epochs to achieve better results than the baseline. When the encoder stays frozen, the validation curve of the LSTM+ATT is significantly shorter than the other methods, achieving a high Fl score, as shown in Fig. 11. In contrast, when training the encoder, the LSTM+ATT takes almost the same number of epochs as the baseline. However, according to Fig. 11, training the encoder using the LSTM+ATT does not improve the F1 score. The findings are reversed when using the MLP+ATT, in this case, training the encoder decreases the number of epochs while improving the F1 score in larger subsets.
Fig. 11 Testing Fl scores for an LSTM trained on light curves directly (baseline) and models trained on ASTROMER embeddings (LSTM+ATT and MLP+ATT). The gray shading represents the standard deviation of the three crossvalidation split. Each row corresponds to the experiments on each survey, MACHO, OGLE, and ATLAS, respectively. In (a), we finetune ASTROMER and optimize classifiers on smaller subsets of 20, 50, 100, and 500 samples per class. The weights of ASTROMER are kept frozen when classifying. However, in (b), we allow gradients to flow into ASTROMER. The third case, (c), shows the results of finetuning with the entire set of light curves and classifying on smaller subsets, training ASTROMER simultaneously. 
8 Discussion
Our results have shown the benefits of using pretrained models to learn representations of light curves. We successfully adapted and applied an NLP selfsupervised technique (Devlin et al. 2018) to the domain of light curves. It offers a solution for learning representations when labels are in limited quantity. Furthermore, it is an alternative to fully unsupervised models that focus more on clustering than making predictions from the data.
Our representation can be extended to light curves coming from different surveys. Using ASTROMER on new data requires a short training process called finetuning to adjust the representation and minimize the RMSE. A lower RMSE implies a better representation and, subsequently, a better result in downstream tasks.
The size of the finetuning dataset depends on the similarity between the pretraining and finetuning magnitudes and cadence distributions. For illustration, Fig. 11c shows no difference in the F1 scores when all the light curves in the dataset were used to finetune the encoder. Indeed it is more important to bring the classifier more labels than more light curves in the finetuning. As shown in Table 5, minor improvements in the RMSE were obtained when finetuning ASTROMER on MACHO and OGLEIII, both of which are more similar to the pretraining dataset than ATLAS. In this scenario, the finetuning stage can even be omitted without affecting the results.
Regarding the downstream task, we demonstrated the power of using embeddings to train classifiers, instead of directly using light curves to predict classes. We evaluated the classification performances in three commonly used scenarios using limited labeled data. As seen in Fig. 11, we outperform a recurrent neural network classifier trained on light curves (baseline) in all the experiments. The improvements in the F1 score became more significant when fewer than 500 labels per class were provided.
ASTROMER can learn discriminative features from the pretraining and finetuning steps using only a selfsupervised task. As shown in Fig. 11, the LSTM+ATT can predict classes with high F1 scores without the need to train the encoder along the classifier. Furthermore, its performance remains the same if the encoder is trained in tandem, independent of the number of samples per class.
While the LSTM+ATT uses its hidden state to capture longand shortterm dependencies on the attention vectors over time, the MLP+ATT deals with their average, thus losing global context information. However, when training the encoder along with MLP+ATT using more than 100 samples per class, the encoder is able to counteract this effect, as shown in Fig. 11.
In both models, the discriminative features learned by the ASTROMER encoder allow the classifiers to converge faster^{6}. From the best performing scenarios, the LSTM+ATT converges faster than the baseline when freezing the encoder weights, as it only needs to recognize discriminative patterns in the embeddings. On the other hand, the MLP+ATT only needs to mitigate the effect of averaging the attention vectors. This task is simpler than learning a representation from scratch, which is the case of the baseline classifier. Specifically, the baseline has to learn how to extract an informative representation and learn discriminative patterns simultaneously. In the training process, any change in the representation will impact the subsequent layers, increasing the number of epochs to converge. This behavior is related to the internal covariance shift (Ioffe & Szegedy 2015).
Finally, even though this work's main contribution is to help train downstream models on small datasets, we achieved competitive results against DonosoOliva et al. (2021) using OGLEIII unfolded light curves (88.1% vs. 88.0% ours). For the MACHO catalog, we compare ASTROMERbased classifiers against the best (78.1% accuracy) singleband nonperiod informed model from Jamal & Bloom (2020). We achieved similar results, 78.2% and 76.7% classification accuracy for the LSTM+ATT and MLP+ATT, respectively. It is important to mention that all models from the literature use random smaller partitions, while we use a balanced testing set of 100 objects per class, which is more representative.
Fig. 12 Best model validation learning curves for classifiers trained on 500 samples per class on each catalog. The columns show the science cases, i.e., freezing (a) and training (b) ASTROMER when classifying. 
9 Python package
We provide a Python package, ASTROMER, which includes the pretrained weights obtained in Sect. 6.1. More pretrained models will be uploaded in the future, either from the ASTROMER team or the community.
The stable version 0.x.y matches the code of this paper. Variables x and y refer to the minor changes and patches of the current version 0. Major changes will not be related to this work but must be duly evaluated to guarantee at least the same performance. ASTROMER v0 can be easily installed from the Python Package Index (PyPI) repository as follows:
pip install ASTROMER
The principal module ASTROMER.models includes the models associated with the different encoders. So far, we only have the SingleBandEncoder, corresponding to the ASTROMER model trained on the MACHO unlabeled dataset. To import and use ASTROMER, we can type:
from ASTROMER.models import SingleBandEncoder model = SingleBandEncoder()
where model is an new instance of ASTROMER. To load pretrained weights, we must use the from_pretraining() method from the model object,
model = model.from_pretraining(“macho”)
The name in parenthesis matches the zip file name in the public repository^{7}.
The simplest way to obtain embeddings is via a collection of NumPy arrays, including light curve information in the form:
data = [ np.array([[5200, 0.3, 0.2], [5300, 0.5, 0.1], [5400, 0.2, 0.3]]), np.array([[4200, 0.3, 0.1], [4300, 0.6, 0.3]]) ]
where the axes (from left to right) of the NumPy array are the times, magnitudes, and standard deviation of the magnitudes. Then, to get the embeddings, we use the encode method:
att_vectors = model.encode(data)
where att_vectors is a list containing the embeddings for each sample in data.
ASTROMER can be easily trained using the fit() method from the SingleBandEncoder instance:
model.fit(training_data, validation_data, epochs=10)
Within the fit method, training_data and validation_data are TensorFlow datasets. We provide a function in the ASTROMER.preprocessing module to format datasets,
from ASTROMER.preprocessing import load_numpy training_data = load_numpy(data, batch_size=2, msk_frac=.5, rnd_frac=.2, same_frac=.2, max_obs=200)
It should be noted that data is the same collection of NumPy arrays we used in the previous examples. In the load_numpy():
batch_size: Number of samples to process in one forward pass of the training;
msk_frac: Fraction of the sequence to be masked;
rnd_frac: Fraction of the mask to be replaced with random values (see Sect. 4.3);
rnd_frac: Fraction of the mask to be replaced with actual values (see Sect. 4.3);
max_obs: Maximum windows length (see Sect. 5.3).
For more information we open project repositories^{8} where data, tutorials, documentation, and contributions via issuing pull requests can be found. These contributions can be in the form of functionalities or pretrained models.
10 Conclusion
We present ASTROMER, a singleband embedding for light curve representation. It is based on the BERT NLP model, which codifies observations into attention vectors via selfsupervised learning, taking advantage of the massive unlabeled volume of data. In this work, we pretrain. ASTROMER on millions of Rband light curves from the MACHO survey.
ASTROMER can be finetuned on specific domain datasets to solve downstream tasks, such as classification or regression. Here, we use labeled catalogs from the MACHO, OGLEIII, and ATLAS surveys to evaluate the effect of using embeddings on the classification of variable stars. By training MLP and LSTM classifiers on embeddings, we outperform a baseline LSTM network trained on light curves. We evaluated classification performances of finetuned models using 20, 50, 100, and 500 samples per class, obtaining better F1 scores in all experiments. Additionally, we showed that the embeddingsbased classifiers achieved competitive scores against stateoftheart solutions. In terms of training times, models trained on ASTROMER representations took fewer epochs than the baseline LSTM classifier to reach the lowest validation loss.
Our selfsupervised approach only considers the reconstruction of magnitudes, not including other tasks, such as the next sentence prediction applied in BERT. As a result, the final representation may not satisfy other downstream tasks correctly. For instance, the consequences of losing global context information were evidenced in the difference between the MLP and the LSTM classifiers, both using attention vectors. In order to make embeddings more informative, we will explore adding new tasks during the pretraining and finetuning steps in a future work.
Finally, we provided a python library including MACHO pretraining weights and finetuned models used in this work. Moreover, our library allows users to pretrain, finetune, and get embeddings on new singleband datasets. We aim to create a collaborative research environment where pretrained ASTROMER weights can be shared, saving computational resources and improving stateoftheart models.
Acknowledgements
This research was supported by the ANID Millennium Science Initiative ICN12 009, awarded to the Millennium Institute of Astrophysics; FONDECYT Initiation No. 11191130 (G.C.V.); the Patagón supercomputer of Universidad Austral de Chile (FONDEQUIP EQM180042) and CONICYTPFCHA/DoctoradoNacional/201821181990.
Appendix A Summing magnitudes to positional encoding
According to the classical approach (Vaswani et al. 2017, Devlin et al. 2018), magnitudes and times should be added to build the input of selfattention blocks. However, summing magnitudes to the PE information is not straightforward as we can interfere with the encoded variability of times. In principle, most of the values in the last dimensions of the PE are constant so that we can sum magnitudes on the unused part (see Fig. 2). Instead of imposing the above assumption, we let the model learn the way to sum magnitudes from the data. We use a single FFN, transforming magnitude scalars to 256dimensional vectors that match the PE dimensionality. After training, we confirmed that most of the projected dimensions are zero, except for one close to 200th dimension (see Fig A.1). We note that the resulting transformation is almost a onehot encoding. Thus, we showed that the magnitudes do not interfere with the PE information when joining both vectors.
Fig. A.1 Feedforward network weights that project magnitude scalars to 256dimensional vectors. The new magnitude vector is then summed to the PE information of equal size. 
Appendix B Looking for a window size
Our preprocessing pipeline consists of windows that move along the light curve, sampling a subset of observations. Using this technique, we deal with the variable length problem while generating more samples for training. Figure B.1 shows the distribution of the lengths of the light curves. Considering that most samples are longer than 200 observations, we defined 200 as the window size.
Fig. B.1 Distributions of light curves lengths from the MACHO pretraining dataset. 
Appendix C Finetuning validation light curves
Figure C.1 shows the validation learning curves associated with ASTROMER finetuned on different training sets. The figure rows are associated with subsets of 20, 50, 100, and 500 samples per class. We also show the learning curves for the finetuning using all light curves in the respective catalogs in the last row. The different lines within each subplot show the RMSE of the crossvalidation splits.
Fig. C.1 Validation learning curves from all datasets during finetuning. 
Appendix D Looking for the LSTM optimum learning rate
In order to evaluate the effectiveness of ASTROMER embeddings, we selected a stateoftheart light curve classifier from the literature (DonosoOliva et al. 2021). The classifier consists of two layers of LSTM with 256 units each. The idea is to compare how the model performs when changing the input from light curves to light curve embeddings. We kept hyperparameters as they were defined in the previous work. However, since we changed the batch size and normalization technique, we looked for the best learning rate on MACHO light curves. Figure D.1 summarizes the crossvalidated results showing that a rate of 0.001 performs better in terms of minimal validation loss.
Fig. D.1 Validation learning curves from the baseline classifier (LSTM) using different learning rates. 
References
 Abadi, M., Agarwal, A., Barham, P., et al. 2015, TensorFlow: LargeScale Machine Learning on Heterogeneous Systems, software available from tensorflow.org [Google Scholar]
 Alcock, C., Allsman, R., Alves, D. R., et al. 1999, PASP, 111, 1539 [NASA ADS] [CrossRef] [Google Scholar]
 Alcock, C., Allsman, R., Alves, D. R., et al. 2000, ApJ, 542, 281 [NASA ADS] [CrossRef] [Google Scholar]
 Alcock, C., Allsman, R., Alves, D., et al. 2003, VizieR Online Data Catalog: II/247 [Google Scholar]
 Allam Jr, T., & McEwen, J. D. 2021, ArXiv eprints [arXiv:2105.06178] [Google Scholar]
 Ba, J. L., Kiros, J. R., & Hinton, G. E. 2016, ArXiv eprints [arXiv: 1607.06450] [Google Scholar]
 Becker, I., Pichara, K., Catelan, M., et al. 2020, MNRAS, 493, 2981 [NASA ADS] [CrossRef] [Google Scholar]
 Bishop, C. M. 1995, Neural Comput., 7, 108 [CrossRef] [Google Scholar]
 Catelan, M. 2004, in International Astronomical Union Colloquium, 193 (Cambridge University Press), 113 [CrossRef] [Google Scholar]
 Catelan, M., & Smith, H. A. 2015, Pulsating Stars (John Wiley & Sons) [Google Scholar]
 Charnock, T., & Moss, A. 2017, ApJ, 837, L28 [NASA ADS] [CrossRef] [Google Scholar]
 de Vries, W., van Cranenburgh, A., Bisazza, A., et al. 2019, ArXiv eprints [arXiv:1912.09582] [Google Scholar]
 Devlin, J., Chang, M.W., Lee, K., & Toutanova, K. 2018, ArXiv eprints [arXiv:1810.04805] [Google Scholar]
 Dhar, P. 2020, Nat. Mach. Intell., 2, 423 [CrossRef] [Google Scholar]
 DonosoOliva, C., CabreraVives, G., Protopapas, P., CarrascoDavis, R., & Estevez, P. 2021, MNRAS, 505, 6069 [CrossRef] [Google Scholar]
 Glorot, X., & Bengio, Y. 2010, in Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, 249 [Google Scholar]
 Heinze, A., Tonry, J. L., Denneau, L., et al. 2018, AJ, 156, 241 [NASA ADS] [CrossRef] [Google Scholar]
 Ioffe, S., & Szegedy, C. 2015, in International conference on machine learning, PMLR, 448 [Google Scholar]
 Ivezic, Ž., Kahn, S. M., Tyson, J. A., et al. 2019, ApJ, 873, 111 [NASA ADS] [CrossRef] [Google Scholar]
 Jamal, S., & Bloom, J. S. 2020, ApJS, 250, 30 [NASA ADS] [CrossRef] [Google Scholar]
 Kingma, D. P., & Ba, J. 2014, ArXiv eprints [arXiv: 1412.6980] [Google Scholar]
 Kremer, J., StensboSmidt, K., Gieseke, F., Pedersen, K. S., & Igel, C. 2017, IEEE Intell. Syst., 32, 16 [CrossRef] [Google Scholar]
 LeCun, Y., Bengio, Y., & Hinton, G. 2015, Nature, 521, 436 [Google Scholar]
 Liu, Y., Ott, M., Goyal, N., et al. 2019, ArXiv eprints [arXiv:1907.11692] [Google Scholar]
 Liu, X., Zhang, F., Hou, Z., et al. 2021, IEEE Trans. Knowl. Data Eng., 1 [Google Scholar]
 Masala, M., Ruseti, S., & Dascalu, M. 2020, in Proceedings of the 28th International Conference on Computational Linguistics, 6626 [Google Scholar]
 Moradshahi, M., Palangi, H., Lam, M. S., Smolensky, P., & Gao, J. 2019, ArXiv eprints [arXiv: 1910.12647] [Google Scholar]
 Naul, B., Bloom, J. S., Pérez, F., & van der Walt, S. 2018, Nat. Astron., 2, 151 [NASA ADS] [CrossRef] [Google Scholar]
 Pan, J., Ting, Y.S., & Yu, J. 2022, ArXiv eprints [arXiv:2207.02787] [Google Scholar]
 Patel, S. 2020, PhD thesis, The Cooper Union for the Advancement of Science and Art [Google Scholar]
 Pimentel, Ó., Estévez, P. A., & Förster, F. 2023, AJ, 165, 18 [NASA ADS] [CrossRef] [Google Scholar]
 Polignano, M., Basile, P., De Gemmis, M., Semeraro, G., & Basile, V. 2019, in 6th Italian Conference on Computational Linguistics, CLiCit 2019, 2481, CEUR, 1 [Google Scholar]
 SánchezSáez, P., Reyes, I., Valenzuela, C., et al. 2021, AJ, 161, 141 [CrossRef] [Google Scholar]
 Semeniuta, S., Severyn, A., & Barth, E. 2016, ArXiv eprints [arXiv:1603.05118] [Google Scholar]
 Szymanski, M., Udalski, A., Soszyński, I., et al. 2011, Acta Astron., 61, 83 [NASA ADS] [Google Scholar]
 Tensorflow 2022, Positional encoding Transformer model for language understanding [Google Scholar]
 Tonry, J., Denneau, L., Heinze, A., et al. 2018, PASP, 130, 064505 [NASA ADS] [CrossRef] [Google Scholar]
 Tsang, B. T.H., & Schultz, W. C. 2019, ApJ, 877, L14 [NASA ADS] [CrossRef] [Google Scholar]
 Udalski, A. 2003, Acta Astron., 53, 291 [NASA ADS] [Google Scholar]
 Udalski, A., Szymanski, M., & Szymanski, G. 2015, Acta Astron., 65, 1 [NASA ADS] [Google Scholar]
 Vaswani, A., Shazeer, N., Parmar, N., et al. 2017, in Advances in Neural Information Processing Systems, 5998 [Google Scholar]
 Vunikili, R., Supriya, H., Marica, V. G., & Farri, O. 2020, in IberLEF@ SEPLN, 505 [Google Scholar]
We confirm the assumption in Appendix A, where the model after training learned to project magnitudes on the ~200th dimension.
It is important to note that shorter training times using shared pretrained representations imply the use of less powerful hardware and fewer resources, decreasing the time and energy consumption when training deep learning models (Dhar 2020).
All Tables
All Figures
Fig. 1 Selfattention block diagram. Each observation vector x_{l} ∈ X (denoted by solid circles) is projected into three vectors: (k)ey, (q)uery, and (v)alue. Then their similarities are computed by the scaled dotproduct between vectors according to Eq. (1). 

In the text 
Fig. 2 Resulting positional embedded space from the F_1.4175.3433 MACHO object. 

In the text 
Fig. 3 ASTROMER architecture. 

In the text 
Fig. 4 Preliminary input example with 50% masking. ASTROMER aims to predict masked observations (denoted by “1”s) using only the attention associated with zerotagged elements. 

In the text 
Fig. 5 Final input composition. Following the example in Fig. 4, 20% of the masked values are replaced by random magnitudes while changing the “1”s in the attention mask vector to “0”s. Similarly, we make another 20% of the masking visible, keeping the actual observations. Dotedline squares indicate both random and real observations. At this point, we should keep a second mask containing the initial 50% masked values to evaluate the loss function. 

In the text 
Fig. 6 Comparison between the labeled and unlabeled pretraining MACHO datasets. The continuous lines represent the labeled MACHO dataset containing variable stars from Alcock et al. (2003). Left: Distribution of light curve magnitudes. Right: Distribution of the differences between observation times. 

In the text 
Fig. 7 Windows sampling diagram. The rectangle represents the entire light curve whose length is L_{real}. The dashed line denotes the sampled windows of size 200 observations. Windows are randomly generated along the light curve. However, we constrain the starting point to be smaller than L_{real} − 200. 

In the text 
Fig. 8 Pretraining learning curves using the MACHO unlabeled dataset. 

In the text 
Fig. 9 Test RMSE from models finetuned on subsets of labeled datasets. 

In the text 
Fig. 10 Classifiers architectures. Architecture names with the +ATT tag refer to models trained on ASTROMER embeddings. In contrast, the baseline uses only the light curves for training. The number in parenthesis is the number of units from the corresponding layer. We notice that the 256 in the LSTM corresponds to the states' size of the RNN. ReLU and Linear correspond to the activations for each feedforward layer. The dimensionality of the last linear layer depends on the number of classes from each labeled dataset. The total number of parameters at the bottom does not include the ASTROMER size. 

In the text 
Fig. 11 Testing Fl scores for an LSTM trained on light curves directly (baseline) and models trained on ASTROMER embeddings (LSTM+ATT and MLP+ATT). The gray shading represents the standard deviation of the three crossvalidation split. Each row corresponds to the experiments on each survey, MACHO, OGLE, and ATLAS, respectively. In (a), we finetune ASTROMER and optimize classifiers on smaller subsets of 20, 50, 100, and 500 samples per class. The weights of ASTROMER are kept frozen when classifying. However, in (b), we allow gradients to flow into ASTROMER. The third case, (c), shows the results of finetuning with the entire set of light curves and classifying on smaller subsets, training ASTROMER simultaneously. 

In the text 
Fig. 12 Best model validation learning curves for classifiers trained on 500 samples per class on each catalog. The columns show the science cases, i.e., freezing (a) and training (b) ASTROMER when classifying. 

In the text 
Fig. A.1 Feedforward network weights that project magnitude scalars to 256dimensional vectors. The new magnitude vector is then summed to the PE information of equal size. 

In the text 
Fig. B.1 Distributions of light curves lengths from the MACHO pretraining dataset. 

In the text 
Fig. C.1 Validation learning curves from all datasets during finetuning. 

In the text 
Fig. D.1 Validation learning curves from the baseline classifier (LSTM) using different learning rates. 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.