Free Access
Issue
A&A
Volume 586, February 2016
Article Number A36
Number of page(s) 14
Section Numerical methods and codes
DOI https://doi.org/10.1051/0004-6361/201526733
Published online 22 January 2016

© ESO, 2016

1. Introduction

The tremendous development in astronomical instrumentation and automation during the last few decades has given rise to several questions about how to analyse and synthesize the growing amount of data. Recently, various dedicated telescope systems, both on the ground and in space, have been used for wide-field shallow, multi-epoch, imaging surveys of low resolution to scan the sky in different wavebands with aims ranging from comprehensive stellar variability searches to exoplanet hunting; e.g. PanSTARRS, Kaiser et al. (2002); OGLE, Udalski (2003); SUPERWASP, Pollacco et al. (2006); CoRoT, Baglin et al. (2007); NSVS, Hoffman et al. (2009); Kepler, Borucki et al. (2010). These data have led to many discoveries in several areas of modern astronomy: asteroseismology, exoplanets, and stellar evolution (e.g. Huber et al. 2012; De Medeiros et al. 2013; Walkowicz & Basri 2013; Paz-Chinchón et al. 2015). The next generation of these surveys, such as Gaia (Bailer-Jones et al. 2013) and the VISTA Variables in Vía Láctea survey (VVV; Minniti et al. 2010), are providing a high data flow for a wide range of science applications in order to understand the dynamics and stellar variability of the Milky Way galaxy.

The first step to investigating time varying data is the detection of any reliable changes in star brightness (e.g. Welch & Stetson 1993; Stetson 1996; Wozniak 2000; Shin et al. 2009; Ferreira Lopes et al. 2015). This step is crucial to decreasing the running time by reducing the number of sources that slower steps, such as period finding and classification, are run on. The stochastic variations are mainly related to very bright sources, caused by saturation of the detector whereby the flux within the aperture bleeds out into nearby pixels and the measured magnitude becomes dependent on the sky brightness and seeing, or very faint sources where the sky noise dominates, providing an increase in the uncertainty of the measurements and a dependency on the sky brightness and seeing. Variability indices and their combinations have been used to identify variability patterns and to select non-stochastic variations (e.g. Damerdji et al. 2007; Shin et al. 2009; Ferreira Lopes et al. 2015), but the separation of true variables from noisy data is hindered because of wavelength-correlated systematics of instrumental and atmospheric origin, or due to possible data reduction anomalies. Detection methods have been optimized for specific variability signals to detect supernovae, microlensing, transits, and other variable sources (e.g. Alard & Lupton 1998; Wozniak 2000; Gössl & Riffeser 2002; Becker et al. 2004; Corwin et al. 2006; Yuan & Akerlof 2008; Renner et al. 2008). An important step to optimising this process is to review the current inventory of variability indices and determine the efficiency level for selecting non-stochastic variations in photometric data.

There are additional second order calibration effects, such as the Forbes Effect (e.g. Pakštienė 2003; Süveges 2014), which can produce apparent waves in the light curve due to variable extinction corrections for different coloured stars and may be difficult to distinguish from real quasi-periodic variations. This does not have a detectable effect in the public WFCAM Calibration (WFCAMCAL – Hodgkin et al. 2009; Cross et al. 2009) data used in this work because the extinction corrections are very low at the near-infrared wavelengths of UKIRT-WFCAM (0.9−2.5 μm), the wavelength dependence is also much lower than at ultraviolet-optical wavelengths, and the colours of stars also have a narrower range.

Table 1

Variability indices analyses in the present work.

The second step is to determine the main periods. There are various methods used in astronomy for frequency analysis, including the Deeming method (Deeming 1975), PDM-Jurkevich (Stellingwerf 1978; Dupuy & Hoffman 1985), string length minimization (Lafler & Kinman 1965; Stetson 1996; Clarke 2002), information entropy (Cincotta et al. 1995), the analysis of variance (ANOVA, Schwarzenberg-Czerny 1996) and the Lomb-Scargle and its extension using error bars (Lomb 1976; Scargle 1982; Zechmeister & Kürster 2009). These methods are based on the fact that the phase diagram of the light curves (LCs) is smoothest when it is visualized with its real frequencies. Assessment of the significance of these frequencies is a pertinent problem due to non-Gaussianity, multi-periodicity, non-periodic variations, and the manner of how they should be taken into account (Süveges 2014). From this view the variability indices are a fundamental part of the variability analysis to save running time and decrease the number of miscalculations in the frequency analysis. The detection of non-periodic variables, transients, and other aspects in regard to the significance of peaks in a periodogram has not been completely solved yet.

The last point is that the variability classification is intrinsically related to the determination of reliable periods and determining a set of parameters that allows us to distinguish all variability types. Automatic classifiers based on machine learning have been applied to several large time-series data sets (e.g. Woźniak et al. 2004; Debosscher et al. 2007; Sarro et al. 2009; Blomme et al. 2010; Richards et al. 2011; Dubath et al. 2011). The inclusion of periodic and non-periodic features, statistics, and more sophisticated model parameters have improved automatic classifiers (e.g. Richards et al. 2011). Misclassification, fuzzy boundaries between variable stars’ classes, mis-labelled training sets, as well as full processing of terabytes of data are current scientific challenges (Eyer 2006).

The present paper is the first in a series of papers covering different aspects of variable star selection and classification. The first two articles are related to selection of variable stars using variability indices. In this paper, we discuss the selection of variable stars using correlation variability indices, while in the second of this series we will discuss non-correlation variability indices; Paper III will be about periodicity search methods; Paper IV will be about the variable star classifier. In this work, we perform a comprehensive stellar variability analysis on time varying data. In Sect. 2.1, we describe the data used to compare each index, using a preselected catalogue of known variable stars to test how well each index selects these stars and the efficiency of the selection measured by how few additional stars are selected by the same cut-off value. In Sect. 3, we present an overview of commonly used correlation variability indices and propose five new variability indices. Next, in Sect. 4 we analyse the limits of correlated variability indices as well as proposing a false alarm probability for variability indices. We present our results and discussions in Sect. 6. Finally, in Sect. 7, we draw our conclusions and discuss some future perspectives.

2. Data

2.1. WFCAMCAL database

WFCAMCAL is a unique programme that is well fitted to test the panchromatic variability indices and our assumptions. This programme contains panchromatic data for 58 different pointings distributed over the full range in right ascension and spread over declinations of and . These were used to calibrate the UKIDSS surveys (Lawrence et al. 2007). The pointing closest to the zenith was chosen whenever a calibration field was observed. This was typically every hour early on in the UKIDSS observations and later every two hours, and some early nights had many additional observations (up to 40 in a night). During each visit the fields were usually observed with a sequence of filters, either through JHK or ZYJHK filters, within a few minutes. This led to an irregular sampling with fields observed again roughly on a daily basis, although longer time gaps are common and, of course, large seasonal gaps are also present in the data set.

The WFCAMCAL data are archived in the WFCAM Science Archive (WSA; Hambly et al. 2008). The data are processed by the Cambridge Astronomy Survey Unit (CASU) Irwin et al. (2004) and the Wide Field Astronomy Unit (WFAU) in Edinburgh, which also published the data through the WSA. The design of the WSA, the details of the data curation procedures and the layout of the database are described in detail in Hambly et al. (2008) and Cross et al. (2009). We use data from the WFCAMCAL08B release (observations up to the end of UKIRT semester 08B).

2.2. The WFCAMCAL variable star catalogue

Ferreira Lopes et al. (2015) performed a comprehensive stellar variability analysis of the WFCAMCAL database and presented the photometric data and characteristics of the identified variable stars as the WFCAM Variable Star Catalogue (WVSC1). The authors used standard data-mining methods and introduced new variability indices designed for multiband data with correlated sampling. To summarize, the authors performed a careful analysis via cut-off surfaces to obtain a preselection with 6651 stars based on criteria established by numerical tests of the noise characteristics of the data. Next they combined four frequency analysis methods to search for the real frequencies in the LCs in each waveband and in the chromatic LC, i.e. comprised of the sum of all broadband filters. Finally, they obtained a ranked list of the best periods for each method and selected the very best period, which gave the minimum χ2 to cope with aliasing. Finally, the authors visually inspected all the phase diagrams of the 6651 stars and recovered a catalogue containing 319 stars in which 275 are classified as periodic variable stars and 44 objects as suspected variables or apparently aperiodic variables.

In this paper, we analyse this same sample from Ferreira Lopes et al. (2015). First, we selected all sources classified as a star or probable star having at least ten unflagged epochs in any of the five filters. This selection was performed from an initial database of 216,722 stars. Next, we test the efficiency of selection of variable stars with the variability indices presented in Sect. 3.

3. Variability indices

Table 1 summarizes 12 variability indices of which five are new indices proposed in this work. The present work discusses the efficiency of selection of each one and the best way to select variable stars via the current tool inventory. Sets of variability indices have been used, instead of one, to improve the selection process in the last few years (e.g. Shin et al. 2009). Indeed automatic classifiers also use these parameters to facilitate the classification of variable stars (e.g. Richards et al. 2011). The variability indices are a fundamental tool to improving all processes of the time domain analysis.

Currently, the Welch-Stetson indices (e.g. Welch & Stetson 1993; Stetson 1996, i.e. IWS, JWS, KWS, and LWS indices) are found to be significantly more sensitive than the “traditional” χ2-test for single variance, which uses the magnitude-rms scatter distribution of the data as a predictor (e.g. Pojmanski 2002). The improvements proposed by Stetson (1996) incorporated in the JWS index allow us to compare wavebands with different numbers of epochs on an equal basis. The author uses the Bessel correction () to reduce the bias related to the sample size even though the index is the square of the correlation rather than the mean variance. The IWS index was modified to quantify panchromatic flux correlations providing a new set of variability indices () by Ferreira Lopes et al. (2015). Moreover the authors proposed a new set of flux independent variability indices (). These were the first variability indices developed to analyse panchromatic surveys.

The statistical period search based in the analysis of variance (ANOVA; Schwarzenberg-Czerny 1996) has been used to select non-stochastic variations. Nevertheless, this method is limited to the identification of periodic variations and requires more running time once its significance level is determined on phase diagrams for each frequency test. Using variability indices, we can identify non-stochastic variations independently based on their nature and thereby reduce the running time. The main goal of this work is to determine the best way to select variable stars without computing the variability periods. In the follow subsection, we summarize the and variability indices as well as improvements on these indices with a new approach.

3.1. The I and I panchromatic variability indices

The current tool inventory was added to by Ferreira Lopes et al. (2015) with a new set of variability indices to separate LCs that are dominated by correlated variations from those that are noise-dominated. The authors introduced a new set of variability indices designed for multi-band data with correlated sampling, which included one index that is highly insensitive to the presence of outliers in the time-series data. First, the authors extended IWS to create the Ipfc index defined as (1)where m is the number of filters, s is the combination type (between two or more epochs), ns is the total number of correlations, and the Λ(s) correction factor is (2)and Γ is given by (3)where uijs is the measured magnitude at the ith epoch and is in the jsth combination. These indices allow us to compute correlations among s epochs (for more detail see Appendix A). As shown by Ferreira Lopes et al. (2015) increasing the number of correlated wave bands (s) makes the separation between correlated and uncorrelated variables more evident. Next the authors proposed a new index, , using Eq. (2), which is the sum of discrete values 1 or –1. This index is defined as (4)where , and is defined in Eq. (2). Finally, Ferreira Lopes et al. (2015) propose a general expression to determine the probability of a random event leading to a positive index. In the case of statistically independent events, this is given by (5)On the other hand, the expected value of for a random distribution is about 0. Meanwhile, the number of sources with negative values increases with s because there is an increase in the number of possible combinations that give a negative correlation.

3.2. Improvements on panchromatic and flux independent indices

The variability indices (see Eq. (1)) would work equally well if a set of observations are in the same bandpass, and if we correlate groups of observations observed over a short interval. Therefore we need to modify the indices to make them still more robust against different numbers of observations in each group. Similarly, we propose a new panchromatic, flux independent, variability index () and additionally combine these indices to create two new indices. To provide an expression to be used in multi- or single waveband data we propose the follow notation:

  • 1.

    First, we compute the values of δi given by (6)where nx is the number of epochs of waveband x, xi are the flux measurements, is the mean flux, and σx,i denotes the flux errors. This parameter is equal to that used by Stetson (1996) to improve IWS index and according to that work the measurements of correlations using δi allow us to compare data from different wavebands with unequal numbers of observations on an equal basis.

  • 2.

    Next, the δi values are computed for all measurements in each waveband using the respective values of nx. As a result, we obtain a vector δ with N measurements collected in any waveband. In addition, we save the observation time for each δi value.

  • 3.

    We determine the time interval (ΔT) for which measurements enclosed in this interval are considered to be at virtually the same epoch. The chosen ΔT value comes from the arrangement of epochs and thus gives the minimum period (see Sect. 4.3). The total number of boxes (Nbox) is given by Ttot/ ΔT, where Ttot is the total time span. The accuracy of the index must increase as ΔT decreases.

  • 4.

    Next, we compute the value of the variability index of order s in the kth box; (7)This equation performs all possible combinations without repetition among the nk values. Indeed, the total number of combinations calculated is given by (8)

  • 5.

    Now, we can express the flux independent indices on a simple expression given by (9)where are the number of positive correlations according to Eq. (2). The indices include measurements obtained in one filter in contrast to indices where measurements are obtained in different filters.

  • 6.

    Next, we can compute the index given by (10)where it reduces to in the case when we only have measurements in different filters. The indices can be used to perform comparisons on an equal basis between stars with different number of epochs as well as using measurements obtained in one filter in contrast to and indices.

  • 7.

    An alternative way to compute the characteristic value of correlation is computing the median of correlations, given by (11)where Q(s) encloses all Q(s,k) correlations. The median value may provide a more robust value than the mean in the presence of outliers.

  • 8.

    Finally, we can use the in a correction factor related to instrument properties and outliers. This kind of factor can be defined as (12)where Ps is the expected value of pure noise for . The F(s) ranges from 0 to 2 × (1−2 /s2) providing an increase of its weight with s values. For instance, the maximum value of F(s) is 1 for s = 2 and ~1.6 for s = 3. F(s) is more efficient than because this increases the difference between values of correlated and uncorrelated data and we concentrate the pure noise values about zero. The parameter F(s) is used to provide a new set of indices given by FL(s) and FM(s) (see Table 1).

The variability indices proposed above are determined from properties related to the magnitude and signal correlation values. Stellar variability searches based on such indices follow general assumptions: (i) intrinsic stellar variability can be typically identifiable from analysis of correlation measures observed in multiple or single wavebands; (ii) there are a minimum number of correlations required to discriminate between stochastic and non-stochastic variations (see Sect. 4.1); (iii) the interval between any two observations (τ) used to compute the correlations must be sufficiently phase-locked (see Sect. 4.3); and (iv) non-intrinsic variations are typically stochastic. Indeed, measurements due to the systematics of instrumental and atmospheric origin, or possible data reduction anomalies that display correlated properties may decrease the confidence level of variability indices. Such measurements are mainly related to temporal saturation of bright objects as well as systematic variations in the sky noise for faint stars.

4. Detection limits of correlated variability indices

The number of measurements and how many measurements designated as close, i.e. within a time span much shorter than the period of any variability, are fundamental information necessary to set better variability indices. The number of measurements determines how stringent the cut-off values must be, while the number of close measurements sets the most appropriate variability index. To determine which measurements are close or not, we need to determine a ΔT value such that it is a compromise between the number of correlated measurements and the minimum period that we are searching for. For instance, variability indices computed in boxes of ΔT, greater than the period, return values closer to those expected for noise. Lower values of ΔT lead to higher accuracy for variability indices that use correlation measurements.

Moreover, variability indices can be used in all processes of the time-series analysis, as discussed in previous sections. Therefore, we must find new ways to allow us to increase the precision, reliability of these indices, and their connexions with the different types of variability. In Sect. 3, we described the current tool inventory and we proposed new variability indices with new correction factors to reduce bias. In the present section, we propose new ways to increase the precision of these variability indices and how to evaluate their reliability.

4.1. Number of correlated measurements

The minimum number of measurements that are sufficient for the use of a variability index are determined by the capability of separating variable and non-variable stars. The statistical properties like mean, standard deviation, skewness, and kurtosis are not dependent on the number of measurements, although the errors on these measurements increase greatly as the number of measurements reduces towards the minimum for each measurement. On the other hand, we need contemporary (close) measurements to use correlated variability indices. These features may change for each variability index.

Consider two cases: one with Ns correlated data points and the other with Ns of pure noise for s = 2. In the case of pure noise, the number of positive and negative correlation must be the same , while for correlated data . Using indices we can determine the minimum number of correlations necessary to separate a purely correlated signal from pure noise assuming that there is an uncertainty in the sign of some correlations. We assume that this uncertainty (or fluctuation) on the number of positive correlations given by nf provides an increase in the variability index of pure noise and a decrease otherwise. So the minimum separation between correlated and uncorrelated data is given by

(13)

where nf is an integer with values less than N2/ 2. The minimum number of correlations must be at least five according to this relation, given a single error. The general expression is given by

(14)

where the minimum value () to validate this relation is given by

(15)

where nf/Ns is the fractional fluctuation of positive correlated measures (ffluc). This equation analytically explains the increase in precision as well as the detachment between correlated and non-correlated data with increasing s (see Fig. 8 of Ferreira Lopes et al. 2015). with the increasing s, so correlated data becomes more easily separable from pure noise with increasing s.

4.2. False alarm probability on variability indices

The statistical significance of a value is associated with the false alarm probability (FAP), which is the probability that the observed value was caused by random fluctations. The smaller the FAP then the larger is the statistical significance of this measurement and the tolerance usually adopted is about 1%. The determination of FAP in period searches is hindered because of non-Gaussian distributions, observations scattered irregularly over a long time span, the unclear meaning of the number of independent frequencies, and the manner in which these should be taken into account (e.g. Terebizh 1998; Andronov 1994; Süveges 2014). Moreover, strong peaks in the periodogram can add some unrelated periods to the real periodicities since multiple periodic variations can interfere with each other and with the periodic patterns. A more detailed discussion about how account for this will be addressed in a forthcoming paper.

Additionally, significance values may depend on the function employed to make the periodicity search. Therefore, in some cases, it indicates the use of different techniques on different types of variable stars (Templeton 2004). For instance, the periodicity search methods based on Fourier series is less sensitive to non-sinusoidal and aperiodic signals. Methods based on the analyses of variance and multi-harmonic periodograms allow us to assess more complex signals more easily (e.g. Wehlau & Leung 1964; Andronov 1994, 2003; Schwarzenberg-Czerny 1996; Baluev 2009, 2013). A reliable period determination is critical to determine the variability nature. From this analysis, we may determine whether there is multi- or single periodicity, the signal complexity, and the weight of each contribution to the light curve. This process may be facilitated if we can first determine whether the time series has reliable variability or not.

We can consider the significance of the variability indices by comparing the null hypothesis H0 of the observed time series (purely noise) against the alternative H1, stating that there is no correlated signal in it. One way to evaluate statistical fluctuations on variability indices is to generate a large number of test time-series sequences by shuffling the times, called bootstrapping. Following this approach, we are able to keep part of the correlated nature of the noise intrinsic to the data, as opposed to numerical tests based on pure Gaussian noise (Ferreira Lopes et al. 2015). However, we need many iterations to provide accurate values for FAP, i.e. we need to compute the variability indices n times, implying a longer running time.

On the other hand an analytical expression of FAP for variability indices may depend on the deviation from the mean, which varies according to the survey analysed since it must depend on detector efficiency, number of measurements, magnitude, etc. However, has a weak dependence on the properties of the survey in which Eq. (15) provides a value above which correlated sources may be distinguished from noise. The only term to be determined is the fractional fluctuation in positive correlated measurements (ffluc = nf/Ns). From our results, we propose the following empirical equation:

(16)

i.e. a constant (α> 0) plus a term related to the number of correlations. The second term in this equation decays quickly to zero as Ns increases and it provides high cut-off values on data with few epochs. We find since the ffluc for must be greater than 0 for any number of correlations. Large values of α give a more complete selection, while smaller values result in a more reliable sample. The value was established in the Sect. 4.1. For instance, β = 4α2 is the lower limit for s = 2. Figure 1 shows the number of sources selected using Eq. (16) for s = 2 (black lines) and s = 3 (grey lines) as a function of α values. Solid lines indicate the number of sources of WVSC1 stars, while dashed lines show the efficiency of selection (ratio of total number of sources to number of known variables). The total number of sources is normalization by the total number of WVSC1 stars (319) to give an efficiency metric (Etot). Table 2 shows the number of total sources selected (Etot) and the fraction of WVSC1 stars (EWVSC1) for some α values. For instance, to select about 90% of WVSC1 stars we need a subsample of 3.77 × 319 stars using α = 0.30 for s = 2. On the other hand to select about 92% of WVSC1 stars, we need a subsample of 3.71 × 319 stars using α = 0.48 for s = 3.

thumbnail Fig. 1

Efficiency metric (Etot: the ratio of total number of sources in the selection to good known variables in WVSC1) using Eq. (16) for s = 2 (black lines) and for s = 3 (grey lines). Solid lines denote EWVSC1, the fraction of the good variables in the selection, while the dashed lines denote Etot. A good choice of α returns a high fraction of good variables EWVSC1 for a low value of the efficiency metric Etot.

Open with DEXTER

Table 2

Efficiency metric for some α values.

4.3. Estimate of ΔT and correlated observations

The variation of variability indices with ΔT depends on many factors such as the variability period (P), the shape of the light curve, the signal to noise, and outliers. To estimate the influence of ΔT, we simulate a pure sinusoidal variation with a period (P). Next, we compute the indices and see how changing ΔT as a function of P affects how well we can separate a sinusoidal signal from random noise.

Figure 2 shows the indices as a function of ΔT/P. The indices decreases quickly to the expected values for random variations when s = 2, while, when s = 3, remains higher than the expected noise value for all ΔT/P. This result helps us to understand and use Δ T values. For instance, if Δ T< 0.1P, we obtain large values for , clearly separated from noise and thus detect variability more easily.

The distance between two measurements for a sinusoidal variation must be much less than ΔT/P = 0.5 to provide a large number of positive correlation values and allow variable stars to be separated from noise, according our simulations (see Fig. 2), and, preferably, ΔT/P ≤ 0.1. An interval of ΔT/P = 0.5 corresponds to the Nyquist frequency, so higher frequency signals cannot be properly reconstructed through aliasing, but it is imperative to sample at a much higher frequency than the Nyquist frequency to have adequately correlated signals necessary for discrimination from noisy signals. The Running Parabola method (Andronov 1997) determines an optimal time interval to describe light curves: their Fig. 4 is very similar to our Fig. 2 because both do better when sampling a greater range of phases of periodic or semi-periodic signals.

thumbnail Fig. 2

versus ΔT/P for s = 2 (black lines) and for s = 3 (grey lines). The dashed lines indicate the expected values for random variation.

Open with DEXTER

thumbnail Fig. 3

Histogram of the interval between observations for the WFCAMCAL08B data; tau is the interval between any two observations regardless of filter and these are binned logarithmically. The peaks at ~10-3 days are intervals during a ZYJHK sequence and the peaks at ~1 day and multiples of 1day are repeat observations on subsequent nights.

Open with DEXTER

We determine Δ T predominantly by the cadence of the data. For the WFCAMCAL data, a sequence of three to five filters were observed over a period of ~0.005 day, and then each pointing reobserved roughly one day later, with longer intervals because of weather or seasonal limits on the observations; see Sect 2.1. This is shown in Fig. 3, which indicates the histogram of time between subsequent observations. There is a strong peak at τ ~ 10-3 day, which corresponds to a ZYJHK sequence with duration of ~0.005 day and a second peak at ~1 day. There are also a variety of smaller peaks at other durations, often a few days apart (bad weather, non-photometric nights), and a few small peaks at long durations of tens or hundreds of days (field not observed because it was too close to the Sun) or on timescales of 0.01 to 0.1 days: days when many calibration fields were taken for test purposes. The best choice of ΔT should be the minimum duration that encloses the correlated observations, up to the end of the last integration. The Δ T = 0.01day is a sensible choice as it is slightly wider than the typical box size so no observation is missed and allows us to look for variables with P ≥ 0.1day, although the main sampling peak at ~1 day may be expected to limit us to P> 0.5 days from the Nyquist frequency. Since the sampling is not rigidly at one-day intervals shorter periods are possible. Since we have correlated sampling at more than 20 times the frequency of the main sampling rate, we do not place additional constraints on the range of well sampled periods.

However, if we have equally spaced data, we are severely restricted. If we have s = 2 correlations and a spacing of τ, then Δ T ≥ 2τ and P ≥ 20τ, if s = 2. Given that at least two full periods are required for a confident identification of periodic behaviour, at least 40 observations would be required to constrain a very narrow range of periods and any high-frequency behaviour with f> 1 / 20τ would be in danger of being aliased, unless we took a more relaxed constraint on the correlation strength between each pair of measurements. Of course, that would reduce the discrimination power of the index to differentiate between noise and real variables. These indices do not require periodic signals: all they require is a high degree of correlation between consecutive measurements, so quasi-periodic and semi-regular variables and even transients can be selected as long as they do not have frequencies that are higher than 0.1 × the sampling rate.

Thus, correlation indices become very inefficient for equally spaced data. Some deep extragalactic surveys have observations designed to maximize depth, so the observations are taken when seeing and sky levels are best. Hence, the observation structure can be pseudo-correlated, but not on fixed timescales. In Paper II of this series we will discuss indices that work better with uncorrelated observations.

A correlated data set may be expected to have at least half of the τ values in a peak or small set of peaks (if several filters with slightly different exposure times) at around the correlation frequency and then the main sampling peaks at τsamp> 20τcor.

When we consider VISTA surveys, e.g. the VVV (Minniti et al. 2010), the data are observed as pawprints, which then get co-added into tiles (Cross et al. 2012), so there are always repeat observations on a short timescale compared to repeat epochs. These observations are also observed almost contemporaneously, with some tiling patterns jumping between the same jitter on different pawprints before moving onto the next jitter1, so the time between pawprints is usually shorter than the integration time of the pawprint. Therefore, these are ideal for correlated indices applied to the pawprints.

Gaia (Bailer-Jones et al. 2013) is another mission where correlated indices are extremely valuable. The main astrometric instrument observes stars as they transit across nine strips of detectors with τ ~ 5s. Stars are then re-observed by a second field of view 2h later or on the next revolution 6h later or on longer timescales due to the orbit and precession of the spacecraft.

thumbnail Fig. 4

Histograms of the number of correlations Ns for s = 2 and s = 3 using bins of width 1.

Open with DEXTER

5. Data analysis

5.1. Broad selection and bias

From Sects. 4.14.3, we can determine the main constraints on variability analysis. We consider that there is at least one incorrect correlation measurement nf = 1 in each LC, therefore, we limit our analyses to sources with more than four correlation measurements, according to Eq. (14). This is the minimum number of correlation measurements adopted in our analysis. Moreover, we adopted Δ T = 0.01 days based on the duration of the ZYJHK sequences. By following these constraints, we are considering all LCs that can possibly discriminate a correlated signal from noise with periods of at least greater than 0.1 days (see Sect. 4.3) We revisited the WFCAMCAL data instead of testing the variability indices with simple sinusoidal light curves, since this gives a more realistic test with correlated observations, real noise values, and a range of variable types.

Next, we compute the Kfi, Lpfc, , FL(s), and FM(s) variability indices using a multi-waveband approach, as discussed in Sect. 3.2, on the data described in Sect. 2.1. Figure 4 shows the histogram of the number of correlation Ns, ranging from 1 to 1467 for s = 2 and from 1 to 863 for s = 3. These numbers are different because the ZYJHK measurements are obtained within a few minutes of each other but not necessarily in all filters. Additionally, the number of correlation measurements decreases quickly for very faint objects around the detection threshold. The total baseline varies from a few months up to three years and the cadence in a single passband can be considered to be quasi-stochastic with rather irregular gaps (see Hodgkin et al. 2009; Cross et al. 2009; Ferreira Lopes et al. 2015, for a better discussion).

5.2. Searching for periodic variations

To search for the best period, a Lomb-Scargle periodogram (Lomb 1976; Scargle 1982) was computed for each LC. We set the low-frequency limit (f0) for each periodogram to be f0 = 2 /Ttot days-1, where Ttot is the total time spanned by the LC. The high-frequency limit was fixed to days-1 and the periodogram size was scaled to 105 elements. Initially, we compute the Lomb-Scargle periodogram independently for each broadband filter (Y, Z, J, H, and K) as well as for the chromatic light curve (as described in Ferreira Lopes et al. 2015). The use of all broadband filters allows us to find variability periods in the cases where the photometry is high quality in some filters, but not others; e.g. in some filters the object may be saturated (sometimes leading to a non-detection at the correct location), very faint and close to the detection limit, or even too faint to be detected.

For each broadband filter, as well as for the chromatic light curve, we retain the ten periods corresponding to the highest peaks. Next these periods were refined following De Medeiros et al. (2013), namely, by maximizing the ratio of the variability amplitudes to the minimum dispersion in the phase diagram given by Dworetsky (1983). The minimum value provided is the refined period value and the full width at half maximum (FWHM) of the peak in the periodogram was assumed to be its uncertainty. Finally, we use the χ2 test as described in Ferreira Lopes et al. (2015) to select the very best period.

thumbnail Fig. 5

Correlation variability indices versus K magnitude (left diagram) and versus number of correlations (right diagram) in each panel. The maximum number of sources per pixel is shown in brackets in each panel. The black circles denote the WVSC1 sources and the solid and dashed lines indicate the value that encloses 90% and 80% of them.

Open with DEXTER

6. Results and discussions

We analyse the efficiency of variability indices for selecting variable stars in the WFCAMCAL database. We evaluate responses of these indices as a function of magnitude and the number of correlations. This study allows us to trace important remarks about the most efficient way to select variable stars. The most efficient index is the one that encloses the majority of the WVSC1 stars with the fewest stars that do not belong to the WVSC1 catalogue and are mostly misclassifications. We compute the variability indices as described in Sect. 3.2. We detail our results in an analysis of correlation variability indices that were computed using a panchromatic approach as described in Sect. 3.2. Below we present our results using stars from the WVSC1 catalogue as a comparison.

6.1. Efficiency of variability indices

Figure 5 shows the distribution of Kfi, Lpfc, , FL(s), and FM(s) variability indices as a function of magnitude and the number of correlations Ns, for s = 2 and s = 3. The solid and dashed lines are set to the values that must be adopted if we want to select 90% and 80% of the final WVSC1 catalogue, respectively. Figure 6 shows the histogram of these indices as a function of magnitude for stars selected using the cut-off value that includes 90% of the WVSC1 catalogue.

thumbnail Fig. 6

Normalized histograms of the sources selected with a constant cut-off value for (black line), and FL(s) (red line), and and FM(s) (blue line) indices. The Etot values for each index is shown in parentheses. The upper panel shows the histograms for s = 2, while the lower panel shows the histograms for s = 3. The cut-off values were determined considering a value that includes 90% of WVSC1 stars (see Table 3).

Open with DEXTER

  • presents a clear separation of WVSC1 stars from the other stars for K ≤ 17 mag. The lines that appear for K ≥ 17 mag are due to the high number of sources with just a few epochs (typically Ns< 20). This index produces discrete values and these are more evident for a small number of epochs. The right panel shows a higher dispersion for low numbers of Ns as expected. Statistical fluctuations may provide high contamination in this region, even though the number of correlations are above the minimum number that allows us to discriminate between them, according to Eq. (14). The index shows a similar behaviour although it displays a greater separation of WVSC1 stars from the other stars.

  • is equivalent to two previous indices under some constraints: it is equal JWS for s = 2 and equivalent to when the correlations are obtained in different filters. JWS has been used in the selection criteria for several surveys (e.g. Christiansen et al. 2008; McCommas et al. 2009; Morales-Calderón et al. 2009; Bhatti et al. 2010; Shappee & Stanek 2011; Pasternacki et al. 2011). provides a higher selection efficiency than previous indices and it performs combinations among measurements in ΔT intervals rather than across wavelengths. The indices present a clear separation of WVSC1 stars from other stars. If we assume a constant value of as a selection criterion, we observe that most of the non-variable stars selected are faint stars. The separation between WVSC1 and other stars is clearer for s = 3 than for s = 2. Indeed, the number of sources preselected to enclose 90% of WVSC1 stars are about 30% fewer for s = 3 than s = 2. Two main features that are expected with increasing s are observed: the increase in the number of stars with negative indices values and a better discrimination between variable and non-variable stars.

  • calculates the median of the correlation values in contrast to the mean encapsulated by . Both mean and median values are used to determine the central or typical value in a statistical distribution. The weight of the outliers is reduced in the median compared to the mean. Outliers in photometric data are commonly associated with variations in brighter stars caused by detector non-linearity and saturation. On the other hand, the increasing dominance of correlated noise is expected for faint stars as the errors become dominated by the sky noise, rather than photon statistics. Figure 6 shows an underestimation of and indices implying on increase of misclassification. Indeed, the number of stars is higher of than for faint stars and against for brighter stars. Figure 6 shows an increase in the fractions of stars selected at both the bright and faint magnitudes for and indices, implying an increase in the misclassification rate. The misclassification rate is higher for faint stars when using than when using , and vice-versa for bright stars. Therefore stars that match both criteria should have a lower misclassification rate at both the bright and faint ends, so agreement between these indices should be considered a selection criteria. Using larger s values gives less weight to outliers and partially-correlated noise, and this leads to a better estimations of the centre of the distribution for both the mean and median. Therefore, the efficiency of and increases and they tend to have similar Etot values for different indices for higher s, especially if sources with small numbers of correlations are removed, as observed in Table 3 for s = 3 and Ns> 20. Meanwhile, the are the best indices to perform a selection of variable stars, when we only consider higher s values as well as only those sources with Ns> 20 (see Table 3).

  • FL(s) and FM(s) provide better efficiency values (Etot) among those indices computed from correlation magnitudes (see Table 3). The F(s) factor provides a concentration of non-variables with values around zero as well as a reduction in the spread of bright sources. We observe a reduction of more than 300% on the number of sources preselected by and indices for s = 2 and about 20% for s = 3 (see Fig. 6). The large reduction is not found for s = 3 because these indices become more accurate with increasing s values and this therefore decreases the weight of F(s). On the other hand, only a slight decrease in Etot was observed when we use as correction factor KWS/ 0.789 (see Table 3). This factor is used to build the LWS Stetson index (Stetson 1996), which can be expressed by .

thumbnail Fig. 7

LCs and phase diagrams of C1 catalogue. The identifiers and periods are shown above each panel.

Open with DEXTER

thumbnail Fig. 8

Examples of LCs shown intrumental bias. The identifiers are shown above each panel.

Open with DEXTER

Table 3

Efficiency metric (Etot) for variability indices analysed to select 90% and 80% of WVSC1 stars for N(s)> 4 and N(s)> 20, respectively.

In summary, the correlation variability indices discriminate between uncorrelated and correlated data, which is a typical feature of variable stars. We can enclose almost all WVSC1 stars in a sample with fewer than about 1500 sources. These indices still present a low efficiency for discrimination at the faint end or bright end. The root cause in each case may well be different: excess high values for bright sources are due to temporal saturation and few epochs and increases in measurement uncertainties for the faint stars, which makes the variability indices more sensitive to statistical fluctuations and systematics.

Figure 6 shows the histograms of sources selected for a constant cut-off value for the higher ranked indices. These indices return most of the WVSC1 stars with fewer non-variable stars (see Table 3). However, these indices have a clear bias on selection from the point of view of magnitude since they are not evenly distributed along all K values. For s = 2, we observe that (black line) and FL(s) (red line) display a prominent over-selection for faint stars while FM(s) (blue line) is biased for both brighter and faint stars. On the other hand, the three indices have similar bias for s = 3. This kind of result indicates that the efficiency of these indices may be similar for higher s values since the difference in Etot between them is smaller for s = 3. Indeed, more than 60% of stars with k> 17.5 have Ns< 20. This low detection efficiency for the instrument in this region gives a maximum magnitude limit where we can sensibly use these indices.

On the other hand, if we use the analytical expression for ffluc,s (see Sect. 4.2) we obtain a higher efficiency. This function allows us to analyse stars with lower Ns values (of course, above the minimum number of correlations Ns> 4) with a similar efficiency such as that obtained for FL(s) and FM(s) considering a more strict selection, i.e. Ns> 20. Using ffluc,s we can enclose a greater number of WVSC1 stars with fewer contaminating sources selected (see Table 2). ffluc,s is an empirical relation and it may be adapted according to its purpose.

6.2. Searching for variable stars

We use the α values (see Sect.4.2) to select at least 90% of the WVSC1 stars. Therefore, we adopt α = 0.30 for s = 2 and α = 0.46 for s = 3, which return a combined sample of 1133 variable stars candidates that were not included in the WVSC1 catalogue. The periods and their uncertainties were computed according to Sect. 5.2 and we visually inspected each star. Uncertainties in periods are of the order of 10%. According to our analysis, these stars can be divided into five main groups: (a) variable stars measured in few epochs; (b) variable stars with low signal-to-noise and low confidence periods; (c) variable stars with amplitudes that are close to the noise level; (d) aperiodic variable stars or variables of such long periods that these data were insufficient for deriving them; and (e) false variables due to instrumental or reduction problems.

Our procedure has resulted in a catalogue with four new sources (C1). Figure 7 shows the C1 stars that may be included in [a, b, c] groups with variability indices’ values near to those expected from noise. Figure 8 shows some instrumental variations that can appear to give false positives for variable stars, including false variables due to instrumental saturation (left and middle panels) and data reduction problems (right panel). The separation between stars of abc and de groups is not possible using only variability indices. Discriminating between these sources using statistical analyses will be discussed in a forthcoming paper. Table 4 lists coordinates, periods, mean magnitudes, and the number of epochs in each filter for this sample. WVSC-336 has a period estimate of 192 days, but its true period may be higher since we do not observe repeatable phases close to the minimum. This star was previously identified as a young stellar object at the Serpens molecular cloud (Oliveira et al. 2009). The set of stars identified in the WVSC1 catalogue in the dark nebula (Ferreira Lopes et al. 2015) as well as WVSC-336 have a peculiar variability signature. These are almost certainly young stellar objects (YSOs) and they will be investigated in forthcoming works.

6.3. Two-dimensional view of correlated data

Table 4

Periodic objects in the WFCAM Variable Star Catalogue (C1).

thumbnail Fig. 9

FL(s), FM(s), and logarithmic FL(s) × Ns versus the variability indices for stars selected by ffluc expression, using α = 0.30 for s = 2 and α = 0.46 for s = 3 (see Sect. 6.2) for s = 2 (left panels) and s = 3 (right panels). The WVSC1 stars are indicates with open black circles and the colours are set with the K magnitude. Objects shown with crosses do not have a K-band magnitude because they are either too faint or saturated. These objects have measurements in other bands.

Open with DEXTER

The ffluc defined for variability indices, using the expression in Eq. (16), presents the best efficiency for selecting WVSC1 stars (see Tables 2 and 3). However, it returns a lot of false positives when we have few correlations. The combination of indices based on correlation signals () with those based with correlation values (FL(s) and FM(s)) may provide two-dimensional view of correlated data.

Figure 9 shows FL(s), FM(s), and logarithmic of FL(s) × Ns as a function of (named KFLs diagram) for s = 2 (left panels) and s = 3 (right panels). The KFLs diagrams allow us to discriminate between two main groups; (G1) faint stars where about 90% have due to a small number of correlations; (G2) is composed of stars with K ≤ 16.5, which includes 91% of WVSC1 stars. These groups display a clear separation if you multiply FL(s) by the number of correlations (Ns) where G1 has log FL(s) × N(s)< 1.5 and G2 is delimited by log FL(s) × Ns> 1.5. However, the last diagram is biased for Ns and so comparisons between different sources are difficult to make.

Stars included in G2 with low values of are mainly bright, saturated stars showing a low level of variability and false positive variations due to instrumental bias. The boundary between WVSC1 and other stars is not well defined in this diagram. Nevertheless, the KFLs diagram helps us enclose about 90% of WVSC1 stars with Etot ~ 2 in G2.

7. Conclusions

From our results we can conclude that analysis of databases with fewer than four correlated measurements is not possible using and related indices (using factor F(s)) when we consider the case of one wrong value. In these cases, we may use or indices to discriminate correlated from uncorrelated data. On the other hand, when we have enough correlation measurements, the variability indices provide unique features to carry out time domain analysis, which allows us to define a general method that can be applied to any survey with correlated epochs: the presents a low sensitivity to outliers and does not undergo strong variations with magnitude In addition, the has a clear interpretation, a theoretical definition of a value expected for noise, and a well-defined range of values from 0 to 1, and this range does not depend on the error bars, so a selection can be made that is independent of the number of measurements. Therefore, it may be used as a universal method to select correlated variations. Moreover, KFLs diagrams display two unique variability features related to intensity and number of positive correlated measurements, which allow us to improve Etot by at least 40% (see Sect. 6.3).

The FL(s) and FM(s) values in the KFLs diagrams (see Fig. 9) may vary for different surveys. However, is not strongly dependent on instrumental features and its cut-off values can be adopted as universal values as can ffluc,s. The ffluc,s values are used to discriminate between correlated variations (variable stars) and non-correlated variables (noise) and the fraction of variables is unbiased with respect to magnitude of observed wavelength. After selecting the variable star candidates using ffluc we may improve the selection with the KFLs diagrams. Next, we can remove G1 stars and use levels of significance of some periodicity methods to determine which of these display periodic variations.

This work is the first in a series that make a detailed analysis of all processes of variable photometric data. In this first paper, we have investigated which indices give the most efficient selection when we have correlated observations. In the second paper of this series, we will consider uncorrelated observations and determine which are the best indices for selecting variable stars. In the coming years, we will apply these methods to very large surveys of the Milky Way, e.g. VVV, PanSTARRS, Gaia, and in the longer term to LSST. These studies will provide fast and reliable classifications of variable stars within the Milky Way, which will improve our understanding of the evolution of different stellar populations and thus the formation of structures within our Galaxy.


Acknowledgments

C. E. F. L. acknowledges a post-doctoral fellowship from the CNPq. N. J. G. C. acknowledges support from the UK Science and Technology Facilities Council. We would like to thank Lorenzo Rimoldini for his helpful comments which have improved this paper.

References

Appendix A: Examples of panchromatic indices

Consider multi-wavelength data with measurements obtained contemporaneously in F1F2F3F4 filters, given by; F1 = [a1,a2,··· ,an], F2 = [b1,b2,··· ,bn], F3 = [c1,c2,··· ,cn], and F4 = [d1,d2,··· ,dn]. We build a vector with all

measurements U = [u11,u12,u13,u14,··· ,un1,un2,un3,un4] where the sub-indices set the ith measurement and the respective filter. Thereafter, we can compute the of order 2, by expanding Eq. (1): (A.1)and for s = 3, (A.2)where the s last sub-indices of each Λ function set the filters that are being combined. Indeed the procedure proposed in Sect. 3.2 provides an easier and more general way to compute the panchromatic correlated indices.

The is equivalent to a geometric mean. Meanwhile the term in the s-root is always positive because of the modulus function. On the other side, and of order 2 is greater than and of order 3 if all Λ values are positive. This can be proved using the inequality of arithmetic and geometric means. Consider a set of multi-wavelength data obtained at same time given by F1F2F3 filters or U = [u11,u12,u13]. We can write (A.3)This equation is equivalent to the inequality of arithmetic and geometric means since we can make the substitutions , , and . Therefore the panchromatic variability indices must decrease with increasing s values if we assume that the most important contribution in the sum are those in which all Λ values are positive or all are negative.

All Tables

Table 1

Variability indices analyses in the present work.

Table 2

Efficiency metric for some α values.

Table 3

Efficiency metric (Etot) for variability indices analysed to select 90% and 80% of WVSC1 stars for N(s)> 4 and N(s)> 20, respectively.

Table 4

Periodic objects in the WFCAM Variable Star Catalogue (C1).

All Figures

thumbnail Fig. 1

Efficiency metric (Etot: the ratio of total number of sources in the selection to good known variables in WVSC1) using Eq. (16) for s = 2 (black lines) and for s = 3 (grey lines). Solid lines denote EWVSC1, the fraction of the good variables in the selection, while the dashed lines denote Etot. A good choice of α returns a high fraction of good variables EWVSC1 for a low value of the efficiency metric Etot.

Open with DEXTER
In the text
thumbnail Fig. 2

versus ΔT/P for s = 2 (black lines) and for s = 3 (grey lines). The dashed lines indicate the expected values for random variation.

Open with DEXTER
In the text
thumbnail Fig. 3

Histogram of the interval between observations for the WFCAMCAL08B data; tau is the interval between any two observations regardless of filter and these are binned logarithmically. The peaks at ~10-3 days are intervals during a ZYJHK sequence and the peaks at ~1 day and multiples of 1day are repeat observations on subsequent nights.

Open with DEXTER
In the text
thumbnail Fig. 4

Histograms of the number of correlations Ns for s = 2 and s = 3 using bins of width 1.

Open with DEXTER
In the text
thumbnail Fig. 5

Correlation variability indices versus K magnitude (left diagram) and versus number of correlations (right diagram) in each panel. The maximum number of sources per pixel is shown in brackets in each panel. The black circles denote the WVSC1 sources and the solid and dashed lines indicate the value that encloses 90% and 80% of them.

Open with DEXTER
In the text
thumbnail Fig. 6

Normalized histograms of the sources selected with a constant cut-off value for (black line), and FL(s) (red line), and and FM(s) (blue line) indices. The Etot values for each index is shown in parentheses. The upper panel shows the histograms for s = 2, while the lower panel shows the histograms for s = 3. The cut-off values were determined considering a value that includes 90% of WVSC1 stars (see Table 3).

Open with DEXTER
In the text
thumbnail Fig. 7

LCs and phase diagrams of C1 catalogue. The identifiers and periods are shown above each panel.

Open with DEXTER
In the text
thumbnail Fig. 8

Examples of LCs shown intrumental bias. The identifiers are shown above each panel.

Open with DEXTER
In the text
thumbnail Fig. 9

FL(s), FM(s), and logarithmic FL(s) × Ns versus the variability indices for stars selected by ffluc expression, using α = 0.30 for s = 2 and α = 0.46 for s = 3 (see Sect. 6.2) for s = 2 (left panels) and s = 3 (right panels). The WVSC1 stars are indicates with open black circles and the colours are set with the K magnitude. Objects shown with crosses do not have a K-band magnitude because they are either too faint or saturated. These objects have measurements in other bands.

Open with DEXTER
In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.