EDP Sciences
Free Access
Issue
A&A
Volume 604, August 2017
Article Number A121
Number of page(s) 19
Section Astronomical instrumentation
DOI https://doi.org/10.1051/0004-6361/201630109
Published online 25 August 2017

© ESO, 2017

1. Introduction

Statistical analysis is a vital concept in our lives because it is used to understand what’s going on and thereby enable us to make a decision. These kind of analyses are also used to assess theoretical models by experiments that are limited by experimental factors, leading to uncertainty. Measurements are usually performed many times to increase the confidence level. The results are summarized by statistical parameters in order to communicate the largest amount of information as simply as possible. Statisticians commonly describe the observations by averages (e.g. arithmetic mean, median, mode, and interquartile mean), dispersion (e.g. standard deviation, variance, range, interquartile range, and absolute deviation), shape of the distribution (e.g. skewness and kurtosis), and a measure of statistical dependence (e.g. Spearman’s rank correlation coefficient). These parameters are used in finance, weather, industry, experiments, science and in several other areas to characterize probability distributions. New insights on this topic should be valuable in natural sciences, technology, economy, and quantitative social science research.

Improvements on data analysis methods are mandatory to analyze the huge amount of data collected in recent years. Large volumes of data with potential scientific results are left unexplored or delayed owing to current inventory tools that are unable to produce clear samples. In fact, we risk wasting the potential of a large part of these data despite efforts that have been undertaken (e.g. von Neumann 1941, 1942; Welch & Stetson 1993; Stetson 1996; Enoch et al. 2003; Kim et al. 2014; Sokolovsky et al. 2017). Current techniques of data processing can be improved considerably. For instance, the flux independent index that we proposed in a previous paper reduces the mis-selection of variable sources by about 250% (Ferreira Lopes & Cross 2016). A reliable selection of astronomical databases allows us to put forward faster scientific results such as those enclosed in many current surveys (e.g. Kaiser et al. 2002; Udalski 2003; Pollacco et al. 2006; Baglin et al. 2007; Hoffman et al. 2009; Borucki et al. 2010; Bailer-Jones et al. 2013; Minniti et al. 2010). The reduction of misclassification at the selection step is crucial to follow up the development on the instruments themselves.

thumbnail Fig. 1

Uniform (black squares), normal (green asterisk), Ceph (blue diamond), RR (red triangle) and EB (grey plus) distributions with 1000 measurements as a function of the number of elements. The same distributions are showing with different arrangements (see Sect. 2 for more details).

Open with DEXTER

The current project discriminates between correlated and non-correlated observations to set the best efficiency for selecting variable objects in each data set. Moreover, Ferreira Lopes & Cross (2016) establishes criteria that allow us to compute confidence variability indices if the interval between measurements used to compute statistical correlations is a small fraction of the variability periods and the interval between correlated groups of observations. On the other hand the confidence level of statistical parameters increases with the number of measurements. Improvements to statistical parameters for which there are few measurements are crucial to analyze surveys such as PanSTARRS, with a mean of about 12 measurements in each filter, and the extended VVV project (VVVX), between 25 to 40 measurements (Chambers et al. 2016; Minniti et al. 2010). Sokolovsky et al. (2017) tested 18 parameters (8 scatter-based parameters and 10 correlated-based parameters), comparing their performance. According to the authors the correlation-based indices are more efficient in selecting variable objects than the scatter-based indices for data sets containing hundreds of measurement epochs or more. The authors proposed a combination of interquartile range (IQR – Kim et al. 2014) and the von Neumann ratio (1 /ηvon Neumann 1941, 1942) as a suitable way to select variable stars. A maximum interval of two days for measurements is used to set those used to compute the correlated indices. This value is greater than the limit required to compute well-correlated measurements. Moreover its efficiency should take into account the number of well-correlated measurements instead of simply the number of epochs. Indeed, maybe the authors did not account for our correlated indices (Ferreira Lopes & Cross 2016) because that would require a bin of shorter interval than the smallest variability period. This allows us to get accurate correlated indices for time series with few correlated measurements. The constraints and data used by Sokolovsky et al. (2017) limit a straightforward comparison between correlated and non-correlated indices performed by the authors. Therefore, the approach used to perform the variability analysis may only be chosen after examining the time interval among the measurements (see Sect. 8).

Statistical parameters such as standard deviation and kurtosis as function of magnitude have been used as the primary way to select variable stars (e.g. Cross et al. 2009). This method assumes that for the same magnitude stochastic and non-stochastic variation have different statistical properties. To compute all current dispersion and almost all shape parameters the averages, must also be calculated and thus we increase the uncertainties and processing time. Indeed, statistical properties still exist even where averages are unknown. In this fashion, Brys et al. (2004) proposed a robust measure of skewness called the “medcouple” through a comparison of quartiles and pairs of measurements that allow us to compute these parameters without use averages. However, the medcouple measure has a long running time since the number of possible combination increases by factorial of the number of measurements. However, we can use a similar idea to propose new averages, dispersions, and shape parameters that have a smaller running time.

This work is the second in a series about new insights into time series analysis. In the first paper we assessed the discrimination of variable stars from noise for correlated data using variability indices (Ferreira Lopes & Cross 2016). In this work, we analyze new statistical parameters and their accuracy in comparison with previous parameters to increase the capability to discriminate between stochastic and non-stochastic distributions. We also look into their dependence with the number of epochs to determine statistical weights to improve the selection criteria. Lastly we use a noise model to propose a new non-correlated variability index. Forthcoming papers will study how to use the full current inventory of period finding methods to clean the sample selected by variability indices.

The notation used is described in Sect. 2 and next we suggest a new set of statistics in Sect. 3. In Sects. 5 and 6 the new parameters are tested and a new approach to model the noise and select variable stars is proposed. Next, the selection criteria are tested on real data in Sect. 7. Finally, we summarize and provide our conclusions in Sect. 9.

2. Notation

Let from where the kernel function is defined by

(1)where Int(N′/ 2) means the integer part (floor) of half the number of measurements. The lower contribution to compute statistical parameters for symmetric distributions is given by yInt(N′/ 2) + 1 since it is the nearest measurement to the average. The value Y is a sample that has an even number of measurements (N) that also can be discriminated into the subsamples Y and Y+ composed of measurements yiyN/ 2 and yi>yN/ 2, respectively. Different arrangements can be taken into consideration, such as

  • P1

    – unsystematic yi values;

  • P2

    – increasing order of Y and Y+;

  • P3

    – decreasing order of Y and increasing order of Y+;

where the measurement of Y and Y+ for P2 and P3 assume the same position on the x-axis only to provide a better display.

Five distributions were used to test our approach. These distributions were generated to model both variable stars and noise. Uniform and normal distributions that can mimic noise were generated by the IDL function RANDOMU. This function returns pseudo-random numbers thar are uniformly distributed and randomly drawn from a multivariate normal distribution, where a mean value of 0.5 and full width at half maximum (FWHM) equal to 0.1 were assumed for the normal distributions to provide a range of values from about 0 to 1. On the other hand we generate Cepheid (Ceph), RR-Lyrae (RR), and eclipsing-binary (EB) distributions that are similar to typical variable stars. Ceph, RR, and EB models were based on the OGLE light curves OGLE LMC-SC14 109671, OGLE LMC-SC21 59535, and OGLE LMC-SC2 180186, respectively. These models were generated in two steps: first a harmonic fit was used to create a model and, second, these distributions were sampled at random points to get the measurements.

Figure 1 shows the P1 − 3 arrangements for uniform (black squares), normal (green asterisks), Ceph (blue diamonds), RR (red triangles) and EB (grey plus signs) distributions. The colours and symbols used in this diagram were adopted throughout the paper to facilitate their identification. The symmetry found in the P2 and P3 yields unique information about the shape and dispersion parameters. Therefore, the measurements of Y and Y+ are combined to propose a new set of statistical parameters (see Table 1).

3. Even-statistic (E)

Non-parametric statistics are not based on probability distributions whose interpretation does not depend on the fitting of parametrized distributions. The typical parameters used for descriptive and inferential purposes are the mean, median, standard deviation, skewness, and kurtosis, among others. These parameters are defined to be a function of a sample that has no dependency on a parameter, i.e. its values are the same for any type of arrangement, for instance P1 − 3 (see Fig. 1). All dispersion and almost all shape parameters are dependent on some kind of average, i.e. they describe distributions around average values. Indeed, the dispersion and shape parameters still exist even where the averages are unknown. For instance, the standard deviation and absolute deviation provide an estimate of dispersion about the mean value. Therefore, we propose the even-statistic as an alternative tool to assess dispersion and shape parameters based only on the measurements, i.e. unbound estimates.

The most appropriate to compute shape parameters is with even number of measurements if we consider that the same number of measurements should appear on both sides of the distribution for appropriate comparison. For instance, consider a distribution with an even number of measurements, where the mean value can be computed as and the first and second term are the weights of the left and right sides of the distributions with the same number of elements. On the other hand, for odd numbers of measurements, the weight for both sides of the distribution are not equivalent. The single measurement that can be withdrawn to correct such variation is yInt(N/ 2) + 1 since if we withdraw any of the other yi measurements we would increase the difference. The kernel given by Eq. (1) provides samples with even numbers of measurements and hence the same weight for both sides. The parameters proposed with Eq. (1) are called “even parameters”. Moreover, the even number of measurements allows us to compare single pairs of measurements among Y and Y+ and therefore to estimate dispersion and shape values without the average into account. The new statistical parameters to compute averages (see Sect. 3.1), dispersions (see Sect. 3.2), and shapes (see Sect. 3.3) are described below.

Table 1

Variability statistical analyses in the present work.

3.1. Averages (A)

Considering the kernel given by Eq. (1) (see Sect. 2), we propose new average statistics given by (2)and (3)where EAμ and EAm are named as even-mean and even-median. These expressions mimic the mean (Aμ) and median (AM). Moreover, EAμ = Aμ and EAm = Am when the number of measurements (N) is an even number. A comparison between these measurements is performed in Sect. 5.1.

3.2. Dispersion parameters (D)

Statistical dispersion is used to measure the amount of sample variance and it is computed using the absolute or square value of the distance between the measurements and the average. Improving the estimation of averages can provide better accuracy of dispersion parameters such as the mean standard deviation (Dσμ), median standard deviation (Dσm), mean absolute standard deviation (Dμ), and median absolute standard deviation (DM). Therefore we propose the even-dispersion parameters that are computed using the even-averages (see Table 1; the even-mean standard deviation (EDσμ), the even-median standard deviation (EDσm), the even-mean-absolute deviation (EDμ), and the even-median-absolute deviation (EDm)). The accuracy of these parameters is assessed in Sect. 5.1. We caution that these parameters return the same values as the previous parameters for even numbers of measurements.

Moreover, using the single combination between measurements Y with Y+, we also can estimate the amount of variation or dispersion of a sample. From the kernel given by Eq. (1) we propose the following dispersion parameter, written as (4)The two parts are the same since . The value ED means even-absolute deviation because such a sum is always positive, i.e. (yNiyi) ≥ 0 and . Moreover, a simple identity is found for distributions with an even number of measurements, i.e. (5)since (yNiEAm) ≥ 0 and (yiEAm) ≤ 0. Indeed, we also can mimic the standard deviation by proposing two new even-dispersion parameters given by (6)and (7)The even-dispersion parameters ED, ED(1), and ED(2) are unbound, i.e. they are not dependent on the average. These parameters allow us to describe the dispersion of a distribution instead of the dispersion about an average. Moreover, a strict relationship between ED(1) and ED(2) with EDσm is found when we have even numbers of measurements, i.e. (8)while for ED(2), (9)where Cov denotes covariance. Indeed, the second term in these equations is additive since the covariance among Y and Y+ is negative.

Asymptotically the identities given by Eqs. (5), (8), and (9) are also valid for odd numbers of measurements since EDmDm and EDσmDσm. Similar identities that link even-dispersion parameters with their correspondents can be found using Aμ, Am, EAμ, and EAm.

The dispersion of a distribution given by Eqs. (6) and (7) is the standard deviation about the averages minus two times the covariance among Y and Y+. Moreover, for symmetric distributions where yNiEAM = − (yiEAm) and EAm = EAμ, we can write the following identity, (10)The ratio of ED(1) by EDσm can be used to estimate whether the measurements are symmetrically distributed.

3.3. Shape parameters (S)

In a similar fashion as the dispersion parameters, we also can improve the accuracy of skewness (SS) and kurtosis (SK) using the even-averages. Therefore, we propose even-skewness (ES) and even-kurtosis (EK) to estimate the distribution shape (see lines 1112 of Table 1). Moreover, we also propose higher moments of ED(1) and ED(2) as new even-shape parameters given by and (14)where ES(1 − 2) and EK(1 − 2) are unbound parameters, i.e. they are independent of the average. The values ES(1 − 2) mimic the skewness while EK(1 − 2) mimic kurtosis. A strict relationship between ES(1 − 2) with ES and EK(1 − 2) with EK is very complicated since such definitions use distinct dispersion parameters. Indeed, we can use other dispersion parameters to broaden the list of even-shape parameters.

3.4. Excess shape

The integration of the Gaussian distribution returns a kurtosis equal to 3 as N → ∞. An adjusted version of Pearson’s kurtosis, the excess kurtosis, which is the kurtosis minus 3 is most commonly used. Some authors refer to the excess kurtosis as simply the “kurtosis”. For example, the kurtosis function in IDL language is actually the excess kurtosis. The excess values for even-shape parameters were determined in the same fashion as the excess kurtosis, i.e. shift these values to zero for normal distributions. Therefore, 105 Monte Carlo simulations using the normal distributions with 106 measurements were performed to determine the excess shape.

Table 2

Excess shape coefficients for even-shape parameters.

Table 2 shows the averages for the even-shape parameters and their errors. The amount of variation is less than 0.1%. The excess shape values were added to the equations for even-shape parameters (see Table 1), rounding to two decimal places.

4. Even interquartile range

The interquartile range (IQR – Kim et al. 2014) is also included in our analysis since it was reported as one of the best statistical parameters to select variable stars (Sokolovsky et al. 2017). The IQR uses the inner 50% of measurements, excluding the 25% brightest and 25% faintest flux measurements, i.e. first the median value is computed in order to divide the set of measurements into upper and lower halves and then the IQR is given by the difference between the median values of the upper and lower halves.

The median value is improved using the even-median (see Sect. 5.1) and hence the IQR can also be improved using the even median instead of the median. Therefore, using the kernel defined by Eq. (1) an even interquartile range is proposed as (15)where Y+ and Y are the measurements of Y above and below the median, respectively (see Sect. 2 for a better description). Indeed, EIQR provides adjustment for distributions with an even or odd number of measurements. Three, two, or zero adjustments are performed for distributions with odd, even (but not modulus 4), or modulus 4 numbers of measurements, respectively.

5. Simulating distributions

Monte Carlo simulations were used to test the even-statistical parameters for a range of the number of measurements (number of epochs) varying from 10 to 100. We performed 105 simulations for a given number of measurements. Indeed, the range of measurements tested typifies light curves from current large wide-field, multi-epoch surveys such as Pan-STARRS, VVV, and Gaia. These simulations were performed for the five distributions described in Sect. 2 that mimic noise and variable stars; the colours and symbols in Fig. 1 are also adopted in the present section to facilitate the identification of these distributions. The statistical parameters have a higher statistical significance for the distributions that have a large number of measurements, where the addition of measurements only leads to small fluctuations. Therefore, the adopted “true parameter” values (Ptrue) are those computed using 106 measurements. We performed 105 simulations to compute the Ptrue values. All parameters analyzed are listed in Table 3; their errors were computed as the standard deviation. This value is used as a reference to analyze the error given by (16)where P means the statistical parameter. This expression provides the mean error for P. In order to avoid singularities we shift the skewness and kurtosis values for and P = P′ + 1 since they have Ptrue ~ 0.

Table 3

Ptrue for all statistical parameters analyzed in the present work.

thumbnail Fig. 2

Parameter eP (left panels) and its comparison with previous statistical parameters (right panels) as a function of the number of measurements (see Table 1). The colours and symbols are the same as those adopted in the Fig. 1. In the left panels the results for the full range of measurements are used, while the right panel only shows those for odd numbers of measurements except for EIQR, where only the results with numbers of measurements that are not modulus 4 are plotted. The solid lines in the even-dispersion diagrams show the models described in Sect. 5.3.

Open with DEXTER

5.1. Bound even-statistical parameters

The bound even-statistical parameters are those dependent on the average. These parameters only differ from the previous parameters in that the mean and median are replaced by the even-mean and even-median. Figure 2 shows eP (see Eq. (16)) for the even-parameters (see Table 1, 16 and 1012) and its comparison with previous parameters (mean, median, mean standard deviation, median standard deviation, mean absolute deviation, median absolute deviation, skewness, and kurtosis) as a function of the number of measurements in the left and right panels, respectively. The left panels include the results of simulations for the whole range of measurements while the right panels only have the results for odd numbers of measurements because for even numbers of measurements the current and even-statistical parameters have the same values. Therefore, we only use the results for odd numbers of measurements, i.e. 11,13,15,··· , where eP/eP< 1 means a higher accuracy for the new parameters compared to current parameters while eP/eP> 1 means no improvement with the new parameters. The simulations were performed as described in Sect. 5, from which we can observe that

  • EAμ – The normal and EB distributions have similar distributions, and separately Ceph, RR, and uniform distributions are also similar. The EAμ returns the lowest errors for the normal distribution. Indeed, eEAμ/eAμ ≃ 1 for normal distributions while eEAμ/eAμ ≃ 1.07 (for 10 measurements) for the EB distribution. This happens because the dispersion about the mean is symmetric for the normal distribution and extremely asymmetric for the EB distribution. The error for 10 measurements for all distributions is about twice that found for 100 measurements on average. The even-mean parameters are more accurate than the mean for all distributions except the EB distribution. For instance, the even-mean returns an improved accuracy of between ~4% and ~8% for Ceph, RR, and uniform distributions over that found by the mean. On the other hand, the mean is better then the even-mean by a similar rate for the EB distribution.

  • EAm The eEAm for normal and EB distributions are similar but offset by roughly a multiplicative factor. The EB distribution eEAm values are about twice those found for the normal distribution for the same reason as discussed for eEAμ. The even-mean for the RR distribution has eEAμ/eAμ< 0.87 for N< 30 measurements, i.e. an increase in the accuracy of about ~7%. Indeed, EAm along with EAμ are more accurate than their previous definitions for the whole range of measurements and for four out of five distributions analyzed.

  • EDσμ The EB distribution has the highest eEDσμ values whereas the uniform, Ceph, and RR distributions show similar values. The same behaviour is also observed in the eEDσm diagram. The Dσμ is more accurate than EDσμ for the whole range of measurements and distributions analyzed despite the improvement on the estimation of the mean, but the difference is less than ~0.2% for Ceph distribution and less than ~0.1% otherwise.

  • EDσm This value is more accurate than Dσm for the whole range of measurements and all distributions analyzed. The normal and EB distributions show eEDσm/eDσm ≃ 1 for all range of measurements analyzed. On the other hand, the Ceph and EB distributions show eEDσm/eDσm ≃ 0.95 for fewer than 30 measurements. The Ceph distribution has the lowest value of eEDσm/eDσm, the opposite of that found in EDσμ.

  • EDμ and EDm The eEDμ and eEDm are similar to the EB distribution returning the highest values, while the other distributions return similar values to each other. The eEDμ/eDμ shows values less than one for all distributions unless EB distribution. On the other hand, eEDm/eDm shows values of less than one for uniform and Ceph distributions, about one for the normal distribution, and greater than one for RR and EB distributions. The largest value for eEDμ/eDμ ≃ 1.005 and eEDm/eDm ≃ 1.023.

  • EIQR The greatest variation on the accuracy of statistical dispersion parameters is found for EIQR. An improvement of about 10% is found for uniform, normal, RR, and Ceph distributions and in opposite way a diminishment in the accuracy of 10% is found for the EB distribution. All distributions with more than ~30 measurements show an improvement or similar accuracy as those found for IQR. The repeating patterns found among each consecutive set of three measurements is related to the number of adjustments performed by the even-median (see Sect. 4).

  • ES and EK The shape parameters have the highest uncertainties among the statistical parameters analyzed. The lowest values for eES and highest values for eEK are found for the RR distribution, respectively. The eES/eS< 1 for all distributions except EB and RR distributions while eEK/eK> 1 for all distributions except the Ceph distribution. Moreover eES and eEK have an uncertainty greater than ~10% up to 50 measurements.

To summarize, the accuracy of statistical parameters has a strong dependence on the number of measurements and distribution type. The even-mean and median are more accurate than the mean and median for all distributions analyzed except for the EB distribution. The improvements in estimation of averages by even-statistics allows us to improve the estimation of dispersion and shape parameters for many distributions analyzed. It is mainly observed for distributions where the probability of finding measurements near Ptrue is lower. As a result, ePeven/eP ≃ 1 for the normal distribution. The even-statistical parameters are strongly dependent on the distribution shape and so they can be useful for discriminating distribution types. Therefore a study about how to classify distributions using even-statistical parameters will be performed in a later paper from this project.

thumbnail Fig. 3

Parameter eP as a function of the number of measurements for the free even-parameters (see Table 1). The colours are the same as in Fig. 2.

Open with DEXTER

5.2. Unbound even-statistical parameters

Unbound even-statistical parameters keep some relations with their counterparts for particular limits and distributions (see Sect. 3.2). Such relations are not valid in general and therefore did not have a counterpart for comparison. Monte Carlo simulations were used to estimate the relative error of unbound even-statistical parameters such as those performed in Sect. 5.1.

The unbound even-statistical parameters display similar eP values as those found for previous parameters (see Fig. 3); in those cases, bound (see Sect. 5.1) and unbound even-dispersion parameters show similar eP values, while the even-shape parameters return smaller errors than the skewness and kurtosis. It means that the errors are comparable with those found for the common statistical parameters used to describe distributions. Moreover, the even-statistical parameters are various for the different distributions analyzed and so these parameters can be used to discriminate between such distributions (see Table 3). Of course these values are for the distributions described in Sect. 2 that were generated to have the same amplitude. The values are different if the amplitude is modified, for instance.

The bound and unbound even-statistical parameters (see Table 1) have a similar accuracy to previous statistical parameters and hence they can be used to characterize statistical distributions in a similar fashion to previous distributions. Such parameters can be used to describe and differentiate distribution types. A better investigation about how use these parameters to describe various distributions will be performed in a forthcoming paper of this project.

5.3. Coefficients adjusted for sample size

The adjusted coefficients for sample size are used because samples with few measurements have larger fluctuations in the estimated parameters. For instance, the Fisher-Pearson coefficient (given by ) for a sample with 10 and 100 measurements is 1.054 and 1.005, respectively. As result, for instance, this correction increases the value if the skewness is positive, and makes the value more negative if the skewness is negative. It cannot be used for parameters that only assume positive values such as standard deviation. Therefore other adjusted coefficients have been proposed in a similar fashion. These coefficients increase the dispersion in a population since they enlarge the range of values.

A single equation to create coefficients to adjust for sample size has been used for all statistical parameter. However, the best adjustment is found using a specific equation for each statistical parameters, since they each have different accuracies (see Sect. 5). The simulations described in Sect. 5 were used to determine a model for each dispersion statistical parameter given by(17)where b(P) is a real number constant (see Table 4). For unknown distributions, i.e. not included in our analysis, the even-mean value may be used.

Table 4

Coefficients (b) of Eq. (17).

thumbnail Fig. 4

Dispersion and shape parameters as a function of magnitude where the black dots indicate the WVSC1 stars. The red and dashed black lines indicate the Strateva-modified and Strateva functions, respectively. The maximum number of sources per pixel is shown in brackets in each panel.

Open with DEXTER

Table 5

Strateva and Strateva modified parameters (see Eq. (18)) for all dispersion parameters analyzed in the present work for BA.

Table 6

Efficiency metric Etot, i.e. the ratio of the number of selected sources to the total number of WVSC1 variable stars, and EWVSC1, i.e. the ratio of the number of WVSC1 stars selected to the total number of WVSC1 stars, and α values computed from Xf(ED) for each waveband and for using all ZYJHK wavebands using β = 4.

6. Modelling the noise

Cross et al. (2009) used the Strateva function (see Strateva et al. 2001; Sesar et al. 2007, for more details) to fit the standard deviation as a function of magnitude to estimate a noise model (ζ). This method assumes that the majority of the sample are point sources, where the variability measurements are dominated by noise, rather than astrophysical variations. The method provides a suitable model for photometric surveys at optical wavelengths if they have a single component of noise that increases in relative magnitude from bright to faint stars. However, the brightest stars can show much greater variation which comes from saturation and non-linearity of the detectors providing a source of variation that cannot be fit by these models. Such a situation is rare at optical wavebands but is very frequently present for NIR data (see Fig. 4). The sky foreground emitted by the atmosphere is highly variable in the NIR. For this reason, the sky foreground causes a highly time-varying saturation limit, which can affect large parts of otherwise highly accurate time-series data for bright stars with substantial outliers that have very small formal error estimates (Ferreira Lopes et al. 2015a). These outliers probably lead to a spurious impact upon the statistical parameters. Therefore, we propose a modification to the Strateva function that allows us to model such variations, the increase in the standard deviation for bright (saturated stars) and faint (photon noise) stars, given by (18)where all the coefficients are real numbers. Indeed, Strateva et al. (2001) and Sesar et al. (2007) proposed a noise model using three terms where the second and third coefficients are 0.4 and 0.8, respectively. These powers continued to be used in Cross et al. (2009) but the optimal coefficients were never tested. For more details see Sect. 7.2.

6.1. Non-correlated indices

The selection of non-stochastic variations can be performed by one or more dispersion parameters. In order to combine a set of dispersion parameters (see Table 1) a non-correlated index is proposed as follows; (19)where w(P) and ζw(P)P are given by Eqs. (17) and (18), respectively. This equation provides an index value that takes into the sample size adjustment coefficient and noise model into account. For instance, I(P) ~ 1 for stochastic variation.

Distinct statistical parameters have different capabilities of discriminating distributions (see Sects. 5.1 and 5.2). Such differences are highlighted when the eP values or sample size adjustment coefficients are compared. A sample composed mainly of stochastic variations has a different dispersion of I(P) values. Therefore an appropriate combination of the results from different dispersion parameters is given by (20)where f is the waveband used, ωPj is a weight related with each dispersion parameter, v is the number of parameters used, and IPj is given by Eq. (19). Indeed, IPj provides a normalized index allowing us to combine distinct dispersion parameters and the results from various wavebands.

6.2. Broadband selection

Variable stars candidates for non-correlated data are usually selected from the noise model (see Sect. 6). Stars with values above n × Dσμ are selected for further analyses. This approach assumes that a few percent of entire sample are variable stars and have statistical values above the noise. The noise samples present distributions such as uniform, normal, or distributions in between, while variable stars are more similar to the Ceph, RR, and EB distributions. Therefore, the dispersion parameters assume a different range of values for variable and non-variable stars that is mainly highlighted for samples with a high number of measurements (typically higher than 50). Indeed, such a difference must increase for higher amplitudes than that found for the noise. For few measurements (typically less than 20) stochastic and non-stochastic variations have large uncertainties increasing the mis-selection rate (see Sect. 5.1). We find similar behaviour for correlated indices. For instance, Ferreira Lopes et al. (2015a) use cut-off surfaces linking magnitude, number of epochs, and variability indices to improve the selection criteria of variable stars, while Ferreira Lopes & Cross (2016) use flux independent indices to propose an empirical relationship between cut-off values and the number of measurements without taking into account magnitude.

The adjusted coefficients for sample size, as presented in Sect. 5.3, reduce the population dispersion. Meanwhile, uncertainties about the range of values assumed by stochastic and non-stochastic variations also vary with the number of measurements. For non-stochastic variations with a good signal-to-noise and a large number of measurements, such a range is different to that produced by stochastic variations. On the other hand, for distributions with just a few measurements, the range of values can significantly overlap. In the same fashion as the empirical selection criteria proposed by Ferreira Lopes & Cross 2016 (see Eq. (16)), we propose the following criteria (21)where α and β are real positive values and N is the number of measurements. In these case where α is bigger than 1, we may find stochastic variations. Higher values of β provide a higher cut-off for small numbers of measurements or correlations. For instance, f(1,4) for N equal to 10, 30, and 50 are 1.63, 1.36, and 1.28, respectively. Indeed lower values of α provide a more complete selection while higher values provide a more reliable selection.

7. Real data

We used the WFCAM Calibration 08B release (WFCAMCAL08B – Hodgkin et al. 2009; Cross et al. 2009) as a test database as we did in the first paper of this series. To summarize, this programme contains panchromatic data for 58 different pointings distributed over the full range in right ascension and spread over declinations of and . These data have been used to calibrate the UKIDSS surveys (Lawrence et al. 2007). During each visit the fields were usually observed with a sequence of filters, either JHK or ZYJHK within a few minutes. This led to an irregular sampling with fields reobserved roughly on a daily basis, although longer time gaps are common and of course large seasonal gaps are also present in the data set. For more information about design, the details of the data curation procedures, the layout, and variability analysis on this database are described in detail in Hambly et al. (2008), Cross et al. (2009), and Ferreira Lopes et al. (2015a).

The multi-waveband data were well fitted to test the statistical parameters using different wavebands (ZYJHK). Moreover, Ferreira Lopes et al. (2015a) and Ferreira Lopes & Cross (2016) performed a comprehensive stellar variability analysis of the WFCAMCAL08B characterizing the photometric data and identifying 319 stars (WVSC1), of which 275 are classified as periodic variable stars and 44 objects as suspected variables or apparent aperiodic variables. In this paper we analyze the same sample from (Ferreira Lopes et al. 2015a; and Ferreira Lopes & Cross 2016). First, we selected all sources classified as a star or probable star with at least 10 unflagged epochs in any of the five filters. This selection was performed from an initial database of 216 722 stars. Next we test the efficiency of selection of variable stars using the statistical parameters presented in Sect. 3.

We compute all statistical parameters listed in Table 1 by the following algorithm: the photometry measured by the best aperture was selected and next the measurements with flags (ppErrBits) higher than 256 were removed. The analysis of these data was performed using the current and earlier approaches, where the comparison between the current and earlier approaches was tested using the following equation: (22)where G(P) means the percentage of upgrade (G> 0) or downgrade (G< 0) provided by the parameter tested (P). For example, P = Etot (ratio of the total number of sources selected to the total number of variable stars in the WVSC1 catalogue) and P = EWVSC1 (ratio of number of selected variables stars in WVSC1 to the total number of variable stars in WVSC1) computed from previous (P) and current (P) statistics. This allows us to estimate the improvement (G> 0) or deterioration (G< 0) provided by the methods proposed in the current work (see Sect. 7.4). The statistic parameters were computed for each waveband as well as considering all wavebands (ZYJHK). Table 6 lists α and its respective efficiency metric values. Such parameters are used to analyze the efficiency of selection of variable stars from noisy data in the WFCAMCAL atabase using the WVSC1 catalogue as a comparison.

7.1. Testing even-statistical parameters

Figure 4 shows even-statistical parameters and the standard deviation as a function of the K-band magnitude. The variable stars in WVSC1 are denoted by large black dots and the noise model (Strateva) functions are indicated by lines. The main results can be summarized as follows:

  • The dispersion even-parameters have a similar range as those found for the standard deviation. Moreover, the majority of WVSC1 stars have values above the stochastic variations and therefore these parameters can be used in the same fashion as the standard deviation to discriminate variable stars from noise. As expected, the diagram of ED is equal to EDM and is being similar to EDμ.

  • The Strateva and the modified-Strateva functions show similar values for almost all ranges of magnitude. The difference is a slope at lower magnitudes (bright stars) found for the modified-Strateva function. This allows us to reduce the mis-selection but we also remove some bright variable stars that have small amplitude variations. We caution that Strateva and the modified-Strateva functions can present an incorrect model for very faint magnitudes since a small decrease in the dispersion is found. In these cases a magnitude limit can be adopted (Cross et al. 2009).

  • The shape even-parameters give a good discrimination for many variable stars particularly for bright stars (see Fig. 4). However, almost all faint stars (magnitudes greater than ~16) have values near those found for stochastic variations. In this sense, the dispersion parameters are better than shape parameters at discriminating non-stochastic variations since we can see a clearer separation among them for all ranges of magnitude. The shape parameters may be useful to discriminate different kinds of light curve signatures and this will be addressed in a future paper in this series.

In summary, the even-statistical parameters can be used in the same fashion as previous parameters. The main goal of this paper is to study the criteria of selection of variable stars from noise and meanwhile these parameters may be useful for many other purposes in different branches of science and technology.

7.2. Finding the best noise model

We tested 82 159 models to find the best model Strateva-modified function (ζP(m) – see Eq. (18)). All combinations in a range of three power terms varying from 4 to − 4 were performed using a bin of 0.1. This range covered all previous values used in the noise model. The procedure was adopted to find the best model to fit the standard deviation as a function of magnitude is similar to that used by Cross et al. (2009). The EAM and EDM are computed for bins with a width of 0.1 mag or at least 100 objects. For this step, we only consider those stars with more than 20 measurements. Next, we compute ζP(m) from a non-linear least-squares minimization using the Levenberg-Marquardt method (Levenberg 1944; Marquardt 1963). The WVSC1 catalogue of variables represent 0.01% of WFCAMCAL stars. However, they were removed from the sample before fitting to get a better noise model.

About 39% of the models tested converge for all statistical parameters and wavebands observed. The model with the lowest χ2 in ZYJHK wavebands was taken as the best noise model given by Eq. (18). Table 5 shows the parameters obtained for both Strateva and Strateva-modified functions and the metric to measure the improvement or deterioration provided by the latter. The dispersion of residuals (G(R)) is about 1% lower than that found for Strateva functions for almost all statistical parameters except for IQR and EIQR. A similar behaviour is found for G(χ2) where an improvement of about 85% is found. Indeed, the largest improvement is found for the K filter. However, a deterioration is found for IQR and EIQR in the ZYJ wavebands. Moreover, Strateva-modified functions do not turn down at faint magnitudes as the Strateva function does sometimes (see Fig. 4). The new noise model provides a more restrictive cut for both the brightest and faintest stars.

7.3. Testing photometric apertures and wavebands

In order to test the dependence of the photometric aperture and extreme measurements on the selection criteria the WFCAM, seven different analyses were performed:

  • A1–5: photometric measurements using a standard photometric aperture from 1 to 5 (0.5″, , 1″, , and 2″ radius, respectively)

  • BA: photometric measurements using the best aperture (see Cross et al. 2009)

  • BAS: all measurements enclosed in 2 × EDσμ about EAM of BA photometry are used.

In these analysis, the measurements with flags greater than 256 were removed. The third aperture (A3), corresponds to the default 1″ aperture, where the radius is slightly larger than the typical seeing FWHM, so an aperture centred on a point-source should contain >95% of the light in the ideal Gaussian case; in reality there is much more light in lower surface brightness wings. Increasing the aperture size increases the amount of signal, but at the expense of increasing the amount of sky too, such that the signal-to-noise decreases. Decreasing the aperture reduces the signal too much, also reducing the signal-to-noise ratio. Usually A3 gives the optimal signal-to-noise, but sometimes, nearby stars can affect the measurements by adding an additional noise component from deblending images that relies on some imperfect modelling, and selecting a smaller aperture that includes less signal from the neighbor gives better results, which is why a variable aperture was selected by Cross et al. (2009).

Figure 5 shows the result for different apertures (top panel) and different wavebands (bottom panel). The BAS returns the best results, i.e. the lowest values of Etot for all values of EWVSC1. The BAS approach allows us to achieve a better discrimination of variable stars from noise (see Table 6). It is mainly noted for those dispersion parameters that take into account the square in their definition, such as EDσμ and EDσm. On the other hand, the BAS approach can also lead to mis-selections of binary stars that have few measurements at the eclipse, for instance (see Sect. 7.3 for more details). The number of stochastic variations decreases a lot but it also means that we could miss some variable stars. On the other hand, the efficiency levels for different wavebands vary significantly (see Table 6). The best result was found for the J waveband rather that for the Z and K wavebands. The efficiency decrease found for the K waveband is related to the decrease of signal-to-noise, while for the Z waveband we find that the c0 in Eq. (18) is significantly higher (~0.023; cf. ~0.014 for Y, J, H, K), which suggests greater variations in the photometry across each detector, since simple offsets in the zero point would be corrected by the recalibration carried out by Cross et al. (2009). Calibrating the Z and Y bands was trickier than the J,H,K bands because the calibration is extrapolated from 2MASS J, H, Ks (see Hodgkin et al. 2009) and more susceptible to extinction, particularly in the Z band, which can vary on small scales in star forming regions. Indeed, 32 WVSC1 stars were found in highly reddened regions (ZK> 3) indicating that such effects can be present in WFCAM data.

7.4. Analysis of improvements

Section 7.2 discusses the improvements made using the Strateva-modified function. These improvements have smaller χ2 than the original Strateva function, which indicates a better noise model estimation. However, this does not inform us about the improvements to the selection of variables. Therefore, to measure the improvements or deteriorations provided by each step of our analysis the metric G was computed for Etot and EWVSC1 using four different approaches as follows;

  • Even-statisticThe results are computed from standard dispersion parameters in comparison with their respective counterpart even-dispersion parameter for BA photometry (see Sect. 7.3).

    thumbnail Fig. 5

    Top panel shows Etot as a function of EWVSC1 for all apertures and for the bottom panel for all individual wavebands and the combination (ZYJHK wavebands) using BA. Here the result for each photometric aperture and waveband are shown with different colours. EWVSC1 decreases with Etot leading to a more reliable selection (fewer misclassifications) and vice versa.

    Open with DEXTER

    thumbnail Fig. 6

    G vs. EWVSC1 using different approaches (see Sect. 7.4). The approach used is named above in each of the upper diagrams. The colours indicate the results for different filters ZYJHK (brown, grey, red, green, and blue lines respectively) and the combination of results found in all bands (black lines). The same colours were also used in Fig. 5 (bottom panel).

    Open with DEXTER

    thumbnail Fig. 7

    X indices using all wavebands XZYJHK for ED and Dσμ. The X(ED) index was computed using the Strateva function without sample size correction for BA photometry, while X(ED) was computed using the Strateva modified function with sample size correction for the BAS approach. The histograms of the entire sample (black lines) and WVSC1 stars (red lines) are shown at the top right. The WVSC1 stars are also represented by open black circles and the maximum number of sources per pixel is shown in brackets.

    Open with DEXTER

  • Sample size – The results with and without sample size corrections for BA photometry.

  • Noise model – The results using the Strateva versus Strateva modified functions for BA photometry.

  • BAS approach – The results for BA photometry versus with those computed for the BAS approach.

  • AllThe results computed from the previous dispersion parameter using the Strateva function without sample size corrections for BA photometry versus their respective even dispersion parameter using sample size correction, Strateva modified functions, and the BAS approach.

This analysis allows us to verify if and how much each approach upgrades or downgrades the selection criteria compared with previous ones. Moreover, the following combination of dispersion parameters were also analyzed: (23)and (24)These combinations provide the same number of dispersion parameters using either the standard statistics or the even-statistics and hence give a better comparison of their efficiency level. Where the weight ωPj was adopted as the inverse of EDσμ for each Dj. The same tests were performed using single dispersion parameters to verify if their combination provides better results.

Figure 6 shows a comparison between previous and current approaches. The even-statistical parameters are on average better than the standard statistical parameters (see Sect. 5.1). The results for real data show a fluctuation of about 1% on G(Etot) values. This is expected since the improvements to the estimation of standard statistical parameters only occur for those distributions that have odd numbers of measurements and decreases quickly with the number of measurements. Moreover, we also observed that the improvements vary from the Z to K waveband since the infrared light curves usually have smaller amplitudes than optical wavebands and therefore the improvement is more evident. The following behaviours are also observed:

  • The G(Etot) for the sample size correction varies from 2% to 7% for EWVSC1 less than ~0.8. The EWVSC1 stars outside of this limit have an XED (see Fig. 7) less than 1.5. Indeed, the X variability indices are approximately a measurement of the signal-to-noise ratio so this indicates an improvement in the signal-to-noise greater than 1.5 for EWVSC1< 0.8 and a deterioration of about 7% otherwise.

  • The improvement provided by the Strateva modified function can reach G(Etot) ≃ 22%. It only improves the selection for EWVSC1 lower than ~0.9 similar to that found for the sample size correction. Indeed, the Strateva modified function provides a fluctuation of about few percent of improvement or diminishment for Z and Y wavebands. The increase to the total number selected provided by the sample size correction and noise model means a reduction of misclassification but this also hinders the detectability of variable stars of lower amplitudes that are mainly found at fainter magnitudes.

  • The BAS approach provides the largest improvement to the selection criteria for all dispersion parameters tested except IQR. The definition of IQR takes into account 75% of the distribution and hence the BAS approach to IQR provides a second reduction on the data used. Therefore the BAS approach to IQR is not appropriate. On the other hand, the BAS approach is suitable for all dispersion parameters analyzed since the maximum improvement found is about 73%, where Dσμ and Dσm have the largest improvements.

  • The total improvement is dominated by the improvement from the BAS approach since the maximum improvement is not so different to that found for BAS approach. Indeed, the BAS approach leads to a constant improvement until EWVSC1 ≃ 0.95. The decrease observed for values higher than that worsens when the sample size correction and the Strateva modified function are added.

  • We also perform the selection using the previous standard procedure to select variable stars using non-correlated data, i.e. select all sources with an magnitude RMS above n times sigma above the noise model function. We compute the standard deviation and the X index for the K waveband using BA photometry. At one sigma above the Strateva function ~81% of WVSC1 stars are selected but at the expense of an Etot ≃ 103. This Etot value is ~5.2 times larger than that found using our approach for the K waveband and ~12 times that considering all wavebands (see Table 6). This means that the modified-Strateva function joined with our empirical approach (see Eqs. (21) and (18)) and statistical weights (see Sect. 5) increases the selection efficiency by about ~520%.

  • The performance of previous and even-statistical parameters are very similar when we use the sample size correction and Strateva-modified function with the BAS approach. Indeed, the efficiency level for EIQR is optimized if only the Strateva-modified function and sample size correction for BA photometry is used.

  • The performance obtained from single statistical parameters in comparison with those found for EDAll or DAll are very similar. Therefore the combination of several statistical parameters does not provide an improvement according to our results. Moreover, the combination with more statistical parameters was performed but no improvement was found.

  • The largest improvement is found when all wavebands are combined. This returns a set of potential variable stars of about 2.1, for EWVSC1 ≃ 0.8 and 4.9, for EWVSC1 ≃ 0.9, times smaller than that found for single wavebands.

The approaches proposed in the current work provide reasonable improvements to the selection criteria. All steps of our approach were tested allowing us to identify which parameters are improved and the range of EWVSC1 over which these improvements are valid. Such analyses allow us to define the best way to select variable stars with statistical parameters.

thumbnail Fig. 8

Etot vs. EWVSC1 (left panel) for the best selection criteria and 1 /η as a function of time interval mean among the measurements ΔT (right panel). In the left panel, the statistical parameters X(EIQR) (blue lines), X(Dσμ) (red lines), and X(ED) (dark lines) for ZYJHK (full lines) and K (dashed lines) wavebands, respectively. In the same diagram the results found for using the mean (full grey line) and even-mean (dashed grey line), respectively, are also plotted. The results for normal, uniform, Ceph, RR, and EB simulated distributions are shown in the right panel. The colours indicate different parameters or distributions analyzed, which are indicated at the top of each diagram. In the right panel we show how 1 /η varies with the selection of ΔT. The box plot indicates the quartiles of WFCAM and WVSC1 samples, where each box encloses 50% of the sample under the grouping algorithm that defines ΔT in Sokolovsky et al. (2017).

Open with DEXTER

7.5. Improvements on correlated indices

The flux independent variability indices () proposed by us (for more details see Ferreira Lopes et al. 2015a; Ferreira Lopes & Cross 2016) are not dependent on the amplitude signal since they only use the correlation signal. However, they are dependent on the mean value. Therefore, the correlation values computed using even-averages are more accurate than those computed using mean values since the even-mean gives a value closer to the true centre (see Sect. 3.1). As a result, the Etot values presented in Table 7 are reduced by about ~18% compared to those found in Table 2 of paper I. This reduction is related with to the sources that have few correlations. Such an improvement is almost constant for EWVSC1< 0.90 (see Fig. 8) and it is this high because a small variation in the number of positive correlations provides a substantial improvement for the indices. For instance, a single correlation can create a variation of about 20% of indices if there are only five correlation measurements. Therefore, a better mean value provides a strong correction on indices with few correlations.

Table 7

Efficiency metric Etot, EWVSC1, and αcor values computed from the analysis of BA photometry for and .

We also tested the correlated indices using BAS, i.e. all measurements enclosed in 2 × EDσμ about EAm of BA photometry (see Sect. 7.3). The results are not that different from those found for EWVSC1< 0.85 (see Table 7), while for EWVSC1> 0.85 we found an Etot about 40% higher. The measurements related to eclipsing binary stars are removed when we use BAS. The correlated and non-correlated indices can fail for low signal-to-noise variations and for non-contact binaries with few measurements at the eclipses.

The WFCAMCAL database allows us to compute correlated indices that have a number of correlations greater than for about ~94% of data. Variable stars with fewer correlations or not previously detected will be explored in the next paper of this series, in which we will propose a new periodicity search method and study selection criteria to produce a cleaner sample.

8. Summary of recommendations

Reliable selections become more important than complete selections of variable stars when confronted by a very large amount of photometric data. Visual inspection is usually performed to designate if an object is a variable star or not (e.g. Pojmanski et al. 2005; Graczyk et al. 2011; De Medeiros et al. 2013; Ferreira Lopes et al. 2015a,b,c; Song et al. 2016). This is accomplished even if good filtering is performed to remove image artefacts, cosmic ray hits, and point spread function (PSF) wings of a bright nearby objects (e.g. Fruchter & Hook 2002; Bernard et al. 2010; Denisenko & Sokolovsky 2011; Ramsay et al. 2014; Desai et al. 2016). This is because stochastic and non-stochastic variations do not look that different from the viewpoint of statistical and correlated indices, especially for low signal-to-noise data. Indeed, at the end of this project we aim to propose a non-supervised procedure that allows us to get an unbiased sample from analyzing a large data set, i.e. without performing visual inspection.

Recently new statistical parameters were proposed that include the error bars. These may improve the statistical parameters if the error bars are well estimated. However, they can also increase the uncertainties because it is common to find outliers with smaller error bars. The performance of many of these parameters were recently tested by Sokolovsky et al. (2017). The authors used as test data, a sample with 127 539 objects and more than 40 epochs of which 1251 variable stars were confirmed among them. The limit in the number of measurements gives a straightforward comparison with surveys such as PanSTARRS (with about 12 measurements) and the VVVX that will observe fewer epochs than VVV, but is still in the range between 25 to 40. The authors set the 1 /η index as the best way to select variable stars, but this is not true if the epoch interval (ΔT) is large. Figure 8 (right panel) shows the variation of the 1 /η index as a function of ΔT. As you can verify, the separation between stochastic (uniform and normal distributions) and variable stars become more evident only for ΔT< 0.1. The grouping of observations as defined for the 1 /η index is in a single band, and hence ΔT ≃ 17.5 d for WVSC1 stars for single wavebands. Therefore 1 /η index is not suitable for select variable stars from noise in the WFCAM database. The WFCAM database was analyzed by correlated indices because the multi-band observations provide a large number of measurements taken in intervals of ΔT ≃ 0.01 d. Unlike statistical parameters the correlated indices can be computed using multi-wavebands and this is the best way to calculate these indices in this case. Indeed, more than 50% of WVSC1 variable stars could be missed if the 1 /η was adopted to analyze the WFCAMCAL database. Such results are in agreement with those found by Ferreira Lopes & Cross (2016, see Fig. 2 Sect. 4.1), where the authors performed this analysis using indices. Indeed, ΔT< 0.1 was found for the current test data because the variable stars simulated (Cepheid, RRlyrae, and eclipsing binary) have a variability period equal to 1. The parameter ΔT is not set to choose which variability indices must be used to performed variability analysis. The analyses of correlated observations (Ferreira Lopes & Cross 2016, see Sect. 4.3) is mandatory to determine whether correlated indices can be used and to set ΔT. Sokolovsky et al. (2017) did not take into account our correlated indices (Ferreira Lopes & Cross 2016), which have a well-defined limit and a high accuracy for only a few correlated measurements. These indices only combined those measurements that provide good information about statistical correlation. Indeed, many variable stars could be missed. The confidence correlated indices only can be computed if ΔT is a small fraction of the variability period, such results are in agreement with those results found by (Ferreira Lopes & Cross 2016). This aspect limits a straightforward comparison between correlated and non-correlated indices performed by the authors.

Figure 8 (left panel) shows a summary of our efforts to provide the best way to select variables in correlated and non-correlated data. The values correlated indices are more efficient than previous correlated indices and should be adopted in the case in which correlated indices can be calculated sensibly, but aspects of this still need to be tested, especially in systems where the correlation order and number of permutations are very low. The flux independent indices are weakly dependent on magnitude but are strongly dependent on the time interval among correlated measurements, such indices should be used when the observations have a natural correlation interval that is shorter than the typical epoch interval (for more details, see Sect. 4.3, Paper I). Indeed, a large number of variable stars can be missed if this is not taken into account.

The discrimination of variable stars from noise is better distinguished using correlated indices than non-correlated indices and so these should be adopted when they are available (see Sect. 8) otherwise Xf indices can be used (see Fig. 8). Indeed, this also determines how we may best perform photometric observations to maximize the performance of selection criteria. A combination of these observations can be used but it is not mean a high performance. The better selection performed by correlated indices is well known and therefore the correlated indices may be adopted to achieve a smaller mis-selection rate. Using all the above, the following set of procedures is recommended as the best way to select variable stars:

  • A histogram of the interval between observations must be analyzed to define if correlated indices can be used (see Fig. 3 of Ferreira Lopes & Cross 2016). The approach used to perform the variability analysis may only be chosen after examining the time interval among the measurements. The measurements used to compute correlated indices must be correlated over a fraction of the minimum variability period. The parameter has the highest performance among the correlated indices analyzed and hence this should be adopted as the main tool to select variable stars using correlated data. Moreover, the even-mean should be used instead of the mean to compute indices in order to improve the correlation estimation.

    Table 8

    EWVSC1 and Etot values for all dispersion parameters analyzed using ZYJHK wavebands.

  • A minimum of 5 correlated measurements must be adopted as the limit to discriminate variable stars from noise using correlated indices (see Eq. (14) of Sect. 4.1 of Ferreira Lopes & Cross 2016).

  • A constant cut-off value may be adopted if you can consider all time series on the same basis independent of the number of correlations. A cut-off using the number of correlations provides a better selection and therefore should be adopted if there are a reasonable number of correlations (more than 10). Indeed, we suggest that correlated indices are only calculated for stars that have a number of correlations greater than 10. This increases the reliability of the correlated indices estimation and allows those stars with few correlated measurements to be analyzed by statistical parameters. Moreover a higher order of correlated variability indices may be adopted if more than 2 measurements are available in each correlation interval.

  • The Xf index may be used for time series with less than 10 correlated measurements. We must combine the information of all wavebands if multiwavelength data is available. This reduces the misclassification rate by about 680% (see Sect. 7.4). A single dispersion parameter must be used to decrease the running time since the performances for a combination of dispersion parameters is similar (see Table 8). The Xf(ED), Xf(EDμ), or Xf(EDm) have performance in between Xf(EIQR) and Xf(EDAll) for EWVSC1< 0.85 and otherwise better than Xf(EIQR) (see Fig. 8). Indeed, ED, EDμ, EDm, or EIQR are not defined using squares and so they are less affected by outliers. On the one hand, the sources nearby the noise model are better using those parameters defined with squares. The BAS approach must not be used if the EIQR parameter is used as the selection criteria. The Xf(ED), Xf(EDμ), Xf(EDm), or their combination may be adopted to get a reliable sample (EWVSC1 ~ 0.85) since they have better performance on average than all dispersion parameters tested. On the other hand, Xf(EDAll) or Xf(DAll) may be adopted to get a complete sample once it has a better performance for EWVSC1 ≽ 0.85.

  • A cut-off dependent on the number of measurements may be used as a parameter to select variable stars (see Eq. (21)).

  • The sample selected by correlated or non-correlated indices is not unbiased, i.e. several stochastic variations are enclosed in this selection. The identification of periodic or aperiodic signals may be performed by period finding methods. That will be addressed in forthcoming papers of this project.

9. Conclusions

Statistical parameters were analyzed as a tool to discriminate variable stars from noise. We observe that statistics based on an even number of measurements provide better estimations of statistical parameters. Therefore, we propose even-statistics, where only even numbers of measurements are considered. The even-averages gave better results than current averages for many of distributions analyzed. Therefore the previous shape and dispersion parameters were tested using even-averages. Next, seven unbound statistical parameters are proposed; i.e. they are independent of the average. We propose 16 new statistical parameters are proposed in total. These parameters enlarge our inventory of tools to identify non-stochastic variations, which is the main goal of this step of our project.

The new statistical parameters were tested using Monte Carlo simulations, from which we verify that the even-statistical parameters can be used to analyze statistical distributions in the same way as their non-even counterparts. Many even-statistical parameters keep a strong relationship with their counterparts that enables a comparison. The improvement in the accuracy of statistical parameters depends of the distribution analyzed. For many of these parameters the even-parameters display better accuracy (uniform - 7/9 of the statistics improved with even; normal: 5:9 improved: Ceph: 8:9 improved: RR: 5:9 improved: EB: 2:9 improved. The simulations were also used to estimate a coefficient to adjust the sample size for each dispersion parameter to take into account the dependence of statistical parameters on the number of measurements. These are extremely important to reduce the mis-selection of sources with few measurements.

Even-statistical parameters plus sample size corrections plus new model noise were used to propose non-correlated indices that can be used on single or multiwavelength observations. The Strateva-modified function proposed in the present paper provides a better model than previous functions and the sample size coefficients were designed for each statistical parameter account for its susceptibility to statistical variations. Indeed, the noise characteristics of bright and faint sources are better modelled by the Strateva-modified function. This is extremely important since the single or multiwavelength analysis are only possible using a noise model. The dispersion parameters provide similar information but are susceptible to statistical variations that are slightly different. However, combinations of statistical parameters tested do not significantly improve the discrimination between variable stars and noise. Finally, the non-correlated index was tested using the WFCAMCAL database. The results were compared with those obtained with the standard deviation and Strateva function. The mis-selection rate was reduced by about 520% as result of our approach. Moreover, the correlated indices were recomputed using the even-mean and we also find a reduction in the mis-selection rate of 18%. From all of the above, we summarize our recommendation to select variable stars from noise.

The first step of this project, where the tools and selection criteria to discriminate variable stars from noise were studied, is now concluded. The next step of this project will study period finding methods and how use these methods to reduce or remove all mis-selected sources.

Acknowledgments

C.E.F.L. acknowledges a post-doctoral fellowship from the CNPq. N.J.G.C. acknowledges support from the UK Science and Technology Facilities Council. We thank Maria Ida Moretti who pointed us to the Sokolovsky et al. (2017) paper, which was crucial to perform a strict comparison with the most recent results. We also thank the reviewer for his/her thorough review and highly appreciate the comments and suggestions, which significantly contributed to improving the quality of the publication.

References

All Tables

Table 1

Variability statistical analyses in the present work.

Table 2

Excess shape coefficients for even-shape parameters.

Table 3

Ptrue for all statistical parameters analyzed in the present work.

Table 4

Coefficients (b) of Eq. (17).

Table 5

Strateva and Strateva modified parameters (see Eq. (18)) for all dispersion parameters analyzed in the present work for BA.

Table 6

Efficiency metric Etot, i.e. the ratio of the number of selected sources to the total number of WVSC1 variable stars, and EWVSC1, i.e. the ratio of the number of WVSC1 stars selected to the total number of WVSC1 stars, and α values computed from Xf(ED) for each waveband and for using all ZYJHK wavebands using β = 4.

Table 7

Efficiency metric Etot, EWVSC1, and αcor values computed from the analysis of BA photometry for and .

Table 8

EWVSC1 and Etot values for all dispersion parameters analyzed using ZYJHK wavebands.

All Figures

thumbnail Fig. 1

Uniform (black squares), normal (green asterisk), Ceph (blue diamond), RR (red triangle) and EB (grey plus) distributions with 1000 measurements as a function of the number of elements. The same distributions are showing with different arrangements (see Sect. 2 for more details).

Open with DEXTER
In the text
thumbnail Fig. 2

Parameter eP (left panels) and its comparison with previous statistical parameters (right panels) as a function of the number of measurements (see Table 1). The colours and symbols are the same as those adopted in the Fig. 1. In the left panels the results for the full range of measurements are used, while the right panel only shows those for odd numbers of measurements except for EIQR, where only the results with numbers of measurements that are not modulus 4 are plotted. The solid lines in the even-dispersion diagrams show the models described in Sect. 5.3.

Open with DEXTER
In the text
thumbnail Fig. 3

Parameter eP as a function of the number of measurements for the free even-parameters (see Table 1). The colours are the same as in Fig. 2.

Open with DEXTER
In the text
thumbnail Fig. 4

Dispersion and shape parameters as a function of magnitude where the black dots indicate the WVSC1 stars. The red and dashed black lines indicate the Strateva-modified and Strateva functions, respectively. The maximum number of sources per pixel is shown in brackets in each panel.

Open with DEXTER
In the text
thumbnail Fig. 5

Top panel shows Etot as a function of EWVSC1 for all apertures and for the bottom panel for all individual wavebands and the combination (ZYJHK wavebands) using BA. Here the result for each photometric aperture and waveband are shown with different colours. EWVSC1 decreases with Etot leading to a more reliable selection (fewer misclassifications) and vice versa.

Open with DEXTER
In the text
thumbnail Fig. 6

G vs. EWVSC1 using different approaches (see Sect. 7.4). The approach used is named above in each of the upper diagrams. The colours indicate the results for different filters ZYJHK (brown, grey, red, green, and blue lines respectively) and the combination of results found in all bands (black lines). The same colours were also used in Fig. 5 (bottom panel).

Open with DEXTER
In the text
thumbnail Fig. 7

X indices using all wavebands XZYJHK for ED and Dσμ. The X(ED) index was computed using the Strateva function without sample size correction for BA photometry, while X(ED) was computed using the Strateva modified function with sample size correction for the BAS approach. The histograms of the entire sample (black lines) and WVSC1 stars (red lines) are shown at the top right. The WVSC1 stars are also represented by open black circles and the maximum number of sources per pixel is shown in brackets.

Open with DEXTER
In the text
thumbnail Fig. 8

Etot vs. EWVSC1 (left panel) for the best selection criteria and 1 /η as a function of time interval mean among the measurements ΔT (right panel). In the left panel, the statistical parameters X(EIQR) (blue lines), X(Dσμ) (red lines), and X(ED) (dark lines) for ZYJHK (full lines) and K (dashed lines) wavebands, respectively. In the same diagram the results found for using the mean (full grey line) and even-mean (dashed grey line), respectively, are also plotted. The results for normal, uniform, Ceph, RR, and EB simulated distributions are shown in the right panel. The colours indicate different parameters or distributions analyzed, which are indicated at the top of each diagram. In the right panel we show how 1 /η varies with the selection of ΔT. The box plot indicates the quartiles of WFCAM and WVSC1 samples, where each box encloses 50% of the sample under the grouping algorithm that defines ΔT in Sokolovsky et al. (2017).

Open with DEXTER
In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.