New insights into time series analysis
II  Noncorrelated observations
^{1} SUPA (Scottish Universities Physics Alliance) WideField Astronomy Unit, Institute for Astronomy, School of Physics and Astronomy, University of Edinburgh, Royal Observatory, Blackford Hill, Edinburgh EH9 3HJ, UK
email: ferreiralopes1011@gmail.com
^{2} Departamento de Física, Universidade Federal do Rio Grande do Norte, RN, 59072970 Natal, Brazil
^{3} National Institute For Space Research (INPE/MCTI), Av. dos Astronautas, 1758 – SP, 12227010 São José dos Campos, Brazil
Received: 21 November 2016
Accepted: 4 June 2017
Context. Statistical parameters are used to draw conclusions in a vast number of fields such as finance, weather, industrial, and science. These parameters are also used to identify variability patterns on photometric data to select nonstochastic variations that are indicative of astrophysical effects. New, more efficient, selection methods are mandatory to analyze the huge amount of astronomical data.
Aims. We seek to improve the current methods used to select nonstochastic variations on noncorrelated data.
Methods. We used standard and new datamining parameters to analyze noncorrelated data to find the best way to discriminate between stochastic and nonstochastic variations. A new approach that includes a modified Strateva function was performed to select nonstochastic variations. Monte Carlo simulations and public timedomain data were used to estimate its accuracy and performance.
Results. We introduce 16 modified statistical parameters covering different features of statistical distribution such as average, dispersion, and shape parameters. Many dispersion and shape parameters are unbound parameters, i.e. equations that do not require the calculation of average. Unbound parameters are computed with single loop and hence decreasing running time. Moreover, the majority of these parameters have lower errors than previous parameters, which is mainly observed for distributions with few measurements. A set of noncorrelated variability indices, sample size corrections, and a new noise model along with tests of different apertures and cutoffs on the data (BAS approach) are introduced. The number of misselections are reduced by about 520% using a single waveband and 1200% combining all wavebands. On the other hand, the evenmean also improves the correlated indices introduced in Paper I. The misselection rate is reduced by about 18% if the evenmean is used instead of the mean to compute the correlated indices in the WFCAM database. Evenstatistics allows us to improve the effectiveness of both correlated and noncorrelated indices.
Conclusions. The selection of nonstochastic variations is improved by noncorrelated indices. The evenaverages provide a better estimation of mean and median for almost all statistical distributions analyzed. The correlated variability indices, which are proposed in the first paper of this series, are also improved if the evenmean is used. The evenparameters will also be useful for classifying light curves in the last step of this project. We consider that the first step of this project, where we set new techniques and methods that provide a huge improvement on the efficiency of selection of variable stars, is now complete. Many of these techniques may be useful for a large number of fields. Next, we will commence a new step of this project regarding the analysis of period search methods.
Key words: methods: data analysis / methods: statistical / techniques: photometric / astronomical databases: miscellaneous / stars: variables: general / infrared: general
© ESO, 2017
1. Introduction
Statistical analysis is a vital concept in our lives because it is used to understand what’s going on and thereby enable us to make a decision. These kind of analyses are also used to assess theoretical models by experiments that are limited by experimental factors, leading to uncertainty. Measurements are usually performed many times to increase the confidence level. The results are summarized by statistical parameters in order to communicate the largest amount of information as simply as possible. Statisticians commonly describe the observations by averages (e.g. arithmetic mean, median, mode, and interquartile mean), dispersion (e.g. standard deviation, variance, range, interquartile range, and absolute deviation), shape of the distribution (e.g. skewness and kurtosis), and a measure of statistical dependence (e.g. Spearman’s rank correlation coefficient). These parameters are used in finance, weather, industry, experiments, science and in several other areas to characterize probability distributions. New insights on this topic should be valuable in natural sciences, technology, economy, and quantitative social science research.
Improvements on data analysis methods are mandatory to analyze the huge amount of data collected in recent years. Large volumes of data with potential scientific results are left unexplored or delayed owing to current inventory tools that are unable to produce clear samples. In fact, we risk wasting the potential of a large part of these data despite efforts that have been undertaken (e.g. von Neumann 1941, 1942; Welch & Stetson 1993; Stetson 1996; Enoch et al. 2003; Kim et al. 2014; Sokolovsky et al. 2017). Current techniques of data processing can be improved considerably. For instance, the flux independent index that we proposed in a previous paper reduces the misselection of variable sources by about 250% (Ferreira Lopes & Cross 2016). A reliable selection of astronomical databases allows us to put forward faster scientific results such as those enclosed in many current surveys (e.g. Kaiser et al. 2002; Udalski 2003; Pollacco et al. 2006; Baglin et al. 2007; Hoffman et al. 2009; Borucki et al. 2010; BailerJones et al. 2013; Minniti et al. 2010). The reduction of misclassification at the selection step is crucial to follow up the development on the instruments themselves.
Fig. 1 Uniform (black squares), normal (green asterisk), Ceph (blue diamond), RR (red triangle) and EB (grey plus) distributions with 1000 measurements as a function of the number of elements. The same distributions are showing with different arrangements (see Sect. 2 for more details). 

Open with DEXTER 
The current project discriminates between correlated and noncorrelated observations to set the best efficiency for selecting variable objects in each data set. Moreover, Ferreira Lopes & Cross (2016) establishes criteria that allow us to compute confidence variability indices if the interval between measurements used to compute statistical correlations is a small fraction of the variability periods and the interval between correlated groups of observations. On the other hand the confidence level of statistical parameters increases with the number of measurements. Improvements to statistical parameters for which there are few measurements are crucial to analyze surveys such as PanSTARRS, with a mean of about 12 measurements in each filter, and the extended VVV project (VVVX), between 25 to 40 measurements (Chambers et al. 2016; Minniti et al. 2010). Sokolovsky et al. (2017) tested 18 parameters (8 scatterbased parameters and 10 correlatedbased parameters), comparing their performance. According to the authors the correlationbased indices are more efficient in selecting variable objects than the scatterbased indices for data sets containing hundreds of measurement epochs or more. The authors proposed a combination of interquartile range (IQR – Kim et al. 2014) and the von Neumann ratio (1 /η – von Neumann 1941, 1942) as a suitable way to select variable stars. A maximum interval of two days for measurements is used to set those used to compute the correlated indices. This value is greater than the limit required to compute wellcorrelated measurements. Moreover its efficiency should take into account the number of wellcorrelated measurements instead of simply the number of epochs. Indeed, maybe the authors did not account for our correlated indices (Ferreira Lopes & Cross 2016) because that would require a bin of shorter interval than the smallest variability period. This allows us to get accurate correlated indices for time series with few correlated measurements. The constraints and data used by Sokolovsky et al. (2017) limit a straightforward comparison between correlated and noncorrelated indices performed by the authors. Therefore, the approach used to perform the variability analysis may only be chosen after examining the time interval among the measurements (see Sect. 8).
Statistical parameters such as standard deviation and kurtosis as function of magnitude have been used as the primary way to select variable stars (e.g. Cross et al. 2009). This method assumes that for the same magnitude stochastic and nonstochastic variation have different statistical properties. To compute all current dispersion and almost all shape parameters the averages, must also be calculated and thus we increase the uncertainties and processing time. Indeed, statistical properties still exist even where averages are unknown. In this fashion, Brys et al. (2004) proposed a robust measure of skewness called the “medcouple” through a comparison of quartiles and pairs of measurements that allow us to compute these parameters without use averages. However, the medcouple measure has a long running time since the number of possible combination increases by factorial of the number of measurements. However, we can use a similar idea to propose new averages, dispersions, and shape parameters that have a smaller running time.
This work is the second in a series about new insights into time series analysis. In the first paper we assessed the discrimination of variable stars from noise for correlated data using variability indices (Ferreira Lopes & Cross 2016). In this work, we analyze new statistical parameters and their accuracy in comparison with previous parameters to increase the capability to discriminate between stochastic and nonstochastic distributions. We also look into their dependence with the number of epochs to determine statistical weights to improve the selection criteria. Lastly we use a noise model to propose a new noncorrelated variability index. Forthcoming papers will study how to use the full current inventory of period finding methods to clean the sample selected by variability indices.
The notation used is described in Sect. 2 and next we suggest a new set of statistics in Sect. 3. In Sects. 5 and 6 the new parameters are tested and a new approach to model the noise and select variable stars is proposed. Next, the selection criteria are tested on real data in Sect. 7. Finally, we summarize and provide our conclusions in Sect. 9.
2. Notation
Let from where the kernel function is defined by
(1)where Int(N′/ 2) means the integer part (floor) of half the number of measurements. The lower contribution to compute statistical parameters for symmetric distributions is given by y′_{Int(N′/ 2) + 1} since it is the nearest measurement to the average. The value Y is a sample that has an even number of measurements (N) that also can be discriminated into the subsamples Y^{−} and Y^{+} composed of measurements y_{i} ≤ y_{N/ 2} and y_{i}>y_{N/ 2}, respectively. Different arrangements can be taken into consideration, such as

P1
– unsystematic y_{i} values;

P2
– increasing order of Y^{−} and Y^{+};

P3
– decreasing order of Y^{−} and increasing order of Y^{+};
Five distributions were used to test our approach. These distributions were generated to model both variable stars and noise. Uniform and normal distributions that can mimic noise were generated by the IDL function RANDOMU. This function returns pseudorandom numbers thar are uniformly distributed and randomly drawn from a multivariate normal distribution, where a mean value of 0.5 and full width at half maximum (FWHM) equal to 0.1 were assumed for the normal distributions to provide a range of values from about 0 to 1. On the other hand we generate Cepheid (Ceph), RRLyrae (RR), and eclipsingbinary (EB) distributions that are similar to typical variable stars. Ceph, RR, and EB models were based on the OGLE light curves OGLE LMCSC14 109671, OGLE LMCSC21 59535, and OGLE LMCSC2 180186, respectively. These models were generated in two steps: first a harmonic fit was used to create a model and, second, these distributions were sampled at random points to get the measurements.
Figure 1 shows the P1 − 3 arrangements for uniform (black squares), normal (green asterisks), Ceph (blue diamonds), RR (red triangles) and EB (grey plus signs) distributions. The colours and symbols used in this diagram were adopted throughout the paper to facilitate their identification. The symmetry found in the P2 and P3 yields unique information about the shape and dispersion parameters. Therefore, the measurements of Y^{−} and Y^{+} are combined to propose a new set of statistical parameters (see Table 1).
3. Evenstatistic (E)
Nonparametric statistics are not based on probability distributions whose interpretation does not depend on the fitting of parametrized distributions. The typical parameters used for descriptive and inferential purposes are the mean, median, standard deviation, skewness, and kurtosis, among others. These parameters are defined to be a function of a sample that has no dependency on a parameter, i.e. its values are the same for any type of arrangement, for instance P1 − 3 (see Fig. 1). All dispersion and almost all shape parameters are dependent on some kind of average, i.e. they describe distributions around average values. Indeed, the dispersion and shape parameters still exist even where the averages are unknown. For instance, the standard deviation and absolute deviation provide an estimate of dispersion about the mean value. Therefore, we propose the evenstatistic as an alternative tool to assess dispersion and shape parameters based only on the measurements, i.e. unbound estimates.
The most appropriate to compute shape parameters is with even number of measurements if we consider that the same number of measurements should appear on both sides of the distribution for appropriate comparison. For instance, consider a distribution with an even number of measurements, where the mean value can be computed as and the first and second term are the weights of the left and right sides of the distributions with the same number of elements. On the other hand, for odd numbers of measurements, the weight for both sides of the distribution are not equivalent. The single measurement that can be withdrawn to correct such variation is y′_{Int(N/ 2) + 1} since if we withdraw any of the other y′_{i} measurements we would increase the difference. The kernel given by Eq. (1) provides samples with even numbers of measurements and hence the same weight for both sides. The parameters proposed with Eq. (1) are called “even parameters”. Moreover, the even number of measurements allows us to compare single pairs of measurements among Y^{−} and Y^{+} and therefore to estimate dispersion and shape values without the average into account. The new statistical parameters to compute averages (see Sect. 3.1), dispersions (see Sect. 3.2), and shapes (see Sect. 3.3) are described below.
Variability statistical analyses in the present work.
3.1. Averages (A)
Considering the kernel given by Eq. (1) (see Sect. 2), we propose new average statistics given by (2)and (3)where EA_{μ} and EA_{m} are named as evenmean and evenmedian. These expressions mimic the mean (A_{μ}) and median (A_{M}). Moreover, EA_{μ} = A_{μ} and EA_{m} = A_{m} when the number of measurements (N′) is an even number. A comparison between these measurements is performed in Sect. 5.1.
3.2. Dispersion parameters (D)
Statistical dispersion is used to measure the amount of sample variance and it is computed using the absolute or square value of the distance between the measurements and the average. Improving the estimation of averages can provide better accuracy of dispersion parameters such as the mean standard deviation (D_{σμ}), median standard deviation (D_{σm}), mean absolute standard deviation (D_{μ}), and median absolute standard deviation (D_{M}). Therefore we propose the evendispersion parameters that are computed using the evenaverages (see Table 1; the evenmean standard deviation (ED_{σμ}), the evenmedian standard deviation (ED_{σm}), the evenmeanabsolute deviation (ED_{μ}), and the evenmedianabsolute deviation (ED_{m})). The accuracy of these parameters is assessed in Sect. 5.1. We caution that these parameters return the same values as the previous parameters for even numbers of measurements.
Moreover, using the single combination between measurements Y^{−} with Y^{+}, we also can estimate the amount of variation or dispersion of a sample. From the kernel given by Eq. (1) we propose the following dispersion parameter, written as (4)The two parts are the same since . The value ED means evenabsolute deviation because such a sum is always positive, i.e. (y_{N − i} − y_{i}) ≥ 0 and . Moreover, a simple identity is found for distributions with an even number of measurements, i.e. (5)since (y_{N − i} − EA_{m}) ≥ 0 and (y_{i} − EA_{m}) ≤ 0. Indeed, we also can mimic the standard deviation by proposing two new evendispersion parameters given by (6)and (7)The evendispersion parameters ED, ED_{(1)}, and ED_{(2)} are unbound, i.e. they are not dependent on the average. These parameters allow us to describe the dispersion of a distribution instead of the dispersion about an average. Moreover, a strict relationship between ED_{(1)} and ED_{(2)} with ED_{σm} is found when we have even numbers of measurements, i.e. (8)while for ED_{(2)}, (9)where Cov denotes covariance. Indeed, the second term in these equations is additive since the covariance among Y^{−} and Y^{+} is negative.
Asymptotically the identities given by Eqs. (5), (8), and (9) are also valid for odd numbers of measurements since ED_{m} ≃ D_{m} and ED_{σm} ≃ D_{σm}. Similar identities that link evendispersion parameters with their correspondents can be found using A_{μ}, A_{m}, EA_{μ}, and EA_{m}.
The dispersion of a distribution given by Eqs. (6) and (7) is the standard deviation about the averages minus two times the covariance among Y^{−} and Y^{+}. Moreover, for symmetric distributions where y_{N − i} − EA_{M} = − (y_{i} − EA_{m}) and EA_{m} = EA_{μ}, we can write the following identity, (10)The ratio of ED_{(1)} by ED_{σm} can be used to estimate whether the measurements are symmetrically distributed.
3.3. Shape parameters (S)
In a similar fashion as the dispersion parameters, we also can improve the accuracy of skewness (S_{S}) and kurtosis (S_{K}) using the evenaverages. Therefore, we propose evenskewness (ES) and evenkurtosis (EK) to estimate the distribution shape (see lines 11−12 of Table 1). Moreover, we also propose higher moments of ED_{(1)} and ED_{(2)} as new evenshape parameters given by and (14)where ES_{(1 − 2)} and EK_{(1 − 2)} are unbound parameters, i.e. they are independent of the average. The values ES_{(1 − 2)} mimic the skewness while EK_{(1 − 2)} mimic kurtosis. A strict relationship between ES_{(1 − 2)} with ES and EK_{(1 − 2)} with EK is very complicated since such definitions use distinct dispersion parameters. Indeed, we can use other dispersion parameters to broaden the list of evenshape parameters.
3.4. Excess shape
The integration of the Gaussian distribution returns a kurtosis equal to 3 as N → ∞. An adjusted version of Pearson’s kurtosis, the excess kurtosis, which is the kurtosis minus 3 is most commonly used. Some authors refer to the excess kurtosis as simply the “kurtosis”. For example, the kurtosis function in IDL language is actually the excess kurtosis. The excess values for evenshape parameters were determined in the same fashion as the excess kurtosis, i.e. shift these values to zero for normal distributions. Therefore, 10^{5} Monte Carlo simulations using the normal distributions with 10^{6} measurements were performed to determine the excess shape.
Excess shape coefficients for evenshape parameters.
Table 2 shows the averages for the evenshape parameters and their errors. The amount of variation is less than 0.1%. The excess shape values were added to the equations for evenshape parameters (see Table 1), rounding to two decimal places.
4. Even interquartile range
The interquartile range (IQR – Kim et al. 2014) is also included in our analysis since it was reported as one of the best statistical parameters to select variable stars (Sokolovsky et al. 2017). The IQR uses the inner 50% of measurements, excluding the 25% brightest and 25% faintest flux measurements, i.e. first the median value is computed in order to divide the set of measurements into upper and lower halves and then the IQR is given by the difference between the median values of the upper and lower halves.
The median value is improved using the evenmedian (see Sect. 5.1) and hence the IQR can also be improved using the even median instead of the median. Therefore, using the kernel defined by Eq. (1) an even interquartile range is proposed as (15)where Y^{+} and Y^{−} are the measurements of Y above and below the median, respectively (see Sect. 2 for a better description). Indeed, EIQR provides adjustment for distributions with an even or odd number of measurements. Three, two, or zero adjustments are performed for distributions with odd, even (but not modulus 4), or modulus 4 numbers of measurements, respectively.
5. Simulating distributions
Monte Carlo simulations were used to test the evenstatistical parameters for a range of the number of measurements (number of epochs) varying from 10 to 100. We performed 10^{5} simulations for a given number of measurements. Indeed, the range of measurements tested typifies light curves from current large widefield, multiepoch surveys such as PanSTARRS, VVV, and Gaia. These simulations were performed for the five distributions described in Sect. 2 that mimic noise and variable stars; the colours and symbols in Fig. 1 are also adopted in the present section to facilitate the identification of these distributions. The statistical parameters have a higher statistical significance for the distributions that have a large number of measurements, where the addition of measurements only leads to small fluctuations. Therefore, the adopted “true parameter” values (P_{true}) are those computed using 10^{6} measurements. We performed 10^{5} simulations to compute the P_{true} values. All parameters analyzed are listed in Table 3; their errors were computed as the standard deviation. This value is used as a reference to analyze the error given by (16)where P means the statistical parameter. This expression provides the mean error for P. In order to avoid singularities we shift the skewness and kurtosis values for and P = P′ + 1 since they have P_{true} ~ 0.
P_{true} for all statistical parameters analyzed in the present work.
Fig. 2 Parameter e_{P} (left panels) and its comparison with previous statistical parameters (right panels) as a function of the number of measurements (see Table 1). The colours and symbols are the same as those adopted in the Fig. 1. In the left panels the results for the full range of measurements are used, while the right panel only shows those for odd numbers of measurements except for EIQR, where only the results with numbers of measurements that are not modulus 4 are plotted. The solid lines in the evendispersion diagrams show the models described in Sect. 5.3. 

Open with DEXTER 
5.1. Bound evenstatistical parameters
The bound evenstatistical parameters are those dependent on the average. These parameters only differ from the previous parameters in that the mean and median are replaced by the evenmean and evenmedian. Figure 2 shows e_{P} (see Eq. (16)) for the evenparameters (see Table 1, 1−6 and 10−12) and its comparison with previous parameters (mean, median, mean standard deviation, median standard deviation, mean absolute deviation, median absolute deviation, skewness, and kurtosis) as a function of the number of measurements in the left and right panels, respectively. The left panels include the results of simulations for the whole range of measurements while the right panels only have the results for odd numbers of measurements because for even numbers of measurements the current and evenstatistical parameters have the same values. Therefore, we only use the results for odd numbers of measurements, i.e. 11,13,15,··· , where e_{P}/e_{P′}< 1 means a higher accuracy for the new parameters compared to current parameters while e_{P}/e_{P′}> 1 means no improvement with the new parameters. The simulations were performed as described in Sect. 5, from which we can observe that

EA_{μ} – The normal and EB distributions have similar distributions, and separately Ceph, RR, and uniform distributions are also similar. The EA_{μ} returns the lowest errors for the normal distribution. Indeed, e_{EAμ}/e_{Aμ} ≃ 1 for normal distributions while e_{EAμ}/e_{Aμ} ≃ 1.07 (for 10 measurements) for the EB distribution. This happens because the dispersion about the mean is symmetric for the normal distribution and extremely asymmetric for the EB distribution. The error for 10 measurements for all distributions is about twice that found for 100 measurements on average. The evenmean parameters are more accurate than the mean for all distributions except the EB distribution. For instance, the evenmean returns an improved accuracy of between ~4% and ~8% for Ceph, RR, and uniform distributions over that found by the mean. On the other hand, the mean is better then the evenmean by a similar rate for the EB distribution.

EA_{m}− The e_{EAm} for normal and EB distributions are similar but offset by roughly a multiplicative factor. The EB distribution e_{EAm} values are about twice those found for the normal distribution for the same reason as discussed for e_{EAμ}. The evenmean for the RR distribution has e_{EAμ}/e_{Aμ}< 0.87 for N< 30 measurements, i.e. an increase in the accuracy of about ~7%. Indeed, EA_{m} along with EA_{μ} are more accurate than their previous definitions for the whole range of measurements and for four out of five distributions analyzed.

ED_{σμ}− The EB distribution has the highest e_{EDσμ} values whereas the uniform, Ceph, and RR distributions show similar values. The same behaviour is also observed in the e_{EDσm} diagram. The D_{σμ} is more accurate than ED_{σμ} for the whole range of measurements and distributions analyzed despite the improvement on the estimation of the mean, but the difference is less than ~0.2% for Ceph distribution and less than ~0.1% otherwise.

ED_{σm}− This value is more accurate than D_{σm} for the whole range of measurements and all distributions analyzed. The normal and EB distributions show e_{EDσm}/e_{Dσm} ≃ 1 for all range of measurements analyzed. On the other hand, the Ceph and EB distributions show e_{EDσm}/e_{Dσm} ≃ 0.95 for fewer than 30 measurements. The Ceph distribution has the lowest value of e_{EDσm}/e_{Dσm}, the opposite of that found in ED_{σμ}.

ED_{μ} and ED_{m}− The e_{EDμ} and e_{EDm} are similar to the EB distribution returning the highest values, while the other distributions return similar values to each other. The e_{EDμ}/e_{Dμ} shows values less than one for all distributions unless EB distribution. On the other hand, e_{EDm}/e_{Dm} shows values of less than one for uniform and Ceph distributions, about one for the normal distribution, and greater than one for RR and EB distributions. The largest value for e_{EDμ}/e_{Dμ} ≃ 1.005 and e_{EDm}/e_{Dm} ≃ 1.023.

EIQR− The greatest variation on the accuracy of statistical dispersion parameters is found for EIQR. An improvement of about 10% is found for uniform, normal, RR, and Ceph distributions and in opposite way a diminishment in the accuracy of 10% is found for the EB distribution. All distributions with more than ~30 measurements show an improvement or similar accuracy as those found for IQR. The repeating patterns found among each consecutive set of three measurements is related to the number of adjustments performed by the evenmedian (see Sect. 4).

ES and EK− The shape parameters have the highest uncertainties among the statistical parameters analyzed. The lowest values for e_{ES} and highest values for e_{EK} are found for the RR distribution, respectively. The e_{ES}/e_{S}< 1 for all distributions except EB and RR distributions while e_{EK}/e_{K}> 1 for all distributions except the Ceph distribution. Moreover e_{ES} and e_{EK} have an uncertainty greater than ~10% up to 50 measurements.
To summarize, the accuracy of statistical parameters has a strong dependence on the number of measurements and distribution type. The evenmean and median are more accurate than the mean and median for all distributions analyzed except for the EB distribution. The improvements in estimation of averages by evenstatistics allows us to improve the estimation of dispersion and shape parameters for many distributions analyzed. It is mainly observed for distributions where the probability of finding measurements near P_{true} is lower. As a result, e_{Peven}/e_{P} ≃ 1 for the normal distribution. The evenstatistical parameters are strongly dependent on the distribution shape and so they can be useful for discriminating distribution types. Therefore a study about how to classify distributions using evenstatistical parameters will be performed in a later paper from this project.
Fig. 3 Parameter e_{P} as a function of the number of measurements for the free evenparameters (see Table 1). The colours are the same as in Fig. 2. 

Open with DEXTER 
5.2. Unbound evenstatistical parameters
Unbound evenstatistical parameters keep some relations with their counterparts for particular limits and distributions (see Sect. 3.2). Such relations are not valid in general and therefore did not have a counterpart for comparison. Monte Carlo simulations were used to estimate the relative error of unbound evenstatistical parameters such as those performed in Sect. 5.1.
The unbound evenstatistical parameters display similar e_{P} values as those found for previous parameters (see Fig. 3); in those cases, bound (see Sect. 5.1) and unbound evendispersion parameters show similar e_{P} values, while the evenshape parameters return smaller errors than the skewness and kurtosis. It means that the errors are comparable with those found for the common statistical parameters used to describe distributions. Moreover, the evenstatistical parameters are various for the different distributions analyzed and so these parameters can be used to discriminate between such distributions (see Table 3). Of course these values are for the distributions described in Sect. 2 that were generated to have the same amplitude. The values are different if the amplitude is modified, for instance.
The bound and unbound evenstatistical parameters (see Table 1) have a similar accuracy to previous statistical parameters and hence they can be used to characterize statistical distributions in a similar fashion to previous distributions. Such parameters can be used to describe and differentiate distribution types. A better investigation about how use these parameters to describe various distributions will be performed in a forthcoming paper of this project.
5.3. Coefficients adjusted for sample size
The adjusted coefficients for sample size are used because samples with few measurements have larger fluctuations in the estimated parameters. For instance, the FisherPearson coefficient (given by ) for a sample with 10 and 100 measurements is 1.054 and 1.005, respectively. As result, for instance, this correction increases the value if the skewness is positive, and makes the value more negative if the skewness is negative. It cannot be used for parameters that only assume positive values such as standard deviation. Therefore other adjusted coefficients have been proposed in a similar fashion. These coefficients increase the dispersion in a population since they enlarge the range of values.
A single equation to create coefficients to adjust for sample size has been used for all statistical parameter. However, the best adjustment is found using a specific equation for each statistical parameters, since they each have different accuracies (see Sect. 5). The simulations described in Sect. 5 were used to determine a model for each dispersion statistical parameter given by(17)where b_{(P)} is a real number constant (see Table 4). For unknown distributions, i.e. not included in our analysis, the evenmean value may be used.
Fig. 4 Dispersion and shape parameters as a function of magnitude where the black dots indicate the WVSC1 stars. The red and dashed black lines indicate the Stratevamodified and Strateva functions, respectively. The maximum number of sources per pixel is shown in brackets in each panel. 

Open with DEXTER 
Strateva and Strateva modified parameters (see Eq. (18)) for all dispersion parameters analyzed in the present work for BA.
Efficiency metric E_{tot}, i.e. the ratio of the number of selected sources to the total number of WVSC1 variable stars, and E_{WVSC1}, i.e. the ratio of the number of WVSC1 stars selected to the total number of WVSC1 stars, and α values computed from X_{f}(ED) for each waveband and for using all ZYJHK wavebands using β = 4.
6. Modelling the noise
Cross et al. (2009) used the Strateva function (see Strateva et al. 2001; Sesar et al. 2007, for more details) to fit the standard deviation as a function of magnitude to estimate a noise model (ζ). This method assumes that the majority of the sample are point sources, where the variability measurements are dominated by noise, rather than astrophysical variations. The method provides a suitable model for photometric surveys at optical wavelengths if they have a single component of noise that increases in relative magnitude from bright to faint stars. However, the brightest stars can show much greater variation which comes from saturation and nonlinearity of the detectors providing a source of variation that cannot be fit by these models. Such a situation is rare at optical wavebands but is very frequently present for NIR data (see Fig. 4). The sky foreground emitted by the atmosphere is highly variable in the NIR. For this reason, the sky foreground causes a highly timevarying saturation limit, which can affect large parts of otherwise highly accurate timeseries data for bright stars with substantial outliers that have very small formal error estimates (Ferreira Lopes et al. 2015a). These outliers probably lead to a spurious impact upon the statistical parameters. Therefore, we propose a modification to the Strateva function that allows us to model such variations, the increase in the standard deviation for bright (saturated stars) and faint (photon noise) stars, given by (18)where all the coefficients are real numbers. Indeed, Strateva et al. (2001) and Sesar et al. (2007) proposed a noise model using three terms where the second and third coefficients are 0.4 and 0.8, respectively. These powers continued to be used in Cross et al. (2009) but the optimal coefficients were never tested. For more details see Sect. 7.2.
6.1. Noncorrelated indices
The selection of nonstochastic variations can be performed by one or more dispersion parameters. In order to combine a set of dispersion parameters (see Table 1) a noncorrelated index is proposed as follows; (19)where w_{(P)} and ζ_{w(P)P} are given by Eqs. (17) and (18), respectively. This equation provides an index value that takes into the sample size adjustment coefficient and noise model into account. For instance, I_{(P)} ~ 1 for stochastic variation.
Distinct statistical parameters have different capabilities of discriminating distributions (see Sects. 5.1 and 5.2). Such differences are highlighted when the e_{P} values or sample size adjustment coefficients are compared. A sample composed mainly of stochastic variations has a different dispersion of I_{(P)} values. Therefore an appropriate combination of the results from different dispersion parameters is given by (20)where f is the waveband used, ω_{Pj} is a weight related with each dispersion parameter, v is the number of parameters used, and I_{Pj} is given by Eq. (19). Indeed, I_{Pj} provides a normalized index allowing us to combine distinct dispersion parameters and the results from various wavebands.
6.2. Broadband selection
Variable stars candidates for noncorrelated data are usually selected from the noise model (see Sect. 6). Stars with values above n × D_{σμ} are selected for further analyses. This approach assumes that a few percent of entire sample are variable stars and have statistical values above the noise. The noise samples present distributions such as uniform, normal, or distributions in between, while variable stars are more similar to the Ceph, RR, and EB distributions. Therefore, the dispersion parameters assume a different range of values for variable and nonvariable stars that is mainly highlighted for samples with a high number of measurements (typically higher than 50). Indeed, such a difference must increase for higher amplitudes than that found for the noise. For few measurements (typically less than 20) stochastic and nonstochastic variations have large uncertainties increasing the misselection rate (see Sect. 5.1). We find similar behaviour for correlated indices. For instance, Ferreira Lopes et al. (2015a) use cutoff surfaces linking magnitude, number of epochs, and variability indices to improve the selection criteria of variable stars, while Ferreira Lopes & Cross (2016) use flux independent indices to propose an empirical relationship between cutoff values and the number of measurements without taking into account magnitude.
The adjusted coefficients for sample size, as presented in Sect. 5.3, reduce the population dispersion. Meanwhile, uncertainties about the range of values assumed by stochastic and nonstochastic variations also vary with the number of measurements. For nonstochastic variations with a good signaltonoise and a large number of measurements, such a range is different to that produced by stochastic variations. On the other hand, for distributions with just a few measurements, the range of values can significantly overlap. In the same fashion as the empirical selection criteria proposed by Ferreira Lopes & Cross 2016 (see Eq. (16)), we propose the following criteria (21)where α and β are real positive values and N is the number of measurements. In these case where α is bigger than 1, we may find stochastic variations. Higher values of β provide a higher cutoff for small numbers of measurements or correlations. For instance, f(1,4) for N equal to 10, 30, and 50 are 1.63, 1.36, and 1.28, respectively. Indeed lower values of α provide a more complete selection while higher values provide a more reliable selection.
7. Real data
We used the WFCAM Calibration 08B release (WFCAMCAL08B – Hodgkin et al. 2009; Cross et al. 2009) as a test database as we did in the first paper of this series. To summarize, this programme contains panchromatic data for 58 different pointings distributed over the full range in right ascension and spread over declinations of and . These data have been used to calibrate the UKIDSS surveys (Lawrence et al. 2007). During each visit the fields were usually observed with a sequence of filters, either JHK or ZYJHK within a few minutes. This led to an irregular sampling with fields reobserved roughly on a daily basis, although longer time gaps are common and of course large seasonal gaps are also present in the data set. For more information about design, the details of the data curation procedures, the layout, and variability analysis on this database are described in detail in Hambly et al. (2008), Cross et al. (2009), and Ferreira Lopes et al. (2015a).
The multiwaveband data were well fitted to test the statistical parameters using different wavebands (ZYJHK). Moreover, Ferreira Lopes et al. (2015a) and Ferreira Lopes & Cross (2016) performed a comprehensive stellar variability analysis of the WFCAMCAL08B characterizing the photometric data and identifying 319 stars (WVSC1), of which 275 are classified as periodic variable stars and 44 objects as suspected variables or apparent aperiodic variables. In this paper we analyze the same sample from (Ferreira Lopes et al. 2015a; and Ferreira Lopes & Cross 2016). First, we selected all sources classified as a star or probable star with at least 10 unflagged epochs in any of the five filters. This selection was performed from an initial database of 216 722 stars. Next we test the efficiency of selection of variable stars using the statistical parameters presented in Sect. 3.
We compute all statistical parameters listed in Table 1 by the following algorithm: the photometry measured by the best aperture was selected and next the measurements with flags (ppErrBits) higher than 256 were removed. The analysis of these data was performed using the current and earlier approaches, where the comparison between the current and earlier approaches was tested using the following equation: (22)where G(P) means the percentage of upgrade (G> 0) or downgrade (G< 0) provided by the parameter tested (P). For example, P = E_{tot} (ratio of the total number of sources selected to the total number of variable stars in the WVSC1 catalogue) and P = E_{WVSC1} (ratio of number of selected variables stars in WVSC1 to the total number of variable stars in WVSC1) computed from previous (P′) and current (P) statistics. This allows us to estimate the improvement (G> 0) or deterioration (G< 0) provided by the methods proposed in the current work (see Sect. 7.4). The statistic parameters were computed for each waveband as well as considering all wavebands (ZYJHK). Table 6 lists α and its respective efficiency metric values. Such parameters are used to analyze the efficiency of selection of variable stars from noisy data in the WFCAMCAL atabase using the WVSC1 catalogue as a comparison.
7.1. Testing evenstatistical parameters
Figure 4 shows evenstatistical parameters and the standard deviation as a function of the Kband magnitude. The variable stars in WVSC1 are denoted by large black dots and the noise model (Strateva) functions are indicated by lines. The main results can be summarized as follows:

The dispersion evenparameters have a similar range as those found for the standard deviation. Moreover, the majority of WVSC1 stars have values above the stochastic variations and therefore these parameters can be used in the same fashion as the standard deviation to discriminate variable stars from noise. As expected, the diagram of ED is equal to ED_{M} and is being similar to ED_{μ}.

The Strateva and the modifiedStrateva functions show similar values for almost all ranges of magnitude. The difference is a slope at lower magnitudes (bright stars) found for the modifiedStrateva function. This allows us to reduce the misselection but we also remove some bright variable stars that have small amplitude variations. We caution that Strateva and the modifiedStrateva functions can present an incorrect model for very faint magnitudes since a small decrease in the dispersion is found. In these cases a magnitude limit can be adopted (Cross et al. 2009).

The shape evenparameters give a good discrimination for many variable stars particularly for bright stars (see Fig. 4). However, almost all faint stars (magnitudes greater than ~16) have values near those found for stochastic variations. In this sense, the dispersion parameters are better than shape parameters at discriminating nonstochastic variations since we can see a clearer separation among them for all ranges of magnitude. The shape parameters may be useful to discriminate different kinds of light curve signatures and this will be addressed in a future paper in this series.
In summary, the evenstatistical parameters can be used in the same fashion as previous parameters. The main goal of this paper is to study the criteria of selection of variable stars from noise and meanwhile these parameters may be useful for many other purposes in different branches of science and technology.
7.2. Finding the best noise model
We tested 82 159 models to find the best model Stratevamodified function (ζ_{P}(m) – see Eq. (18)). All combinations in a range of three power terms varying from 4 to − 4 were performed using a bin of 0.1. This range covered all previous values used in the noise model. The procedure was adopted to find the best model to fit the standard deviation as a function of magnitude is similar to that used by Cross et al. (2009). The EA_{M} and ED_{M} are computed for bins with a width of 0.1 mag or at least 100 objects. For this step, we only consider those stars with more than 20 measurements. Next, we compute ζ_{P′}(m) from a nonlinear leastsquares minimization using the LevenbergMarquardt method (Levenberg 1944; Marquardt 1963). The WVSC1 catalogue of variables represent 0.01% of WFCAMCAL stars. However, they were removed from the sample before fitting to get a better noise model.
About 39% of the models tested converge for all statistical parameters and wavebands observed. The model with the lowest χ^{2} in ZYJHK wavebands was taken as the best noise model given by Eq. (18). Table 5 shows the parameters obtained for both Strateva and Stratevamodified functions and the metric to measure the improvement or deterioration provided by the latter. The dispersion of residuals (G(R)) is about 1% lower than that found for Strateva functions for almost all statistical parameters except for IQR and EIQR. A similar behaviour is found for G(χ^{2}) where an improvement of about 85% is found. Indeed, the largest improvement is found for the K filter. However, a deterioration is found for IQR and EIQR in the ZYJ wavebands. Moreover, Stratevamodified functions do not turn down at faint magnitudes as the Strateva function does sometimes (see Fig. 4). The new noise model provides a more restrictive cut for both the brightest and faintest stars.
7.3. Testing photometric apertures and wavebands
In order to test the dependence of the photometric aperture and extreme measurements on the selection criteria the WFCAM, seven different analyses were performed:

A1–5: photometric measurements using a standard photometric aperture from 1 to 5 (0.5″, , 1″, , and 2″ radius, respectively)

BA: photometric measurements using the best aperture (see Cross et al. 2009)

BAS: all measurements enclosed in 2 × ED_{σμ} about EA_{M} of BA photometry are used.
In these analysis, the measurements with flags greater than 256 were removed. The third aperture (A3), corresponds to the default 1″ aperture, where the radius is slightly larger than the typical seeing FWHM, so an aperture centred on a pointsource should contain >95% of the light in the ideal Gaussian case; in reality there is much more light in lower surface brightness wings. Increasing the aperture size increases the amount of signal, but at the expense of increasing the amount of sky too, such that the signaltonoise decreases. Decreasing the aperture reduces the signal too much, also reducing the signaltonoise ratio. Usually A3 gives the optimal signaltonoise, but sometimes, nearby stars can affect the measurements by adding an additional noise component from deblending images that relies on some imperfect modelling, and selecting a smaller aperture that includes less signal from the neighbor gives better results, which is why a variable aperture was selected by Cross et al. (2009).
Figure 5 shows the result for different apertures (top panel) and different wavebands (bottom panel). The BAS returns the best results, i.e. the lowest values of E_{tot} for all values of E_{WVSC1}. The BAS approach allows us to achieve a better discrimination of variable stars from noise (see Table 6). It is mainly noted for those dispersion parameters that take into account the square in their definition, such as ED_{σμ} and ED_{σm}. On the other hand, the BAS approach can also lead to misselections of binary stars that have few measurements at the eclipse, for instance (see Sect. 7.3 for more details). The number of stochastic variations decreases a lot but it also means that we could miss some variable stars. On the other hand, the efficiency levels for different wavebands vary significantly (see Table 6). The best result was found for the J waveband rather that for the Z and K wavebands. The efficiency decrease found for the K waveband is related to the decrease of signaltonoise, while for the Z waveband we find that the c_{0} in Eq. (18) is significantly higher (~0.023; cf. ~0.014 for Y, J, H, K), which suggests greater variations in the photometry across each detector, since simple offsets in the zero point would be corrected by the recalibration carried out by Cross et al. (2009). Calibrating the Z and Y bands was trickier than the J,H,K bands because the calibration is extrapolated from 2MASS J, H, Ks (see Hodgkin et al. 2009) and more susceptible to extinction, particularly in the Z band, which can vary on small scales in star forming regions. Indeed, 32 WVSC1 stars were found in highly reddened regions (Z − K> 3) indicating that such effects can be present in WFCAM data.
7.4. Analysis of improvements
Section 7.2 discusses the improvements made using the Stratevamodified function. These improvements have smaller χ^{2} than the original Strateva function, which indicates a better noise model estimation. However, this does not inform us about the improvements to the selection of variables. Therefore, to measure the improvements or deteriorations provided by each step of our analysis the metric G was computed for E_{tot} and E_{WVSC1} using four different approaches as follows;

Evenstatistic – The results are computed from standard dispersion parameters in comparison with their respective counterpart evendispersion parameter for BA photometry (see Sect. 7.3).
Fig. 5 Top panel shows E_{tot} as a function of E_{WVSC1} for all apertures and for the bottom panel for all individual wavebands and the combination (ZYJHK wavebands) using BA. Here the result for each photometric aperture and waveband are shown with different colours. E_{WVSC1} decreases with E_{tot} leading to a more reliable selection (fewer misclassifications) and vice versa.
Open with DEXTER Fig. 6 G vs. E_{WVSC1} using different approaches (see Sect. 7.4). The approach used is named above in each of the upper diagrams. The colours indicate the results for different filters ZYJHK (brown, grey, red, green, and blue lines respectively) and the combination of results found in all bands (black lines). The same colours were also used in Fig. 5 (bottom panel).
Open with DEXTER Fig. 7 X indices using all wavebands X_{ZYJHK} for ED and D_{σμ}. The X(ED) index was computed using the Strateva function without sample size correction for BA photometry, while X(ED) was computed using the Strateva modified function with sample size correction for the BAS approach. The histograms of the entire sample (black lines) and WVSC1 stars (red lines) are shown at the top right. The WVSC1 stars are also represented by open black circles and the maximum number of sources per pixel is shown in brackets.
Open with DEXTER 
Sample size – The results with and without sample size corrections for BA photometry.

Noise model – The results using the Strateva versus Strateva modified functions for BA photometry.

BAS approach – The results for BA photometry versus with those computed for the BAS approach.

All – The results computed from the previous dispersion parameter using the Strateva function without sample size corrections for BA photometry versus their respective even dispersion parameter using sample size correction, Strateva modified functions, and the BAS approach.
This analysis allows us to verify if and how much each approach upgrades or downgrades the selection criteria compared with previous ones. Moreover, the following combination of dispersion parameters were also analyzed: (23)and (24)These combinations provide the same number of dispersion parameters using either the standard statistics or the evenstatistics and hence give a better comparison of their efficiency level. Where the weight ω_{Pj} was adopted as the inverse of ED_{σμ} for each D_{j}. The same tests were performed using single dispersion parameters to verify if their combination provides better results.
Figure 6 shows a comparison between previous and current approaches. The evenstatistical parameters are on average better than the standard statistical parameters (see Sect. 5.1). The results for real data show a fluctuation of about 1% on G(E_{tot}) values. This is expected since the improvements to the estimation of standard statistical parameters only occur for those distributions that have odd numbers of measurements and decreases quickly with the number of measurements. Moreover, we also observed that the improvements vary from the Z to K waveband since the infrared light curves usually have smaller amplitudes than optical wavebands and therefore the improvement is more evident. The following behaviours are also observed:

The G(E_{tot}) for the sample size correction varies from 2% to 7% for E_{WVSC1} less than ~0.8. The E_{WVSC1} stars outside of this limit have an X_{ED} (see Fig. 7) less than 1.5. Indeed, the X variability indices are approximately a measurement of the signaltonoise ratio so this indicates an improvement in the signaltonoise greater than 1.5 for E_{WVSC1}< 0.8 and a deterioration of about 7% otherwise.

The improvement provided by the Strateva modified function can reach G(E_{tot}) ≃ 22%. It only improves the selection for E_{WVSC1} lower than ~0.9 similar to that found for the sample size correction. Indeed, the Strateva modified function provides a fluctuation of about few percent of improvement or diminishment for Z and Y wavebands. The increase to the total number selected provided by the sample size correction and noise model means a reduction of misclassification but this also hinders the detectability of variable stars of lower amplitudes that are mainly found at fainter magnitudes.

The BAS approach provides the largest improvement to the selection criteria for all dispersion parameters tested except IQR. The definition of IQR takes into account 75% of the distribution and hence the BAS approach to IQR provides a second reduction on the data used. Therefore the BAS approach to IQR is not appropriate. On the other hand, the BAS approach is suitable for all dispersion parameters analyzed since the maximum improvement found is about 73%, where D_{σμ} and D_{σm} have the largest improvements.

The total improvement is dominated by the improvement from the BAS approach since the maximum improvement is not so different to that found for BAS approach. Indeed, the BAS approach leads to a constant improvement until E_{WVSC1} ≃ 0.95. The decrease observed for values higher than that worsens when the sample size correction and the Strateva modified function are added.

We also perform the selection using the previous standard procedure to select variable stars using noncorrelated data, i.e. select all sources with an magnitude RMS above n times sigma above the noise model function. We compute the standard deviation and the X index for the K waveband using BA photometry. At one sigma above the Strateva function ~81% of WVSC1 stars are selected but at the expense of an E_{tot} ≃ 103. This E_{tot} value is ~5.2 times larger than that found using our approach for the K waveband and ~12 times that considering all wavebands (see Table 6). This means that the modifiedStrateva function joined with our empirical approach (see Eqs. (21) and (18)) and statistical weights (see Sect. 5) increases the selection efficiency by about ~520%.

The performance of previous and evenstatistical parameters are very similar when we use the sample size correction and Stratevamodified function with the BAS approach. Indeed, the efficiency level for EIQR is optimized if only the Stratevamodified function and sample size correction for BA photometry is used.

The performance obtained from single statistical parameters in comparison with those found for ED_{All} or D_{All} are very similar. Therefore the combination of several statistical parameters does not provide an improvement according to our results. Moreover, the combination with more statistical parameters was performed but no improvement was found.

The largest improvement is found when all wavebands are combined. This returns a set of potential variable stars of about 2.1, for E_{WVSC1} ≃ 0.8 and 4.9, for E_{WVSC1} ≃ 0.9, times smaller than that found for single wavebands.
The approaches proposed in the current work provide reasonable improvements to the selection criteria. All steps of our approach were tested allowing us to identify which parameters are improved and the range of E_{WVSC1} over which these improvements are valid. Such analyses allow us to define the best way to select variable stars with statistical parameters.
Fig. 8 E_{tot} vs. E_{WVSC1} (left panel) for the best selection criteria and 1 /η as a function of time interval mean among the measurements ΔT (right panel). In the left panel, the statistical parameters X(EIQR) (blue lines), X(D_{σμ}) (red lines), and X(ED) (dark lines) for ZYJHK (full lines) and K (dashed lines) wavebands, respectively. In the same diagram the results found for using the mean (full grey line) and evenmean (dashed grey line), respectively, are also plotted. The results for normal, uniform, Ceph, RR, and EB simulated distributions are shown in the right panel. The colours indicate different parameters or distributions analyzed, which are indicated at the top of each diagram. In the right panel we show how 1 /η varies with the selection of ΔT. The box plot indicates the quartiles of WFCAM and WVSC1 samples, where each box encloses 50% of the sample under the grouping algorithm that defines ΔT in Sokolovsky et al. (2017). 

Open with DEXTER 
7.5. Improvements on correlated indices
The flux independent variability indices () proposed by us (for more details see Ferreira Lopes et al. 2015a; Ferreira Lopes & Cross 2016) are not dependent on the amplitude signal since they only use the correlation signal. However, they are dependent on the mean value. Therefore, the correlation values computed using evenaverages are more accurate than those computed using mean values since the evenmean gives a value closer to the true centre (see Sect. 3.1). As a result, the E_{tot} values presented in Table 7 are reduced by about ~18% compared to those found in Table 2 of paper I. This reduction is related with to the sources that have few correlations. Such an improvement is almost constant for E_{WVSC1}< 0.90 (see Fig. 8) and it is this high because a small variation in the number of positive correlations provides a substantial improvement for the indices. For instance, a single correlation can create a variation of about 20% of indices if there are only five correlation measurements. Therefore, a better mean value provides a strong correction on indices with few correlations.
Efficiency metric E_{tot}, E_{WVSC1}, and α_{cor} values computed from the analysis of BA photometry for and .
We also tested the correlated indices using BAS, i.e. all measurements enclosed in 2 × ED_{σμ} about EA_{m} of BA photometry (see Sect. 7.3). The results are not that different from those found for E_{WVSC1}< 0.85 (see Table 7), while for E_{WVSC1}> 0.85 we found an E_{tot} about 40% higher. The measurements related to eclipsing binary stars are removed when we use BAS. The correlated and noncorrelated indices can fail for low signaltonoise variations and for noncontact binaries with few measurements at the eclipses.
The WFCAMCAL database allows us to compute correlated indices that have a number of correlations greater than for about ~94% of data. Variable stars with fewer correlations or not previously detected will be explored in the next paper of this series, in which we will propose a new periodicity search method and study selection criteria to produce a cleaner sample.
8. Summary of recommendations
Reliable selections become more important than complete selections of variable stars when confronted by a very large amount of photometric data. Visual inspection is usually performed to designate if an object is a variable star or not (e.g. Pojmanski et al. 2005; Graczyk et al. 2011; De Medeiros et al. 2013; Ferreira Lopes et al. 2015a,b,c; Song et al. 2016). This is accomplished even if good filtering is performed to remove image artefacts, cosmic ray hits, and point spread function (PSF) wings of a bright nearby objects (e.g. Fruchter & Hook 2002; Bernard et al. 2010; Denisenko & Sokolovsky 2011; Ramsay et al. 2014; Desai et al. 2016). This is because stochastic and nonstochastic variations do not look that different from the viewpoint of statistical and correlated indices, especially for low signaltonoise data. Indeed, at the end of this project we aim to propose a nonsupervised procedure that allows us to get an unbiased sample from analyzing a large data set, i.e. without performing visual inspection.
Recently new statistical parameters were proposed that include the error bars. These may improve the statistical parameters if the error bars are well estimated. However, they can also increase the uncertainties because it is common to find outliers with smaller error bars. The performance of many of these parameters were recently tested by Sokolovsky et al. (2017). The authors used as test data, a sample with 127 539 objects and more than 40 epochs of which 1251 variable stars were confirmed among them. The limit in the number of measurements gives a straightforward comparison with surveys such as PanSTARRS (with about 12 measurements) and the VVVX that will observe fewer epochs than VVV, but is still in the range between 25 to 40. The authors set the 1 /η index as the best way to select variable stars, but this is not true if the epoch interval (ΔT) is large. Figure 8 (right panel) shows the variation of the 1 /η index as a function of ΔT. As you can verify, the separation between stochastic (uniform and normal distributions) and variable stars become more evident only for ΔT< 0.1. The grouping of observations as defined for the 1 /η index is in a single band, and hence ΔT ≃ 17.5 d for WVSC1 stars for single wavebands. Therefore 1 /η index is not suitable for select variable stars from noise in the WFCAM database. The WFCAM database was analyzed by correlated indices because the multiband observations provide a large number of measurements taken in intervals of ΔT ≃ 0.01 d. Unlike statistical parameters the correlated indices can be computed using multiwavebands and this is the best way to calculate these indices in this case. Indeed, more than 50% of WVSC1 variable stars could be missed if the 1 /η was adopted to analyze the WFCAMCAL database. Such results are in agreement with those found by Ferreira Lopes & Cross (2016, see Fig. 2 Sect. 4.1), where the authors performed this analysis using indices. Indeed, ΔT< 0.1 was found for the current test data because the variable stars simulated (Cepheid, RRlyrae, and eclipsing binary) have a variability period equal to 1. The parameter ΔT is not set to choose which variability indices must be used to performed variability analysis. The analyses of correlated observations (Ferreira Lopes & Cross 2016, see Sect. 4.3) is mandatory to determine whether correlated indices can be used and to set ΔT. Sokolovsky et al. (2017) did not take into account our correlated indices (Ferreira Lopes & Cross 2016), which have a welldefined limit and a high accuracy for only a few correlated measurements. These indices only combined those measurements that provide good information about statistical correlation. Indeed, many variable stars could be missed. The confidence correlated indices only can be computed if ΔT is a small fraction of the variability period, such results are in agreement with those results found by (Ferreira Lopes & Cross 2016). This aspect limits a straightforward comparison between correlated and noncorrelated indices performed by the authors.
Figure 8 (left panel) shows a summary of our efforts to provide the best way to select variables in correlated and noncorrelated data. The values correlated indices are more efficient than previous correlated indices and should be adopted in the case in which correlated indices can be calculated sensibly, but aspects of this still need to be tested, especially in systems where the correlation order and number of permutations are very low. The flux independent indices are weakly dependent on magnitude but are strongly dependent on the time interval among correlated measurements, such indices should be used when the observations have a natural correlation interval that is shorter than the typical epoch interval (for more details, see Sect. 4.3, Paper I). Indeed, a large number of variable stars can be missed if this is not taken into account.
The discrimination of variable stars from noise is better distinguished using correlated indices than noncorrelated indices and so these should be adopted when they are available (see Sect. 8) otherwise X_{f} indices can be used (see Fig. 8). Indeed, this also determines how we may best perform photometric observations to maximize the performance of selection criteria. A combination of these observations can be used but it is not mean a high performance. The better selection performed by correlated indices is well known and therefore the correlated indices may be adopted to achieve a smaller misselection rate. Using all the above, the following set of procedures is recommended as the best way to select variable stars:

A histogram of the interval between observations must be analyzed to define if correlated indices can be used (see Fig. 3 of Ferreira Lopes & Cross 2016). The approach used to perform the variability analysis may only be chosen after examining the time interval among the measurements. The measurements used to compute correlated indices must be correlated over a fraction of the minimum variability period. The parameter has the highest performance among the correlated indices analyzed and hence this should be adopted as the main tool to select variable stars using correlated data. Moreover, the evenmean should be used instead of the mean to compute indices in order to improve the correlation estimation.
Table 8E_{WVSC1} and E_{tot} values for all dispersion parameters analyzed using ZYJHK wavebands.

A minimum of 5 correlated measurements must be adopted as the limit to discriminate variable stars from noise using correlated indices (see Eq. (14) of Sect. 4.1 of Ferreira Lopes & Cross 2016).

A constant cutoff value may be adopted if you can consider all time series on the same basis independent of the number of correlations. A cutoff using the number of correlations provides a better selection and therefore should be adopted if there are a reasonable number of correlations (more than 10). Indeed, we suggest that correlated indices are only calculated for stars that have a number of correlations greater than 10. This increases the reliability of the correlated indices estimation and allows those stars with few correlated measurements to be analyzed by statistical parameters. Moreover a higher order of correlated variability indices may be adopted if more than 2 measurements are available in each correlation interval.

The X_{f} index may be used for time series with less than 10 correlated measurements. We must combine the information of all wavebands if multiwavelength data is available. This reduces the misclassification rate by about 680% (see Sect. 7.4). A single dispersion parameter must be used to decrease the running time since the performances for a combination of dispersion parameters is similar (see Table 8). The X_{f}(ED), X_{f}(ED_{μ}), or X_{f}(ED_{m}) have performance in between X_{f}(EIQR) and X_{f}(ED_{All}) for E_{WVSC1}< 0.85 and otherwise better than X_{f}(EIQR) (see Fig. 8). Indeed, ED, ED_{μ}, ED_{m}, or EIQR are not defined using squares and so they are less affected by outliers. On the one hand, the sources nearby the noise model are better using those parameters defined with squares. The BAS approach must not be used if the EIQR parameter is used as the selection criteria. The X_{f}(ED), X_{f}(ED_{μ}), X_{f}(ED_{m}), or their combination may be adopted to get a reliable sample (E_{WVSC1} ~ 0.85) since they have better performance on average than all dispersion parameters tested. On the other hand, X_{f}(ED_{All}) or X_{f}(D_{All}) may be adopted to get a complete sample once it has a better performance for E_{WVSC1} ≽ 0.85.

A cutoff dependent on the number of measurements may be used as a parameter to select variable stars (see Eq. (21)).

The sample selected by correlated or noncorrelated indices is not unbiased, i.e. several stochastic variations are enclosed in this selection. The identification of periodic or aperiodic signals may be performed by period finding methods. That will be addressed in forthcoming papers of this project.
9. Conclusions
Statistical parameters were analyzed as a tool to discriminate variable stars from noise. We observe that statistics based on an even number of measurements provide better estimations of statistical parameters. Therefore, we propose evenstatistics, where only even numbers of measurements are considered. The evenaverages gave better results than current averages for many of distributions analyzed. Therefore the previous shape and dispersion parameters were tested using evenaverages. Next, seven unbound statistical parameters are proposed; i.e. they are independent of the average. We propose 16 new statistical parameters are proposed in total. These parameters enlarge our inventory of tools to identify nonstochastic variations, which is the main goal of this step of our project.
The new statistical parameters were tested using Monte Carlo simulations, from which we verify that the evenstatistical parameters can be used to analyze statistical distributions in the same way as their noneven counterparts. Many evenstatistical parameters keep a strong relationship with their counterparts that enables a comparison. The improvement in the accuracy of statistical parameters depends of the distribution analyzed. For many of these parameters the evenparameters display better accuracy (uniform  7/9 of the statistics improved with even; normal: 5:9 improved: Ceph: 8:9 improved: RR: 5:9 improved: EB: 2:9 improved. The simulations were also used to estimate a coefficient to adjust the sample size for each dispersion parameter to take into account the dependence of statistical parameters on the number of measurements. These are extremely important to reduce the misselection of sources with few measurements.
Evenstatistical parameters plus sample size corrections plus new model noise were used to propose noncorrelated indices that can be used on single or multiwavelength observations. The Stratevamodified function proposed in the present paper provides a better model than previous functions and the sample size coefficients were designed for each statistical parameter account for its susceptibility to statistical variations. Indeed, the noise characteristics of bright and faint sources are better modelled by the Stratevamodified function. This is extremely important since the single or multiwavelength analysis are only possible using a noise model. The dispersion parameters provide similar information but are susceptible to statistical variations that are slightly different. However, combinations of statistical parameters tested do not significantly improve the discrimination between variable stars and noise. Finally, the noncorrelated index was tested using the WFCAMCAL database. The results were compared with those obtained with the standard deviation and Strateva function. The misselection rate was reduced by about 520% as result of our approach. Moreover, the correlated indices were recomputed using the evenmean and we also find a reduction in the misselection rate of 18%. From all of the above, we summarize our recommendation to select variable stars from noise.
The first step of this project, where the tools and selection criteria to discriminate variable stars from noise were studied, is now concluded. The next step of this project will study period finding methods and how use these methods to reduce or remove all misselected sources.
Acknowledgments
C.E.F.L. acknowledges a postdoctoral fellowship from the CNPq. N.J.G.C. acknowledges support from the UK Science and Technology Facilities Council. We thank Maria Ida Moretti who pointed us to the Sokolovsky et al. (2017) paper, which was crucial to perform a strict comparison with the most recent results. We also thank the reviewer for his/her thorough review and highly appreciate the comments and suggestions, which significantly contributed to improving the quality of the publication.
References
 Baglin, A., Auvergne, M., Barge, P., et al. 2007, in Fifty Years of Romanian Astrophysics, eds. C. Dumitrache, N. A. Popescu, M. D. Suran, & V. Mioc, AIP Conf. Ser., 895, 201 [Google Scholar]
 BailerJones, C. A. L., Andrae, R., Arcay, B., et al. 2013, A&A, 559, A74 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Bernard, E. J., Monelli, M., Gallart, C., et al. 2010, ApJ, 712, 1259 [NASA ADS] [CrossRef] [Google Scholar]
 Borucki, W. J., Koch, D., Basri, G., et al. 2010, Science, 327, 977 [NASA ADS] [CrossRef] [PubMed] [Google Scholar]
 Brys, G., Hubert, M., & Struyf, A. 2004, J. Comput. Graph. Statistics, 13, 996 [CrossRef] [Google Scholar]
 Chambers, K. C., Magnier, E. A., Metcalfe, N., et al. 2016, ArXiv eprints [arXiv:1612.05560] [Google Scholar]
 Cross, N. J. G., Collins, R. S., Hambly, N. C., et al. 2009, MNRAS, 399, 1730 [NASA ADS] [CrossRef] [Google Scholar]
 De Medeiros, J. R., Ferreira Lopes, C. E., Leão, I. C., et al. 2013, A&A, 555, A63 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Denisenko, D. V., & Sokolovsky, K. V. 2011, Astron. Lett., 37, 91 [NASA ADS] [CrossRef] [Google Scholar]
 Desai, S., Mohr, J. J., Bertin, E., Kümmel, M., & Wetzstein, M. 2016, Astron. Comp., 16, 67 [NASA ADS] [CrossRef] [Google Scholar]
 Enoch, M. L., Brown, M. E., & Burgasser, A. J. 2003, AJ, 126, 1006 [NASA ADS] [CrossRef] [Google Scholar]
 Ferreira Lopes, C. E., & Cross, N. J. G. 2016, A&A, 586, A36 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Ferreira Lopes, C. E., Dékány, I., Catelan, M., et al. 2015a, A&A, 573, A100 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Ferreira Lopes, C. E., Leão, I. C., de Freitas, D. B., et al. 2015b, A&A, 583, A134 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Ferreira Lopes, C. E., Neves, V., Leão, I. C., et al. 2015c, A&A, 583, A122 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Fruchter, A. S., & Hook, R. N. 2002, PASP, 114, 144 [NASA ADS] [CrossRef] [Google Scholar]
 Graczyk, D., Soszynski, I., Poleski, R., et al. 2011, VizieR Online Data Catalog: I/206 [Google Scholar]
 Hambly, N. C., Collins, R. S., Cross, N. J. G., et al. 2008, MNRAS, 384, 637 [NASA ADS] [CrossRef] [Google Scholar]
 Hodgkin, S. T., Irwin, M. J., Hewett, P. C., & Warren, S. J. 2009, MNRAS, 394, 675 [NASA ADS] [CrossRef] [Google Scholar]
 Hoffman, D. I., Harrison, T. E., & McNamara, B. J. 2009, AJ, 138, 466 [NASA ADS] [CrossRef] [Google Scholar]
 Kaiser, N., Aussel, H., Burke, B. E., et al. 2002, in Survey and Other Telescope Technologies and Discoveries, eds. J. A. Tyson, & S. Wolff, SPIE Conf. Ser., 4836, 154 [Google Scholar]
 Kim, D.W., Protopapas, P., BailerJones, C. A. L., et al. 2014, A&A, 566, A43 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Lawrence, A., Warren, S. J., Almaini, O., et al. 2007, MNRAS, 379, 1599 [NASA ADS] [CrossRef] [MathSciNet] [Google Scholar]
 Levenberg, K. 1944, Quart. Appl. Math., 2, 164 [Google Scholar]
 Marquardt, D. 1963, SIAM J. Appl. Math., 11, 431 [CrossRef] [MathSciNet] [Google Scholar]
 Minniti, D., Lucas, P. W., Emerson, J. P., et al. 2010, New Astron., 15, 433 [NASA ADS] [CrossRef] [Google Scholar]
 Pojmanski, G., Pilecki, B., & Szczygiel, D. 2005, Acta Astron., 55, 275 [NASA ADS] [Google Scholar]
 Pollacco, D. L., Skillen, I., Collier Cameron, A., et al. 2006, PASP, 118, 1407 [NASA ADS] [CrossRef] [Google Scholar]
 Ramsay, G., Brooks, A., Hakala, P., et al. 2014, MNRAS, 437, 132 [NASA ADS] [CrossRef] [Google Scholar]
 Sesar, B., Ivezić, Ž., Lupton, R. H., et al. 2007, AJ, 134, 2236 [NASA ADS] [CrossRef] [Google Scholar]
 Sokolovsky, K. V., Gavras, P., Karampelas, A., et al. 2017, MNRAS, 464, 274 [NASA ADS] [CrossRef] [Google Scholar]
 Song, F.F., Esamdin, A., Ma, L., et al. 2016, Res. Astron. Astrophys., 16, 154 [NASA ADS] [CrossRef] [Google Scholar]
 Stetson, P. B. 1996, PASP, 108, 851 [NASA ADS] [CrossRef] [Google Scholar]
 Strateva, I., Ivezić, Ž., Knapp, G. R., et al. 2001, AJ, 122, 1861 [NASA ADS] [CrossRef] [Google Scholar]
 Udalski, A. 2003, Acta Astron., 53, 291 [NASA ADS] [Google Scholar]
 von Neumann, J. 1941, Ann. Math. Stat., 12, 367 [CrossRef] [Google Scholar]
 von Neumann, J. 1942, Ann. Math. Stat., 13, 86 [CrossRef] [Google Scholar]
 Welch, D. L., & Stetson, P. B. 1993, AJ, 105, 1813 [NASA ADS] [CrossRef] [Google Scholar]
All Tables
Strateva and Strateva modified parameters (see Eq. (18)) for all dispersion parameters analyzed in the present work for BA.
Efficiency metric E_{tot}, i.e. the ratio of the number of selected sources to the total number of WVSC1 variable stars, and E_{WVSC1}, i.e. the ratio of the number of WVSC1 stars selected to the total number of WVSC1 stars, and α values computed from X_{f}(ED) for each waveband and for using all ZYJHK wavebands using β = 4.
Efficiency metric E_{tot}, E_{WVSC1}, and α_{cor} values computed from the analysis of BA photometry for and .
E_{WVSC1} and E_{tot} values for all dispersion parameters analyzed using ZYJHK wavebands.
All Figures
Fig. 1 Uniform (black squares), normal (green asterisk), Ceph (blue diamond), RR (red triangle) and EB (grey plus) distributions with 1000 measurements as a function of the number of elements. The same distributions are showing with different arrangements (see Sect. 2 for more details). 

Open with DEXTER  
In the text 
Fig. 2 Parameter e_{P} (left panels) and its comparison with previous statistical parameters (right panels) as a function of the number of measurements (see Table 1). The colours and symbols are the same as those adopted in the Fig. 1. In the left panels the results for the full range of measurements are used, while the right panel only shows those for odd numbers of measurements except for EIQR, where only the results with numbers of measurements that are not modulus 4 are plotted. The solid lines in the evendispersion diagrams show the models described in Sect. 5.3. 

Open with DEXTER  
In the text 
Fig. 3 Parameter e_{P} as a function of the number of measurements for the free evenparameters (see Table 1). The colours are the same as in Fig. 2. 

Open with DEXTER  
In the text 
Fig. 4 Dispersion and shape parameters as a function of magnitude where the black dots indicate the WVSC1 stars. The red and dashed black lines indicate the Stratevamodified and Strateva functions, respectively. The maximum number of sources per pixel is shown in brackets in each panel. 

Open with DEXTER  
In the text 
Fig. 5 Top panel shows E_{tot} as a function of E_{WVSC1} for all apertures and for the bottom panel for all individual wavebands and the combination (ZYJHK wavebands) using BA. Here the result for each photometric aperture and waveband are shown with different colours. E_{WVSC1} decreases with E_{tot} leading to a more reliable selection (fewer misclassifications) and vice versa. 

Open with DEXTER  
In the text 
Fig. 6 G vs. E_{WVSC1} using different approaches (see Sect. 7.4). The approach used is named above in each of the upper diagrams. The colours indicate the results for different filters ZYJHK (brown, grey, red, green, and blue lines respectively) and the combination of results found in all bands (black lines). The same colours were also used in Fig. 5 (bottom panel). 

Open with DEXTER  
In the text 
Fig. 7 X indices using all wavebands X_{ZYJHK} for ED and D_{σμ}. The X(ED) index was computed using the Strateva function without sample size correction for BA photometry, while X(ED) was computed using the Strateva modified function with sample size correction for the BAS approach. The histograms of the entire sample (black lines) and WVSC1 stars (red lines) are shown at the top right. The WVSC1 stars are also represented by open black circles and the maximum number of sources per pixel is shown in brackets. 

Open with DEXTER  
In the text 
Fig. 8 E_{tot} vs. E_{WVSC1} (left panel) for the best selection criteria and 1 /η as a function of time interval mean among the measurements ΔT (right panel). In the left panel, the statistical parameters X(EIQR) (blue lines), X(D_{σμ}) (red lines), and X(ED) (dark lines) for ZYJHK (full lines) and K (dashed lines) wavebands, respectively. In the same diagram the results found for using the mean (full grey line) and evenmean (dashed grey line), respectively, are also plotted. The results for normal, uniform, Ceph, RR, and EB simulated distributions are shown in the right panel. The colours indicate different parameters or distributions analyzed, which are indicated at the top of each diagram. In the right panel we show how 1 /η varies with the selection of ΔT. The box plot indicates the quartiles of WFCAM and WVSC1 samples, where each box encloses 50% of the sample under the grouping algorithm that defines ΔT in Sokolovsky et al. (2017). 

Open with DEXTER  
In the text 