Issue 
A&A
Volume 587, March 2016



Article Number  A31  
Number of page(s)  15  
Section  Stellar structure and evolution  
DOI  https://doi.org/10.1051/00046361/201526183  
Published online  12 February 2016 
A statistical test on the reliability of the noncoevality of stars in binary systems ^{⋆,}^{⋆⋆}
^{1} INAF–Osservatorio Astronomico di Collurania, via Maggini, 64100 Teramo, Italy
email: valle@df.unipi.it
^{2} INFN, Sezione di Pisa, Largo Pontecorvo 3, 56127 Pisa, Italy
^{3} Dipartimento di Fisica “Enrico Fermi”, Università di Pisa, Largo Pontecorvo 3, 56127 Pisa, Italy
Received: 25 March 2015
Accepted: 30 December 2015
Aims. We develop a statistical test on the expected difference in age estimates of two coeval stars in detached doublelined eclipsing binary systems that are only caused by observational uncertainties. We focus on stars in the mass range [0.8; 1.6] M_{⊙}, with an initial metallicity [Fe/H] from −0.55 to 0.55 dex, and on stars in the mainsequence phase.
Methods. The ages were obtained by means of the SCEPtER technique, a maximumlikelihood procedure relying on a precomputed grid of stellar models. The observational constraints used in the recovery procedure are stellar mass, radius, effective temperature, and metallicity [Fe/H]. To check the effect of the uncertainties affecting observations on the (non)coevality assessment, the chosen observational constraints were subjected to a Gaussian perturbation before applying the SCEPtER code. We defined the statistic W computed as the ratio of the absolute difference of estimated ages for the two stars over the age of the older one. We determined the critical values of this statistics above which coevality can be rejected in dependence on the mass of the two stars, on the initial metallicity [Fe/H], and on the evolutionary stage of the primary star.
Results. The median expected difference in the reconstructed age between the coeval stars of a binary system – caused alone by the observational uncertainties – shows a strong dependence on the evolutionary stage. This ranges from about 20% for an evolved primary star to about 75% for a near ZAMS primary. The median difference also shows an increase with the mass of the primary star from 20% for 0.8 M_{⊙} stars to about 50% for 1.6 M_{⊙} stars. The reliability of these results was checked by repeating the process with a grid of stellar models computed by a different evolutionary code; the median difference in the critical values was only 0.01. We show that the W test is much more sensible to age differences in the binary system components than the alternative approach of comparing the confidence interval of the age of the two stars. We also found that the distribution of W is, for almost all the examined cases, well approximated by beta distributions.
Conclusions. The proposed method improves upon the techniques that are commonly adopted for judging the coevality of an observed system. It also provides a result founded on reliable statistics that simultaneously accounts for all the observational uncertainties.
Key words: binaries: eclipsing / methods: statistical / stars: evolution / stars: lowmass
Online calculator available at http://astro.df.unipi.it/stellarmodels/W/
Full Table 2 is only available at the CDS via anonymous ftp to cdsarc.ustrasbg.fr (130.79.128.5) or via http://cdsarc.ustrasbg.fr/vizbin/qcat?J/A+A/587/A31
© ESO, 2016
1. Introduction
The impossibility to simultaneously fit the observed characteristics of stars in a doublelined detached binary system with a single isochrone is usually interpreted as the need to introduce some modifications in the current generation of evolutionary codes, such as adjustments of the external convection efficiency through the mixinglength parameter (e.g. Torres et al. 2006; Clausen et al. 2009; Morales et al. 2009). Moreover, these systems are often adopted in studies on the calibration of convective core overshooting by finetuning the isochrone fit (see, among many, Andersen et al. 1990; Ribas et al. 2000; Lacy et al. 2008; Clausen et al. 2010).
However, an apparent age difference between the binary system components could simply arise by fluctuations that arise as a result of the uncertainties in the observational constraints adopted in the estimation. This was already pointed out in the literature; as an example, Gennaro et al. (2012) conducted some tests for nine combinations of stellar masses on premainsequence stars. More recently, Valle et al. (2015b) presented some basic results about the expected differences in the age estimates of the two coeval members in doublelined detached binary systems, obtained by the gridbased technique SCEPtER (Valle et al. 2014, 2015a). We found that the differences are large and reach a median value of 60% for a mass ratio of 0.5.
Therefore, given the relevance of the coevality hypothesis, it is of paramount importance to set out accurate methods for evaluating whenever a detected difference in ages of the two stars in an observed system is “too high” to be coeval on statistical grounds. Unfortunately, the recent literature is quite inhomogeneous with regard to this problem. It is common practice to compare the observed system to sets of isochrones to determine the system age, while the bestfit ages for a single star are seldom provided. In most cases the details of the fitting techniques are scarcely discussed. Moreover, it is common to individually obtain the system age estimates for a set of different metallicities, compatible with the error on the observed [Fe/H], and then to assess the error caused by the metallicity uncertainty by considering the obtained age spread (see among many Clausen et al. 2008; Lacy et al. 2008; Torres et al. 2009; Clausen et al. 2010; Rozyczka et al. 2011; Sowell et al. 2012). The drawback of this method is that it cannot account for possible error correlations among masses, metallicities, and effective temperatures of the stars. For age estimation purposes, stellar tracks for the precise observed masses were calculated in other cases (e.g. Vos et al. 2012). The literature also reports a twostage fitting method, which first estimates the bestfit stellar model metallicity that is compatible with the spectroscopic determination, and then finds the bestfitting isochrone for this fixed metallicity (see e.g. Sandberg Lacy et al. 2010; Welsh et al. 2012). Overall, each author relies on different methods and different stellar tracks, computed with chemical and physical input that results in age estimates that sometimes are vastly different. The worst problem is that a reliable and consistent treatment of the errors, at least those arising from the uncertainties in the observed quantities adopted in the fitting, is generally lacking, and a reliable error on the age estimates is rarely provided. This makes evaluating the reliability of the deviation from coevality and comparing the results of different authors very challenging.
The aim of this paper is to partially alleviate this problem by providing a statistical test of the coevality hypothesis that supplements the currently adopted techniques. Our intent is to develop a reliable method that consistently treats the observational errors, and to show its usage in practice. To this aim, we evaluate the dependence of the expected age differences on the mass of the two stars, on their metallicity, and on their relative age^{1} on the main sequence. We provide the critical values of the expected differences in age to be used to asses if the reconstructed noncoevality is simply the result of a random fluctuation. We also show the differences between the coevality test and the commonly adopted comparison of the age confidence interval of the observed stars.
The work developed in this paper is framed in the theory of the frequentistic statistical hypothesis testing (see among many Snedecor & Cochran 1989; Feigelson & Babu 2012, for details). In the presentday formulation, this sound mathematical theory is rooted in the first decades of the nineteenth century and is mainly based on the theoretical studies of Pearson, Gosset (Student), Neyman, and Fisher (see e.g. Fisher 1925; Neyman & Pearson 1933). Without any claim at completeness, we recall that the theory requires formulating a scientific hypothesis to check (in our case the stellar coevality) that is often referred to as the null hypothesis H_{0}, to define a parent population on which the hypothesis should be verified, to define a statistic to be adopted in the hypothesis testing, and to derive the distribution of under H_{0}. From this distribution is then identified a rejection region of H_{0}; for an unilateral test (as the one in the present paper) this traditionally corresponds to values of the statistics above the 95th quantile of the distribution (the value corresponding to the 95th quantile is called critical value). By this choice the experimenter sets the socalled level α of the test to a value of 0.05, corresponding to the probability of type I errors (i.e. rejecting H_{0} when it is indeed true). After these theoretical derivations, the test can be applied to a random sample drawn from the parent population to verify H_{0}. The reference statistic is computed on this sample and compared with the chosen quantile of its distribution. A sample value of above the critical value leads to rejecting H_{0}.
2. Methods
Stellar ages were determined by means of the SCEPtER pipeline, a maximumlikelihood technique relying on a precomputed grid of stellar models and a set of observational constraints (see e.g. Valle et al. 2015a). A first application to eclipsing binary systems has been extensively described in Valle et al. (2015b). The grid of models covers the evolution from the zeroage mainsequence (ZAMS) up to the central hydrogen depletion of stars with masses in the range [0.8; 1.6] M_{⊙} and initial metallicities [Fe/H] from −0.55 dex to 0.55 dex. The grid was computed by means of the FRANEC stellar evolutionary code (Degl’Innocenti et al. 2008; Tognelli et al. 2011), in the same configuration adopted to compute the Pisa Stellar Evolution Data Base^{2} for lowmass stars (Dell’Omodarme et al. 2012; Dell’Omodarme & Valle 2013). The details of the standard grid of stellar models, the sampling procedure, and the age estimation technique are fully described in Valle et al. (2015a,b), while the adopted input and the related uncertainties are discussed in Valle et al. (2009, 2013a,b).
To test the ability of gridbased maximumlikelihood techniques of assessing the coevality of binary components within random error fluctuation, we chose the most favourable scenario, that is, the case in which stellar models perfectly agree with observations and the binary members are coeval by construction. To achieve this optimistic situation, we built a synthetic dataset by sampling the artificial binary systems from the same grid of stellar models as was used in the recovery procedure. The effect on the coevality assessment of the current typical observational errors was simulated by adding random Gaussian perturbations – with σ equal to the uncertainty – to the chosen observational constraints.
As already mentioned, the binary components are coeval by construction because they have been generated with the same age in the sampling stage. However, as a consequence of the perturbation procedure mimicking the observational errors, the recovery might provide two different age estimates for the binary members.
Our aim is to devise a criterion founded on statistics to evaluate the likelihood that the recovered noncoevality is genuine and not merely the result of observational errors. More formally, the null hypothesis H_{0} for which we wish to develop a statistical test is that the stars are coeval within the random variability caused by the observational errors. We aim to define a statistics to test H_{0}, obtaining a rejection region at level α, and thus a critical value identifying it. We define A_{1} and A_{2} as the estimated ages of the two members, with A_{1}>A_{2}. We focus on the statistics W defined as (1)W then varies from 0 (when the stars are estimated to be coeval) to 1 (when A_{1} ≫ A_{2}). High values of W lead to the rejection of the null hypothesis, implying that it is very unlikely that standard stellar models might account for the coevality of the stars with the assumed input or parameters. The point is to establish the critical values above which the coevality rejection is statistically significant.
Fig. 1 Outline of the design to compute the critical values. Steps 1−3 are described in Sect. 2.1, steps 4−5 in Sect. 2.2. 
To do this, we evaluated the distribution of W – under the coevality hypothesis H_{0} – that arises from observational uncertainties alone and computed the critical values that define the range of values W that are compatible with the null hypothesis itself. As stated in the introduction, the 95th quantile of the distribution of the statistics under examination are typically adopted as critical values. The choice of the quantile defines the level α of the test. In the following we assume a level of the test α = 0.05 and compute the critical values W_{1−α}. For W>W_{1−α} the null hypothesis H_{0}, that is, the coevality of the binary components, is rejected.
2.1. Sampling and critical values estimation
Since the critical value W_{1−α} depends on the characteristics of the binary system, it is not possible to provide a single value suitable for all the possible binary configurations. To describe the dependence of W_{1−α} on the physical parameters that identify the binary system, a long and time consuming procedure is required. Figure 1 outlines the steps – described in detail below – that we followed in the estimation procedure.
Step 1
To produce a representative sample of synthetic binary systems spanning the possible combinations of masses, metallicity, and relative ages of the two stars, we performed a systematic sampling from the standard grid of stellar models. We selected a set of nine stellar masses from 0.8 M_{⊙} to 1.6 M_{⊙} with a step of 0.1 M_{⊙}. The metallicity was restricted to initial [Fe/H] values of −0.55, −0.25, 0.0, 0.25 and 0.55. The relative age r of the primary star was chosen to assume values from 0.1 to 0.9 with a step of 0.2. Then, for each value of the primary mass M_{1}, metallicity [Fe/H], and relative age, we selected the corresponding model from the grid. Linear interpolation in effective temperature, radius, and metallicity was performed to obtain a stellar model of the exact chosen relative age. This step fixes the age of the binary system to A, that is, the age of the selected star. This synthetic star is coupled to all the possible secondary components with masses lower than or equal to M_{1}, the same initial [Fe/H], and age exactly equal to A. Linear interpolation in effective temperature, radius, and metallicity was performed to match these requirements. The described sampling scheme produced 1125 possible combinations. However, since the grid does not contain models in the premainsequence phase, not all these combinations correspond to existing pairs in the grid. For example, a 1.6 M_{⊙} model at relative age 0.1 is too young to be matched by any model of 0.8 M_{⊙}, which is still in the premainsequence phase.
In conclusion, we found that a total number of 1087 binary systems are possible, and we estimated the corresponding critical values W_{1−α} for each of them.
Step 2
To simulate observational uncertainties, the 1087 synthetic binary systems were subjected to a Gaussian perturbation of all the observed quantities with standard deviations of 100 K in T_{eff}, 0.1 dex in [Fe/H], 1% in mass, and 0.5% in radius. To simulate realistic uncertainties, we assumed a correlation of 0.95 between the primary and secondary effective temperatures, 0.95 between the metallicities, 0.8 between the masses, and no correlation between the radii. A detailed discussion of these choices and their effect on the final results, also including the motivations for the zero correlation between the radii, can be found in Valle et al. (2015b).
As a result, for each of these 1087 cases we produced a set of N perturbed systems, where the actual value of N was properly evaluated in steps 4 and 5.
Step 3
For each of the 1087 analysed systems, we computed the N values of the statistic W corresponding to all the perturbed objects. Finally, we estimated from these 1087 samples the W_{1−α}(M_{1},M_{2}, [Fe/H] ,r) quantiles, which approximate the required critical values for the statistical test. Critical values for masses, metallicities, and relative ages different from those adopted in the computations can be obtained by interpolation.
At the end of this step, we obtained the required critical values. However, the procedure cannot stop here. In fact, a different simulation run will produce different values; it is therefore mandatory to estimate the random variability of the results. This is the purpose of the following section.
2.2. Monte Carlo variability on the critical values
The accuracy of the computed quantiles W_{0.95} depends on the size N of the Monte Carlo sample, and it is obvious that a larger sample would provide a more accurate estimate. Our aim is to reach a relative precision of 1% on the estimated critical values; therefore we have to evaluate the minimum N needed to meet this requirement.
More technically, given a sample X_{1},...,X_{N} of size N, taken from a distribution F and density f, we aimed to estimate the 1−α quantile, defined as the value for which the distribution function assumes value 1−α, that is, (2)Standard theory establishes that the 1−α sample quantile X_{⌈ N(1−α) ⌉} (the ⌈ N(1−α) ⌉th smallest observation in the sample X_{1},...,X_{N}) is a consistent estimator^{3} of the real quantile. Moreover, the asymptotic expansion for the variance of X_{⌈ N(1−α) ⌉} is (see e.g. Stuart & Ord 1994) (3)which shows that the convergence rate of Monte Carlo quantile simulations is . Unfortunately, Eq. (3) cannot be used for direct computations since it depends on the unknown density function f(W_{1−α}). An usual way to circumvent the problem is to relie on bootstrap estimates (see e.g. Davison & Hinkley 1997).
Steps 4 and 5
The approach requires sampling with replacement a number n of samples of size N from the original W values. For each of these samples, we computed the sample quantile , j = 1,...,n } and then estimated the arithmetic mean and the variance by the variance of the bootstrap quantiles, that is, (4)For our computations we adopted n = 400.
Fig. 2 Boxplot of the relative error on the W_{0.95} quantile estimates as a function of the primary relative age. Error estimates are obtained by bootstrap resampling (see text). 
Hall & Martin (1988) proved that , which means that the convergence to the true estimator is relatively slow but sufficient for our purposes. The technique also allows estimating the bias b of the sample w_{1−α}(5)The sample estimate can therefore be corrected to account for the bias assuming as best quantile estimate the value .
We therefore performed some exploratory computations to quantify the values from Eq. (4) for different N. As a result, we found that N = 50 000 allowed us to reach the required 1% relative accuracy for all the sample. The obtained relative errors on the 1087 quantile estimates are shown in Fig. 2, in dependence on the relative age of the binary system. The figure presents the boxplot^{4} of the errors for each relative age. It is apparent that the median relative errors on the estimated quantile are lower than 0.5% for all the explored relative ages.
3. Critical values W_{0.95}
Fig. 3 Left: boxplot of the W_{0.95} quantile estimates as a function of the mass of the primary star. Right: same as the left panel, but for the dependence on the relative age of the primary star. 
The computation of the 1087 critical values described in the previous section required a total of 5.4 × 10^{7} binary system age estimates. The results of this huge set of simulations are presented in Tables A.1 to A.5. Each table collects – for values of the primary relative age r from 0.1 to 0.9 with a step of 0.2 – the critical values W_{0.95} in dependence on the masses of the stars and their initial metallicity [Fe/H].
To use the test in practice, after estimating the ages of the two stars and computing the value of the statistics W, this must be compared with the appropriate critical value in the tables. If the computed W is lower than the critical value W_{0.95}, the null hypothesis of coevality cannot be rejected.
One complication arises because the values in Tables A.1−A.5 depend on the observationally unknown primary relative age r. The problem can be solved by estimating r, for example by means of the same grid technique as adopted for age estimates (more details on this topic are provided in Sect. 5).
Our main result here is that the critical values – and thus the critical age relative differences on age estimates of two stars that are coeval by construction – are indeed high despite the high precision reached by the observations. The overall median of the critical values is 0.36, with an interquartile range of [0.16; 0.61]. Restricting the test to binary systems of low or intermediate relative age of the primary star (r ≤ 0.5) results in a median critical value of 0.60 with an interquartile range of [0.45; 0.65]. This shows a general behaviour, which is that the closer the primary star is to the ZAMS, the higher is the critical value W_{0.95} and hence the expected difference between the ages of the two binary components.
Some trends in the critical values presented in Tables A.1−A.5 are apparent and easily understandable. The lowest values of W_{0.95} in each row or column of the tables are usually found for binary systems composed by equalmass stars. These are the only combinations for which the relative age of the two stars are the same, all the others provide a lower relative age for the secondary. Since it is known (see the extensive discussion in Valle et al. 2015a) that the relative error in age decreases with relative age, it is straightforward to conclude that the W_{0.95} values on the diagonal of the tables should be the lowest. From the same argument it follows that an increase of the critical values is expected farther down in the columns – that is, increasing the mass of the primary star – and leftward in the rows – that is, decreasing the mass of the secondary star. However, closer to the border of the tables, the trend is less clear since here some edge effects (see Valle et al. 2014, 2015a, for a discussion) become dominant.
The discussed trend of the critical values with the mass of the primary star is shown in the left panel of Fig. 3. The medians of the boxes increase monotonically from the 0.8 M_{⊙} models to those with 1.6 M_{⊙}. The large spread of the boxes is due to the variability of the other parameters, that is, the secondary mass, the metallicity, and the relative age of the primary star. The strong effect of the primary relative age r on critical values W_{0.95} is shown in the right panel of Fig. 3. For low values of r both stars are at a young relative age, which leads to a large dispersion of singleage estimates. Conversely, at high values of r we find systems of equal masses with high r, whose age estimates are more precise and lead to lower critical values, and unbalanced systems for which the age errors and the critical values are greater.
From the analysis of the tables in Appendix A, it appears that the effect of the initial metallicity [Fe/H] is modest, with a mean variation of W_{0.95} by about 0.03 for a change of 1.0 dex in [Fe/H]. This explains the choice of grouping the binary systems according to their initial metallicity, neglecting the change in chemical composition owing to the microscopic diffusion.
Some words of caution are needed. First of all, the computed critical values directly depend on the assumed magnitude of the observational uncertainties. A larger uncertainty produces a stronger fluctuation in age estimates and thus critical values higher than those presented here. Therefore we calibrated the uncertainties we adopted here by assuming realistic values, which are slightly higher than the average of the quoted uncertainties in some recent determinations (e.g. Yıldız 2007; Clausen et al. 2009, 2010; Southworth 2013; Torres et al. 2014).
Fig. 4 Outline of the comparison between the coevality tests based on W and confidence intervals (see text). 
Moreover, we provide an online tool^{5} that allows computing the required critical value for the supplied masses, metallicity, evolutionary phase, and observational uncertainties. This calculator can be useful when the uncertainties on the binary system observables are larger than those adopted here.
Another aspect that is worth discussing is whether the critical values strongly depend on the adopted stellar models. If there is no dependence, the critical values we computed here can be readily used regardless of the stellar models used for the age estimation. If the values do depend on the models, these critical values can be safely adopted only when stellar ages are determined by means of the SCEPtER grid. To answer this question, it is not possible to simply use the stellar model grids currently available in literature because they are too sparse. The only way is to compute fine grids of models covering the same parameter space as that of SCEPtER by means of different evolutionary codes. Unfortunately, only one code is freely available: the code MESA (Paxton et al. 2013). We therefore computed a grid of stellar models assuming default MESA input with the following exceptions: solar heavyelement mixtures from Asplund et al. (2009); solarcalibrated mixinglength α_{ml} = 1.8; including the element diffusion with the coefficients by Thoul et al. (1994) with radiation turbulence by Morel & Thévenin (2002); and the ^{14}N(p,γ)^{15}O rate from Imbriani et al. (2004). Then we repeated the previously described steps (see Fig. 1) to compute the MESAbased critical values.
The comparison between FRANEC and MESAbased critical values is quite encouraging because it shows only small variations. The median of the differences between FRANEC and MESA was 0.009, with 16th to 84th quantile range [−0.002; 0.032]. Although a more systematic exploration with other widely used stellar evolution codes would be worthwhile, this first comparison suggests that the critical values provided by our procedure are generally applicable.
SCEPtER age estimates for single stars and binary systems can be easily obtained from the R libraries SCEPtER^{6} and SCEPtERbinary^{7} that are accessible on CRAN.
4. Comparison of the W test with the individual age confidence intervals
The classical way to asses the stellar coevality in a binary system is to fit the observations with isochrones of different ages and claim coevality whenever a single isochrone is able to fit – within the errors – the whole system. Equivalently, it is possible to estimate the two stellar ages and their errors and establish the coevality by the overlap of the two error intervals.
A natural question is therefore how the W test performs with respect to the more intuitive approach of adopting the age error intervals to claim system coevality. The purpose of this section is to verify whether the two techniques are equivalent. In other words, does the rejection of the coevality by the W test imply that the ages of the single members are not compatible with each other within the errors and vice versa?
The outline of the procedure is presented in Fig. 4. It required a huge amount of Monte Carlo simulations, lasting for 12 days on a Intel Xenon machine with 32 cores. To reduce the computational burden, we restricted the calculations to solar metallicity; 217 binary systems from the possible 1087 of Sect. 2.1 were entered in the analysis.
As a first step, we generated N = 5000 Monte Carlo perturbed systems for each of the 217 possible couples (step 2 in Fig. 4) for the selected 217 ideal binary systems. We estimated the ages of the two stars for each of these N systems and computed the W values (step 3). For the 5% of these perturbed systems whose values were greater than the critical W (i.e. 250 systems for each of the 217 ideal binaries) we computed the Monte Carlo 95% confidence interval for the age. That is, for each of these critical systems we generated n = 10 000 newly perturbed systems, estimated the two stellar ages, and obtained the 95% confidence interval on the ages by computing the 2.5th and the 97.5th quantiles of the age estimates (see Valle et al. 2015a, for details). The same procedure was repeated for a random set (about 15% of the total, for computational reasons) of systems with W lower than the critical values (steps 4 and 5 in Fig. 4). The procedure required a total of 5.4 × 10^{8} binary age estimates.
The results of these computations showed only a few systems for which the W was lower than the critical values, but the age confidence intervals did not overlap. We can therefore safely disregard these cases. This means that whenever the coevality of the binary components cannot be rejected by the W test, the individual stellar ages are compatible with each other within the errors.
Conversely, we found a striking lack of agreement between the two techniques for systems with W greater than the critical values. Only in a small fraction of cases the comparison of the age confidence intervals revealed a significant difference in the two stellar ages. In the vast majority of cases in which the W test allows rejecting coevality, the estimates of the individual ages are still compatible with each other within the errors. Under the null hypothesis H_{0}, the confidence interval approach is therefore more conservative for the coevality assessment than the W test, and adopting it results in a severe underestimation of the fraction of systems for which the coevality hypothesis is questionable. In summary, the W test and the confidence interval computation have a different scope of applicability. Whenever the former is specifically developed for the task of a coevality check, the confidence interval computation lack of statistical power when adopted for this aim and its actual level is about an order of magnitude lower than the nominal α = 0.05.
The results for the 217 considered systems are summarised in Table 1, which contains the median and the 16th and 84th quantiles of the percentage of systems for which the W test and the confidence interval techniques concordantly report a significant difference between the ages of the two stars.
Medians and ranges from the 16th to 84th quantiles (in parentheses) of the percentages of nonoverlapping 95% confidence intervals for age estimates for systems with W greater than the corresponding critical value, classified according to the relative age of the primary star and the mass ratio of the system.
We found decreasing trends with increasing mass ratio of the system; while for systems with 0.5 ≤ q ≤ 0.6 the confidence interval method rejects the coevality hypothesis for a median fraction of 16.2% of the systems that are significant according to the W test, this percentage drops to zero for systems with near equal masses. This trend is expected and is mainly due to the assumed correlations in the perturbation step. Systems with q ≈ 1 , which are very near in the grid before the perturbation, will still be near after the perturbation. Therefore their age confidence intervals will always overlap.
5. Toward the W test application on real binary systems
Another problem deserving a detailed analysis before adopting the W test for real binary systems concerns measurement errors. The empirically determined values of the physical quantities of the two stellar components used in the age estimate procedure are in principle different from the real ones . Moreover, the evolutionary stage r of the primary star is not observable and has to be estimated. On the other hand, we showed in Sect. 3 that the critical values W_{0.95} depends on the true values of stellar masses, metallicity, and primary relative age r. As a consequence, the critical value Ŵ_{0.95} computed following the five steps described in Sect. 2 starting from the empirically determined values of the binary members is in principle different from the true critical value W_{0.95} computed from the real values. Because of this situation, the recovered W value for the observed system should not be compared with Ŵ_{0.95} but with W_{0.95}. The true values are not observables, however, and thus W_{0.95} cannot be directly computed. Which values of masses, metallicity, and evolutionary stage r should be adopted to properly compute a critical value that is a satisfactory approximation of the true W_{0.95}? This is a crucial question; the W test can only be applied to real systems if the correct critical value to be adopted in the age comparison can be accurately recovered.
The problem is equivalent to finding the most probable true binary system associated with the observed one, or in other words, a system composed of two coeval members that maximizes the likelihood of generating the observed binaries after a perturbation caused by the observational uncertainties. Valle et al. (2015b) showed that the best solution to this problem is provided by imposing the coevality of the two stars in the gridbased recovery procedure. Let be the best estimate of under this assumption. We therefore performed a new set of Monte Carlo simulations to compute the critical values from . To check the goodness of this approach, we tested it on a sample of synthetic binary systems for which the true values and W_{0.95} are known. Then the comparison between W_{0.95} and will prove the performance of the adopted procedure.
Adopting the same framework as described above, we obtained the best estimates and the corresponding critical values for all the systems relevant to the W test (critical systems in the step 4 of Fig. 4). A possible complication arises whenever the two stars have age estimates so different that no gridbased coeval solutions exist, respecting the observational constraints. For these extreme cases the singlestar estimates can be adopted. It is clear, however, that the impossibility of obtaining a gridbased coeval solution strongly advises against this hypothesis.
As a result, we found that the proposed estimator of the true critical value computed starting from the most probable coeval binary system associated with the observed one is good, that is, unbiased and with a small variance. The differences between and W_{0.95} are small because overall the median difference is −0.004 (16th and 84th quantile, −0.019 and 0.018, respectively). No relevant trends with the mass of the stars, the evolutionary phase, or the mass ratio were found. To quantify these variations in terms of the level of the test, they correspond to a median α of 0.048, (16th and 84th quantiles, 0.033 and 0.066).
In conclusion, the Monte Carlo simulations showed that the test can be safely applied to real systems and is more sensitive in rejecting the coevality than the simple computation of the individual age confidence intervals. To adopt the test in practice, it is therefore necessary to

1.
compute the two singleage estimates and the W value;

2.
obtain the best coeval solution of the system;

3.
adopt the masses, metallicity, and relative age of the primary stars obtained in the preceding step to find the critical value W_{0.95} by interpolating Tables A.1 to A.5; and

4.
compare the W value with W_{0.95}.
Fig. 5 Comparison of beta distribution functions (solid black line) with the observed distributions (red dashed line) of W for 25 randomly selected combinations of masses M_{1} and M_{2} of stars (in solar units) and relative age r of the primary star (see text). 
6. Analytical approximation of the W distribution
For all the combinations of masses, metallicities, and relative ages described in Sect. 2, the Gaussian perturbations – added to the values of the observables before the age estimation – cause a scatter of the values of W. The actual distribution of these values ultimately depends on the position in the estimation grid of the unperturbed values. Since the grid is irregular and the differences among near models change with the models evolutionary stage (see Valle et al. 2014, for a detailed discussion), it is impossible to explicitly derive the exact distributions of W for all the examined cases.
Nevertheless, we were able to find an empirical approximation suitable for all these distributions. In this section we show that these distributions are closely approximated by beta distributions and discuss the validity and limits of this approximation. This result is particularly useful since it could be used to estimate the critical values at different levels α.
The beta density function f(x,a,b) is defined on the interval x ∈ [0,1] and is parametrised by two positive parameters, a and b, controlling the shape of the distribution. The density has the expression (6)where Γ(.) is the gamma function. The mean μ and the variance σ^{2} of the distribution are (7)For each of the 1087 simulated binary systems of Sect. 3 we adapted a beta distribution to the N = 50 000 synthetic values of W. The parameters of the distribution were computed by equating the sample mean and variance to the theoretical values given by Eq. (7). Figure 5 shows a comparison between the theoretical and the sample distributions for 25 randomly selected binary systems. The agreement is surprisingly good. All the computed critical values, their errors, and the parameters of the beta approximating distribution function are available at CDS. Table 2 shows the first four lines of the table.
Fig. 6 Relative difference of the W_{0.95} critical values obtained from the sample data and from the approximating beta distribution, in dependence on the mass ratio q (see text). A positive value implies that the computed quantile is larger than the theoretical one. The red line is a loess smoother of the data that shows the mean trend of the relative error. 
W_{0.95} critical values and the parameters of the beta distribution, which approximate the W distribution, for the 1087 considered binary systems.
In Fig. 6 we show, in dependence on the mass ratio q, the value (W_{0.95}−B_{0.95}) /W_{0.95}, where W_{0.95} are the values computed in Sect. 3, and B_{0.95} are the corresponding quantiles obtained from the theoretical beta distribution. A positive value means that the sample quantile is larger than the corresponding theoretical value. To better show the median trend of the relative error, the figure also shows a loess smoother^{8} of the data. The data and theoretical estimates agree very well for q> 0.7, where the spread is lower than about ± 5%. For lower q there are larger differences and the beta distribution overestimates the desired quantile by as much as 15%. A global weak bias is present and the theoretical quantiles generally overestimate the observed ones, providing a more conservative test. The median differences among theoretical and empirical quantiles, marked by the loess smoother, are of about −2% at high q, and reach a −5% at q = 0.5.
Figure 6 shows that the variability in the differences between the observed and theoretical quantiles is higher at q = 0.5. The typical behaviour for these systems is shown in Fig. 7, which displays the case of worst agreement between theoretical and empirical distributions (M_{1} = 1.6 M_{⊙}, M_{2} = 0.8 M_{⊙}, r = 0.5). The empirical distribution shows a lack of values around W = 0.3 and an accumulation at W = 0.5.
Fig. 7 Comparison of beta distribution function (solid black) with the observed distribution (dashed red) for the case of worst agreement (M_{1} = 1.6 M_{⊙}, M_{2} = 0.8M_{⊙}, r = 0.5). 
7. Conclusions
We devised a statistical test on the difference in the estimated ages of two coeval stars in a binary system as a result of the fluctuations caused by observational errors. This test allows assessing on statistical grounds whether the apparent noncoevality of binary members is merely the consequence of observational uncertainties. We also provided an online tool to be used for real systems.
We introduced the statistics W, defined as the absolute value of the difference between the two estimated ages and the age of the older star. We studied how the W values are scattered as a result of the uncertainty on the observational constraints adopted in the age estimation procedure. We assumed a level of the statistical test α = 0.05, corresponding to critical values W_{0.95}. The coevality hypothesis is rejected when W>W_{0.95}. We analysed the dependence of W_{0.95} on the masses of the two stars, the initial metallicity [Fe/H], and the relative age of the primary star.
We found that the values W_{0.95} range in median from 0.65 for relative age r = 0.1 to 0.2 at relative age r = 0.9, meaning that the younger the system, the larger the expected difference between the estimated ages of the two components that is due to the observational uncertainties. Moreover, W_{0.95}, and thus the expected age discrepancy, also increases with the mass of the primary star. The dependence on the initial metallicity is negligible.
We also verified that the results are robust to a change of the adopted stellar evolution code. To this purpose, we repeated the process by using a grid of stellar models computed by the MESA evolutionary code (Paxton et al. 2013). The median difference in the critical values was about 0.01.
The magnitude of the critical values, in particular for systems near the ZAMS, should be taken into account in the analysis of observational data before concluding that the coevality of the stars cannot be accounted for by standard stellar models without changes in the input physics and/or in the adopted calibration of the free parameters, such as the mixinglength or the convective core overshooting.
We demonstrated how the test can be adapted to real binary systems in presence of measurements errors on the observed quantities. We showed that the true critical values, computed with perfect knowledge of the observables, are closely approximated by the corresponding values derived by assuming the gridbased best estimates of the masses imposing the coevality of the solution. We compared the performance of the W test in assessing the (non)coevality of the binary components with the common approach, which relies on confidence intervals of the individual age estimates. We found that the latter approach is too conservative for the assumed level α, meaning that most of the systems signalled to be noncoeval by the W test have estimated individual ages that are compatible with each other within the errors. In contrast to the W test, which is specifically developed for the task of a coevality check, the confidence interval comparison is not statistically powerful when adopted for this aim, and it has an actual level of about an order of magnitude lower than the nominal α = 0.05. More in detail, the common approach gives significant differences ranging from about 16% of the systems with a significant W value at q ≈ 0.5 to 0% for systems at q ≈ 1.
Finally, we showed that the distributions of W for the various combinations of star masses, metallicities, and primary relative ages are approximated by beta distributions with appropriate shape parameters. The approximation is very good for systems with a mass ratio higher than 0.7, while it is less accurate for more unbalanced systems.
A boxplot is a convenient way to summarize a distribution. The black thick line marks the median of the distribution, while the box extends from the 25th to the 75th quantile. The whiskers extend to the extreme values, but their lengths are limited to 1.5 times the width of the box. Points outside the extension of the whiskers are omitted from the plot.
A loess regression smoother is a nonparametric locally weighted polynomial regression technique that is often used to show the underlying trend of scattered data (see e.g. Feigelson & Babu 2012; Venables & Ripley 2002).
Acknowledgments
We thank our referee for the useful comments that helped us in clarifying and improving this paper. This work has been supported by PRINMIUR 20102011 (Chemical and dynamical evolution of the Milky Way and Local Group galaxies, PI F. Matteucci), and PRININAF 2012 (The M4 Core Project with Hubble Space Telescope, PI L. Bedin).
References
 Andersen, J., Clausen, J. V., & Nordstrom, B. 1990, ApJ, 363, L33 [NASA ADS] [CrossRef] [Google Scholar]
 Asplund, M., Grevesse, N., Sauval, A. J., & Scott, P. 2009, ARA&A, 47, 481 [NASA ADS] [CrossRef] [Google Scholar]
 Clausen, J. V., Torres, G., Bruntt, H., et al. 2008, A&A, 487, 1095 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Clausen, J. V., Bruntt, H., Claret, A., et al. 2009, A&A, 502, 253 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Clausen, J. V., Frandsen, S., Bruntt, H., et al. 2010, A&A, 516, A42 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Davison, A. C., & Hinkley, D. V. 1997, Bootstrap Methods and Their Application, Cambridge Series in Statistical and Probabilistic Mathematics (Cambridge University Press) [Google Scholar]
 Degl’Innocenti, S., Prada Moroni, P. G., Marconi, M., & Ruoppo, A. 2008, Ap&SS, 316, 25 [NASA ADS] [CrossRef] [Google Scholar]
 Dell’Omodarme, M., & Valle, G. 2013, The R Journal, 5, 108 [Google Scholar]
 Dell’Omodarme, M., Valle, G., Degl’Innocenti, S., & Prada Moroni, P. G. 2012, A&A, 540, A26 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Feigelson, E. D., & Babu, G. J. 2012, Modern Statistical Methods for Astronomy with R applications (Cambridge University Press) [Google Scholar]
 Fisher, R. A. 1925, Statistical Methods for Research Workers (Oliver and Boyd) [Google Scholar]
 Gennaro, M., Prada Moroni, P. G., & Tognelli, E. 2012, MNRAS, 420, 986 [NASA ADS] [CrossRef] [Google Scholar]
 Hall, P., & Martin, M. A. 1988, Probability Theory and Related Fields, 80, 261 [CrossRef] [Google Scholar]
 Imbriani, G., Costantini, H., Formicola, A., et al. 2004, A&A, 420, 625 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Lacy, C. H. S., Torres, G., & Claret, A. 2008, AJ, 135, 1757 [NASA ADS] [CrossRef] [Google Scholar]
 Morales, J. C., Torres, G., Marschall, L. A., & Brehm, W. 2009, ApJ, 707, 671 [NASA ADS] [CrossRef] [Google Scholar]
 Morel, P., & Thévenin, F. 2002, A&A, 390, 611 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Neyman, J., & Pearson, E. 1933, Math. Proc. Cambridge Philosophical Society, 29, 492 [NASA ADS] [CrossRef] [Google Scholar]
 Paxton, B., Cantiello, M., Arras, P., et al. 2013, ApJS, 208, 4 [NASA ADS] [CrossRef] [Google Scholar]
 Ribas, I., Jordi, C., & Giménez, Á. 2000, MNRAS, 318, L55 [NASA ADS] [CrossRef] [Google Scholar]
 Rozyczka, M., Kaluzny, J., Pych, W., et al. 2011, MNRAS, 414, 2479 [NASA ADS] [CrossRef] [Google Scholar]
 Sandberg Lacy, C. H., Torres, G., Claret, A., et al. 2010, AJ, 139, 2347 [NASA ADS] [CrossRef] [Google Scholar]
 Snedecor, G., & Cochran, W. 1989, Statistical methods, Statistical Methods No. v. 276 (Iowa State University Press) [Google Scholar]
 Southworth, J. 2013, A&A, 557, A119 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Sowell, J. R., Henry, G. W., & Fekel, F. C. 2012, AJ, 143, 5 [NASA ADS] [CrossRef] [Google Scholar]
 Stuart, A., & Ord, J. K. 1994, Kendall’s Advanced Theory of Statistics (Edward Arnold) [Google Scholar]
 Thoul, A. A., Bahcall, J. N., & Loeb, A. 1994, ApJ, 421, 828 [NASA ADS] [CrossRef] [Google Scholar]
 Tognelli, E., Prada Moroni, P. G., & Degl’Innocenti, S. 2011, A&A, 533, A109 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Torres, G., Lacy, C. H., Marschall, L. A., Sheets, H. A., & Mader, J. A. 2006, ApJ, 640, 1018 [NASA ADS] [CrossRef] [Google Scholar]
 Torres, G., Sandberg Lacy, C. H., & Claret, A. 2009, AJ, 138, 1622 [NASA ADS] [CrossRef] [Google Scholar]
 Torres, G., Vaz, L. P. R., Sandberg Lacy, C. H., & Claret, A. 2014, AJ, 147, 36 [NASA ADS] [CrossRef] [Google Scholar]
 Valle, G., Marconi, M., Degl’Innocenti, S., & Prada Moroni, P. G. 2009, A&A, 507, 1541 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Valle, G., Dell’Omodarme, M., Prada Moroni, P. G., & Degl’Innocenti, S. 2013a, A&A, 549, A50 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Valle, G., Dell’Omodarme, M., Prada Moroni, P. G., & Degl’Innocenti, S. 2013b, A&A, 554, A68 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Valle, G., Dell’Omodarme, M., Prada Moroni, P. G., & Degl’Innocenti, S. 2014, A&A, 561, A125 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Valle, G., Dell’Omodarme, M., Prada Moroni, P. G., & Degl’Innocenti, S. 2015a, A&A, 575, A12 (V15) [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Valle, G., Dell’Omodarme, M., Prada Moroni, P. G., & Degl’Innocenti, S. 2015b, A&A, 579, A59 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Venables, W., & Ripley, B. 2002, Modern applied statistics with S, Statistics and computing (Springer) [Google Scholar]
 Vos, J., Clausen, J. V., Jørgensen, U. G., et al. 2012, A&A, 540, A64 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Welsh, W. F., Orosz, J. A., Carter, J. A., et al. 2012, Nature, 481, 475 [NASA ADS] [CrossRef] [PubMed] [Google Scholar]
 Yıldız, M. 2007, MNRAS, 374, 1264 [NASA ADS] [CrossRef] [Google Scholar]
Appendix A: Tables of critical values
Critical values W_{0.95} in dependence on masses of the stars M_{1} and M_{2} (both in solar units), and on their initial metallicity [Fe/H].
All Tables
Medians and ranges from the 16th to 84th quantiles (in parentheses) of the percentages of nonoverlapping 95% confidence intervals for age estimates for systems with W greater than the corresponding critical value, classified according to the relative age of the primary star and the mass ratio of the system.
W_{0.95} critical values and the parameters of the beta distribution, which approximate the W distribution, for the 1087 considered binary systems.
Critical values W_{0.95} in dependence on masses of the stars M_{1} and M_{2} (both in solar units), and on their initial metallicity [Fe/H].
All Figures
Fig. 1 Outline of the design to compute the critical values. Steps 1−3 are described in Sect. 2.1, steps 4−5 in Sect. 2.2. 

In the text 
Fig. 2 Boxplot of the relative error on the W_{0.95} quantile estimates as a function of the primary relative age. Error estimates are obtained by bootstrap resampling (see text). 

In the text 
Fig. 3 Left: boxplot of the W_{0.95} quantile estimates as a function of the mass of the primary star. Right: same as the left panel, but for the dependence on the relative age of the primary star. 

In the text 
Fig. 4 Outline of the comparison between the coevality tests based on W and confidence intervals (see text). 

In the text 
Fig. 5 Comparison of beta distribution functions (solid black line) with the observed distributions (red dashed line) of W for 25 randomly selected combinations of masses M_{1} and M_{2} of stars (in solar units) and relative age r of the primary star (see text). 

In the text 
Fig. 6 Relative difference of the W_{0.95} critical values obtained from the sample data and from the approximating beta distribution, in dependence on the mass ratio q (see text). A positive value implies that the computed quantile is larger than the theoretical one. The red line is a loess smoother of the data that shows the mean trend of the relative error. 

In the text 
Fig. 7 Comparison of beta distribution function (solid black) with the observed distribution (dashed red) for the case of worst agreement (M_{1} = 1.6 M_{⊙}, M_{2} = 0.8M_{⊙}, r = 0.5). 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.