Free Access
Volume 588, April 2016
Article Number A19
Number of page(s) 8
Section Numerical methods and codes
Published online 11 March 2016

© ESO, 2016

1. Introduction

Bayesian statistical methods are now widely applied in astronomy. Of the new techniques thus introduced, model selection (or comparison) is especially notable in that it replaces the frequentist acceptance-rejection paradigm for testing hypotheses. Thus, given a data set D, there might be several hypotheses { Hk } that have the potential to explain D. From these, Bayesian model selection identifies the particular Hk that best explains D. The procedure is simple, though computationally demanding: starting with an arbitrary pair of the { Hk }, we apply the model selection machinery, discard the weaker hypothesis, replace it with one of the remaining Hk, and repeat.

This procedure usefully disposes of the weakest hypotheses, but there is no guarantee that the surviving Hk explains D. If the correct hypothesis is not included in the { Hk }, we are left with the “best” hypothesis but are not made aware that the search for an explanation should continue. In the context of model selection, the next step would be a comparison of this “best” Hk with the hypothesis that Hk is false. But model selection fails at this point because we cannot compute the required likelihood (Sivia & Skilling 2006, p. 84). Clearly, what is needed is a goodness-of-fit criterion for Bayesian models.

This issue is discussed by Press et al. (2007, p. 779). They note that “There are no good fully Bayesian methods for assessing goodness-of-fit ...” and go on to report that “Sensible Bayesians usually fall back to p-value tail statistics ... when they really need to know if a model is wrong”.

On the assumption that astronomers do really need to know if their models are wrong, this paper adopts a frequentist approach to testing Bayesian models. Although this approach may be immediately abhorrent to committed Bayesians, the role of the tests proposed herein is merely to provide a quantitative measure according to which Bayesians decide whether their models are satisfactory. When they are, the Bayesian inferences are presented – and with increased confidence. When not, flaws in the models or the data must be sought, with the aim of eventually achieving satisfactory Bayesian inferences.

2. Bayesian models

The term Bayesian model – subsequently denoted by – must now be defined. The natural definition of is that which must be specified in order that Bayesian inferences can be drawn from D – i.e., in order to compute posterior probabilities. This definition implies that, in addition to the hypothesis H, which introduces the parameter vector α, the prior probability distribution π(α) must also be included in . Thus, symbolically, we write (1)It follows that different Bayesian models can share a common H. For example, H may be the hypothesis that D is due to Keplerian motion. But for the motion of a star about the Galaxy’s central black hole, the appropriate π will differ from that for the reflex orbit of star due to a planetary companion.

A further consequence is that a failure of to explain D is not necessarily due to H: an inappropriate π is also a possibility.

To astronomers accustomed to working only with uniform priors, the notion that a Bayesian model’s poor fit to D could be due to π might be surprising. A specific circumstance where π could be at fault arises when Bayesian methods are used to repeatedly update our knowledge of some phenomenon – e.g., the value of a fundamental constant that over the years is the subject of numerous experiments (X1,...,Xi,...), usually of increasing precision. With an orthodox Bayesian approach, Xi + 1 is analysed with the prior set equal to the posterior from Xi. Thus (2)This is the classical use of Bayes theorem to update our opinion by incorporating past experience. However, if Xi is flawed – e.g., due to an unrecognized systematic error – then this choice of π impacts negatively on the analysis of Xi + 1, leading perhaps to a poor fit to Di + 1

Now, since subsequent flawless experiments result in the decay of the negative impact of Xi, this recursive procedure is self-correcting, so a Bayesian might argue that the problem can be ignored. But scientists feel obliged to resolve such discrepancies before publishing or undertaking further experiments and therefore need a statistic that quantifies any failure of to explain D.

3. A goodness-of-fit statistic for Bayesian models

The most widely used goodness-of-fit statistic in the frequentist approach to hypothesis testing is χ2, whose value is determined by the residuals between the fitted model and the data, with no input from prior knowledge. Thus, (3)is the goodnes-of-fit statistic for the minimum-χ2 solution α0.

A simple analogue of for Bayesian models is the posterior mean of χ2(α), (4)where the posterior distribution (5)Here ℒ(α | H,D) is the likelihood of α given data D.

Note that since χ2π depends on both constituents of , namely H and π, it has the potential to detect a poor fit due to either or both being at fault, as required by Sect. 2.

In Eq. (4) the subscript π is attached to χ2 to stress that a non-trivial, informative prior is included in . On the other hand, when a uniform prior is assumed, χ2 is independent of the prior and is then denoted by χ2u.

The Bayesian goodness-of-fit statistic χ2u is used in Lucy (2014, L14) to illustrate the detection of a weak second orbit in simulations of Gaia scans of an astrometric binary. In that case, H states that the scan residuals are due to the reflex Keplerian orbit caused by one invisible companion. With increasing amplitude of the second orbit, the point comes when the investigator will surely abandon H – i.e., abandon the assumption of just one companion – see Fig. 12, L14.

3.1. p-values

With the classical frequentist acceptance-rejection paradigm, a null hypothesis H0 is rejected on the basis of a p-value tail statistic. Thus, with the χ2 test, H0 is rejected if , where , and accepted otherwise. Here ν = nk is the number of degrees of freedom, where n is the number of measurements and k is the number of parameters introduced by H0, and β is the designated p-threshold chosen by the investigator.

For a Bayesian model, a p-value can be computed from the χ2π statistic, whose approximate distribution is derived below in Sect. 5.1. However, a sharp transition from acceptance to rejection of the null hypothesis at some designated p-value is not recommended. First, p-values overstate the strength of the evidence against H0 (e.g., Sellke et al. 2001). In particular, the value p = 0.05 recommended in elementary texts does not imply strong evidence againts H0. Second, the p-value is best regarded (Sivia & Skilling 2006, p. 85) as serving a qualitative purpose, with a small value prompting us to think about alternative hypotheses. Thus, if χ2π exceeds the chosen threshold , this is a warning that something is amiss and should be investigated, with the degree of concern increasing as β decreases. If the β = 0.001 threshold is exceeded, then the investigator would be well-advised to suspect that or D is at fault.

Although statistics texts emphasize tests of H not D, astronomers know that D can be corrupted by biases or calibration errors. Departures from normally-distributed errors can also increase χ2π.

If D is not at fault, then is the culprit, implying that either π or H is at fault. If the fault lies with π not H, then we expect that even though .

If neither D nor π can be faulted, then the investigator must seek a refined or alternative H.

3.2. Type I and type II errors

In the frequentist approach to hypothesis testing, decision errors are said to be of type I if H is true but the test says reject H, and of type II if H is false but the test says accept H.

Since testing a Bayesian model is not concerned exclusively with H, these definitions must be revised, as follows:

  • A type I error arises when and D are flawless but the statistic (e.g., χ2π) exceeds the designated threshold.

  • A type II error arise when or D are flawed but the statistic does not exceed the designated threshold.

Here the words accept and reject are avoided. Moreover, no particular threshold is mandatory: it is at the discretion of the investigator and is chosen with regard to the consequences of making a decision error.

4. Statistics of χ2

The intuitive understanding that scientists have regarding derives from its simplicity and the rigorous theorems on its distribution that allow us to derive confidence regions for multi-dimensional linear models (e.g., Press et al. 2007, Sect. 15.6.4).

Rigorous statistics for χ2 require two assumptions: 1) that the model fitted to D is linear in its parameters; and 2) that measurement errors obey the normal distribution. Nevertheless, even when these standard assumptions do not strictly hold, scientists still commonly rely on to gauge goodness-of-fit, with perhaps Monte Carlo (MC) sampling to provide justification or calibration (e.g., Press et al. 2007, Sect. 15.6.1).

Rigorous results for the statistic χ2 are therefore of interest. In fact, if we add the assumption of a uniform prior to the above standard assumptions, then we may prove (Appendix A) that (6)where k is the number of parameters.

Given that Eq. (6) is exact under the stated assumptions, it follows that the quantity χ2uk is distributed as with ν = nk degrees of freedom.

For minimum-χ2 fitting of linear models, the solution is always a better fit to D than is the true solution. In particular, , whereas . Accordingly, the + k term in Eq. (6) “corrects” for overfitting and so – i.e., the expected value of χ2 for the actual measurement errors.

4.1. Effect of an informative prior

When an informative prior is included in , an analytic formula for χ2π is in general not available. However, its approximate statistical properties are readily found.

Consider again a linear model with normally-distributed errors and suppose further that the experiment (X2) is without flaws. The χ2 surfaces are then self-similar k-dimensional ellipsoids with minimum at α0αtrue, the unknown exact solution. Let us now further suppose that the informative prior π(α) derives from a previous flawless experiment (X1). The prior π will then be a convex (bell-shaped) function with maximum at αmaxα0. Now, consider a 1D family of such functions all centred on α0 and ranging from the very broad to the very narrow. For the former χ2π ≈ ⟨ χ2u; for the latter . Thus, ideally, when a Bayesian model is applied to data D we expect that (7)Now, a uniform prior and δ(αα0) are the limits of the above family of bell-shaped functions. Since the delta function limit is not likely to be closely approached in practice, a first approximation to the distribution of χ2π is that of χ2u – i.e., that of .

The above discussion assumes faultless X1 and X2. But now suppose that there is an inconsistency between π and X2. The peak of π at αmax will then in general be offset from the minimum of χ2 at α0. Accordingly, in the calculation of χ2π from Eq. (4), the neighbourhood of the χ2 minimum at α0 has reduced weight relative to χ2(αmax) at the peak of π. Evidently, in this circumstance, χ2π can greatly exceed χ2u, and the investigator is then alerted to the inconsistency.

5. Numerical experiments

Given that rigorous results χ2 are not available for informative priors, numerical tests are essential to illustrate the discussion of Sect. 4.1.

5.1. A toy model

A simple example with just one parameter μ is as follows: H states that u = μ, and D comprises n measurements ui = μ + σzi, where the zi here and below are independent gaussian variates randomly sampling . In creating synthetic data, we set μ = 0 = 1 and n = 100.

In the first numerical experiment, two independent data sets D1 and D2 are created comprising n1 and n2 measurements, respectively. On the assumption of a uniform prior, the posterior density of μ derived from D1 is (8)We now regard p(μ | H,D1) as prior knowledge to be taken into account in analysing D2. Thus (9)so that the posterior distribution derived from D2 is (10)The statistic quantifying the goodness-of-fit achieved when ℳ = { π,H } is applied to data D2 is then (11)From N independent data pairs (D1,D2), we obtain N independent values of χ2π thus allowing us to test the expectation (Sect. 4.1) that this statistic is approximately distributed as χ2u. In Fig. 1, this empirical PDF is plotted together the theoretical PDF for χ2u.

thumbnail Fig. 1

Empirical PDF of χ2π derived from the analysis of 106 data pairs D1,D2 as described in Sect. 5.1. The dashed curve is the theoretical PDF for χ2u.

Open with DEXTER

The accuracy of the approximation at large values of χ2 is of special importance. For n = 100 and k = 1, the 0.05,0.01 and 0.001 critical points of χ2u are 124.2, 135.6 and 149.2, respectively. From a simulation with N = 106, the number of χ2π values exceeding these thresholds are 50 177, 10 011 and 1025, respectively. Thus, the fraction of χ2π exceeding the critical values derived from the distribution of χ2u are close to their predicted values. Accordingly, the conventional interpretation of these critical values is valid.

In this experiment, the analysis of X2 benefits from knowledge gained from X1. We expect therefore that χ2π is less than χ2u, since replacing the uniform prior with the informative π obtained from X1 should improve the fit. From 106 repetitions, this proves to be so with probability 0.683. Sampling noise in D1 and D2 accounts for the shortfall.

5.2. Bias test

When X1 and X2 are flawless, the statistic χ2π indicates doubts – i.e., type I errors (Sect. 3.2) – about the mutual consistency of X1 and X2 with just the expected frequency. Thus, with the 5% threshold, doubts arise in 5.02% of the above 106 trials. This encourages the use of χ2π to detect inconsistency.

Accordingly, in a second test, X1 is flawed due to biased measurements. Thus, now ui = μ + σzi + b for D1. As a result, the prior for X2 obtained from Eq. (9) is compromised, and this impacts on the statistic χ2π from Eq. (11).

In Fig. 2, the values of χ2π and χ2u are plotted against b/σ. Since the compromised prior is excluded from χ2u, its values depend only on the flawless data sets D2, and so mostly fall within the (0.05,0.95) interval. In contrast, as b/σ increases, the values of χ2π are increasingly affected by the compromised prior.

The mutual consistency of X1 and X2 is assessed on the basis of χ2π, with choice of critical level at our discretion. However, when χ2π exceeds the 0.1% level at 149.2, we would surely conclude that X1 and X2 are in conflict and seek to resolve the discrepancy. On the other hand, when inconsistency is not indicated, we may accept the Bayesian inferences derived from X2 in the confident belief that incorporating prior knowledge from X1 is justified and beneficial.

thumbnail Fig. 2

Detection of bias in X1 with the χ2π statistic when X2 is analysed with prior derived from X1. Values of χ2π (filled circles) and χ2u (open circles) are plotted against the bias parameter b/σ. The dashed lines are the 5 and 95% levels.

Open with DEXTER

This test illustrates the important point that an inappropriate π can corrupt the Bayesian inferences drawn from a flawless experiment. Thus, in this case, the bias in D1 propagates into the posterior p(μ | H,D2) derived from X2. This can be (and is) avoided by preferring the frequentist methodology. But to do so is to forgo the great merit of Bayesian inference, namely its ability to incorporate informative prior information (Feldman & Cousins 1998). If one does therefore prefer Bayesian inference, it is evident that a goodness-of-fit statistic such as χ2π is essential in order to detect errors propagating into the posterior distribution from an ill-chosen prior.

5.3. Order reversed

In the above test, the analysis of X2 is preceded by that of X1. This order can be reversed. Thus, with the same N data pairs (D1,D2), we now first analyse X2 with a uniform prior to obtain p(μ | H,D2). This becomes the prior for the analysis of X1. This analysis then gives the posterior p(μ | H,D1) from which a new value of χ2π is obtained.

When the values of χ2π obtained with this reversed order of analysis are plotted against b/σ, the result is similar to Fig. 2, implying that the order is unimportant. Indeed, statistically, the same decision is reached independently of order. For example, for 105 independent data pairs (D1,D2) with b/σ = 1, the number with χ2π> 124.2, the 5% threshold, is 50 267 when X1 precedes X2 and 50 149 when X2 precedes X1.

5.4. Alternative statistic

Noting that the Bayesian evidence is , the prior-weighted mean of the likelihood, we can, under standard assumptions, write (12)where the effective χ2 (Bridges et al. 2009) is (13)This is a possible alternative to χ2π defined in Eq. (4). However, in the test of Sect. 5.1, the two values are so nearly identical it is immaterial which mean is used. Here χ2π is preferred because it remains well-defined for a uniform prior, for which an analytic result is available (Appendix A).

Because χ2π and are nearly identical, the distribution of should approximate that of χ2u (Sect. 4.1). To test this, the experiment of Sect. 5.1 is repeated with replacing χ2π. From a simulation with N = 106, the number of values exceeding the 0.05, 0.01 and 0.001 thresholds are 50 167, 9951 and 970, respectively. Thus if is chosen as the goodness-of-fit statistic, accurate p-values can be derived on the assumption that is distributed as with ν = nk degrees of freedom. From Sect. 4.1, we expect these p-values to be accurate if π(α) is not more sharply peaked than ℒ(α | H,D).

6. An F statistic for Bayesian models

Inspection of Fig. 2 shows that a more powerful test of inconsistency between X1 and X2 must exist. A systematic displacement of χ2π relative to χ2u is already evident even when χ2π is below the 5% threshold at 124.2. This suggests that a Bayesian analogue of the F statistic be constructed.

6.1. The frequentist F-test

In frequentist statistics, a standard result (e.g., Hamilton 1964, p. 139) in the testing of linear hypotheses is the following: we define the statistic (14)where is the minimum value of χ2 when all i parameters are adjusted, and is the minimum value when a linear constraint is imposed on j( ≤ i) parameters, so that only the remaining ij are adjusted. Then, on the null hypothesis H that the constraint is true, F is distributed as Fν1,ν2 with ν1 = j and ν2 = ni, where n is the number of measurements. Accordingly, H is tested by comparing the value F given by Eq. (14) with critical values Fν1,ν2 derived from the distribution of Fν1,ν2.

Note that when j = i, the constraint completely determines α. If this value is α, then and H states that α = αtrue.

A particular merit of the statistic F is that it is independent of σ. However, the resulting F-test does assume normally-distributed measurement errors.

6.2. A Bayesian F

In a Bayesian context, the frequentist hypothesis that αtrue = α is replaced by the statement that αtrue obeys the posterior distribution p(α | H,D2). Thus an exact constraint is replaced by a fuzzy constaint.

Adopting the simplest approach, we define , a Bayesian analogue of F, by taking to be the value at the posterior mean of α, (15)where π(α) is the informative prior. Considerations of accuracy when values of χ2 are computed on a grid suggest we take to be the value at (16)the posterior mean for a uniform prior.

With and thus defined, the Bayesian F is (17)and its value is to be compared with the chosen threshold Fν1,ν2 when testing the consistency of X1 and X2. Since αu is independent of π(α), the statistic measures the effect of π(α) in displacing απ from αu.

In this Bayesian version of the F-test, the null hypothesis states that the posterior mean απ = αtrue. This will be approximately true when a flawless Bayesian model is applied to a flawless data set D. However, if the chosen threshold on is exceeded, then one or more of π,H and D is suspect as discussed in Sect. 3.1. If the threshold is not exceeded, then Bayesian inferences drawn from the posterior distribution p(α | H,D2) are supported.

6.3. Test of p-values from

If Eq. (17) gives ℱ = ℱ, the corresponding p-value is (18)where ν1 = j and ν2 = ni. The accuracy of these p-values can be tested with the 1D toy model of Sect. 5.1 as follows:

(i): Independent data sets D1,D2 are created corresponding to X1,X2.

(ii): With π from X1, the quantities and are calculated for X2 with Eqs. (15)–(17).

(iii): M independent data sets are now created with .

(iv): For each , the new value of is calculated for the derived at step (ii).

(v): For each , the new value of χ2( ⟨ αu) is calculated with αu from Eq. (16).

(vi): With these new χ2 values, the statistic m is obtained from Eq. (17).

The resulting M values of m give us an empirical estimate of p, namely f, the fraction of the m that exceed . In one example of this test, a data pair D1,D2 gives and = 1.0021. With ν1 = 1 and ν2 = 99, Eq. (18) then gives p = 0.3192. This is checked by creating M = 105 data sets with the assumption that μtrue = 0.089. The resulting empirical estimate is f = 0.3189, in close agreement with p.

From 100 independent pairs D1,D2, the mean value of | pf | is 0.001. This simulation confirms the accuracy of p-values derived from Eq. (18) and therefore of decision thresholds Fν1,ν2.

6.4. Bias test

To investigate this Bayesian F-test’s ability to detect inconsistecy between X1 and X2, the bias test of Sect. 5.2 is repeated, again with n1 = n2 = 100. The vector α in Sect. 6.1 now becomes the scalar μ, and i = 1.

In Fig. 3, the values of from Eq. (17) with j = i = 1 are plotted against the bias parameter b/σ. Also plotted are critical values derived from the distribution Fν1,ν2 with ν1 = 1 and ν2 = 99. In contrast to Fig. 2 for the χ2π statistic, inconsistency between X1 and X2 is now detected down to b/σ ≈ 0.5.

In this test of inconsistency, the flaw in X1 is the bias b. Now, if it were known for certain that this was the flaw, then a Bayesian analysis with H1 changed from u = μ to u = μ + b – i.e., with an extra parameter – is staightforward. The result is the posterior density for b, allowing for correction. In contrast, the detection of a flaw with χ2π or is cause-independent. Although Figs. 2 and 3 have b/σ as the abscissa, for real experiments this quantity is not known and conclusions are drawn just from the ordinate.

thumbnail Fig. 3

Detection of inconsistency between X1 and X2. Values of are plotted against the bias parameter b/σ. The dashed lines are the 0.1, 5 and 50% levels. Note that the horizontal scale differs from Fig. 2.

Open with DEXTER

7. Tension between experiments

In the above, the goodness-of-fit of to D is investigated taking into account the possibility that a poor fit might be due the prior derived from a previous experiment. A related goodness-of-fit issue commonly arises in modern astrophysics, particularly cosmology, namely whether estimates of non-identical but partially overlapping parameter vectors from different experiments are mutually consistent. The term commonly used in this regard is tension, with investigators often reporting their subjective assessments – e.g., there is marginal tension between X1 and X2 – based on the two credibility domains (often multi-dimensional) for the parameters in common.

In a recent paper, Seehars et al. (2015) review the attempts in the cosmological literature to quantify the concept of tension, with emphasis on CMB experiments. Below, we develop a rather different approach based on the statistic defined in Sect. 6.2.

Since detecting and resolving conflicts between experiments is essential for scientific progress, it is desirable to quantify the term tension and to optimize its detection. The conjecture here is that this optimum is achieved when inferences from X1 are imposed on the analysis of X2.

7.1. Identical parameter vectors

A special case of assessing tension between experiments is that where the parameter vectors are identical. But this is just the problem investigated in Sects. 5 and 6.

When X1 (with bias) and X2 (without bias) are separately analysed, the limited overlap of the two credibility intervals for μ provides a qualitative indication of tension. However, if X2 is analysed with a prior derived from X1, then the statistic χ2π – see Fig. 2 – or, more powerfully, the statistic – see Fig. 3 – provide a quantitative measure to inform statements about the degree of tension.

7.2. Non-identical parameter vectors-I

We now suppose that D1,D2 are data sets from different experiments X1,X2 designed to test the hypotheses H1,H2. However, though different, these hypotheses have parameters in common. Specifically, the parameter vectors of H1 and H2 are ξ = (α,β) and η = (β,γ), respectively, and k,l,m are the numbers of elements in α,β,γ, respectively.

If X1 and X2 are analysed independently, we may well find that 1 and 2 provide satisfactory fits to D1 and D2 and yet still be obliged to report tension between the experiments because of a perceived inadequate overlap of the two l-dimensional credibility domains for β.

A quantitative measure of tension between X1 and X2 can be derived via the priors, as follows: the analysis of X1 gives p(ξ | H1,D1), the posterior distribution of ξ, from which we may derive the posterior distribution of β by integrating over α. Thus (19)Now, for a Bayesian analysis of X2, we must specify π(η) throughout η-space, not just β-space. But (20)Accordingly, what we infer from X1 can be imposed on the analysis of X2 by writing (21)The conditional prior π(γ | β) must now be specified. This can be taken to be uniform unless we have prior knowledge from other sources – i.e., not from D2.

With π(η) specified, the analysis of X2 gives the posterior density p(η | H2,D2). As in Sect. 6.2, we now regard this as a fuzzy constraint on η from which we compute the sharp constraint ηπ given by Eq. (15). Now, in general, ηπ will be displaced from ηu given by Eq. (16). The question then is: Is the increment in χ2(η | H2,D2) between ηπ and ηu so large that we must acknowledge tension between X1 and X2?

Following Sect. 6, we answer this question by computing from Eq. (17) with i = j = l + m, the total number parameters in η. The result is then compared to selected critical values from the Fν1,ν2 distribution, where ν1 = l + m and ν2 = n2lm. With standard assumptions, obeys this distribution if ηπ = ηtrue – i.e., if βπ = βtrue and γπ = γtrue – see Sect. 6.3.

7.3. A toy model

A simple example with one parameter (μ) for X1 and two (μ,κ) for X2 is as follows: H1 states that u = μ and H2 states that v = μ + κx. The data set D1 comprises n1 measurements ui = μ + σzi + b, where b is the bias. The data set D2 comprises n2 measurements vj = μ + κxj + σzj, where the xj are uniform in (− 1, + 1). The parameters are μ = 0 = 1 = 1 and n1 = n2 = 100. In the notation of Sect. 7.2, the vectors β,γ contract to the scalars μ,κ, whence l = m = 1, and α does not appear, whence k = 0.

7.4. Bias test

In the above, H1 are H2 are different hypotheses but have the parameter μ in common. If b = 0, the analyses of X1 are X2 should give similar credibility intervals for μ and therefore no tension. But with sufficiently large b, tension should arise.

This is investigated following Sect. 7.2. Applying 1 to D1, we derive p(μ | H1,D1). Then, taking the conditional prior π(κ | μ) to be constant, we obtain (22)as the prior for the analysis of X2. This gives us the posterior distribution p(μ,κ | H2,D2), which is a fuzzy constraint in (μ,κ)-space. Replacing this by the sharp constraint (⟨ μ ⟩ , ⟨ κ ⟩), the constrained χ2 is (23)Substitution in Eq. (17) with j = i = 2, then gives as a measure of the tension between X1 and X2. Under standard assumptions, is distributed as Fν1,ν2 with ν1 = 2,ν2 = n2 − 2 if (⟨ μ ⟩ , ⟨ κ ⟩ ) = (μ,κ)true.

In Fig. 4, the values of are plotted against b/σ together with critical values for Fν1,ν2 with ν1 = 2,ν2 = 98. This plot shows that tension is detected for b/σ ≳ 0.6. This is slightly inferior to Fig. 3 as is to be expected because of the more complicated X2.

thumbnail Fig. 4

Detection of tension between different experiments. Values of are plotted against the bias parameter b/σ. The dashed lines are the 0.1,5 and 50% levels.

Open with DEXTER

The importance of Fig. 4 is in showing that the statistic has the desired characteristic of reliably informing the investigator of tension between different experiments with partially overlapping parameter vectors. When the inconsistency is slight (b/σ ≲ 0.2), this statistic does not sound a false alarm. When the inconsistency is substantial (b/σ ≳ 0.6), the statistic does not fail to sound the alarm.

7.5. Non-identical parameter vectors-II

If calculated as in Sect. 7.2 indicates tension, the possible flaws include the conditional prior π(γ | β). Thus, tension could be indicated even if the prior π(β) inferred from X1 is accurate and consistent with D2.

Accordingly, we might prefer to focus just on β – i.e., on the parameters in common. If so, we compute (24)and then find the minimum of χ2(η | H2,D2) when β = ⟨ βπ. Thus, the contrained χ2 is now (25)The F test also applies in this case – see Sect. 6.1. Thus this value is substituted in Eq. (17), where we now take j = l, the number of parameters in β. The resulting is then compared to critical values derived from Fν1,ν2 with ν1 = l,ν2 = n2lm. With the standard assumptions, obeys this distribution if βπ = βtrue.

For the simple model of Sect. 7.3, the resulting plot of against b/σ is closely similar to Fig. 4 and so is omitted. Nevertheless, the option of constraining a subset of the parameters is likely to be a powerful diagnostic tool for complex, multi-dimensional problems, identifying the parameters contributing most to the tension revealed when the entire vector is constrained (cf. Seehars et al. 2015).

8. Discussion and conclusion

A legitimate question to ask about the statistical analysis of data acquired in a scientific experiment is: How well or how badly does the model fit the data? Asking this question is, after all, just the last step in applying the scientific method. In a frequentist analysis, where the estimated parameter vector α0 is typically the minimum-χ2 point, this question is answered by reporting the goodness-of-fit statisic or, equivalently, the corresponding p-value. If a poor fit is thereby indicated, the investigator and the community are aware that not only is the model called into question but so also is the estimate α0 and its confidence domain.

If the same data is subject to a Bayesian analysis, the same question must surely be asked: the Bayesian approach does not exempt the investigator from the obligation to report on the success or otherwise of the adopted model. In this case, if the fit is poor, the Bayesian model is called into question and so also are inferences drawn from the posterior distribution p(α | H,D).

As noted in Sect. 1, the difficulty in testing Bayesian models is that there are no good fully Bayesian methods for assessing goodness-of-fit. Accordingly, in this paper, a frequentist approach is advocated. Specifically, χ2π is proposed in Sect. 3 as a suitable goodness-of-fit statisic for Bayesian models. Under the null hypothesis that and D are flawless and with the standard assumptions of linearity and normally-distributed errors, then, as argued in Sect. 4.1 and illustrated in Fig. 1, χ2πk is approximately distributed as , and so p-values can be computed. A p-value thus derived from χ2π quantifies the average goodness-of-fit provided by the posterior distribution. In contrast, in a frequentist minimum-χ2 analysis, the p-value quantifies the goodness-of-fit provided by the point estimate α0.

In the above, it is regarded as self-evident that astronomers want to adhere to the scientific method by always reporting the goodness-of-fit achieved in Bayesian analyses of observational data. However, Gelman & Shalizi (2013), in an essay on the philosophy and practice of Bayesian statistics, note that investigators who identify Bayesian inference with inductive inference regularly fit and compare models without checking them. They deplore this practice. Instead, these authors advocate the hypothetico-deductive approach in which model checking is crucial. As in this paper, they discuss non-Bayesian checking of Bayesian models – specifically, the derivation of p-values from posterior predictive distributions. Moreover, they also stress that the prior distribution is a testable part of a Bayesian model.

In the astronomical literature, the use of frequentist tests to validate Bayesian models is not unique to this paper. Recently, Harrison et al. (2015) have presented an ingenious procedure for validating multidimensional posterior distributions with the frequentist Kolmogorov-Smirnov (KS) test for one-dimensional data. Their aim, as here, is to test the entire Bayesian inference procedure.

Frequentist testing also arises in recent applications of the Kullback-Leibler divergence to quantify tension between cosmological probes (e.g. Seehars et al. 2015). For linear models, and with the assumption of Gaussian priors and likelihoods, a term in the relative entropy is a statistic that measures tension. With these assumptions, the statistic follows a generalized χ2 distribution, thus allowing a p-value to be computed.

Seehars et al. (2015) also investigate various purely Bayesian measures of tension. They conclude that interpreting these measures on a fixed, problem-independent scale – e.g., the Jeffrey’s scale – can be misleading – see also Nesseris & García-Bellido (2013).


I thank D. J. Mortlock for comments on an early draft of this paper, and A. H. Jaffe, M.P.Hobson and the referee for useful references.


  1. Bridges, M., Feroz, F., Hobson, M. P., & Lasenby, A. N. 2009, MNRAS, 400, 1075 [NASA ADS] [CrossRef] [Google Scholar]
  2. Feldman, G. J., & Cousins, R. D. 1998, Phys. Rev. D, 57, 3873 [NASA ADS] [CrossRef] [Google Scholar]
  3. Gelman, A., & Shalizi 2013, British J. Math. Stat. Psychol., 66, 8 [CrossRef] [Google Scholar]
  4. Hamilton, W. C. 1964, Statistics in Physical Science (New York: Ronald Press) [Google Scholar]
  5. Harrison, D., Sutton, D., Carvalho, P., & Hobson, M. 2015, MNRAS, 451, 2610 [NASA ADS] [CrossRef] [Google Scholar]
  6. Lucy, L. B. 2014, A&A, 571, A86 (L14) [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  7. Nesseris, S., & García-Bellido, J. 2013, J. Cosmol. Astropart. Phys., 08, 036 [NASA ADS] [CrossRef] [Google Scholar]
  8. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. 2007 Numerical Recipes (Cambridge: Cambridge University Press) [Google Scholar]
  9. Seehars, S., Grandis, S., Amara, A., & Refregier, A. 2015, arXiv e-prints [arXiv:1510.08483] [Google Scholar]
  10. Sellke, T., Bayarri, M. J., & Berger, J. 2001, The American Statistician, 55, 62 [CrossRef] [Google Scholar]
  11. Sivia, D. S., & Skilling, J. 2006, Data Analysis: A Bayesian Tutorial, 2nd Edn. (Oxford: Oxford University Press) [Google Scholar]

Appendix A: Evaluation of χ2u

If α0 denotes the minimum-χ2 solution, then (A.1)where a is the displacement from α0. Then, on the assumption of linearity, (A.2)where the Aij are the constant elements of the k × k curvature matrix (Press et al. 2007, p. 680), where k is the number of parameters. It follows that surfaces of constant χ2 are k-dimensional self-similar ellipsoids centered on α0.

Now, given the second assumption of normally-distributed measurement errors, the likelihood (A.3)Thus, in the case of a uniform prior, the posterior mean of χ2 is (A.4)

Because surfaces of constant Δχ2 are self-similar, the k-dimensional integrals in Eq. (A.4) reduce to 1D integrals.

Suppose on the surface of the ellipsoid with volume V. If lengths are increased by the factor λ, then the new ellipsoid has (A.5)With these scaling relations, the integrals in Eq. (A.4) can be transformed into integrals over λ. The result is (A.6)where . The integrals have now been transformed to a known definite integral, (A.7)where x = (z + 1) / 2. Aplying this formula, we obtain (A.8)an exact result under the stated assumptions.

All Figures

thumbnail Fig. 1

Empirical PDF of χ2π derived from the analysis of 106 data pairs D1,D2 as described in Sect. 5.1. The dashed curve is the theoretical PDF for χ2u.

Open with DEXTER
In the text
thumbnail Fig. 2

Detection of bias in X1 with the χ2π statistic when X2 is analysed with prior derived from X1. Values of χ2π (filled circles) and χ2u (open circles) are plotted against the bias parameter b/σ. The dashed lines are the 5 and 95% levels.

Open with DEXTER
In the text
thumbnail Fig. 3

Detection of inconsistency between X1 and X2. Values of are plotted against the bias parameter b/σ. The dashed lines are the 0.1, 5 and 50% levels. Note that the horizontal scale differs from Fig. 2.

Open with DEXTER
In the text
thumbnail Fig. 4

Detection of tension between different experiments. Values of are plotted against the bias parameter b/σ. The dashed lines are the 0.1,5 and 50% levels.

Open with DEXTER
In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.