Frequentist tests for Bayesian models
Astrophysics Group, Blackett Laboratory, Imperial College London,
Prince Consort
Road,
London
SW7 2AZ,
UK
email:
l.lucy@imperial.ac.uk
Received: 6 November 2015
Accepted: 1 February 2016
Analogues of the frequentist chisquare and F tests are proposed for testing goodnessoffit and consistency for Bayesian models. Simple examples exhibit these tests’ detection of inconsistency between consecutive experiments with identical parameters, when the first experiment provides the prior for the second. In a related analysis, a quantitative measure is derived for judging the degree of tension between two different experiments with partially overlapping parameter vectors.
Key words: methods: data analysis / methods: statistical
© ESO, 2016
1. Introduction
Bayesian statistical methods are now widely applied in astronomy. Of the new techniques thus introduced, model selection (or comparison) is especially notable in that it replaces the frequentist acceptancerejection paradigm for testing hypotheses. Thus, given a data set D, there might be several hypotheses { H_{k} } that have the potential to explain D. From these, Bayesian model selection identifies the particular H_{k} that best explains D. The procedure is simple, though computationally demanding: starting with an arbitrary pair of the { H_{k} }, we apply the model selection machinery, discard the weaker hypothesis, replace it with one of the remaining H_{k}, and repeat.
This procedure usefully disposes of the weakest hypotheses, but there is no guarantee that the surviving H_{k} explains D. If the correct hypothesis is not included in the { H_{k} }, we are left with the “best” hypothesis but are not made aware that the search for an explanation should continue. In the context of model selection, the next step would be a comparison of this “best” H_{k} with the hypothesis that H_{k} is false. But model selection fails at this point because we cannot compute the required likelihood (Sivia & Skilling 2006, p. 84). Clearly, what is needed is a goodnessoffit criterion for Bayesian models.
This issue is discussed by Press et al. (2007, p. 779). They note that “There are no good fully Bayesian methods for assessing goodnessoffit ...” and go on to report that “Sensible Bayesians usually fall back to pvalue tail statistics ... when they really need to know if a model is wrong”.
On the assumption that astronomers do really need to know if their models are wrong, this paper adopts a frequentist approach to testing Bayesian models. Although this approach may be immediately abhorrent to committed Bayesians, the role of the tests proposed herein is merely to provide a quantitative measure according to which Bayesians decide whether their models are satisfactory. When they are, the Bayesian inferences are presented – and with increased confidence. When not, flaws in the models or the data must be sought, with the aim of eventually achieving satisfactory Bayesian inferences.
2. Bayesian models
The term Bayesian model – subsequently denoted by ℳ – must now be defined. The natural definition of ℳ is that which must be specified in order that Bayesian inferences can be drawn from D – i.e., in order to compute posterior probabilities. This definition implies that, in addition to the hypothesis H, which introduces the parameter vector α, the prior probability distribution π(α) must also be included in ℳ. Thus, symbolically, we write (1)It follows that different Bayesian models can share a common H. For example, H may be the hypothesis that D is due to Keplerian motion. But for the motion of a star about the Galaxy’s central black hole, the appropriate π will differ from that for the reflex orbit of star due to a planetary companion.
A further consequence is that a failure of ℳ to explain D is not necessarily due to H: an inappropriate π is also a possibility.
To astronomers accustomed to working only with uniform priors, the notion that a Bayesian model’s poor fit to D could be due to π might be surprising. A specific circumstance where π could be at fault arises when Bayesian methods are used to repeatedly update our knowledge of some phenomenon – e.g., the value of a fundamental constant that over the years is the subject of numerous experiments (X_{1},...,X_{i},...), usually of increasing precision. With an orthodox Bayesian approach, X_{i + 1} is analysed with the prior set equal to the posterior from X_{i}. Thus (2)This is the classical use of Bayes theorem to update our opinion by incorporating past experience. However, if X_{i} is flawed – e.g., due to an unrecognized systematic error – then this choice of π impacts negatively on the analysis of X_{i + 1}, leading perhaps to a poor fit to D_{i + 1}
Now, since subsequent flawless experiments result in the decay of the negative impact of X_{i}, this recursive procedure is selfcorrecting, so a Bayesian might argue that the problem can be ignored. But scientists feel obliged to resolve such discrepancies before publishing or undertaking further experiments and therefore need a statistic that quantifies any failure of ℳ to explain D.
3. A goodnessoffit statistic for Bayesian models
The most widely used goodnessoffit statistic in the frequentist approach to hypothesis testing is χ^{2}, whose value is determined by the residuals between the fitted model and the data, with no input from prior knowledge. Thus, (3)is the goodnesoffit statistic for the minimumχ^{2} solution α_{0}.
A simple analogue of for Bayesian models is the posterior mean of χ^{2}(α), (4)where the posterior distribution (5)Here ℒ(α  H,D) is the likelihood of α given data D.
Note that since ⟨ χ^{2} ⟩ _{π} depends on both constituents of ℳ, namely H and π, it has the potential to detect a poor fit due to either or both being at fault, as required by Sect. 2.
In Eq. (4) the subscript π is attached to ⟨ χ^{2} ⟩ to stress that a nontrivial, informative prior is included in ℳ. On the other hand, when a uniform prior is assumed, ⟨ χ^{2} ⟩ is independent of the prior and is then denoted by ⟨ χ^{2} ⟩ _{u}.
The Bayesian goodnessoffit statistic ⟨ χ^{2} ⟩ _{u} is used in Lucy (2014, L14) to illustrate the detection of a weak second orbit in simulations of Gaia scans of an astrometric binary. In that case, H states that the scan residuals are due to the reflex Keplerian orbit caused by one invisible companion. With increasing amplitude of the second orbit, the point comes when the investigator will surely abandon H – i.e., abandon the assumption of just one companion – see Fig. 12, L14.
3.1. pvalues
With the classical frequentist acceptancerejection paradigm, a null hypothesis H_{0} is rejected on the basis of a pvalue tail statistic. Thus, with the χ^{2} test, H_{0} is rejected if , where , and accepted otherwise. Here ν = n − k is the number of degrees of freedom, where n is the number of measurements and k is the number of parameters introduced by H_{0}, and β is the designated pthreshold chosen by the investigator.
For a Bayesian model, a pvalue can be computed from the ⟨ χ^{2} ⟩ _{π} statistic, whose approximate distribution is derived below in Sect. 5.1. However, a sharp transition from acceptance to rejection of the null hypothesis at some designated pvalue is not recommended. First, pvalues overstate the strength of the evidence against H_{0} (e.g., Sellke et al. 2001). In particular, the value p = 0.05 recommended in elementary texts does not imply strong evidence againts H_{0}. Second, the pvalue is best regarded (Sivia & Skilling 2006, p. 85) as serving a qualitative purpose, with a small value prompting us to think about alternative hypotheses. Thus, if ⟨ χ^{2} ⟩ _{π} exceeds the chosen threshold , this is a warning that something is amiss and should be investigated, with the degree of concern increasing as β decreases. If the β = 0.001 threshold is exceeded, then the investigator would be welladvised to suspect that ℳ or D is at fault.
Although statistics texts emphasize tests of H not D, astronomers know that D can be corrupted by biases or calibration errors. Departures from normallydistributed errors can also increase ⟨ χ^{2} ⟩ _{π}.
If D is not at fault, then ℳ is the culprit, implying that either π or H is at fault. If the fault lies with π not H, then we expect that even though .
If neither D nor π can be faulted, then the investigator must seek a refined or alternative H.
3.2. Type I and type II errors
In the frequentist approach to hypothesis testing, decision errors are said to be of type I if H is true but the test says reject H, and of type II if H is false but the test says accept H.
Since testing a Bayesian model is not concerned exclusively with H, these definitions must be revised, as follows:

A type I error arises when ℳ and D are flawless but the statistic (e.g., ⟨ χ^{2} ⟩ _{π}) exceeds the designated threshold.

A type II error arise when ℳ or D are flawed but the statistic does not exceed the designated threshold.
Here the words accept and reject are avoided. Moreover, no particular threshold is mandatory: it is at the discretion of the investigator and is chosen with regard to the consequences of making a decision error.
4. Statistics of ⟨ χ^{2} ⟩
The intuitive understanding that scientists have regarding derives from its simplicity and the rigorous theorems on its distribution that allow us to derive confidence regions for multidimensional linear models (e.g., Press et al. 2007, Sect. 15.6.4).
Rigorous statistics for χ^{2} require two assumptions: 1) that the model fitted to D is linear in its parameters; and 2) that measurement errors obey the normal distribution. Nevertheless, even when these standard assumptions do not strictly hold, scientists still commonly rely on to gauge goodnessoffit, with perhaps Monte Carlo (MC) sampling to provide justification or calibration (e.g., Press et al. 2007, Sect. 15.6.1).
Rigorous results for the statistic ⟨ χ^{2} ⟩ are therefore of interest. In fact, if we add the assumption of a uniform prior to the above standard assumptions, then we may prove (Appendix A) that (6)where k is the number of parameters.
Given that Eq. (6) is exact under the stated assumptions, it follows that the quantity ⟨ χ^{2} ⟩ _{u} − k is distributed as with ν = n − k degrees of freedom.
For minimumχ^{2} fitting of linear models, the solution is always a better fit to D than is the true solution. In particular, , whereas . Accordingly, the + k term in Eq. (6) “corrects” for overfitting and so – i.e., the expected value of χ^{2} for the actual measurement errors.
4.1. Effect of an informative prior
When an informative prior is included in ℳ, an analytic formula for ⟨ χ^{2} ⟩ _{π} is in general not available. However, its approximate statistical properties are readily found.
Consider again a linear model with normallydistributed errors and suppose further that the experiment (X_{2}) is without flaws. The χ^{2} surfaces are then selfsimilar kdimensional ellipsoids with minimum at α_{0} ≈ α_{true}, the unknown exact solution. Let us now further suppose that the informative prior π(α) derives from a previous flawless experiment (X_{1}). The prior π will then be a convex (bellshaped) function with maximum at α_{max} ≈ α_{0}. Now, consider a 1D family of such functions all centred on α_{0} and ranging from the very broad to the very narrow. For the former ⟨ χ^{2} ⟩ _{π} ≈ ⟨ χ^{2} ⟩ _{u}; for the latter . Thus, ideally, when a Bayesian model ℳ is applied to data D we expect that (7)Now, a uniform prior and δ(α − α_{0}) are the limits of the above family of bellshaped functions. Since the delta function limit is not likely to be closely approached in practice, a first approximation to the distribution of ⟨ χ^{2} ⟩ _{π} is that of ⟨ χ^{2} ⟩ _{u} – i.e., that of .
The above discussion assumes faultless X_{1} and X_{2}. But now suppose that there is an inconsistency between π and X_{2}. The peak of π at α_{max} will then in general be offset from the minimum of χ^{2} at α_{0}. Accordingly, in the calculation of ⟨ χ^{2} ⟩ _{π} from Eq. (4), the neighbourhood of the χ^{2} minimum at α_{0} has reduced weight relative to χ^{2}(α_{max}) at the peak of π. Evidently, in this circumstance, ⟨ χ^{2} ⟩ _{π} can greatly exceed ⟨ χ^{2} ⟩ _{u}, and the investigator is then alerted to the inconsistency.
5. Numerical experiments
Given that rigorous results ⟨ χ^{2} ⟩ are not available for informative priors, numerical tests are essential to illustrate the discussion of Sect. 4.1.
5.1. A toy model
A simple example with just one parameter μ is as follows: H states that u = μ, and D comprises n measurements u_{i} = μ + σz_{i}, where the z_{i} here and below are independent gaussian variates randomly sampling . In creating synthetic data, we set μ = 0,σ = 1 and n = 100.
In the first numerical experiment, two independent data sets D_{1} and D_{2} are created comprising n_{1} and n_{2} measurements, respectively. On the assumption of a uniform prior, the posterior density of μ derived from D_{1} is (8)We now regard p(μ  H,D_{1}) as prior knowledge to be taken into account in analysing D_{2}. Thus (9)so that the posterior distribution derived from D_{2} is (10)The statistic quantifying the goodnessoffit achieved when ℳ = { π,H } is applied to data D_{2} is then (11)From N independent data pairs (D_{1},D_{2}), we obtain N independent values of ⟨ χ^{2} ⟩ _{π} thus allowing us to test the expectation (Sect. 4.1) that this statistic is approximately distributed as ⟨ χ^{2} ⟩ _{u}. In Fig. 1, this empirical PDF is plotted together the theoretical PDF for ⟨ χ^{2} ⟩ _{u}.
Fig. 1 Empirical PDF of ⟨ χ^{2} ⟩ _{π} derived from the analysis of 10^{6} data pairs D_{1},D_{2} as described in Sect. 5.1. The dashed curve is the theoretical PDF for ⟨ χ^{2} ⟩ _{u}. 

Open with DEXTER 
The accuracy of the approximation at large values of χ^{2} is of special importance. For n = 100 and k = 1, the 0.05,0.01 and 0.001 critical points of ⟨ χ^{2} ⟩ _{u} are 124.2, 135.6 and 149.2, respectively. From a simulation with N = 10^{6}, the number of ⟨ χ^{2} ⟩ _{π} values exceeding these thresholds are 50 177, 10 011 and 1025, respectively. Thus, the fraction of ⟨ χ^{2} ⟩ _{π} exceeding the critical values derived from the distribution of ⟨ χ^{2} ⟩ _{u} are close to their predicted values. Accordingly, the conventional interpretation of these critical values is valid.
In this experiment, the analysis of X_{2} benefits from knowledge gained from X_{1}. We expect therefore that ⟨ χ^{2} ⟩ _{π} is less than ⟨ χ^{2} ⟩ _{u}, since replacing the uniform prior with the informative π obtained from X_{1} should improve the fit. From 10^{6} repetitions, this proves to be so with probability 0.683. Sampling noise in D_{1} and D_{2} accounts for the shortfall.
5.2. Bias test
When X_{1} and X_{2} are flawless, the statistic ⟨ χ^{2} ⟩ _{π} indicates doubts – i.e., type I errors (Sect. 3.2) – about the mutual consistency of X_{1} and X_{2} with just the expected frequency. Thus, with the 5% threshold, doubts arise in 5.02% of the above 10^{6} trials. This encourages the use of ⟨ χ^{2} ⟩ _{π} to detect inconsistency.
Accordingly, in a second test, X_{1} is flawed due to biased measurements. Thus, now u_{i} = μ + σz_{i} + b for D_{1}. As a result, the prior for X_{2} obtained from Eq. (9) is compromised, and this impacts on the statistic ⟨ χ^{2} ⟩ _{π} from Eq. (11).
In Fig. 2, the values of ⟨ χ^{2} ⟩ _{π} and ⟨ χ^{2} ⟩ _{u} are plotted against b/σ. Since the compromised prior is excluded from ⟨ χ^{2} ⟩ _{u}, its values depend only on the flawless data sets D_{2}, and so mostly fall within the (0.05,0.95) interval. In contrast, as b/σ increases, the values of ⟨ χ^{2} ⟩ _{π} are increasingly affected by the compromised prior.
The mutual consistency of X_{1} and X_{2} is assessed on the basis of ⟨ χ^{2} ⟩ _{π}, with choice of critical level at our discretion. However, when ⟨ χ^{2} ⟩ _{π} exceeds the 0.1% level at 149.2, we would surely conclude that X_{1} and X_{2} are in conflict and seek to resolve the discrepancy. On the other hand, when inconsistency is not indicated, we may accept the Bayesian inferences derived from X_{2} in the confident belief that incorporating prior knowledge from X_{1} is justified and beneficial.
Fig. 2 Detection of bias in X_{1} with the ⟨ χ^{2} ⟩ _{π} statistic when X_{2} is analysed with prior derived from X_{1}. Values of ⟨ χ^{2} ⟩ _{π} (filled circles) and ⟨ χ^{2} ⟩ _{u} (open circles) are plotted against the bias parameter b/σ. The dashed lines are the 5 and 95% levels. 

Open with DEXTER 
This test illustrates the important point that an inappropriate π can corrupt the Bayesian inferences drawn from a flawless experiment. Thus, in this case, the bias in D_{1} propagates into the posterior p(μ  H,D_{2}) derived from X_{2}. This can be (and is) avoided by preferring the frequentist methodology. But to do so is to forgo the great merit of Bayesian inference, namely its ability to incorporate informative prior information (Feldman & Cousins 1998). If one does therefore prefer Bayesian inference, it is evident that a goodnessoffit statistic such as ⟨ χ^{2} ⟩ _{π} is essential in order to detect errors propagating into the posterior distribution from an illchosen prior.
5.3. Order reversed
In the above test, the analysis of X_{2} is preceded by that of X_{1}. This order can be reversed. Thus, with the same N data pairs (D_{1},D_{2}), we now first analyse X_{2} with a uniform prior to obtain p(μ  H,D_{2}). This becomes the prior for the analysis of X_{1}. This analysis then gives the posterior p(μ  H,D_{1}) from which a new value of ⟨ χ^{2} ⟩ _{π} is obtained.
When the values of ⟨ χ^{2} ⟩ _{π} obtained with this reversed order of analysis are plotted against b/σ, the result is similar to Fig. 2, implying that the order is unimportant. Indeed, statistically, the same decision is reached independently of order. For example, for 10^{5} independent data pairs (D_{1},D_{2}) with b/σ = 1, the number with ⟨ χ^{2} ⟩ _{π}> 124.2, the 5% threshold, is 50 267 when X_{1} precedes X_{2} and 50 149 when X_{2} precedes X_{1}.
5.4. Alternative statistic
Noting that the Bayesian evidence is , the priorweighted mean of the likelihood, we can, under standard assumptions, write (12)where the effective χ^{2} (Bridges et al. 2009) is (13)This is a possible alternative to ⟨ χ^{2} ⟩ _{π} defined in Eq. (4). However, in the test of Sect. 5.1, the two values are so nearly identical it is immaterial which mean is used. Here ⟨ χ^{2} ⟩ _{π} is preferred because it remains welldefined for a uniform prior, for which an analytic result is available (Appendix A).
Because ⟨ χ^{2} ⟩ _{π} and are nearly identical, the distribution of should approximate that of ⟨ χ^{2} ⟩ _{u} (Sect. 4.1). To test this, the experiment of Sect. 5.1 is repeated with replacing ⟨ χ^{2} ⟩ _{π}. From a simulation with N = 10^{6}, the number of values exceeding the 0.05, 0.01 and 0.001 thresholds are 50 167, 9951 and 970, respectively. Thus if is chosen as the goodnessoffit statistic, accurate pvalues can be derived on the assumption that is distributed as with ν = n − k degrees of freedom. From Sect. 4.1, we expect these pvalues to be accurate if π(α) is not more sharply peaked than ℒ(α  H,D).
6. An F statistic for Bayesian models
Inspection of Fig. 2 shows that a more powerful test of inconsistency between X_{1} and X_{2} must exist. A systematic displacement of ⟨ χ^{2} ⟩ _{π} relative to ⟨ χ^{2} ⟩ _{u} is already evident even when ⟨ χ^{2} ⟩ _{π} is below the 5% threshold at 124.2. This suggests that a Bayesian analogue of the F statistic be constructed.
6.1. The frequentist Ftest
In frequentist statistics, a standard result (e.g., Hamilton 1964, p. 139) in the testing of linear hypotheses is the following: we define the statistic (14)where is the minimum value of χ^{2} when all i parameters are adjusted, and is the minimum value when a linear constraint is imposed on j( ≤ i) parameters, so that only the remaining i − j are adjusted. Then, on the null hypothesis H that the constraint is true, F is distributed as F_{ν1,ν2} with ν_{1} = j and ν_{2} = n − i, where n is the number of measurements. Accordingly, H is tested by comparing the value F given by Eq. (14) with critical values F_{ν1,ν2,β} derived from the distribution of F_{ν1,ν2}.
Note that when j = i, the constraint completely determines α. If this value is α_{∗}, then and H states that α_{∗} = α_{true}.
A particular merit of the statistic F is that it is independent of σ. However, the resulting Ftest does assume normallydistributed measurement errors.
6.2. A Bayesian F
In a Bayesian context, the frequentist hypothesis that α_{true} = α_{∗} is replaced by the statement that α_{true} obeys the posterior distribution p(α  H,D_{2}). Thus an exact constraint is replaced by a fuzzy constaint.
Adopting the simplest approach, we define ℱ, a Bayesian analogue of F, by taking to be the value at the posterior mean of α, (15)where π(α) is the informative prior. Considerations of accuracy when values of χ^{2} are computed on a grid suggest we take to be the value at (16)the posterior mean for a uniform prior.
With and thus defined, the Bayesian F is (17)and its value is to be compared with the chosen threshold F_{ν1,ν2,β} when testing the consistency of X_{1} and X_{2}. Since ⟨ α ⟩ _{u} is independent of π(α), the statistic ℱ measures the effect of π(α) in displacing ⟨ α ⟩ _{π} from ⟨ α ⟩ _{u}.
In this Bayesian version of the Ftest, the null hypothesis states that the posterior mean ⟨ α ⟩ _{π} = α_{true}. This will be approximately true when a flawless Bayesian model ℳ is applied to a flawless data set D. However, if the chosen threshold on ℱ is exceeded, then one or more of π,H and D is suspect as discussed in Sect. 3.1. If the threshold is not exceeded, then Bayesian inferences drawn from the posterior distribution p(α  H,D_{2}) are supported.
6.3. Test of pvalues from ℱ
If Eq. (17) gives ℱ = ℱ_{∗}, the corresponding pvalue is (18)where ν_{1} = j and ν_{2} = n − i. The accuracy of these pvalues can be tested with the 1D toy model of Sect. 5.1 as follows:
(i): Independent data sets D_{1},D_{2} are created corresponding to X_{1},X_{2}.
(ii): With π from X_{1}, the quantities and ℱ_{∗} are calculated for X_{2} with Eqs. (15)–(17).
(iii): M independent data sets are now created with .
(iv): For each , the new value of is calculated for the derived at step (ii).
(v): For each , the new value of χ^{2}( ⟨ α ⟩ _{u}) is calculated with ⟨ α ⟩ _{u} from Eq. (16).
(vi): With these new χ^{2} values, the statistic ℱ_{m} is obtained from Eq. (17).
The resulting M values of ℱ_{m} give us an empirical estimate of p_{∗}, namely f_{∗}, the fraction of the ℱ_{m} that exceed ℱ_{∗}. In one example of this test, a data pair D_{1},D_{2} gives and ℱ_{∗} = 1.0021. With ν_{1} = 1 and ν_{2} = 99, Eq. (18) then gives p_{∗} = 0.3192. This is checked by creating M = 10^{5} data sets with the assumption that μ_{true} = 0.089. The resulting empirical estimate is f_{∗} = 0.3189, in close agreement with p_{∗}.From 100 independent pairs D_{1},D_{2}, the mean value of  p_{∗} − f_{∗}  is 0.001. This simulation confirms the accuracy of pvalues derived from Eq. (18) and therefore of decision thresholds F_{ν1,ν2,β}.
6.4. Bias test
To investigate this Bayesian Ftest’s ability to detect inconsistecy between X_{1} and X_{2}, the bias test of Sect. 5.2 is repeated, again with n_{1} = n_{2} = 100. The vector α in Sect. 6.1 now becomes the scalar μ, and i = 1.
In Fig. 3, the values of ℱ from Eq. (17) with j = i = 1 are plotted against the bias parameter b/σ. Also plotted are critical values derived from the distribution F_{ν1,ν2} with ν_{1} = 1 and ν_{2} = 99. In contrast to Fig. 2 for the ⟨ χ^{2} ⟩ _{π} statistic, inconsistency between X_{1} and X_{2} is now detected down to b/σ ≈ 0.5.
In this test of inconsistency, the flaw in X_{1} is the bias b. Now, if it were known for certain that this was the flaw, then a Bayesian analysis with H_{1} changed from u = μ to u = μ + b – i.e., with an extra parameter – is staightforward. The result is the posterior density for b, allowing for correction. In contrast, the detection of a flaw with ⟨ χ^{2} ⟩ _{π} or ℱ is causeindependent. Although Figs. 2 and 3 have b/σ as the abscissa, for real experiments this quantity is not known and conclusions are drawn just from the ordinate.
Fig. 3 Detection of inconsistency between X_{1} and X_{2}. Values of ℱ are plotted against the bias parameter b/σ. The dashed lines are the 0.1, 5 and 50% levels. Note that the horizontal scale differs from Fig. 2. 

Open with DEXTER 
7. Tension between experiments
In the above, the goodnessoffit of ℳ to D is investigated taking into account the possibility that a poor fit might be due the prior derived from a previous experiment. A related goodnessoffit issue commonly arises in modern astrophysics, particularly cosmology, namely whether estimates of nonidentical but partially overlapping parameter vectors from different experiments are mutually consistent. The term commonly used in this regard is tension, with investigators often reporting their subjective assessments – e.g., there is marginal tension between X_{1} and X_{2} – based on the two credibility domains (often multidimensional) for the parameters in common.
In a recent paper, Seehars et al. (2015) review the attempts in the cosmological literature to quantify the concept of tension, with emphasis on CMB experiments. Below, we develop a rather different approach based on the ℱ statistic defined in Sect. 6.2.
Since detecting and resolving conflicts between experiments is essential for scientific progress, it is desirable to quantify the term tension and to optimize its detection. The conjecture here is that this optimum is achieved when inferences from X_{1} are imposed on the analysis of X_{2}.
7.1. Identical parameter vectors
A special case of assessing tension between experiments is that where the parameter vectors are identical. But this is just the problem investigated in Sects. 5 and 6.
When X_{1} (with bias) and X_{2} (without bias) are separately analysed, the limited overlap of the two credibility intervals for μ provides a qualitative indication of tension. However, if X_{2} is analysed with a prior derived from X_{1}, then the statistic ⟨ χ^{2} ⟩ _{π} – see Fig. 2 – or, more powerfully, the statistic ℱ – see Fig. 3 – provide a quantitative measure to inform statements about the degree of tension.
7.2. Nonidentical parameter vectorsI
We now suppose that D_{1},D_{2} are data sets from different experiments X_{1},X_{2} designed to test the hypotheses H_{1},H_{2}. However, though different, these hypotheses have parameters in common. Specifically, the parameter vectors of H_{1} and H_{2} are ξ = (α,β) and η = (β,γ), respectively, and k,l,m are the numbers of elements in α,β,γ, respectively.
If X_{1} and X_{2} are analysed independently, we may well find that ℳ_{1} and ℳ_{2} provide satisfactory fits to D_{1} and D_{2} and yet still be obliged to report tension between the experiments because of a perceived inadequate overlap of the two ldimensional credibility domains for β.
A quantitative measure of tension between X_{1} and X_{2} can be derived via the priors, as follows: the analysis of X_{1} gives p(ξ  H_{1},D_{1}), the posterior distribution of ξ, from which we may derive the posterior distribution of β by integrating over α. Thus (19)Now, for a Bayesian analysis of X_{2}, we must specify π(η) throughout ηspace, not just βspace. But (20)Accordingly, what we infer from X_{1} can be imposed on the analysis of X_{2} by writing (21)The conditional prior π(γ  β) must now be specified. This can be taken to be uniform unless we have prior knowledge from other sources – i.e., not from D_{2}.
With π(η) specified, the analysis of X_{2} gives the posterior density p(η  H_{2},D_{2}). As in Sect. 6.2, we now regard this as a fuzzy constraint on η from which we compute the sharp constraint ⟨ η ⟩ _{π} given by Eq. (15). Now, in general, ⟨ η ⟩ _{π} will be displaced from ⟨ η ⟩ _{u} given by Eq. (16). The question then is: Is the increment in χ^{2}(η  H_{2},D_{2}) between ⟨ η ⟩ _{π} and ⟨ η ⟩ _{u} so large that we must acknowledge tension between X_{1} and X_{2}?
Following Sect. 6, we answer this question by computing ℱ from Eq. (17) with i = j = l + m, the total number parameters in η. The result is then compared to selected critical values from the F_{ν1,ν2} distribution, where ν_{1} = l + m and ν_{2} = n_{2} − l − m. With standard assumptions, ℱ obeys this distribution if ⟨ η ⟩ _{π} = η_{true} – i.e., if ⟨ β ⟩ _{π} = β_{true} and ⟨ γ ⟩ _{π} = γ_{true} – see Sect. 6.3.
7.3. A toy model
A simple example with one parameter (μ) for X_{1} and two (μ,κ) for X_{2} is as follows: H_{1} states that u = μ and H_{2} states that v = μ + κx. The data set D_{1} comprises n_{1} measurements u_{i} = μ + σz_{i} + b, where b is the bias. The data set D_{2} comprises n_{2} measurements v_{j} = μ + κx_{j} + σz_{j}, where the x_{j} are uniform in (− 1, + 1). The parameters are μ = 0,κ = 1,σ = 1 and n_{1} = n_{2} = 100. In the notation of Sect. 7.2, the vectors β,γ contract to the scalars μ,κ, whence l = m = 1, and α does not appear, whence k = 0.
7.4. Bias test
In the above, H_{1} are H_{2} are different hypotheses but have the parameter μ in common. If b = 0, the analyses of X_{1} are X_{2} should give similar credibility intervals for μ and therefore no tension. But with sufficiently large b, tension should arise.
This is investigated following Sect. 7.2. Applying ℳ_{1} to D_{1}, we derive p(μ  H_{1},D_{1}). Then, taking the conditional prior π(κ  μ) to be constant, we obtain (22)as the prior for the analysis of X_{2}. This gives us the posterior distribution p(μ,κ  H_{2},D_{2}), which is a fuzzy constraint in (μ,κ)space. Replacing this by the sharp constraint (⟨ μ ⟩ , ⟨ κ ⟩), the constrained χ^{2} is (23)Substitution in Eq. (17) with j = i = 2, then gives ℱ as a measure of the tension between X_{1} and X_{2}. Under standard assumptions, ℱ is distributed as F_{ν1,ν2} with ν_{1} = 2,ν_{2} = n_{2} − 2 if (⟨ μ ⟩ , ⟨ κ ⟩ ) = (μ,κ)_{true}.
In Fig. 4, the values of ℱ are plotted against b/σ together with critical values for F_{ν1,ν2} with ν_{1} = 2,ν_{2} = 98. This plot shows that tension is detected for b/σ ≳ 0.6. This is slightly inferior to Fig. 3 as is to be expected because of the more complicated X_{2}.
Fig. 4 Detection of tension between different experiments. Values of ℱ are plotted against the bias parameter b/σ. The dashed lines are the 0.1,5 and 50% levels. 

Open with DEXTER 
The importance of Fig. 4 is in showing that the ℱ statistic has the desired characteristic of reliably informing the investigator of tension between different experiments with partially overlapping parameter vectors. When the inconsistency is slight (b/σ ≲ 0.2), this statistic does not sound a false alarm. When the inconsistency is substantial (b/σ ≳ 0.6), the statistic does not fail to sound the alarm.
7.5. Nonidentical parameter vectorsII
If ℱ calculated as in Sect. 7.2 indicates tension, the possible flaws include the conditional prior π(γ  β). Thus, tension could be indicated even if the prior π(β) inferred from X_{1} is accurate and consistent with D_{2}.
Accordingly, we might prefer to focus just on β – i.e., on the parameters in common. If so, we compute (24)and then find the minimum of χ^{2}(η  H_{2},D_{2}) when β = ⟨ β ⟩ _{π}. Thus, the contrained χ^{2} is now (25)The F test also applies in this case – see Sect. 6.1. Thus this value is substituted in Eq. (17), where we now take j = l, the number of parameters in β. The resulting ℱ is then compared to critical values derived from F_{ν1,ν2} with ν_{1} = l,ν_{2} = n_{2} − l − m. With the standard assumptions, ℱ obeys this distribution if ⟨ β ⟩ _{π} = β_{true}.
For the simple model of Sect. 7.3, the resulting plot of ℱ against b/σ is closely similar to Fig. 4 and so is omitted. Nevertheless, the option of constraining a subset of the parameters is likely to be a powerful diagnostic tool for complex, multidimensional problems, identifying the parameters contributing most to the tension revealed when the entire vector is constrained (cf. Seehars et al. 2015).
8. Discussion and conclusion
A legitimate question to ask about the statistical analysis of data acquired in a scientific experiment is: How well or how badly does the model fit the data? Asking this question is, after all, just the last step in applying the scientific method. In a frequentist analysis, where the estimated parameter vector α_{0} is typically the minimumχ^{2} point, this question is answered by reporting the goodnessoffit statisic or, equivalently, the corresponding pvalue. If a poor fit is thereby indicated, the investigator and the community are aware that not only is the model called into question but so also is the estimate α_{0} and its confidence domain.
If the same data is subject to a Bayesian analysis, the same question must surely be asked: the Bayesian approach does not exempt the investigator from the obligation to report on the success or otherwise of the adopted model. In this case, if the fit is poor, the Bayesian model is called into question and so also are inferences drawn from the posterior distribution p(α  H,D).
As noted in Sect. 1, the difficulty in testing Bayesian models is that there are no good fully Bayesian methods for assessing goodnessoffit. Accordingly, in this paper, a frequentist approach is advocated. Specifically, ⟨ χ^{2} ⟩ _{π} is proposed in Sect. 3 as a suitable goodnessoffit statisic for Bayesian models. Under the null hypothesis that ℳ and D are flawless and with the standard assumptions of linearity and normallydistributed errors, then, as argued in Sect. 4.1 and illustrated in Fig. 1, ⟨ χ^{2} ⟩ _{π} − k is approximately distributed as , and so pvalues can be computed. A pvalue thus derived from ⟨ χ^{2} ⟩ _{π} quantifies the average goodnessoffit provided by the posterior distribution. In contrast, in a frequentist minimumχ^{2} analysis, the pvalue quantifies the goodnessoffit provided by the point estimate α_{0}.
In the above, it is regarded as selfevident that astronomers want to adhere to the scientific method by always reporting the goodnessoffit achieved in Bayesian analyses of observational data. However, Gelman & Shalizi (2013), in an essay on the philosophy and practice of Bayesian statistics, note that investigators who identify Bayesian inference with inductive inference regularly fit and compare models without checking them. They deplore this practice. Instead, these authors advocate the hypotheticodeductive approach in which model checking is crucial. As in this paper, they discuss nonBayesian checking of Bayesian models – specifically, the derivation of pvalues from posterior predictive distributions. Moreover, they also stress that the prior distribution is a testable part of a Bayesian model.
In the astronomical literature, the use of frequentist tests to validate Bayesian models is not unique to this paper. Recently, Harrison et al. (2015) have presented an ingenious procedure for validating multidimensional posterior distributions with the frequentist KolmogorovSmirnov (KS) test for onedimensional data. Their aim, as here, is to test the entire Bayesian inference procedure.
Frequentist testing also arises in recent applications of the KullbackLeibler divergence to quantify tension between cosmological probes (e.g. Seehars et al. 2015). For linear models, and with the assumption of Gaussian priors and likelihoods, a term in the relative entropy is a statistic that measures tension. With these assumptions, the statistic follows a generalized χ^{2} distribution, thus allowing a pvalue to be computed.
Seehars et al. (2015) also investigate various purely Bayesian measures of tension. They conclude that interpreting these measures on a fixed, problemindependent scale – e.g., the Jeffrey’s scale – can be misleading – see also Nesseris & GarcíaBellido (2013).
Acknowledgments
I thank D. J. Mortlock for comments on an early draft of this paper, and A. H. Jaffe, M.P.Hobson and the referee for useful references.
References
 Bridges, M., Feroz, F., Hobson, M. P., & Lasenby, A. N. 2009, MNRAS, 400, 1075 [NASA ADS] [CrossRef] [Google Scholar]
 Feldman, G. J., & Cousins, R. D. 1998, Phys. Rev. D, 57, 3873 [NASA ADS] [CrossRef] [Google Scholar]
 Gelman, A., & Shalizi 2013, British J. Math. Stat. Psychol., 66, 8 [CrossRef] [Google Scholar]
 Hamilton, W. C. 1964, Statistics in Physical Science (New York: Ronald Press) [Google Scholar]
 Harrison, D., Sutton, D., Carvalho, P., & Hobson, M. 2015, MNRAS, 451, 2610 [NASA ADS] [CrossRef] [Google Scholar]
 Lucy, L. B. 2014, A&A, 571, A86 (L14) [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Nesseris, S., & GarcíaBellido, J. 2013, J. Cosmol. Astropart. Phys., 08, 036 [NASA ADS] [CrossRef] [Google Scholar]
 Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. 2007 Numerical Recipes (Cambridge: Cambridge University Press) [Google Scholar]
 Seehars, S., Grandis, S., Amara, A., & Refregier, A. 2015, arXiv eprints [arXiv:1510.08483] [Google Scholar]
 Sellke, T., Bayarri, M. J., & Berger, J. 2001, The American Statistician, 55, 62 [CrossRef] [Google Scholar]
 Sivia, D. S., & Skilling, J. 2006, Data Analysis: A Bayesian Tutorial, 2nd Edn. (Oxford: Oxford University Press) [Google Scholar]
Appendix A: Evaluation of ⟨ χ^{2} ⟩ _{u}
If α_{0} denotes the minimumχ^{2} solution, then (A.1)where a is the displacement from α_{0}. Then, on the assumption of linearity, (A.2)where the A_{ij} are the constant elements of the k × k curvature matrix (Press et al. 2007, p. 680), where k is the number of parameters. It follows that surfaces of constant χ^{2} are kdimensional selfsimilar ellipsoids centered on α_{0}.
Now, given the second assumption of normallydistributed measurement errors, the likelihood (A.3)Thus, in the case of a uniform prior, the posterior mean of χ^{2} is (A.4)
Because surfaces of constant Δχ^{2} are selfsimilar, the kdimensional integrals in Eq. (A.4) reduce to 1D integrals.
Suppose on the surface of the ellipsoid with volume V_{∗}. If lengths are increased by the factor λ, then the new ellipsoid has (A.5)With these scaling relations, the integrals in Eq. (A.4) can be transformed into integrals over λ. The result is (A.6)where . The integrals have now been transformed to a known definite integral, (A.7)where x = (z + 1) / 2. Aplying this formula, we obtain (A.8)an exact result under the stated assumptions.
All Figures
Fig. 1 Empirical PDF of ⟨ χ^{2} ⟩ _{π} derived from the analysis of 10^{6} data pairs D_{1},D_{2} as described in Sect. 5.1. The dashed curve is the theoretical PDF for ⟨ χ^{2} ⟩ _{u}. 

Open with DEXTER  
In the text 
Fig. 2 Detection of bias in X_{1} with the ⟨ χ^{2} ⟩ _{π} statistic when X_{2} is analysed with prior derived from X_{1}. Values of ⟨ χ^{2} ⟩ _{π} (filled circles) and ⟨ χ^{2} ⟩ _{u} (open circles) are plotted against the bias parameter b/σ. The dashed lines are the 5 and 95% levels. 

Open with DEXTER  
In the text 
Fig. 3 Detection of inconsistency between X_{1} and X_{2}. Values of ℱ are plotted against the bias parameter b/σ. The dashed lines are the 0.1, 5 and 50% levels. Note that the horizontal scale differs from Fig. 2. 

Open with DEXTER  
In the text 
Fig. 4 Detection of tension between different experiments. Values of ℱ are plotted against the bias parameter b/σ. The dashed lines are the 0.1,5 and 50% levels. 

Open with DEXTER  
In the text 