Frequentist tests for Bayesian models

L. B. Lucy

doi:10.1051/0004-6361/201527709

Home

All issues

Volume 588 (April 2016)

A&A, 588 (2016) A19

Full HTML

Free Access

Issue		A&A Volume 588, April 2016


Article Number		A19
Number of page(s)		8
Section		Numerical methods and codes
DOI		https://doi.org/10.1051/0004-6361/201527709
Published online		11 March 2016

A&A 588, A19 (2016)

Frequentist tests for Bayesian models

L. B. Lucy

Astrophysics Group, Blackett Laboratory, Imperial College London, Prince Consort Road, London SW7 2AZ, UK
e-mail: l.lucy@imperial.ac.uk

Received: 6 November 2015
Accepted: 1 February 2016

Abstract

Analogues of the frequentist chi-square and F tests are proposed for testing goodness-of-fit and consistency for Bayesian models. Simple examples exhibit these tests’ detection of inconsistency between consecutive experiments with identical parameters, when the first experiment provides the prior for the second. In a related analysis, a quantitative measure is derived for judging the degree of tension between two different experiments with partially overlapping parameter vectors.

Key words: methods: data analysis / methods: statistical

© ESO, 2016

1. Introduction

Bayesian statistical methods are now widely applied in astronomy. Of the new techniques thus introduced, model selection (or comparison) is especially notable in that it replaces the frequentist acceptance-rejection paradigm for testing hypotheses. Thus, given a data set D, there might be several hypotheses { H_k } that have the potential to explain D. From these, Bayesian model selection identifies the particular H_k that best explains D. The procedure is simple, though computationally demanding: starting with an arbitrary pair of the { H_k }, we apply the model selection machinery, discard the weaker hypothesis, replace it with one of the remaining H_k, and repeat.

This procedure usefully disposes of the weakest hypotheses, but there is no guarantee that the surviving H_k explains D. If the correct hypothesis is not included in the { H_k }, we are left with the “best” hypothesis but are not made aware that the search for an explanation should continue. In the context of model selection, the next step would be a comparison of this “best” H_k with the hypothesis that H_k is false. But model selection fails at this point because we cannot compute the required likelihood (Sivia & Skilling 2006, p. 84). Clearly, what is needed is a goodness-of-fit criterion for Bayesian models.

This issue is discussed by Press et al. (2007, p. 779). They note that “There are no good fully Bayesian methods for assessing goodness-of-fit ...” and go on to report that “Sensible Bayesians usually fall back to p-value tail statistics ... when they really need to know if a model is wrong”.

On the assumption that astronomers do really need to know if their models are wrong, this paper adopts a frequentist approach to testing Bayesian models. Although this approach may be immediately abhorrent to committed Bayesians, the role of the tests proposed herein is merely to provide a quantitative measure according to which Bayesians decide whether their models are satisfactory. When they are, the Bayesian inferences are presented – and with increased confidence. When not, flaws in the models or the data must be sought, with the aim of eventually achieving satisfactory Bayesian inferences.

2. Bayesian models

The term Bayesian model – subsequently denoted by ℳ – must now be defined. The natural definition of ℳ is that which must be specified in order that Bayesian inferences can be drawn from D – i.e., in order to compute posterior probabilities. This definition implies that, in addition to the hypothesis H, which introduces the parameter vector α, the prior probability distribution π(α) must also be included in ℳ. Thus, symbolically, we write $\begin{matrix} ℳ \equiv {π,H} . \end{matrix}$ $\begin{eqnarray} {\cal M} \equiv \left\{\pi,H \right\}. \end{eqnarray}$ (1)It follows that different Bayesian models can share a common H. For example, H may be the hypothesis that D is due to Keplerian motion. But for the motion of a star about the Galaxy’s central black hole, the appropriate π will differ from that for the reflex orbit of star due to a planetary companion.

A further consequence is that a failure of ℳ to explain D is not necessarily due to H: an inappropriate π is also a possibility.

To astronomers accustomed to working only with uniform priors, the notion that a Bayesian model’s poor fit to D could be due to π might be surprising. A specific circumstance where π could be at fault arises when Bayesian methods are used to repeatedly update our knowledge of some phenomenon – e.g., the value of a fundamental constant that over the years is the subject of numerous experiments (X₁,...,X_i,...), usually of increasing precision. With an orthodox Bayesian approach, X_{i + 1} is analysed with the prior set equal to the posterior from X_i. Thus $\begin{matrix} π_{i + 1} (α) = p (α | H, D_{i}) . \end{matrix}$ $\begin{eqnarray} \pi_{i+1}(\vec{\alpha}) = p(\vec{\alpha}|H,D_{i}). \end{eqnarray}$ (2)This is the classical use of Bayes theorem to update our opinion by incorporating past experience. However, if X_i is flawed – e.g., due to an unrecognized systematic error – then this choice of π impacts negatively on the analysis of X_{i + 1}, leading perhaps to a poor fit to D_{i + 1}

Now, since subsequent flawless experiments result in the decay of the negative impact of X_i, this recursive procedure is self-correcting, so a Bayesian might argue that the problem can be ignored. But scientists feel obliged to resolve such discrepancies before publishing or undertaking further experiments and therefore need a statistic that quantifies any failure of ℳ to explain D.

3. A goodness-of-fit statistic for Bayesian models

The most widely used goodness-of-fit statistic in the frequentist approach to hypothesis testing is χ², whose value is determined by the residuals between the fitted model and the data, with no input from prior knowledge. Thus, $\begin{matrix} χ_{0}^{2} = χ^{2} (α_{0}) \end{matrix}$ $\begin{eqnarray} \chi^{2}_{0} = \chi^{2}(\vec{\alpha}_{0}) \end{eqnarray}$ (3)is the goodnes-of-fit statistic for the minimum-χ² solution α₀.

A simple analogue of $χ_{0}^{2}$ $\hbox{$\chi^{2}_{0}$}$ for Bayesian models is the posterior mean of χ²(α), $\begin{matrix} ⟨ χ^{2} ⟩_{π} = \int χ^{2} (α) p (α | H,D) d V_{α} \end{matrix}$ $\begin{eqnarray} \langle \chi^{2} \rangle_{\pi} \: = \int \chi^{2}(\vec{\alpha}) \: p(\vec{\alpha}|H,D) \: {\rm d}V_{\vec{\alpha}} \end{eqnarray}$ (4)where the posterior distribution $\begin{matrix} p (α | H,D) = \frac{π (α) ℒ (α | H,D)}{\int π (α) ℒ (α | H,D) d V_{α}} \cdot \end{matrix}$ $\begin{eqnarray} p(\vec{\alpha}|H,D) = \frac{ \pi (\vec{\alpha}) {\cal L}(\vec{\alpha}|H,D)} {\int \pi (\vec{\alpha}) {\cal L}(\vec{\alpha}|H,D) \: {\rm d}V_{\vec{\alpha}}} \cdot \end{eqnarray}$ (5)Here ℒ(α | H,D) is the likelihood of α given data D.

Note that since ⟨ χ² ⟩ _π depends on both constituents of ℳ, namely H and π, it has the potential to detect a poor fit due to either or both being at fault, as required by Sect. 2.

In Eq. (4) the subscript π is attached to ⟨ χ² ⟩ to stress that a non-trivial, informative prior is included in ℳ. On the other hand, when a uniform prior is assumed, ⟨ χ² ⟩ is independent of the prior and is then denoted by ⟨ χ² ⟩ _u.

The Bayesian goodness-of-fit statistic ⟨ χ² ⟩ _u is used in Lucy (2014, L14) to illustrate the detection of a weak second orbit in simulations of Gaia scans of an astrometric binary. In that case, H states that the scan residuals are due to the reflex Keplerian orbit caused by one invisible companion. With increasing amplitude of the second orbit, the point comes when the investigator will surely abandon H – i.e., abandon the assumption of just one companion – see Fig. 12, L14.

3.1. p-values

With the classical frequentist acceptance-rejection paradigm, a null hypothesis H₀ is rejected on the basis of a p-value tail statistic. Thus, with the χ² test, H₀ is rejected if $χ_{0}^{2} > χ_{ν,β}^{2}$ $\hbox{$\chi^{2}_{0} > \chi^{2}_{\nu,\beta}$}$ , where $p (χ_{ν}^{2} > χ_{ν,β}^{2}) = β$ $\hbox{$p(\chi^{2}_{\nu} > \chi^{2}_{\nu,\beta}) = \beta$}$ , and accepted otherwise. Here ν = n − k is the number of degrees of freedom, where n is the number of measurements and k is the number of parameters introduced by H₀, and β is the designated p-threshold chosen by the investigator.

For a Bayesian model, a p-value can be computed from the ⟨ χ² ⟩ _π statistic, whose approximate distribution is derived below in Sect. 5.1. However, a sharp transition from acceptance to rejection of the null hypothesis at some designated p-value is not recommended. First, p-values overstate the strength of the evidence against H₀ (e.g., Sellke et al. 2001). In particular, the value p = 0.05 recommended in elementary texts does not imply strong evidence againts H₀. Second, the p-value is best regarded (Sivia & Skilling 2006, p. 85) as serving a qualitative purpose, with a small value prompting us to think about alternative hypotheses. Thus, if ⟨ χ² ⟩ _π exceeds the chosen threshold $χ_{ν,β}^{2}$ $\hbox{$\chi^{2}_{\nu,\beta}$}$ , this is a warning that something is amiss and should be investigated, with the degree of concern increasing as β decreases. If the β = 0.001 threshold is exceeded, then the investigator would be well-advised to suspect that ℳ or D is at fault.

Although statistics texts emphasize tests of H not D, astronomers know that D can be corrupted by biases or calibration errors. Departures from normally-distributed errors can also increase ⟨ χ² ⟩ _π.

If D is not at fault, then ℳ is the culprit, implying that either π or H is at fault. If the fault lies with π not H, then we expect that $⟨ χ^{2} ⟩_{u} < χ_{ν,β}^{2}$ $\hbox{$\langle\chi^{2}\rangle_{u} \: < \: \chi^{2}_{\nu,\beta}$}$ even though $⟨ χ^{2} ⟩_{π} > χ_{ν,β}^{2}$ $\hbox{$\langle \chi^{2} \rangle _{\pi} \: > \: \chi^{2}_{\nu,\beta}$}$ .

If neither D nor π can be faulted, then the investigator must seek a refined or alternative H.

3.2. Type I and type II errors

In the frequentist approach to hypothesis testing, decision errors are said to be of type I if H is true but the test says reject H, and of type II if H is false but the test says accept H.

Since testing a Bayesian model is not concerned exclusively with H, these definitions must be revised, as follows:

A type I error arises when ℳ and D are flawless but the statistic (e.g., ⟨ χ² ⟩ _π) exceeds the designated threshold.
A type II error arise when ℳ or D are flawed but the statistic does not exceed the designated threshold.

Here the words accept and reject are avoided. Moreover, no particular threshold is mandatory: it is at the discretion of the investigator and is chosen with regard to the consequences of making a decision error.

4. Statistics of ⟨ χ² ⟩

The intuitive understanding that scientists have regarding $χ_{0}^{2}$ $\hbox{$\chi^{2}_{0}$}$ derives from its simplicity and the rigorous theorems on its distribution that allow us to derive confidence regions for multi-dimensional linear models (e.g., Press et al. 2007, Sect. 15.6.4).

Rigorous statistics for χ² require two assumptions: 1) that the model fitted to D is linear in its parameters; and 2) that measurement errors obey the normal distribution. Nevertheless, even when these standard assumptions do not strictly hold, scientists still commonly rely on $χ_{0}^{2}$ $\hbox{$\chi^{2}_{0}$}$ to gauge goodness-of-fit, with perhaps Monte Carlo (MC) sampling to provide justification or calibration (e.g., Press et al. 2007, Sect. 15.6.1).

Rigorous results for the statistic ⟨ χ² ⟩ are therefore of interest. In fact, if we add the assumption of a uniform prior to the above standard assumptions, then we may prove (Appendix A) that $\begin{matrix} ⟨ χ^{2} ⟩_{u} = χ_{0}^{2} + k \end{matrix}$ $\begin{eqnarray} \langle \chi^{2} \rangle_{u} \: = \chi^{2}_{0} + k \end{eqnarray}$ (6)where k is the number of parameters.

Given that Eq. (6) is exact under the stated assumptions, it follows that the quantity ⟨ χ² ⟩ _u − k is distributed as $χ_{ν}^{2}$ $\hbox{$\chi^{2}_{\nu}$}$ with ν = n − k degrees of freedom.

For minimum-χ² fitting of linear models, the solution is always a better fit to D than is the true solution. In particular, $ℰ (χ_{true}^{2}) = n$ $\hbox{${\cal E}(\chi^{2}_{\rm true}) = n$}$ , whereas $ℰ (χ_{0}^{2}) = n - k$ $\hbox{${\cal E}(\chi^{2}_{0}) = n - k$}$ . Accordingly, the + k term in Eq. (6) “corrects” $χ_{0}^{2}$ $\hbox{$\chi^{2}_{0}$}$ for overfitting and so $ℰ (⟨ χ^{2} ⟩_{u}) = ℰ (χ_{true}^{2})$ $\hbox{${\cal E} (\langle \chi^{2} \rangle_{u}) = {\cal E} (\chi^{2}_{\rm true})$}$ – i.e., the expected value of χ² for the actual measurement errors.

4.1. Effect of an informative prior

When an informative prior is included in ℳ, an analytic formula for ⟨ χ² ⟩ _π is in general not available. However, its approximate statistical properties are readily found.

Consider again a linear model with normally-distributed errors and suppose further that the experiment (X₂) is without flaws. The χ² surfaces are then self-similar k-dimensional ellipsoids with minimum at α₀ ≈ α_true, the unknown exact solution. Let us now further suppose that the informative prior π(α) derives from a previous flawless experiment (X₁). The prior π will then be a convex (bell-shaped) function with maximum at α_max ≈ α₀. Now, consider a 1D family of such functions all centred on α₀ and ranging from the very broad to the very narrow. For the former ⟨ χ² ⟩ _π ≈ ⟨ χ² ⟩ _u; for the latter $⟨ χ^{2} ⟩_{π} \approx χ_{0}^{2}$ $\hbox{$\langle \chi^{2} \rangle_{\pi} \: \approx \: \chi^{2}_{0}$}$ . Thus, ideally, when a Bayesian model ℳ is applied to data D we expect that $\begin{matrix} ⟨ χ^{2} ⟩_{u} ≳ ⟨ χ^{2} ⟩_{π} \geq χ_{0}^{2} . \end{matrix}$ $\begin{eqnarray} \langle\chi^{2} \rangle_{u} \;\; \ga \;\; \langle\chi^{2} \rangle_{\pi} \;\; \geq \;\; \chi^{2}_{0}. \end{eqnarray}$ (7)Now, a uniform prior and δ(α − α₀) are the limits of the above family of bell-shaped functions. Since the delta function limit is not likely to be closely approached in practice, a first approximation to the distribution of ⟨ χ² ⟩ _π is that of ⟨ χ² ⟩ _u – i.e., that of $χ_{n - k}^{2} + k$ $\hbox{$\chi^{2}_{n-k} + k$}$ .

The above discussion assumes faultless X₁ and X₂. But now suppose that there is an inconsistency between π and X₂. The peak of π at α_max will then in general be offset from the minimum of χ² at α₀. Accordingly, in the calculation of ⟨ χ² ⟩ _π from Eq. (4), the neighbourhood of the χ² minimum $χ_{0}^{2}$ $\hbox{$\chi^{2}_{0}$}$ at α₀ has reduced weight relative to χ²(α_max) at the peak of π. Evidently, in this circumstance, ⟨ χ² ⟩ _π can greatly exceed ⟨ χ² ⟩ _u, and the investigator is then alerted to the inconsistency.

5. Numerical experiments

Given that rigorous results ⟨ χ² ⟩ are not available for informative priors, numerical tests are essential to illustrate the discussion of Sect. 4.1.

5.1. A toy model

A simple example with just one parameter μ is as follows: H states that u = μ, and D comprises n measurements u_i = μ + σz_i, where the z_i here and below are independent gaussian variates randomly sampling $\hbox{${\cal N}(0,1)$}$ . In creating synthetic data, we set μ = 0,σ = 1 and n = 100.

In the first numerical experiment, two independent data sets D₁ and D₂ are created comprising n₁ and n₂ measurements, respectively. On the assumption of a uniform prior, the posterior density of μ derived from D₁ is $\begin{matrix} p (μ | H, D_{1}) = \frac{ℒ (μ | H, D_{1})}{\int ℒ (μ | H, D_{1}) d μ} \cdot \end{matrix}$ $\begin{eqnarray} p(\mu|H,D_{1}) = \frac{{\cal L}(\mu|H,D_{1})} {\int {\cal L}(\mu|H,D_{1}) \: {\rm d}\mu}\cdot \end{eqnarray}$ (8)We now regard p(μ | H,D₁) as prior knowledge to be taken into account in analysing D₂. Thus $\begin{matrix} π (μ) = p (μ | H, D_{1}) \end{matrix}$ $\begin{eqnarray} \pi(\mu) = p(\mu|H,D_{1}) \end{eqnarray}$ (9)so that the posterior distribution derived from D₂ is $\begin{matrix} p (μ | H, D_{2}) = \frac{π (μ) ℒ (μ | H, D_{2})}{\int π (μ) ℒ (μ | H, D_{2}) d μ} \cdot \end{matrix}$ $\begin{eqnarray} p(\mu|H,D_{2}) = \frac{\pi(\mu) {\cal L}(\mu|H,D_{2})} {\int {\pi(\mu) \cal L}(\mu|H,D_{2}) \:{\rm d}\mu}\cdot \end{eqnarray}$ (10)The statistic quantifying the goodness-of-fit achieved when ℳ = { π,H } is applied to data D₂ is then $\begin{matrix} ⟨ χ^{2} ⟩_{π} = \int χ^{2} (μ) p (μ | H, D_{2}) d μ \cdot \end{matrix}$ $\begin{eqnarray} \langle \chi^{2} \rangle_{\pi} \: = \: \int \chi^{2}(\mu) \: p(\mu|H,D_{2}) \: {\rm d}\mu\cdot \end{eqnarray}$ (11)From N independent data pairs (D₁,D₂), we obtain N independent values of ⟨ χ² ⟩ _π thus allowing us to test the expectation (Sect. 4.1) that this statistic is approximately distributed as ⟨ χ² ⟩ _u. In Fig. 1, this empirical PDF is plotted together the theoretical PDF for ⟨ χ² ⟩ _u.

Fig. 1

Empirical PDF of ⟨ χ² ⟩ _π derived from the analysis of 10⁶ data pairs D₁,D₂ as described in Sect. 5.1. The dashed curve is the theoretical PDF for ⟨ χ² ⟩ _u.

The accuracy of the approximation at large values of χ² is of special importance. For n = 100 and k = 1, the 0.05,0.01 and 0.001 critical points of ⟨ χ² ⟩ _u are 124.2, 135.6 and 149.2, respectively. From a simulation with N = 10⁶, the number of ⟨ χ² ⟩ _π values exceeding these thresholds are 50 177, 10 011 and 1025, respectively. Thus, the fraction of ⟨ χ² ⟩ _π exceeding the critical values derived from the distribution of ⟨ χ² ⟩ _u are close to their predicted values. Accordingly, the conventional interpretation of these critical values is valid.

In this experiment, the analysis of X₂ benefits from knowledge gained from X₁. We expect therefore that ⟨ χ² ⟩ _π is less than ⟨ χ² ⟩ _u, since replacing the uniform prior with the informative π obtained from X₁ should improve the fit. From 10⁶ repetitions, this proves to be so with probability 0.683. Sampling noise in D₁ and D₂ accounts for the shortfall.

5.2. Bias test

When X₁ and X₂ are flawless, the statistic ⟨ χ² ⟩ _π indicates doubts – i.e., type I errors (Sect. 3.2) – about the mutual consistency of X₁ and X₂ with just the expected frequency. Thus, with the 5% threshold, doubts arise in 5.02% of the above 10⁶ trials. This encourages the use of ⟨ χ² ⟩ _π to detect inconsistency.

Accordingly, in a second test, X₁ is flawed due to biased measurements. Thus, now u_i = μ + σz_i + b for D₁. As a result, the prior for X₂ obtained from Eq. (9) is compromised, and this impacts on the statistic ⟨ χ² ⟩ _π from Eq. (11).

In Fig. 2, the values of ⟨ χ² ⟩ _π and ⟨ χ² ⟩ _u are plotted against b/σ. Since the compromised prior is excluded from ⟨ χ² ⟩ _u, its values depend only on the flawless data sets D₂, and so mostly fall within the (0.05,0.95) interval. In contrast, as b/σ increases, the values of ⟨ χ² ⟩ _π are increasingly affected by the compromised prior.

The mutual consistency of X₁ and X₂ is assessed on the basis of ⟨ χ² ⟩ _π, with choice of critical level at our discretion. However, when ⟨ χ² ⟩ _π exceeds the 0.1% level at 149.2, we would surely conclude that X₁ and X₂ are in conflict and seek to resolve the discrepancy. On the other hand, when inconsistency is not indicated, we may accept the Bayesian inferences derived from X₂ in the confident belief that incorporating prior knowledge from X₁ is justified and beneficial.

Fig. 2

Detection of bias in X₁ with the ⟨ χ² ⟩ _π statistic when X₂ is analysed with prior derived from X₁. Values of ⟨ χ² ⟩ _π (filled circles) and ⟨ χ² ⟩ _u (open circles) are plotted against the bias parameter b/σ. The dashed lines are the 5 and 95% levels.

This test illustrates the important point that an inappropriate π can corrupt the Bayesian inferences drawn from a flawless experiment. Thus, in this case, the bias in D₁ propagates into the posterior p(μ | H,D₂) derived from X₂. This can be (and is) avoided by preferring the frequentist methodology. But to do so is to forgo the great merit of Bayesian inference, namely its ability to incorporate informative prior information (Feldman & Cousins 1998). If one does therefore prefer Bayesian inference, it is evident that a goodness-of-fit statistic such as ⟨ χ² ⟩ _π is essential in order to detect errors propagating into the posterior distribution from an ill-chosen prior.

5.3. Order reversed

In the above test, the analysis of X₂ is preceded by that of X₁. This order can be reversed. Thus, with the same N data pairs (D₁,D₂), we now first analyse X₂ with a uniform prior to obtain p(μ | H,D₂). This becomes the prior for the analysis of X₁. This analysis then gives the posterior p(μ | H,D₁) from which a new value of ⟨ χ² ⟩ _π is obtained.

When the values of ⟨ χ² ⟩ _π obtained with this reversed order of analysis are plotted against b/σ, the result is similar to Fig. 2, implying that the order is unimportant. Indeed, statistically, the same decision is reached independently of order. For example, for 10⁵ independent data pairs (D₁,D₂) with b/σ = 1, the number with ⟨ χ² ⟩ _π> 124.2, the 5% threshold, is 50 267 when X₁ precedes X₂ and 50 149 when X₂ precedes X₁.

5.4. Alternative statistic

Noting that the Bayesian evidence is $\hbox{$= \bar{ {\cal L}}$}$ , the prior-weighted mean of the likelihood, we can, under standard assumptions, write $\begin{matrix} ℒ̅ \propto \exp [- \frac{1}{2} χ_{eff}^{2}] \end{matrix}$ $\begin{eqnarray} \bar{ {\cal L}} \propto \exp\left[-\frac{1}{2} \chi^{2}_{\rm eff} \right] \end{eqnarray}$ (12)where the effective χ² (Bridges et al. 2009) is $\begin{matrix} χ_{eff}^{2} = - 2 \ln \int π (α) \exp [- \frac{1}{2} χ^{2} (α)] d V_{α} . \end{matrix}$ $\begin{eqnarray} \chi^{2}_{\rm eff} \: = \: - 2 \ln \int \pi(\vec{\alpha}) \exp\left[-\frac{1}{2} \chi^{2}(\vec{\alpha}) \right] {\rm d}V_{\vec{\alpha}}. \end{eqnarray}$ (13)This is a possible alternative to ⟨ χ² ⟩ _π defined in Eq. (4). However, in the test of Sect. 5.1, the two values are so nearly identical it is immaterial which mean is used. Here ⟨ χ² ⟩ _π is preferred because it remains well-defined for a uniform prior, for which an analytic result is available (Appendix A).

Because ⟨ χ² ⟩ _π and $χ_{eff}^{2}$ $\hbox{$\chi^{2}_{\rm eff}$}$ are nearly identical, the distribution of $χ_{eff}^{2}$ $\hbox{$\chi^{2}_{\rm eff}$}$ should approximate that of ⟨ χ² ⟩ _u (Sect. 4.1). To test this, the experiment of Sect. 5.1 is repeated with $χ_{eff}^{2}$ $\hbox{$\chi^{2}_{\rm eff}$}$ replacing ⟨ χ² ⟩ _π. From a simulation with N = 10⁶, the number of $χ_{eff}^{2}$ $\hbox{$\chi^{2}_{\rm eff}$}$ values exceeding the 0.05, 0.01 and 0.001 thresholds are 50 167, 9951 and 970, respectively. Thus if $χ_{eff}^{2}$ $\hbox{$\chi^{2}_{\rm eff}$}$ is chosen as the goodness-of-fit statistic, accurate p-values can be derived on the assumption that $χ_{eff}^{2} - k$ $\hbox{$\chi^{2}_{\rm eff}-k$}$ is distributed as $χ_{ν}^{2}$ $\hbox{$\chi^{2}_{\nu}$}$ with ν = n − k degrees of freedom. From Sect. 4.1, we expect these p-values to be accurate if π(α) is not more sharply peaked than ℒ(α | H,D).

6. An F statistic for Bayesian models

Inspection of Fig. 2 shows that a more powerful test of inconsistency between X₁ and X₂ must exist. A systematic displacement of ⟨ χ² ⟩ _π relative to ⟨ χ² ⟩ _u is already evident even when ⟨ χ² ⟩ _π is below the 5% threshold at 124.2. This suggests that a Bayesian analogue of the F statistic be constructed.

6.1. The frequentist F-test

In frequentist statistics, a standard result (e.g., Hamilton 1964, p. 139) in the testing of linear hypotheses is the following: we define the statistic $\begin{matrix} F = \frac{n - i}{j} \frac{χ_{c}^{2} - χ_{0}^{2}}{χ_{0}^{2}} \end{matrix}$ $\begin{eqnarray} F = \frac{n-i}{j} \: \frac{\chi^{2}_{\rm c} - \chi^{2}_{0}}{\chi^{2}_{0}} \end{eqnarray}$ (14)where $χ_{0}^{2}$ $\hbox{$\chi^{2}_{0}$}$ is the minimum value of χ² when all i parameters are adjusted, and $χ_{c}^{2}$ $\hbox{$\chi^{2}_{\rm c}$}$ is the minimum value when a linear constraint is imposed on j( ≤ i) parameters, so that only the remaining i − j are adjusted. Then, on the null hypothesis H that the constraint is true, F is distributed as F_ν₁,ν₂ with ν₁ = j and ν₂ = n − i, where n is the number of measurements. Accordingly, H is tested by comparing the value F given by Eq. (14) with critical values F_{ν₁,ν₂,β} derived from the distribution of F_ν₁,ν₂.

Note that when j = i, the constraint completely determines α. If this value is α_∗, then $χ_{c}^{2} = χ^{2} (α_{*})$ $\hbox{$\chi^{2}_{\rm c} = \chi^{2}(\vec{\alpha}_{*})$}$ and H states that α_∗ = α_true.

A particular merit of the statistic F is that it is independent of σ. However, the resulting F-test does assume normally-distributed measurement errors.

6.2. A Bayesian F

In a Bayesian context, the frequentist hypothesis that α_true = α_∗ is replaced by the statement that α_true obeys the posterior distribution p(α | H,D₂). Thus an exact constraint is replaced by a fuzzy constaint.

Adopting the simplest approach, we define ℱ, a Bayesian analogue of F, by taking $χ_{c}^{2}$ $\hbox{$\chi^{2}_{\rm c}$}$ to be the value at the posterior mean of α, $\begin{matrix} ⟨ α ⟩_{π} = \frac{\int α π (α) ℒ (α | H, D_{2}) d V_{α}}{\int π (α) ℒ (α | H, D_{2}) d V_{α}} \end{matrix}$ $\begin{eqnarray} \langle \vec{\alpha} \rangle_{\pi} \: = \: \frac{\int \vec{\alpha} \: \pi( \vec{\alpha} ){\cal L}(\vec{\alpha}|H,D_{2}) \: {\rm d}V_{\vec{\alpha}}} {\int \pi( \vec{\alpha} ){\cal L}(\vec{\alpha}|H,D_{2} ) \: {\rm d}V_{\vec{\alpha}}} \end{eqnarray}$ (15)where π(α) is the informative prior. Considerations of accuracy when values of χ² are computed on a grid suggest we take $χ_{0}^{2}$ $\hbox{$\chi^{2}_{0}$}$ to be the value at $\begin{matrix} ⟨ α ⟩_{u} = \frac{\int α ℒ (α | H, D_{2}) d V_{α}}{\int ℒ (α | H, D_{2}) d V_{α}} \end{matrix}$ $\begin{eqnarray} \langle \vec{\alpha} \rangle_{u} \: = \: \frac{\int \vec{\alpha} \: {\cal L}(\vec{\alpha}|H,D_{2}) \: {\rm d}V_{\vec{\alpha}}} {\int {\cal L}(\vec{\alpha}|H,D_{2} ) \: {\rm d}V_{\vec{\alpha}}} \end{eqnarray}$ (16)the posterior mean for a uniform prior.

With $χ_{c}^{2}$ $\hbox{$\chi^{2}_{\rm c}$}$ and $χ_{o}^{2}$ $\hbox{$\chi^{2}_{\rm o}$}$ thus defined, the Bayesian F is $\begin{matrix} ℱ = \frac{n - i}{j} \frac{χ^{2} (⟨ α ⟩_{π}) - χ^{2} (⟨ α ⟩_{u})}{χ^{2} (⟨ α ⟩_{u})} \end{matrix}$ $\begin{eqnarray} {\cal F} = \frac{n-i}{j} \: \frac{\chi^{2}(\langle \vec{\alpha} \rangle_{\pi}) - \chi^{2}(\langle \vec{\alpha} \rangle_{u})} { \chi^{2}(\langle \vec{\alpha} \rangle_{u})} \end{eqnarray}$ (17)and its value is to be compared with the chosen threshold F_{ν₁,ν₂,β} when testing the consistency of X₁ and X₂. Since ⟨ α ⟩ _u is independent of π(α), the statistic ℱ measures the effect of π(α) in displacing ⟨ α ⟩ _π from ⟨ α ⟩ _u.

In this Bayesian version of the F-test, the null hypothesis states that the posterior mean ⟨ α ⟩ _π = α_true. This will be approximately true when a flawless Bayesian model ℳ is applied to a flawless data set D. However, if the chosen threshold on ℱ is exceeded, then one or more of π,H and D is suspect as discussed in Sect. 3.1. If the threshold is not exceeded, then Bayesian inferences drawn from the posterior distribution p(α | H,D₂) are supported.

6.3. Test of p-values from ℱ

If Eq. (17) gives ℱ = ℱ_∗, the corresponding p-value is $\begin{matrix} p_{*} = \int_{ℱ_{*}}^{\infty} F_{ν_{1}, ν_{2}} d F \end{matrix}$ $\begin{eqnarray} p_{*} = \int_{ {\cal F}_{*}}^{\infty} \: F_{\nu_{1},\nu_{2}} \: {\rm d}F \end{eqnarray}$ (18)where ν₁ = j and ν₂ = n − i. The accuracy of these p-values can be tested with the 1D toy model of Sect. 5.1 as follows:

(i): Independent data sets D₁,D₂ are created corresponding to X₁,X₂.

(ii): With π from X₁, the quantities $⟨ μ ⟩_{π}^{*}$ $\hbox{$\langle \mu \rangle_{\pi}^{*}$}$ and ℱ_∗ are calculated for X₂ with Eqs. (15)–(17).

(iii): M independent data sets $\hbox{${{\cal D}_{m}}$}$ are now created with $u_{i} = ⟨ μ ⟩_{π}^{*} + σ z_{i}$ $\hbox{$u_{i} = \langle \mu \rangle_{\pi}^{*} + \sigma z_{i}$}$ .

(iv): For each $\hbox{${\cal D}_{m}$}$ , the new value of $χ^{2} (⟨ α ⟩_{π}^{*})$ $\hbox{$\chi^{2}(\langle \vec{\alpha} \rangle_{\pi}^{*})$}$ is calculated for the $⟨ μ ⟩_{π}^{*}$ $\hbox{$\langle \mu \rangle_{\pi}^{*}$}$ derived at step (ii).

(v): For each $\hbox{${\cal D}_{m}$}$ , the new value of χ²( ⟨ α ⟩ _u) is calculated with ⟨ α ⟩ _u from Eq. (16).

(vi): With these new χ² values, the statistic ℱ_m is obtained from Eq. (17).

The resulting M values of ℱ_m give us an empirical estimate of p_∗, namely f_∗, the fraction of the ℱ_m that exceed ℱ_∗. In one example of this test, a data pair D₁,D₂ gives

⟨ μ ⟩_{π}^{*} = 0.089

$\hbox{$\langle \mu \rangle_{\pi}^{*} = 0.089$}$ and ℱ_∗ = 1.0021. With ν₁ = 1 and ν₂ = 99, Eq. (18) then gives p_∗ = 0.3192. This is checked by creating M = 10⁵ data sets $\hbox{${\cal D}_{m}$}$ with the assumption that μ_true = 0.089. The resulting empirical estimate is f_∗ = 0.3189, in close agreement with p_∗.

From 100 independent pairs D₁,D₂, the mean value of | p_∗ − f_∗ | is 0.001. This simulation confirms the accuracy of p-values derived from Eq. (18) and therefore of decision thresholds F_{ν₁,ν₂,β}.

6.4. Bias test

To investigate this Bayesian F-test’s ability to detect inconsistecy between X₁ and X₂, the bias test of Sect. 5.2 is repeated, again with n₁ = n₂ = 100. The vector α in Sect. 6.1 now becomes the scalar μ, and i = 1.

In Fig. 3, the values of ℱ from Eq. (17) with j = i = 1 are plotted against the bias parameter b/σ. Also plotted are critical values derived from the distribution F_ν₁,ν₂ with ν₁ = 1 and ν₂ = 99. In contrast to Fig. 2 for the ⟨ χ² ⟩ _π statistic, inconsistency between X₁ and X₂ is now detected down to b/σ ≈ 0.5.

In this test of inconsistency, the flaw in X₁ is the bias b. Now, if it were known for certain that this was the flaw, then a Bayesian analysis with H₁ changed from u = μ to u = μ + b – i.e., with an extra parameter – is staightforward. The result is the posterior density for b, allowing for correction. In contrast, the detection of a flaw with ⟨ χ² ⟩ _π or ℱ is cause-independent. Although Figs. 2 and 3 have b/σ as the abscissa, for real experiments this quantity is not known and conclusions are drawn just from the ordinate.

Fig. 3

Detection of inconsistency between X₁ and X₂. Values of ℱ are plotted against the bias parameter b/σ. The dashed lines are the 0.1, 5 and 50% levels. Note that the horizontal scale differs from Fig. 2.

7. Tension between experiments

In the above, the goodness-of-fit of ℳ to D is investigated taking into account the possibility that a poor fit might be due the prior derived from a previous experiment. A related goodness-of-fit issue commonly arises in modern astrophysics, particularly cosmology, namely whether estimates of non-identical but partially overlapping parameter vectors from different experiments are mutually consistent. The term commonly used in this regard is tension, with investigators often reporting their subjective assessments – e.g., there is marginal tension between X₁ and X₂ – based on the two credibility domains (often multi-dimensional) for the parameters in common.

In a recent paper, Seehars et al. (2015) review the attempts in the cosmological literature to quantify the concept of tension, with emphasis on CMB experiments. Below, we develop a rather different approach based on the ℱ statistic defined in Sect. 6.2.

Since detecting and resolving conflicts between experiments is essential for scientific progress, it is desirable to quantify the term tension and to optimize its detection. The conjecture here is that this optimum is achieved when inferences from X₁ are imposed on the analysis of X₂.

7.1. Identical parameter vectors

A special case of assessing tension between experiments is that where the parameter vectors are identical. But this is just the problem investigated in Sects. 5 and 6.

When X₁ (with bias) and X₂ (without bias) are separately analysed, the limited overlap of the two credibility intervals for μ provides a qualitative indication of tension. However, if X₂ is analysed with a prior derived from X₁, then the statistic ⟨ χ² ⟩ _π – see Fig. 2 – or, more powerfully, the statistic ℱ – see Fig. 3 – provide a quantitative measure to inform statements about the degree of tension.

7.2. Non-identical parameter vectors-I

We now suppose that D₁,D₂ are data sets from different experiments X₁,X₂ designed to test the hypotheses H₁,H₂. However, though different, these hypotheses have parameters in common. Specifically, the parameter vectors of H₁ and H₂ are ξ = (α,β) and η = (β,γ), respectively, and k,l,m are the numbers of elements in α,β,γ, respectively.

If X₁ and X₂ are analysed independently, we may well find that ℳ₁ and ℳ₂ provide satisfactory fits to D₁ and D₂ and yet still be obliged to report tension between the experiments because of a perceived inadequate overlap of the two l-dimensional credibility domains for β.

A quantitative measure of tension between X₁ and X₂ can be derived via the priors, as follows: the analysis of X₁ gives p(ξ | H₁,D₁), the posterior distribution of ξ, from which we may derive the posterior distribution of β by integrating over α. Thus $\begin{matrix} p (β | H_{1}, D_{1}) = \int p (ξ | H_{1}, D_{1}) d V_{α} . \end{matrix}$ $\begin{eqnarray} p(\vec{\beta}|H_{1},D_{1}) = \int p(\vec{\xi}| H_{1}, D_{1}) \: {\rm d} V_{\vec{\alpha}}. \end{eqnarray}$ (19)Now, for a Bayesian analysis of X₂, we must specify π(η) throughout η-space, not just β-space. But $\begin{matrix} π (η) = π (β, γ) = π (γ | β) π (β) . \end{matrix}$ $\begin{eqnarray} \pi(\vec{\eta}) \:= \: \pi(\vec{\beta},\vec{\gamma}) \: = \: \pi(\vec{\gamma}|\vec{\beta})\: \pi(\vec{\beta}). \end{eqnarray}$ (20)Accordingly, what we infer from X₁ can be imposed on the analysis of X₂ by writing $\begin{matrix} π (η) = π (γ | β) p (β | H_{1}, D_{1}) . \end{matrix}$ $\begin{eqnarray} \pi(\vec{\eta}) \: = \: \pi(\vec{\gamma}|\vec{\beta})\: p(\vec{\beta}| H_{1}, D_{1}). \end{eqnarray}$ (21)The conditional prior π(γ | β) must now be specified. This can be taken to be uniform unless we have prior knowledge from other sources – i.e., not from D₂.

With π(η) specified, the analysis of X₂ gives the posterior density p(η | H₂,D₂). As in Sect. 6.2, we now regard this as a fuzzy constraint on η from which we compute the sharp constraint ⟨ η ⟩ _π given by Eq. (15). Now, in general, ⟨ η ⟩ _π will be displaced from ⟨ η ⟩ _u given by Eq. (16). The question then is: Is the increment in χ²(η | H₂,D₂) between ⟨ η ⟩ _π and ⟨ η ⟩ _u so large that we must acknowledge tension between X₁ and X₂?

Following Sect. 6, we answer this question by computing ℱ from Eq. (17) with i = j = l + m, the total number parameters in η. The result is then compared to selected critical values from the F_ν₁,ν₂ distribution, where ν₁ = l + m and ν₂ = n₂ − l − m. With standard assumptions, ℱ obeys this distribution if ⟨ η ⟩ _π = η_true – i.e., if ⟨ β ⟩ _π = β_true and ⟨ γ ⟩ _π = γ_true – see Sect. 6.3.

7.3. A toy model

A simple example with one parameter (μ) for X₁ and two (μ,κ) for X₂ is as follows: H₁ states that u = μ and H₂ states that v = μ + κx. The data set D₁ comprises n₁ measurements u_i = μ + σz_i + b, where b is the bias. The data set D₂ comprises n₂ measurements v_j = μ + κx_j + σz_j, where the x_j are uniform in (− 1, + 1). The parameters are μ = 0,κ = 1,σ = 1 and n₁ = n₂ = 100. In the notation of Sect. 7.2, the vectors β,γ contract to the scalars μ,κ, whence l = m = 1, and α does not appear, whence k = 0.

7.4. Bias test

In the above, H₁ are H₂ are different hypotheses but have the parameter μ in common. If b = 0, the analyses of X₁ are X₂ should give similar credibility intervals for μ and therefore no tension. But with sufficiently large b, tension should arise.

This is investigated following Sect. 7.2. Applying ℳ₁ to D₁, we derive p(μ | H₁,D₁). Then, taking the conditional prior π(κ | μ) to be constant, we obtain $\begin{matrix} π (μ,κ) \propto p (μ | H_{1}, D_{1}) \end{matrix}$ $\begin{eqnarray} \pi(\mu,\kappa) \: \propto \: p(\mu| H_{1}, D_{1}) \end{eqnarray}$ (22)as the prior for the analysis of X₂. This gives us the posterior distribution p(μ,κ | H₂,D₂), which is a fuzzy constraint in (μ,κ)-space. Replacing this by the sharp constraint (⟨ μ ⟩ , ⟨ κ ⟩), the constrained χ² is $\begin{matrix} χ_{c}^{2} = χ^{2} (⟨ μ ⟩, ⟨ κ ⟩ | H_{2}, D_{2}) . \end{matrix}$ $\begin{eqnarray} \chi^{2}_{\rm c} = \chi^{2}(\langle\mu\rangle,\langle\kappa\rangle| H_{2},D_{2}). \end{eqnarray}$ (23)Substitution in Eq. (17) with j = i = 2, then gives ℱ as a measure of the tension between X₁ and X₂. Under standard assumptions, ℱ is distributed as F_ν₁,ν₂ with ν₁ = 2,ν₂ = n₂ − 2 if (⟨ μ ⟩ , ⟨ κ ⟩ ) = (μ,κ)_true.

In Fig. 4, the values of ℱ are plotted against b/σ together with critical values for F_ν₁,ν₂ with ν₁ = 2,ν₂ = 98. This plot shows that tension is detected for b/σ ≳ 0.6. This is slightly inferior to Fig. 3 as is to be expected because of the more complicated X₂.

Fig. 4

Detection of tension between different experiments. Values of ℱ are plotted against the bias parameter b/σ. The dashed lines are the 0.1,5 and 50% levels.

The importance of Fig. 4 is in showing that the ℱ statistic has the desired characteristic of reliably informing the investigator of tension between different experiments with partially overlapping parameter vectors. When the inconsistency is slight (b/σ ≲ 0.2), this statistic does not sound a false alarm. When the inconsistency is substantial (b/σ ≳ 0.6), the statistic does not fail to sound the alarm.

7.5. Non-identical parameter vectors-II

If ℱ calculated as in Sect. 7.2 indicates tension, the possible flaws include the conditional prior π(γ | β). Thus, tension could be indicated even if the prior π(β) inferred from X₁ is accurate and consistent with D₂.

Accordingly, we might prefer to focus just on β – i.e., on the parameters in common. If so, we compute $\begin{matrix} ⟨ β ⟩_{π} = \int β p (η | H_{2}, D_{2}) d V_{η} \end{matrix}$ $\begin{eqnarray} \langle\vec{\beta}\rangle_{\pi} \: = \: \int \vec{\beta} \: p(\vec{\eta}| H_{2}, D_{2}) \: {\rm d} V_{\vec{\eta}} \end{eqnarray}$ (24)and then find the minimum of χ²(η | H₂,D₂) when β = ⟨ β ⟩ _π. Thus, the contrained χ² is now $\begin{matrix} χ_{c}^{2} = \min_{γ} χ^{2} (⟨ β ⟩_{π}, γ) . \end{matrix}$ $\begin{eqnarray} \chi^{2}_{\rm c} \: = \: \min_{\vec{\gamma}} \: \chi^{2}(\langle\vec{\beta}\rangle_{\pi} , \vec{\gamma}). \end{eqnarray}$ (25)The F test also applies in this case – see Sect. 6.1. Thus this value $χ_{c}^{2}$ $\hbox{$\chi^{2}_{\rm c}$}$ is substituted in Eq. (17), where we now take j = l, the number of parameters in β. The resulting ℱ is then compared to critical values derived from F_ν₁,ν₂ with ν₁ = l,ν₂ = n₂ − l − m. With the standard assumptions, ℱ obeys this distribution if ⟨ β ⟩ _π = β_true.

For the simple model of Sect. 7.3, the resulting plot of ℱ against b/σ is closely similar to Fig. 4 and so is omitted. Nevertheless, the option of constraining a subset of the parameters is likely to be a powerful diagnostic tool for complex, multi-dimensional problems, identifying the parameters contributing most to the tension revealed when the entire vector is constrained (cf. Seehars et al. 2015).

8. Discussion and conclusion

A legitimate question to ask about the statistical analysis of data acquired in a scientific experiment is: How well or how badly does the model fit the data? Asking this question is, after all, just the last step in applying the scientific method. In a frequentist analysis, where the estimated parameter vector α₀ is typically the minimum-χ² point, this question is answered by reporting the goodness-of-fit statisic $χ_{0}^{2} = χ^{2} (α_{0})$ $\hbox{$\chi^{2}_{0} = \chi^{2}(\vec{\alpha}_{0})$}$ or, equivalently, the corresponding p-value. If a poor fit is thereby indicated, the investigator and the community are aware that not only is the model called into question but so also is the estimate α₀ and its confidence domain.

If the same data is subject to a Bayesian analysis, the same question must surely be asked: the Bayesian approach does not exempt the investigator from the obligation to report on the success or otherwise of the adopted model. In this case, if the fit is poor, the Bayesian model is called into question and so also are inferences drawn from the posterior distribution p(α | H,D).

As noted in Sect. 1, the difficulty in testing Bayesian models is that there are no good fully Bayesian methods for assessing goodness-of-fit. Accordingly, in this paper, a frequentist approach is advocated. Specifically, ⟨ χ² ⟩ _π is proposed in Sect. 3 as a suitable goodness-of-fit statisic for Bayesian models. Under the null hypothesis that ℳ and D are flawless and with the standard assumptions of linearity and normally-distributed errors, then, as argued in Sect. 4.1 and illustrated in Fig. 1, ⟨ χ² ⟩ _π − k is approximately distributed as $χ_{n - k}^{2}$ $\hbox{$\chi^{2}_{n-k}$}$ , and so p-values can be computed. A p-value thus derived from ⟨ χ² ⟩ _π quantifies the average goodness-of-fit provided by the posterior distribution. In contrast, in a frequentist minimum-χ² analysis, the p-value quantifies the goodness-of-fit provided by the point estimate α₀.

In the above, it is regarded as self-evident that astronomers want to adhere to the scientific method by always reporting the goodness-of-fit achieved in Bayesian analyses of observational data. However, Gelman & Shalizi (2013), in an essay on the philosophy and practice of Bayesian statistics, note that investigators who identify Bayesian inference with inductive inference regularly fit and compare models without checking them. They deplore this practice. Instead, these authors advocate the hypothetico-deductive approach in which model checking is crucial. As in this paper, they discuss non-Bayesian checking of Bayesian models – specifically, the derivation of p-values from posterior predictive distributions. Moreover, they also stress that the prior distribution is a testable part of a Bayesian model.

In the astronomical literature, the use of frequentist tests to validate Bayesian models is not unique to this paper. Recently, Harrison et al. (2015) have presented an ingenious procedure for validating multidimensional posterior distributions with the frequentist Kolmogorov-Smirnov (KS) test for one-dimensional data. Their aim, as here, is to test the entire Bayesian inference procedure.

Frequentist testing also arises in recent applications of the Kullback-Leibler divergence to quantify tension between cosmological probes (e.g. Seehars et al. 2015). For linear models, and with the assumption of Gaussian priors and likelihoods, a term in the relative entropy is a statistic that measures tension. With these assumptions, the statistic follows a generalized χ² distribution, thus allowing a p-value to be computed.

Seehars et al. (2015) also investigate various purely Bayesian measures of tension. They conclude that interpreting these measures on a fixed, problem-independent scale – e.g., the Jeffrey’s scale – can be misleading – see also Nesseris & García-Bellido (2013).

Acknowledgments

I thank D. J. Mortlock for comments on an early draft of this paper, and A. H. Jaffe, M.P.Hobson and the referee for useful references.

References

Bridges, M., Feroz, F., Hobson, M. P., & Lasenby, A. N. 2009, MNRAS, 400, 1075 [NASA ADS] [CrossRef] [Google Scholar]
Feldman, G. J., & Cousins, R. D. 1998, Phys. Rev. D, 57, 3873 [Google Scholar]
Gelman, A., & Shalizi 2013, British J. Math. Stat. Psychol., 66, 8 [CrossRef] [Google Scholar]
Hamilton, W. C. 1964, Statistics in Physical Science (New York: Ronald Press) [Google Scholar]
Harrison, D., Sutton, D., Carvalho, P., & Hobson, M. 2015, MNRAS, 451, 2610 [NASA ADS] [CrossRef] [Google Scholar]
Lucy, L. B. 2014, A&A, 571, A86 (L14) [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Nesseris, S., & García-Bellido, J. 2013, J. Cosmol. Astropart. Phys., 08, 036 [NASA ADS] [CrossRef] [Google Scholar]
Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. 2007 Numerical Recipes (Cambridge: Cambridge University Press) [Google Scholar]
Seehars, S., Grandis, S., Amara, A., & Refregier, A. 2015, arXiv e-prints [arXiv:1510.08483] [Google Scholar]
Sellke, T., Bayarri, M. J., & Berger, J. 2001, The American Statistician, 55, 62 [CrossRef] [Google Scholar]
Sivia, D. S., & Skilling, J. 2006, Data Analysis: A Bayesian Tutorial, 2nd Edn. (Oxford: Oxford University Press) [Google Scholar]

Appendix A: Evaluation of ⟨ χ² ⟩ _u

If α₀ denotes the minimum-χ² solution, then $\begin{matrix} α = α_{0} + a \end{matrix}$ $\appendix \setcounter{section}{1} \begin{eqnarray} \vec{\alpha} = \vec{\alpha}_{0} + \vec{a} \end{eqnarray}$ (A.1)where a is the displacement from α₀. Then, on the assumption of linearity, $\begin{matrix} Δ χ^{2} (a) = χ_{0}^{2} + \sum_{i,j} A_{ij} a_{i} a_{j} \end{matrix}$ $\appendix \setcounter{section}{1} \begin{eqnarray} \Delta \chi^{2}(\vec{a}) = \chi^{2}_{0} + \sum_{i,j} A_{ij} a_{i} a_{j} \end{eqnarray}$ (A.2)where the A_ij are the constant elements of the k × k curvature matrix (Press et al. 2007, p. 680), where k is the number of parameters. It follows that surfaces of constant χ² are k-dimensional self-similar ellipsoids centered on α₀.

Now, given the second assumption of normally-distributed measurement errors, the likelihood $\begin{matrix} ℒ (α) \propto \exp (- \frac{1}{2} χ_{0}^{2}) \times \exp (- \frac{1}{2} Δ χ^{2}) \cdot \end{matrix}$ $\appendix \setcounter{section}{1} \begin{eqnarray} {\cal L} (\vec{\alpha}) \propto \exp \left( -\frac{1}{2} \chi^{2}_{0} \right) \times \exp \left( -\frac{1}{2} \Delta \chi^{2} \right)\cdot \end{eqnarray}$ (A.3)Thus, in the case of a uniform prior, the posterior mean of χ² is $\begin{matrix} {⟨ χ^{2} ⟩}_{u} = χ_{0}^{2} + \frac{\int Δ χ^{2} \exp (- \frac{1}{2} Δ χ^{2}) d V_{α}}{\int \exp (- \frac{1}{2} Δ χ^{2}) d V_{α}} \cdot \end{matrix}$ $\appendix \setcounter{section}{1} \begin{eqnarray} \left\langle\chi^{2}\right\rangle_{u} = \chi^{2}_{0} + \frac{ \int \Delta \chi^{2} \exp ( -\frac{1}{2} \Delta \chi^{2} ) {\rm d}V_{\vec{\alpha}} } { \int \exp ( -\frac{1}{2} \Delta \chi^{2} ) {\rm d}V_{\vec{\alpha}} }\cdot \end{eqnarray}$ (A.4)

Because surfaces of constant Δχ² are self-similar, the k-dimensional integrals in Eq. (A.4) reduce to 1D integrals.

Suppose $Δ χ^{2} = Δ χ_{*}^{2}$ $\hbox{$\Delta \chi^{2} = \Delta \chi^{2}_{*}$}$ on the surface of the ellipsoid with volume V_∗. If lengths are increased by the factor λ, then the new ellipsoid has $\begin{matrix} Δ χ^{2} = Δ χ_{*}^{2} \times λ^{2} and V = V_{*} \times λ^{k} . \end{matrix}$ $\appendix \setcounter{section}{1} \begin{eqnarray} \Delta \chi^{2} = \Delta \chi^{2}_{*} \times \lambda^{2} \;\; {\rm and} \;\; V=V_{*} \times \lambda^{k}. \end{eqnarray}$ (A.5)With these scaling relations, the integrals in Eq. (A.4) can be transformed into integrals over λ. The result is $\begin{matrix} {⟨ χ^{2} ⟩}_{u} = χ_{0}^{2} + 2 b \frac{\int_{0}^{\infty} λ^{k + 1} \exp (- b λ^{2}) d λ}{\int_{0}^{\infty} λ^{k - 1} \exp (- b λ^{2}) d λ} \end{matrix}$ $\appendix \setcounter{section}{1} \begin{eqnarray} \left\langle\chi^{2}\right\rangle_{u} = \chi^{2}_{0} + 2b \: \frac{ \int_{0}^{\infty} \lambda^{k+1} \exp \left( - b \lambda^{2} \right) {\rm d}\lambda} { \int_{0}^{\infty} \lambda^{k-1} \exp \left( - b \lambda^{2} \right) {\rm d}\lambda } \end{eqnarray}$ (A.6)where $2 b = Δ χ_{*}^{2}$ $\hbox{$2 b=\Delta \chi^{2}_{*}$}$ . The integrals have now been transformed to a known definite integral, $\begin{matrix} \int_{0}^{\infty} λ^{z} \exp (- b λ^{2}) d λ = \frac{1}{2} Γ (x) b^{- x} \end{matrix}$ $\appendix \setcounter{section}{1} \begin{eqnarray} \int_{0}^{\infty} \lambda^{z} \exp \left( - b \lambda^{2} \right) \: {\rm d}\lambda = \frac{1}{2} \: \Gamma(x) \: b^{-x} \end{eqnarray}$ (A.7)where x = (z + 1) / 2. Aplying this formula, we obtain $\begin{matrix} {⟨ χ^{2} ⟩}_{u} = χ_{0}^{2} + k \end{matrix}$ $\appendix \setcounter{section}{1} \begin{eqnarray} \left\langle\chi^{2}\right\rangle_{u} \: = \: \chi^{2}_{0} + k \end{eqnarray}$ (A.8)an exact result under the stated assumptions.

All Figures

	Fig. 1 Empirical PDF of ⟨ χ² ⟩ _π derived from the analysis of 10⁶ data pairs D₁,D₂ as described in Sect. 5.1. The dashed curve is the theoretical PDF for ⟨ χ² ⟩ _u.
In the text

	Fig. 2 Detection of bias in X₁ with the ⟨ χ² ⟩ _π statistic when X₂ is analysed with prior derived from X₁. Values of ⟨ χ² ⟩ _π (filled circles) and ⟨ χ² ⟩ _u (open circles) are plotted against the bias parameter b/σ. The dashed lines are the 5 and 95% levels.
In the text

	Fig. 3 Detection of inconsistency between X₁ and X₂. Values of ℱ are plotted against the bias parameter b/σ. The dashed lines are the 0.1, 5 and 50% levels. Note that the horizontal scale differs from Fig. 2.
In the text

	Fig. 4 Detection of tension between different experiments. Values of ℱ are plotted against the bias parameter b/σ. The dashed lines are the 0.1,5 and 50% levels.
In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.

[1] Bridges, M., Feroz, F., Hobson, M. P., & Lasenby, A. N. 2009, MNRAS, 400, 1075 [NASA ADS] [CrossRef] [Google Scholar]

[2] Feldman, G. J., & Cousins, R. D. 1998, Phys. Rev. D, 57, 3873 [Google Scholar]

[3] Gelman, A., & Shalizi 2013, British J. Math. Stat. Psychol., 66, 8 [CrossRef] [Google Scholar]

[4] Hamilton, W. C. 1964, Statistics in Physical Science (New York: Ronald Press) [Google Scholar]

[5] Harrison, D., Sutton, D., Carvalho, P., & Hobson, M. 2015, MNRAS, 451, 2610 [NASA ADS] [CrossRef] [Google Scholar]

[6] Lucy, L. B. 2014, A&A, 571, A86 (L14) [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[7] Nesseris, S., & García-Bellido, J. 2013, J. Cosmol. Astropart. Phys., 08, 036 [NASA ADS] [CrossRef] [Google Scholar]

[8] Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. 2007 Numerical Recipes (Cambridge: Cambridge University Press) [Google Scholar]

[9] Seehars, S., Grandis, S., Amara, A., & Refregier, A. 2015, arXiv e-prints [arXiv:1510.08483] [Google Scholar]

[10] Sellke, T., Bayarri, M. J., & Berger, J. 2001, The American Statistician, 55, 62 [CrossRef] [Google Scholar]

[11] Sivia, D. S., & Skilling, J. 2006, Data Analysis: A Bayesian Tutorial, 2nd Edn. (Oxford: Oxford University Press) [Google Scholar]

Frequentist tests for Bayesian models

1. Introduction

2. Bayesian models

3. A goodness-of-fit statistic for Bayesian models

3.1. p-values

3.2. Type I and type II errors

4. Statistics of ⟨ χ2 ⟩

4.1. Effect of an informative prior

5. Numerical experiments

5.1. A toy model

5.2. Bias test

5.3. Order reversed

5.4. Alternative statistic

6. An F statistic for Bayesian models

6.1. The frequentist F-test

6.2. A Bayesian F

6.3. Test of p-values from ℱ

6.4. Bias test

7. Tension between experiments

7.1. Identical parameter vectors

7.2. Non-identical parameter vectors-I

7.3. A toy model

7.4. Bias test

7.5. Non-identical parameter vectors-II

8. Discussion and conclusion

Acknowledgments

References

Appendix A: Evaluation of ⟨ χ2 ⟩ u

All Figures

4. Statistics of ⟨ χ² ⟩

Appendix A: Evaluation of ⟨ χ² ⟩ _u