Bayesian model checking: A comparison of tests
Astrophysics Group, Blackett Laboratory, Imperial College London,
Prince Consort Road,
London SW7 2AZ, UK
email: l.lucy@imperial.ac.uk
Received:
14
December
2017
Accepted:
7
February
2018
Two procedures for checking Bayesian models are compared using a simple test problem based on the local Hubble expansion. Over four orders of magnitude, pvalues derived from a global goodnessoffit criterion for posterior probability density functions agree closely with posterior predictive pvalues. The former can therefore serve as an effective proxy for the difficulttocalculate posterior predictive pvalues.
Key words: methods: statistical / binaries: general
© ESO 2018
1 Introduction
In a recent paper (Lucy 2018; L18), straightforward procedures are proposed for Bayesian model checking.
Because all Bayesian inferences derive from the posterior probability density function Λ(αD), where α is the parameter vector and D is the data, these tests check the global goodnessoffit (GOF) to D provided by Λ. The proposed checks take the form of pvalues closely analogous to those familiar in frequentist analyses. The first test derives from the statistic (L18, Appendix C) for uncorrelated measurement errors. The second test derives from the statistic (L18, Appendix C.1) for correlated measurement errors.
The need for readilyapplied Bayesian model checking procedures has recently been persuasively ponted out by Ford  see Sect. 4.1 in Fischer et al. (2016). Ford attributes the frequent neglect of model checking by Bayesians to the absence of a “neat and tidy” test criterion for Bayesian analyses, in contrast to the χ^{2} criterion for frequentist analyses. For Bayesian model checking, Ford recommends the book by Gelman et al. (2013), which advocates posterior predictive checks.
In Sect. 6.4 of L18, a numerical experiment is briefly reported in which pvalues from posterior predictive checking are compared to those derived from the same data with the statistic. Given the importance of establishing the merits of the simpler approach provided by the statistic, this short paper presents the numerical experiment in some detail.
2 Bayesian pvalues
In this section, the basics of the pvalues to be compared are stated.
2.1 Posterior predictive pvalue
On the assumption that χ^{2} is an appropriate test quantity  other possibilities are discussed in Gelman et al. (2013), the steps required to obtain the posterior predictive pvalue are as follows:
 1)
Compute the posterior probability density function, given by
where π is the prior, and , the likelihood, is given by (2)
 2)
Randomly select a point α′ from the posterior density Λ(αD).
 3)
Compute a simulated data set D′ at α′ by randomly sampling Pr(Dα′).
 4)
Compute and record the values of χ^{2}(α′D′) and χ^{2}(α′D).
 5)
Repeat steps 2) to 5) times.
The resulting Bayesian pvalue is then (3)
2.2 pvalue from the statistic
Following earlier work (Lucy 2016; L16) on the statistical properties of the posterior mean of χ^{2} , the general form of the statistic is (Appendix C in L18) (4)
where k is the number of parameters and (5)
The pvalue corresponding to the statistic is derived from the χ^{2} distribution with ν = n − k degrees of freedom, where n is the number of measurements. Specifically, (6)
The theoretical basis of this pvalue is as follows: in the majority of astronomical applications of Bayesian methods, the priors are weak and noninformative. Typically, π(α) defines a box in αspace representing a search volume that the investigator is confident includes all points α with significant likelihood. In consequence, a weak prior has neglible effect on the posterior density Λ and therefore also on any resulting inferences.
In the weak prior limit, Eq. (1) simplifies to (7)
and the resulting posterior mean of χ^{2} is written as.
Now, in Appendix A of L16, it is proved that, for linearity in α and normallydistributed measurement errors, (8)
where is the minimum value of χ^{2} (αD). Since, under the stated assumptions, the minimumχ^{2} solution has a distribution with ν = n − k degrees of freedom, it follows that, with the extra assumption of a weak prior, the statistic defined in Eq. (4) is distributed as . This is the theoretical basis of the pvalue given in Eq. (6).
2.3 pvalue from the statistic
The statistic of Sect. 2.2 is appropriate when the measurement errors are uncorrelated. When errors are correlated, χ^{2} (αD) is replaced by (see Appendix C.1 in L18) (9)
where C is the covariance matrix and v is the vector of residuals. The statistic ψ^{2} reduces to χ^{2} when the offdiagonal elements of C^{−1} are zero – i.e., no correlations.
By analogy with the statistic , a statistic is derived from the posterior mean of ψ^{2}. Specifically, (10)
with corresponding pvalue (11)
The theoretical basis for this pvalue is essentially identical to that for the pvalue derived from . If we have linearity in α and normallydistributed errors, then in the weak prior limit (12)
and is distributed as with ν = n − k degrees of freedom.
Note that posterior predictive pvalues (Sect. 2.1) can similarly be generalized to treat correlated measurement errors. The quantity ψ^{2} replaces χ^{2} , and the random sampling to create simulated data sets can be accomplished with a Cholesky decompostion of the covariance matrix (e.g., Gentle 2009).
3 Test model
A simple 1D test model is now defined. This allows numerous simulated data sets to be created so that the pvalues computed according to Sects. 2.1 and 2.2 can be compared and any differeces established with high statistical confidence.
3.1 Hubble expansion
The null hypothesis is that the local Universe is undergoing an isotropic, cold Hubble expansion with Hubble parameter h_{0} = 70 km s^{−1} Mpc^{−1}. Moreover throughout the local Universe there exists a population of perfect standard candles of absolute bolometric magnitude M = −19.0. Finally, we suppose that the redshifts z of these standard candles are measured exactly, but that their apparent bolometric magnitudes m have normallydistributed measurement errors with σ_{m} = 0.3 mag.
3.2 Ideal samples
In order to test pvalues when the null hypothesis is correct, we need data sets consistent with the above description of the Hubble flow.
We define the local Universe as extending to z_{max} = 0.2, so that to a good approximation the geometry is Euclidean. If we suppose the standard candles to be uniformly distributed in space, a sample {z_{i}} is created with the formula (13)
where the x_{i} are independent random numbers ∈ (0, 1). The distances of these n standard candles are (14)
and their measured apparent bolometric magnitudes are (15)
where the z_{G} are independent gaussian variates from .
The n pairs comprise a data set D from which the posterior probability density Λ(hD) of the Hubble parameter h can be inferred. Moreover, the global GOF between Λ(hD) and D can be tested with the pvalues of Sects. 2.1 and 2.2.
3.3 Imperfect samples
In order to test pvalues when the null hypothesis is false, we need to modify either the model or the data. If we continue to make inferences based on the Hubble flow defined in Sect. 3.1, we can introduce a violation of the null hypothesis by contaminating the data sample with an unrecognized second class of standard candles with a different absolute bolometric magnitude. Thus the data set D now comprises n_{1} pairs with given by Eq. (15) and n_{2} pairs with given by (16)
and where n = n_{1} + n_{2}. Thus, the sample is now contaminated with a second population of standard candles that are brighter by Δ M magnitudes. The investigator is unaware of this, and so carries out a Bayesian analysis assuming an ideal sample.
4 Comparison of pvalues
The Bayesian pvalues defined in Sect. 2 are now computed for simulated data samples derived as described in Sect. 3.
Fig. 1 Comparison of pvalues. For J = 200 ideal samples D_{j}, is plotted against log p_{B}. 

Open with DEXTER 
4.1 Null hypothesis correct
In order to compare pvalues in this case, J independent ideal samples D_{j} are created as described in Sect. 3.2, and then pvalues for each D_{j} computed as described in Sects. 2.1 and 2.2.
With a constant prior π(h) in the interval (h_{1}, h_{2}) with h_{1} = 64 and h_{2} = 76 km s^{−1} Mpc^{−1}, the posterior density Λ(hD_{j}) is obtained from Eq. (1) with the scalar h replacing the vector α.
The posterior predictive pvalue for D_{j} is computed as described in Sect. 2.1. First, a value h′ is derived byrandomly sampling Λ(h, D_{j}). This is achieved by solving the equation (17)
where x_{ℓ} is a random number ∈ (0, 1). Second, an ideal sample D′ for Hubble parameter h′ is computed according to Sect. 3.2. Third, the GOF values χ^{2} (h′D′) and χ^{2} (h′D_{j}) are computed, where (18)
where the predicted apparent bolometric magnitude is (19)
These steps are repeated times with independent random numbers x_{ℓ} in Eq. (17). The posterior predictive pvalue for the ideal sample D_{j} is then given by Eq. (3).
The corresponding pvalue for D_{j} from the statistic is given by Eq. (4), where (20)
with χ^{2}(h, D) from Eq. (18).
The calculations described in Sects. 4.1 and 4.2 are carried out for J = 200 ideal samples D_{j} , with D_{j} comprising n = 200 pairs . In calculating posterior predictive pvalues, we take randomly selected Hubble parameters h′ for each D_{j} in order to achieve high precision for p_{B}.
The results are plotted in Fig. 1. This reveals superb agreement between the two pvalues with no outliers. The mean value of Δlog p = 2.3 × 10^{−3} confirms that there is almost exact agreement. Accordingly, this experiment indicates that the readilycalculated is an almost exact proxy for the posterior predictive pvalue p_{B} , the calculation of which is undoubtedly cumbersome.
Fig. 2 Comparison of pvalues for imperfect samples. Each imperfect sample contains 180 standard candles of magnitude M and 20 of magnitude M −ΔM with Δ M = 0.5. 

Open with DEXTER 
Fig. 3 pvalues for imperfect samples. Each imperfect sample contains 180 standard candles of magnitude M and 20 of magnitude M −ΔM. For each ΔM, the p values are computed for 200 independent samples. 

Open with DEXTER 
4.2 Null hypothesis false
In order to compare pvalues when p≲0.001, J = 200 independent imperfect samples D_{j} are created as described in Sect. 3.3. Again and n = 200, but now n =n_{1} + n_{2}, with n_{1} = 180 perfect standard candles and n_{2} = 20 belonging to a second population brighter by ΔM = 0.5 mag. With these changes, the calculations described above are repeated.
The results are plotted in Fig. 2. Because the null hypothesis is false, pvalues < 0.01 occur frequently, so that the comparison provided by Fig. 1 now extends to p≲10^{−4}. However, at these small pvalues, the values of p_{B} are subject to noticeable Poisson noise even with . Because of this, Δlog p = 3.2 × 10^{−2}, somewhat larger than in Sect. 4.1 but neverthless again confirming the excellent agreement between the two pvalues.
In Fig. 3, the behavior of the pvalues derived fromthe statistic as a function of ΔM is plotted. With increasing ΔM, the points shift to smaller pvalues, increasing the likelihood that the investigator will conclude that the Bayesian model is suspect.
5 Conclusion
The aim of this paper is to investigate the performance of a recenly proposed global GOF criterion for posterior probability densities functions. To do this, a toy model of the local Hubble expansion is defined (Sect. 3) for which both ideal (Sect. 3.2) and flawed (Sect. 3.3) data samples can be created . The pvalues derived for such samples from are then compared to the pvalues obtained with the wellestablished procedure of posterior predictive checking. The two pvalues are found to be in close agreement (Figs. 1 and 2) throughout the interval (0.0001, 1.0), thus covering the range of pvalues expected when a Bayesian model is supported (p≳0.01) to those when the model is suspect (p≲0.01).
Because carrying out a posterior predictive check is cumbersome or even infeasible, the pvalue derived from can serve asan effective proxy for posterior predictive checking. Of course, this recommendation derives from just one statistical experiment. Other investigators may wish to devise experiments relevant to different areas of astronomy.
References
 Fischer, D. A., AngladaEscude, G., Arriagada, P., et al. 2016, PASP, 128, 6001 [NASA ADS] [CrossRef] [Google Scholar]
 Gelman, A., Carlin, J. B., Stern, H. S., et al. 2013, Bayesian Data Analysis (Boca Raton, FL: CRC Press) [Google Scholar]
 Gentle, J. E. 2009, Computational Statistics (New York: Springer) [CrossRef] [Google Scholar]
 Lucy, L. B. 2016, A&A, 588, A19 (L16) [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Lucy, L. B. 2018, A&A, in press, DOI: 10.1051/00046361/201732145 (L18) [Google Scholar]
All Figures
Fig. 1 Comparison of pvalues. For J = 200 ideal samples D_{j}, is plotted against log p_{B}. 

Open with DEXTER  
In the text 
Fig. 2 Comparison of pvalues for imperfect samples. Each imperfect sample contains 180 standard candles of magnitude M and 20 of magnitude M −ΔM with Δ M = 0.5. 

Open with DEXTER  
In the text 
Fig. 3 pvalues for imperfect samples. Each imperfect sample contains 180 standard candles of magnitude M and 20 of magnitude M −ΔM. For each ΔM, the p values are computed for 200 independent samples. 

Open with DEXTER  
In the text 