B eyond P lanck IV. On end-to-end simulations in CMB analysis — Bayesian versus frequentist statistics

,


Introduction
High-fidelity end-to-end simulations play a critical role in the analysis of any modern CMB experiment for at least three important reasons.Firstly, during the design phase of the experiment, simulations are used to optimize and forecast the performance of a given experimental design, and ensure that the future experiment will achieve its scientific goals (e.g., LiteBIRD Collaboration et al. 2022).Secondly, simulations are essential for validation purposes, as they may be used to test data-processing techniques as applied to a realistic instrument model.Thirdly, realistic end-to-end simulations play an important role in bias and error estimation for traditional CMB analysis pipelines.
Simulations played a particularly important role in the data reduction of Planck, and massive efforts were invested in implementing efficient and re-usable analysis codes that were generally applicable to a wide range of experiments.This work started with the LevelS software package (Reinecke et al. 2015) and culminated with the Time Ordered Astrophysics Scalable Tools1 (TOAST), which was explicitly designed to operate in a massively parallel high-performance computing environment.TOAST was used to produce the final generations of the Planck Full Focal Plane (FFP) simulations (Planck Collaboration XII 2016), which served as the main error propagation mechanism in the Planck 2015 and 2018 data releases (Planck Collaboration I 2016, 2020).
For Planck, generating end-to-end simulations represented by far the dominant computational cost of the entire experiment, accounting for 25 million CPU in the 2015 data release alone.In addition, the production phase required massive amounts of human effort, in terms of preparing the inputs, executing the runs, and validating the outputs.It is of great interest for any future experiment to optimize and streamline this simulation process, and reuse both validated software and human work whenever possible.
In this respect, the BeyondPlanck end-to-end Bayesian analysis framework (BeyondPlanck 2022) offers a novel approach to generating CMB simulations.While the primary goal of this framework is to draw samples from a full joint posterior distribution for analysis purposes, it is useful to note that the foundation of this approach is simply a general and explicit parametric model for the full time-ordered data (TOD).When exploring the full joint posterior distribution, this model is compared with the observed data in TOD space.The analysis phase is as such numerically equivalent to producing a large number of TOD simulations, and comparing each of these with the actual observed data.In this framework, each step of the analysis and simulation pipelines are thus fully equivalent, and the primary difference is simply whether the input model parameters are assumed to be constrained by the data or not.This latter observation is in fact a key point regarding endto-end simulations for CMB experiments in general, and a main goal of the current paper is to clarify the importance of choosing input parameters for a given simulation appropriately.Specifically, we argue in this paper that two fundamentally different choices are available; one can either choose parameters that are constrained directly by the observed data (as is traditionally done for the CMB Solar dipole or astrophysical foregrounds), or one can choose parameters that are independent from the observed data (as is traditionally done for CMB fluctuations or instrumental noise).We further argue that this choice will have direct consequences for what scientific questions the resulting simulations are optimized to address.
It is important to note that these ideas were discussed broadly, but not systematically, within the Planck community before building the FFP simulations.For instance, one proposal was to base the large-scale CMB temperature fluctuations at ≤ 70 from constrained WMAP realizations (Bennett et al. 2013), and thereby integrate knowledge about the real sky into the simulations.Another proposal was to use the actually observed LFI gain measurements to generate the simulations.A third and long-standing discussion revolved around which values to adopt for the CMB Solar dipole.
The BeyondPlanck framework offers a novel systematic view on these questions, as our Bayesian approach provides for the first time statistically well-defined constrained realizations for all parameters in the sky model, and not only a small subset.Furthermore, when comparing the correlation structures that arise from the posterior samples with those derived from traditional simulations, obvious and important differences appear, both in terms of frequency maps (Basyrov et al. 2022) and CMB maps (Colombo et al. 2022).
The first main goal of the current paper is to explain these differences intuitively, and in that process we introduce the concepts of "Bayesian simulations" and "frequentist simulations".Bayesian simulations are identical to the posterior samples described by BeyondPlanck (2022), and represent simulations that are constrained by the observed data.In contrast, frequentist simulations are unconstrained by the data.
The second main goal of this paper is simply to demonstrate in practice how the BeyondPlanck machinery may be used to generate frequentist simulations, on a similar footing as TOAST, and we will use these simulations for one important application, namely code validation; as discussed by Galloway et al. (2022a) and Gerakakis et al. (2022), the Commander code that forms the computational basis of the BeyondPlanck pipeline is explicitly designed to be re-used for a wide range of experiments.It is therefore critically important that this implementation is thoroughly validated with respect to statistical bias and uncertainties, and we do that by analyzing well-controlled simulations in this paper.
The rest of the paper is organized as follows.We first provide a brief overview of the BeyondPlanck framework and data model in Sect. 2. In Sect.3, we introduce the concept of Bayesian and frequentist simulations, and we discuss their difference.In Sect. 4 we describe the input parameters and simulation configuration used in this paper, before using these simulations to validate the BeyondPlanck implementation in Sect. 5. We conclude in Sect.6.

BeyondPlanck data model and Gibbs sampler
As described in BeyondPlanck (2022) and its companion papers, the single most fundamental component of the BeyondPlanck framework is an explicit parametric model that is to be fitted to raw TOD that includes instrumental, astrophysical, and cosmological parameters.For the current analysis, this model takes the Article number, page 2 of 15 Planck Collaboration: Bayesian versus frequentist simulations following form, sample, p denotes a single pixel on the sky, and c represents one single astrophysical signal component.Furthermore, d j,t denotes the measured data; g j,t denotes the instrumental gain; P tp, j is a pointing matrix; B pp , j denotes beam convolution with either the (symmetric) main beam, the (asymmetric) far sidelobes, or the full 4π beam response; M c j (β p , ∆ bp ) denotes the so-called mixing matrix, which describes the amplitude of component c as seen by radiometer j relative to some reference frequency when assuming some set of bandpass correction parameters ∆ bp ; a c p is the amplitude of component c in pixel p; s orb j,t is the orbital CMB dipole signal, including relativistic quadrupole corrections; s fsl j,t denotes the contribution from far sidelobes; s 1Hz j,t denotes the contribution from electronic 1 Hz spikes; n corr j,t denotes correlated instrumental noise; and n w j,t is uncorrelated (white) instrumental noise.The sky model, denoted by the sum over components, c, in the above expression may be written out as an explicit sum over CMB, synchrotron, free-free, AME, thermal dust, and point source emission, as described by Andersen et al. (2022);Svalheim et al. (2022b).
On the instrumental side, the correlated noise is associated with a covariance matrix, N corr = n corr (n corr ) T , which may be approximated as piecewise stationary, and with a Fourier space power spectral density (PSD), N f f = P( f )δ f f , that for Beyond-Planck consists of a sum of a classic 1/ f term and a log-normal term (Ihle et al. 2022), We define ξ n = {σ 0 , α, f knee , A p } as a composite parameter that is internally sampled iteratively through an individual Gibbs step, as described by Ihle et al. (2022); the peak location and width parameters of the log-normal term, f p and σ dex , are currently fixed at representative values.
Denoting the set of all free parameters in Eqs. ( 1)-( 2) by ω, we can simplify Eq. (1) symbolically to The BeyondPlanck approach to CMB analysis simply amounts to mapping out the posterior distribution as given by Bayes' theorem, where P(d | ω) ≡ L(ω) is called the likelihood, P(ω) is some set of priors, and P(d), the so-called evidence, is effectively a normalization constant for purposes of evaluating ω.The likelihood is easily defined, and given by Eq. (3) under the assumption that n w j is Gaussian distributed, The prior is less well-defined, and we adopt in practice a combination of informative and algorithmic priors in the Beyond-Planck analysis (see BeyondPlanck (2022) for an overview).
To explore this distribution by Markov Chain Monte Carlo, we use the following Gibbs sampling chain (BeyondPlanck 2022), where the symbol ← denotes setting the variable on the left-hand side equal to a sample from the distribution on the right-hand side.

Bayesian versus frequentist simulations
End-to-end TOD simulations have become the de-facto industry standard for producing robust error estimates for high-precision experiments (e.g., Planck Collaboration XII 2016), and the data model defined in Eqs. ( 1)-( 2) represents a succinct simulation recipe for producing such simulations: If ω is assumed to be perfectly known, then these equations can be evaluated in a forward manner without the need for parameter estimation or inversion algorithms, and the only stochastic terms are the correlated and white noise, both of which can be easily generated by a combination of standard random Gaussian number generators and Fourier transforms.However, in practice ω is of course not perfectly known, and precisely how ω is specified has direct and strong implications regarding what the resulting simulations can inform the user about; for an example of this within the context of Planck LFI, see Basyrov et al. (2022).In short, the key discriminator is whether ω is defined using real observed data (and in practice drawn from the posterior distribution, P(ω | d)) or whether it is drawn from a data-independent hyper-distribution, for instance informed by theoretical models and/or ground-based laboratory measurements.We will refer to these two approaches as "Bayesian" and "frequentist" respectively, indicating whether or not they condition on the true data in question.
We note that both Bayesian and frequentist simulations specifically refer to time-ordered data in the current paper, not pixelized maps or higher-level products.That is, we distinguish between simulation pipelines, which transform ω into timelines, and analysis pipelines, which transform timelines into higherordered products, such as maps and power spectra.

Bayesian versus frequentist statistics
Before comparing the two simulation types through a few worked examples, it is useful to recall the fundamental difference between Bayesian and frequentist statistics, which may be summarized as follows: In frequentist statistics, the model M and its parameters ω are considered to be fixed and known, while the data d are considered to be the main uncertain quantity.In Bayesian statistics, on the other hand, d is assumed to be perfectly known, and essentially defined by a list of numbers recorded by a measuring device, while ω is assumed to be the main unknown quantity.This difference has important consequences for how each framework typically approaches statistical inference, and which questions they are most suited to answer.This is perhaps most easily illustrated through their most typical mode of operations.First, the classical frequentist approach to statistical inference is to construct an ensemble of simulated data sets, d i , each with parameters drawn independently from M(ω).The next step is to define some statistic, γ(d i ) : R N → R, that isolates and highlights the important piece of information that the user is interested in; widely used CMB examples include χ 2 statistics, angular power spectrum statistics, or non-Gaussianity statistics.Finally, one computes γ both for the simulations and the actual data, and determines the relative frequency for which γ(d real ) < γ(d i ), which is often called the p-value or "probability-to-exceed" (PTE).Values between, say, 0.025 and 0.975 are taken to suggest that the data are consistent with the model, while more extreme values indicate a discrepancy.
Given this prescription, it is clear that the frequentist approach is particularly suited for model testing applications; it intrinsically and directly addresses the question of whether the data are consistent with the model.As such it has been widely used in the CMB field for instance for studies of non-Gaussianity and isotropy.In this case, the null-hypothesis is easy to specify, namely that the universe is isotropic and homogeneous, and filled with Gaussian random fluctuations drawn from a ΛCDM universe with given parameters.Establishing some statistic that shows that the observed data are inconsistent with this hypothesis would constitute evidence of new physics, and is as such a high-priority scientific target.
In contrast, Bayesian statistics takes a fundamentally different approach to statistical inference.In this case, we consider ω to be a stochastic and unknown quantity, and want to understand how the observed data constrains ω.The most succinct summary of this is the posterior probability distribution itself, P(ω | d), and the starting point for this framework is therefore Bayes' theorem as given in Eq. ( 4).The majority of applications of modern Bayesian statistics thus simply amounts to mapping out P(ω | d) as a function of ω by any means necessary.
At the same time, it is important to note that the likelihood L(ω) = P(d | ω) on the right-hand side of Eq. ( 4) is a fully classical frequentist statistic, in which ω is assumed to be perfectly known, and the data are uncertain.Still, it is important to note that the free parameter in L(ω) is indeed ω, not d, and L itself is really just a frequentist statistic that measures the overall goodness-of-fit between the data and the model.This statistic may then be used to estimate ω within a strictly frequentist framework; one popular example of this within the CMB field are so-called profile likelihoods.
Likewise, the Bayesian approach is also able to address the model selection problem, and this is most typically done using the evidence factor, P(d), in Eq. 4. The importance of this factor becomes obvious when explicitly acknowledging that all involved probability distributions in Eq. ( 4) actually depend on the overall model M, and not only the individual parameter values, Mathematically, P(d | M) is simply given by the average likelihood integrated over all allowed parameter values, and classical Bayesian model selection between models M 1 and M 2 proceeds simply by evaluating P(d | M 1 )/P(d | M 2 ); the model with the higher evidence is preferred.
In summary, the foundational assumptions underlying frequentist and Bayesian methods are different and complementary, and they fundamentally address different questions.Frequentist statistics are ideally suited to address model testing problems (e.g., "is the observed CMB sky Gaussian and isotropic?"),while Bayesian statistics are ideally suited to address parameter estimation problems (e.g., "what are the best-fit ΛCDM parameters?").At the same time, this dichotomy is by no means absolute, and either framework is fully capable of addressing both types of questions if they are carefully posed.Fig. 2: Single column of the low-resolution 30 (top section), 44 (middle section), and 70 GHz (bottom section) frequency channel covariance matrix, as estimated from 300 LFI DPC FFP10 frequentist simulations (left column); from 300 PR4 frequentist simulations (middle column); and from 3200 BeyondPlanck Bayesian simulations (right column).The selected column corresponds to the Stokes Q pixel number 100 marked in gray, which is located in the top right quadrant.All covariance matrices are constructed at N side = 8.Note that the Planck PR4 30 GHz covariance slice has been divided by a factor of 5, and it is therefore even stronger than the color scale naively implies.

Constrained versus random input parameters in CMB simulations
We now return to the issue raised in the introduction to this section, namely how to properly choose ω for CMB inference based on end-to-end simulations.As discussed by Basyrov et al. (2022), essentially all CMB analysis pipelines prior to Beyond-Planck have adopted a mixture of data-constrained and dataindependent parameters for this purpose.Key examples of the former are the CMB Solar dipole and Galactic foregrounds, both of which are strongly informed by real measurements.Correspondingly, classical examples of the latter are CMB fluctuations, which are typically drawn as Gaussian realizations from a ΛCDM power spectrum, and instrumental noise, which is often based on laboratory measurements.In our notation, these simulations qualify thus neither as pure Bayesian nor pure frequentist, but rather as a mixture of the two.
In contrast, each sample of ω produced by the BeyondPlanck Gibbs chain summarized in Eqs. ( 6)-( 13) represents one possible simulated realization in which all sub-parameters in ω are determined exclusively by the real posterior distribution; not only the CMB dipole and Galactic model, but also those parameters that are traditionally chosen from external sources in classical pipelines, such as the CMB anisotropies and the specific noise realization.
The difference between these two types of simulation inputs is illustrated in Fig. 1 which compares ten independent frequentist time-domain realizations (red curves) with ten independent Bayesian realizations (black curves).The top and bottom panels show the correlated noise n corr and the sky model s sky , respectively, both plotted as a function of time.Starting with the frequentist simulations, we see that these are entirely uncorrelated between realizations, and scatter randomly with some model-specific mean and variance.In particular, the frequentist simulations include so-called cosmic variance, i.e., independent realizations have different CMB and noise amplitudes and phases, even if they are drawn from the same underlying stochastic model.In contrast, Bayesian simulations do not include cosmic variance, but rather focus exclusively on structures in the real data.For the sky signal component shown in the top panel of Fig. 1, this is seen in terms of two different aspects.First, the structure of all ten realizations follow very closely the same overall structure, and this is defined by the specific CMB pattern of the real sky.However, they also explicitly account for the uncertainty in the sky value at each pixel, and this is seen by the varying width of the black band; in the middle of the plot, the width is small, and this implies that the sky has been well measured here (due to deep scanning), while along the edges of the plot the width is larger, and this implies that the sky as been less well measured.The variation between Bayesian simulations thus directly quantify the uncertainty of the true data.Intuitively speaking, this point may be summarized as follows: Uncertainties measured by frequentist simulations quantify the expected variations as observed with a random instrument in a random universe, while Bayesian simulations quantify the expected variations of the real instrument in the real universe.
These intuitive differences translate directly into both qualitatively and quantitatively different ensemble properties for the resulting simulations, and correspondingly also into different resulting error estimates.As a real-world illustration of this, Fig. 2 shows slices through the empirical low-resolution polarization covariance matrix computed for each of the three Planck LFI frequency channels using three different generations of LFI simulations, namely (from left to right columns) Planck 2018 (Planck Collaboration II 2020), Planck PR4 (Planck Collaboration Int.LVII 2020), and BeyondPlanck (BeyondPlanck 2022).Row sections show results for the 30, 44, and 70 GHz channels, respectively, and within each section the two rows show the QQ and UQ segments of the full matrix, sliced through Stokes Q pixel number 100, marked in gray in the upper right quadrant.Each covariance matrix is computed by first downgrading each simulation to a HEALPix2 (Górski et al. 2005) resolution of N side = 8, and averaging the outer product over all available realizations; see Basyrov et al. (2022);Colombo et al. (2022) for further details.Effectively, these matrices visually summarize the mapspace uncertainty estimates predicted by each simulation set.
Starting with the Planck 2018 simulations, the most striking observation is that these empirical matrices are very noisy for all three frequency channels.This is partly a reflection of the fact that only 300 simulations were actually constructed, and this leads to a high Monte Carlo uncertainty.However, it is also a reflection of the fact that these simulations are largely frequentist based, with both random CMB and noise properties.Furthermore, the gains that were assumed when generating these simulations exhibited significantly less structure than the real observations.In sum, there are relatively little common structures between the various realizations, either from the astrophysical sky, the instrumental noise, or the gain, and the corresponding covariance structures are therefore weak.Visually speaking, perhaps the most notable feature is a positive correlation from correlated noise along the scanning direction that passes through the sliced pixel seen in the upper right quadrant, but these are significantly obscured by Monte Carlo uncertainties.
Proceeding to the Planck PR4 simulations summarized in the middle column, we now see very strong coherent structures for the 30 GHz channel, while the 44 and 70 GHz channels behave similarly to the 2018 case.The explanation for this qualitative difference is the Planck PR4 calibration algorithm; in this pipeline, the 30 GHz channel is calibrated independently without the use of supporting priors, while the 44 and 70 GHz channels are calibrated by using the 30 GHz channel as a polarized foreground prior.The net effect of this independent calibration procedure is a very high calibration uncertainty for the 30 GHz channel, and these couple directly to the true CMB dipole, which is kept fixed between all simulations.The result is the familiar large-scale pattern seen in this figure, which has been highlighted by several previous analyses as a particularly difficult mode to observe with Planck (e.g., Planck Collaboration II 2020; Gjerløw et al. 2022;Watts et al. 2022).
Turning to the BeyondPlanck simulations summarized in the right column, we now see coherent and signal-dominated structures across the full sky in all frequency channels.A part of this is simply due to more realizations than for the other two pipelines -in this case 3200 -but even more importantly is the fact that the simulations are now entirely data-driven.That is, they correspond to the black curves in Fig. 1, while the previous pipelines correspond to the red curves.In practice, this has two main effects.First, it implies that the total parameter volume that needs to be explored by Monte Carlo sampling is intrinsically smaller, simply because the posterior distribution does not include cosmic variance; the simulations only need to describe our instrument and universe, not any instrument and universe, and this is a much smaller sub-set.Second, and even more importantly, the Bayesian simulations account naturally for nonlinearity between the various parameters, and these are very often the dominant contributions in these distributions.As a con-crete example, if the gain happens to scatter either high or low during a given time period, then the total uncertainty estimate will be particularly sensitive to the CMB dipole during the same time period, and it will excite a correlation structure in these plots that is intimately connected to the satellite scanning strategy.Thus, if one chooses a gain profile that is independent of other parameters, then those real uncertainties will not be properly accounted for in the simulation set: Intuitively speaking, the hot and cold spots in the covariance matrices shown in Fig. 2 will either appear in the wrong places, or be suppressed when averaging over independent realizations.In general, specifying the instrumental model at a sufficiently realistic level represents a real challenge for frequentist simulations, and great care is required in order to capture the full error budget.This task is considerably simplified in the Bayesian approach, as each instrumental parameter is defined directly from the data themselves.

Simulation specification
Returning to the data model summary in Sect.2, we note that the Commander3 code described by Galloway et al. (2022a), and used by the BeyondPlanck project to perform Bayesian end-toend analysis of the Planck LFI data, is able to produce both frequentist and Bayesian simulations essentially without modifications; the only question is whether the parameters used to generate the TOD, ω, are drawn from the posterior distribution, or whether they are selected from a data-independent hyperdistribution.Choosing which type of simulations to generate is thus only a matter of selecting proper initialization values in the Commander3 parameter file.
In this paper, we demonstrate the frequentist mode of operation by generating a set of classical frequentist simulations with Commander3, and we then use these to validate the novel low-level processing algorithms introduced by Keihänen et al.We note that the original BeyondPlanck analysis required 670 000 CPU-hours to generate 4000 full Gibbs samples for the full LFI dataset, which took about three months of runtime to complete.In the current paper, we are primarily interested in validating the low-level algorithms themselves, and we therefore choose to consider only one year of 30 GHz observations in the following (corresponding to about 10,000 Planck pointing periods (PIDs), each lasting for about one hour; Planck Collaboration I 2014), rather than the full LFI dataset, and this reduces the computational cost from 169 to 2.5 CPU-hours per Gibbs sample (Galloway et al. 2022a).As a result, we are able to produce individual chains with 10 000 samples within a matter of days, rather than months or years, which is useful for convergence analyses.This also reduces the total volume of the TOD themselves (not including pointing, flags, etc.) from 638 GB to 22 GB, and the simulations may therefore be run on a much broader range of hardware.In fact, subsets of the following simulations have been produced on more than ten different computing systems all over the world, using both AMD and Intel processors (e.g., Intel E5-2697v2 2.7 GHz, Intel Xeon E5-2698 2.3 GHz, Intel Xeon W-2255 3.7 GHz, AMD Ryzen 9 3950X 2.2 GHz), with between 128 GB and 1.5 TB RAM per node, and using both Intel and GNU compilers.From top to bottom, the various panels show 1) one pixel value of the CMB component map m CMB ; 2) one pixel of the correlated noise map m n corr ; 3) the temperature quadrupole moment, a 2,0 ; 4) the PID-averaged total gain g; and 5-8) the PID-averaged noise PSD parameters σ 0 , f knee , α, and A p /σ 0 .In panels with multiple lines, the various colors show Stokes T , Q, and U parameters.
In panels with gray bands, the black line shows results averaged over all PIDs, and the band shows the 1 σ variation among PIDs.The dashed red line marks a correlation coefficient of 0.1, which is used to define the typical correlation length of each parameter.
Given that we will only consider low-level processing of the 30 GHz channel, we simplify the data model in Eq. ( 1 = s tot j,t + n corr j,t + n w j,t .
That is, we only include one single sky component, namely the CMB, and we ignore sub-dominant effects such as far sidelobe corrections, 1 Hz electronic spikes, etc.As such, this configuration provides a test of the gain, noise estimation, and mapmaking parts of the full algorithm, but not the component separation or cosmological parameter estimation.
The CMB sky realizations used in the following analysis are drawn from the best-fit Planck 2018 ΛCDM model (Planck Collaboration V 2020) using the HEALPix 4 (Górski et al. 2005) synfast utility.All instrumental parameters are drawn from different realizations of the BeyondPlanck ensemble presented in BeyondPlanck (2022), and these are taken as true input values in the following.
For the noise terms, we draw a random Gaussian realization of n j,t = n corr j,t + n w j,t with the noise PSD model given in Eq. ( 2).This is done independently for each Planck pointing ID (PID), and the noise PSD parameters are thus varying in time with the same structure as the real observations.

Validation of low-level processing algorithms
To validate the noise and gain estimation and mapmaking steps in Commander3 , we analyze the frequentist simulations described above with the same Bayesian framework as used for the main BeyondPlanck processing, and compare the output marginal posterior distributions with the known true inputs.To quantify both biases and the accuracy of the uncertainty estimates, we adopt the following normalized residual, where µ ω and σ ω are the posterior mean and standard deviation for parameter ω.For a truly Gaussian posterior distribution with no bias and perfect uncertainty estimation, this quantity should be distributed according to a standard normal distribution with zero mean and unit variance, N(0, 1), while a non-zero value of δ indicates a bias measured in units of σ.It is of course important to note that the full data model in Eq. ( 1) is highly nonlinear due to the presence of the gain; therefore, the deviations from N(0, 1) at some level are fully expected, in particular for signal-dominated quantities.Still, we find that δ serves as a useful quality monitor.Unless otherwise noted, the main results presented in the following are derived from a single Markov chain comprising 10 000 samples.Where useful for convergence and mixing assessment, we will also use shorter and independent chains, typically with 1000 samples in each chain.

Markov auto-correlations
We are also interested in studying the statistical properties of individual Markov chains in terms of correlation lengths, degeneracies, and convergence.We define the Markov chain autocorrelation for a given chain as where i denotes Gibbs sample number, and ∆ is a chain lag parameter which denotes sample separation.
Figure 3 shows the auto-correlation for a typical set of parameters.The top four panels show (1) a single CMB map pixel (in T , Q, and U); (2) a single correlated noise map pixel (in T , 4 http://healpix.jpl.nasa.govQ, and U); (3) the CMB temperature quadrupole moment, a 2,0 ; and (4) the gain for a single PID.These all have relatively short correlation lengths, which indicates that we are likely to produce robust results for these parameters.
In contrast, the parameters in the bottom four panels have very long correlation lengths, and these correspond to the four correlated noise PSD parameters within a single PID; σ 0 , f knee , α, and A p /σ 0 .As discussed by Ihle et al. (2022), the introduction of the log-normal noise term greatly increases degeneracies and correlations among these parameters as compared to a standard 1/ f noise profile, and this makes proper estimation of these parameters much more expensive.However, it is also important to note that this is only a challenge regarding the estimation of the individual noise PSD parameters; the full PSD as a function of frequency, P n ( f ), is insensitive to these degeneracies, and that function is the only thing that is actually propagated to the rest of the system.This explains why the long correlations seen in the lower half of the plot do not excite long correlations also among the (far more important) parameters in the top half of the plot.
In fact, the single most important parameter in the entire system is the CMB map, shown in the first (for individual pixels) and third (for the quadrupole moment, a 2,0 ) panels.Indeed, the correlation length is very short or even non-existent for single pixels.This is primarily due to the fact that this map is strongly dominated by white noise on a single-pixel scale for the setup we consider here.As seen in the third panel, the same does not hold true for the quadrupole moment, in which case the correlation is in fact higher than 0.3 at a lag of ∆ = 25.The main driver for this is the gain, as shown in the fourth panel.While the gain is dominated by white noise on short time-scales (as seen by the quick drop-off between lags of 1 and 2), there is a slow drift at higher lags.This is caused by a partial degeneracy between the CMB map (which acts as a calibration source in this framework, anchored by the orbital dipole) and the overall gain.In the real BeyondPlanck analysis, this degeneracy is mitigated to a large extent by analyzing all LFI channels jointly, and also by including WMAP observations to break important low-polarization degeneracies (Gjerløw et al. 2022;Basyrov et al. 2022).Still, even with those additions there are important long-term drifts in the largest CMB temperature scales, and these have non-negligible consequences for the statistical significance of low-CMB anomalies (Colombo et al. 2022).

Posterior distribution overview
Next, to build intuition regarding the full set of recovered parameters, we show in Fig. 4 marginal one-and two-dimensional posterior distributions for a small set of parameters for two different PIDs.In each panel, the true input values are shown as dashed lines.The bottom triangle (blue) show posterior results for one well-behaved PID with good goodness-of-fit statistics, while the top triangle (orange) shows a less well behaved case in which the true input values are at the edge of recovered distributions.Together, these two cases represent the majority of all PIDs in terms of overall behaviour.
Overall, the true input parameters are recovered reasonably well in most cases.One of the parameters that is less well recovered is the white noise amplitude, σ 0 .This parameter is a special case due to the sampling algorithm currently used in the BeyondPlanck pipeline.As described by Ihle et al. (2022), σ 0 is currently determined as the standard deviation of all pairwise differences between neighboring time samples divided by √ 2. While this is a commonly used technique in radio astronomy to 0 .9 0 0 .7 5 0 .6 0 0 .8 1 .6 A p / 2 0 0 .9 0 .7 0 .6 0 . 1 PID 3003 PID 5515  9)-( 12) the four correlated noise parameters, ξ n = {σ 0 , f knee , α, A p }.Note that the one-dimensional histograms of the first seven parameters are completely overlapping since these parameters are independent of PID.
derive an estimate of the white noise that is highly robust against unmodelled systematic errors, it does not correspond to a proper sample from the true conditional distribution P(σ 0 | d, g, . ..).
In particular, this approach underestimates the true fluctuations of σ 0 , which in turn results in the overall uncertainties being slightly underestimated.This is one of several examples in the pipeline in which robustness to systematic effects comes at a cost of statistical rigor.At the same time, it is important to note that the absolute white noise level is in general very well determined in these data (Ihle et al. 2022), and a slight under-estimation of the uncertainty in σ 0 has little practical impact on other parameters in the model.
0 .9 0 0 .7 5 0 .8 1 .6 A p / 2 0 0 .9 0 Fig. 5: Comparison of partial posteriors distributions from multiple short chains for the quadrupole amplitude a 2,0 , the gain g, white noise level σ 0 , knee frequency f knee , correlated noise spectral index α, and log-normal noise amplitude A p .Each chain consists of 1000 samples.The posterior contours only span the range of the underlying samples, wherefore some are not closed.Looking more broadly at the two-dimensional distributions in this figure, we see that the parameters split naturally into two groups, defined by the short and long correlation lengths

Density
Fig. 8: Aggregate standard deviation normalized differences between the gain sample mean and the input gain values.For each PID t and detector i we calculate (g in t,i − g t,i )/σ t,i where g in is the input gain, g is the mean sample value and σ is the sample standard deviation.We then aggregate all of these values into the appropriate histogram.The red lines are ideal Gaussian distribution for comparison.Each subplot also lists the aggregate deviation from the expected mean of 0 with error bounds.discussed above.That is, the CMB, correlation noise, and gain parameters generally exhibit more symmetric distributions than the noise PSD distributions, which are highly correlated and non-Gaussian.Once again, this reflects the internal degeneracies among the noise PSD parameters.
To further illustrate the impact of the slow convergence rate for several of these parameters, Fig. 5 shows four partial chains, each with only 1000 samples, for a sub-set of these parameters.Once again, we see that the input values are reasonably well recovered for most cases, but each colored sub-distribution only cover a modest part of the full posterior volume.

Gain validation
Going into greater detail on individual parameters, we show in Fig. 6 a subset of the estimated gain as a function of Gibbs iteration for four selected PIDs, one for each radiometer.The red lines show the true input values.Here we visually observe the same behaviour as discussed above; on short time scales, these trace plots are dominated by random fluctuations, while on long time-scales there are still obvious significant drifts.sual level, the two curves agree well, without obvious evidence of systematic biases, and the uncertainties appear reasonable.These observations are made more quantitative in Fig. 8, which shows histograms of normalized residuals, δ g , over all PIDs.Red lines indicate the standard Gaussian N(0, 1) reference distribution.Once again, we see that the reconstruction appears good, as the nominal bias is at most 0.36 σ, and the maximum posterior width is 1.36 σ.From the shape of the histograms, it is also clear that a significant fraction of these variations is due Monte Carlo sample variance from the long gain correlation lengths.
Once again, we note that such deviations will decrease as the number of frequency bands included in the analysis increases, since the Solar CMB dipole, which is the main calibrator, will be much better constrained with more observations; the actual gain correlation lengths found for the real BeyondPlanck analysis are shown by Gjerløw et al. (2022), and are notably shorter than those of this reduced simulation.

Correlated noise posterior validation
Next we turn to the correlated noise component, and we start with the specific noise realization, n corr ; the correlated noise PSD parameters will be discussed separately in Sect.5.6.To simplify the visualization, we bin the correlated noise TOD into a sky map, as illustrated in Fig. 9.The top left panel shows the true input correlated noise map (temperature component only), while A visual inspection of the simulation input and posterior mean correlated noise maps indicates no obvious differences.In fact, the normalized residual map in the bottom right panel of Fig. 9 appears fully consistent with white noise.Once again, this observation is quantified more accurately in Fig. 10, where we compare the histogram of δ corr over all pixels with the usual N(0, 1) distribution for each of the three Stokes parameters; in each case, the agreement is excellent.

CMB map validation
Figures 11 and 12 show similar plots for the CMB sky map component.Once again, the normalized residual in the bottom right panel appears fully consistent with white noise over most of the sky -but this time, we actually see a power excess in δ CMB around the Ecliptic poles.These features correspond to regions of the sky that are particularly deeply observed by the Planck scanning strategy (Planck Collaboration I 2014).As a result of these deep measurements, the white noise in these regions is very low, and the total error budget per pixel is far more sensitive to the non-linear contributions in the system, in particular the coupling between the gain and the Solar dipole.
This effect does of course not only apply to the Ecliptic "deep fields", but to all signal-dominated map pixels at some level, and it therefore also applies to the full-sky CMB map in temperature.This statement is made more quantitative in the left panel of Fig. 12, where we see that the temperature histogram is very slightly wider than the reference N(0, 1) distribution.To be specific, the standard deviation of this distribution is about 1.15; at the same time, the mean of the distribution is consistent with zero, the non-linear couplings therefore do not introduce a bias, but only a higher variance.For the noise-dominated Stokes Q and U parameters, for which gain couplings are negligible on a per-pixel level, both distributions are perfectly consistent with N(0, 1).
Figure 13 shows Pearson's correlation coefficients between the CMB and correlated noise components for three selected pixels.Two of the pixels, marked '1' and '2', are located along the same Planck scanning ring near the Ecliptic plane, where the Planck scanning strategy is particularly poor.The third pixel is located far away from these, and on a different scanning ring.Here we see that correlations are very strong for Stokes parameters of the same type along the same ring, with correlation coefficients ranging between 0.5 and 0.8.These correlations are induced both by gain and correlated noise fluctuations, which are tightly associated with the Planck scanning rings.Stokes parameters of different types (e.g.I and Q) are significantly less correlated, typically with anti-correlation coefficients of ρ −0.25.Correlations between widely separated pixels are practically negligible in the current simulation setup, although for the real analysis, this is no longer true due to additional couplings from, for instance, astrophysical foregrounds, bandpass corrections, and sidelobes (Galloway et al. 2022b;Svalheim et al. 2022a;Basyrov et al. 2022;Colombo et al. 2022;Andersen et al. 2022;Svalheim et al. 2022b).

Correlated noise PSD validation
Finally, we consider the noise PSD parameters, σ 0 , f knee , α, and A p /σ 0 , and, as already noted, these are significantly harder to estimate individually than the previous parameters due to the strong correlation between the 1/ f and log-normal terms in Eq. (2).
As usual, we plot the reduced residual, δ, for each parameter type in Fig. 14, and in this case we see that the posterior distributions are significantly wider than standard Gaussian distribution, by as much as a factor of two.The distributions are also clearly non-Gaussian, with notable skewness and kurtosis.Both the excess variance and non-Gaussianity stem from the same degeneracies as discussed above, and are partially due to intrinsic non-Gaussianities in the model, and partially due to incomplete Monte Carlo convergence and very long correlation lengths.On the other hand, the mean bias in these distribution is small, and the estimated posterior distributions do provide a useful summary of each parameter individually.
As mentioned above, however, other parameters in the model are not sensitive to individual ξ n values, but only to the total noise PSD, P corr ( f ).This function is plotted in the top two panels of Fig. 15 for the same two PIDs and radiometers as shown in Fig. 4; the blue curves correspond to the well-measured PID, while the orange curve corresponds to the PID with the marginal fit.Faint lines in the top two panels show individual Gibbs samples, corresponding to different combinations of ξ n .By eye, the sampled values appear to span the true input reasonably well, although the orange line is on the lower edge of the estimated posterior distribution.
These visual observations are made more quantitative in the bottom two panels, where the third panel shows the fractional difference between the output and input PSD functions in units of percent, and the fourth panel shows the same in units of standard deviation of the PSD across Gibbs samples, σ.For the wellbehaved (blue) pixel, we see that the posterior mean matches the true input everywhere to within a few percent; in units of standard deviations, this is typically less than 2.5 σ for most of the region, except at frequencies above 10 Hz, where the estimated standard deviation is very small due, and the underestimation of the uncertainty in σ 0 becomes noticeable.For the less wellbehaved case, the recovered PSD is within 2 σ at all frequencies in units of standard deviations, or within 5 %.Overall, the PSD itself is recovered very well in both cases in absolute terms.

Conclusions
End-to-end time-ordered simulations play a key role in estimating both biases and uncertainties for current and future CMB experiments.To date, no other practical method has been able to account for the full and rich set of systematic errors that affect modern high-precision measurements.
As detailed by BeyondPlanck ( 2022) and its companion papers, the BeyondPlanck project has implemented a new approach to end-to-end CMB analysis in which a global parametric model is fitted directly to the time-ordered data, allowing for joint estimation of instrumental, astrophysical, and cosmological parameters with true end-to-end error propagation.This approach relies strongly on a sampling algorithm called Gibbs sampling, which allows the user to draw joint samples from a complex posterior distribution.Each of these Gibbs samples correspond essentially to one end-to-end TOD simulation, similar to those produced by classical CMB simulation pipelines, for instance the Planck full focalplane (FFP; Planck Collaboration XII 2016) simulations.
The fundamental difference between these two simulation pipelines lies in how to define the input parameters used to generate the simulation.In the BeyondPlanck approach, all parameters are constrained directly from the true data, and correspond as such to samples drawn from the full joint posterior distribution.In contrast, traditional pipelines uses parameters that are a mixture of data-constrained and data-independent parameters.Typical examples of the former include the CMB Solar dipole and Galactic foregrounds, while typical examples of the latter include CMB anisotropies and instrumental noise.In this paper, we have the two types of simulations for "Bayesian" or "frequentist", respectively, indicating whether or not they condition on the true data.
The difference between these two types of simulations has direct real-world consequences for what applications each simulation type is suitable for.As first argued by Basyrov et al. (2022), this may be intuitively understood through the following line of reasoning: Suppose one is tasked with constructing a new end-to-end simulation for a given experiment.Among the first decisions that needs to taken concerns the CMB Solar dipole: Should this correspond to the true dipole, or should it have a random amplitude and direction?If it is chosen randomly, then the hot and cold spots in the correlation matrices shown in Fig. 2 in this paper will appear at random positions on the sky, and eventually be washed out in an ensemble average.In practice, all current pipelines adopt the true CMB Solar dipole as an input.The next question is, what Galactic model should be used?Once again, if this is selected randomly, then the Galactic plane will move around on the sky from realization to realization.In practice, all current pipelines adopt a model of the true Galactic signal as an input.
The third question is, what CMB anisotropies should be used?At this point, all pipelines prior to BeyondPlanck have in fact adopted random CMB skies drawn from a theoretical ΛCDM model.This has two main effects: On the one hand, in the same way that randomizing the CMB dipole signal would average out any coherent correlations between the sky signal and the gain, randomizing the CMB anisotropies also average out, and non-linear correlations between these structures and the instrumental parameters are not accounted for.On the other hand, the resulting simulations do actually include so-called cosmic variance, i.e., for the scatter between individual CMB realizations.
Fourth and finally, the same question apply to all the instrumental parameters, perhaps most notably correlated noise and gain fluctuations: Should these be constrained by the real data, or should they be drawn randomly from a laboratory-determined hyper-distribution?
It is important to stress that none of these four questions have a "right" or "wrong" answer.However, whatever choice one makes, that choice will have direct consequences for what correlation structures appear among the resulting simulations, and therefore also for which applications they are suitable for.In particular, if the primary application is traditional frequentist model testing -for instance asking whether the CMB sky is Gaussian and isotropic -then it is critical to account for cosmic variance among the CMB realizations.For those applications, one must choose data-independent CMB inputs in order to capture the full uncertainties, and the appropriate choice are frequentist data-independent simulation inputs.
If, on the other hand, the main application is traditional parameter estimation, for instance as constraining the ΛCDM model, then the important point is to properly estimate the total CMB uncertainty per-pixel on the sky.In this case, it is critical to properly model all non-linear couplings between the actual sky signal, the true gain, the true correlated noise, etc.In this case, the appropriate choice are Bayesian data-dependent simulation inputs.
In this paper, we note that the novel Commander3 software is able to produce both frequentist and Bayesian simulations, simply by adjusting the inputs that are used to initialize the code.While the Bayesian simulation process has been described in detail in most of the BeyondPlanck companion papers, we present in the current paper a first application of the frequentist mode of operation by producing a data-independent time-ordered simulation corresponding to one year of 30 GHz data, and we then use this to validate three important low-level steps in the full Be-yondPlanck Gibbs samples, namely gain estimation, correlated noise estimation, and mapmaking.Doing so, we find that the recovered posterior distribution matches the true input parameters well.

3 3
The research presented in this paper was undertaken as a part of the Master-and PhD-level course called "AST9240 -Cosmological component separation" in 2021 at the University of Oslo, and individual students produced and analyzed simulations in their home institutions.

Fig. 3 :
Fig.3: Auto-correlation function, ρ, for selected parameters in the model, as estimated from a single chain with 10 000 samples.From top to bottom, the various panels show 1) one pixel value of the CMB component map m CMB ; 2) one pixel of the correlated noise map m n corr ; 3) the temperature quadrupole moment, a 2,0 ; 4) the PID-averaged total gain g; and 5-8) the PID-averaged noise PSD parameters σ 0 , f knee , α, and A p /σ 0 .In panels with multiple lines, the various colors show Stokes T , Q, and U parameters.In panels with gray bands, the black line shows results averaged over all PIDs, and the band shows the 1 σ variation among PIDs.The dashed red line marks a correlation coefficient of 0.1, which is used to define the typical correlation length of each parameter.

Fig. 4 :
Fig. 4: Recovered posterior distributions for a selected set of parameters from two PIDs and detectors.The contours indicate 68 and 95 % confidence regions, while the dashed lines (in the respective color of the contours) show the true input value of each of the PIDs.The contours below (blue) and above (orange) the diagonal correspond to PIDs 3003 and 5515, respectively.From left to right along the horizontal axis, columns show (1)-(3) one arbitrary CMB map pixel in Stokes I, Q, and U; (4)-(6) correlated noise for the same pixel and Stokes parameters; (7) the CMB intensity quadrupole amplitude a 2,0 ; (8) gain g; and (9)-(12) the four correlated noise parameters, ξ n = {σ 0 , f knee , α, A p }.Note that the one-dimensional histograms of the first seven parameters are completely overlapping since these parameters are independent of PID.

Fig. 6 :
Fig. 6: Trace plots of gain values as a function of chain iteration (blue) compared to their input values (red) for selected PIDs, in order from left to right: 349, 9847, 4298 and 1993.

Fig. 7 :
Fig. 7: The input gain values (red) over-plotted on the output gain values (blue).The width of the blue line indicates the sample standard deviation of the PID in question.

Fig. 13 :
Fig.13: Correlation matrix for selected pixel values of the CMB map, m CMB , and the correlated noise map, m n corr , for all three Stokes parameters I, Q, and U. Pixels 1 and 2 are selected to be neighboring pixels along the same Planck scanning ring and located near the Ecliptic plane, while pixel 3 is an arbitrarily selected pixel not spatially associated with the other two.

Fig. 14 :
Fig.14: Histograms of the noise parameters over all PIDs and 10 000 samples for radiometer 27M.We show the white noise level σ 0 , knee frequency f knee , correlated noise spectral index α, and log-normal noise amplitude A p .For reference we show the standard normal distribution as a black dashed line.

Fig. 15 :
Fig. 15: Comparison of recovered correlated noise PSD in terms of the functional form, P n ( f ).The top two panels show results for the same PIDs as in Fig. 4; faint lines indicate individual Gibbs samples, while the dashed lines show the true input functions.The bottom two panels show the difference between the posterior mean function and the true input in units of percent and posterior rms, respectively.