Unified Radio Interferometric Calibration and Imaging with Joint Uncertainty Quantification

The data reduction procedure for radio interferometers can be viewed as a combined calibration and imaging problem. We present an algorithm that unifies cross-calibration, self-calibration, and imaging. Being a Bayesian method, that algorithm does not only calculate an estimate of the sky brightness distribution, but also provides an estimate of the joint uncertainty which entails both the uncertainty of the calibration and the one of the actual observation. The algorithm is formulated in the language of information field theory and uses Metric Gaussian Variational Inference (MGVI) as the underlying statistical method. So far only direction-independent antenna-based calibration is considered. This restriction may be released in future work. An implementation of the algorithm is contributed as well.


Introduction
Radio astronomy is thriving. Super-modern telescopes such as MeerKAT, the Australian Square Kilometer Array Pathfinder (ASKAP), the Very Large Array (VLA), and the Atacama Large Millimeter Array (ALMA) are operating and the Square Kilometer Array (SKA) is in the planning stages. All these telescopes provide high-quality data on an unprecedented scale and much progress is being made instrumental-wise, which facilitates enormous improvements in sensitivity and survey speed.
Impressed by these novel facilities we would like to turn our attention to the calibration and imaging algorithms that are fed by the data from these telescopes. The amount of scientific insight that can possibly be extracted from a given telescope is limited by the capability of the employed data reduction algorithm. We suggest that there is room for improvement regarding the calibration and imaging procedure: the most widely applied algorithms view calibration and imaging as separate problems and are not able to provide uncertainty information. The latter is desperately needed to quantify the level of trust a scientist can put on any result based on radio observations. Furthermore, a statistical sound confrontation of astrophysical models to radio data requires reliable uncertainty quantification. Treating calibration and imaging as separate steps ignores their tight interdependence.
The algorithmic idea presented in this work is an advancement of the original resolve algorithm (Radio Extended SOurces Lognormal deconvolution Estimator; Junklewitz et al. 2016) and may retain its name. The resolve algorithm is formulated in the language of information field theory (IFT; Enßlin et al. 2009;Enßlin 2018), which is a view on Bayesian statistics applicable wherever (physical) fields are supposed to be inferred. From a Bayesian point of view the question when reducing radio data is the following: given prior knowledge as well as measurement information about the brightness distribution of a patch of the sky, what knowledge does the observer have after obtaining the data? This question is answered by the Bayes theorem in terms of a probability distribution over all possible sky brightness distributions conditional to the data.
Reconstruction algorithms may be judged based on their statistical integrity or by their performance. The first perspective ultimately leads to pure Bayesian algorithms, which are too expensive for typical problems computationally. The latter often leads to ad hoc algorithms that may perform well in applications, but these can have major shortcomings such as a missing uncertainty quantification or negative-flux pixels, which is the case, for example, for CLEAN (Högbom 1974). The resolve algorithm attempts a compromise between these two objectives. It is based on purely statistical arguments and the necessary operations are approximated such that they can efficiently be implemented on a computer and be used for actual imaging tasks. Thus, the approximations and (prior) assumptions on which resolve is based can be written down explicitly.
resolve is reasonably fast but cannot compete in pure speed with algorithms like the Cotton-Schwab algorithm (Schwab 1984) as implemented in CASA. This is rooted in the fact that resolve not only provides a single sky brightness distribution but needs to update the sky prior probability distribution according to the raw data in order to properly state how much the data has constrained the probability distribution and how much uncertainty is left in the final result. This uncertainty is defined in a fashion such that it can encode the posterior variance and also cross-correlations. Thus, the uncertainty is qunatified by O(n 2 ) pieces of information where n is the number of pixels in the image. Given this massive amount of degrees of freedom it may be surprising that resolve is able to return its results after a sensible amount of time. Having said this, there is still potential for improvement. The technical cause for the long runtime is the complexity of the gridding/degridding operation, which needs to be called orders of magnitude more often than for conventional algorithms. This problem may be tackled from an informationtheoretic perspective in the future.
Turning to the specific subject of the present publication, the data reduction pipeline of modern radio telescopes consists of numerous steps. In this paper, we would like to focus on the calibration and imaging part. Calibration is necessary because the data is corrupted by a variety of effects including antenna-based, baseline-based, and direction-dependent or direction-independent effects (Smirnov 2011). For the scope of this paper only antenna-based calibration terms are considered, a simplification which is sensible for telescopes with a small field of view such as ALMA or the VLA. The crucial idea of this paper is to view the amplitude and phase corrections for each antenna as one-dimensional fields that are defined over time. These fields are discretized and regularized by a prior which states that the calibration solution for a given antenna is smooth over time. This removes the ambiguity of an interpolation scheme in between the calibrator observations and the subsequent application of self-calibration. Because resolve is an IFT algorithm, there is no notion of solution intervals, which are time bins in which traditional calibration algorithms bin the data (see, e.g., Kenyon et al. 2018). Instead IFT takes care of a consistent discretization of the principally continuous fields. Similarly, the sky brightness distribution is defined on a discretized twodimensional space; only single-channel imaging is performed in this work.
In practice, the current approach in the IFT community is to define a generative model that turns the degrees of freedom, which are learned by the algorithm into synthetic data that can be compared to the actual data in a squared-norm fashion (in the case of additive Gaussian noise). This approach is similar to the so-called radio interferometric measurement equation (RIME; Hamaker et al. 1996;Perkins et al. 2015;Smirnov 2011). Therefore, our notation closely follows the notation defined in Smirnov (2011). Calibration effects that are part of the RIME but left out for simplicity in this publication could in principle be integrated into the resolve framework.
The resolve approach may be classified according to the notion of first, second, and third generation calibration established in Noordam & Smirnov (2010): it unifies crosscalibration (1GC), self-calibration (2GC), and imaging. Still it is to be strictly distinguished from existing approaches like Kenyon et al. (2018), Cai et al. (2018), and Salvini & Wijnholds (2014). This is because it focuses on a strict Bayesian treatment combined with consistent discretization (one of the main benefits of IFT) and does not use computational speed as an argument to drop Bayesian rigidity.
The actual posterior probability distribution of the joint imaging and calibration problem is highly non-Gaussian and therefore not easily storeable on a computer. In order to overcome this apparent problem the posterior is approximated by a multivariate Gaussian with full covariance matrix. The algorithm prescribes to minimize the Kullback-Leibler divergence (KL divergence) between the actual posterior and the approximate one which is the information gain between the two probability distributions. We use the variant of this known as Metric Gaussian Variational Inference (MGVI) (Knollmüller & Enßlin 2019).
Together with this publication we contribute an implementation of resolve that is available under the terms of GPLv3 1 . It is 1 https://gitlab.mpcdf.mpg.de/ift/resolve based on the Python library NIFTy (Arras et al. 2019), which is freely available as well.
The paper is divided into four sections. Section 2 discusses the structure of likelihood and prior for the statistical problem at hand. This defines an algorithm which is verified on synthetic data in Sect. 3 and afterward applied to real data from the VLA in Sect. 4. Section 6 finishes the paper with conclusions and a outlook for future work.

Bayes' theorem
Every reconstruction algorithm needs a prescription of how the quantity of interest s affects the data d. This prescription is called the data model. Combined with statistical information, this model defines the likelihood P(d|s), which is a probability distribution on data realizations conditioned on a given realization of the signal s. Bayes' theorem, requires us to supplement the likelihood with a prior probability distribution P(s), which assigns a probability to each signal realization s. This distribution encodes the knowledge the scientist has prior to looking at the data. Since it is virtually impossible to visualize the posterior probability distribution P(s|d) in the high dimensional setting of Bayesian image reconstruction we may compute the posterior mean and posterior variance as m := s P(s|d) := Ds P(s|d) s, |m − s| 2 P(s|d) := Ds P(s|d) |m − s| 2 .
The notation Ds is borrowed from statistical physics and means integrating over all possible configurations s. For a discussion on this measure in the continuum limit see Enßlin (2018, Sect. 1.8). In practice, this integral is discretized as follows: Ds = i ds i where s i refers to the pixel values of the discretized quantity s. The term P(d) is independent of s and serves as a normalization factor. It expresses the probability to obtain the data irrespective of what the signal is, i.e. P(d) = Ds P(d, s).
In the following we describe the data model and implied likelihood employed by resolve. This includes the assumptions resolve makes about the measurement process. Afterward, resolve's prior is discussed. For definiteness the notation established in Smirnov (2011) is used.

Data model and likelihood
The measurement equation of a radio interferometer can be understood as a modified Fourier transform followed by an application of data-corrupting terms, the terms which need to be solved for in the calibration procedure. Assume that the data is only corrupted by so-called antenna-based directionindependent effects. Then Smirnov (2011, Eq. (18)) is written as where A134, page 2 of 12 l, m: Direction cosines on the sky and n(l, m) = √ 1 − l 2 − m 2 . p, q ∈ {1, . . . , N a }: Antenna indices where N a is the total number of antennas of the interferometer. -V pq ∈ C 2×2 : Visibility for antenna pair (pq).
-(u pq , v pq , w pq ): Vector that connects antenna p with antenna q. The coordinates u pq and v pq are aligned with l and m, respectively. The value w pq is perpendicular to both and points from the interferometer toward the center of the field of view. -G p ∈ C 2×2 : Antenna-based direction-independent calibration effect. -B ∈ R 2×2 : Intrinsic sky brightness matrix. Since only the Stokes I component is considered in this publication, this matrix is proportional to the identity matrix. Equation (4) can be understood as a Fourier transform of the sky brightness distribution, which is distorted by the terms involving n(l, m) and corrupted by the calibration terms G p . For the purpose of this publication we make the following simplifying assumptions: First, only the total intensity I is reconstructed. Second, G p is assumed to be diagonal, which states that there is no significant polarization leakage and especially no timevariable leakage. Finally, the temporal structure of the data is needed for the construction of the prior. Therefore, a time index is added to the above expression that is written as where G p (t) are diagonal matrices and B(l, m) is a diagonal matrix, which is proportional to unity in polarization space. We note that G p (t) needs to absorb the V-term from (4), which is possible as long as polarization leakage is not too time variable. The w-term can be taken care of by w-stacking (Offringa et al. 2014), which means that the range of possible values for w pq is binned linearly such that the integral becomes an ordinary Fourier transform. Technically, the non-equidistant Fourier transform in Eq. (5) is carried out by the NFFT library (Keiner et al. 2009) in our resolve implementation. All in all, Eq. (5) prescribes how to simulate data V pqt given calibration solutions G p (t) and an inherent sky brightness distribution B(l, m), which is what we wanted. In order to declutter the notation in the following let us denote the quantities of interest by s = G p (t), B(l, m) and the map R such that V pqt = R(s).
The commonly used data model is the following: d = R(s) + n. It assumes additive Gaussian noise (Thompson et al. 1986). Let N be a diagonal noise covariance matrix with the noise variances on its diagonal and G(s−m, S ) refers to a Gaussian random field with mean m and covariance matrix S . Then, the additive noise can be marginalized over to arrive at an expression for the likelihood The likelihood distribution P(d|s) contains all information about the measurement device and the measurement process that the inference algorithm will take into account.
We conclude the discussion on data and likelihood with three remarks: First, the likelihood does not depend on the statistical method at hand. All simplifications being made are rooted in practical reasons in the implementation process. There is no fundamental reason for not taking, for instance, a more accurate noise model or a more sophisticated calibration structure into account. Second, the employed notation already hints at the goal of describing an algorithm that jointly calibrates and images: the generalized response function R takes at the same time the calibration parameters G p (t) and the intrinsic sky brightness distribution B as an argument.
Finally, we consider what happens if the telescope alternates between observing the science target and observing a calibration source. Then, both the data set and the intrinsic sky brightness consists of two parts and the likelihood separates into From the likelihood perspective, calibration and science source are two separate things. However, as soon as the onedimensional calibration fields are supplemented by a prior that imposes temporal smoothness the degrees of freedom regarding the science target and calibration target interact. This solves the problem of applying interpolated calibration solutions in traditional cross-calibration in a natural way.

Prior
Turning to the prior probability distribution, we note that the technical framework in which resolve is implemented allows for a variety of different priors, which may supersede that presented in this paper.
As stated before G p (t) are assumed to be diagonal, The elements of this matrix are functions defined over time and take the following complex nonzero values 2 : The natural way of parameterizing a function taking values in C * is in polar coordinates, i.e., where . The modulus and phase of the complex gains g (i) p have different physical origins. The modulus describes a varying amplification of the signal in the antenna electronics, which is rooted amongst others in fluctuating temperatures of the receiver system. Varying phases stem from fluctuations in the atmosphere. Therefore, these two ingredients of g p have differing typical timescales a priori.
The prior knowledge on λ (i) p and φ (i) p is the following: respectively, share a typical behavior over time for all antennas p, both of which are not known a priori and need to be inferred from the data as well. This typical behavior does not change over time. Additionally, all λ (i) p , φ (i) p evolve smoothly over time. Mathematically, this can be captured by Gaussian random fields, where Λ, Φ is defined such that the Gaussian random fields obey homogeneous but still specifically unknown statistics. This means that not only the calibration solutions themselves but also their prior correlation structure is inferred. For this a prior on the covariances needs to be supplemented: P(Λ), P(Φ). In Sect. 2.4 we describe how to set up the prior on Λ and Φ such that they implement homogeneous statistics and which parameters they take.
Next, let us discuss the prior on the sky brightness distribution B(l, m). We recall that the matrix B(l, m) is assumed to be diagonal and proportional to unity, i.e., where b(l, m) : [l min , l max ] × [m min , m max ] → R >0 map the field of view to the set of positive real numbers since sky brightness is inherently a positive quantity 3 . For the scope of this publication, the sky brightness contains only a diffuse component. It shall be modeled similarly to the modulus of the calibration terms: it is strictly positive a priori, smooth over its domain and may vary over large scales. Therefore, we define b(l, m) = e ψ(l,m) and let ψ(l, m) be a two-dimensional Gaussian random field with correlation structure Ψ, which is going to be inferred as well: All in all, the basic structure of the priors on all terms appearing in Eq. (5) has been described apart from the construction of the prior on all covariance matrices, which is the objective for the next section.

Correlated fields
To account for correlations of a Gaussian distributed field ψ the following statements are assumed to be true: 1. The autocorrelation of ψ can be characterized by a power spectrum P Ψ (|k|), where k is the coordinate of the Fourier transformed field. 2. The power spectrum P Ψ (|k|) is a positive quantity that can vary over many orders of magnitudes. 3. Physical power spectra are falling with |k|, typically according to a power law. 4. Given enough data, it is possible to infer any kind of differentiable power spectrum. Note that the first assumption is equivalent to the seemingly weaker assumptions: -In absence of data, there is no special direction in space or time, i.e., a priori the correlation of the field is invariant under rotations. -In absence of data, there is no special point in space or time, i.e., a priori the correlation of field values is invariant under shifts in space or time. The fact that homogeneous and isotropic correlation matrices are diagonal in Fourier space and can be fully characterized by a power spectrum is known as the Wiener-Khinchin theorem (Wiener 1949;Khintchin 1934).
It is assumed that ψ as well as its power spectrum P Ψ (|k|) are unknown. Therefore, both need a prior that may be formulated as generative model: an operator that generates samples for ψ and its square root power spectrum (henceforth called amplitude spectrum) from one or multiple white Gaussian fields. Formulating a prior as a generative model has several theoretical and practical advantages .
We propose the following ansatz for an operator that converts independent normal distributed fields, φ and τ to the amplitude spectrum of the correlated field ψ. This operator is called amplitude operator A C (see Fig. 1 for an illustrative example), i.e.
where C = (a, t 0 ,m,ȳ 0 , σ m , σ y 0 , α, β) denotes the tuple of parameters (all real numbers), Exp * denotes the pullback of a field by the exponential function acting on log(|k|) 4 , Exp denotes exponentiation of the field values, F log(k)t denotes the Fourier transform of a space with coordinates t to the logarithmic coordinates log(k) of the power spectrum,m andȳ 0 are the slope and the y-intercept of the a priori mean power law, sym is an (anti-) symmetrizing operation defined to operate on a field φ over the interval [t 0 , t 1 ] as for x ∈ (t 0 , 2t 1 − t 0 ). In words, sym mirrors the field and subtracts it from itself, then restricts the domain to half the original size. Finally, cp is the log-cepstrum, Let us show that Eq. (16) meets the requirements stated at the beginning of Sect. 2.4. Requirement 1 is trivial. Requirement 2 is met since the amplitude spectrum is constructed by applying an exponential function to a Gaussian field. Thus, all values are positive and can vary over several order of magnitudes.
To requirement 3: In absence of data, the mean of the inferred white fields φ and τ, to which the amplitude operator is applied, remains zero. For φ = 0 and τ = 0, Eq. (16) becomes which is the equation for a power law with spectral indexm. A preference for falling spectra can be encoded by choosing the hyperparameterm to be negative. To requirement 4: Let us show that any differentiable function lies in the image space of the amplitude operator. This implies that any differentiable amplitude spectrum can be inferred given enough data. Let φ be an arbitrary smooth field over the interval [t 0 , t 1 ] and φ sym be a smooth field that has a point symmetry at (t 1 , φ(t 1 )) and is defined on the interval The function φ sym is a continuous and differentiable continuation of φ at t 1 . Now, we decompose φ sym into a linear part and a residual term: where The residual term is a differentiable periodic function, i.e., Thus, φ res can be represented in Fourier space by a field that falls of at least with second order. This is exactly how φ res is represented in Eq. (16). Assuming that the mean and the slope of the linear part are well represented by its prior distribution, it is indeed possible to represent any kind of differentiable amplitude spectrum. All in all, all four requirements are met by Eq. (16).
There remains one unconstrained degree of freedom, the value of the power spectrum at |k| = 0, the zero mode. As the zero mode describes the magnitude of the overall logarithmic flux, it is decoupled from the remaining spectrum and should have its own prior. This value is fixed by imposing an inverse gamma prior on the zero mode, which restricts it to be a positive quantity, while still allowing for large deviations.
To sum up, the amplitude operator depends on the following eight hyper parameters: a, t 0 : The amplitude parameter and cutoff scale of the log-cepstrum. -m,ȳ 0 : The prior means for the slope and the height of the power law. -σ m , σ y 0 : The corresponding standard deviations.
α, β: The shape and scale parameter of the inverse gamma prior for the zero mode. We note that the assumptions made at the beginning of Sect. 2.4 apply to a wide variety of processes, regardless of their dimensionality. This generic correlated field model has already been successfully used in a number of synthetic and real applications ( In resolve, the amplitude operator is used as a prior for the amplitude spectra of the antenna calibration fields and the image itself.

Full algorithm
In the foregoing sections, the full likelihood and prior are described. Now, we stack all the ingredients together to build the full algorithm. Let us assume that the data set consists out of two alternating observations: observations of a calibrator source and observations of the science target. This means that the likelihood splits into two parts as indicated in Eq. (9). In contrast to the sky brightness distribution of the science target that of the calibrator B c is known: it is a point source in the middle of the field of view. The sky brightness distribution of the science target is reconstructed.
The full likelihood takes the form where C x denote the tuple of parameters of the respective amplitude operator, Z is a padding operator. The unit matrices in Eq.
(27) is a 2 × 2 matrix acting on the same space as the sky brightness matrix B. The tuple of all excitation fields is called ξ, where As discussed before this model is set up such that the excitation fields ξ have white Gaussian statistics a priori, The posterior probability distribution is given by Finally, the statistical model that is employed in this publication is fully defined.

Inference algorithm
The probability distribution (38) has too many degrees of freedom to be represented on a computer. The resolve algorithm solves this problem by approximating this full posterior distribution by a multivariate Gaussian distribution whose covariance is equated with the inverse Fisher information metric. The latter can be represented symbolically alleviating the need for an explicit storage and handling of otherwise prohibitively large matrices. This algorithm is called MGVI and A134, page 5 of 12 is described in full length in Knollmüller & Enßlin (2019) and implemented in NIFTy 5 . The following is an inexhaustive outline of Knollmüller & Enßlin (2019).
The algorithm MGVI prescribes to minimize the KL divergence 6 between the actual posterior and approximate posterior such that KL(P 1 ||P 2 ) = Ds P 1 (s) log P 1 (s) P 2 (s) , where P 1 is more informed compared to P 2 . However, it is apparent that it is virtually impossible to perform the integration with respect to the posterior distribution as integration measure. Therefore, MGVI exchanges the order of the arguments of the KL divergence such that the integral can be approximated by samples of the approximate posterior, i.e., where H(ξ, d) := − log P(ξ, d) is the information Hamiltonian and D(ξ) the Fisher information. The parameter F[ξ] is a cost function that can be minimized with respect to ξ. Suitable (2nd order) minimizers are provided by NIFTy. With the help of the above approximation scheme we gets a computational handle on the posterior. The drawbacks of this approach include the uncertainty estimate of MGVI sets a lower bound on the variance of the posterior and it is not suited for extremely non-Gaussian and especially multimodal probability distributions. But we note that the posterior is approximated with a Gaussian in the space on which the parameters are defined. After processing the parameters through nonlinearities as discussed in this section the actual quantities of interest such as the sky brightness distribution are not Gaussian distributed any more and may even have multiple modes. A detailed discussion on the abilities of MGVI is provided in Knollmüller & Enßlin (2019).

Verification on synthetic data
This section is devoted to the verification of the algorithm, i.e., the reconstruction of a synthetic sky brightness distribution from a simulated observation and artificial noise. The setup is described followed by a comparison of the ground truth and the reconstruction. Application to real data, where effects that are not modeled may occur and the ground truth is unknown, is presented in Sect. 4.
We employ a realistic uv coverage. It is an L-band observation of the source G327.6+14.6, also known as SN1006 7 . For the purpose of this paper we randomly select 30 000 visibilities from this data set to demonstrate that joint calibration and imaging is possible even without much data. (see Fig. 2). We use the field shown in Fig. 3 as the synthetic sky brightness distribution. It is a random sample assuming the power spectrum shown in orange in Fig. 4. The noiseless simulated visibilities are corrupted by noise whose level is visualized in Fig. 5. The resulting information source, i.e., the naturally weighted dirty image, is shown in Fig. 6. This synthetic observation is set up in a fashion such that the calibration artifacts are stronger and the noise level is higher as compared to real data (see Sect. 4) to demonstrate the capability of the resolve in bad data situations. The calibration artifacts that have been applied are visualized in Fig. 7.
The resolve algorithm is run on this synthetic data to compare its output and uncertainty estimation to the (known) ground truth. The prior parameters are listed in Table 1. Additionally, we choose a resolution of 64 2 pixels for the sky brightness distribution with a field of view of 60 and 256 pixels for the calibration fields that are defined on a temporal domain. As the total length of the observation was ∼220 min one temporal pixel is ∼50 s long. These temporal pixels should not be confused with solution intervals of traditional calibration schemes where the data is binned on a grid and then the calibration parameters are solved for. In IFT fields are by their nature continuous quantities that are discretized on an arbitrary grid. For convenience a regular grid was chosen. Then the data provides information on each pixel that is propagated to the neighboring pixels through the prior; the calibration fields are assumed to be smooth over time. Therefore, the user is free to choose the resolution of the fields in IFT algorithms as long as it is finer than the finest structure that shall be reconstructed.
As pointed out resolve is a Bayesian algorithm whose output is not a single image of the observed patch of the sky but rather a probability distribution of all possible sky configurations. The MGVI algorithm approximates this non-Gaussian probability distribution with a Gaussian in the space of ξ, i.e., the eigenspace of the prior covariance. This again implies non-Gaussian statistics on quantities such as b(l, m), λ (i) p , and φ (i) p since they depend in a nonlinear fashion on ξ. The only useful way to visualize this probability distribution is to analyze a finite number of samples from it which resolve can generate. A given set of samples can then be analyzed with standard statistical means such as the pixel-wise mean and variance. Figures 8-10 show the posterior mean, the absolute value of the residual, the standard deviation of the sky brightness distribution, and a histogram of the residual divided by the standard deviation computed from 100 posterior samples, respectively. The algorithm has managed to perform the calibration correctly and to reconstruct the sky brightness distribution. The total flux of the ground truth (Fig. 3) could not totally be recovered because of the noise on the synthetic measurement. Remarkably, the proposed uncertainty is a bit too small compared to the residuals which is what is to be expected from MGVI.
Since resolve does not assume a specific power spectrum as prior for the reconstruction but rather learns it together with the sky brightness from the data, resolve also provides the user uncertainty on the power spectrum; see Fig. 4. We note that the posterior variance on the power spectrum increases toward the boundaries of the plot. This is because interferometers do not provide information on scales larger than those that belong to the shortest baseline. On small scales an interferometer is limited by the noise level, which leads to an increased variance in the power spectrum on the right-hand side of the plot. Next, we turn to the calibration solutions. Figure 7 shows a comparison of the ground truth and the posterior provided by resolve. Since two polarizations are considered (LL and RR) for both the amplitude and the phase of the antenna-based calibration term, Fig. 7 has four rows. On first sight, the posterior mean and the ground truth are indistinguishable by eye and the residuals and posterior standard deviation fit together nicely. There is a significant increase of the uncertainty for, for example, antenna 2 toward the end of the observation. This is because a flagged data set was used and that simply all data points involving this antenna have been flagged from the beginning of the observation up to ∼2 h.
To illustrate this more explicitly, Figs. 11 and 12 show the calibration solution for one antenna, respectively. The ground truth lies within the bounds of uncertainty indicated by the samples. We note that all data points have been flagged on the lefthand side of Fig. 11. Since no information about the phase is available the only constraint is the prior, which enforces temporal smoothness. Consistently, the uncertainty increases where no information is available.
Finally, we demonstrate what kind of other information posterior samples can reveal. Say, a scientist is interested in the integrated flux over a certain region. In addition to the image, this integrated flux comes with an uncertainty that can be calculated by averaging over posterior samples of the sky brightness distribution. An example is shown in Fig. 13. The scatter of the histogram is caused by the noise influence on the data, the (un)certainty of the calibration solutions, and ultimately the uv coverage. We are not aware of any other radio aperture synthesis algorithm that is able to provide this kind of probabilistic posterior information. All in all, the proposed method is able to recover the ground truth and is able to supplement it with an appropriate uncertainty estimation.

Application to VLA data
We continue with an application of resolve to real data. To this end, take the VLA data set whose uv coverage and time stamps    have already been used in the preceding section. Also, the resolution of all spaces is taken to be the same. Starting from raw data, the first thing to look at is the information source (see Fig. 14). No structure of the supernova remnant is visible whatsoever since the data is not calibrated yet. This illustrates that resolve is able to operate on raw (but already flagged) visibilities that have not been processed further. Table 2 summarizes the prior parameters for the following reconstruction.
All calibration solutions are shown in Fig. 15 together with two exemplary plots in Figs. 16 and 17. The major characteristic of these solutions are hidden in the right-hand column of Fig. 15: the uncertainty on the calibration decreases whenever the calibrator source is observed as expected. Additionally, A134, page 8 of 12   the uncertainty increases dramatically where the data has been flagged. The amplitude solutions are surprisingly stable over time although the prior would allow for more variance in the solution as can be seen from Sect. 3, where the same prior parameters have been used.    There is a systematic difference between the samples for the amplitude solutions and those for the phases. The former vary around a mean solution whereas the latter exhibit a certain global offset. This is explained by the fact that the likelihood is invariant under a pixel-wise global phase shift, which is broken by the prior to a global phase shift to all temporal pixels at once. This residual symmetry is again broken by the prior on the zero-mode variance of the phase solutions. However, this prior is very weak to allow for phase solutions of arbitrary magnitude. Therefore, the phase solutions cannot have an arbitrarily large offset but still can globally vary to some degree, which is shown in Fig. 17.
Next, the posterior sky brightness is discussed. Figures 18  and 19, along with Figs. 20 and 21, show the posterior mean and pixel-wise standard deviation of b(l, m). The posterior standard deviation is higher wherever more flux is detected. Therefore, Fig. 21 provides a descriptive visualization of the posterior uncertainty of the sky brightness distribution.
Last but not least the power spectrum of the logarithmic sky brightness distribution also needs to be reconstructed; this A134, page 9 of 12    is shown in Fig. 22. The power spectrum is more constrained compared to that of Sect. 3 since the noise level is much lower in this data set as compared to the synthetic data set. We might expect the posterior power spectrum to feature nodes or distinct minima because the Fourier transform of compact objects typically exhibit such. This is suppressed by the smoothness prior on the power spectrum. However, we note that this does not mean that the algorithm cannot reconstruct the object because it can still choose to not excite the respective modes in ξ B .
All in all, this demonstrates that resolve is not only able to operate on synthetic data but is actually capable of solving for the sky brightness distribution and the calibration terms at the same time for real data sets. A134, page 10 of 12

Performance and scalability
Performance and scalability are crucial aspects of the applicability of algorithms. The expensive part of the evaluation of the sky model is a fast Fourier transform (FFT), which is in O(n log n) where n is the total number of pixels of the sky model. For realworld data sets the cost for the (de)gridding exceeds the FFTs by far such that one likelihood evaluation is in O(N), where N is the number of data points that need to be degridded once for each polarization. To compute the sampled KL divergence we need to compute the likelihood n s times, where n s is the number of samples (typically 3 -20). The memory consumption scales linearly with the number of samples used to approximate the KL divergence, number of pixels, and number of data points. This is possible since NIFTy is designed such that no explicit matrices need to be stored.
Both reconstructions in this paper each took ≈60 minutes to be computed on a mobile CPU (Intel(R) Core(TM) i5-4258U CPU @ 2.40 GHz) with 4GB main memory. The response and adjoint needed to be called ≈30 000 times, respectively.
These values might improve in the future. Barnett et al. (2018) have proposed a novel gridding kernel that features speedups of several times in first experiments. This is possible since it needs relatively small support and can be computed on the fly. Also, the structure of the algorithm allows for various forms of parallelization. The gridding/degridding can be computed in parallel with OpenMP. Moreover, the data set could be split into several parts and distributed on a cluster. This is a general feature of Bayesian statistics: a likelihood can be split into the product of two likelihoods each of which contains only a subset of the data. Additionally, the evaluation of the KL divergence, which is a sum of few but expensive independent summands, can be distributed. Finally, NIFTy offers the (experimental) feature to distribute large fields on a cluster. Orthogonal to computational speedup ideas the algorithm might also benefit from compressing the likelihood itself such that fewer (de)gridder calls are necessary.

Conclusions
We have presented the probabilistic resolve algorithm for simultaneous calibration and imaging. After a derivation from first principles of the full posterior probability distribution for the joint calibration and imaging algorithm resolve, it has been shown how this distribution can be approximated by a multivariate Gaussian probability distribution to render the problem computationally solvable. This method is called MGVI and provides a prescription for how to draw samples from the approximate posterior distribution. The calibration algorithm resolve has been verified on synthetic data. The results indicate that the uncertainty quantification is qualitatively sensible but should be taken with a grain of salt since MGVI systematically underestimates posterior variance. Furthermore, it has been demonstrated that the algorithm has the capability to reconstruct a sky brightness distribution of A134, page 11 of 12 A&A 627, A134 (2019) a intricate source, the supernova remnant SN1006, together with uncertainty information from raw VLA L-band data.
There are many open ends to continue the investigation that we started with this paper. First, the model for the sky brightness distribution may include point source and multifrequency correlations. On top of that the response may be described more thoroughly. Direction-dependent calibration and nontrivial primary beam effects may be taken into account. Moreover, we performed the flagging by a standard CASA flagging algorithm. This can be replaced with an algorithm rooted in information theory that unifies flagging with calibration/imaging. Additionally, a major/minor cycle scheme similar to that in CLEAN may be introduced to avoid to frequent (de)gridding operations. This is necessary to apply resolve to big data sets from telescopes such as MeerKAT. Finally, resolve can be extended to polarization imaging. On an orthogonal track resolve may be used for imaging of a variety of sources from different telescopes including ALMA and especially the Event Horizon Telescope.