Issue 
A&A
Volume 566, June 2014



Article Number  A127  
Number of page(s)  11  
Section  Astronomical instrumentation  
DOI  https://doi.org/10.1051/00046361/201423503  
Published online  24 June 2014 
Nonlinear Kalman filters for calibration in radio interferometry
^{1} GEPI, Observatoire de Paris, CNRS, Université Paris Diderot, 5 place Jules Janssen, 92190 Meudon, France
email: cyril.tasse@obspm.fr
^{2} SKA South Africa, 3rd Floor, The Park, Park Road, 7405 Pinelands, South Africa
^{3} Department of Physics & Electronics, Rhodes University, PO Box 94, 6140 Grahamstown, South Africa
Received: 24 January 2014
Accepted: 20 March 2014
The data produced by the new generation of interferometers are affected by a wide variety of partially unknown complex effects such as pointing errors, phased array beams, ionosphere, troposphere, Faraday rotation, or clock drifts. Most algorithms addressing directiondependent calibration solve for the effective Jones matrices, and cannot constrain the underlying physical quantities of the radio interferometry measurement equation (RIME). A related difficulty is that they lack robustness in the presence of low signaltonoise ratios, and when solving for moderate to large numbers of parameters they can be subject to illconditioning. These effects can have dramatic consequences in the image plane such as source or even thermal noise suppression. The advantage of solvers directly estimating the physical terms appearing in the RIME is that they can potentially reduce the number of free parameters by orders of magnitudes while dramatically increasing the size of usable data, thereby improving conditioning. We present here a new calibration scheme based on a nonlinear version of the Kalman filter that aims at estimating the physical terms appearing in the RIME. We enrich the filter’s structure with a tunable data representation model, together with an augmented measurement model for regularization. Using simulations we show that it can properly estimate the physical effects appearing in the RIME. We found that this approach is particularly useful in the most extreme cases such as when ionospheric and clock effects are simultaneously present. Combined with the ability to provide prior knowledge on the expected structure of the physical instrumental effects (expected physical state and dynamics), we obtain a fairly computationally cheap algorithm that we believe to be robust, especially in low signaltonoise regimes. Potentially, the use of filters and other similar methods can represent an improvement for calibration in radio interferometry, under the condition that the effects corrupting visibilities are understood and analytically stable. Recursive algorithms are particularly well adapted for precalibration and sky model estimate in a streaming way. This may be useful for the SKAtype instruments that produce huge amounts of data that have to be calibrated before being averaged.
Key words: instrumentation: interferometers / methods: data analysis / techniques: interferometric
© ESO, 2014
1. Introduction
The new generation of interferometers is characterized by very wide fields of view, large fractional bandwidth, high sensitivity, and high resolution. At low frequency, such as with the LowFrequency Array (LOFAR), or the Murchison Widefield Array (MWA), the crosscorrelation between voltages from pairs of antenna (the visibilities) are affected by severe complex baselinetimefrequency direction dependent effects (DDE) such as the complex phased array beams, the ionosphere and its associated Faraday rotation, the station’s clock drifts, and the sky structure. At higher frequency, the interferometers using dishes are less affected by ionosphere, whereas troposphere, pointing errors, and dish deformation play an important role.
1.1. Direction dependent effects and calibration problems
A wide variety of solvers has been developed to tackle the directiondependent calibration problems of radio interferometry. In this paper, for the clarity of our discourse, we classify them in two categories. The first and most widely used family of algorithms (later referred as the Jonesbased solvers) aim at estimating the apparent net product of various effects discussed above. The output solution is a Jones matrix per timefrequency bin per antenna per direction (see Yatawatta et al. 2008; Noordam & Smirnov 2010, and references therein). Sometimes the solutions are used to derive physical parameters (e.g., Intema et al. 2009; Yatawatta 2013 in the cases of ionosphere and beam shape, respectively). The second family of solvers estimate directly from the data the physical terms mentioned above that give rise to a set of visibility (later referred as the continuous or physicsbased solvers). These solvers are used for directionindependent calibration in the context of fringefitting for VLBI (Cotton 1995; Walker 1999, and references therein) to constrain the clock states and drifts (also referred as delays and fringe rates). Bhatnagar et al. (2004) and Smirnov (2011b) have presented solutions to the directiondependent calibration problem of pointing error. It is important to note that deconvolution algorithms are also physicsbased solvers estimating the sky brightness, potentially taking the DDE calibration solution into account (Bhatnagar et al. 2008, 2013; Tasse et al. 2013). The latest imaging solvers can also estimate spectral energy distribution parameters (Rau & Cornwell 2011; Junklewitz et al. 2014). Most of these imaging algorithms are now well understood in the framework of compressed sensing theory (see McEwen & Wiaux 2011, for a review). Their goals, constrains, and methods are, however, very different from purely calibrationrelated algorithms, and we will not discuss them further in this paper.
Jonesbased and physicsbased solvers have both advantages and disadvantages. The main problem with using physicsbased solvers is that the system needs to be modeled accurately, while analytically complex physics can intervene before measuring a given visibility. Jonesbased algorithms solving for the effective Jones matrices are not subject to this problem, because no assumptions have to be made about the physics underlying the building of a visibility (apart from the sky model that is assumed).
However, one important disadvantage of Jonesbased solvers over physicsbased solvers for DDE calibration is that they lack robustness when solving for a large number of parameters and can be subject to illconditioning. This can have dramatic effects in the image plane, such as source suppression. In the most extreme case these algorithms can artificially decrease noise in the calibrated residual maps by overfitting the data. This easily drives artificially high dynamic range estimates. Hundreds of parameters (i.e., of directions) per antenna, polarization, can correspond to tens of thousands of free parameters per time and frequency bin. The measurement operator being highly nonlinear, for each given data set and process space, it is often hard to know whether illconditioning is an issue. Simulations can give an answer in individual cases, and a minimum time and frequency interval for the solution estimate can be estimated. However, this time interval can be large, and the true underlying process can vary significantly within that interval.
1.2. Tracking versus solving
Another important consideration is the statistical method used by the algorithm to estimate the parameters. Most existing Jonesbased and physicsbased solvers minimize a chisquare. This is done by using the GaussNewton, gradient descent, or LevenbergMarquardt algorithm. More recently, in order to solve for larger systems in the context of calibration of DDEs, this has been extended using expectation maximization, and SAGE algorithms (Yatawatta et al. 2008; Kazemi et al. 2011). One wellknown problem is that conventional leastsquares minimization and maximum likelihood solvers lack robustness in the presence of low signaltonoise ratios (S/N; the estimated minimum chisquare jumps in between adjacent data chunks, while this behaviour is nonphysical). In most cases, a filtering of the estimated solutions (Box car, median, etc) or an interpolation might be necessary (see e.g., Cotton 1995). In practice, situations of low S/N combined with the need to perform DDE calibration are not rare (in the case of LOFAR, the ionosphere varies on scale of 30 s while not much flux is available in the field).
In this paper we present a new calibration algorithm whose structure is that of a Kalman filter. Our main aim is to address the stability and illconditioning issues discussed above by using a physicsbased approach, which (i) decreases the number of free parameters in the model and (ii) increases the amount of usable data; the algorithm structure allows us to (iii) use additional physical priors (time/frequency process smoothness for example), while (iv) keeping the algorithm computationally cheap. We note that we do not do any quantitative comparison between leastsquares solvers and our approach. Instead, we focus on describing an implementation of a nonlinear Kalman filter for radio interferometry and we study its robustness.
While nonlinear leastsquares solvers are iterative, our algorithm uses a nonlinear Kalman filter, which is a recursive sequence (see Sect. 2). Kalman filters are referred in the literature as minimum mean square error estimators, and instead of fitting the data at best (leastsquares solver), they minimize the error on the estimate given information on previous states. In other words, they can be viewed as tracking the process rather than solving for it. An estimated process state vector^{1} built from previous recursions, together with a covariance matrix prior are specified. This way, the filter allows us to constrain the expected location of the true process state along the recursion. Even when the location of the minimum chisquare jumps between data chunks, the posterior estimate stays compatible with the prior estimate and with the data (under the measurement and evolutionary models). As more data goes though the filter, the process state and its covariance are updated (and the trace of the covariance matrix decreases in general).
An interesting aspect of our approach is that we use alternative data domains (Sect. 3), which amounts to conducing the calibration in the image domain. We show that this approach provides higher robustness. We discuss the details of the implementation and algorithmic costs in Sect. 4. An important step for the feasibility of the approach is to refactor the filtering steps using the Woodbury matrix identity (Sect. 4.1). We demonstrate the efficiency of our algorithms in Sect. 5, based on simulations of the clock/ionosphere (Sect. 5.1) and pointing error (Sect. 5.2) problems. An extended discussion on the differences between our algorithm and other existing techniques is given in Sect. 6. An overview of the mathematical notation is given in Table 1.
Overview of the mathematical notations used throughout this paper.
1.3. Radio interferometry measurement equation
To model the complex DDE (station beams, ionosphere, Faraday rotation, etc.), we make extensive use of the radio interferometry measurement equation (RIME) formalism, which provides a model of a generic interferometer (for extensive discussions on the validity and limitations of the measurement equation see Hamaker et al. 1996; Smirnov 2011a). Each of the physical phenomena that transforms or converts the electric field before the correlation is modeled by linear transformations (2×2 matrices). If is a sky direction, and M^{H} stands for the Hermitian transpose operator of matrix M, then the 2 × 2 correlation matrix V_{(pq)tν} between antennas p and q at time t and frequency ν can be written as where x is a vector containing the parameters of the given system (ionosphere state, electronics, clocks, etc.), D_{pstν} is the product of directiondependent Jones matrices corresponding to antenna p (e.g., beam, ionosphere phase screen, and Faraday rotation), G_{ptν} is the product of directionindependent Jones matrices for antenna p (like electronic gain and clock errors), and X_{s} is referred as the sky term^{2} in the direction s, and is the true underlying source coherency matrix [[,], [, ]]. The scalar term describes the effect of the array geometry and correlator on the observed phase shift of a coherent plane wave between antennas p and q. We have with [u,v,w] ^{T} as the baseline vector between antennas p and q in wavelength units and φ(u,v,w,s) = ul + vm + w^{(}n − 1^{)}.
Although the detailed structure of Eq. (1) is of fundamental importance, throughout this paper it is reduced to a nonlinear operator h:R^{N} → R^{M}, where N is the number of free parameters and M is the number of data points. The operator h therefore maps a vector x_{k} parameterizing the Jones matrices and/or sky terms appearing on the righthand side of Eq. (1) (the states of the beam, the ionosphere, the clocks, the sky, etc.), and maps it to a vector of visibilities y_{k} such that y_{k} = h(x_{k}). In the following, y_{k} is the set of visibilities at the time step k for all frequencies, all baselines, and all polarizations. The choice of mapping the state space to the measurement space for all frequencies for a limited amount of time (time step k) is motivated by the fact that regularity is much stronger on the frequency axis. For example, the Jones matrices associated with the ionosphere or clocks, although they greatly vary in time, have a very stable frequency behaviour at any given time.
2. Kalman filter for nonlinear systems
Nonlinear leastsquares algorithms only consider the χ^{2} value for the given data chunk. As mentioned above, this is a problem in (i) low S/R and (ii) illconditioned regimes. For example for (i), if one considers a noisy χ^{2} valley, the leastsquares solution will jump between each timefrequency bin because of noise – while this behaviour is obviously nonphysical. Effect (ii) will bring instability as a result of the χ^{2} valley having multiple local minima, or a flat minima. As explained in Sect. 1.2, Kalman filters provide a number of advantages allowing us in principle to significantly improve robustness, and to minimize the impact of illconditioning.
In the following, we assume an evolution operator x_{k} = f(x_{k − 1}) + w_{k} describing the evolution of the physical quantities underlying the RIME, and a measurements operator y_{k} = h(x_{k}) + v_{k} generating the set of measurement for a given process vector x_{k} (examples for both f and h are given in Sect. 5). The random variables v_{k} and w_{k} model the noise and are assumed to follow normal distributions and , where R_{k} and Q_{k} are the data and process covariance matrix respectively. In the following, we name the predictedprocess and data domains the codomains of f and h, respectively.
2.1. Kalman filter
The traditional Kalman filter (Kalman 1960) assumes (a) f and h to be linear operators (written F and H below for f and h respectively). If the process vector x_{k − 1} for the timestep k − 1 has estimated mean and P_{k − 1  k − 1} estimated covariance from the data at step k − 1, assuming (b) Gaussian noise in the process and data domains, x_{k − 1  k − 1} is distributed following .
Under the conditions (a) and (b), operators F and H yield Gaussian distributions in the predictedprocess and data domains respectively. Given and P_{k − 1  k − 1} the Kalman filter (i) predicts and P_{k  k − 1} through F; and (ii) updates those to and P_{k  k} through H given the data y_{k}.
It can be shown that the mean and covariance of x_{k − 1} can be evolved through F giving and P_{k  k − 1} as follows:
Taking into account the data vector y_{k} at step k, the updated mean and covariance and P_{k  k} of x_{k} are estimated through the calculation of the Kalman gain K_{k}, and are given by
The estimate is optimal in the sense that P_{k  k} is minimized. This approach is extremely powerful for linearsystems, but the radio interferometry measurement equation is highly nonlinear (the operator h in Eq. (1)). This makes the traditional Kalman filters unpractical for the radio interferometry calibration problem.
2.2. Unscented Kalman filter
The Kalman filters fail at properly estimating the statistics of x essentially because f and h are nonlinear, and lead to nonGaussian distributions in the predictedprocess and data domains described above.
The unscented Kalman filter (UKF, Julier & Uhlmann 1997; Wan & van der Merwe 2000) aims at properly estimating the mean and covariance in both these domains by directly applying the nonlinear operators f and h to deform the initial Gaussian distribution of x. In practice, instead of selecting a large number of process vectors built at random as is done in Monte Carlo particle filters for example, the unscented transform (UT) scheme selects a much smaller set of 2N + 1 sigmapoints in the process domain in a deterministic manner. Each point is characterized by a location in the process space and by a corresponding weight. The set is built in such a way that its mean and covariance match the statistics of the random variable x. The points are propagated through the nonlinear operators f and h in the predictedprocess and data domains, respectively, and the corresponding mean and covariance are estimated based on their evolved positions and associated weights. Using Taylor expansions of f and h, it can be shown that the mean and covariance of the evolved random variable are correct up to the third order of the expansion (Julier & Uhlmann 1997). Errors are introduced by higher order terms, but partial knowledge about these can be introduced using proper weighting schemes. It is important to note, however, that even though the mean and covariance can be properly estimated after applying nonlinear operators through the UT, the Kalman filter still assumes Gaussian statistics of all random variables to estimate the statistics of x.
2.2.1. Sigmapoints and associated weights
Given a multivariate distribution with covariance P of size N × N, the set of sigmapoints are generated in the following way (10)Here N is the number of parameters, P is the process covariance matrix, and [M] _{i} is the ith column of the matrix M. The realvalued scalar w_{0} controls the distance of the sigmapoints to the origin. As N increases, the radius of the sphere that contains the sigmapoints increases as well. As shown in Julier & Uhlmann (1997), for the errors to be minimized on the mean and covariance estimate, the sigmapoints should stay in the neighbourhood of the origin. The sigmapoint locations are scaled by a parameter α giving: (11)When estimating the mean of the evolved distribution, the weights associated with the sigmapoints are (12)where μ is a normalizing constant appearing while computing the Taylor expansion of the nonlinear operator. When computing the covariance of the sigmapoints, the weights are given by (13)where β is an extra parameter that can be used to incorporate additional knowledge on the fourthorder term of the Taylor expansion of the covariance.
2.2.2. Filtering steps
A set of sigmapoints is generated assuming , following the scheme outlined above (Sect. 2.2.1). The sigmapoints are then propagated through the nonlinear evolution operator f (14)and the mean and covariance are estimated as follows: If μ = α^{2} the expressions of and x_{k  k − 1} agree up to the third order of the Taylor expansion. We note that Eqs. (15) and (16) are the UKF versions of Eqs. (3) and (4). Once and P_{k  k − 1} are estimated, we assume , and a new set of sigmapoints is generated following the scheme outlined in Eqs. (10) and (11). The new set of sigmapoints are propagated onto the measurement domain through the nonlinear observation function h (17)where is the measurement vector corresponding to each process vector . As in Eqs. (15) and (16) the measurement mean ŷ_{k}, measurement covariance P_{ykyk}, and statemeasurement crosscovariance P_{xkyk} are then estimated:
We note again that Eqs. (18) and (19) mirror the behaviour of Eq. (5), while the term P_{xkyk} (Eq. (20)) is similar to in Eq. (6); ŷ_{k} has size M, P_{ykyk} has size M × M and P_{xkyk} has size N × M, where N and M are the dimensions of the process and measurement spaces respectively. The Kalman gain is then (21)and the updated estimates and P_{k  k} can be computed:
Fig. 1 This figure shows how the sigmapoint statistics compares to the true statistics. Specifically, we consider the case of a frequencydependent phase gradient (clock offset, ionospheric effect, or source position offset). The true covariance corresponds to the gray area, while the sigma points appear as dotted lines, together with their associated estimated covariance (thick dashed line). The noise in the data is represented by the dashdotted line. Qualitatively, the goodness of the description depends on the type of the data representations operator. 
3. Data representation
In this section we describe how we can modify the measurement operator together with the raw data to improve robustness. Using the operator discussed in Sect. 3.2 in combination with the Kalman filter discussed above, this is equivalent to an imageplane calibration.
3.1. Robustness with large process covariance
As explained in Sects. 2.2 and 4, the unscented transform correctly approximates the evolved covariance up to the third order. When the radius of the sphere containing σpoints in the process domain increase, and depending on the strength of nonlinearities of the evolution and measurement operators f and h, the estimated mean and covariance can be affected by large errors. This is the case when the multivariate ellipsoid is too deformed, and the statistics of the σpoints in the evolved domain no longer capture the true statistics.
Here, we introduce another layer of nonlinear operators that transform the sets of visibilities into another measurement domain. We define various simple representation operators as where is the operator described in Sect. 3.2 transforming a set of visibilities into another set of 1D images (the Fourier transform along the frequency axis, also referred to later as the pseudoimage domain). The operators ., φ(.), and Re(.) take the complex norm, the phase, and the real part, respectively; M is the number of visibility data, and M_{i} the number of pixels in the image domain. Our goal is to obtain a system that has less nonlinearity, so that the σpoints statistics still properly match the true statistics, even when the volume within the multivariate ellipsoid is large in the process domain.
In order to illustrate this idea, we compare the evolved covariance estimated using Eq. (19) to the true evolved mean and covariance as estimated by running Monte Carlo simulations. The system is made of one point source at the phase center, the bandpass goes from 30 to 70 MHz, and our interferometer consists of one baseline. We consider the ellipsoid of the clockoffset parameter (therefore corresponding to a large 10 × 10^{9} s estimated error), and inspect how it is reflected in the data domains after applying (the symbol ° is used here for the function composition). For this system, Eq. (1) becomes , where Δt_{01} is our random variable. Figure 1 shows how the statistics of the σpoints compare to the true statistics when using different data representations ℛ. When ℛ picks the real part Re or the phase φ of ( and ), the σpoints statistics are obviously wrong.
3.2. Onedimensional image domain for calibration
Although a single transformation separate the uvplane from the image domain, it seems the latter is sometimes more suited for calibration. Intuitively, in the uv domain clock shifts, source position, and ionospheric disturbance will wrap the phases of the complexvalued visibilities everywhere, and the strong nonlinearities sometimes present in h make the distribution strongly nonGaussian. On the contrary, in the image domain, the same type of perturbations only affect the data locally, and will move the flux from one pixel to its neighborhood.
We cannot use the 2D image domain, as it is built from the superposition of all baselines, which would lead to an effective loss of information. Instead, similarly to what is done for VLBI delays and fringe rates calibration (Cotton 1995), we build a 1D image per baseline (see Appendix A for details). As shown in Fig. 1, when going into the pseudoimage domain, the power is concentrated in a few pixels. Using Monte Carlo simulations Fig. 1 shows the ability of the unscented transform to properly describe the data statistics after the nonlinear measurement operator and the different representation operators ℛ (Sect. 3.1) have been applied to the process domain. We have subtracted the true mean ⟨ℛ⟩_{MC} from each quantity plotted in this figure. The real part of individual pixels gives a better match, but is still biased. Taking the norm of the imageplane complex pixel seems to behave well in all conditions. This is discussed in detail in Appendix A.
3.3. The augmented measurement state for regularization
As explained above, one of the aims of the work presented in this paper is to address the illconditioning issues related to the large inverse problem underlying the use of modern interferometers. This is done by analytically specifying the physics underlying the RIME (using a physicsbased approach, see Sect. 1.1), and by using the Kalman filter mechanism, able to constrain the location of the true process state through the transmission of previous estimated state and associated covariance. Yet, in some situations, and particularly when two variables are analytically degenerate (such as the clock shifts and the ionosphere when the fractional bandwidth is small), the robustness of the scheme presented is not strong enough to guarantee the regularity of solutions, and the estimated process can drift to a domain where solutions are nonphysical.
In order to take into account external constrains while still properly evolving the process covariance P_{k  k}, we introduce an augmented measurement model (see e.g., Henar 2011; Hiltunen et al. 2011). Using the idea underlying Tikhonov regularization, if x_{0} is the expected value of x, and the covariance of x_{0}, our cost function becomes where ∥ x ∥ _{C} = x^{T}C^{1}x is the norm of vector x for the metric C, with C the covariance matrix of x (or the Mahalanobis distance). The operator is the augmented version of h and is the block diagonal covariance matrix of the augmented process vector. The parameter γ allows us to control the strength of the Tikhonov regularization, and is such that .
4. Implementation for radiointerferometry
As explained above, radiointerferometry deals with large inverse problems, made of millions or billions of nonlinear equations. This poses a few deep problems including (i) numerical cost and (ii) numerical stability. In this section, we describe our UKF implementation.
4.1. Woodbury matrix identity
The first matter is the size of the matrices involved in the UKF recursion steps presented in Sect. 2.2. Specifically, in the case of LOFAR, we have n_{bl} ~ 1500 baselines and n_{ν} ~ 250 frequencies. This gives a number of dimensions M for the measurement space of M ~ 1.5 × 10^{6} per recursion (taking into account the fourpolarization visibilities). The predicted measurement covariance matrix P_{ykyk} has size M × M, and in practice becomes impossible to store and invert directly (~8 petabytes of memory). Fortunately we can refactor Eq. (21) so that we do not have to explicitly calculate each cell of P_{ykyk}. We can see that Eq. (19) can be rewritten as (27)where S_{k} is a matrix of size M × (2N + 1), [S_{k}] _{i} is the ith column of S_{k}, and W is a diagonal matrix of size (2N + 1) × (2N + 1) containing the weights on its diagonal. Using the Woodbury matrix identity^{3} (Hager 1989), we can express the Kalman gain K_{k} as (28)This relation (Eq. (28)) is quite remarkable, as it allows us to apply the Kalman gain without explicitly calculating it, and without estimating either P_{ykyk} or its inverse. Instead, the inverse W^{1} of the diagonal weight matrix of size (2N + 1) × (2N + 1), and the inverse of the data covariance matrix of size M × M have to be estimated. Even though R_{k} has large dimensions, if the noise is uncorrelated only the diagonal has to be stored and the inverse can be computed elementbyelement. Similarly, the inner product of matrices with R_{k} are computationally cheap. At each recursion k we have to explicitly estimate the σpoints evolved through the measurement operator h and contained in S_{k}.
4.2. Adaptive step
While P_{k  k} characterizes the posterior process covariance, the matrix Q (Eqs. (4) and (16)) characterizes the intrinsic process covariance through time. It can, for example, describe the natural timevariability of the ionosphere, the speed of the clock drift, or the beam stability. In addition, in a strong nonlinear regime, it is well known that the Kalman filters can underestimate P_{k  k}, and thereby drive biases in the estimate of x_{k}. This would typically happen when is too far from x, or when the process covariance is changing (for example a changing and increasing ionospheric disturbance for a given timeperiod). Although the Kalman filter does not produce an update Q_{k} of Q, based on the residual data we can externally update it and write Q_{k} = κQ. The scaling factor κ is useful for estimating whether the model properly fits the data at any time step k. Following Ding et al. (2007), we can write where tr{A} is the operator computing the trace of a matrix A. The weights are designed to take into account past residual values, and τ_{Qk} is a timetype constant. Here, estimating κ is computationally cheap as tr only accesses the diagonal of the input matrix.
Fig. 2 Snapshot residual image estimated at different recursion times (the color scale is identical in each panel). As recursion time evolves, more data cross the Kalman filter, the ionospheric and clock parameters estimates become more accurate, and the residual noise level decreases (see also Fig. 4). 
4.3. Computational cost
In this section, we discuss the computational cost of the proposed algorithm. Our concern is to show that the approach is feasible, and we do not intend to show that it is faster than other existing approaches. However, we discuss the issues of the scaling relations and parallelizability of various parts of the algorithm. We argue that using the refactorization described in Sect. 4.1, our algorithm should be compatible with existing hardware, even for the data sets produced by the most modern radiotelescopes.
As explained above, we adopt a physicsbased approach to reduce the number of degrees of freedom by orders of magnitudes while using more data at a time (see Sect. 1.1 for a discussion on Jonesbased versus physicsbased approach). The Kalman filter is fundamentally recursive (i.e., it has only one iteration), while tens to hundreds of iterations are needed to reach local convergence with LevenbergMarquardt for example. This means that the equivalent gain by the proposed approach on the model estimation side is the number of iteration. This gain might be balanced in some cases by the larger data chunks processed at a time by the Kalman filter itself. We give here the scaling relations for our implementation of the filter scheme.
The predict step (controlled by operator f in Sect. 2.2) always represents a relatively small number of steps, as it scales with the number of parameters N in the process vector. The update step, however, is the costly part of the computation (Sect. 2.2). It consists of (i) estimating the data corresponding to different points in the process domain (applying the operator h, see Eq. (1)) and (ii) computing the updated estimate of the process vector and associated covariance. Step (i) is common to all calibration algorithms, and in most cases this is the expensive part of the calculation as we map N parameters to M data points, M having relatively high values of ~10^{4} − 10^{6}. Within our framework, most of the computation is spent in the estimate of S_{k} of size M × (2N + 1) (Eq. (17)), which, compared to the Jacobian equivalent that would have size M × N, represents a cost of a factor of ~2 in computational time. It is worth noting that this step is heavily parallelizable. The operations in (ii) to apply the Kalman gain (Eq. (28)) are negligible in terms of computing time for the example described in Sect. 5.1. From the scaling associated with the use of the Woodbury matrix identity, following the work presented in Mandel (2006), we estimate this operation should scale as .
We believe the algorithm should be practical with large data sets. For the few test cases we have been working on (with a 4core CPU) with ~20 to ~100 parameters, and moderate to large data set (such as the LOFAR HBA data set containing ~1500 baselines, 30 subbands, and a corresponding M ~ 1.8 × 10^{5} points per recursion, see Sect. 5.1.2), the algorithm was always sveral times faster than real time. On the LOFAR CEP1 cluster, a distributed software would work with up to M ~ 10^{6} points per recursion and N ~ 100. Preliminary tests showed that the Kalman filter’s most consuming steps appearing in Eq. (28) are computed within a few seconds (computing inner products with numpy/ATLAS, on an 8core CPU).
5. Simulations
In this section, we present simulations for (i) the pointing error calibration problem (also addressed using physicsbased algorithms in Bhatnagar et al. 2004; Smirnov 2011b) and (ii) the clock/ionosphere problem.
5.1. Clock drifts and ionosphere
LOFAR raw data sets are characterized by a few dominating directionindependent and DDEs, including clocks and the ionosphere. While direction dependent calibration is known to be difficult, clock errors and ionosphere effects combined with a limited – even though large – bandwidth make the problem partially illconditioned.
5.1.1. Evolution and measurement operators
The error due to the clock offset of antenna p produces a linearly frequencydependent phase . The time delay introduced by the ionosphere is frequency dependent , where T_{p,d} is the total electron content (TEC), given in TECunits (TECU), and seen by station p in direction d. This gives a phase , with k = 8.44 × 10^{9} m^{3} s^{2}.
The measurement operator h (Eq. (1)) is built from the direction independent G_{ptν} and direction dependent terms as where I is the 2 × 2 unity matrix and δt_{p} is a simple linear operator unpacking the clock value of antenna p from the process vector x. For this test, we choose to model the ionosphere using a Legendre polynomial function (see e.g., Yavuz 2007). Assuming a single ionospheric screen at a height of 100 km, the nonlinear operator extracts the Legendre coefficients, and returns the TEC value seen by antenna p in direction d. For this simulation, we are using the representation presented in Sect. 3.
Fig. 3 Estimated clock errors (full line) and posterior covariance (dotted line) as a function of time for different LOFAR stations in the simulation presented in Sect. 5.1. The dashed line shows the true underlying clock offset. To improve convergence speed, before t = 6 min, the evolutionary model is stochastic. 
The operator f that describes the dynamics of the system typically contains a lot of physics. Clock offsets drift linearly with time, while the ionosphere has a nontrivial behaviour (defined in Sect. 5.1.2). We configure the filter to consecutively use two types of evolution operator f. The first is used for the first ~6 min, and f is the identity function, which corresponds to a stochastic evolution. This appears useful when the initial process estimate starts far from the true process state. This way, the filter’s state estimate gets closer to the true state without assuming any physical evolutionary dynamics. The convergence speed and accuracy are then both controlled by the covariance matrices P_{k} and Q_{k} described above. We set the second evolutionary operator f to be an extrapolating operator. It computes an estimated process vector value from the past estimates. At time step k, for the ith component of x, solutions are estimated as follows: where is the operator computing the polynomial interpolation, is the degree of the polynomial used for the interpolation, w_{i} are the weights associated with each point x_{k − m} at t_{k − m}, and τ_{i} gives a timescale beyond which past information is tapered.
5.1.2. Simulation for LOFAR
An important consideration is that at any given time our algorithm needs to access all frequencies simultaneously. With 250 subbands (16bit mode), 1 channel per subband, 1500 baselines, 4 polarization, this gives a number of visibilities per recursion of 1.5 × 10^{6}. The data is currently distributed and stored per subband, so our software needs to deal with a number of technical problems for a realistic full size simulation. For this simulation, we scale down the problem by a factor ~8 in terms of number of frequency points per recursion. Assuming NVSS source counts (Condon et al. 1998), a spectral index of −0.8, and a field of view of 8 degrees in diameter, we estimate a total of ~20 Jy of signal per pointing at ~150 MHz. Inspecting the cross polarization visibilities of a LOFAR calibrated data set with Δν = 0.2 MHz and Δt = 6 s gives an estimated noise of ~2 Jy per visibility bin. We work on 30 subbands only, with frequencies linearly distributed between 100 MHz and 150 MHz, and scale the signal to match S/N ~ 10 per visibility per subband. We distribute the corresponding flux density on a 3 × 3 rectangular grid of sources with a step of 3 degrees in RA and 3 degrees in Dec.
For the dynamics of the underlying physical effects, we apply a linearly drifting clock offset taken at random with ns. As mentioned above, we model the ionosphere with a 2DLegendre polynomial basis function. For this simulation, each Legendre l_{ij} coefficient varies following l_{ij} = sin(t/τ_{ij} + dτ_{ij}), with τ_{ij} and dτ_{ij} taken at random. Along the lines discussed above, we set and for the clock and for the ionosphere, respectively. These orders for the extrapolating polynomials are in agreement with the linear clock drift and the nonlinear behaviour of the ionosphere.
5.1.3. Results
The filter and simulation configuration are described in Sects. 5.1.1 and 5.1.2, respectively. At each recursion step, ~180.000 complex visibilities cross the filter and the process state (clock and ionosphere) as well as its covariance are estimated.
First, in order to inspect the match to the data we derive the frequencybaselinedirection dependent Jones matrices at the discrete locations of the sources in our skymodel, and subtract the visibilities corresponding to the given discrete directions. We then grid the residual visibilities and compute the snapshot images (see Fig. 2). Figure 4 shows the minimum and maximum residual as well as the standard deviation as a function of time. Very quickly the visibilities are correctly matched down to the thermal noise.
Fig. 4 Top panel: maximum and minimum residual values in the snapshot image as a function of recursion time (full line). Bottom panel: standard deviation in the residual snapshot maps. The expected thermal noise is shown in the bottom figure as the dotted line. 
Fig. 5 Ionosphere state described using a Legendre polynomial basis function. The estimated and true state of the ionosphere (left and right panels) are shown at the beginning and at the end of the filter’s recursion. The open circles show the location of the pierce points in the ionosphere. The spatial coordinates are given in kilometers from the array center projected on the ionosphere screen. 
In Fig. 3 we show the clock offsets estimates as a function of time compared to the true clock errors. The clock offset estimates seem to converge asymptotically to the true underlying states. The ionospheric parameter estimates are more subject to illconditioning, as some parts of the TECscreen are not pierced by any projected station. However, plotting the TECscreen corresponding to the individual Legendre coefficients gives a good qualitative match to the true TEC values (Fig. 5).
5.2. Pointing errors
One of the dominating calibration errors for interferometers using dishes are the individual antenna pointing errors. They start to have a significant effect even at moderate dynamic range, and can be severe for nonsymmetric primary beams with azimuthal dish mounts. Bhatnagar et al. (2004) and Smirnov (2011b) have presented a physicsbased calibration scheme to specifically solve for pointing errors, using a leastsquares minimization technique combined with a beam model.
Here, we simulate a Westerbork Synthesis Radio Telescope (WSRT) data set. As in Sect. 5.1, we first define a measurement equation (operator h). We only consider the direction dependent primary beam effect, using the WSRT cos^{3} beam model
where δl_{pt} and δm_{pt} are the operators unpacking the pointing error values δl and δm for antenna p at time t. The f evolution operator is the same as in Sect. 5.1.1, with τ_{i} = 5 min (Eq. (33)). We simulate a dataset containing 64 channels centered at ~1.3 GHz and channel width δν = 0.3 MHz. The sky model has 20 sources with a total flux density of ~10 Jy, with a noise of 0.2 Jy per visibility. The simulated pointing errors have an initial global offset distributed as arcmin, and the pointing offsets evolve periodically as δl(t) = l_{0} + a_{l}cos(2πt/τ_{l} + φ_{l}), with arcmin, min, and φ_{l} uniformly distributed between 0 and 2π. The same scheme is used to generate the evolution law for δm.
Figure 6 shows the comparison between the estimated pointing errors and the true pointing errors for a few antennas. The filter’s estimate rapidly converges to the true pointing offset, and properly tracks its state within the estimated uncertainty.
Fig. 6 True pointing errors for a WSRT simulation (dashed line) together with the estimated state as a function of time (full line). Our algorithm properly tracks the time dependent pointing errors within the estimated covariance (dotted line). 
6. Discussion and conclusion
6.1. Overview: pros, cons, and potential
As discussed throughout this paper, it is important to obtain robust algorithms that do not affect the scientific signal. In this paper, we have presented a method that aims at improving robustness along the following lines:

1.
The Kalman filter presented here is fundamentally recursive, and information from the past is transferred along the recursion, thereby constraining the expected location of the underlying true state. This is fundamentally different from minimizing a chisquare, and then smoothing or interpolating this solution, especially since we can assume a physical measurement and evolutionary model.

2.
Contrary to the Jonesbased algorithms that have to deal with hundreds of degrees of freedom, our algorithm follow a physicsbased approach (see Sect. 1.1 for other physicsbased methods). It aims at estimating the true underlying physical term, potentially describing the Jones matrices of individual effects everywhere in the baselinedirectionfrequency space. The very inner structure of the RIME can be used to constrain the solutions. This feature allows us to take into account much bigger data chunks. Typically, most effects have a very stable frequency behaviour, and the data in the full instrumental bandwidth can be simultaneously used at each recursion. This improves conditioning.

3.
The measurement operator is nonlinear, and combining (1) with (2) is made possible by using a modern nonlinear version of the Kalman filter, together with the representation operator presented in Sect. 3.

4.
Illconditioning can still be significant if effects are analytically degenerate to some degree. We can modify the measurement operator to take external prior information into account (see Sect. 3.3), and reject solutions that are considered to be nonphysical. For example, this can allow the user to provide the filter with an expected ionospheric power spectrum of the Legendre coefficients.

5.
One of the benefits of using filters is that they produce a posterior covariance matrix on the estimated process state. The covariance estimate should be reliable assuming the nonlinearities are not too severe.
Given the large size of our inverse problem, and in order to make any algorithm practical, optimizing the computational cost is of prime importance. Using the Woodbury matrix identity (Sect. 4.1), we refactor the unscented Kalman filter steps to make the algorithm practical. Even for the moderately large simulations described in Sect. 5.1, a 4core CPU is able to constrain solutions faster than real time. The need to access the data of all frequencies simultaneously represents some technical problems, as these are distributed over different cluster nodes.
An important potential problem with physicsbased approaches is that the system needs to be described analytically, while algorithms solving for the effective Jones matrices do not use any assumptions about the physics underlying the building of a visibility (apart from the sky model that is assumed). This would cause problems in particular if the model encapsulated in the operator h misses physical ingredients that are present in reality, and would probably drive biases in the estimates. Furthermore, the unscented Kalman Filter used and adapted to the context of radiointerferometry in this paper deals with nonlinearities only up to a certain level. This means in practice that the process a priori covariance has a certain maximum size for a given type of nonlinearities.
6.2. Conclusion
The use of filters and similar methods can potentially improve radio interferometric calibration. As with leastsquares minimization techniques, our approach is guaranteed to work only if nonlinearities are not too severe in the neighborhood of the true process state. Other algorithms dealing with nonlinearities are known to provide higher robustness, such as more general particle filters, or Markov chains Monte Carlo algorithms. The latter is guaranteed to provide a correct estimated posterior distribution. However, most of these methods are computationally expensive because of the many predict steps that have to be computed, and this fact could make them impractical, given the large size of our problem. Recursive algorithms are well adapted to streaming precalibration, and based on preliminary simulations, our algorithm seems to be robust enough to solve for the sky term (positions, flux densities, spectral indices, etc.) in a streaming way.
We have not yet demonstrated the efficiency of our algorithm with real data sets essentially because of its complexity and novelty. Indeed, our software needs to deal with a number of technical issues as well as more fundamental problems. Specifically in the case of the newest interferometers such as LOFAR, (i) we have to deal with large quantities of distributed data, and the algorithm has to access all frequencies simultaneously. Beyond these technical aspects, because we solve for the underlying physical effects; (ii) we need to build pertinent physical models for the various effects we are solving for, such as the ionosphere or phased array beams. An application of this algorithm to real data sets will be presented in a future paper.
Acknowledgments
I thank Ludwig Schwardt for helping me understand some important aspects of Kalman filters. These openended discussions were very helpful to developing the framework presented in this paper. Thanks go to Trienko Grobler and Oleg Smirnov for giving useful comments on the paper draft.
References
 Bhatnagar, S., Cornwell, T. J., & Golap, K. 2004, EVLA Memo 84. Solving for the antenna based pointing errors, Tech. Rep., NRAO [Google Scholar]
 Bhatnagar, S., Cornwell, T. J., Golap, K., & Uson, J. M. 2008, A&A, 487, 419 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Bhatnagar, S., Rau, U., & Golap, K. 2013, ApJ, 770, 91 [NASA ADS] [CrossRef] [Google Scholar]
 Condon, J. J., Cotton, W. D., Greisen, E. W., et al. 1998, AJ, 115, 1693 [NASA ADS] [CrossRef] [Google Scholar]
 Cotton, W. D. 1995, in Very Long Baseline Interferometry and the VLBA, eds. J. A. Zensus, P. J. Diamond, & P. J. Napier, ASP Conf. Ser., 82, 189 [Google Scholar]
 Ding, W., Wang, J., Rizos, C., & Kinlyside, D. 2007, J. Navigation, 60, 517 [NASA ADS] [CrossRef] [Google Scholar]
 Hager, W. W. 1989, SIAM Rev., 31, 221 [CrossRef] [Google Scholar]
 Hamaker, J. P., Bregman, J. D., & Sault, R. J. 1996, A&AS, 117, 137 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Henar, F. E. 2011, Master thesis, Institut für Biomedizinische Technik [Google Scholar]
 Hiltunen, P., Särkkä, S., Nissilä, I., Lajunen, A., & Lampinen, J. 2011, Inverse Problems, 27, 025009 [NASA ADS] [CrossRef] [Google Scholar]
 Intema, H. T., van der Tol, S., Cotton, W. D., et al. 2009, A&A, 501, 1185 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Julier, S. J., & Uhlmann, J. K. 1997, in Signal Processing, Sensor Fusion, and Target Recognition VI, ed. I. Kadar, SPIE Conf. Ser., 3068, 182 [NASA ADS] [Google Scholar]
 Junklewitz, H., Bell, M. A., & Enßlin, T. 2014, A&A, submitted [arXiv:1401.4711] [Google Scholar]
 Kalman, R. E. 1960, J. Fluids Eng., 82, 35 [Google Scholar]
 Kazemi, S., Yatawatta, S., Zaroubi, S., et al. 2011, MNRAS, 414, 1656 [NASA ADS] [CrossRef] [Google Scholar]
 Mandel, J. 2006, in CCM Report 231, University of Colorado at Denver and Health Sciences Center [Google Scholar]
 McEwen, J. D., & Wiaux, Y. 2011 [arXiv:1110.6137] [Google Scholar]
 Noordam, J. E., & Smirnov, O. M. 2010, A&A, 524, A61 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Rau, U., & Cornwell, T. J. 2011, A&A, 532, A71 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Smirnov, O. M. 2011a, A&A, 527, A106 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Smirnov, O. M. 2011b, presentation at CALIM2011 Conf. [Google Scholar]
 Tasse, C., van der Tol, S., van Zwieten, J., van Diepen, G., & Bhatnagar, S. 2013, A&A, 553, A105 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Walker, R. C. 1999, in Synthesis Imaging in Radio Astronomy II, eds. G. B. Taylor, C. L. Carilli, & R. A. Perley, ASP Conf. Ser., 180, 433 [Google Scholar]
 Wan, E. A., & van der Merwe, R. 2000, The Unscented Kalman Filter for Nonlinear Estimation [Google Scholar]
 Yatawatta, S. 2013, Exp. Astron., 35, 469 [NASA ADS] [CrossRef] [Google Scholar]
 Yatawatta, S., Zaroubi, S., de Bruyn, G., Koopmans, L., & Noordam, J. 2008 [arXiv:0810.5751] [Google Scholar]
 Yavuz, E., Arıkan, F., Arıkan, O., Erol, C. B. 2007, in 13th European Signal Processing Conference, EUSIPCO 2005, Antalya, Turkey, 2394 [Google Scholar]
Appendix A: Onedimensional images properties
In this section we discuss in more detail the 1D image representations introduced in Sect. 3. Aiming at being as conservative as possible, but still working in the image domain, we define g_{1D} to be the operator that builds one 1D image per baseline (pq), and polarization i as
where c is the speed of light, rect is the rectangular function, ν_{m} = (ν_{0} + ν_{1}) / 2, and Δν = ν_{1} − ν_{0} with ν_{0} and ν_{1} the minimum and maximum available frequencies. In order to align the ucoordinate with the frequency extent of the baseline, we rotate b_{pq} and s and s_{d} with 3 × 3 rotation matrix U_{θφ} such that where (l,m,n) are the image plane coordinates, (u_{pq},v_{pq}) are the uvcoordinates of baseline (pq), (ν_{0},ν_{1}) are the lower and higher frequencies values of the interferometer’s bandpass. It is unitary so . We can write the complex 1D image as (A.5)where ∗ is the convolution product and S_{d,i} is the apparent flux of the source in direction d for polarization i.
More intuitively, this means that is obtained by projecting the sky on the baseline, and convolving it with the 1D point spread function (PSF) of the given baseline. The term PSF_{1D} is obtained by computing the inverse Fourier transform of the uvdomain sampling function, where sinc is the cardinal sine function. We can see in Eq. (A.7) that PSF_{1D} contains both low and high spatial frequency terms (u′Δν/c and u′ν_{m}/c, respectively, whose ratio equals the fractional bandwidth).
Therefore, while still contains the high frequency fringe, taking the complex norm using eliminates the high spatial frequency term, and PSF_{1D} under is smoother ( extracts the envelope of PSF_{1D}). This intuitively explains why the representation seems to provide a better match between the true and the σpoints statistics. As the clock and ionospheric displacements mostly amount to apparent shifts in source locations, smoothness of provides stability in the σpoints statistics in the data domain.
Applying any of the representation operators presented above modifies the properties of the input data covariance matrix R_{k} (Eq. (5), (19), and (27)). Assuming the noise in the visibilities is uncorrelated, R_{k} is diagonal matrix. Assuming the noise is not baseline or frequency dependent we have R_{k} = σ^{2}ℐ, for and , and R_{k} = (σ^{2}/n_{ν}) ℐ for . The statistics of are nonGaussian since it is the norm of a complex number. In that case the random variable is which follows a noncentral χdistribution with two degrees of freedom, and mean and covariance (A.8)\newpage
where is a generalized Laguerre polynomial.
All Tables
All Figures
Fig. 1 This figure shows how the sigmapoint statistics compares to the true statistics. Specifically, we consider the case of a frequencydependent phase gradient (clock offset, ionospheric effect, or source position offset). The true covariance corresponds to the gray area, while the sigma points appear as dotted lines, together with their associated estimated covariance (thick dashed line). The noise in the data is represented by the dashdotted line. Qualitatively, the goodness of the description depends on the type of the data representations operator. 

In the text 
Fig. 2 Snapshot residual image estimated at different recursion times (the color scale is identical in each panel). As recursion time evolves, more data cross the Kalman filter, the ionospheric and clock parameters estimates become more accurate, and the residual noise level decreases (see also Fig. 4). 

In the text 
Fig. 3 Estimated clock errors (full line) and posterior covariance (dotted line) as a function of time for different LOFAR stations in the simulation presented in Sect. 5.1. The dashed line shows the true underlying clock offset. To improve convergence speed, before t = 6 min, the evolutionary model is stochastic. 

In the text 
Fig. 4 Top panel: maximum and minimum residual values in the snapshot image as a function of recursion time (full line). Bottom panel: standard deviation in the residual snapshot maps. The expected thermal noise is shown in the bottom figure as the dotted line. 

In the text 
Fig. 5 Ionosphere state described using a Legendre polynomial basis function. The estimated and true state of the ionosphere (left and right panels) are shown at the beginning and at the end of the filter’s recursion. The open circles show the location of the pierce points in the ionosphere. The spatial coordinates are given in kilometers from the array center projected on the ionosphere screen. 

In the text 
Fig. 6 True pointing errors for a WSRT simulation (dashed line) together with the estimated state as a function of time (full line). Our algorithm properly tracks the time dependent pointing errors within the estimated covariance (dotted line). 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.