Imaging swiFTly: Streaming widefield Fourier Transforms for large-scale interferometry

P. Wortmann; J. Kent; B. Nikolic

doi:10.1051/0004-6361/202142040

Home

All issues

Volume 687 (July 2024)

A&A, 687 (2024) A152

Full HTML

Open Access

Issue		A&A Volume 687, July 2024


Article Number		A152
Number of page(s)		13
Section		Numerical methods and codes
DOI		https://doi.org/10.1051/0004-6361/202142040
Published online		05 July 2024

A&A, 687, A152 (2024)

Imaging swiFTly: Streaming widefield Fourier Transforms for large-scale interferometry

P. Wortmann, J. Kent and B. Nikolic

Cavendish Astrophysics Group, University of Cambridge, JJ Thomson Avenue, Cambridge CB3 0HE, UK
e-mail: peter.wortmann@skao.int; jameschristopherkent@gmail.com; bn204@cam.ac.uk

Received: 16 August 2021
Accepted: 29 January 2024

Abstract

Aims. We describe a scalable distributed imaging algorithm framework for next-generation radio telescopes, managing the Fourier transform from apertures to sky (or vice versa) with a focus on minimising memory load, data transfers, and computation.

Methods. Our algorithm uses smooth window functions to isolate the influence between specific regions of spatial-frequency and image space. This allows the distribution of image data between nodes and the construction of segments of frequency space exactly when and where needed.

Results. The developed prototype distributes terabytes of image data across many nodes, while generating visibilities at throughput and accuracy competitive with existing software. Scaling is demonstrated to be better than cubic in problem complexity (for baseline length and field of view), reducing the risk involved in growing radio astronomy processing to large telescopes like the Square Kilometre Array.

Key words: methods: data analysis / methods: numerical / techniques: image processing / techniques: interferometric

© The Authors 2024

Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article is published in open access under the Subscribe to Open model. Subscribe to A&A to support open access publication.

1 Introduction

Upcoming large-scale radio telescopes like the Square Kilometre Array (SKA) (Dewdney et al. 2009) are designed for deep observations of large areas of the sky. These observations require many long baselines, wide bandwidths, and high frequency resolution. This translates to demanding computing requirements; for SKA imaging observations, more than a petabyte of visibility data might be generated per hour. Pipelines will need to perform approximately 40 exa-operations to produce images that are terabytes in size (Bolton et al. 2016). This means that efficient use of high-performance computing facilities will be essential, which will require algorithms that distribute computation, visibility data, as well as image data evenly across many compute nodes.

This distribution is not trivial because radio astronomy imaging requires a Fourier transform where every measured visibility impacts every image pixel. Conventional strategies for distributing radio-interferometric imaging split visibilities by frequency or by observation time, but replicate full copies of the image to every node. A well-known alternative approach is to instead split the image into ‘facets’, previously used for non-coplanarity (Cornwell & Perley 1992), direction-dependent calibration, and deconvolution (Van Weeren et al. 2016). This requires visibilities to be gridded to a separate low-resolution grid for every facet, which combined with averaging can be a viable scaling strategy (Bolton et al. 2016; Tasse et al. 2018). On the other hand, repeated gridding is inefficient, phase rotation can become a bottleneck, and averaging introduces inherent inaccuracies into the result.

Therefore, this paper focuses on the Fourier transform specifically, describing a scalable algorithm that addresses the core challenges of large-scale interferometric processing. Our method fully distributes the image data as well as the computation, never needing to assemble the entire image or uv(w)-grid at any stage. As with previous work, this is achieved using a distributed facet approach, so different parts of the image are held and processed in memory separately. However, our method solves the conceptually easier problem of transforming them from and to cut-outs of the full-resolution spatial-frequency grid, which we call ‘sub-grids’. Hence, in our approach both the image-plane and the aperture-plane are divided into smaller parts that can be processed in parallel. This can be used to distribute conventional visibility (de)gridding algorithms as a separate step, as shown in Fig. 1. This would process every visibility only once on a single node, with a roughly predictable processing order.

All we need is an efficient way to transform between such subgrids and facets, which would make it a form of distributed Fourier transform. Distributed fast Fourier transform (FFT) algorithms (e.g. Frigo & Johnson 1998) might seem the obvious choice; however, they are not well suited for our purposes. For non-coplanarity corrections we are often only interested in local regions of the uvw spatial-frequency space, and therefore we need to minimise not only the computational work spent on unneeded uvw regions, but also the time the full grid or image data is held in memory. Additionally, we would like to apply extra image-space factors per subgrid, for example for widefield corrections, which further pushes us towards a different class of algorithm.

Our solution is therefore to stream portions of spatial-frequency space to worker nodes (or, during de-gridding, from them) exactly as and when required to cover the telescope baselines: an incremental semi-sparse Fourier transform algorithm for contiguous partitions of image and frequency space. In Sec. 2, we define the core algorithm, which we then extend in Sec. 3 to cover the effects of Fresnel diffraction when observing widefield. In Sec. 4, we demonstrate how to assemble and parametrise the algorithm, which allows us to perform scaling and performance tests using SKA-scale parameters in Sec. 5. We finally wrap things up with a discussion in Sec. 6.

Fig. 1

Distribution concept sketch of our algorithm. The visibilities are gridded to subgrids (solid lines), and the contributions are accumulated to facets. The dashed lines indicate the reverse direction: the contributions are extracted from the facets, and the visibilities are degridded from the subgrids.

2 Core algorithm

Interferometric imaging measures complex visibilities V, and relates them to the planar projection of the sky intensity distribution I as $V (u, v, w) = \iint I (l, m) e^{- 2 π i w (\sqrt{1 - l^{2} - m^{2}} - 1)} e^{- 2 π i (u l + v m)} d l d m,$ $\[V(u, v, w)=\iint I(l, m) \mathrm{e}^{-2 \pi i w\left(\sqrt{1-l^2-m^2}-1\right)} \mathrm{e}^{-2 \pi i(u l+v m)} \mathrm{d} l \mathrm{~d} m,\]$ (1)

where l and m are sky direction cosines, and u, v, and w are coordinate components of telescope baselines (compare e.g. Cornwell & Perley 1992). The right-hand side involves a Fourier transform, which is why an efficient approximation of I from visibility samples V (or vice versa) requires discrete Fourier transforms.

2.1 Notation

For simplicity we limit ourselves to one-dimensional functions for the moment. By convention, all named functions are in visibility space (i.e. using spatial-frequency u coordinate). The image or gridded visibilities to be transformed are represented by a function Υ, no matter the transformation direction considered. Furthermore, all functions are discrete using a given step rate of 2l_Υ and repeating with a period of 2u_Υ satisfying 4u_Υl_Υ ∈ ℤ. This allows us to define a ‘scaled’ discrete Fourier transform as $F [f] (l) = \frac{1}{2 u_{Υ}} \sum_{\begin{matrix} - u_{Υ} \leq u < u_{Υ} \\ 2 u l_{Υ} \in Z \end{matrix}} f (u) e^{- 2 π i u l} for 2 l u_{Υ} \in Z .$ $\[\mathcal{F}[\mathrm{f}](l)=\frac{1}{2 u_{\Upsilon}} \sum_{\substack{-u_{\Upsilon} \leq u_1<u_{\mathrm{r}} \\ 2 u l_{\mathrm{\Upsilon}} \in \mathbb{Z}}} \mathrm{f}(u) \mathrm{e}^{-2 \pi i u l} \quad\qquad \text { for } 2 l u_{\Upsilon} \in \mathbb{Z} \text {. }\]$ (2)

This makes ℱf an image-space function (l coordinate) that is sampled with a step rate of 2u_Υ and repeats with a period of 2l_Υ. We use the following notations and properties: $(f + g) (u) = f (u) + g (u) \Rightarrow f + g = F^{- 1} [F f + F g],$ $\[(\mathrm{f}+\mathrm{g})(u)=\mathrm{f}(u)+\mathrm{g}(u) \qquad \Rightarrow \quad\mathrm{f}+\mathrm{g}=\mathcal{F}^{-1}[\mathcal{F} \mathrm{f}+\mathcal{F} \mathrm{g}],\]$ (3) $(f g) (u) = f (u) g (u) \Rightarrow f g = \frac{1}{2 l_{Υ}} F^{- 1} [F f * F g],$ $\[(\mathrm{fg})(u)=\mathrm{f}(u) \mathrm{g}(u) \qquad\qquad \Rightarrow \qquad \mathrm{fg}=\frac{1}{2 l_{\mathrm{\Upsilon}}} \mathcal{F}^{-1}[\mathcal{F} \mathrm{f} * \mathcal{F} \mathrm{g}],\]$ (4) $(f * g) (u) = \sum_{\begin{matrix} - u_{Υ} \leq τ < u_{Υ} \\ 2 τ l_{Υ} \in Z \end{matrix}} f (τ) g (u - τ) \Rightarrow f * g = 2 u_{Υ} F^{- 1} [F f F g] .$ $\[(\mathrm{f} * \mathrm{~g})(u)=\sum_{\substack{-u_{\mathrm{\Upsilon}} \leq \tau<u_{\mathrm{\Upsilon}} \\ 2 \tau l_{\mathrm{\Upsilon}} \in \mathbb{Z}}} \mathrm{f}(\tau) \mathrm{g}(u-\tau) \quad \Rightarrow \quad \mathrm{f} * \mathrm{~g}=2 u_\Upsilon \mathcal{F}^{-1}[\mathcal{F} \mathrm{f} \mathcal{~F} \mathrm{g}].\]$ (5)

For this paper, we often need to deal with functions that are sampled more coarsely than the base u_Υ/l_Υ discretisation. We use Kronecker comb functions to represent this: ${I I I}_{2 u_{f}} (u) = \sum_{k = - \infty}^{\infty} δ (u - k 2 u_{f}) \Rightarrow F [{I I I}_{2 u_{f}}] = \frac{1}{2 u_{f}} {I I I}_{{(2 u_{f})}^{- 1}} .$ $\[{\rm III}_{2 u_{\mathrm{f}}}(u)=\sum_{k=-\infty}^{\infty} \delta\left(u-k 2 u_{\mathrm{f}}\right) \Rightarrow \mathcal{F}\left[{\rm III}_{2 u_{\mathrm{f}}}\right]=\frac{1}{2 u_{\mathrm{f}}} {\rm III}_{\left(2 u_{\mathrm{f}}\right)^{-1}}.\]$ (6)

Here $2 u_{f} l_{Υ}, 2 u_{f} u_{Υ}^{- 1} \in Z$ $\[2 u_{\mathrm{f}} l_{\mathrm{\Upsilon}}, 2 u_{\mathrm{f}} u_\Upsilon^{-1} \in \mathbb{Z}\]$ . Multiplication with ${I I I}_{2 u_{F}}$ $\[{\rm III}_{2 u_F}\]$ samples frequency space at a rate of (2u_f)⁻¹, while convolution with ${I I I}_{2 u_{F}}$ $\[{\rm III}_{2 u_F}\]$ samples image space at a rate of 2u_f. With $2 u_{Υ} l_{f}, 2 l_{f} l_{Υ}^{- 1}, 4 u_{f} l_{f} \in Z$ $\[2 u_{\Upsilon} l_{\mathrm{f}}, 2 l_{\mathrm{f}} l_{\Upsilon}^{-1}, 4 u_{\mathrm{f}} l_{\mathrm{f}} \in \mathbb{Z}\]$ we obtain $f = \frac{u_{Υ}}{u_{f}} {I I I}_{2 u_{f}} * {I I I}_{{(2 l_{f})}^{- 1}} f \Leftrightarrow F f = \frac{l_{f} u_{Υ}}{u_{f} l_{Υ}} ({I I I}_{2 l_{f}} * {I I I}_{{(2 u_{f})}^{- 1}} F f);$ $\[\mathrm{f}=\frac{u_{\mathrm{\Upsilon}}}{u_{\mathrm{f}}} {\rm III}_{2 u_{\mathrm{f}}} * {\rm III}_{\left(2 l_{\mathrm{f}}\right)^{-1}} \mathrm{f} ~~\Leftrightarrow~~ \mathcal{F} \mathrm{f}=\frac{l_f u_{\mathrm{\Upsilon}}}{u_{\mathrm{f}} l_{\mathrm{\Upsilon}}}\left({\rm III}_{2 l_{\mathrm{f}}} * {\rm III}_{\left(2 u_{\mathrm{f}}\right)^{-1}} \mathcal{~F} \mathrm{f}\right);\]$ (7)

in other words, where a function f repeats with a period of 2u_f and is non-zero at a rate of 2l_f, its Fourier transform repeats with a period of 2l_f and has a rate of 2u_f. For numerical purposes, Eq. (7) furthermore shows that we can represent f or ℱf entirely using exactly |4u_fl_f| samples.

2.2 Problem statement

The distributed Fourier transform problem is now given by mask functions A_i and B_j, which encode how we wish to partition Υ in spatial-frequency and image space: $\sum_{i} A_{i} Υ = Υ and \sum_{j} (B_{j} * Υ) = Υ$ $\[\sum_i \mathrm{~A}_i \Upsilon=\Upsilon \quad \qquad\text { and } \qquad \sum_j\left(\mathrm{~B}_j * \Upsilon\right)=\Upsilon\]$ (8)

with u_A < u_Υ and l_B < l_Υ such that we can represent subgrids A_iΥ using 4u_Al_Υ samples (subgrid size) and facets B_j * Υ using 4u_Υl_B samples (facet size). We expect A_i to be boxcar filters in frequency space and B_j boxcar filters in image space (blue graphs in Fig. 2), and therefore with $u_{Υ} u_{A}^{- 1}, l_{Υ} l_{B}^{- 1} \in Z$ $\[u_{\Upsilon} u_{\mathrm{A}}^{-1}, l_{\Upsilon} l_{\mathrm{B}}^{-1} \in \mathbb{Z}\]$ $A_{i} Υ = A_{i} ({I I I}_{2 u_{A}} * A_{i} Υ), B_{j} * Υ = \frac{l_{Υ}}{2 u_{Υ} l_{B}} B_{j} * {I I I}_{{(2 l_{B})}^{- 1}} (B_{j} * Υ)$ $\[\mathrm{A}_i \Upsilon=\mathrm{A}_i\left({\rm III}_{2 u_{\mathrm{A}}} * \mathrm{~A}_i \Upsilon\right), \mathrm{B}_j * \Upsilon=\frac{l_{\mathrm{\Upsilon}}}{2 u_{\mathrm{\Upsilon}} l_{\mathrm{B}}} \mathrm{B}_j * {\rm III}_{\left(2 l_{\mathrm{B}}\right)^{-1}}\left(\mathrm{~B}_j * \Upsilon\right)\]$ (9)

by Whittaker–Shannon interpolation. As illustrated in Fig. 1, now the goal is to reconstruct B_j * Υ given all A_iΥ. On paper this is straightforward due to linearity: $B_{j} * Υ = B_{j} * \sum_{i} A_{i} Υ = \sum_{i} (B_{j} * A_{i} Υ) .$ $\[\mathrm{B}_j * \Upsilon=\mathrm{B}_j * \sum_i \mathrm{~A}_i \Upsilon=\sum_i\left(\mathrm{~B}_j * \mathrm{~A}_i\Upsilon\right).\]$ (10)

So the contribution of subgrid i to facet j is simply B_j * A_iΥ. For an efficient distributed algorithm, this contribution would have to be computed and exchanged between nodes. Unfortunately, naïve attempts to represent B_j * A_iΥ using 4u_Al_B samples would fail, as we cannot apply Whittaker–Shannon simultaneously in image and spatial-frequency space: $\frac{l_{Υ}}{2 u_{Υ} l_{B}} B_{j} * A_{i} {I I I}_{{(2 l_{B})}^{- 1}} ({I I I}_{2 u_{A}} * B_{j} * A_{i} Υ =) \neq B_{j} * A_{i} Υ .$ $\[\frac{l_{\mathrm{\Upsilon}}}{2 u_{\mathrm{\Upsilon}} l_{\mathrm{B}}} \mathrm{B}_j * \mathrm{~A}_i {\rm III}_{\left(2 l_{\mathrm{B}}\right)^{-1}}\left({\rm III}_{2 u_{\mathrm{A}}} * \mathrm{~B}_j * \mathrm{~A}_i ~\Upsilon=\right) \neq \mathrm{B}_j * \mathrm{~A}_i ~\Upsilon.\]$ (11)

The reason is that B_j never approaches zero in frequency space (see Fig. 2, bottom left plot), and therefore neither does B_j * A_iΥ. Hence, tiling it with a period of 2u_A effectively loses information.

Fig. 2

Key functions and relationships at a glance.

2.3 Approximation

The key idea is to use a different representation. We suppose a function n_j that falls close to zero in both frequency and image space (with associated limits u_n and l_n respectively). Then the same should be true for n_j * A_iΥ, so we should be able to say $m_{i} (n_{j} * A_{i} Υ) \approx n_{j} * A_{i} Υ,$ $\[\mathrm{m}_i\left(\mathrm{n}_j * \mathrm{A}_i \Upsilon)\approx \mathrm{n}_j * \mathrm{A}_i \Upsilon\right. \text {, }\]$ (12)

where m_i is a boxcar filter with u_m> u_A + u_n (Fig. 2, left plots). If we furthermore find an ‘inverse’ function b_j that gets us back to B_j = b_j * n_j (which will require l_n ≥ l_B), it follows that $B_{j} * Υ = b_{j} * \sum_{i} (n_{j} * A_{i} Υ) \approx b_{j} * \sum_{i} m_{i} (n_{j} * A_{i} Υ) .$ $\[\mathrm{B}_j * \Upsilon=\mathrm{b}_j * \sum_i\left(\mathrm{n}_j * \mathrm{~A}_i \Upsilon\right) \approx \mathrm{b}_j * \sum_i \mathrm{~m}_i\left(\mathrm{n}_j * \mathrm{~A}_i \Upsilon\right).\]$ (13)

This approach mirrors visibility gridding. If A_i were delta functions identifying positions of visibilities (oversampled at a rate of 2l_Υ), then n_j would be the gridding kernel, m_i the grid environment around each visibility, and b_j the gridding correction. What we are doing is simply ‘bulk re-gridding’ to coarser facet grids.

This suggests that window functions like prolate spheroidal wave functions (PSWFs; see Fig. 2) are a good choice for n_j. Unfortunately, there is no window function that is perfectly limited in both image and frequency space (uncertainty principle), so there will always be errors. We can quantify the effects by subtracting both sides from above for a subgrid i: $b_{j} * (1 - m_{i}) (n_{j} * A_{i} Υ) \approx 0.$ $\[\mathrm{b}_j *\left(1-\mathrm{m}_i\right)\left(\mathrm{n}_j * \mathrm{~A}_i \Upsilon\right) \approx 0.\]$ (14)

Figure 3 shows what these errors look like in image space for a centred facet. Before convolution with b_j (dark blue line) the absolute error is mostly constant with small peaks at ±l_n. As the light blue line shows, this only becomes slightly worse for a worst-case Υ (single source at −l_n). On the other hand, convolution with b_j introduces a distinct U-shaped error pattern. The reason is that for n_j to be approximately limited in spatial-frequency space, it needs to be smooth in image space, and as we require ℱn_j to fall to zero, the inverse |ℱn_j|⁻¹ tends to infinity at ±l_n (see also Fig. 2). This is why we need to somewhat over-dimension n_j in image-space (l_n > l_B) to keep error magnification in check.

Fig. 3

Image space error (light colours: worst case Υ).

2.4 Method

At this point, we know that n_j * A_iΥ is approximately limited to an u_m region in frequency space, so assuming $u_{Υ} u_{m}^{- 1}, 4 u_{m} l_{n} \in Z$ $\[u_{\Upsilon} u_{\mathrm{m}}^{-1}, 4 u_{\mathrm{m}} l_{\mathrm{n}} \in \mathbb{Z}\]$ : $b_{j} * \frac{l_{Υ}}{l_{n}} {I I I}_{{(2 l_{n})}^{- 1}} (n_{j} * A_{i} Υ) = B_{j} * A_{i} Υ$ $\[\mathrm{b}_j * \frac{l_{\mathrm{\Upsilon}}}{l_{\mathrm{n}}} {\rm III}_{\left(2 l_{\mathrm{n}}\right)^{-1}}\left(\mathrm{n}_j * \mathrm{~A}_i \Upsilon \mathcal{}\right)=\mathrm{B}_j * \mathrm{~A}_i \Upsilon\]$ (15) $\Rightarrow b_{j} * m_{i} \frac{l_{Υ}}{l_{n}} \begin{matrix} {I I I}_{(2 l_{n})^{- 1}} ({I I I}_{2 u_{m}} * n_{j} * A_{i} Υ) \end{matrix} \approx B_{j} * A_{i} Υ$ $\[\Rightarrow \mathrm{b}_j * \mathrm{~m}_i \frac{l_{\mathrm{\Upsilon}}}{l_{\mathrm{n}}} \begin{array}{|c|}\hline{\rm III}_{(2 l_{\mathrm{n}})^{-1}}({\rm III}_{2 u_{\mathrm{m}}} * \mathrm{n}_j * \mathrm{~A}_i \mathcal{} \Upsilon)\\\hline\end{array} \approx \mathrm{B}_j * \mathrm{~A}_i \Upsilon\]$ (16)

by associativity of multiplication. Therefore n_j * A_iΥ can be represented approximately using 4u_ml_n samples (boxed expression), solving the core challenge from Sect. 2.2.

If we define $n_{j}^{0}$ $\[\mathrm{n}_j^0\]$ as a mask representing the image-space extent of $n_{j} (F n_{j}^{0} (l) = 1$ $\[\mathrm{n}_j~(\mathcal{F} \mathrm{n}_j^0(l)=1\]$ exactly where ℱn_j(l) ≠ 0), we can also show how to compute this contribution efficiently, $\begin{aligned} {I I I}_{{(2 l_{n})}^{- 1}} ({I I I}_{2 u_{m}} * n_{j} * A_{i} Υ) \\ = \frac{u_{m} l_{Υ}}{2 u_{Υ}^{2} l_{n}} {I I I}_{{(2 l_{n})}^{- 1}} ({I I I}_{2 u_{m}} * n_{j}) * {I I I}_{{(2 l_{n})}^{- 1}} (n_{j}^{0} * {I I I}_{2 u_{m}} * A_{i} Υ), \end{aligned}$ $\[\begin{aligned}& {\rm III}_{\left(2 l_{\mathrm{n}}\right)^{-1}}\left({\rm III}_{2 u_{\mathrm{m}}} * \mathrm{n}_j * \mathrm{~A}_i \Upsilon\right) \\& =\frac{u_{\mathrm{m}} l_{\mathrm{\Upsilon}}}{2 u_{\mathrm{\Upsilon}}^2 l_{\mathrm{n}}} {\rm III}_{\left(2 l_{\mathrm{n}}\right)^{-1}}\left({\rm III}_{2 u_{\mathrm{m}}} * \mathrm{n}_j\right) * {\rm III}_{\left(2 l_{\mathrm{n}}\right)^{-1}}\left(\mathrm{n}_j^0 * {\rm III}_{2 u_{\mathrm{m}}} * \mathrm{~A}_i \Upsilon \mathcal{}\right),\end{aligned}\]$ (17)

as ${I I I}_{2 u_{m}} = \frac{u_{m}}{u_{Υ}} {I I I}_{2 u_{m}} * {I I I}_{2 u_{m}}, n_{j} = \frac{1}{2 u_{Υ}} n_{j} * n_{j}^{0}$ $\[{\rm III}_{2 u_{\mathrm{m}}}=\frac{u_{\mathrm{m}}}{u_{\mathrm{\Upsilon}}} {\rm III}_{2 u_{\mathrm{m}}} * {\rm III}_{2 u_{\mathrm{m}}}, \mathrm{n}_j=\frac{1}{2 u_{\mathrm{\Upsilon}}} \mathrm{n}_j * \mathrm{n}_j^0\]$ , and III $_{{(2 l_{n})}^{- 1}} (f * g) = \frac{l_{Υ}}{l_{n}} {I I I}_{{(2 l_{n})}^{- 1}} f * {I I I}_{{(2 l_{n})}^{- 1}} g$ $\[{\rm III}_{\left(2 l_{\mathrm{n}}\right)^{-1}}(\mathrm{f} * \mathrm{g})=\frac{l_{\mathrm{\Upsilon}}}{l_{\mathrm{n}}} {\rm III}_{\left(2 l_{\mathrm{n}}\right)^{-1}} \mathrm{f} * {\rm III}_{\left(2 l_{\mathrm{n}}\right)^{-1}} \mathrm{~g}\]$ , where f and g are zero in image space outside a 2/_n region. This means that if we have padded subgrid data ${I I I}_{2 u_{m}} * A_{i} Υ$ $\[\amalg_{2 u_{\mathrm{m}}} * \mathrm{~A}_i \Upsilon\]$ (4u_mlΥ samples; u_m > u_A) in image space, then we just need to select 4u_ml_n samples where we know ℱn_j(l) ≠ 0, multiply by ℱn_j (sampled at the same points), and we have calculated our subgrid contribution representation.

Applying the contribution is equally straightforward: ${I I I}_{{(2 l_{n})}^{- 1}} \sum_{i} (B_{j} * A_{i} Υ)$ $\[{\rm III}_{\left(2 l_{\mathrm{n}}\right)^{-1}} \sum_i\left(\mathrm{~B}_j * \mathrm{~A}_i \Upsilon\right)\]$ $\approx {I I I}_{{(2 l_{n})}^{- 1}} \sum_{i} b_{j} * m_{i} \frac{l_{Υ}}{l_{n}} {I I I}_{{(2 l_{n})}^{- 1}} ({I I I}_{2 u_{m}} * n_{j} * A_{i} Υ)$ $\[\approx {\rm III}_{\left(2 l_{\mathrm{n}}\right)^{-1}} \sum_i \mathrm{~b}_j * \mathrm{~m}_i \frac{l_{\mathrm{\Upsilon}}}{l_{\mathrm{n}}} {\rm III}_{\left(2 l_{\mathrm{n}}\right)^{-1}}\left({\rm III}_{2 u_{\mathrm{m}}} * \mathrm{n}_j * \mathrm{~A}_i \Upsilon\right)\]$ (18) $= \frac{l_{Υ}}{l_{n}} {I I I}_{{(2 l_{n})}^{- 1}} b_{j} * \sum_{i} ({I I I}_{(2 l_{n}) - 1} m_{i}) \begin{matrix} {I I I}_{{(2 l_{n})}^{- 1}} ({I I I}_{2 u_{m}} * n_{j} * A_{i} Υ) \end{matrix}$ $\[=\frac{l_{\mathrm{\Upsilon}}}{l_{\mathrm{n}}} {\rm III}_{\left(2 l_n\right)^{-1}} \mathrm{~b}_j * \sum_i\left({\rm III}_{\left(2 l_n\right)-1} \mathrm{~m}_i\right) \begin{array}{|c|}\hline{\rm III}_{\left(2 l_n\right)^{-1}}\left({\rm III}_{2 u_m} * \mathrm{n}_j * \mathrm{~A}_i \mathrm{\Upsilon}\right)\\\hline\end{array}\]$ (19)

substituting in the term from Eq. (16), then using ${I I I}_{{(2 l_{n})}^{- 1}} = {I I I}_{{(2 l_{n})}^{- 1}} {I I I}_{{(2 l_{n})}^{- 1}}$ $\[{\rm III}_{\left(2 l_{\mathrm{n}}\right)^{-1}}={\rm III}_{\left(2 l_{\mathrm{n}}\right)^{-1}} {\rm III}_{\left(2 l_n\right)^{-1}}\]$ and ${I I I}_{{(2 l_{n})}^{- 1}} (f * {I I I}_{{(2 l_{n})}^{- 1}} g) = {I I I}_{{(2 l_{n})}^{- 1}} f * {I I I}_{{(2 l_{n})}^{- 1}} g$ $\[{\rm III}_{\left(2 l_{\mathrm{n}}\right)^{-1}}\left(\mathrm{f} * {\rm III}_{\left(2 l_{\mathrm{n}}\right)^{-1}} \mathrm{~g}\right)={\rm III}_{\left(2 l_{\mathrm{n}}\right)^{-1}} \mathrm{f} * {\rm III}_{\left(2 l_{\mathrm{n}}\right)^{-1}} \mathrm{~g}\]$ , where f is zero in image space outside a 2l_n region. This means that we need to pad in frequency space from 4u_ml_n to 4u_ml_Υ samples, then multiply the sum in image space by an equivalently sampled b_j.

Fig. 4

2D algorithm sketch of both transformation directions, highlighting the symmetry. This also illustrates buffer sizes, showing how padding facet data to 4u_Υl_n tends to dominate memory consumption both in the horizontal (u) and vertical (υ) axes. This is why it is important to share intermediate buffers for subgrids from the same ‘column’ $(m_{i}^{(u)} = m_{i^{'}}^{(u)})$ $\[\left(\mathrm{m}_i^{(\mathrm{u})}=\mathrm{m}_{i^{\prime}}^{(\mathrm{u})}\right)\]$ .

2.5 Shifts

For simplicity let us assume that all m_i and n_j are the same functions, just shifted by subgrid or facet offsets ${\hat{u}}_{i} / {\hat{l}}_{j}$ $\[\hat{u}_i / \hat{l}_j\]$ as follows: $m_{i} = δ_{{\hat{u}}_{i}} * m,$ $\[\mathrm{m}_i=\delta_{\hat{u}_i} * \mathrm{~m},\]$ (20) $n_{j} = (F^{- 1} δ_{{\hat{l}}_{j}}) n = F^{- 1} [δ_{{\hat{l}}_{j}} * F n] .$ $\[\mathrm{n}_j=\left(\mathcal{F}^{-1} \delta_{\hat{l}_j}\right) \mathrm{n}=\mathcal{F}^{-1}\left[\delta_{\hat{l}_j} * \mathcal{F} \mathrm{n}\right].\]$ (21)

As $δ_{{\hat{u}}_{i}} * f g = (δ_{{\hat{u}}_{i}} * f) (δ_{{\hat{u}}_{i}} * g), F^{- 1} δ_{{\hat{l}}_{j}} (f * g) = F^{- 1} δ_{{\hat{l}}_{j}} f * F^{- 1} δ_{{\hat{l}}_{j}} g$ $\[\delta_{\hat{u}_i} * \mathrm{fg}=\left(\delta_{\hat{u}_i} * \mathrm{f}\right)\left(\delta_{\hat{u}_i} * \mathrm{~g}\right), \mathcal{F}^{-1} \delta_{\hat{l}_j}(\mathrm{f} * \mathrm{~g})=\mathcal{F}^{-1} \delta_{\hat{l}_j} \mathrm{f} * \mathcal{F}^{-1} \delta_{\hat{l}_j} \mathrm{~g}\]$ , and $δ_{- {\hat{u}}_{i}} * F^{- 1} δ_{- {\hat{l}}_{j}} = F^{- 1} δ_{- {\hat{l}}_{j}}$ $\[\delta_{-\hat{u}_i} * \mathcal{F}^{-1} \delta_{-\hat{l}_j}=\mathcal{F}^{-1} \delta_{-\hat{l}_j}\]$ (assuming ${\hat{u}}_{i} {\hat{l}}_{j} \in Z$ $\[\hat{u}_i \hat{l}_j \in \mathbb{Z}\]$ ), we can derive $F^{- 1} δ_{- {\hat{l}}_{j}} (B_{j} * Υ),$ $\[\mathcal{F}^{-1} \delta_{-\hat{l}_j}\left(\mathrm{~B}_j * \Upsilon\right),\]$ (22) $\approx F^{- 1} δ_{- {\hat{l}}_{j}} b_{j} * \sum_{i} (δ_{{\hat{u}}_{i}} * m (n * (δ_{- {\hat{u}}_{i}} * F^{- 1} δ_{- {\hat{l}}_{j}}) (δ_{- {\hat{u}}_{i}} * A_{i} Υ))),$ $\[\approx \mathcal{F}^{-1} \delta_{-\hat{l}_j} \mathrm{~b}_j * \sum_i\left(\delta_{\hat{u}_i} * \mathrm{~m}\left(\mathrm{n} *\left(\delta_{-\hat{u}_i} * \mathcal{F}^{-1} \delta_{-\hat{l}_j}\right)\left(\delta_{-\hat{u}_i} * \mathrm{~A}_i \mathcal{\Upsilon}\right)\right)\right),\]$ (23) $= F^{- 1} δ_{- {\hat{ı}}_{j}} b_{j} * \sum_{i} (δ_{{\hat{u}}_{i}} * m (n * F^{- 1} δ_{- {\hat{ı}}_{j}} (δ_{- {\hat{u}}_{i}} * A_{i} Υ))) .$ $\[=\mathcal{F}^{-1} \delta_{-\hat{\imath}_j} \mathrm{~b}_j * \sum_i\left(\delta_{\hat{u}_i} * \mathrm{~m}\left(\mathrm{n} * \mathcal{F}^{-1} \delta_{-\hat{\imath}_j}\left(\delta_{-\hat{u}_i} * \mathrm{~A}_i \Upsilon\right)\right)\right).\]$ (24)

This tells us that we only need to have one sampled representation of m and n because the function offsets applied to the data simply correspond to index shifts for image or frequency space samples as long as $2 u_{m} {\hat{l}}_{j}, 2 {\hat{u}}_{i} l_{n} \in Z$ $\[2 u_{\mathrm{m}} \hat{l}_j, 2 \hat{u}_i l_{\mathrm{n}} \in \mathbb{Z}\]$ . We revisit these side conditions in Sec. 4.1.

2.6 Two dimensions

We can easily generalise our reasoning to images simply by redefining Υ, A, B, m, b, and n to be two-dimensional functions. We additionally assume m_i and b_j to be separable, $m_{i} = m_{i}^{(u)} m_{i}^{(v)}, b_{j} = b_{j}^{(u)} * b_{j}^{(v)},$ $\[\mathrm{m}_i=\mathrm{m}_i^{(u)} \mathrm{m}_i^{(v)}, \quad \mathrm{b}_j=\mathrm{b}_j^{(u)} * \mathrm{~b}_j^{(v)},\]$ (25)

such that m^(u) and m^(v) are constant along the v- and u-axes, respectively, and b^(u)(u, v) = 0 for v ≠ 0 and b^(v)(u, v) = 0 for u ≠ 0. Then we can re-order as follows: $B_{j} * Υ \approx b_{j}^{(u)} * \sum_{i} m_{i}^{(u)} (b_{j}^{(v)} * m_{i}^{(v)} (n_{j} * A_{i} Υ)) .$ $\[\mathrm{B}_j * \Upsilon \approx \mathrm{b}_j^{(u)} * \sum_i \mathrm{~m}_i^{(u)}\left(\mathrm{b}_j^{(v)} * \mathrm{~m}_i^{(v)}\left(\mathrm{n}_j * \mathrm{~A}_i \Upsilon\right)\right).\]$ (26)

Convolution with $b_{j}^{(v)}$ $\[2 u_{\mathrm{m}} \hat{l}_j, 2 \hat{u}_i l_{\mathrm{n}} \in \mathbb{Z}\]$ straight after multiplication with $m_{i}^{(v)}$ $\[2 u_{\mathrm{m}} \hat{l}_j, 2 \hat{u}_i l_{\mathrm{n}} \in \mathbb{Z}\]$ is a small optimisation. An implementation can discard 4u_Υ(l_n − l_B) rows where $F b_{j}^{(v)}$ $\[\mathcal{F} b_j^{(v)}\]$ is zero by doing the Fourier transform ℱ^(v) along the v-axis right away. The sum can be pulled further inwards if we assume that some subgrids are in the same column (i.e. share a certain value of $m_{i}^{(u)}$ $\[\mathrm{m}_i^\mathrm{(u)}\]$ ). Then with M^(u) the set of all $m_{i}^{(u)}$ $\[\mathrm{m}_i^\mathrm{(u)}\]$ : $B_{j} * Υ \approx b_{j}^{(u)} * \sum_{m^{(u)} \in M^{(u)}} m^{(u)} (b_{j}^{(v)} * \sum_{i, m_{i}^{(u)} = m^{(u)}} m_{i}^{(v)} (n_{j} * A_{i} Υ)) .$ $\[\mathrm{B}_j * \Upsilon \approx \mathrm{b}_j^{(u)} * \sum_i \mathrm{~m}_i^{(u)}\left(\mathrm{b}_j^{(v)} * \mathrm{~m}_i^{(v)}\left(\mathrm{n}_j * \mathrm{~A}_i \Upsilon\right)\right).\mathrm{B}_j * \Upsilon \approx \mathrm{b}_j^{(u)} * \sum_{\mathrm{m}^{(u)} \in M^{(u)}} \mathrm{m}^{(u)}\left(\mathrm{b}_j^{(v)} * \sum_{i, \mathrm{~m}_i^{(u)}=\mathrm{m}^{(u)}} \mathrm{m}_i^{(v)}\left(\mathrm{n}_j * \mathrm{~A}_i \Upsilon\right)\right).\]$ (27)

As illustrated in Fig. 4a, this means we can accumulate contributions from entire columns of subgrids, reducing the number of Fourier transforms needed.

2.7 Dual variant

As degridding relates to gridding, there is a similar dual algorithm variant for going from facets to subgrids. Analogously to the steps taken in Sec. 2.1, we start with the observation that $A_{i} Υ = A_{i} (\sum_{j} B_{j} * Υ) = \sum_{j} A_{i} (B_{j} * Υ),$ $\[\mathrm{A}_i \Upsilon=\mathrm{A}_i\left(\sum_j \mathrm{~B}_j * \Upsilon\right)=\sum_j \mathrm{~A}_i\left(\mathrm{~B}_j * \Upsilon\right),\]$ (28)

and we then decompose B_j = n_j * b_j and A_i = A_im_i as done in Sect. 2.3 to approximate the sum term $A_{i} (B_{j} * Υ) = A_{i} m_{i} (n_{j} * b_{j} * Υ) \approx A_{i} (n_{j} * m_{i} (b_{j} * Υ))$ $\[\mathrm{A}_i\left(\mathrm{~B}_j * \Upsilon\right)=\mathrm{A}_i \mathrm{~m}_i\left(\mathrm{n}_j * \mathrm{~b}_j * \Upsilon\right) \approx \mathrm{A}_i\left(\mathrm{n}_j * \mathrm{~m}_i\left(\mathrm{~b}_j * \Upsilon\right)\right)\]$ (29)

with the error term A_i(n_j * (1 − m_i)(b_j * Υ)), which behaves equivalently to the errors discussed in Sect. 2.3. This also works in 2D: $A_{i} Υ \approx A_{i} \sum_{j} (n_{j} * m_{i}^{(v)} (b_{j}^{(v)} * m_{i}^{(u)} (b_{j}^{(u)} * Υ))) .$ $\[\mathrm{A}_i \Upsilon \approx \mathrm{A}_i \sum_j\left(\mathrm{n}_j * \mathrm{~m}_i^{(v)}\left(\mathrm{b}_j^{(v)} * \mathrm{~m}_i^{(u)}\left(\mathrm{b}_j^{(u)} * \Upsilon\right)\right)\right).\]$ (30)

The optimisation from the previous section also has an equivalent here: $A_{i} Υ \approx A_{i} \sum_{j} (n_{j} * T_{j, m_{i}^{(u)}}), T_{j, m^{(u)}} = m_{i}^{(v)} (b_{j}^{(v)} * m^{(u)} (b_{j}^{(u)} * Υ)) .$ $\[\mathrm{A}_i \Upsilon \approx \mathrm{A}_i \sum_j\left(\mathrm{n}_j * \mathrm{~T}_{j, \mathrm{~m}_i^{(u)}}\right), \mathrm{T}_{j, \mathrm{~m}^{(u)}}=\mathrm{m}_i^{(v)}\left(\mathrm{b}_j^{(v)} * \mathrm{~m}^{(u)}\left(\mathrm{b}_j^{(u)} * \Upsilon\right)\right).\]$ (31)

So we only need to calculate $T_{j, m^{(u)}}$ $\[\mathrm{T}_{j, \mathrm{m}^\mathrm{(u)}}\]$ once for every combination of facet and subgrid column, as shown in Fig. 4b.

3 Handling widefield effects

A general-purpose discrete Fourier transform algorithm cannot quite cover the full complexity of the measurement equation from Sect. 2 as wide fields of view require correction for telescope non-coplanarity (w ≠ 0)/sky curvature (n ≠ 1). As noted by Cornwell et al. (2008), we can represent this using a convolution with a w-specific Fresnel diffraction pattern: $g_{w} = F_{l, m}^{- 1} [e^{2 π w (\sqrt{1 - l^{2} - m^{2}} - 1)}] .$ $\[\mathrm{g}_w=\mathcal{F}_{l, m}^{-1}\left[\mathrm{e}^{2 \pi w\left(\sqrt{1-l^2-m^2}-1\right)}\right].\]$ (32)

This convolution effectively introduces a third dimension into the problem, but only on the spatial frequency side. To represent this, we modify our subgrid definition such that each subgrid i has an arbitrary w-offset ${\hat{w}}_{i}$ $\[\hat{w}_i\]$ associated with it: $A_{i} (g_{{\hat{ω}}_{i}} * Υ) .$ $\[A_i\left(\mathrm{~g}_{\hat{\omega}_i} * \Upsilon\right).\]$ (33)

In this section, we consider the problem variant of generating such w-shifted subgrids from facets, or vice versa. We note that even with w-offsets, satisfying $\sum_{i} A_{i} (g_{{\hat{ω}}_{i}} * Υ) = Υ$ $\[{\sum}_i \mathrm{A}_i\left(\mathrm{~g}_{\hat{\omega}_i} * \Upsilon\right)=\Upsilon\]$ (compare Sect. 2.2) is still possible by weighting A_i. This is well understood in radio astronomy, but is a distraction for the purpose of this paper, so we stick to the dual algorithm (facet to subgrid; see Sect. 2.7) for this section.

3.1 Implementation options

If we simply replace Υ by $(g_{{\hat{w}}_{i}} * Υ)$ $\[\left(\mathrm{g}_{\hat{w}_i} * \Upsilon\right)\]$ in our subgrid formula from Sect. 2.7 we arrive at $A_{i} (g_{{\hat{w}}_{i}} * Υ) \approx A_{i} \sum_{j} (n_{j} * m_{i} (b_{j} * g_{{\hat{w}}_{i}} * Υ)) .$ $\[\mathrm{A}_i\left(\mathrm{~g}_{\hat{w}_i} * \Upsilon\right) \approx \mathrm{A}_i \sum_j\left(\mathrm{n}_j * \mathrm{~m}_i\left(\mathrm{~b}_j * \mathrm{~g}_{\hat{w}_i} * \Upsilon\right)\right).\]$ (34)

So the obvious way to obtain w-shifted subgrids with our algorithm is to multiply facets by $F g_{{\hat{w}}_{i}}$ $\[\mathcal{F} ~\mathrm{g}_{\hat{w}_i}\]$ , which is known as w-stacking (e.g. Offringa et al. 2014). This is simple and precise no matter the field of view or value of $| {\hat{w}}_{i} |$ $\[\left|\hat{w}_i\right|\]$ .

On the other hand, this forces us to repeat the entire data re-distribution for every needed value of ${\hat{w}}_{i}$ $\[\hat{w}_i\]$ . Sharing intermediate subgrid column buffers, as shown in Fig. 4, would now especially require ${\hat{u}}_{i}$ $\[\hat{u}_i\]$ and ${\hat{w}}_{i}$ $\[\hat{w}_i\]$ to match. We can still minimise the amount of subgrid data to reproduce for any given ${\hat{w}}_{i}$ $\[\hat{w}_i\]$ level by skipping generation of unneeded subgrids, yet this is still not particularly efficient.

Fortunately we have additional options. After all, we can treat $B_{j, {\hat{w}}_{i}}^{'} = B_{j} * g_{{\hat{w}}_{i}}$ $\[\mathrm{B}_{j, \hat{w}_i}^{\prime}=\mathrm{B}_j * \mathrm{~g}_{\hat{w}_i}\]$ like a modified facet mask, which means that we can decompose, as in Sect. 2.3, into $B_{j, {\hat{w}}_{i}}^{'} = n_{j, {\hat{w}}_{i}}^{'} * b_{j}$ $\[\mathrm{B}_{j, \hat{w}_i}^{\prime}=\mathrm{n}_{j, \hat{w}_i}^{\prime} * \mathrm{~b}_j\]$ using a modified window function $n_{j, {\hat{w}}_{i}}^{'} = n_{j} * g_{{\hat{w}}_{i}}$ $\[\mathrm{n}_{j, \hat{w}_i}^{\prime}=\mathrm{n}_j * \mathrm{~g}_{\hat{w}_i}\]$ . This yields $A_{i} \sum_{j} ((B_{j} * g_{{\hat{w}}_{i}}) * Υ)) \approx A_{i} \sum_{j} ((n_{j} * g_{{\hat{w}}_{i}}) * m_{i} (b_{j} * Υ)) .$ $\[\mathrm{A}_i \sum_j\left((\mathrm{~B}_j * \mathrm{~g}_{\hat{w}_i}) * \Upsilon)\right)\approx \mathrm{A}_i \sum_j\left(\left(\mathrm{n}_j * \mathrm{~g}_{\hat{w}_i}\right) * \mathrm{~m}_i\left(\mathrm{~b}_j * \Upsilon\right)\right).\]$ (35)

This is attractive because m_i(b_j * Υ) is the distributed contribution term, so this introduces w-shifts after data re-distribution and all image-size FFTs. However, this can only work as long as $n_{j, {\hat{w}}_{i}}^{'}$ $\[\mathrm{n}_{j, \hat{w}_i}^{\prime}\]$ still acts like a window function for our purposes: $A_{i} ((n_{j} * g_{{\hat{w}}_{i}}) * (1 - m_{i})) \approx 0 .$ $\[\mathrm{A}_i\left(\left(\mathrm{n}_j * \mathrm{~g}_{\hat{w}_i}\right) *\left(1-\mathrm{m}_i\right)\right) \approx 0.\]$ (36)

So for this approach to hold up, we must now identify masks m_i compatible not only with A_i and all n_j, but also given ${\hat{w}}_{i}$ $\[\hat{w}_i\]$ .

Fig. 5

Window interaction with non-coplanarity term.

Fig. 6

Non-coplanarity margin relative to highest chirp frequency for l_max = 0.2 (solid: facet size l_n = 0.1, dotted: l_n = 0.2).

3.2 Margins

Fortunately, it is not unreasonable to expect $n_{j} * g_{{\hat{w}}_{i}}$ $\[\mathrm{n}_j * \mathrm{~g}_{\hat{w}_i}\]$ to be limited in frequency space. Using the definition of g_w from Sect. 3, we can determine the frequency by derivation of the exponent: $\frac{δ}{δ l} w \sqrt{1 - l^{2}} = \frac{- w l}{\sqrt{1 - l^{2}}} .$ $\[\frac{\delta}{\delta l} w \sqrt{1-l^2}=\frac{-w l}{\sqrt{1-l^2}}.\]$ (37)

Now we can use the fact that n_j is an image-space filter. If l_max is the maximum distance from the phase centre in the facet n_j, then the effective size of g_w for the purpose of n_j * g_w should be about $u_{g_{w}} = | w | l_{max} {(1 - l_{max}^{2})}^{- \frac{1}{2}}$ $\[u_{\mathrm{g}_w}=|w| l_{\max }\left(1-l_{\max }^2\right)^{-\frac{1}{2}}\]$ in frequency space. As Fig. 5 shows, g_w * n° (with n° again the mask version of n) indeed falls off, though it is hard to spot due to n⁰ causing sinc-like ripples (green line). Fortunately, convolution with n suppresses this by design, although while adding signal spread of its own (red line).

As a result, if we define $u_{g_{w} n}$ $\[u_{\mathrm{g}_w \mathrm{n}}\]$ as the point where $n * g_{{\hat{w}}_{i}}$ $\[\mathrm{n} * \mathrm{~g}_{\hat{w}_i}\]$ reaches the base error level associated with n, we would generally expect $u_{g_{w} n} - u_{g_{w}} \approx u_{n}$ $\[u_{\mathrm{g}_w \mathrm{n}}-u_{\mathrm{g}_w} \approx u_{\mathrm{n}}\]$ . As Fig. 6 shows, things are not quite that simple: $u_{g_{w} n}$ $\[u_{\mathrm{g}_w \mathrm{n}}\]$ first grows a bit more than expected, then eventually even falls below u_n for some parameters. These interactions are due to n_j not just limiting, but also weighting different frequencies of the chirp differently. In practice, this means that we have to numerically or experimentally determine the maximum acceptable w from u_A and u_m.

Fig. 7

Generating a subgrid w-stack (w-tower).

3.3 w-towers

The w-distance limit is not a problem in practice, as due to $g_{w_{1} + w_{2}} = g_{w_{1}} * g_{w_{2}}$ $\[\mathrm{g}_{w_1+w_2}=\mathrm{g}_{w_1} * \mathrm{~g}_{w_2}\]$ we can use w-stacking to reach a sufficiently close ${\bar{w}}_{i}$ $\[\bar{w}_i\]$ first: $A_{i} (g_{{\hat{w}}_{i}} * Υ) \approx A_{i} (g_{{\hat{w}}_{i} - {\bar{w}}_{i}} * \sum_{j} (n_{j} * m_{i} (b_{j} * g_{{\bar{w}}_{i}} * Υ))) .$ $\[\mathrm{A}_i(\mathrm{~g}_{\hat{w}_i} * \Upsilon) \approx \mathrm{A}_i\left(\mathrm{~g}_{\hat{w}_i-\bar{w}_i} * \sum_j\left(\mathrm{n}_j * \mathrm{~m}_i(\mathrm{~b}_j * \mathrm{~g}_{\bar{w}_i} * \Upsilon)\right)\right).\]$ (38)

This combination is much more efficient than pure w-stacking, as we now only have to repeat data re-distribution for every unique value of ${\bar{w}}_{i}$ $\[\bar{w}_i\]$ . Furthermore, in summing up the $n_{j} * m_{i} (b_{j} * g_{{\bar{w}}_{i}} * Υ)$ $\[\mathrm{n}_j * \mathrm{~m}_i\left(\mathrm{~b}_j * \mathrm{~g}_{\bar{w}_i} * \Upsilon\right)\]$ contributions for some w-stacking plane ${\bar{w}}_{i}$ $\[\bar{w}_i\]$ , we naturally end up in image space (see last step in Fig. 4 and the right side of Fig. 7). Therefore, image-space multiplication with $F g_{{\hat{w}}_{i} - {\bar{w}}_{i}}$ $\[\mathcal{F} ~\mathrm{g}_{\hat{w}_i-\bar{w}_i}\]$ is basically free. In the special case that we are interested in subgrids that are equidistant in w (and have matching m_i), we can iteratively build a ‘w-tower’ of subgrids using only one constant ℱ g_Δw, as illustrated in Fig. 7. We still need one Fourier transform for every ‘storey’ of the tower; however, these are quite a bit smaller than the image size, and can thus be computed relatively efficiently.

For subgrids at maximum height $| {\hat{w}}_{i} - {\bar{w}}_{i} |$ $\[\left|\hat{w}_i-\bar{w}_i\right|\]$ , only the central A_i region (of maximum size u_A) will be accurate. This is subject to $u_{m} > u_{g_{w} n} + u_{A}$ $\[u_{\mathrm{m}}>u_{\mathrm{g}_w \mathrm{n}}+u_{\mathrm{A}}\]$ as established in the last section, effectively sacrificing $4 {(u_{g_{w} n} - u_{n})}^{2}$ $\[4\left(u_{\mathrm{g}_w \mathrm{n}}-u_{\mathrm{n}}\right)^2\]$ subgrid area to deal with non-coplanarity. Thus, covering larger w-ranges either means reducing the usable uv-area per subgrid u_A¹ or introducing more w-stacking planes ( ${\bar{w}}_{i}$ $\[\bar{w}_i\]$ values), and therefore full data re-distributions. We show in Sect. 4.2 how we can handle this particular trade-off.

3.4 Optimisation of w-distribution

The number of subgrids or w-tower storeys to generate is still a major cost factor, so optimising the visibility uvw distribution for a minimum w-range is highly desirable. For instance, we can use a shear transformation as proposed in Cornwell et al. (2012) to transform visibility (u, v, w) coordinates to (u, v, w′) as follows: $w^{'} = w - h_{u} u - h_{v} v,$ $\[w^{\prime}=w-h_u u-h_v v,\]$ (39) $l^{'} = l + h_{u} n = l + h_{u} (\sqrt{1 - l^{2} - m^{2}} - 1),$ $\[l^{\prime}=l+h_u n=l+h_u\left(\sqrt{1-l^2-m^2}-1\right),\]$ (40) $m^{'} = m + h_{v} n = m + h_{v} (\sqrt{1 - l^{2} - m^{2}} - 1) .$ $\[m^{\prime}=m+h_v n=m+h_v\left(\sqrt{1-l^2-m^2}-1\right) \text {. }\]$ (41)

This allows us to choose h_u and h_v to minimise |w′|, while solving the same problem due to ul + vm + wn = ul′ + vm′ + w′n. We still have $n = \sqrt{1 - l^{2} - m^{2}} - 1$ $\[n=\sqrt{1-l^2-m^2}-1\]$ , and therefore the definition of g_w from Sect. 3 remains unchanged, except that we now sample in l′ and m′ for the purpose of w-stacking or w-towers. Using computer algebra we can solve for the original l and m: $l = \frac{l^{'} (1 + h_{v}^{2}) + h_{u} (1 - m^{'} h_{v} - \sqrt{c})}{h_{u}^{2} + h_{v}^{2} + 1},$ $\[l=\frac{l^{\prime}\left(1+h_{\mathrm{v}}^2\right)+h_u\left(1-m^{\prime} h_v-\sqrt{c}\right)}{h_u^2+h_v^2+1},\]$ (42) $m = \frac{m^{'} (1 + h_{u}^{2}) + h_{v} (1 - l^{'} h_{u} - \sqrt{c})}{h_{u}^{2} + h_{v}^{2} + 1},$ $\[m=\frac{m^{\prime}\left(1+h_u^2\right)+h_v\left(1-l^{\prime} h_u-\sqrt{c}\right)}{h_u^2+h_v^2+1},\]$ (43) $\begin{array}{r} where c = 1 - 2 l^{'} h_{u} - 2 m^{'} h_{v} + 2 l^{'} h_{u} m^{'} h_{v} \\ - l^{' 2} (h_{v}^{2} + 1) - m^{' 2} (h_{u}^{2} + 1) \end{array} .$ $\[\begin{array}{r}\text { where } c=1-2 l^{\prime} h_u-2 m^{\prime} h_v+2 l^{\prime} h_u m^{\prime} h_v \\-l^{\prime 2}\left(h_v^2+1\right)-m^{\prime 2}\left(h_u^2+1\right)\end{array}.\]$

To first order l ≈ l′ and m ≈ m′, which means that this transformation will change neither l_max nor the required subgrid margin $u_{g_{w} n}$ $\[u_{\mathrm{g}_w \mathrm{n}}\]$ from Sect. 3.2 significantly until quite close to the horizon, so there is little efficiency loss associated with this optimisation. This in contrast to the variant described in Offringa et al. (2014), where flat offsets in l and m directly increase both l_max and $u_{g_{w} n}$ $\[u_{\mathrm{g}_w \mathrm{n}}\]$ , reducing the maximum height of w-towers and therefore efficiency much more rapidly.

Fig. 8

Subgrid and facet spacings for convolutional gridding.

3.5 Convolutional (de)gridding

Using w-stacking and w-towers we can efficiently generate visibility values for regularly spaced uvw subgrids, as shown in Fig. 8a. This is exactly what we need for gridding algorithms like w-projection (Cornwell et al. 2008), image domain gridding (van der Tol et al. 2018), or uvw-(de-)gridding (Ye et al. 2022). These algorithms allow us to work with irregularly positioned visibilities as long as we can provide sufficient samples of certain 2u_r × 2u_r × 2w_r environments around their locations (e.g. dot and box in Fig. 8a).

Gridding typically uses spatial-frequency space convolutions, which means that just as when we split B_j = b_j * n_j in Sec. 2.3, it must be cancelled out by multiplication with a gridding correction function in image space. Fortunately, our algorithm works well with image-space factors. The natural approach is to combine it with b_j to apply it to facets (like w-stacking), but theoretically we could also merge it with n_j to apply it per subgrid (like w-towers). This means that we can introduce the gridding correction function into our framework efficiently, possibly even supporting different gridding algorithms in different grid regions.

In practice, gridding efficiency depends greatly on how finely the grid is sampled: the more we oversample the spatial-frequency grid by padding the image or introducing finer w-tower storeys, the smaller the 2u_r × 2u_r × 2w_r region we need to sample (Tan 1986; Ye et al. 2020; also compare Fig. 9). Our algorithm allows us to adjust both parameters easily. Due to linearity we can simply ignore irrelevant facets, and therefore we can freely increase the image size and therefore spatial-frequency sampling rate l_Υ, while only covering a given l_FoV area with facets (see Fig. 8b). Extra w-tower storeys are also easily added. On the other hand, either measure will increase computation, leading to a trade-off against gridding complexity.

Fig. 9

Amount of facet space useable per desired base error for prolate spheroidal wave function (PSWF).

4 Algorithm

We assemble the complete algorithms, both for going from facets to w-towers of subgrids (Algorithm 1) and its dual for going from subgrids or w-towers back to facets (Algorithm 2). Separable shifts δ^(u), δ^(v) and ${\hat{u}}_{i}, {\hat{v}}_{i}, {\hat{l}}_{j}, {\hat{m}}_{j}$ $\[\hat{u}_i, \hat{v}_i, \hat{l}_j, \hat{m}_j\]$ work analogously to what was shown in Sects. 2.5 and 2.6. Apart from adding w-towers, this closely follows Fig. 4.

Both directions are linear, meaning that we can skip sub-grids or w-towers without visibility coverage, just as we can skip facets outside the field of view, as shown in Fig. 8b. Furthermore, all ‘foreach’ loops can be executed in parallel. This especially applies to w-towers, as we can parallelise per w-tower, w-tower storey, and/or visibility chunk. Even the outer w-stacking loop can be parallelised to some degree, memory availability permitting.

4.1 Dimensions and shifts

Once the structure of the algorithm has been established, actually making it work depends entirely on suitable parameter choices. Fast Fourier transforms form the backbone of the algorithm, so we start by looking at their sizes. In Algorithms 1 and 2, we perform Fourier transforms at the padded subgrid size 4u_ml_Υ (line 9 and 18), padded facet size 4u_Υl_n (line 23, 25 and 3, 5), and finally 4u_ml_n (line 21, 23 and 5, 7), which derives from the previous values. From Sec. 2.4 we know two more things: Firstly, the total image size u_Υl_Υ must be divisible by both the padded subgrid size u_ml_Υ and padded facet sizes u_Υl_n; Secondly, as u_ml_n = (u_ml_Υ)(u_Υl_n)(u_Υl_Υ)⁻¹, the product of padded subgrid and facet size must be divisible by the image size. This clearly favours simple factorisations, for example powers of two, which are also optimal for fast Fourier transforms.

Alg. 1w-towers → facets method

Alg. 2Facets → w-towers method (dual)

As established in Sec. 2.5, these factorisations also restrict permissible subgrid and facet positions ${\hat{u}}_{i} / {\hat{v}}_{i} / {\hat{l}}_{j} / {\hat{m}}_{j}$ $\[\hat{u}_i / \hat{v}_i / \hat{l}_j / \hat{m}_j\]$ . We assume certain base shift steps Δu and Δl such that all ${\hat{u}}_{i} / {\hat{v}}_{i}$ $\[\hat{u}_i / \hat{v}_i\]$ and ${\hat{l}}_{j} / {\hat{m}}_{j}$ $\[\hat{l}_j / \hat{m}_j\]$ , respectively, are multiples. As we require all ${\hat{u}}_{i} {\hat{l}}_{j}, {\hat{v}}_{i} {\hat{m}}_{j} \in Z$ $\[\hat{u}_i \hat{l}_j, \hat{v}_i \hat{m}_j \in \mathbb{Z}\]$ we can assume ΔuΔl = 1 without loss of generality, and because 4u_Υl_Υ = (2Δul_Υ)(2u_ΥΔl) this means that the product of base subgrid and facet shifts in pixels must equal the image size.

To ensure that 2u_mΔl, 2Δul_n ∈ ℤ we can derive the following, $2 u_{m} Δ l = \frac{2 u_{m}}{Δ u} = \frac{4 u_{m} l_{Υ}}{2 Δ u l_{Υ}} \in Z, 2 Δ u l_{n} = \frac{2 l_{n B}}{Δ l} = \frac{4 u_{Υ} l_{n}}{2 u_{Υ} Δ l} \in Z,$ $\[2 u_{\mathrm{m}} \Delta l=\frac{2 u_{\mathrm{m}}}{\Delta u}=\frac{4 u_{\mathrm{m}} l_{\Upsilon}}{2 \Delta u l_{\Upsilon}} \in \mathbb{Z}, 2 \Delta u l_{\mathrm{n}}=\frac{2 l_{\mathrm{nB}}}{\Delta l}=\frac{4 u_{\Upsilon} l_{\mathrm{n}}}{2 u_{\Upsilon} \Delta l} \in \mathbb{Z},\]$ (44)

so additionally base subgrid and facet shifts must evenly divide the padded subgrid and padded facet sizes, respectively.

Fig. 10

Parameter efficiencies depending on PSWF and precision for 4u_Υl_Υ = 8192,4u_Υl_n = 2048, u_ml_Υ = 1024, 4u_Υl_fov = 6656 (top: efficiency; bottom: w-tower efficiency; dotted: upper efficiency bound; marked: best found parameter combination).

4.2 Errors and margins

In addition to data sizes, the other fundamental parameter is what level of approximation we are willing to accept. As established in Sect. 2.3, this is mostly determined by two factors. The first is the base error level, which effectively derives from the PSWF parameter W = 4u_nl_n. From this we can especially derive 4u_nl_Υ, the minimum subgrid margin size. The second factor is the base error magnification due to b, which depends directly on $l_{B} l_{n}^{- 1}$ $\[l_{\mathrm{B}} l_{\mathrm{n}}^{-1}\]$ (i.e. how much facet space we are willing to sacrifice).

We know that the worst-case error magnification happens at the facet borders, and therefore we just need to determine how large we can make l_B until the base error multiplied by |ℱn(l_B)|⁻¹ becomes larger than the acceptable error level (compare Fig. 2). This means that for any target error, we have a range of PSWF parameters W to choose from, as illustrated in Fig. 9.

Facet–subgrid contributions have size 4u_ml_n, yet effectively sample only 4u_Al_B points of image–frequency space. Therefore, we can calculate an upper bound on communication efficiency: $efficiency = \frac{u_{A} l_{B}}{u_{m} l_{n}} \leq \frac{(u_{m} - u_{n}) l_{B}}{u_{m} l_{n}} .$ $\[\text { efficiency }=\frac{u_{\mathrm{A}} l_{\mathrm{B}}}{u_{\mathrm{m}} l_{\mathrm{n}}} \quad \leq \frac{\left(u_{\mathrm{m}}-u_{\mathrm{n}}\right) l_{\mathrm{B}}}{u_{\mathrm{m}} l_{\mathrm{n}}} \text {. }\]$ (45)

This already shows a fundamental trade-off between facet and subgrid margins, as shown in Fig. 10 (top, dotted). For larger W we have to give up subgrid area, whereas for small W we lose facet area to compensate for the worse base error.

For a measure of w-tower efficiency, we observe that in Sect. 3.2 there is to first order a linear dependency between maximum tower size |w| and the effective margin reserved for the w-term $u_{g_{w}}^{'} = u_{g_{w} n} - u_{n}$ $\[u_{\mathrm{g}_w}^{\prime}=u_{\mathrm{g}_w \mathrm{n}}-u_{\mathrm{n}}\]$ . Therefore, we can use a l′_max-normalised uvw volume as an efficiency measure subject to u_A ≤ u_m − u_n − $u_{g_{w}}^{'}$ $\[u_{\mathrm{g}_w}^{\prime}\]$ : $w -tower efficiency = \frac{u_{g_{w}}^{'}}{u_{m}} {(\frac{u_{A} l_{B}}{u_{m} l_{n}})}^{2} \leq \frac{4 {(u_{m} - u_{n})}^{3} l_{B}^{2}}{27 u_{m}^{3} l_{n}^{2}} .$ $\[w \text {-tower efficiency }=\frac{u_{\mathrm{g}_w}^{\prime}}{u_{\mathrm{m}}}\left(\frac{u_{\mathrm{A}} l_{\mathrm{B}}}{u_{\mathrm{m}} l_{\mathrm{n}}}\right)^2 \leq \frac{4\left(u_{\mathrm{m}}-u_{\mathrm{n}}\right)^3 l_{\mathrm{B}}^2}{27 u_{\mathrm{m}}^3 l_{\mathrm{n}}^2} \text {. }\]$ (46)

Optimally $u_{g_{w}}^{'} \approx \frac{1}{3} (u_{m} - u_{n})$ $\[u_{\mathrm{g}_w}^{\prime} \approx \frac{1}{3}\left(u_{\mathrm{m}}-u_{\mathrm{n}}\right)\]$ , which again yields an upper bound, as shown in Fig. 10 (bottom, dotted). As should be expected, larger W are somewhat less efficient here as the PSWF and g_w compete for limited subgrid space.

4.3 Parameter search

Unfortunately, the side conditions from Sec. 4.1 mean that at best we can get close to those limits, especially when we try to realise facet and subgrid partitioning as shown in Fig. 8a: To account for subgrid or facet overlaps due to offset restrictions (see Sec. 2.5) for the purpose of efficiency calculation, 2u_A and 2l_B should be multiples of Δu and Δl, respectively. As we require the same of 2u_m and 2l_n, respectively, this extends to the subgrid and facet margins u_m − u_A and l_n − l_B.

We would also like the facets to cover the field of view efficiently. If we need k × k facets to cover the field of view, this means that on average every facet only contains $2 l_{B}^{'} = 2 l_{fov} k^{- 1}$ $\[2 l_{\mathrm{B}}^{\prime}=2 l_{\text {fov }} k^{-1}\]$ useable data. Even numbers of facets are a special case, as with a centred field of view this requires placing facets at an offset of l_B. Therefore in this case l_B must be a multiple of Δl as well.

In practice, finding good parameters boils down to trying all combinations of W = 4u_nl_n, Δu and Δl, u_A, and l_B exhaustively. After that, the smallest possible l_n and subsequently all remaining parameters and efficiency measures can be determined. If we plot the best solution of an example parameter search for every value of W (Fig. 10, top solid), we see that the achievable efficiency can be quite unpredictable. For w-towers we gain another degree of freedom in the w-term margin $u_{g_{w}}^{'}$ $\[u_{\mathrm{g}_w}^{\prime}\]$ , which makes efficiency better behaved (Fig. 10, bottom solid). This is mainly because the requirement that (u_A − u_m) must be a multiple of Δu is much less restrictive when we can still derive efficiency from extra subgrid space by making w-towers taller.

Table 1 shows a number of algorithm parameter sets for a fixed image size 4u_Υl_Υ = 8192, optimised for w-towers efficiency at 10⁻⁵ target error. Configurations with matching contribution sizes 4u_ml_n generally have the same properties as they arrive at the same balance between base error (i.e. PSWF parameter W = u_nl_n) and error magnification (i.e. facet margin $l_{B} l_{n}^{- 1}$ $\[l_{\mathrm{B}} l_{\mathrm{n}}^{-1}\]$ ) independently of the concrete values of l_n or u_m (compare Sec. 4.2). This works as long as we can find equivalent facet splits (and base offsets) for them. For instance, in Table 1 4u_Υl_n = 1280, 4u_ml_Υ = 1024 is in the same family as 4u_Υl_n = 5120, 4u_ml_Υ = 256, but as a facet split of k = 1.5 would be meaningless a different PSWF parameter W must be used. Following the same logic we can also scale the image size l_Υ or resolution u_Υ, so the configuration from Table 1 with facet size 4u_Υl_n = 1024 could be scaled into a configuration with 4u_Υl_Υ = 524288 (4.4 TB of data) and 4u_Υl_n = 65 536 (69 GB of data), which is where a k × k = 64-way facet split would start to seem more appropriate.

5 Results

We investigate scaling properties by posing an SKA-sized ‘forward’ imaging problem: predicting visibilities for the SKA1-Mid layout with 197 dishes, assuming a snapshot duration of 29 minutes, a dump time of 0.142 s (12 288 dumps), a frequency window of 350-472.5 MHz split into 11 264 channels, resulting in 2.6 × 10¹² visibilities (or 42.3 TB data at double precision). The uv-coverage for such an observation is shown in Fig. 13. The phase centre was chosen at a declination of 0 with hour angles centred around transit (i.e. 0 hour angle). This means around 60 degrees elevation for SKA Mid, which we can compensate for with shear factor h_v = 0.5936 (see Sec. 3.4). For the de-gridding we used a 8 × 8 × 4 deconvolution using Sze Meng Tan’s kernels (Ye et al. 2020), optimised for x₀ = 0.40625 (i.e. 1-2×0.40625 =18.75% image margin) and x₀ = 0.1875, respectively, resulting in ~10⁻⁵ expected degridding accuracy. The test image contained ten on-grid sources with the same intensity, all placed randomly on facet borders to maximise errors, as explained in Sec. 2.3. To verify correctness, we randomly sampled about 6 × 10⁸ visibilities per run and compared them against a direct evaluation of the measurement equation.

To make configurations comparable, all tests were conducted with scaled variants of the first configuration from Table 1. We specifically held facet count and subgrid size constant, but scaled both image and facet size as well as subgrid count. We configured our implementation (Wortmann 2019, 2023) so it distributed work across 16 nodes and 49 processes (1 scheduler, 4 × 4 = 16 facet workers and 32 w-towers / subgrid workers). The tests were run on 16 dual Intel® Xeon® Platinum 8368Q (2 × 76 cores) with 256 GiB RAM each.

Table 1

Sample of possible algorithm parameters for a fixed image size (columns W = 4u_nl_n, Δu & Δl, u_A, and l_B) and the corresponding efficiencies (last two columns).

5.1 Scaling resolution

The image size in pixels scales as the product of the field of view (and therefore l_Υ) and the resolution u_Υ. For this section we consider the effects of scaling u_Υ, while holding the field of view at 2l_fov = 0.322 (~18.5 degrees; see middle column of Table 2). This corresponds to roughly the third null of SKA1 Mid’s beam at the finest tested image resolution of 4u_Υl_Υ = 196608. For smaller grid sizes, this results in coarser image resolution and truncation of the uv-coverage (Fig. 13). This does not decrease degridding work substantially due to the high number of short baselines (Fig. 11, solid red line).

On the other hand, higher image resolution increases the cost for facet Fourier transform work (i.e. line 3, 5, and 7 in Algorithm 2) roughly as $O (u_{Υ}^{2})$ $\[O\left(u_{\Upsilon}^2\right)\]$ per w-stacking plane. For a baseline distribution where maximum |w| grows linearly with uv distance (i.e. as long as the Earth’s rotation dominates the Earth’s curvature) the number of w-stacking planes required to cover the truncated uv-distribution will also grow linearly with u_Υ. This means the expected worst-case scaling of facet work is $O (u_{Υ}^{3})$ $\[O\left(u_{\Upsilon}^3\right)\]$ , and as Fig. 11 shows (solid orange line) we indeed find scaling slightly better than $O (u_{Υ}^{3})$ $\[O\left(u_{\Upsilon}^3\right)\]$ .

For w-towers, changing image resolution increases the number of subgrids required to cover all of the baselines. Therefore, we would expect the growth of subgrid FFT work (i.e. line 18 in Algorithm 2) as well as the data transfer amount to grow with the covered uvw volume, which increases as $O (u_{Υ}^{3})$ $\[O\left(u_{\Upsilon}^3\right)\]$ . Compared against Fig. 11 (solid green and blue lines), the scaling is significantly more efficient. This is clearly because, as first discussed in Sec. 2.6, we can skip more uvw volume the longer the baselines get, especially as we approach the rare longest baselines. By varying subgrid sizes and margins we can trade w-stacking and w-tower work, so there is reason to suspect that we will be able to adapt to bigger telescopes with significantly better-than-cubic net growth in complexity.

Fig. 11

Complexity scaling (dotted lines indicate cubic growth).

5.2 Scaling the field of view

To investigate field of view scaling, let us now hold the grid dimension u_Υ constant such that it fits all baselines from Fig. 13 exactly. Changing image size 4u_Υl_Υ then will change l_Υ, and therefore the effective field of view size 2l_fov. In our test scenarios shown in Table 2 we again scale from the maximum case of 16.8 degrees down to 0.7 degrees.

The reasoning about computational complexity is more complex in this case. To determine the number of w-stacking planes, we first note that scaling l_Υ means scaling frequency-space resolution; therefore, a given subgrid size 4u_ml_Υ and its margins 4(u_m − u_A)l_Υ now cover less frequency space. This means that for the purpose of the discussion in Sect. 3.2, the permissable $u_{g_{w} n}$ $\[u_{\mathrm{g}_w \mathrm{n}}\]$ decreases, while the relevant field of view and therefore l_max increases. It follows that the number of required w-stacking planes increases as $O (l_{Υ}^{2})$ $\[O\left(l_{\Upsilon}^2\right)\]$ to first order. As each plane is also $O (l_{Υ}^{2})$ $\[O\left(l_{\Upsilon}^2\right)\]$ larger, the total facet Fourier transform complexity growth is about $O (l_{Υ}^{4})$ $\[O\left(l_{\Upsilon}^4\right)\]$ . The cost for subgrid Fourier transforms will simply increase with the number of subgrids we need to produce. As the subgrids cover less spatial-frequency space each, we need to increase their number by $O (l_{Υ}^{2})$ $\[O\left(l_\Upsilon^2\right)\]$ to cover the telescope layout in uv. However, to implement (de-)gridding (e.g. Ye et al. 2022) we also need to decrease the w distance between w-tower planes as the image curvature $n \sim l_{fov}^{2}$ $\[n \sim l_{\text {fov }}^2\]$ increases. The result is an overall scaling of $O (l_{Υ}^{4})$ $\[O\left(l_\Upsilon^4\right)\]$ again. Interestingly enough, this means that the number of storeys in a w-tower is constant. The tower layout shrinks linearly along the u- and v-axes, and about quadratically along the w-axis.

As Fig. 11 shows, the experiments again demonstrate slightly better than expected scaling behaviour; facet work grows supercubically once we get to larger image sizes, but the subgrid Fourier transform work and data transfer volume stay at or slightly below $O (l_{Υ}^{3})$ $\[O\left(l_\Upsilon^3\right)\]$ . This is because a larger number of fine-grained subgrids and w-towers allow us to more closely match the baseline distribution. In a way we are observing the fractal dimension of the telescope layout. We note that existing widefield imaging algorithms would show the same $O (l_{Υ}^{4})$ $\[O\left(l_{\Upsilon}^4\right)\]$ scaling (e.g. using constant-size facets would see a $O (l_{Υ}^{2})$ $\[O\left(l_\Upsilon^2\right)\]$ growth both in the number of visibilities and facet count for a combined scaling of $O (l_{Υ}^{4})$ $\[O\left(l_{\Upsilon}^4\right)\]$ of phase rotation cost growth).

Table 2

Test configuration parameters and results (left: scaling baseline length u_Υ; right: scaling field of view (FoV) size l_Υ).

Fig. 12

Benchmark results. ducc results scaled as if distributed perfectly to 16 nodes.

5.3 Communication

In the largest configuration we are dealing with an image of size 196 608², which at double precision translates to about 618.5 GB of data. To ensure the accuracy of the gridder we leave 18.75% margins, which means that the effective field of view size corresponds to about 408.3 GB of data. The results in Table 2 show that 997.0 GB of communication were exchanged, so a bit more than twice the size of the information-containing part of the image.

This is quite efficient, as the configuration has 33 w-stacking planes, with w-towers decomposing them into 154 subplanes each, so we are effectively sampling a uvw grid 33 · 154 · 618.5 GB = 2.6 EB in size. We can also contrast against the visibility volume of 42.3 TB, which puts the network overhead at about 1.8 % relative to the throughput required to load visibilities from distributed storage.

Fig. 13

Visibilities per subgrid position (SKA1-Mid layout, summed over all w-stacking planes and w-tower storeys).

5.4 Performance

Figure 12 shows the run-time performance of tests in Table and compares them to equivalent benchmark runs of the ducc degridding kernel (Arras et al. 2021). The ducc module uses the same class of degridding algorithm (Ye et al. 2022) and is in regular scientific use as a (de)gridding kernel for the WS Clean widefield imager (Offringa et al. 2014). For this test, image data was passed to ducc as one merged effective image, as in Table 2, which was then internally padded according to ducc’s heuristic (target precision ϵ = 10⁻⁴). We predicted visibilities as 598 chunks of 32 baselines each (~70 GB). To minimise redundant FFT work due to repeating w-stacking planes, baselines were ordered by median w before chunking. All chunks were predicted sequentially on a single CPU (76 cores), with performance linearly extrapolated to 16 dual-CPU nodes for Fig. 12.

Our implementation significantly outperforms ducc in raw degridding speed (red lines) despite the similar degridding approach. There are a number of possible explanations. Our kernel was hand-optimised for AVX2, uses a pre-tabulated kernel, and also convolves across all w-tower storeys in one go. Meanwhile, ducc has a more generic implementation that generates the convolution kernel on the fly. This calculation even has to be repeated multiple times per visibility, due to the high memory pressure of w-stacking. Our implementation likely further benefits from executing degridding and w-towers FFTs in parallel as these workloads have very different operational intensity, and therefore work well with hyper-threading.

For the traditional w-stacking method used by ducc, FFT work dominates past images sizes of 32768² due to unmitigated $O (l_{Υ}^{4})$ $\[O\left(l_{\Upsilon}^4\right)\]$ complexity growth (see Sect. 5.2, green dashed line). In comparison, the w-towers approach together with the shear transformation from Sect. 3.4 substantially blunt the impact of larger fields of view in terms of number and size of required FFTs (green dotted line). Time spent on FFTs still eventually overtakes degridding despite the lower operation count (compare Fig. 11) due to lower operational intensity.

Finally, we consider the cost of distribution, represented through the facet FFT costs (orange lines) and scheduling inefficiencies (violet lines). The former are basically insignificant, while the latter eventually start dominating. This is so because while there is an abundance of parallel tasks, they vary greatly in terms of computational and communication complexity. This makes saturating 4864 threads quite challenging in practice.

Our simple scheduling algorithm approaches this by alternating ‘expensive’ (many visibilities) and ‘cheap’ (few visibilities) w-stacking planes to balance FFT and degridding work as evenly as possible throughout. It then dynamically schedules subgrid tasks to subgrid workers from a pre-populated work queue. For the most densely populated subgrids it splits the degridding work further to allow multiple subgrid workers to work on them in parallel. Queue sizes were optimised for the large-scale case, and kept constant for all test runs. This also explains why small configurations sometimes have worse scheduling efficiency, as due to the small number of subgrids work is scheduled before imbalances can become apparent. As a result, some workers end up running idle towards the end of the run. It is clear that effective work scheduling is one of the main remaining challenges with this algorithm.

6 Conclusion

We have presented a scalable way to distribute interferometric imaging calculations by using spatial distribution and window functions. This results in a streaming widefield Fourier transform algorithm that enables interferometry imaging to be parallelised to thousands of distributed threads, while taking full advantage of both image and uvw sparseness to reduce transfer and compute costs.

6.1 Future improvements

There are still a number of improvements and generalisations that can be made to the algorithm. Centring contribution terms as in Sec. 2.5 is not actually necessary, and removing the ${\hat{u}}_{i} {\hat{l}}_{j} \in Z$ $\[\hat{u}_i \hat{l}_j \in \mathbb{Z}\]$ side condition allows more flexible subgrid and facet placement. Additional efficiency could be gained by combining different parameters within the same imaging task (e.g. adjust subgrid size per w-level), or even introducing additional layers between w-stacking and w-towers on either side of the distribution pattern (for dealing with very large images). There is good reason to believe that such techniques could improve performance further.

6.2 Implementation

Despite our efforts, the MPI implementation (Wortmann 2023) is still severely limited by its primitive work scheduling. A more sophisticated work-stealing mechanism will likely be required as data flow predictability will only get worse from here. For instance, (de-)gridding from and to subgrids could be tackled by accelerators, and storage might struggle to keep pace with visibility data flows: processing ~40 TB in under 4 min means 10 GB/s/node, with non-obvious access patterns.

Additionally, to integrate calibration as well as corrections for various instrumental and environmental effects, our algorithm will have to work in concert with other state-of-the-art radio astronomy algorithms. These methods come with significant complexity of their own, which makes integration even trickier. We are planning to investigate use of a specialised execution framework to take over the data transfer and task scheduling functions of the prototype code.

6.3 Imaging and calibration

Transformations between the sky and aperture planes are common and well-defined operations in radio-astronomy pipelines, which means that the presented algorithm can help distribute a variety of functions relevant to radio astronomy. In Sect. 3.5, we discuss how it can work with a variety of gridding approaches; extensions to cover polarisation, Taylor terms, or point-spread function generation are straightforward. Furthermore, as every subgrid gets represented in image space along the way, we can introduce slow-changing image-space multipliers (e.g. corrections for primary beam effects) very cheaply and at good resolution. This might complement approaches that deal with fast-changing direction-dependent effects such as IDG (van der Tol et al. 2018).

On the other hand, pipelines that require traversing visibilities in a specific ordering may need substantial rework to make use of the presented algorithm because much of its efficiency derives from traversing the grid in a very particular order: iterating over w-stacking planes, then over u-columns, and finally visiting individual subgrids. This is not an issue for a CLEAN major loop iteration that predicts visibilities to be subtracted for an imaging step; the forward and backward directions of the algorithm can be combined so that processing proceeds from a model image to the current residual dirty image, while touching every observed visibility exactly once. Combined with a distributed by-facet image-based ‘minor loop’ deconvolution step, this should yield a scalable CLEAN pipeline.

The presented algorithm is however a worse fit for standard antenna-based calibration approaches as they require combining data from all baselines to reliably solve for antenna-based gain terms. The algorithm is designed such that visibilities from different w-stacking planes would basically never meet in the same memory space. A possible solution would be to reduce the size of the calibration problem representation enough so that we could make progress even with an incomplete view (e.g. Yatawatta 2015). Alternatively, one might attempt to devise a distributed algorithm that re-aggregates optimisation problems, similarly to how this distributed Fourier transform algorithm re-assembles a complete image.

6.4 Outlook

The presented algorithm significantly improves on existing radio-astronomy imaging algorithms by enabling the distribution of both the computation and the working memory load. This enables imaging in less wall-clock time by distribution over relatively small general-purpose (and therefore cost effective) nodes. Reducing wall-clock time to process data sets is a key requirement for telescopes like SKA where storage of visibilities during processing is a major cost driver.

Acknowledgements

Tim Cornwell originally suggested the possibility for such an approach at one of the SKA SDP face-to-face meetings, and provided a lot of helpful discussion along the way. The work by Bas van der Tol and Bram Veen-boer on IDG was a major inspiration, with Bram originally suggesting subgrids as a unit of distribution. Steve Gull and Sze Meng Tan helped me understand how to perform accurate wide-field degridding. Thanks to Martin Reinecke for helping us understand ducc’s performance. We also thank the anonymous referee who’s comments helped us improve the presentation substantially. Finally we would like to thank the SKA SDP Consortium, the SKA Organisation and SKA Observatory for providing us the means to do this work.

Appendix A Reference implementation

The one-dimensional Fourier transform algorithm can be demonstrated quite succinctly, as the following snippet shows. We note that this is just the base algorithm, without non-coplanarity corrections.

References

Arras, P., Reinecke, M., Westermann, R., & Enßlin, T. A. 2021, A&A, 646, A58 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Bolton, R., et al. 2016, Parametric models of SDP compute requirements, Tech. Rep. SKA-TEL-SDP-0000040, SDP Consortium, dated 2015-03-24 [Google Scholar]
Cornwell, T., & Perley, R. 1992, A&A, 261, 353 [NASA ADS] [Google Scholar]
Cornwell, T. J., Golap, K., & Bhatnagar, S. 2008, IEEE J. Selected Top. Signal Process., 2, 647 [NASA ADS] [CrossRef] [Google Scholar]
Cornwell, T., Voronkov, M., & Humphreys, B. 2012, in Image Reconstruction from Incomplete Data VII, 8500, International Society for Optics and Photonics, 85000L [NASA ADS] [CrossRef] [Google Scholar]
Dewdney, P. E., Hall, P. J., Schilizzi, R. T., & Lazio, T. J. L. W. 2009, Proc. IEEE, 97, 1482 [NASA ADS] [CrossRef] [Google Scholar]
Frigo, M., & Johnson, S. G. 1998, in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), Vol. 3, IEEE, 1381 [Google Scholar]
Offringa, A., McKinley, B., Hurley-Walker, N., et al. 2014, MNRAS, 444, 606 [Google Scholar]
Tan, S. M. 1986, Ph.D. Thesis, University of Cambridge [Google Scholar]
Tasse, C., Hugo, B., Mirmont, M., et al. 2018, A&A, 611, A87 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
van der Tol, S., Veenboer, B., & Offringa, A. R. 2018, A&A, 616, A27 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Van Weeren, R., Williams, W., Hardcastle, M., et al. 2016, ApJS, 223, 2 [CrossRef] [Google Scholar]
Wortmann, P. 2019, Distributed Predict I/O Prototype, Tech. Rep. SKA-TEL-SDP-0000203, SDP memo 102, SDP Consortium, revision 1 [Google Scholar]
Wortmann, P. 2023, SDP Exec - Imaging IO Test, https://gitlab.com/ska-telescope/sdp/ska-sdp-exec-iotest [Google Scholar]
Yatawatta, S. 2015, MNRAS, 449, 4506 [Google Scholar]
Ye, H., Gull, S. F., Tan, S. M., & Nikolic, B. 2020, MNRAS, 491, 1146 [Google Scholar]
Ye, H., Gull, S. F., Tan, S. M., & Nikolic, B. 2022, MNRAS, 510, 4110 [NASA ADS] [CrossRef] [Google Scholar]

¹

The value of $u_{g_{w} n}$ $\[u_{\mathrm{g}_w \mathrm{n}}\]$ depends on $| {\hat{w}}_{i} - {\bar{w}}_{i} |$ $\[\left|\hat{w}_i-\bar{w}_i\right|\]$ , so the usable uvw-volume actually has the shape of a bipyramid. This is not very useful for filling uvw-volumes though, so we assume w-towers to cover ‘boxes’.

All Tables

Table 1

Sample of possible algorithm parameters for a fixed image size (columns W = 4u_nl_n, Δu & Δl, u_A, and l_B) and the corresponding efficiencies (last two columns).

In the text

Table 2

Test configuration parameters and results (left: scaling baseline length u_Υ; right: scaling field of view (FoV) size l_Υ).

In the text

All Figures

	Fig. 1 Distribution concept sketch of our algorithm. The visibilities are gridded to subgrids (solid lines), and the contributions are accumulated to facets. The dashed lines indicate the reverse direction: the contributions are extracted from the facets, and the visibilities are degridded from the subgrids.
In the text

	Fig. 2 Key functions and relationships at a glance.
In the text

	Fig. 3 Image space error (light colours: worst case Υ).
In the text

Fig. 4

2D algorithm sketch of both transformation directions, highlighting the symmetry. This also illustrates buffer sizes, showing how padding facet data to 4u_Υl_n tends to dominate memory consumption both in the horizontal (u) and vertical (υ) axes. This is why it is important to share intermediate buffers for subgrids from the same ‘column’ $(m_{i}^{(u)} = m_{i^{'}}^{(u)})$ $\[\left(\mathrm{m}_i^{(\mathrm{u})}=\mathrm{m}_{i^{\prime}}^{(\mathrm{u})}\right)\]$ .

In the text

	Fig. 5 Window interaction with non-coplanarity term.
In the text

	Fig. 6 Non-coplanarity margin relative to highest chirp frequency for l_max = 0.2 (solid: facet size l_n = 0.1, dotted: l_n = 0.2).
In the text

	Fig. 7 Generating a subgrid w-stack (w-tower).
In the text

	Fig. 8 Subgrid and facet spacings for convolutional gridding.
In the text

	Fig. 9 Amount of facet space useable per desired base error for prolate spheroidal wave function (PSWF).
In the text

	Fig. 10 Parameter efficiencies depending on PSWF and precision for 4u_Υl_Υ = 8192,4u_Υl_n = 2048, u_ml_Υ = 1024, 4u_Υl_fov = 6656 (top: efficiency; bottom: w-tower efficiency; dotted: upper efficiency bound; marked: best found parameter combination).
In the text

	Fig. 11 Complexity scaling (dotted lines indicate cubic growth).
In the text

	Fig. 12 Benchmark results. ducc results scaled as if distributed perfectly to 16 nodes.
In the text

	Fig. 13 Visibilities per subgrid position (SKA1-Mid layout, summed over all w-stacking planes and w-tower storeys).
In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.

[1] Arras, P., Reinecke, M., Westermann, R., & Enßlin, T. A. 2021, A&A, 646, A58 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[2] Bolton, R., et al. 2016, Parametric models of SDP compute requirements, Tech. Rep. SKA-TEL-SDP-0000040, SDP Consortium, dated 2015-03-24 [Google Scholar]

[3] Cornwell, T., & Perley, R. 1992, A&A, 261, 353 [NASA ADS] [Google Scholar]

[4] Cornwell, T. J., Golap, K., & Bhatnagar, S. 2008, IEEE J. Selected Top. Signal Process., 2, 647 [NASA ADS] [CrossRef] [Google Scholar]

[5] Cornwell, T., Voronkov, M., & Humphreys, B. 2012, in Image Reconstruction from Incomplete Data VII, 8500, International Society for Optics and Photonics, 85000L [NASA ADS] [CrossRef] [Google Scholar]

[6] Dewdney, P. E., Hall, P. J., Schilizzi, R. T., & Lazio, T. J. L. W. 2009, Proc. IEEE, 97, 1482 [NASA ADS] [CrossRef] [Google Scholar]

[7] Frigo, M., & Johnson, S. G. 1998, in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), Vol. 3, IEEE, 1381 [Google Scholar]

[8] Offringa, A., McKinley, B., Hurley-Walker, N., et al. 2014, MNRAS, 444, 606 [Google Scholar]

[9] Tan, S. M. 1986, Ph.D. Thesis, University of Cambridge [Google Scholar]

[10] Tasse, C., Hugo, B., Mirmont, M., et al. 2018, A&A, 611, A87 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[11] van der Tol, S., Veenboer, B., & Offringa, A. R. 2018, A&A, 616, A27 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[12] Van Weeren, R., Williams, W., Hardcastle, M., et al. 2016, ApJS, 223, 2 [CrossRef] [Google Scholar]

[13] Wortmann, P. 2019, Distributed Predict I/O Prototype, Tech. Rep. SKA-TEL-SDP-0000203, SDP memo 102, SDP Consortium, revision 1 [Google Scholar]

[14] Wortmann, P. 2023, SDP Exec - Imaging IO Test, https://gitlab.com/ska-telescope/sdp/ska-sdp-exec-iotest [Google Scholar]

[15] Yatawatta, S. 2015, MNRAS, 449, 4506 [Google Scholar]

[16] Ye, H., Gull, S. F., Tan, S. M., & Nikolic, B. 2020, MNRAS, 491, 1146 [Google Scholar]

[17] Ye, H., Gull, S. F., Tan, S. M., & Nikolic, B. 2022, MNRAS, 510, 4110 [NASA ADS] [CrossRef] [Google Scholar]