Issue |
A&A
Volume 687, July 2024
|
|
---|---|---|
Article Number | A152 | |
Number of page(s) | 13 | |
Section | Numerical methods and codes | |
DOI | https://doi.org/10.1051/0004-6361/202142040 | |
Published online | 05 July 2024 |
Imaging swiFTly: Streaming widefield Fourier Transforms for large-scale interferometry
Cavendish Astrophysics Group, University of Cambridge,
JJ Thomson Avenue,
Cambridge
CB3 0HE,
UK
e-mail: peter.wortmann@skao.int; jameschristopherkent@gmail.com; bn204@cam.ac.uk
Received:
16
August
2021
Accepted:
29
January
2024
Aims. We describe a scalable distributed imaging algorithm framework for next-generation radio telescopes, managing the Fourier transform from apertures to sky (or vice versa) with a focus on minimising memory load, data transfers, and computation.
Methods. Our algorithm uses smooth window functions to isolate the influence between specific regions of spatial-frequency and image space. This allows the distribution of image data between nodes and the construction of segments of frequency space exactly when and where needed.
Results. The developed prototype distributes terabytes of image data across many nodes, while generating visibilities at throughput and accuracy competitive with existing software. Scaling is demonstrated to be better than cubic in problem complexity (for baseline length and field of view), reducing the risk involved in growing radio astronomy processing to large telescopes like the Square Kilometre Array.
Key words: methods: data analysis / methods: numerical / techniques: image processing / techniques: interferometric
© The Authors 2024
Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
This article is published in open access under the Subscribe to Open model. Subscribe to A&A to support open access publication.
1 Introduction
Upcoming large-scale radio telescopes like the Square Kilometre Array (SKA) (Dewdney et al. 2009) are designed for deep observations of large areas of the sky. These observations require many long baselines, wide bandwidths, and high frequency resolution. This translates to demanding computing requirements; for SKA imaging observations, more than a petabyte of visibility data might be generated per hour. Pipelines will need to perform approximately 40 exa-operations to produce images that are terabytes in size (Bolton et al. 2016). This means that efficient use of high-performance computing facilities will be essential, which will require algorithms that distribute computation, visibility data, as well as image data evenly across many compute nodes.
This distribution is not trivial because radio astronomy imaging requires a Fourier transform where every measured visibility impacts every image pixel. Conventional strategies for distributing radio-interferometric imaging split visibilities by frequency or by observation time, but replicate full copies of the image to every node. A well-known alternative approach is to instead split the image into ‘facets’, previously used for non-coplanarity (Cornwell & Perley 1992), direction-dependent calibration, and deconvolution (Van Weeren et al. 2016). This requires visibilities to be gridded to a separate low-resolution grid for every facet, which combined with averaging can be a viable scaling strategy (Bolton et al. 2016; Tasse et al. 2018). On the other hand, repeated gridding is inefficient, phase rotation can become a bottleneck, and averaging introduces inherent inaccuracies into the result.
Therefore, this paper focuses on the Fourier transform specifically, describing a scalable algorithm that addresses the core challenges of large-scale interferometric processing. Our method fully distributes the image data as well as the computation, never needing to assemble the entire image or uv(w)-grid at any stage. As with previous work, this is achieved using a distributed facet approach, so different parts of the image are held and processed in memory separately. However, our method solves the conceptually easier problem of transforming them from and to cut-outs of the full-resolution spatial-frequency grid, which we call ‘sub-grids’. Hence, in our approach both the image-plane and the aperture-plane are divided into smaller parts that can be processed in parallel. This can be used to distribute conventional visibility (de)gridding algorithms as a separate step, as shown in Fig. 1. This would process every visibility only once on a single node, with a roughly predictable processing order.
All we need is an efficient way to transform between such subgrids and facets, which would make it a form of distributed Fourier transform. Distributed fast Fourier transform (FFT) algorithms (e.g. Frigo & Johnson 1998) might seem the obvious choice; however, they are not well suited for our purposes. For non-coplanarity corrections we are often only interested in local regions of the uvw spatial-frequency space, and therefore we need to minimise not only the computational work spent on unneeded uvw regions, but also the time the full grid or image data is held in memory. Additionally, we would like to apply extra image-space factors per subgrid, for example for widefield corrections, which further pushes us towards a different class of algorithm.
Our solution is therefore to stream portions of spatial-frequency space to worker nodes (or, during de-gridding, from them) exactly as and when required to cover the telescope baselines: an incremental semi-sparse Fourier transform algorithm for contiguous partitions of image and frequency space. In Sec. 2, we define the core algorithm, which we then extend in Sec. 3 to cover the effects of Fresnel diffraction when observing widefield. In Sec. 4, we demonstrate how to assemble and parametrise the algorithm, which allows us to perform scaling and performance tests using SKA-scale parameters in Sec. 5. We finally wrap things up with a discussion in Sec. 6.
![]() |
Fig. 1 Distribution concept sketch of our algorithm. The visibilities are gridded to subgrids (solid lines), and the contributions are accumulated to facets. The dashed lines indicate the reverse direction: the contributions are extracted from the facets, and the visibilities are degridded from the subgrids. |
2 Core algorithm
Interferometric imaging measures complex visibilities V, and relates them to the planar projection of the sky intensity distribution I as
(1)
where l and m are sky direction cosines, and u, v, and w are coordinate components of telescope baselines (compare e.g. Cornwell & Perley 1992). The right-hand side involves a Fourier transform, which is why an efficient approximation of I from visibility samples V (or vice versa) requires discrete Fourier transforms.
2.1 Notation
For simplicity we limit ourselves to one-dimensional functions for the moment. By convention, all named functions are in visibility space (i.e. using spatial-frequency u coordinate). The image or gridded visibilities to be transformed are represented by a function Υ, no matter the transformation direction considered. Furthermore, all functions are discrete using a given step rate of 2lΥ and repeating with a period of 2uΥ satisfying 4uΥlΥ ∈ ℤ. This allows us to define a ‘scaled’ discrete Fourier transform as
(2)
This makes ℱf an image-space function (l coordinate) that is sampled with a step rate of 2uΥ and repeats with a period of 2lΥ. We use the following notations and properties:
(3)
(4)
(5)
For this paper, we often need to deal with functions that are sampled more coarsely than the base uΥ/lΥ discretisation. We use Kronecker comb functions to represent this:
(6)
Here . Multiplication with
samples frequency space at a rate of (2uf)−1, while convolution with
samples image space at a rate of 2uf. With
we obtain
(7)
in other words, where a function f repeats with a period of 2uf and is non-zero at a rate of 2lf, its Fourier transform repeats with a period of 2lf and has a rate of 2uf. For numerical purposes, Eq. (7) furthermore shows that we can represent f or ℱf entirely using exactly |4uflf| samples.
2.2 Problem statement
The distributed Fourier transform problem is now given by mask functions Ai and Bj, which encode how we wish to partition Υ in spatial-frequency and image space:
(8)
with uA < uΥ and lB < lΥ such that we can represent subgrids AiΥ using 4uAlΥ samples (subgrid size) and facets Bj * Υ using 4uΥlB samples (facet size). We expect Ai to be boxcar filters in frequency space and Bj boxcar filters in image space (blue graphs in Fig. 2), and therefore with
(9)
by Whittaker–Shannon interpolation. As illustrated in Fig. 1, now the goal is to reconstruct Bj * Υ given all AiΥ. On paper this is straightforward due to linearity:
(10)
So the contribution of subgrid i to facet j is simply Bj * AiΥ. For an efficient distributed algorithm, this contribution would have to be computed and exchanged between nodes. Unfortunately, naïve attempts to represent Bj * AiΥ using 4uAlB samples would fail, as we cannot apply Whittaker–Shannon simultaneously in image and spatial-frequency space:
(11)
The reason is that Bj never approaches zero in frequency space (see Fig. 2, bottom left plot), and therefore neither does Bj * AiΥ. Hence, tiling it with a period of 2uA effectively loses information.
![]() |
Fig. 2 Key functions and relationships at a glance. |
2.3 Approximation
The key idea is to use a different representation. We suppose a function nj that falls close to zero in both frequency and image space (with associated limits un and ln respectively). Then the same should be true for nj * AiΥ, so we should be able to say
(12)
where mi is a boxcar filter with um> uA + un (Fig. 2, left plots). If we furthermore find an ‘inverse’ function bj that gets us back to Bj = bj * nj (which will require ln ≥ lB), it follows that
(13)
This approach mirrors visibility gridding. If Ai were delta functions identifying positions of visibilities (oversampled at a rate of 2lΥ), then nj would be the gridding kernel, mi the grid environment around each visibility, and bj the gridding correction. What we are doing is simply ‘bulk re-gridding’ to coarser facet grids.
This suggests that window functions like prolate spheroidal wave functions (PSWFs; see Fig. 2) are a good choice for nj. Unfortunately, there is no window function that is perfectly limited in both image and frequency space (uncertainty principle), so there will always be errors. We can quantify the effects by subtracting both sides from above for a subgrid i:
(14)
Figure 3 shows what these errors look like in image space for a centred facet. Before convolution with bj (dark blue line) the absolute error is mostly constant with small peaks at ±ln. As the light blue line shows, this only becomes slightly worse for a worst-case Υ (single source at −ln). On the other hand, convolution with bj introduces a distinct U-shaped error pattern. The reason is that for nj to be approximately limited in spatial-frequency space, it needs to be smooth in image space, and as we require ℱnj to fall to zero, the inverse |ℱnj|−1 tends to infinity at ±ln (see also Fig. 2). This is why we need to somewhat over-dimension nj in image-space (ln > lB) to keep error magnification in check.
![]() |
Fig. 3 Image space error (light colours: worst case Υ). |
2.4 Method
At this point, we know that nj * AiΥ is approximately limited to an um region in frequency space, so assuming :
(15)
(16)
by associativity of multiplication. Therefore nj * AiΥ can be represented approximately using 4umln samples (boxed expression), solving the core challenge from Sect. 2.2.
If we define as a mask representing the image-space extent of
exactly where ℱnj(l) ≠ 0), we can also show how to compute this contribution efficiently,
(17)
as , and III
, where f and g are zero in image space outside a 2/n region. This means that if we have padded subgrid data
(4umlΥ samples; um > uA) in image space, then we just need to select 4umln samples where we know ℱnj(l) ≠ 0, multiply by ℱnj (sampled at the same points), and we have calculated our subgrid contribution representation.
Applying the contribution is equally straightforward:
(18)
(19)
substituting in the term from Eq. (16), then using and
, where f is zero in image space outside a 2ln region. This means that we need to pad in frequency space from 4umln to 4umlΥ samples, then multiply the sum in image space by an equivalently sampled bj.
![]() |
Fig. 4 2D algorithm sketch of both transformation directions, highlighting the symmetry. This also illustrates buffer sizes, showing how padding facet data to 4uΥln tends to dominate memory consumption both in the horizontal (u) and vertical (υ) axes. This is why it is important to share intermediate buffers for subgrids from the same ‘column’ |
2.5 Shifts
For simplicity let us assume that all mi and nj are the same functions, just shifted by subgrid or facet offsets as follows:
(20)
(21)
As , and
(assuming
), we can derive
(22)
(23)
(24)
This tells us that we only need to have one sampled representation of m and n because the function offsets applied to the data simply correspond to index shifts for image or frequency space samples as long as . We revisit these side conditions in Sec. 4.1.
2.6 Two dimensions
We can easily generalise our reasoning to images simply by redefining Υ, A, B, m, b, and n to be two-dimensional functions. We additionally assume mi and bj to be separable,
(25)
such that m(u) and m(v) are constant along the v- and u-axes, respectively, and b(u)(u, v) = 0 for v ≠ 0 and b(v)(u, v) = 0 for u ≠ 0. Then we can re-order as follows:
(26)
Convolution with straight after multiplication with
is a small optimisation. An implementation can discard 4uΥ(ln − lB) rows where
is zero by doing the Fourier transform ℱ(v) along the v-axis right away. The sum can be pulled further inwards if we assume that some subgrids are in the same column (i.e. share a certain value of
). Then with M(u) the set of all
:
(27)
As illustrated in Fig. 4a, this means we can accumulate contributions from entire columns of subgrids, reducing the number of Fourier transforms needed.
2.7 Dual variant
As degridding relates to gridding, there is a similar dual algorithm variant for going from facets to subgrids. Analogously to the steps taken in Sec. 2.1, we start with the observation that
(28)
and we then decompose Bj = nj * bj and Ai = Aimi as done in Sect. 2.3 to approximate the sum term
(29)
with the error term Ai(nj * (1 − mi)(bj * Υ)), which behaves equivalently to the errors discussed in Sect. 2.3. This also works in 2D:
(30)
The optimisation from the previous section also has an equivalent here:
(31)
So we only need to calculate once for every combination of facet and subgrid column, as shown in Fig. 4b.
3 Handling widefield effects
A general-purpose discrete Fourier transform algorithm cannot quite cover the full complexity of the measurement equation from Sect. 2 as wide fields of view require correction for telescope non-coplanarity (w ≠ 0)/sky curvature (n ≠ 1). As noted by Cornwell et al. (2008), we can represent this using a convolution with a w-specific Fresnel diffraction pattern:
(32)
This convolution effectively introduces a third dimension into the problem, but only on the spatial frequency side. To represent this, we modify our subgrid definition such that each subgrid i has an arbitrary w-offset associated with it:
(33)
In this section, we consider the problem variant of generating such w-shifted subgrids from facets, or vice versa. We note that even with w-offsets, satisfying (compare Sect. 2.2) is still possible by weighting Ai. This is well understood in radio astronomy, but is a distraction for the purpose of this paper, so we stick to the dual algorithm (facet to subgrid; see Sect. 2.7) for this section.
3.1 Implementation options
If we simply replace Υ by in our subgrid formula from Sect. 2.7 we arrive at
(34)
So the obvious way to obtain w-shifted subgrids with our algorithm is to multiply facets by , which is known as w-stacking (e.g. Offringa et al. 2014). This is simple and precise no matter the field of view or value of
.
On the other hand, this forces us to repeat the entire data re-distribution for every needed value of . Sharing intermediate subgrid column buffers, as shown in Fig. 4, would now especially require
and
to match. We can still minimise the amount of subgrid data to reproduce for any given
level by skipping generation of unneeded subgrids, yet this is still not particularly efficient.
Fortunately we have additional options. After all, we can treat like a modified facet mask, which means that we can decompose, as in Sect. 2.3, into
using a modified window function
. This yields
(35)
This is attractive because mi(bj * Υ) is the distributed contribution term, so this introduces w-shifts after data re-distribution and all image-size FFTs. However, this can only work as long as still acts like a window function for our purposes:
(36)
So for this approach to hold up, we must now identify masks mi compatible not only with Ai and all nj, but also given .
![]() |
Fig. 5 Window interaction with non-coplanarity term. |
![]() |
Fig. 6 Non-coplanarity margin relative to highest chirp frequency for lmax = 0.2 (solid: facet size ln = 0.1, dotted: ln = 0.2). |
3.2 Margins
Fortunately, it is not unreasonable to expect to be limited in frequency space. Using the definition of gw from Sect. 3, we can determine the frequency by derivation of the exponent:
(37)
Now we can use the fact that nj is an image-space filter. If lmax is the maximum distance from the phase centre in the facet nj, then the effective size of gw for the purpose of nj * gw should be about in frequency space. As Fig. 5 shows, gw * n° (with n° again the mask version of n) indeed falls off, though it is hard to spot due to n0 causing sinc-like ripples (green line). Fortunately, convolution with n suppresses this by design, although while adding signal spread of its own (red line).
As a result, if we define as the point where
reaches the base error level associated with n, we would generally expect
. As Fig. 6 shows, things are not quite that simple:
first grows a bit more than expected, then eventually even falls below un for some parameters. These interactions are due to nj not just limiting, but also weighting different frequencies of the chirp differently. In practice, this means that we have to numerically or experimentally determine the maximum acceptable w from uA and um.
![]() |
Fig. 7 Generating a subgrid w-stack (w-tower). |
3.3 w-towers
The w-distance limit is not a problem in practice, as due to we can use w-stacking to reach a sufficiently close
first:
(38)
This combination is much more efficient than pure w-stacking, as we now only have to repeat data re-distribution for every unique value of . Furthermore, in summing up the
contributions for some w-stacking plane
, we naturally end up in image space (see last step in Fig. 4 and the right side of Fig. 7). Therefore, image-space multiplication with
is basically free. In the special case that we are interested in subgrids that are equidistant in w (and have matching mi), we can iteratively build a ‘w-tower’ of subgrids using only one constant ℱ gΔw, as illustrated in Fig. 7. We still need one Fourier transform for every ‘storey’ of the tower; however, these are quite a bit smaller than the image size, and can thus be computed relatively efficiently.
For subgrids at maximum height , only the central Ai region (of maximum size uA) will be accurate. This is subject to
as established in the last section, effectively sacrificing
subgrid area to deal with non-coplanarity. Thus, covering larger w-ranges either means reducing the usable uv-area per subgrid uA1 or introducing more w-stacking planes (
values), and therefore full data re-distributions. We show in Sect. 4.2 how we can handle this particular trade-off.
3.4 Optimisation of w-distribution
The number of subgrids or w-tower storeys to generate is still a major cost factor, so optimising the visibility uvw distribution for a minimum w-range is highly desirable. For instance, we can use a shear transformation as proposed in Cornwell et al. (2012) to transform visibility (u, v, w) coordinates to (u, v, w′) as follows:
(39)
(40)
(41)
This allows us to choose hu and hv to minimise |w′|, while solving the same problem due to ul + vm + wn = ul′ + vm′ + w′n. We still have , and therefore the definition of gw from Sect. 3 remains unchanged, except that we now sample in l′ and m′ for the purpose of w-stacking or w-towers. Using computer algebra we can solve for the original l and m:
(42)
(43)
To first order l ≈ l′ and m ≈ m′, which means that this transformation will change neither lmax nor the required subgrid margin from Sect. 3.2 significantly until quite close to the horizon, so there is little efficiency loss associated with this optimisation. This in contrast to the variant described in Offringa et al. (2014), where flat offsets in l and m directly increase both lmax and
, reducing the maximum height of w-towers and therefore efficiency much more rapidly.
![]() |
Fig. 8 Subgrid and facet spacings for convolutional gridding. |
3.5 Convolutional (de)gridding
Using w-stacking and w-towers we can efficiently generate visibility values for regularly spaced uvw subgrids, as shown in Fig. 8a. This is exactly what we need for gridding algorithms like w-projection (Cornwell et al. 2008), image domain gridding (van der Tol et al. 2018), or uvw-(de-)gridding (Ye et al. 2022). These algorithms allow us to work with irregularly positioned visibilities as long as we can provide sufficient samples of certain 2ur × 2ur × 2wr environments around their locations (e.g. dot and box in Fig. 8a).
Gridding typically uses spatial-frequency space convolutions, which means that just as when we split Bj = bj * nj in Sec. 2.3, it must be cancelled out by multiplication with a gridding correction function in image space. Fortunately, our algorithm works well with image-space factors. The natural approach is to combine it with bj to apply it to facets (like w-stacking), but theoretically we could also merge it with nj to apply it per subgrid (like w-towers). This means that we can introduce the gridding correction function into our framework efficiently, possibly even supporting different gridding algorithms in different grid regions.
In practice, gridding efficiency depends greatly on how finely the grid is sampled: the more we oversample the spatial-frequency grid by padding the image or introducing finer w-tower storeys, the smaller the 2ur × 2ur × 2wr region we need to sample (Tan 1986; Ye et al. 2020; also compare Fig. 9). Our algorithm allows us to adjust both parameters easily. Due to linearity we can simply ignore irrelevant facets, and therefore we can freely increase the image size and therefore spatial-frequency sampling rate lΥ, while only covering a given lFoV area with facets (see Fig. 8b). Extra w-tower storeys are also easily added. On the other hand, either measure will increase computation, leading to a trade-off against gridding complexity.
![]() |
Fig. 9 Amount of facet space useable per desired base error for prolate spheroidal wave function (PSWF). |
4 Algorithm
We assemble the complete algorithms, both for going from facets to w-towers of subgrids (Algorithm 1) and its dual for going from subgrids or w-towers back to facets (Algorithm 2). Separable shifts δ(u), δ(v) and work analogously to what was shown in Sects. 2.5 and 2.6. Apart from adding w-towers, this closely follows Fig. 4.
Both directions are linear, meaning that we can skip sub-grids or w-towers without visibility coverage, just as we can skip facets outside the field of view, as shown in Fig. 8b. Furthermore, all ‘foreach’ loops can be executed in parallel. This especially applies to w-towers, as we can parallelise per w-tower, w-tower storey, and/or visibility chunk. Even the outer w-stacking loop can be parallelised to some degree, memory availability permitting.
4.1 Dimensions and shifts
Once the structure of the algorithm has been established, actually making it work depends entirely on suitable parameter choices. Fast Fourier transforms form the backbone of the algorithm, so we start by looking at their sizes. In Algorithms 1 and 2, we perform Fourier transforms at the padded subgrid size 4umlΥ (line 9 and 18), padded facet size 4uΥln (line 23, 25 and 3, 5), and finally 4umln (line 21, 23 and 5, 7), which derives from the previous values. From Sec. 2.4 we know two more things: Firstly, the total image size uΥlΥ must be divisible by both the padded subgrid size umlΥ and padded facet sizes uΥln; Secondly, as umln = (umlΥ)(uΥln)(uΥlΥ)−1, the product of padded subgrid and facet size must be divisible by the image size. This clearly favours simple factorisations, for example powers of two, which are also optimal for fast Fourier transforms.
As established in Sec. 2.5, these factorisations also restrict permissible subgrid and facet positions . We assume certain base shift steps Δu and Δl such that all
and
, respectively, are multiples. As we require all
we can assume ΔuΔl = 1 without loss of generality, and because 4uΥlΥ = (2ΔulΥ)(2uΥΔl) this means that the product of base subgrid and facet shifts in pixels must equal the image size.
To ensure that 2umΔl, 2Δuln ∈ ℤ we can derive the following,
(44)
so additionally base subgrid and facet shifts must evenly divide the padded subgrid and padded facet sizes, respectively.
![]() |
Fig. 10 Parameter efficiencies depending on PSWF and precision for 4uΥlΥ = 8192,4uΥln = 2048, umlΥ = 1024, 4uΥlfov = 6656 (top: efficiency; bottom: w-tower efficiency; dotted: upper efficiency bound; marked: best found parameter combination). |
4.2 Errors and margins
In addition to data sizes, the other fundamental parameter is what level of approximation we are willing to accept. As established in Sect. 2.3, this is mostly determined by two factors. The first is the base error level, which effectively derives from the PSWF parameter W = 4unln. From this we can especially derive 4unlΥ, the minimum subgrid margin size. The second factor is the base error magnification due to b, which depends directly on (i.e. how much facet space we are willing to sacrifice).
We know that the worst-case error magnification happens at the facet borders, and therefore we just need to determine how large we can make lB until the base error multiplied by |ℱn(lB)|−1 becomes larger than the acceptable error level (compare Fig. 2). This means that for any target error, we have a range of PSWF parameters W to choose from, as illustrated in Fig. 9.
Facet–subgrid contributions have size 4umln, yet effectively sample only 4uAlB points of image–frequency space. Therefore, we can calculate an upper bound on communication efficiency:
(45)
This already shows a fundamental trade-off between facet and subgrid margins, as shown in Fig. 10 (top, dotted). For larger W we have to give up subgrid area, whereas for small W we lose facet area to compensate for the worse base error.
For a measure of w-tower efficiency, we observe that in Sect. 3.2 there is to first order a linear dependency between maximum tower size |w| and the effective margin reserved for the w-term . Therefore, we can use a l′max-normalised uvw volume as an efficiency measure subject to uA ≤ um − un −
:
(46)
Optimally , which again yields an upper bound, as shown in Fig. 10 (bottom, dotted). As should be expected, larger W are somewhat less efficient here as the PSWF and gw compete for limited subgrid space.
4.3 Parameter search
Unfortunately, the side conditions from Sec. 4.1 mean that at best we can get close to those limits, especially when we try to realise facet and subgrid partitioning as shown in Fig. 8a: To account for subgrid or facet overlaps due to offset restrictions (see Sec. 2.5) for the purpose of efficiency calculation, 2uA and 2lB should be multiples of Δu and Δl, respectively. As we require the same of 2um and 2ln, respectively, this extends to the subgrid and facet margins um − uA and ln − lB.
We would also like the facets to cover the field of view efficiently. If we need k × k facets to cover the field of view, this means that on average every facet only contains useable data. Even numbers of facets are a special case, as with a centred field of view this requires placing facets at an offset of lB. Therefore in this case lB must be a multiple of Δl as well.
In practice, finding good parameters boils down to trying all combinations of W = 4unln, Δu and Δl, uA, and lB exhaustively. After that, the smallest possible ln and subsequently all remaining parameters and efficiency measures can be determined. If we plot the best solution of an example parameter search for every value of W (Fig. 10, top solid), we see that the achievable efficiency can be quite unpredictable. For w-towers we gain another degree of freedom in the w-term margin , which makes efficiency better behaved (Fig. 10, bottom solid). This is mainly because the requirement that (uA − um) must be a multiple of Δu is much less restrictive when we can still derive efficiency from extra subgrid space by making w-towers taller.
Table 1 shows a number of algorithm parameter sets for a fixed image size 4uΥlΥ = 8192, optimised for w-towers efficiency at 10−5 target error. Configurations with matching contribution sizes 4umln generally have the same properties as they arrive at the same balance between base error (i.e. PSWF parameter W = unln) and error magnification (i.e. facet margin ) independently of the concrete values of ln or um (compare Sec. 4.2). This works as long as we can find equivalent facet splits (and base offsets) for them. For instance, in Table 1 4uΥln = 1280, 4umlΥ = 1024 is in the same family as 4uΥln = 5120, 4umlΥ = 256, but as a facet split of k = 1.5 would be meaningless a different PSWF parameter W must be used. Following the same logic we can also scale the image size lΥ or resolution uΥ, so the configuration from Table 1 with facet size 4uΥln = 1024 could be scaled into a configuration with 4uΥlΥ = 524288 (4.4 TB of data) and 4uΥln = 65 536 (69 GB of data), which is where a k × k = 64-way facet split would start to seem more appropriate.
5 Results
We investigate scaling properties by posing an SKA-sized ‘forward’ imaging problem: predicting visibilities for the SKA1-Mid layout with 197 dishes, assuming a snapshot duration of 29 minutes, a dump time of 0.142 s (12 288 dumps), a frequency window of 350-472.5 MHz split into 11 264 channels, resulting in 2.6 × 1012 visibilities (or 42.3 TB data at double precision). The uv-coverage for such an observation is shown in Fig. 13. The phase centre was chosen at a declination of 0 with hour angles centred around transit (i.e. 0 hour angle). This means around 60 degrees elevation for SKA Mid, which we can compensate for with shear factor hv = 0.5936 (see Sec. 3.4). For the de-gridding we used a 8 × 8 × 4 deconvolution using Sze Meng Tan’s kernels (Ye et al. 2020), optimised for x0 = 0.40625 (i.e. 1-2×0.40625 =18.75% image margin) and x0 = 0.1875, respectively, resulting in ~10−5 expected degridding accuracy. The test image contained ten on-grid sources with the same intensity, all placed randomly on facet borders to maximise errors, as explained in Sec. 2.3. To verify correctness, we randomly sampled about 6 × 108 visibilities per run and compared them against a direct evaluation of the measurement equation.
To make configurations comparable, all tests were conducted with scaled variants of the first configuration from Table 1. We specifically held facet count and subgrid size constant, but scaled both image and facet size as well as subgrid count. We configured our implementation (Wortmann 2019, 2023) so it distributed work across 16 nodes and 49 processes (1 scheduler, 4 × 4 = 16 facet workers and 32 w-towers / subgrid workers). The tests were run on 16 dual Intel® Xeon® Platinum 8368Q (2 × 76 cores) with 256 GiB RAM each.
Sample of possible algorithm parameters for a fixed image size (columns W = 4unln, Δu & Δl, uA, and lB) and the corresponding efficiencies (last two columns).
5.1 Scaling resolution
The image size in pixels scales as the product of the field of view (and therefore lΥ) and the resolution uΥ. For this section we consider the effects of scaling uΥ, while holding the field of view at 2lfov = 0.322 (~18.5 degrees; see middle column of Table 2). This corresponds to roughly the third null of SKA1 Mid’s beam at the finest tested image resolution of 4uΥlΥ = 196608. For smaller grid sizes, this results in coarser image resolution and truncation of the uv-coverage (Fig. 13). This does not decrease degridding work substantially due to the high number of short baselines (Fig. 11, solid red line).
On the other hand, higher image resolution increases the cost for facet Fourier transform work (i.e. line 3, 5, and 7 in Algorithm 2) roughly as per w-stacking plane. For a baseline distribution where maximum |w| grows linearly with uv distance (i.e. as long as the Earth’s rotation dominates the Earth’s curvature) the number of w-stacking planes required to cover the truncated uv-distribution will also grow linearly with uΥ. This means the expected worst-case scaling of facet work is
, and as Fig. 11 shows (solid orange line) we indeed find scaling slightly better than
.
For w-towers, changing image resolution increases the number of subgrids required to cover all of the baselines. Therefore, we would expect the growth of subgrid FFT work (i.e. line 18 in Algorithm 2) as well as the data transfer amount to grow with the covered uvw volume, which increases as . Compared against Fig. 11 (solid green and blue lines), the scaling is significantly more efficient. This is clearly because, as first discussed in Sec. 2.6, we can skip more uvw volume the longer the baselines get, especially as we approach the rare longest baselines. By varying subgrid sizes and margins we can trade w-stacking and w-tower work, so there is reason to suspect that we will be able to adapt to bigger telescopes with significantly better-than-cubic net growth in complexity.
![]() |
Fig. 11 Complexity scaling (dotted lines indicate cubic growth). |
5.2 Scaling the field of view
To investigate field of view scaling, let us now hold the grid dimension uΥ constant such that it fits all baselines from Fig. 13 exactly. Changing image size 4uΥlΥ then will change lΥ, and therefore the effective field of view size 2lfov. In our test scenarios shown in Table 2 we again scale from the maximum case of 16.8 degrees down to 0.7 degrees.
The reasoning about computational complexity is more complex in this case. To determine the number of w-stacking planes, we first note that scaling lΥ means scaling frequency-space resolution; therefore, a given subgrid size 4umlΥ and its margins 4(um − uA)lΥ now cover less frequency space. This means that for the purpose of the discussion in Sect. 3.2, the permissable decreases, while the relevant field of view and therefore lmax increases. It follows that the number of required w-stacking planes increases as
to first order. As each plane is also
larger, the total facet Fourier transform complexity growth is about
. The cost for subgrid Fourier transforms will simply increase with the number of subgrids we need to produce. As the subgrids cover less spatial-frequency space each, we need to increase their number by
to cover the telescope layout in uv. However, to implement (de-)gridding (e.g. Ye et al. 2022) we also need to decrease the w distance between w-tower planes as the image curvature
increases. The result is an overall scaling of
again. Interestingly enough, this means that the number of storeys in a w-tower is constant. The tower layout shrinks linearly along the u- and v-axes, and about quadratically along the w-axis.
As Fig. 11 shows, the experiments again demonstrate slightly better than expected scaling behaviour; facet work grows supercubically once we get to larger image sizes, but the subgrid Fourier transform work and data transfer volume stay at or slightly below . This is because a larger number of fine-grained subgrids and w-towers allow us to more closely match the baseline distribution. In a way we are observing the fractal dimension of the telescope layout. We note that existing widefield imaging algorithms would show the same
scaling (e.g. using constant-size facets would see a
growth both in the number of visibilities and facet count for a combined scaling of
of phase rotation cost growth).
Test configuration parameters and results (left: scaling baseline length uΥ; right: scaling field of view (FoV) size lΥ).
![]() |
Fig. 12 Benchmark results. ducc results scaled as if distributed perfectly to 16 nodes. |
5.3 Communication
In the largest configuration we are dealing with an image of size 196 6082, which at double precision translates to about 618.5 GB of data. To ensure the accuracy of the gridder we leave 18.75% margins, which means that the effective field of view size corresponds to about 408.3 GB of data. The results in Table 2 show that 997.0 GB of communication were exchanged, so a bit more than twice the size of the information-containing part of the image.
This is quite efficient, as the configuration has 33 w-stacking planes, with w-towers decomposing them into 154 subplanes each, so we are effectively sampling a uvw grid 33 · 154 · 618.5 GB = 2.6 EB in size. We can also contrast against the visibility volume of 42.3 TB, which puts the network overhead at about 1.8 % relative to the throughput required to load visibilities from distributed storage.
![]() |
Fig. 13 Visibilities per subgrid position (SKA1-Mid layout, summed over all w-stacking planes and w-tower storeys). |
5.4 Performance
Figure 12 shows the run-time performance of tests in Table and compares them to equivalent benchmark runs of the ducc degridding kernel (Arras et al. 2021). The ducc module uses the same class of degridding algorithm (Ye et al. 2022) and is in regular scientific use as a (de)gridding kernel for the WS Clean widefield imager (Offringa et al. 2014). For this test, image data was passed to ducc as one merged effective image, as in Table 2, which was then internally padded according to ducc’s heuristic (target precision ϵ = 10−4). We predicted visibilities as 598 chunks of 32 baselines each (~70 GB). To minimise redundant FFT work due to repeating w-stacking planes, baselines were ordered by median w before chunking. All chunks were predicted sequentially on a single CPU (76 cores), with performance linearly extrapolated to 16 dual-CPU nodes for Fig. 12.
Our implementation significantly outperforms ducc in raw degridding speed (red lines) despite the similar degridding approach. There are a number of possible explanations. Our kernel was hand-optimised for AVX2, uses a pre-tabulated kernel, and also convolves across all w-tower storeys in one go. Meanwhile, ducc has a more generic implementation that generates the convolution kernel on the fly. This calculation even has to be repeated multiple times per visibility, due to the high memory pressure of w-stacking. Our implementation likely further benefits from executing degridding and w-towers FFTs in parallel as these workloads have very different operational intensity, and therefore work well with hyper-threading.
For the traditional w-stacking method used by ducc, FFT work dominates past images sizes of 327682 due to unmitigated complexity growth (see Sect. 5.2, green dashed line). In comparison, the w-towers approach together with the shear transformation from Sect. 3.4 substantially blunt the impact of larger fields of view in terms of number and size of required FFTs (green dotted line). Time spent on FFTs still eventually overtakes degridding despite the lower operation count (compare Fig. 11) due to lower operational intensity.
Finally, we consider the cost of distribution, represented through the facet FFT costs (orange lines) and scheduling inefficiencies (violet lines). The former are basically insignificant, while the latter eventually start dominating. This is so because while there is an abundance of parallel tasks, they vary greatly in terms of computational and communication complexity. This makes saturating 4864 threads quite challenging in practice.
Our simple scheduling algorithm approaches this by alternating ‘expensive’ (many visibilities) and ‘cheap’ (few visibilities) w-stacking planes to balance FFT and degridding work as evenly as possible throughout. It then dynamically schedules subgrid tasks to subgrid workers from a pre-populated work queue. For the most densely populated subgrids it splits the degridding work further to allow multiple subgrid workers to work on them in parallel. Queue sizes were optimised for the large-scale case, and kept constant for all test runs. This also explains why small configurations sometimes have worse scheduling efficiency, as due to the small number of subgrids work is scheduled before imbalances can become apparent. As a result, some workers end up running idle towards the end of the run. It is clear that effective work scheduling is one of the main remaining challenges with this algorithm.
6 Conclusion
We have presented a scalable way to distribute interferometric imaging calculations by using spatial distribution and window functions. This results in a streaming widefield Fourier transform algorithm that enables interferometry imaging to be parallelised to thousands of distributed threads, while taking full advantage of both image and uvw sparseness to reduce transfer and compute costs.
6.1 Future improvements
There are still a number of improvements and generalisations that can be made to the algorithm. Centring contribution terms as in Sec. 2.5 is not actually necessary, and removing the side condition allows more flexible subgrid and facet placement. Additional efficiency could be gained by combining different parameters within the same imaging task (e.g. adjust subgrid size per w-level), or even introducing additional layers between w-stacking and w-towers on either side of the distribution pattern (for dealing with very large images). There is good reason to believe that such techniques could improve performance further.
6.2 Implementation
Despite our efforts, the MPI implementation (Wortmann 2023) is still severely limited by its primitive work scheduling. A more sophisticated work-stealing mechanism will likely be required as data flow predictability will only get worse from here. For instance, (de-)gridding from and to subgrids could be tackled by accelerators, and storage might struggle to keep pace with visibility data flows: processing ~40 TB in under 4 min means 10 GB/s/node, with non-obvious access patterns.
Additionally, to integrate calibration as well as corrections for various instrumental and environmental effects, our algorithm will have to work in concert with other state-of-the-art radio astronomy algorithms. These methods come with significant complexity of their own, which makes integration even trickier. We are planning to investigate use of a specialised execution framework to take over the data transfer and task scheduling functions of the prototype code.
6.3 Imaging and calibration
Transformations between the sky and aperture planes are common and well-defined operations in radio-astronomy pipelines, which means that the presented algorithm can help distribute a variety of functions relevant to radio astronomy. In Sect. 3.5, we discuss how it can work with a variety of gridding approaches; extensions to cover polarisation, Taylor terms, or point-spread function generation are straightforward. Furthermore, as every subgrid gets represented in image space along the way, we can introduce slow-changing image-space multipliers (e.g. corrections for primary beam effects) very cheaply and at good resolution. This might complement approaches that deal with fast-changing direction-dependent effects such as IDG (van der Tol et al. 2018).
On the other hand, pipelines that require traversing visibilities in a specific ordering may need substantial rework to make use of the presented algorithm because much of its efficiency derives from traversing the grid in a very particular order: iterating over w-stacking planes, then over u-columns, and finally visiting individual subgrids. This is not an issue for a CLEAN major loop iteration that predicts visibilities to be subtracted for an imaging step; the forward and backward directions of the algorithm can be combined so that processing proceeds from a model image to the current residual dirty image, while touching every observed visibility exactly once. Combined with a distributed by-facet image-based ‘minor loop’ deconvolution step, this should yield a scalable CLEAN pipeline.
The presented algorithm is however a worse fit for standard antenna-based calibration approaches as they require combining data from all baselines to reliably solve for antenna-based gain terms. The algorithm is designed such that visibilities from different w-stacking planes would basically never meet in the same memory space. A possible solution would be to reduce the size of the calibration problem representation enough so that we could make progress even with an incomplete view (e.g. Yatawatta 2015). Alternatively, one might attempt to devise a distributed algorithm that re-aggregates optimisation problems, similarly to how this distributed Fourier transform algorithm re-assembles a complete image.
6.4 Outlook
The presented algorithm significantly improves on existing radio-astronomy imaging algorithms by enabling the distribution of both the computation and the working memory load. This enables imaging in less wall-clock time by distribution over relatively small general-purpose (and therefore cost effective) nodes. Reducing wall-clock time to process data sets is a key requirement for telescopes like SKA where storage of visibilities during processing is a major cost driver.
Acknowledgements
Tim Cornwell originally suggested the possibility for such an approach at one of the SKA SDP face-to-face meetings, and provided a lot of helpful discussion along the way. The work by Bas van der Tol and Bram Veen-boer on IDG was a major inspiration, with Bram originally suggesting subgrids as a unit of distribution. Steve Gull and Sze Meng Tan helped me understand how to perform accurate wide-field degridding. Thanks to Martin Reinecke for helping us understand ducc’s performance. We also thank the anonymous referee who’s comments helped us improve the presentation substantially. Finally we would like to thank the SKA SDP Consortium, the SKA Organisation and SKA Observatory for providing us the means to do this work.
Appendix A Reference implementation
The one-dimensional Fourier transform algorithm can be demonstrated quite succinctly, as the following snippet shows. We note that this is just the base algorithm, without non-coplanarity corrections.
References
- Arras, P., Reinecke, M., Westermann, R., & Enßlin, T. A. 2021, A&A, 646, A58 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Bolton, R., et al. 2016, Parametric models of SDP compute requirements, Tech. Rep. SKA-TEL-SDP-0000040, SDP Consortium, dated 2015-03-24 [Google Scholar]
- Cornwell, T., & Perley, R. 1992, A&A, 261, 353 [NASA ADS] [Google Scholar]
- Cornwell, T. J., Golap, K., & Bhatnagar, S. 2008, IEEE J. Selected Top. Signal Process., 2, 647 [NASA ADS] [CrossRef] [Google Scholar]
- Cornwell, T., Voronkov, M., & Humphreys, B. 2012, in Image Reconstruction from Incomplete Data VII, 8500, International Society for Optics and Photonics, 85000L [NASA ADS] [CrossRef] [Google Scholar]
- Dewdney, P. E., Hall, P. J., Schilizzi, R. T., & Lazio, T. J. L. W. 2009, Proc. IEEE, 97, 1482 [NASA ADS] [CrossRef] [Google Scholar]
- Frigo, M., & Johnson, S. G. 1998, in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), Vol. 3, IEEE, 1381 [Google Scholar]
- Offringa, A., McKinley, B., Hurley-Walker, N., et al. 2014, MNRAS, 444, 606 [Google Scholar]
- Tan, S. M. 1986, Ph.D. Thesis, University of Cambridge [Google Scholar]
- Tasse, C., Hugo, B., Mirmont, M., et al. 2018, A&A, 611, A87 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- van der Tol, S., Veenboer, B., & Offringa, A. R. 2018, A&A, 616, A27 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Van Weeren, R., Williams, W., Hardcastle, M., et al. 2016, ApJS, 223, 2 [CrossRef] [Google Scholar]
- Wortmann, P. 2019, Distributed Predict I/O Prototype, Tech. Rep. SKA-TEL-SDP-0000203, SDP memo 102, SDP Consortium, revision 1 [Google Scholar]
- Wortmann, P. 2023, SDP Exec - Imaging IO Test, https://gitlab.com/ska-telescope/sdp/ska-sdp-exec-iotest [Google Scholar]
- Yatawatta, S. 2015, MNRAS, 449, 4506 [Google Scholar]
- Ye, H., Gull, S. F., Tan, S. M., & Nikolic, B. 2020, MNRAS, 491, 1146 [Google Scholar]
- Ye, H., Gull, S. F., Tan, S. M., & Nikolic, B. 2022, MNRAS, 510, 4110 [NASA ADS] [CrossRef] [Google Scholar]
All Tables
Sample of possible algorithm parameters for a fixed image size (columns W = 4unln, Δu & Δl, uA, and lB) and the corresponding efficiencies (last two columns).
Test configuration parameters and results (left: scaling baseline length uΥ; right: scaling field of view (FoV) size lΥ).
All Figures
![]() |
Fig. 1 Distribution concept sketch of our algorithm. The visibilities are gridded to subgrids (solid lines), and the contributions are accumulated to facets. The dashed lines indicate the reverse direction: the contributions are extracted from the facets, and the visibilities are degridded from the subgrids. |
In the text |
![]() |
Fig. 2 Key functions and relationships at a glance. |
In the text |
![]() |
Fig. 3 Image space error (light colours: worst case Υ). |
In the text |
![]() |
Fig. 4 2D algorithm sketch of both transformation directions, highlighting the symmetry. This also illustrates buffer sizes, showing how padding facet data to 4uΥln tends to dominate memory consumption both in the horizontal (u) and vertical (υ) axes. This is why it is important to share intermediate buffers for subgrids from the same ‘column’ |
In the text |
![]() |
Fig. 5 Window interaction with non-coplanarity term. |
In the text |
![]() |
Fig. 6 Non-coplanarity margin relative to highest chirp frequency for lmax = 0.2 (solid: facet size ln = 0.1, dotted: ln = 0.2). |
In the text |
![]() |
Fig. 7 Generating a subgrid w-stack (w-tower). |
In the text |
![]() |
Fig. 8 Subgrid and facet spacings for convolutional gridding. |
In the text |
![]() |
Fig. 9 Amount of facet space useable per desired base error for prolate spheroidal wave function (PSWF). |
In the text |
![]() |
Fig. 10 Parameter efficiencies depending on PSWF and precision for 4uΥlΥ = 8192,4uΥln = 2048, umlΥ = 1024, 4uΥlfov = 6656 (top: efficiency; bottom: w-tower efficiency; dotted: upper efficiency bound; marked: best found parameter combination). |
In the text |
![]() |
Fig. 11 Complexity scaling (dotted lines indicate cubic growth). |
In the text |
![]() |
Fig. 12 Benchmark results. ducc results scaled as if distributed perfectly to 16 nodes. |
In the text |
![]() |
Fig. 13 Visibilities per subgrid position (SKA1-Mid layout, summed over all w-stacking planes and w-tower storeys). |
In the text |
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.