Issue 
A&A
Volume 646, February 2021



Article Number  A58  
Number of page(s)  13  
Section  Numerical methods and codes  
DOI  https://doi.org/10.1051/00046361/202039723  
Published online  05 February 2021 
Efficient widefield radio interferometry response
^{1}
MaxPlanck Institut für Astrophysik, KarlSchwarzschildStr. 1, 85748 Garching, Germany
email: parras@mpagarching.mpg.de
^{2}
LudwigMaximiliansUniversität München (LMU), GeschwisterSchollPlatz 1, 80539 München, Germany
^{3}
Technische Universität München (TUM), Boltzmannstr. 3, 85748 Garching, Germany
Received:
20
October
2020
Accepted:
1
December
2020
Radio interferometers do not measure the sky brightness distribution directly, but measure a modified Fourier transform of it. Imaging algorithms therefore need a computational representation of the linear measurement operator and its adjoint, regardless of the specific chosen imaging algorithm. In this paper, we present a C++ implementation of the radio interferometric measurement operator for widefield measurements that is based on socalled improved wstacking. It can provide high accuracy (down to ≈10^{−12}), is based on a new gridding kernel that allows smaller kernel support for given accuracy, dynamically chooses kernel, kernel support, and oversampling factor for maximum performance, uses piecewise polynomial approximation for cheap evaluations of the gridding kernel, treats the visibilities in cachefriendly order, uses explicit vectorisation if available, and comes with a parallelisation scheme that scales well also in the adjoint direction (which is a problem for many previous implementations). The implementation has a small memory footprint in the sense that temporary internal data structures are much smaller than the respective input and output data, allowing inmemory processing of data sets that needed to be read from disk or distributed across several compute nodes before.
Key words: instrumentation: interferometers / techniques: interferometric / methods: numerical / methods: data analysis
© P. Arras et al. 2021
Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Open Access funding provided by Max Planck Society.
1. Introduction
The central data analysis task in radio interferometry derives the locationdependent sky brightness distribution I(l, m) from a set of complexvalued measured visibilities d_{k}. In the noiseless case they are related by the expression (e.g. Richard Thompson et al. 2017)
Here, l, m, and are direction cosines with respect to the central observation axis, while , , and are the coordinates of the baselines in metres and λ_{k} are the observation wavelengths. When we assume that I(l, m) is approximated by discretised values on a Cartesian (l, m) grid, the double integral corresponds to a discrete Fourier transform. The entries d_{k} of the data vector d correspond to delta peak readouts of the threedimensional Fourier transformed sky brightness at Fourier location (u_{k}, v_{k}, w_{k}), which are commonly called ‘visibilities’. It suffices to discuss the noiseless case here. While taking the noise into account is the task of the chosen imaging algorithm, all such algorithms need an implementation of Eq. (1).
Typical problem sizes range from 10^{6} to beyond 10^{9} visibilities, fields of view can reach significant fractions of the hemisphere, and image dimensions exceed 10 000 × 10 000 pixels. It is evident that naïve application of Eq. (1) becomes prohibitively expensive at these parameters; a single evaluation would already require ≈10^{17} calls to the complex exponential function.
Massive acceleration can be achieved by using ‘convolutional gridding’ (in other fields often called ‘nonuniform fast Fourier transform’; Dutt & Rokhlin 1993). Here, the information contained in the d_{k} is transferred onto a regular Cartesian grid by convolving the delta peak readouts at (u_{k}, v_{k}, w_{k}) with an appropriately chosen kernel function, which is evaluated at surrounding (u, v) grid points. Transformation between u, v and l, m can now be carried out quickly by means of a twodimensional fast Fourier transform (FFT; Cooley & Tukey 1965), and the smoothing caused by the convolution with the kernel is compensated for by dividing the I(l, m) by the Fouriertransformed kernel.
When the term is very close to 1, no further optimisation steps are required. This criterion is not fulfilled for nonplanar instruments and for widefield observations. Therefore the visibilities need to be gridded onto several uvplanes with different w, which are Fouriertransformed and corrected separately. Perley (1999) has pointed out that Eq. (1) can be written as a threedimensional Fourier transform. Based on this idea, Ye (2019) applied the convolutional gridding algorithm not only for the uvcoordinates, but also for the wdirection. Because this approach naturally generalises wstacking (Offringa et al. 2014) to use gridding in the wdirection as well, we propose the term ‘wgridding’ instead of the term ‘improved wstacking’ (Ye 2019, Ye et al. 2021).
This paper does not present any new insights into the individual components of the radio interferometric measurement operator implementation (except for the introduction of a tuned gridding kernel in Sect. 3.2); our code only makes use of algorithms that are already publicly available. Instead, our main point is to demonstrate how significant advances in performance and accuracy can be achieved by appropriate selection of individual components and their efficient implementation. Our implementation has been integrated into the wellknown imaging tool wsclean^{1} (Offringa et al. 2014) since version 2.9, where it can be selected through the usewgridder flag, and the imaging toolkit codexafricanus^{2}. Furthermore, the implementation presented here has been used in Arras et al. (2020, 2021), for instance.
Section 2 introduces the notation used in this paper and summarises the algorithmic approach to numerically approximate Eq. (1) and its adjoint. Section 3 describes all algorithmic components in detail from a computational point of view, and Sect. 4 lists the design goals for the new code, which influence the choice of algorithmic components from the set given in Sect. 3. Here we also list a number of additional optimisations to improve overall performance. The new code is validated against discrete Fourier transforms in Sect. 5, and an analysis of its scaling behaviour as well as a performance comparison with other publicly available packages is presented in Sect. 6.
2. Notation and formal derivation of the algorithm
The data that are taken by radio interferometers are called ‘visibilities’. Equation (1) already shows that the operation that is to be implemented is similar to a Fourier transform modulated by a phase term. In the following, we introduce all notation that is required to describe the algorithm and present the threedimensional (de)gridding approach from Ye (2019) in this notation.
Let be the vector of observing wavelengths in metres and the coordinates of the baselines in metres, each of which are elements of . In other words, n_{ν} and n_{r} are the number of observing wavelengths and number of rows of the data set, respectively. Then, the effective baseline coordinates (u, v, w) are defined as
These are the effective locations of the sampling points in Fourier space. To simplify the notation, we view the above three coordinates as elements of a simple vector space, for example, with n_{d} = n_{r}n_{ν}. Because the measurement Eq. (1) is to be evaluated on a computer, it needs to be discretised,
where R_{0} is the (accurate) response operator defined by the righthand side of the equation, and L, M are the sets of direction cosines of the pixels of the discretised sky brightness distribution in the two orthogonal directions. (Δl, Δm) are the pixel sizes, and (n_{l}, n_{m}) are the number of pixels. Then, formally, L and M can be defined as
It is apparent that computing Eq. (3) is prohibitively expensive because the whole sum needs to be performed for each data index k individually. As a solution, the convolution theorem can be applied in order to replace the Fourier transform by an FFT that can be reused for all data points. As it stands, Eq. (3) is not a pure Fourier transform because of the phase term w_{k}(n_{lm} − 1). As discussed above, we follow Perley (1999) and introduce an auxiliary Fourier integration in which w and n_{lm} − 1 are viewed as Fourierconjugate variables,
The next goal is to replace the above threedimensional nonequidistant Fourier transform by an equidistant one. This can be done by expressing a visibility d_{k} as a convolution of a field defined on a grid with a convolution kernel. This convolution is undone by dividing by its Fourier transform in sky space.
For this, we need to define the convolution kernel. Let be a function that is pointsymmetric around 0 and has compact support with the kernel support size . In other words, the kernel function is zero outside a symmetric integerlength interval around zero. In practice, this means that every visibility is gridded onto the α × αuvgrid points that are closest to it. We use ϕ as convolution kernel to interpolate all three grid dimensions. Let be its Fourier transform: . ψ needs to be defined only on because ϕ is evaluated on a grid with pixel size 1.
Now, the (discrete) convolution theorem can be applied to turn the sums in Eq. (6) into a discrete Fourier transform followed by a periodic convolution on an oversampled grid (with oversampling factor σ) and to turn the integral over into a regular convolution. Some degree of oversampling (σ > 1) is required to lower the error of the algorithm to the required levels; ultimately, the error depends on σ, α, and the kernel ϕ. Specifically for the wdirection, using the coordinate transform c(x) = w_{k} + xΔw and the definition of ψ,
with W = {w_{°} + jΔw  j ∈ {0, …, N_{w} − 1}}, it follows that
This expression replaces the wterm in Eq. (6) below. The idea of rewriting the wterm as a convolution was first presented in Ye (2019). w_{°}, N_{w}, and Δw denote the as yet unspecified position of the wplane with the lowest wvalue, the number of wplanes, and the distance between neighbouring wplanes, respectively. The approximation Eq. (9) is only sensible for all l ∈ L, m ∈ M and all k if Δw is small enough. The proper condition is given by the NyquistShannon sampling theorem (Ye 2019),
The factor σ appears because the accuracy of a given gridding kernel ϕ depends on the oversampling factor. Therefore, the optimal, that is, largest possible, Δw is
For a given σ this determines Δw. w_{°} and N_{w} are still unspecified. Combining Eqs. (6) and (10) leads to the final approximation of the measurement equation,
where R is the linear map that approximates R_{0} in our implementation, and Φ and Ψ are the threefold outer product of ϕ and ψ, respectively,
with χ(a) = a − ⌊a⌋−0.5 where . To define the sets U, V, and W, the discretisation in uvwspace needs to be worked out. The number of pixels in discretised uvspace is controlled by the oversampling factor σ,
where . Thus, the set of pixels of the discretised uvwspace is given by
For the wdimension we can assume w_{k} ≥ 0 for all k without loss of generality because the transformation
leaves Eq. (3) invariant individually for each k. Because of this Hermitian symmetry, only half of the threedimensional Fourier space needs to be represented in computer memory.
For a given Δw, the first wplane is located at
that is, half of the kernel width subtracted from the minimum wvalue, and the total number of wplanes N_{w} is
because below the minimum and above the maximum wvalue, half a kernel width needs to be added in order to be able to grid the respective visibilities with extreme wvalues.
In Eq. (13), we can view the sky brightness distribution I as element of and . Then Eq. (13) can be written as d = R(I) with being a linear map. In imaging algorithms this linear map often appears in the context of functionals that are optimised, for example, a negative loglikelihood or a simple χ^{2} = d − R(I)^{2} functional between data and expected sky response. To compute the gradient (and potentially higher derivatives) of such functionals, not only R, but also R^{†}, the adjoint, is needed. It can be obtained from Eq. (13) by reversing the order of all operations and taking the conjugate of the involved complex numbers. In the case at hand, it is given by
Here we can already observe that parallelisation over the data index k is more difficult in Eq. (22) than in Eq. (13). In Eq. (22), the grid in Fourier space is subject to concurrent write accesses, whereas in Eq. (13), it is only read concurrently, which is less problematic. In Sect. 4.4 we discuss this in more detail and present a parallelisation strategy that scales well in both directions.
All in all, the scheme Eq. (13), which approximates the discretised version Eq. (3) of the radio interferometric response function Eq. (1), has been derived. That it can be computed efficiently is shown in the subsequent sections. The choice of the gridding kernel function ϕ, the kernel support α, and the oversampling factor σ have not yet been discussed. Tuning these three quantities with respect to each other controls the achievable accuracy and the performance of the algorithm (see Sect. 3.2).
3. Algorithmic elements
Equation (13) prescribes a nonequidistant Fourier transform that is carried out with the help of the as yet unspecified gridding kernel Φ. Its choice is characterised by a tradeoff between accuracy (larger kernel support α and/or oversampling factor σ) and computational cost. As a criterion for assessing the accuracy of a kernel, we use a modified version of the leastmisfit function approach from Ye et al. (2020).
3.1. Gridding and degridding and treatment of the wterm
For the implementation, Eq. (13) is reordered in the following way:
In other words, first the geometric term n and the gridding correction Ψ are applied to the input I_{lm} Eq. (25). Then, the wplanes are handled one after another. For every wplane the phase term , called wscreen, is applied to the image Eq. (24). This is followed by the Fourier transform and the degridding procedure with Φ (bracketed term in Eq. (23)). Finally, the contributions from all wplanes are accumulated by the sum over c ∈ W to obtain the visibility d_{k}.
For the adjoint direction, Eq. (22) is reordered to
In words, the wplanes are handled one after another again. First, the visibilities that belong to the current wplane are gridded onto a twodimensional grid with Φ and the twodimensional Fourier transform is applied Eq. (27). Then, its result H_{c} is multiplied with the complex conjugate wscreen and the contributions from wplanes to the image are accumulated by the sum over c ∈ W Eq. (26). Finally, the gridding correction Ψ_{lm} and the geometric factor n_{lm} are applied.
The number of iterations in the loop over the wplanes W can be reduced by up to a factor of two by restricting the w coordinate to w ≥ 0 with the help of the Hermitian symmetry Eq. (19). The implementation scheme described above highlights that the choice of the kernel shape ϕ and its evaluation are crucial to the performance of the algorithm: The support α should be small in order to reduce memory accesses and kernel evaluations. At the same time, the oversampling factor σ needs to be small such that the Fourier transforms do not dominate the run time. Additionally, the kernel itself needs to be evaluated with high accuracy, while at the same time, its computation should be very fast.
3.2. Kernel shape
As already mentioned, the shape of the employed kernel function ϕ has a strong effect on the accuracy of the gridding and degridding algorithms. The historical evolution of preferred kernels is too rich to be discussed here in full, but see Ye et al. (2020) for an astronomycentred background and Barnett et al. (2019) for a more engineeringcentred point of view.
It appears that the kernel shape accepted as ‘optimal’ amongst radio astronomers is the spheroidal function as described by Schwab (1980). This function maximises the energy in the main lobe of the Fouriertransformed kernel compared to the total energy, which is essential to suppress aliasing artefacts.
However, this concept of optimality only holds under the assumption that gridding and degridding are carried out without any oversampling of the uvgrid and the corresponding trimming of the resulting dirty image. While this may have been the default scenario at the time this memorandum was written, most currently employed gridding algorithms use some degree of oversampling and trimming (i.e. σ > 1), which requires restating the optimality criterion: instead of trying to minimise the errors over the entire dirty image, the task now is to minimise the error only in the part of the dirty image that is retained after trimming, while errors in the trimmed part may be arbitrarily high. More quantitatively: Given a kernel support of α cells and an oversampling factor of σ, a kernel shape is sought that produces the lowest maximum error within the parts of the dirty image that are not trimmed.
Ye et al. (2020) demonstrated an approach to determine nonanalytic optimal kernels. However, very good results can also be obtained with rather simple analytical expressions. Barnett et al. (2019) presented the oneparameter kernel called ‘exponential of a semicircle kernel’ or ‘ES kernel’,
for β > 0. In the following, we use a twoparameter version derived from this,
for β > 0 and μ > 0 and call it ‘modified ES kernel’.
To determine optimal values for the two parameters for given α and σ, we use the prescription described in Ye et al. (2020). The idea is to consider the squared difference between the outputs of the accurate and the approximate adjoint response operator R_{0} and R. Without loss of generality, we restrict the following analysis to the case of a onedimensional nonequidistant Fourier transform. For readability, we define and and
Using the CauchySchwarz inequality, the squared error can be bounded from above with
The first term of the righthand side of the inequality is purely data dependent and therefore not relevant in quantifying the (upper limit of the) approximation error of the linear map R^{†}. The actual approximation error does depend on the data d, and for a given data vector, more accurate approximation schemes could be derived in principle. However, because generic statements about d are difficult to make and a dataindependent generic kernel is desired here, we optimise the righthand side of the inequality. If the number of visibilities is large (tests have shown that in generic setups already > 10 visibilities suffice), the values of (a − u_{k})x mod 2π sample the interval [0, 1) sufficiently uniformly. Then the second term is approximately proportional to the dataindependent integral
Because the actual error is quantified by l(x), we call l(x) the ‘map error function’ in contrast to Ye et al. (2020), who used this name for l^{2}(x). l(x) depends on the choice of the functional form of ϕ, the kernel support α, and the oversampling factor σ. Ye et al. (2020) used Eq. (34) in a leastsquares fashion to determine the ‘optimal gridding kernel’ for a given α and σ.
We propose to use Eq. (34) slightly differently. Instead of the L2norm, we use the supremum norm to minimise it because the error should be guaranteed to be below the accuracy specified by the user for all x. Additionally, we use the twoparameter modified ES kernel. The parameters that result from a twodimensional parameter search are hardcoded into the implementation. For explicitness, a selection of parameters is displayed in Appendix A.
As an example, Fig. 1 shows the map error function of the modified ES kernel in dependence on the oversampling factor σ and for fixed α. Increasing the oversampling factor allows a reduction of the convolution kernel support size while keeping the overall accuracy constant, which reduces the time required for the actual gridding or degridding step. At the same time, however, an increase in σ implies both a larger uvgrid and a higher number of wplanes. The former aspect leads to increased memory consumption of the algorithm, and both aspects increase the total cost of FFT operations. As a consequence, for a given number of visibilities, dirty image size, w range, and desired accuracy, it is possible to minimise the algorithm runtime by finding the optimal tradeoff between oversampling factor and kernel support size. The sweet spot for most applications lies in the range 1.2 to 2.0 for the oversampling factor. Our chosen functional form of the gridding kernel naturally leads to higher accuracy towards the phase centre, that is, x = 0.
Fig. 1. Map error function for kernel support α = 6 for a varying oversampling factor σ. The horizontal dotted lines display the advertised accuracy of the kernel. 
For the comparison of our modified ES kernel and the leastmisfit kernel, we note that the kernels are designed to minimise the supremum norm and the L2norm, respectively, of the map error function. All leastmisfit kernels in the following were computed using the software released alongside Ye et al. (2020). For given α and σ, the leastmisfit kernel is therefore not necessarily optimal in our metric and vice versa, and comparison becomes nontrivial. Figure 2 displays the map error function for the modified ES kernel and the leastmisfit kernel with the same α and σ and compares it to the leastmisfit kernel with σ = 1.45. The steep increase in the map error function of the leastmisfit kernel for σ = 1.5 significantly affects the supremum norm but still leads to a lower value for the L2norm because the function is considerably smaller for small x. For the following comparison we select the leastmisfit kernel for σ = 1.45 by hand. It is optimal under the L2norm for σ = 1.45, but still performs better than the modified ES kernel even at σ = 1.5 under the supremum norm. It is to be assumed that with a more systematic search, even better leastmisfit kernels can be found, so that the selected one should be regarded only in a qualitative sense.
Fig. 2. Comparison of the map error function for leastmisfit kernels with different oversampling factor and modified ES kernel. The kernel support size is α = 6 for all three kernels. The dashed lines denote the supremum norm of the respective functions. We display only positive x (in contrast to Fig. 4). All map error functions are symmetric around x = 0. 
Figures 3 and 4 display a comparison for given oversampling factor and kernel width of different gridding kernels. For all kernels (except for the leastmisfit kernel) the same hyperparameter search for optimal parameters given α and σ was performed. The ES kernel (Barnett et al. 2019) is less accurate than the optimal KaiserBessel kernel, while our modified ES kernel exceeds both other kernels in terms of accuracy. Figure 4 again shows that it is possible to find a kernel shape with this code that leads to more accurate transforms than our modified ES kernel. We also plot the spheroidal function kernel, which evidently performs much worse than the other kernels within the retained part of the image. The comparison with this particular error function illustrates that the other kernels, which are chosen based on the knowledge that a part of the image will be trimmed, produce lower errors inside the final image in exchange for much higher errors in the trimmed regions.
Fig. 3. Optimal kernel shapes for α = 1.5 and α = 6 with achieved accuracy ϵ. 
Fig. 4. Map error function of different kernel shapes for α = 1.5 and α = 6. A leastmisfit kernel for a slightly lower oversampling factor is added for qualitative comparison (see the main text for a discussion of this choice), as well as the classic spheroidal function kernel. The arrows highlight the differences of the supremum norm of map error function of the different kernels with respect to our modified ES kernel. 
Although the leastmisfit kernel achieves a slightly more accurate gridding, we used the modified ES kernel for our implementation because only two real numbers are needed to specify the kernel for given α and σ in contrast to much larger tables for the leastmisfit kernel. Additionally, it is nontrivial to minimise the supremum norm of Eq. (34) for a general leastmisfit kernel. With only two parameters, a brute force parameter search is affordable, but this does not work for the many more degrees of freedom of the leastmisfit kernels.
3.3. Kernel evaluation
In addition to choosing a kernel function that yields low errors, for the design of a practical algorithm it is also crucial to have a highly efficient way of evaluating this chosen function. Because for every visibility processed it is necessary to evaluate the kernel at least 3α times (α times each in u, v, and wdirection), this is definitely a computational hot spot, and therefore a single evaluation should not take more than a few CPU cycles.
From the candidate functions listed in Sect. 3.2, it is obvious that this rules out direct evaluation in most cases. The only exception here is the original ES kernel Eq. (28), which can be evaluated up to several hundred million times per second on a single CPU core using vector arithmetic instructions. To retain high flexibility with respect to the choice of kernel function, some other approach is therefore needed.
Traditionally, this problem is often addressed using tables of precomputed function values evaluated at equidistant points, from which the desired kernel values are then obtained by interpolation. Typically, zerothorder (i.e. nearestneighbour selection) and linear interpolation are used.
Interpolation at low polynomial degree soon leads to lookup tables that no longer fit into the CPU Level1 and Level2 caches when the required accuracy is increased, thus leading to high load on the memory subsystem, especially when running on multiple threads. To overcome this, we adopted an approach presented by Barnett et al. (2019) and approximated the kernel in a piecewise fashion by several higher order polynomials. Barnett et al. (2019) reported that for a given desired accuracy ϵ, it is sufficient to represent a kernel with support α by a set of α polynomials of degree α + 3. This means that a kernel evaluation can be carried out using only α + 3 multiplyandadd instructions, and the total storage requirement for the polynomial coefficients is α(α + 4) floating point numbers, which is negligible compared to the traditional lookup tables and much smaller than the CPU cache.
Because this approach is applicable to all kernel shapes discussed above, has sufficient accuracy (which can even be tuned by varying the degree of the polynomials), and has very low requirements on both CPU and memory, we used it in our implementation. Details on the construction of the approximating polynomials are given in Sect. 4.2.
4. Implementation
4.1. Design goals and highlevel overview
In order to make our code useful (and easy to use) in the broadest possible range of situations, we aim for the library to have minimum external dependences (to simplify installation), have a minimum simple interface and be easily callable from different programming languages (to allow convenient use as a plugin for existing radioastronomical codes), be wellsuited for a broad range of problem sizes and required accuracies, have a very low memory footprint for internal data structures, and reach very high performance, but not at the cost of significant memory consumption. We decided to provide the functionality as a component of the ducc^{3} collection of numerical algorithms. Because this package already provides support for multithreading, SIMD data types, FFTs, and all other algorithmic prerequisites, the code does not have external dependences and only requires a compiler supporting the C++17 language standard. Its interface only consists of two functions (to apply the gridding operator and its adjoint), which take a moderate number of parameters (scalars and arrays). For illustration purposes, we list the interface documentation for the Python frontend of the library in Appendix B. Similar to many other gridder implementations, the interface allows specifying individual weights for each visibility, as well as a mask for flagging arbitrary subsets of the measurement set; both of these parameters are optional, however (see Appendix B). For an easy explicit understanding of the algorithm, we provide a compact Python and a slightly optimized Numpy and a Numba implementation of the wgridding^{4}.
One important motivation for choosing C++ was its ability to separate the highlevel algorithm structure from lowlevel potentially architecturedependent implementation details. As an example, while the algorithm is written only once, it is instantiated twice for use with singleprecision and doubleprecision data types. The singleprecision version is faster, requires less memory, and may be sufficient for most applications, but the user may choose to use double precision in particularly demanding circumstances. Similarly, advanced templating techniques allow us to make transparent use of vector arithmetic instructions available on the target CPU, be it SSE2, AVX, AVX2, FMA3/4, or AVX512F; this is invaluable to keep the code readable and easy to maintain. The SIMD class of ducc supports the x86_64 instruction set, but could be extended to other instruction sets (such as ARM64) if needed.
Especially due to the necessity of having a low memory footprint, the wplanes are processed strictly sequentially. For the gridding direction (the degridding procedure is analogous), this means that for every wplane, all relevant visibilities are gridded onto the uvgrid, weighted accordingly to their wcoordinate, the appropriate wscreen is applied to the grid, the grid is transformed into the image domain via FFT, and the resulting image is trimmed and added to the final output image. This approach is arguably suboptimal from a performance point of view because it requires recomputing the kernel weights in u and vdirection for every visibility at each wplane it contributes to: the number of kernel evaluations necessary to process a single visibility increases from 3α to α(2α + 1). On the other hand, processing several planes simultaneously would increase the memory consumption considerably, and at the same time the speedup would probably not be very significant because kernel computation only accounts for a minor fraction of the overall run time (⪅20%).
Overall, our approach requires the following auxiliary data objects: a twodimensional complexvalued array for the uvgrid (requiring 2σ^{2} times the size of the dirty image), a temporary copy of the dirty image (only for degridding), and a data structure describing the processing order of the visibilities (see Sect. 4.3 for a detailed description and a size estimate). Processing only a single wplane at a time implies that for parallelisation the relevant visibilities need to be subdivided into groups that are gridded or degridded concurrently onto/from that plane by multiple threads. To obtain reasonable scaling with such an approach, it is crucial to process the visibilities in an order that is strongly coherent in u and v; in other words, visibilities falling into the same small patch in uvspace should be processed by the same thread and temporally close to each other. This approach optimises both cache reuse on every individual thread as well as (in the gridding direction) minimising concurrent memory writes. However, finding a closetooptimal ordering for the visibilities in short time, as well as storing it efficiently, are nontrivial problems; they are discussed in Sect. 4.3.
As mentioned initially, parameters for interferometric imaging tasks can vary extremely strongly: the opening angle of the field of view can lie between arcseconds and π, visibility counts range from a few thousands to many billions, and image sizes start below 10^{6} pixels and reach 10^{9} pixels for current observations, with further increases in resolution to be expected. Depending on the balance between these three quantities, the optimal choice (in terms of CPU time) for the kernel support α, and depending on this the choice, of other kernel parameters and the oversampling factor σ, can vary considerably, and choosing these parameters badly can result in run times that are several times slower than necessary. To avoid this, our implementation picks nearoptimal α and σ depending on a given task’s parameters, based on an approximate cost model for the individual parts of the computation. For all available α (α ∈ {4, …, 16} in the current implementation), the code checks the list of available kernels for the one with the smallest σ that provides sufficient accuracy and predicts the total runtime for this kernel using the cost model. Then the kernel, α, and σ with the minimum predicted cost are chosen.
4.2. Gridding kernel
Our code represents the kernel function by approximating polynomials as presented in Sect. 3.3. A kernel with a support of α grid cells is subdivided into α equallength parts, one for each cell, which are approximated individually by polynomials of degree α + 3. When the kernel is computed in u and vdirections, evaluation always takes place at locations spaced with a distance of exactly one grid cell, a perfect prerequisite for using vector arithmetic instructions. As an example, for α = 8 and single precision, all eight necessary kernel values can be computed with only 11 FMA (floatingpoint multiplyandadd) machine instructions on any reasonably modern x86 CPU.
We used the family of modified ES kernels introduced in Sect. 3.2. They are convenient because an optimised kernel for given α and σ is fully characterised by only two numbers β and μ, and therefore it is simple and compact to store a comprehensive list of kernels for a wide parameter range of α, σ and ϵ directly within the code. This is important for the choice of nearoptimal gridding parameters described in the preceding section.
When a kernel has been picked for a given task, it is converted to approximating polynomial coefficients. For maximum accuracy, this should be done using the Remez algorithm (Remez 1934), but we found that evaluating the kernel at the respective Chebyshew points (for an expansion of degree n, these are the roots of the degree (n + 1) Chebyshev polynomial, mapped from [ − 1; 1] to the interval in question) and using the interpolating polynomial through the resulting values produces sufficiently accurate results in practice while at the same time being much simpler to implement. Chebyshew abscissas are used because the resulting interpolants are much less prone to spurious oscillations than those obtained from equidistant abscissas^{5} (Runge 1901).
Even better accuracy could be obtained by switching from modified ES kernels to leastmisfit kernels, but there is a difficult obstacle to this approach: determining a leastmisfit kernel for a given α and σ, which is optimal in the supremumnorm sense instead of the L2norm sense, may be possible only by a bruteforce search, which may be unaffordably expensive. Because the obtainable increase in accuracy is probably modest, we decided to postpone this improvement to a future improved release of the code.
4.3. Optimising memory access patterns
With the highly efficient kernel evaluation techniques described above, the pure computational aspect of gridding and degridding no longer dominates the run time of the algorithm. Instead, most of the time is spent reading from and writing to the twodimensional uvgrid. Processing a single visibility requires α^{3} read accesses to this grid, and for the gridding direction, the same number of additional write accesses. While it is not possible to reduce this absolute number without fundamentally changing the algorithm (which in turn will almost certainly lead to increasing complexity in other parts), much can be gained by processing the visibilities in a cachefriendly order, as was already pointed out in Sect. 4.1. Making the best possible use of the cache is also crucial for good scaling behaviour because every CPU core has its own L1 and L2 caches, whereas there is only a small number of memory buses (with limited bandwidth) for the entire compute node. For multithreaded gridding operations, this optimisation is even more important because it decreases the rate of conflicts between different threads trying to update the same grid locations; without this measure, R^{†} would have extremely poor scaling behaviour.
Reordering the visibility and/or baseline data is not an option here because this would require either creating a rearranged copy of the visibilities (which consumes an unacceptable amount of memory) or, in the gridding direction, manipulating the input visibility array inplace (which is fairly poor interface design). Consequently, we rather used an indexing data structure describing the order in which the visibilities should be processed.
For this purpose, we subdivided the uvgrid into patches of 16 × 16 pixels, which allowed us to assign a tuple of tile indices (t_{u}, t_{v}) to every visibility. The patch dimension was chosen such that for all supported α and arithmetic data types, the ‘hot’ data set during gridding and degridding fit into a typical Level1 data cache. In wdirection, the index of the first plane onto which the visibility needs to be gridded is called t_{w}. For compact storage, we used the fact that the uvwlocations of the individual frequency channels for a given row of the measurement set tend to be very close to each other. In other words, it is highly likely that two visibilities that belong to the same row and neighbouring channels are mapped to the same (t_{u}, t_{v}, t_{w}) tuple.
The resulting data structure is a vector containing all (t_{u}, t_{v}, t_{w}) tuples that contain visibilities. The vector is sorted lexicographically in order of ascending t_{u}, ascending t_{v}, and finally ascending t_{w}. Each of the vector entries contains another vector, whose entries are (i_{row}, i_{chan,begin}, i_{chan,end}) tuples, where i_{row} is the row index of the visibility in question, and i_{chan,begin} and i_{chan,end} represent the first and oneafterlast channel in the range, respectively. Each of these vectors is sorted lexicographically in order of ascending i_{row} and ascending i_{chan,begin}.
While fairly nontrivial, this storage scheme is extremely compact: for a typical measurement set, it consumes roughly one bit per nonflagged visibility and is therefore much smaller than the visibility data themselves (which use eight bytes for every visibility, even the flagged ones). In the most unfavourable case (which occurs, e.g., when the measurement set only contains a single channel or when every other frequency channel is flagged), the memory consumption will be around eight bytes per nonflagged visibility.
Processing the visibility data in this new ordering leads to a more random access pattern to the visibility array itself. This is only a small problem, however, because entries for neighbouring channels are still accessed together in most cases, and also because the number of data accesses to the visibility array is lower by a factor of α^{2} than the one to the uvgrid in our algorithm.
4.4. Parallelisation strategy
Our code supports shared memory parallelisation by standard C++ threads, that is, it can be run on any set of CPUs belonging to the same compute node. To achieve good scaling, all parts of the algorithm that contribute noticeably to the run time need to be parallelised. In our case these parts are: building the internal data structures, performing the (de)gridding process, applying wscreens, evaluating Fourier transforms, and evaluating and applying kernel corrections.
For the construction of the data structures (discussed in Sect. 4.3), we subdivided the measurement set into small ranges of rows that are processed by the available threads in a firstcomefirstserve fashion. The threads concurrently update a global sorted data structure (using mutexes to guard against write conflicts), which is finally converted into the desired index list in a singlethreaded code section. While considerable speedups can be achieved by this approach compared to a purely singlethreaded computation, this part of the algorithm does not scale perfectly and can become a bottleneck at very high thread counts.
With the list of work items in hand, parallelising the actual gridding and degridding steps is straightforward: first, the list is subdivided into a set of roughly equalsized chunks with n_{chunks} ≫ n_{threads}. Each thread fetches the first chunk that has not been processed yet, performs the necessary operations, and then requests the next available chunk, until all chunks have been processed. This kind of dynamic work balancing is required here because it is difficult to determine a priori how much CPU time a given chunk will require.
The way in which the list was constructed ensures that each chunk is confined to a compact region of the uvplane and therefore reduces potential write conflicts between threads during gridding. Still, it might happen that different threads try to update the same pixel in the uvgrid simultaneously, which would lead to undefined program behaviour. To avoid this, each thread in both gridding and degridding routines employs a small buffer containing a copy of the uvregion it is currently working on, and when the gridding routine needs to write this back to the global uvgrid, this operation is protected with a locking mechanism. In practice, the amount of time spent in this part of the code is very small, so that lock contention is not an issue.
Furthermore, the application of the wscreens and the kernel correction factors are parallelised by subdividing the array in question into equalsized slabs, which are simultaneously worked on by the threads. The FFT component has a builtin parallelisation scheme for multidimensional transforms that we make use of.
As mentioned above, the provided parallelisation can only be used on a single sharedmemory compute node. A further degree of parallelism can be added easily, for example by distributing the measurement set data evenly over several compute nodes, performing the desired gridding operation independently on the partial data sets, and finally summing all resulting images. Analogously, for degridding, the image needs to be broadcast to all nodes first, and afterwards, each node performs degridding for its own part of the measurement set. How exactly this is done will most likely depend on the particular usage scenario, therefore we consider distributed memory parallelisation to be beyond the scope of our library. A distribution strategy over several compute nodes will increase the relative amount of time spent for computing the FFTs. Still, our implementation partially compensates for this effect by picking a combination of α, σ, and kernel shape that is optimised for the changed situation.
5. Accuracy tests
This section reports the accuracy tests that we have performed to validate our implementation. The tests can be subdivided into two major parts: the accuracy with respect to the direct evaluation of the adjoint of Eq. (3),
and the adjointness consistency between the forward and backward direction of the different calls.
5.1. Adjointness consistency
First, the degridding and the gridding calls were checked for their consistency. This is possible because mathematically, the two calls are the adjoint of each other. Therefore
where ⟨a, b⟩_{(1)} := a^{†}b and ⟨a, b⟩_{(2)} := a^{T}b are the dot products of and , respectively. On the lefthand side of the equation, the real part needs to be taken because R maps from an  to a vector space. Still, ℑ𝔪(R(I)) is tested by Eq. (36) because evaluating the scalar product involves complex multiplications. Therefore the real part of the scalar product also depends on ℑ𝔪(R(I)).
For the numerical test, we chose n_{l} = n_{m} = 512 and a field of view of 15 ° ×15°. The observation was performed at 1 GHz with one channel. The synthetic uvwcoverage consisted of 1000 points sampled from a uniform distribution in the interval [ − a, a], where a = pixsize/2/λ, pixsize is the length of one pixel and λ is the observing wave length. The real and the imaginary parts of the synthetic visibilities d were drawn from a uniform distribution in the interval [ − 0.5, 0.5]. Analogously, we drew the pixel values for the dirty image I from the same distribution. We consider this setup to be generic enough for accuracy testing purposes.
As discussed above, our implementation supports applying or ignoring the wcorrection and can run in single or double precision. This gives four modes that are tested individually in the following. Moreover, the kernel sizes and the oversampling factor were chosen based on the intended accuracy ϵ, specified by the user. As a criterion for the quality of the adjointness, we use
For all four modes and for all tested ϵ in the supported range (≥10^{−5} for single precision, ≥10^{−14} for double precision), this quantity lay below 10^{−7} and 10^{−15} for single and double precision, respectively.
5.2. Accuracy of R^{†}
Second, we compared the output of our implementation to the output of the direct Fourier transform with and without wcorrection. It suffices to test only R^{†} and not also R because the consistency of the two directions was already verified in Sect. 5.1. The error is quantified as rms error,
As testing setup, the same configuration as above was employed. Figure 5 shows the results of the (approximate) gridding implementation against the exact DFT. It is apparent that single precision transforms reach the requested accuracy for ϵ ⪆ 3 × 10^{−5}, while double precision transforms are reliably accurate down to ϵ ≈ 10^{−13}. We deliberately also show results for ϵ outside this safe range to demonstrate how the resulting errors grow beyond the specified limit due to the intrinsic inaccuracy of floatingpoint arithmetics. Inside the safe region, the achieved accuracy typically lies in the range between 0.03ϵ and ϵ, which indicates that the estimation in Eq. (33) is not overly pessimistic.
Fig. 5. Accuracy of R^{†}. The ratio of measured root mean square error to the requested accuracy ϵ is plotted as a function of ϵ itself. The grey line denotes the identity function. Points lying in the region below the line represent configurations that are more accurate than specified by the user. 
The sawtooth pattern of the measured errors is caused by the dynamic parameter selection during the setup process of each gridding operation mentioned near the end of Sect. 4.1: Moving from higher to lower accuracies, a fixed combination of α, σ, and the corresponding kernel shape results in decreasing ϵ_{rms}/ϵ, which is indicated by the individual descending line segments. At some point, a new parameter combination (lower σ, or lower α with increased σ) with sufficiently high accuracy and lower predicted run time becomes available. This is then selected and the relative error jumps upwards, while still remaining well below the specified tolerance.
6. Performance tests
The tests in this section were performed on a 12core AMD Ryzen 9 3900X CPU with 64GB main memory attached. g++ 10.2 was used to compile the code, with notable optimisation flags including march=native, ffastmath, and O3. The system supports two hyperthreads per physical CPU core, so that some of the tests were executed on up to 24 threads. As test data we used a MeerKAT (Jonas & MeerKAT Team 2016) Lband measurement set corresponding to an 11hour synthesis with 8s integration time and 2048 frequency channels, using 61 antennas (824476 rows in total, project id 201804260018). We worked on the sum of XX and YY correlations only, ignoring polarisation, and after selecting only unflagged visibilities with nonvanishing weights, roughly 470 million visibilities need to be processed for each application of the gridding or degridding operator. The size of the dirty image was 4096 × 4096 pixels, and the specified field of view was 1.6 ° ×1.6°. Unless mentioned otherwise, computations were executed in singleprecision arithmetic and with a requested accuracy of ϵ = 10^{−4}. We compared the timings of our implementation to the standard radio software wsclean and the generalpurpose library FINUFFT^{6}.
6.1. Strong scaling
First, we investigated the strongscaling behaviour of our implementation. Figure 6 shows the timings of this problem evaluated with a varying number of threads. The ideal scaling would of course be , but this cannot be fully reached in practice. As mentioned in Sect. 4.4, the setup part of the algorithm does not scale perfectly, and the same is true for the FFT operations because of their complicated and not always cachefriendly memory access patterns.
Fig. 6. Strongscaling scenario. The vertical dotted gray line indicates the number of physical cores on the benchmark machine. Efficiency is the theoretical wall time with perfect scaling divided by the measured wall time and divided by the singlethread timing of ‘R^{†} ducc’. 
Still, the implementation scales acceptably well, reaching a speedup of roughly 8.0 when running on 12 threads. While the further improvements are much lower when scaling beyond the number of physical cores, as has to be expected, a total speedup of around 9.6 is reached when using all hyperthreads available on the system.
In this test, degridding is slightly, but consistently slower than gridding, which appears counterintuitive because degridding only requires roughly half the number of memory accesses. We assume that this is due to the horizontal addition of vector registers that has to be performed when a computed visibility value is written back to the measurement set. This kind of operation is notoriously slow on most CPUs, while the corresponding broadcast operation that is needed during gridding is much faster. If this interpretation is correct, it indicates that in the selected regime (single precision with an accuracy of 10^{−4}) memory accesses do not completely dominate computation. For higher accuracies this is no longer true, as shown in Sect. 6.3.
Figure 6 also shows analogous timings for the standard gridder in wsclean, but it is important to note that these cannot be directly compared to those of our code. While we tried to measure the timings with as little overhead as possible (we used the times reported by wsclean itself for the operations in question), the wsclean default gridder always interleaves I/O operations (which do not contribute at all to our own measurements) with the actual gridding and degridding, so there is always an unknown, nonscaling amount of overhead in these numbers. Additionally, the accuracy of wsclean cannot be set explicitly; based on experience, we expect it to be close to the target of 10^{−4} near the image center, but somewhat worse in the outer regions.
6.2. Comparison to nonequidistant FFT
As mentioned in the introduction, gridding or degridding without the wterm can be interpreted as a special case of the nonuniform FFT, where the uv coordinates of the individual points are not independent, but vary linearly with frequency in each channel. For this reason we also performed a direct comparison of our implementation with the FINUFFT library (Barnett et al. 2019). We still used the same measurement set as above, but performed a gridding step without the w term, using double precision and requiring ϵ = 10^{−10}.
Because a general nonuniform FFT algorithm cannot be informed about the special structure of the uv coordinates, we supplied it with an explicit coordinate pair for every visibility. This implies that a much larger amount of data is passed to the implementation, and it also increases the cost of the preprocessing step. To allow a fairer comparison, we also ran ducc on an equivalent flattened data set, which only contained a single frequency channel and therefore as many uv coordinates as there are visibilities. We verified that both implementations returned results that are equal to within the requested tolerance. The performance results are shown in Fig. 7. In contrast to our implementation, FINUFFT features a separate planning phase that can be timed independently, so we show FINUFFT timings with and without the planning time, in addition to ducc timings for processing the original and flattened measurement set.
Fig. 7. Comparison to FINUFFT. The vertical dotted grey line indicates the number of physical cores on the benchmark machine. Efficiency is the theoretical wall time with perfect scaling divided by the measured wall time and divided by the singlethread timing of ‘ducc’. 
To a large extent, the results confirm the expectations. FINUFFT is always slower than ducc when ducc works on the unflattened data. This can be attributed to the slightly higher accuracy of the ducc kernels and/or to its advantage of knowing the internal structure of the uv data, which reduces setup time and the amount of memory accesses considerably. Furthermore, ducc performs rather poorly on the flattened data compared to its standard operation mode, especially with many threads. Here it becomes obvious that the index data structure, which has many benefits for multichannel data, slows the code down when it is not used as intended by providing only a single channel. Finally, preplanned FINUFFT performs worse than ducc with flattened data at low thread counts, but has a clear speed advantage on many threads; again, this is probably due to the ducc data structures, which are suboptimal for this scenario.
Memory consumption also behaves as expected, meaning that ducc without flattening requires the least amount of memory (because it does not need to store the redundant uv data), followed by both FINUFFT runs, while ducc with flattening consumes the most memory because it stores the full uv coordinates as well as a really large index data structure. Overall, we consider it very encouraging that despite differences in details, the performance and scaling behaviour of these two independent implementations are fairly similar to each other.
6.3. Run time vs. accuracy
For the following tests, we again used the setup described at the beginning of this section, but we fixed the number of threads to six and varied the requested accuracy ϵ as well as the data type of the input (single or double precision). Figure 8 shows the expected decrease in wall time for increasing ϵ, that is, lower accuracy. In singleprecision mode the evaluation is indeed slightly faster than in double precision, most probably because more visibilities and grid points can be communicated per second between CPU and RAM for a given memory bandwidth. Moreover, the number of elements in the CPU vector registers is twice as large for singleprecision variables.
Fig. 8. Wall time vs. specified accuracy ϵ measured with six threads. 
In analogy to the observations in Sect. 6.1, degridding is slightly slower than gridding for these measurements. For double precision, the same is only true at very low accuracies; for ϵ ⪅ 10^{−3}, gridding becomes the more expensive operation, and this trend becomes very pronounced at the lowest reachable ϵ values. In these runs, the kernel support α is quite large and most of the runtime is presumably spent on data transfer from/to main memory. The results also show that while certainly attainable, high accuracy comes at a significant cost: going from a typical ϵ of 10^{−4} to 10^{−12} increases the runtime by about an order of magnitude.
7. Discussion
We have presented a new implementation of the radio interferometry gridding and degridding operators, which combines algorithmic improvements from different sources: an accurate and efficient treatment of the wterm for widefield observations published by Ye (2019) and Ye et al. (2021), an easytouse, highaccuracy, functional form for the gridding kernels presented by Barnett et al. (2019), with some slight improvements, a piecewise polynomial approximation method for arbitrary kernels (also published by Barnett et al. 2019), which is very well suited for the task at hand), a parallelisation strategy, dynamic parameter selection, and indexing data structure of our own design. To the best of our knowledge, the resulting code compares favourably to other existing Fourierdomain gridders (both for wide and narrowfield data) in terms of accuracy, memory consumption, singlecore performance, and scalability. Our implementation is designed to have minimum dependences (only a C++17 compiler is needed), and it is free and opensource software. Therefore it may be advantageous to add it as an alternative option to existing radio interferometry imaging packages, as was already done in the wsclean code.
Compared with the fairly recent imagedomain gridding approach (IDG, van der Tol et al. 2018), it appears that our implementation currently has a performance advantage when both algorithms are run on CPUs, but the GPU implementation of IDG easily outperforms all competitors on hardware of comparable cost. Furthermore, IDG can incorporate directiondependent effects (DDEs) in a straightforward manner, which are difficult and costly to treat with Fourierdomain gridding algorithms.
However, it may be possible to address this within the wgridding framework. The Astacking algorithm (Young et al. 2015) might be combined with wgridding, for instance. This would imply approximating all possible DDE patterns as linear combinations of a small set of N_{b} basis functions f_{b}(l, m), computing (for every visibility) the projection of its particular DDE pattern onto this set of functions, running the wgridder N_{b} times with the appropriate sets of weights, multiplying each result with the corresponding basis function, and finally adding everything together. Investigating the actual feasibility and performance of such an approach is left for future studies.
https://github.com/flatironinstitute/finufft, type 1, twodimensional transform.
Acknowledgments
We thank Landman Bester, Simon Perkins, Wasim Raja and Oleg Smirnov for testing and feedback on the interface, Alex Barnett, Vincent Eberle, Torrance Hodgson and Haoyang Ye for feedback on drafts of the manuscript, and SARAO for providing access to MeerKAT data for our algorithmic testing purposes. Philipp Arras acknowledges financial support by the German Federal Ministry of Education and Research (BMBF) under grant 05A17PB1 (Verbundprojekt DMeerKAT).
References
 Arras, P., Frank, P., Haim, P., et al. 2020, ArXiv eprints [arXiv:2002.05218] [Google Scholar]
 Arras, P., Perley, R. A., Bester, H. L., et al. 2021, A&A, in press, https://doi.org/10.1051/00046361/202039258 [Google Scholar]
 Barnett, A. H., Magland, J., & af Klinteberg, L., 2019, SIAM. J. Sci. Comput., 41, C479 [Google Scholar]
 Cooley, J. W., & Tukey, J. W. 1965, An algorithm for the machine calculation of complex fourier series, 19, 297 [Google Scholar]
 Dutt, A., & Rokhlin, V. 1993, SIAM J. Sci. comput., 14, 1368 [Google Scholar]
 Jonas, J., & MeerKAT Team 2016, MeerKAT Science: On the Pathway to the SKA, 1 [Google Scholar]
 Offringa, A., McKinley, B., HurleyWalker, N., et al. 2014, MNRAS, 444, 606 [Google Scholar]
 Perley, R. A. 1999, Synthesis imaging in radio astronomy II, Vol. 180, 383 [Google Scholar]
 Remez, E. Y. 1934, Compt. Rend. Acade. Sc., 199, 337 [Google Scholar]
 Richard Thompson, A., Moran, J. M., & Swenson, Jr., G. W. 2017, Interferometry and Synthesis in Radio Astronomy (Springer Nature) [Google Scholar]
 Runge, C. 1901, Zeitschrift für Mathematik und Physik, 46, 224 [Google Scholar]
 Schwab, F. R. 1980, VLA Scientific Memoranda, 132 [Google Scholar]
 van der Tol, S., Veenboer, B., & Offringa, A. R. 2018, A&A, 616, A27 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Ye, H. 2019, Accurate Image Reconstruction in Radio Interferometry, 139, https://www.repository.cam.ac.uk/handle/1810/292298 [Google Scholar]
 Ye, H., Gull, S. F., Tan, S. M., & Nikolic, B. 2020, MNRAS, 491, 1146 [Google Scholar]
 Ye, H., Gull, S. F., Tan, S. M., & Nikolic, B. 2021, arXiv eprints [arXiv:2101.11172] [Google Scholar]
 Young, A., Wijnholds, S. J., Carozzi, T. D., et al. 2015, A&A, 577, A56 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Appendix A: Kernel parameters
Optimal kernel parameters and associated accuracy ϵ for the modified exponential semicircle kernel Eq. (29) given the oversampling factor σ and the kernel support size α. Larger σ and larger α lead to smaller ϵ. Larger σ and smaller α increase the fraction of the FFT of the total computation time. FFT and gridding costs are represented in our implementation with a simple cost model such that the algorithm can choose optimal α and σ automatically. For brevity, we display only the tables for α ∈ {4, 7, 8, 12, 16}. The rest can be looked up in the ducc code repository. The leastmisfit kernels (Ye et al. 2020) achieve an accuracy ϵ = 10^{−7} for α = 7 and σ = 2.
Optimal parameters for α = 4.
Optimal parameters for α = 7.
Optimal parameters for α = 8.
Optimal parameters for α = 12.
Optimal parameters for α = 16.
Appendix B: Python interface documentation
def ms2dirty(uvw, freq, ms, wgr, npix_x, npix_y, pixsize_x, pixsize_y, nu, nv, epsilon, do_wstacking, nthreads, verbosity, mask): """ Converts an MS object to dirty image. Parameters  uvw: numpy.ndarray((nrows, 3), dtype=numpy.float64) UVW coordinates from the measurement set freq: numpy.ndarray((nchan,), dtype=numpy.float64) channel frequencies ms: numpy.ndarray((nrows, nchan), dtype=numpy.complex64 or numpy.complex128) the input measurement set data. Its data type determines the precision in which the calculation is carried out. wgt: numpy.ndarray((nrows, nchan), float with same precision as `ms'), optional If present, its values are multiplied to the input before gridding. npix_x, npix_y: int dimensions of the dirty image (must both be even and at least 32) pixsize_x, pixsize_y: float angular pixel size (in radians) of the dirty image nu, nv: int obsolete, ignored epsilon: float accuracy at which the computation should be done. Must be larger than 2e13. If `ms' has type numpy.complex64, it must be larger than 1e5. do_wstacking: bool if True, the full wgridding algorithm is carried out, otherwise the w values are assumed to be zero nthreads: int number of threads to use for the calculation verbosity: int 0: no output 1: some output 2: detailed output mask: numpy.ndarray((nrows, nchan), dtype=numpy.uint8), optional If present, only visibilities are processed for which mask!=0 Returns  numpy.ndarray((npix_x, npix_y), dtype=float of same precision as `ms') the dirty image Notes  The input arrays should be contiguous and in C memory order. Other strides will work, but can degrade performance significantly. """ def dirty2ms(uvw, freq, dirty, wgr, pixsize_x, pixsize_y, nu, nv, epsilon, do_wstacking, nthreads, verbosity, mask): """ Converts a dirty image to an MS object. Parameters  uvw: numpy.ndarray((nrows, 3), dtype=numpy.float64) UVW coordinates from the measurement set freq: numpy.ndarray((nchan,), dtype=numpy.float64) channel frequencies dirty: numpy.ndarray((npix_x, npix_y), dtype=numpy.float32 or numpy.float64) dirty image Its data type determines the precision in which the calculation is carried out. Both dimensions must be even and at least 32. wgt: numpy.ndarray((nrows, nchan), same dtype as `dirty`), optional If present, its values are multiplied to the output. pixsize_x, pixsize_y: float angular pixel size (in radians) of the dirty image nu, nv: int obsolete, ignored epsilon: float accuracy at which the computation should be done. Must be larger than 2e13. If `dirty` has type numpy.float32, it must be larger than 1e5. do_wstacking: bool if True, the full wgridding algorithm is carried out, otherwise the w values are assumed to be zero nthreads: int number of threads to use for the calculation verbosity: int 0: no output 1: some output 2: detailed output mask: numpy.ndarray((nrows, nchan), dtype=numpy.uint8), optional If present, only visibilities are processed for which mask!=0 Returns  numpy.ndarray((nrows, nchan), dtype=complex of same precision as `dirty`) the measurement set data. Notes  The input arrays should be contiguous and in C memory order. Other strides will work, but can degrade performance significantly. """
All Tables
All Figures
Fig. 1. Map error function for kernel support α = 6 for a varying oversampling factor σ. The horizontal dotted lines display the advertised accuracy of the kernel. 

In the text 
Fig. 2. Comparison of the map error function for leastmisfit kernels with different oversampling factor and modified ES kernel. The kernel support size is α = 6 for all three kernels. The dashed lines denote the supremum norm of the respective functions. We display only positive x (in contrast to Fig. 4). All map error functions are symmetric around x = 0. 

In the text 
Fig. 3. Optimal kernel shapes for α = 1.5 and α = 6 with achieved accuracy ϵ. 

In the text 
Fig. 4. Map error function of different kernel shapes for α = 1.5 and α = 6. A leastmisfit kernel for a slightly lower oversampling factor is added for qualitative comparison (see the main text for a discussion of this choice), as well as the classic spheroidal function kernel. The arrows highlight the differences of the supremum norm of map error function of the different kernels with respect to our modified ES kernel. 

In the text 
Fig. 5. Accuracy of R^{†}. The ratio of measured root mean square error to the requested accuracy ϵ is plotted as a function of ϵ itself. The grey line denotes the identity function. Points lying in the region below the line represent configurations that are more accurate than specified by the user. 

In the text 
Fig. 6. Strongscaling scenario. The vertical dotted gray line indicates the number of physical cores on the benchmark machine. Efficiency is the theoretical wall time with perfect scaling divided by the measured wall time and divided by the singlethread timing of ‘R^{†} ducc’. 

In the text 
Fig. 7. Comparison to FINUFFT. The vertical dotted grey line indicates the number of physical cores on the benchmark machine. Efficiency is the theoretical wall time with perfect scaling divided by the measured wall time and divided by the singlethread timing of ‘ducc’. 

In the text 
Fig. 8. Wall time vs. specified accuracy ϵ measured with six threads. 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.