Improved cosmic microwave background (de-)lensing using general spherical harmonic transforms

Deep cosmic microwave background polarization experiments allow a very precise internal reconstruction of the gravitational lensing signal in pricinple. For this aim, likelihood-based or Bayesian methods are typically necessary, where very large numbers of lensing and delensing remappings on the sphere are sometimes required before satisfactory convergence. We discuss here an optimized piece of numerical code in some detail that is able to efficiently perform both the lensing operation and its adjoint (closely related to delensing) to arbitrary accuracy, using nonuniform fast Fourier transform technology. Where applicable, we find that the code outperforms current widespread software by a very wide margin. It is able to produce high-resolution maps that are accurate enough for next-generation cosmic microwave background experiments on the timescale of seconds on a modern laptop. The adjoint operation performs similarly well and removes the need for the computation of inverse deflection fields. This publicly available code enables de facto efficient spherical harmonic transforms on completely arbitrary grids, and it might be applied in other areas as well.


Introduction
Weak gravitational lensing by large-scale structure affects radiation from the cosmic microwave background (CMB) in subtle but important ways (Lewis & Challinor 2006) by distorting and smoothing the primordial near isotropic two-point statistics and introducing a large trispectrum that can now easily be detected with very high significance (Planck Collaboration VIII 2020;Carron et al. 2022;Qu et al. 2023).Extraction of the signal is now an important piece of the science case of many CMB experiments (Abazajian et al. 2016;Challinor et al. 2018;Ade et al. 2019) because the lensing potential power spectrum, which probes the formation of structure at high redshift, is thought to be a particularly clean probe of the neutrino mass scale (Lesgourgues & Pastor 2006;Hall & Challinor 2012;Allison et al. 2015).Another reason is that removal of the lensing signal (delensing) from the CMB polarization B-mode (Zaldarriaga & Seljak 1998;Knox & Song 2002;Kesden et al. 2002) is now compulsory in order to place the best constraints on a primordial background of gravitational waves from inflation (Ade et al. 2021;Tristram et al. 2022).
It has long been known that deep high-resolution observations of the CMB polarization in principle allow an extremely good internal reconstruction of the lensing signal (Hirata & Seljak 2003b).Recent years have seen works trying to achieve this goal of capturing a signal-to-noise ratio that is as high as possible in realistic experimental configurations (Carron & Lewis 2017;Millea et al. 2019Millea et al. , 2021;;Millea & Seljak 2022;Legrand & Carron 2022, 2023;Aurlien et al. 2023), and some attempts were made on data as well (Adachi et al. 2020;Millea et al. 2021).These methods have in common that they use a likelihood model of the lensing signal, which allowed them to outperform the now standard quadratic estimators (Hu & Okamoto 2002;Okamoto & Hu 2003;Maniyar et al. 2021), which are limited by the amount of lensing B-mode power in the data.
These likelihood-based methods form the main motivation for this work.They are typically much more expensive than current quadratic estimator analysis: Properly modeling the subtle, ∼2 arcmin remapping effect of gravitational lensing or delensing on CMB maps requires working at high resolution, and these operations must be performed many times before convergence.Numerically speaking, harmonic transforms of data distributed on the sphere are significantly costlier than in flat space.Only a handful of works were so far able to run such optimized reconstructions on large sky fractions and take the sky curvature into account (Aurlien et al. 2023;Legrand & Carron 2022, 2023).
In order to be slightly more concrete, we consider for example the problem of recovering the unlensed CMB from data and an estimate of the lensing deflection.The unlensed CMB likelihood conditioned on the lensing map is Gaussian, but with a covariance that is anisotropic owing to the deflection.If no further approximation is made, recovery of the delensed CMB will involve the inverse covariance (Wiener-filtering), which today can only be performed with iterative methods such as a conjugate gradient, and needs approximately ten iterations to converge.Each iteration requires two remapping operations (one forward operation, and the second operation is the adjoint operation).The construction of optimal lensing mass maps internally from the CMB proceeds by iteratively applying these delensing steps and measuring residual lensing (Hirata & Seljak 2003a,b), and at best, it requires about approximately ten delensing iterations.Hence (again, in the absence of approximations), it seems difficult to reconstruct a single best-lensing mass map with fewer than approximately 200 remapping operations.Sampling methods (Millea et al. 2019(Millea et al. , 2021) ) are in principle orders of magnitude costlier still.
For these reasons, we have developed an optimized piece of code that is able to perform the deflection operation and its adjoint (the operations are properly defined below) efficiently.While several pieces of software are publicly available to perform the forward-deflection operation, we find that our implementation outperforms them by far, and the adjoint operation (equally as important for likelihood-based methods) appears to be new.

Lensing and delensing the cosmic microwave background
To a very good approximation that is valid for next-generation CMB experiments such as CMB-S4 1 (Abazajian et al. 2016), the effect of gravitational lensing is that of a remapping of points on the sphere.The observed CMB intensity signal at position n′ is related to that of an unlensed position by the relation where n′ is located at a distance α from n along the great circle generated by the deflection field α. Figure 1 shows the geometry and our notation.In polarization, and more generally, for any spin-weighted field s T , there is an additional phase factor that is sourced by the change in the local basis between the deflected and undeflected position (hence mostly relevant only near the poles; Challinor & Chon 2002).This may be written where χ = β − β ′ in Fig. 1.We are primarily interested in efficient implementations of both the deflection operator, D α , which from a band-limited set of harmonic modes results in the lensed map on some arbitrary locations, or pixels, ni , and of its adjoint D † α .
1 www.cmb-s4.org The forward operation can be written explicitly (for spin-0 fields) as and is thus closely related to the problem of finding an efficient forward spherical harmonic transform to an irregular grid.The adjoint takes a map as input, together with a set of deflected coordinates, to produce harmonic coefficients as follows (again, here for a spin-0 field): for |m| ≤ ℓ ≤ ℓ max .In the most general situation, the points ni and n′ i are completely arbitrary, such that the code presented here forms in fact a spherical harmonic transform (SHT) pair that works on any pixelization of the sphere.
In situations like those encountered in CMB lensing, the points ni cover the sphere according to a reasonable sampling scheme (e.g., a Healpix; Górski et al. 2005 or Gauss-Legendre grid), and n′ i are the deflected coordinates given by Fig. 1.When quadrature weights are added to Eq. ( 4), the sum becomes an approximation to an integral over the observed coordinate, This is different to the operation inverse to Eq. ( 3), (delensing), as we discuss now.
If the remapping of the sphere is invertible (which is always the case in the weak-lensing regime), we can perform a variable transformation to the unlensed coordinate n′ and obtain where |A(n)| = |d 2 n ′ /d 2 n| is the Jacobian (magnification) determinant of the lensing remapping.Equation ( 6) now has the form of a standard SHT of (T/|A|)(n), where n is matched to n′ as in Fig. 1 (hence, T len /A is first delensed, and then mapped back to harmonic space).The choice of an isolatitude grid for n′ provides one way to calculate this integral quickly with a standard backward SHT (Aurlien et al. 2023;Legrand & Carron 2022, 2023) implemented D † in this way.However, significant overhead can remain with this method because it requires calculating the inverse deflection angles n(n ′ ) on this grid.In a standard situation, the angles n′ (n) are easily obtained from a standard SHT of the deflection field on an isolatitude grid sampling the observed coordinate n, which does not provide n(n ′ ) when the unlensed coordinate n′ itself is sampled on such a grid.Moreover, usage of Eq. ( 6) requires the additional calculation of the magnification determinant, which has the cost of several forward SHTs (see Appendix B).The algorithm presented here bypasses this additional work and drops the requirement of an invertible deflection field.
For spin-weighted fields, the situation is almost identical.The harmonic space coefficients are split into a gradient (G ℓm ) and a curl term (C ℓm ), and the deflection operation is defined through A165, page 2 of 8 Reinecke, M., et al.: A&A, 678, A165 (2023) for s > 0. The sign is not consistent with the spin-0 case to accommodate for the most prevalent conventions in the community.This creates a complex map of spin weight s, whose complex conjugate can be referred to with the subscript −s.
The adjoint takes this complex map as input and calculates the two sets of coefficients which are decomposed as usual into gradient and curl modes, The phase χ is quite specific to CMB lensing applications and is absent in the general-purpose interpolation routines.The relation between the adjoint and inverse is unchanged from the case of spin-0 fields.

Implementation
Of the many implementations of the forward operation that were tested over the years, our approach is closest to that of Basak et al. (2008).The fundamentals are quite straightforward: the key point is that on the sphere parameterized by co-latitude θ and longitude φ, a band-limited function can be exactly written as a two-dimensional discrete Fourier series in θ and φ.While any function on the sphere is naturally periodic in the longitude coordinate, we must artificially extend the θ range to [0, 2π) to obtain a doubly periodic function.We can then apply nonuniform fast Fourier transform (NUFFT) techniques (Barnett et al. 2019) to this function to perform the desired interpolation.
The main steps of the forward operations can be summarized as follows: -synthesis: Synthesis of the map from T unl ℓm on a rectangular equidistant grid with n θ ≥ ℓ max + 2 and n φ ≥ 2ℓ max + 2 points with a standard SHT, with one isolatitude ring on each pole.Both dimensions are chosen in a way to make FFTs of lengths n φ and 2n θ − 2 efficient; in addition, n φ must be an even number to enable the subsequent doubling step.For a standard CMB lensing application requiring ℓ max ∼ 5000, this corresponds to a sampling resolution close to 2 arcmin.The asymptotic complexity is O(ℓ 3 max ).-doubling: Doubling the θ-range from [0, π] to [0, 2π), by creating a (2n θ − 2, n φ ) map, with the original map in the first half, and mirrored image θ → 2π − θ and φ → φ + π in the second half.In the case of odd spin values, the mirrored image also takes a minus sign.
-FFT: Going to Fourier space with the help of a standard forward 2D FFT.These coefficients by construction contain all information necessary to evaluate the map perfectly at any location.The asymptotic complexity is O(ℓ 2 max log ℓ max ).-NUFFT: Finally, from these Fourier coefficients and the coordinates n′ , interpolation proper, performed with a uniformto-nonuniform (or type 2) NUFFT.The asymptotic complexity is O(ℓ 2 max log ℓ max ) + O(n points ), where n points is the number of points on the irregular grid.This is the only step that incurs inaccuracies beyond those introduced by the finite precision of floating-point arithmatics.These inaccuracies are controlled by a user-specified parameter ϵ that was described in some detail in Arras et al. (2021).
The first three steps given here construct the 2D Fourier coefficients of the doubled-sphere representation of the same map from the spherical harmonic coefficients.Our approach to performing this (synthesis-doubling-FFT) starts with an SHT (synthesis).Consistent with the given asymptotic complexities, we find that this typically dominates the execution time overall.There are alternative approaches, however by manipulating Wigner small-d matrices, we can build a more explicit representation of the relation between spherical and 2D Fourier harmonics that can be implemented via a well-behaved three-term recurrence formula without any Fourier transforms and with a similar O(ℓ 3 max ) theoretical complexity.This is how Huffenberger & Wandelt (2010) implemented their SHTs, for example, and how Basak et al. (2008) implemented their CMB remapping.In place of a Legendre transform for each θ coordinate, a Legendre transform is required only at the equator, but one spin-weighted transform is required for each spin between 0 and ℓ max .In our case, previous measurements (see, e.g., Sect. 2 of Galloway et al. 2023) showed that it was difficult to bring this recursion to speeds on CPUs that were comparable to the highly optimized standard Legendre transforms derived from the libsharp library (Reinecke & Seljebotn 2013) we are using.While we cannot exclude that there is some room for improvements provided the recursion can be optimized in a similar fashion (or, possibly, on GPUs), these are likely to be minor.
For the adjoint operation, the steps naturally go backwards (the individual complexities stay unchanged): -NUFFT: From the input map and deflected coordinates, we perform a nonuniform-to-uniform (or type 1) NUFFT, resulting in the 2D FFT Fourier coefficients.
-FFT: We remap them to position space using a standard backward FFT on the same doubled Fourier sphere of shape (2n θ − 2, n φ ) as for the forward operation.
-undoubling: The doubling of the Fourier sphere is undone by adding its mirror image to (or, for odd spins, subtracting from) the part [0, π].
-adjoint synthesis: We perform a standard adjoint SHT on this new map of shape (n θ , n φ ).This gives us the desired spherical harmonic coefficients.
Most of these steps are well established 2 in the astrophysical community or can be understood intuitively, with the possible exception of the NUFFT, whose purpose and structure we therefore outline (for more theoretical and technical details, see Potts et al. 2001;Greengard & Lee 2004;Barnett et al. 2019).When given the Fourier coefficients of an n-dimensional function, it is trivial to obtain the function values on a regular grid in real space, in (almost) linear time.This is achieved using the fast Fourier transform, potentially after zero-padding the Fourier coefficients to increase the grid resolution.If the function values are required at irregularly spaced positions, however, this is not possible, and naive calculation of the Fourier series at each point is typically prohibitively slow in practice.One mathematically correct approach is to perform an FFT to a regular grid and then convolve these points with a sinc kernel centered on the desired locations (sinc-interpolation).This is just as prohibitive, 2 Fast Fourier transforms are handled with code derived from the pocketFFT library (https://gitlab.mpcdf.mpg.de/mtr/pocketfft), which in turn is a descendant of FFTPACK (https:// netlib.org/fftpack/),which was heavily modified for an improved accuracy and performance.
A165, page 3 of 8 Reinecke, M., et al.: A&A, 678, A165 (2023) but approximate solutions can be found by choosing an alternative and more suitable convolution kernel.In particular, a kernel with compact support will be chosen (in our case, it takes the form of Eq. ( 29) of Arras et al. 2021), and we divide the Fourier coefficients of the function by the Fourier coefficients of the kernel.This is the deconvolution step, and is followed by zeropadding the Fourier coefficients which increases the size of each dimension by a factor of roughly 1.2 to 2, and perform an FFT of the resulting array.At last comes the convolution step: for each irregularly spaced point, we perform a sum over its neighborhood and weight it by the kernel function.
Depending on the chosen kernel shape, zero-padding factor, and kernel support, it is possible to achieve accuracies close to machine precision; tuning the algorithm for higher tolerances is also possible and will improve the run time considerably.Good kernels compromise between the conflicting properties of being fast to evaluate and having small support and a quickly decaying Fourier transform.We used the variations of the kernel proposed in Barnett et al. (2019), as discussed in Arras et al. (2021).
Because all steps mentioned above are linear operations, the adjoint of the described NUFFT is obtained by executing the adjoint steps in reverse order, which is the nonuniform-touniform, or type 1, NUFFT.
Finally, for CMB-lensing applications, the unlensed angles n′ can easily be gained from the representation where at each point, the deflection vector is obtained from the spin-1 transform of the gradient (ϕ) and curl (Ω, if present) lensing potentials, Here, we follow the convention of using capital letters LM to refer to the spherical harmonic coefficients of the lensing potential.

Benchmark
This section discusses the computational cost, the scaling with threading, and memory usage of the forward and adjoint operation, and it compares our work to a few publicly available implementations of the forward operation, such as LensPix A typical task in the CMB lensing context is to compute lensed CMB spherical harmonic coefficients (a len ℓm ) starting from unlensed ones (a unl ℓm ).This can be achieved by first performing a forward remapping onto a suitable pixelization of the sphere, and computing the spherical harmonics coefficients by a standard backward SHT.The adjoint of this entire operation is first built out of a forward SHT, followed by the adjoint remapping.Key parameters impacting the execution time are the band limit ℓ unl max of a unl ℓm and the requested maximum multipole ℓ len max of the lensed CMB.In the applications that motivated this work, ℓ len max is at most the multipole above which the information on the lensing signal becomes negligible.For example, Planck reconstructions (Planck Collaboration VIII 2020; Carron et al. 2022) used ℓ len max = 2048, and the recent ACT results (Madhavacheril et al. 2023) ℓ len max = 3000.For a very deep future polarization experiment such as the CMB-S4 deep survey (Abazajian et al. 2016), this is closer to ℓ len max = 4096, a number we use often as reference in this section.Then ℓ unl max must typically be taken slightly higher than ℓ len max in order to account for the mode mixing by lensing.When lensed spherical harmonic coefficients are built in this way, another important parameter is the intermediate grid pixelization.Because the lensed CMB is not band limited, there is no exact quadrature rule, and this choice can strongly impact the accuracy of the recovered a len ℓm at high multipoles.We show in Fig. 2 the execution time of these tasks for the forward (upper panel) and adjoint (lower panel) cases as a function of the number of threads (strong scaling).We picked ℓ unl max = 5120, ℓ len max = 4096, and a longitude-thinned Gauss-Legendre grid with 6144 rings as intermediate grid (∼5 × 107 pixels).This is a conservative configuration that in our experience allows a robust and accurate lensing reconstruction for very deep stage IV CMB observations.This was performed on a CSCS Piz Daint XC50 compute node 7 , with 12 physical cores.Results on more recent processors can sometimes be up to twice as fast in our experience.The top panel shows the result for the synthesis operation, while the lower panels shows the adjoint of it, and we see that each operation scales almost perfectly.We have excluded the cost of building the undeflected angles from the timing, which we briefly discuss below, as is suitable in the context of optimal lensing reconstruction, where many maps are deflected with the same set of angles.The accuracy of the interpolation was set to 10 −5 , which value is good enough for most purposes, as also discussed in more detail below.The interpolation is very efficient.Only a minor part of the total time is dominated by the pair of SHTs that is involved.We now discuss the accuracy and memory usage of our implementation.Our code uses simple heuristics to allow the user to choose a target-relative accuracy, ϵ target , of at least 10 −13 .For the forward operation, schematically a unl lm → m, we calculated a map-level effective relative accuracy, ϵ eff , as follows: We took the difference between the true and estimated lensed maps, mtrue , mest , respectively, and normalized by the total power of the true lensed maps, Here, P true = i (m true i ) 2 , and the sum index i run over pixels as well as the map components for nonzero spin (Q and U in polarization).The maps mtrue were determined by a brute-force approach: We remapped the unlensed map to the desired grid by calculating Eq. ( 3) explicitly.As this can be fairly expensive, we calculated 10 5 exactly lensed pixels at most for a few numbers of isolatitude rings close to the equator.
The right panel in Fig. 3 shows the computational cost of the lensing routine for spin-2 fields at ℓ unl max = 4096, using four CPUs, mapping onto a Healpix pixelization with N side = 2048, as a function of ϵ eff (bottom axis) and ϵ target (top axis).The effective accuracy is typically slightly higher than requested, except in the vicinity of 10 −13 .This is not a problem of the NUFFT interpolation, however, but rather of the true SHT calculations perfect scaling spin-0  2. Execution time of different transforms as a function of number of threads used for the calculation.Upper panel: scaling of a single forward-lensing execution time, producing lensed spherical harmonic coefficients from their unlensed counterparts.The solid blue line shows the total time of producing a len ℓm from a unl ℓm in the polarized spin-2 case, and the solid orange, green, and red curves show the synthesis, interpolation proper and final backward SHT contributions.The black line shows a perfect scaling with the inverse thread number for comparison and the scaling of each and every operation is almost perfect.Dashed lines show the spin-0 results, for which harmonic transforms are substantially faster.The forward-lensing operation is built out of remapping on a pixelized sphere (orange and green, the latter being the remapping step, properly speaking), and sending it back to harmonic space with a standard spin-2 spherical harmonic transform (red).See the text for the precise specifications.These curves do not include the cost of calculating the deflected angles from the deflection field spherical harmonic coefficients, which is comparable to that of a spin-1 forward spherical harmonic transform.Lower panel: corresponding results for the adjoint operation (note that while the adjoint operation is closely related to delensing, this is not the operation inverse to forward lensing; see text).The dashed lines show the corresponding results for the spin-0 case.themselves, which lose several digits in accuracy at this band limit.The diagram shows the split of the total computational cost into its most relevant steps described earlier.The SHT steps for the angle and doubled Fourier sphere always dominate the cost in this configuration, and the choice of accuracy has a fairly minor impact.If relevant, the SHTs can be performed with a gradientonly setting for nonzero spin fields (an implementation specific to the case of a vanishing curl component), further reducing the computational cost of these tasks by about 25%.The scale of the Fig. 3. Full-sky forward operation execution times for spin-2 fields of our implementation compared with lenspyx, which uses a popular bicubic spline interpolation technique.The top panels show lenspyx (left) and the new implementation (right) as a function of the effective relative accuracy for ℓ max = 4096, producing lensed CMB maps in a Healpix geometry with N side = 2048.The effective accuracy is calculated for a few isolatitude rings close to the equator.All calculations were made with four CPUs.We indicate the execution time (shaded areas, left y-axes), peak memory usage (green data points, right y-axis), and the accuracy parameters (target resolution, target accuracy) at the top x-axes.For lenspyx, we divided the full sky map into two bands to reduce the memory consumption.The differently colored areas show the main steps of the operation.The execution time and memory consumption both grows rapidly for lenspyx because it requires high-resolution grids for an accurate interpolation.peak memory usage is indicated on the right y-axis of the right panel, and is shown as connected green square data points.The timings, memory, and accuracy calculations were made for a set of five analyses, and their standard deviation is negligible.
The left panel shows the corresponding results for a code using a popular interpolation scheme, lenspyx.This code (just as LensPix and pixell) uses bicubic spline interpolation on a intermediate grid obtained by cylindrical projection.The resolution of this grid then essentially sets the accuracy of the result and memory usage.While these interpolation schemes are perfectly fine for the purpose of producing a set of lensed CMB maps, they have strong limitations for more intensive tasks, or when higher accuracies are imperative.For low target resolutions of about 1.4 arcmin, the new implementation speeds up the execution time approximately 7 times for a similarly effective accuracy of about ϵ eff ≈ 10 −2 .This increases to a speed-up of 30 times for an effective accuracy of 10 −5 .Higher accuracies are almost not manageable for lenspyx in both time and memory, whereas the new implementation can easily reach accuracies as low as 10 −12 .
Figure 4 shows the execution time and peak memory usage of the new implementation for various N side and ℓ unl max configurations and a target accuracy of ϵ target = 10 −13 .The memory usage for the most challenging benchmarks is still below 64 GB, and we would like to note that the memory consumption is only slightly larger than the memory needed for the three maps (the deflection angles) for the respective N side .
A165, page 5 of 8 Reinecke, M., et al.: A&A, 678, A165 (2023)   We now compare our results to other implementations that benchmark the spin-0 forward operation.The effective accuracy was calculated as in the above case, replacing the data by the spin-0 true and estimated temperature maps.For LensPix and Flints, we estimated the costs by using the natively Cimplemented lensing routine, and with OpenMP support and four CPUs.We measured the tasks directly related to lensing, such as the calculation of deflection angles, the deflection, the interpolation, and rotation.To farily compare all implementations, we did not cache any calculations in Flints, even though they are available.For pixell, we used the full-sky lensing routine implemented in Python, and measured the costs within Python.
Almost all implementations provide an accuracy parameter in form of the intermediate grid pixel resolution.We chose them to provide approximately comparable results.
For Flints, LensPix, and lenspyx, we generated input CMB and lensing potential realizations and the true and estimated lensed maps on a Healpix grid.For pixell, the full-sky lensing routine was evaluated onto a CAR geometry, and we calculated the effective accuracy using our own CAR geometry to calculate the true lensed map.
We used ℓ unl max = 4096 and an HPC machine with four CPUs and 60 GB of memory.To reduce memory usage, the data for LensPix were split into sets and were calculated individually.
Table 1 summarizes the computational costs and memory usage for the pixel resolution and effective accuracy.LensPix and Flints are well-established algorithms and perform the task in reasonable time and for very reasonable memory usage.LensPix allows us to control the intermediate grid pixel resolution, whereas with Flints, this can be indirectly controlled by the choice of N side .If needed, Flints execution time could be reduced by caching some of the calculations, making them available for repeated runs.pixell supports splitting of the full sky data into bands, which can be used to reduce the memory usage, if needed.The algorithm in this paper, shown in the bottom row, is vastly more efficient both in execution time and memory consumption, even more so in the high-accuracy regime.
Finally, we comment more specifically on the relevance of this code for a maximum a posteriori lensing reconstruction (Hirata & Seljak 2003b;Carron & Lewis 2017).In these methods, the reconstruction proceeds by finding an approximate solution to the Wiener-filtered delensed CMB using a conjugate gradient solver at each iteration that involves applying the forward and adjoint operation.As mentioned earlier, this typically involves several hundred such operations.The benefits for a reconstruction strategy like this are twofold.Faster operations directly speed up the lensing reconstruction, and accurate operations can prevent the conjugate gradient solver from building up errors or showing instability.
We saw this explicitly by testing the full-sky lensing reconstruction with CMB-S4-like settings ((ℓ unl max , ℓ len max ) = (4500, 4000), a polarization noise level of √ 2 arcmin, and 4 CPUs), reconstructing the lensing potential using polarizationonly maps, using both lenspyx, and the new implementation for the forward and adjoint operations.With lenspyx, we observed that a target resolution of about 1.0 arcmin is beneficial for a stable reconstruction of the very largest modes of the lensing potential in experimental configurations like these, resulting in execution times for a single adjoint operation of about 3 min.This is significantly longer than for the new implementation, for which we find execution times of only 8 s for better accuracy.Analogously, we find for the execution time of a typical fulliteration step of a lensing reconstruction ∼25 min, or ∼4 min, respectively, and the speed-up is even larger for lower ℓ max analyses (∼24 min vs. ∼2 for ((ℓ unl max , ℓ len max ) = (3200, 3000)).It is worth mentioning that with lenspyx at this resolution, outliers can occasionally occur.They can take more than 1 h or do not converge at all.This is directly due to the lower accuracy, in particular very close to the poles, where the bicubic spline method is less accurate.This has not been observed with the new implementation, whose error map is uniform across the entire sphere.We observed something similar for temperatureonly reconstruction, where an effective relative accuracy of 10 −7 generally appears to be required for a successful convergence of A165, page 6 of 8 Reinecke, M., et al.: A&A, 678, A165 (2023) the lensing-map search on the very largest scales.This accuracy is accessible only with the implementation discussed here.

Conclusion
We have described an optimized implementation of the spherical transform pair of an arbitrary spin-weight that can be used on any pixelization of the sphere, such as regular grids distorted by CMB lensing.A C++ implementation and comprehensive Python front-end is available, together with the low-level algorithms (FFT, NUFFT, and SHT), under the terms of the GNU General Public License and named DUCC 8 .The code is written with the goal of portability and does not depend on external libraries.It supports multithreading via the C++ standard threading functionality and will make use of CPU vector instructions on x86 and ARM hardware, if the compiler supports the respective language extensions.The Python interface is kept deliberately general and flexible to allow use in the widest possible range of scientific applications.As a consequence, parts of the interface are somewhat complex and are perhaps best used by higher-level more application-specific packages to hide unnecessary details from the end user.
For users interested in applications specific to CMB lensing, the lenspyx 9 Python package has been updated in this spirit to include these developments and provide additional wrappers to these routines.This results in an improvent of some orders of magnitude in execution time and accuracy over currently publicly available tools.

Fig. 1 .
Fig. 1.Lensing geometry and notation near the north pole.The sky curvature is suppressed for clarity.The deflection vector α(n) lies in the plane tangent to the observed coordinate n at latitude θ and longitude φ, and points toward the unlensed coordinate n′ , lying a distance α = |α(n)| away along the great circle generated by α.The lensing remapping for parallel-transported spin-weighted fields like the dashed green vector or ellipse receives a phase correction e is(β−β ′ ) from the rotation of the local θ and φ basis axes.

Fig. 4 .
Fig. 4. Benchmark of the execution time and peak memory consumption for a wide range of Healpix resolution N side and ℓ max , for a target accuracy of 10 −13 , and for our new implementation.The cell highlighted in magenta is the configuration we used for the accuracy benchmark in Fig. 3.The execution time ranges from 0.55 s (N side = 512, ℓ max = 1024) to 336 s (N side = 8192, ℓ max = 12 288), while the memory consumption increases from 0.4 GB to 51 GB.The memory consumption closely follows the memory needed to store the Healpix maps of that particular size.

Table 1 .
spin-0, ℓ max = 4096, and N side = 2048 (if applicable) forward-operation execution time and memory consumption for different implementations on the full sky, using 4 CPUs.Implementation Grid resolution Effective accuracy Computation time Memory peak usage Additionally, we split the full sky map for pixell and lenspyx into two bands to reduce memory consumption. Notes.