Learning sparse representations on the sphere
^{1} Laboratoire AIM, CEA, CNRS, Université ParisSaclay, Université Paris Diderot, Sorbonne Paris Cité, 91191 GifsurYvette, France
email: florent.sureau@cea.fr
^{2} Institut für Mathematik, Technische Universität Berlin, 10623 Berlin, Germany
^{3} Lehrstuhl für Wissenschaftliches Rechnen, Katholische Universität EichstättIngolstadt, Ostenstraße 26, 85072 Eichstätt, Germany
Received:
8
August
2018
Accepted:
27
October
2018
Many representation systems on the sphere have been proposed in the past, such as spherical harmonics, wavelets, or curvelets. Each of these data representations is designed to extract a specific set of features, and choosing the best fixed representation system for a given scientific application is challenging. In this paper, we show that one can directly learn a representation system from given data on the sphere. We propose two new adaptive approaches: the first is a (potentially multiscale) patchbased dictionary learning approach, and the second consists in selecting a representation from among a parametrized family of representations, the αshearlets. We investigate their relative performance to represent and denoise complex structures on different astrophysical data sets on the sphere.
Key words: methods: data analysis / methods: statistical / methods: numerical
© ESO 2019
Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
Wavelets on the sphere (Starck et al. 2015) are now standard tools in astronomy and have been widely used for purposes such as Fermi Large Area Telescope data analysis (Schmitt et al. 2010; McDermott et al. 2016), the recovery of the cosmic microwave background (CMB) intensity and polarized CMB maps (Bobin et al. 2015, 2016), string detection (McEwen et al. 2017), point source removal in CMB data (Sureau et al. 2014), the detection of CMB anomalies (Naidoo et al. 2017; Rassat et al. 2014), or stellar turbulent convection studies (Bessolaz & Brun 2011). While wavelets are well suited for representing isotropic components in an image, they are far from optimal for analyzing anisotropic features such as filamentary structures. This has motivated in the past the construction of socalled multiscale geometric decompositions such as ridgelets, curvelets (Candès & Donoho 2004; Starck et al. 2003), bandelets (Le Pennec & Mallat 2005), or shearlets (Labate et al. 2005b). Extensions to the sphere of ridgelets and curvelets were already presented in Starck et al. (2006), Chan et al. (2017) and McEwen (2015), and also for spherical vector field data sets in Starck et al. (2009) and Leistedt et al. (2017).
For a given data set, we therefore have the choice between many fixed representation spaces (such as pixel domain, harmonics, wavelets, ridgelets, curvelets), which are also called dictionaries. A dictionary is a set of functions, named atoms, and the data can be represented as a linear combination of these atoms. The dictionary can be seen as a kind of prior (Beckouche et al. 2013), and the best representation is the one leading to the most compact representation, one in which the maximum of information is contained in few coefficients. For the previously mentioned fixed dictionaries, there exist fast operators for decomposing the data into the dictionary, and fast operators for reconstructing the image from its coefficients in the dictionary (Starck et al. 2015).
In some cases, it is not clear which dictionary is the best, or even if the existing dictionaries are good enough for a given scientific application. Therefore, new strategies were devised in the Euclidean setting to construct adaptive representations. Among them, sparse dictionary learning (DL) techniques (Engan et al. 1999; Aharon et al. 2006) have been proposed to design a dictionary directly from the data, in such a way that the data can be sparsely represented in that dictionary. DL has been used in astronomy for image denoising (Beckouche et al. 2013), stellar spectral classification (DíazHernández et al. 2014), and morphological galaxy classification (DíazHernández et al. 2016).
An alternative approach for adaptively choosing a dictionary is to start with a large parametrized family of dictionaries, and then to choose the parameter(s), either based on simulations or directly from the data. An example of such a parametrized family of dictionaries is the family of αshearlets (Labate et al. 2005; Grohs et al. 2016; Voigtlaender & Pein 2017).
In this paper, we propose to extend to the sphere both adaptive representation methods, DL and αshearlets, and we compare the performance of the two approaches. More precisely, we are concerned with adaptive sparsifying representation systems for data defined on the sphere. In Sect. 2, we present our approach for performing DL on the sphere, while Sect. 3 is devoted to our extension of the αshearlet transform to data defined on the sphere. We present the scenarios for our comparison of the two approaches in Sect. 4; the results of this comparison are presented in Sect. 5. Finally, we conclude the paper in Sect. 6. The necessary background related to αshearlets in the Euclidean setting is covered in Appendix A.
2. Dictionary learning on the sphere
Dictionary learning techniques were proposed in the early 2000s (Olshausen & Field 1996; Engan et al. 1999; Aharon et al. 2006) to build adapted linear representations that yield sparse decompositions of the signals of interest. Contrary to fixed dictionaries, in dictionary learning the atoms are estimated from the data (or a proxy, such as simulations or exemplars of the data), and can therefore model more complex geometrical content, which could ultimately result in sparser (and typically redundant) representations. DL techniques have proved their efficiency in many inverse problems in restoration, classification, and texture modeling (see, e.g., Elad & Aharon 2006; Mairal et al. 2008a, 2009; Peyré 2009; Zhang & Li 2010) with improved performance compared to fixed representations (see Beckouche et al. 2013 for denoising astrophysical data). A wide variety of dictionary learning techniques have been proposed to process multivariate data (Mairal et al. 2008a,b); to construct multiscale (Mairal et al. 2008b), translationinvariant (Jost et al. 2006; Aharon & Elad 2008), or hierarchical representations (Jenatton et al. 2011); to estimate coupled dictionaries (Rubinstein & Elad 2014); or to build analysis priors (Rubinstein et al. 2013). Also, online algorithms for dictionary learning have been considered (Mairal et al. 2010).
While fixed structured representations typically have fast direct and inverse transforms, dictionary learning techniques become computationally intractable even for signals of moderate size. Based on the observation that natural images exhibit nonlocal selfsimilarities, this computational problem is typically overcome by performing dictionary learning on patches extracted from the images that one wants to model. In this section we focus on this patchbased dictionary learning approach, and extend it for signals living on the sphere.
2.1. Sparse representation with patchbased dictionary learning
Given an n × n = N image represented as a vector X ∈ ℝ^{N}, we consider square overlapping patches x_{ij} in ℝ^{Q}, with Q = q × q, where q is typically small; in fact, in the present work we will always have q ≤ 12. Formally,
where the matrix R_{ij} ∈ ℝ^{Q × N} extracts a patch with its upper left corner at position (i, j).
From a training set 𝒯 of such patches {x_{ij}}_{(i, j)∈𝒯}, a dictionary with M atoms D ∈ ℝ^{Q × M} is then learned such that the codes Λ = {λ_{ij}}_{(i, j)∈𝒯} satisfying x_{ij} = Dλ_{ij} are sparse. To perform the training, one typically considers the following inverse problem, or one of its variants:
where 𝒟 (respectively 𝒞) is a nonempty convex set enforcing some constraints on the dictionary D (respectively the codes Λ), and μ ⋅ ∥λ_{ij}∥_{0} is the weighted ℓ_{0} pseudonorm, which enforces sparsity of the codes. To remove the scale indeterminacy in such a minimization problem – that is, if (D, Λ) is a solution, then so is (αD, α ^{−1}Λ), at least if αD ∈ 𝒟 and α^{−1}Λ ∈ 𝒞 – the set 𝒟 typically enforces each atom (column) of the dictionary to belong to a unit ℓ_{2} ball, while 𝒞 can enforce constraints in the code (e.g., nonnegativity in nonnegative matrix factorization). More details can be found in Starck et al. (2015).
2.2. Extension of patchbased dictionary learning to the sphere
To extend patchbased dictionary learning to data defined on the sphere, we first need to specify how to construct patches on the sphere. We do so by introducing local charts on the sphere. Specifically, in this work we propose to consider the HEALPix framework (Górski et al. 1999, 2005), widely used in astronomy, to construct these charts.
2.2.1. Defining patches on the sphere
HEALPix partitions the sphere into equal area pixels with curvilinear boundaries, defined hierarchically from a set of twelve base quadrilaterals (see Fig. 1). These twelve base elements (or faces) form an atlas of the sphere, and are further partitioned dyadically to obtain finer discretization levels. Consequently, each of the twelve faces is typically considered as a chart with HEALPix pixel positions mapped onto a square grid in [0, 1]×[0, 1].
Fig. 1. HEALPix grid (visualizing N_{side} = 16) in orthographic projection on the left and Mollweide projection on the right. Faint lines indicate the circles of latitude . The right image also introduces the numbering of the faces, used in the following illustrations. 

Open with DEXTER 
Using these charts to perform usual Euclidean patchbased dictionary learning is straightforward, and would have the main advantage of applying dictionary learning directly onto the pixel values, without requiring any interpolation. This comes, however, with two drawbacks: first, this approach introduces boundary issues even when using overlapping patches on each face; second, sampling on the sphere leads to patches with local characteristics (e.g., the pixel shape varies along the latitude in HEALPix). The first of these two problems can be overcome by creating the patches based on local neighbors, as defined by HEALPix. Because of the regularity of the HEALPix sampling scheme, all pixels have eight neighbors, except for eight pixels on the sphere that are located at the vertices in between equatorial and polar faces, which only have seven neighbors. The second problem, however, implies that the same signal defined continuously on the sphere, but centered at different patch centers, will likely lead to different patches being extracted (e.g., for a patch in the equatorial region or in the polar caps). We do not take this effect into account, so that these patches may have a different sparse decomposition or different approximation errors. HEALPix is also not suited to efficiently represent band limited signals, since only approximated quadrature rules are then available to compute spherical harmonic coefficients (Doroshkevich et al. 2005).
Provided some care is taken on defining the respective position of each neighbor to a central pixel across the sphere, overlapping patches can be created – even in between the twelve HEALPix faces – without any interpolation, except at the patches crossing the specific points on the HEALPix grid, which only have seven neighbors. Interpolation strategies to compensate for these “missing” neighbors can be envisioned; but in this work we choose not to interpolate, which implies that for a few pixels around these points, we do not construct all overlapping patches. The final covering of the map is illustrated in Fig. 2, also including patches randomly selected on the sphere. Once these patches are extracted, classical dictionary learning techniques can be used to learn a sparse adapted representation.
Fig. 2. Example of our covering of the sphere with overlapping patches based on HEALPix neighborhoods for N_{side} = 128 and patch width q = 8 (note that in our numerical experiments, N_{side} = 2048 and the patch width is either q = 8 or q = 12). Several randomly selected patches on the sphere are also represented in color. The plotted value in gray indicates the number of overlapping patches including each pixel. Because the patch width is usually small with respect to the number of pixels per face, the number of overlapping patches varies in small regions around the pixels that only have seven neighbors. 

Open with DEXTER 
2.2.2. Learning a multiscale representation on the sphere
Our proposed approach for dictionary learning on the sphere can be extended to capture multiscale information as proposed in Ophir et al. (2011), namely, by learning a dictionary from patches extracted from a multiscale decomposition of the data. At lower scales, capturing meaningful information would require an increase in the patch size, and would ultimately lead to a computational burden impossible to handle. To capture this information without increasing the patch size, the decomposition is subsampled.
In this work, we use the Starlet decomposition for data on the sphere (Starck et al. 2006), with one dictionary learned per wavelet scale. Since all scales except the last one are bandlimited, subsampling can be performed without loosing information by adapting the N_{side} parameter to the maximal multipole at the level considered (typically dyadically decreasing, as illustrated in Table 1).
Parameters used for learning the multiscale dictionary for thermal dust data.
The resulting minimization problem for the multiscale dictionary learning problem reads
where X is the signal on the sphere, 𝒲^{(s)} extracts the scale s of the wavelet transform on the sphere according to the N_{side} chosen for that scale, R_{ij} is now extracting patches according to neighbors on the sphere for the patch indexed by (i, j) at scale s in training set 𝒯^{(s)}, and S is the total number of wavelet scales. For each scale s = 1, …, S, a dictionary D^{(s)} is therefore learned, giving coefficients collected in Λ^{(s)}; the hyperparameter μ^{(s)} is also allowed to change with the scale. Because the cost function is separable per scale, the minimization problem Eq. (3) is equivalent to solving S dictionary learning subproblems associated to each wavelet scale.
2.3. Our algorithm for patchbased dictionary learning on the sphere
In the training phase, the joint nonconvex problems described in Eqs. (2)–(3) are typically handled by alternating sparse coding steps and dictionary update steps. Here, a sparse coding step means that one minimizes Eq. (2) (resp. Eq. (3)) with respect to Λ (resp. Λ^{(s)}), with a fixed previously estimated dictionary. Similarly, a dictionary update step means that one minimizes Eq. (2) (resp. Eq. (3)) with respect to D (resp. D^{(s)}), with the fixed, previously estimated codes. Standard algorithms were proposed for both subproblems. In this work, we will use the classical dictionary learning technique KSVD (Aharon et al. 2006) with Orthogonal Matching Pursuit (OMP; Mallat & Zhang 1993; Pati & Krishnaprasad 1993) as a sparse coder. For denoising applications, the sparse coding step will encompass both a maximal sparsity level, and an approximation threshold based on the ℓ_{2} norm of the residual, similar to the approach in Elad & Aharon (2006). This approach resulted in adapted sparse representations, while not being sensitive to small fluctuations below the targeted level of approximation, and in practice led to faster algorithms. The resulting multiscale dictionary learning algorithm is described in Algorithm ^{1}, from which its variant without the multiscale transform can be obtained for S = 1 and 𝒲^{(1)} = Id.
The first critical choice for this dictionary learning technique is to adapt the patch size q to capture information at the scale of the patch without impacting too much the computational burden of the algorithm (q is at most 12 in this work). The maximal sparsity degree K^{(s)} and the number of atoms M^{(s)} should be selected so that the dictionary leads to small approximation errors, while being able to capture the important features with only a few atoms, in particular for denoising applications. The parameter ϵ^{(s)} is the noise level expected in the denoising application at the considered wavelet scale, and the number of iterations is in practice chosen to be sufficiently large so that the average approximation error does not change with iterations. Because this problem is nonconvex, it is crucial to initialize the algorithm with a meaningful dictionary; in our case, the initial dictionary is chosen to be an overcomplete discrete cosine transform (DCT) dictionary as in Elad & Aharon (2006).
3. Extension of αshearlets to the sphere
3.1. Euclidean αshearlets
Adaptive dictionaries can also be derived from a parametrized family of representations such as the αshearlets that generalizes wavelets and shearlets and are indexed by the anisotropy parameter α ∈ [0, 1]. To each parameter α corresponds a dictionary characterized by:

atoms with a “shape” governed by height ≈ width^{α} (see Fig. A.2);

a directional selectivity: on scale j, an αshearlet system can distinguish about 2^{(1 − α)j} different directions (see Fig. A.3);

a specific frequency support for the atoms (see Fig. A.3).
A key result (Voigtlaender & Pein 2017) is that αshearlets are almost optimal for the approximation of socalled C^{β}cartoonlike functions, a model class for natural images. More precisely, the Nterm αshearlet approximation error (that is, the smallest approximation error that can be obtained using a linear combination of Nαshearlets) for a C^{β}cartoonlike function is decreasing at (almost) the best rate that any dictionary can reach for the class of such functions. For this to hold, the anisotropy parameter α needs to be adapted to the regularity β, that is, one needs to choose α = 1/β. For more details on this, we refer to Appendix A.
In general, given a certain data set, or a certain data model, different types of αshearlet systems will be better adapted to the given data than other α′shearlet systems. Thus, having such a versatile, parametrized family of representation systems is valuable to adapt to a variety of signals to recover.
3.2. Defining αshearlet transforms on the sphere
In order to define the αshearlet transform on the sphere, similarly to what was discussed for the dictionary learning approach, we need to define the charts on which the Euclidean αshearlet transform will be applied. HEALPix faces are again an obvious candidate since these base resolution pixels can be interpreted as squares composed of N_{side} by N_{side} equally spaced pixels, although their shape is contorted in different ways on the sphere (see Fig. 1).
We could map the sphere to these twelve square faces and then take the αshearlet transform on every one of them individually. However, as for dictionary learning, this approach to the processing of HEALPix data (e.g., for the task of denoising) is deemed to introduce boundary artifacts for this partition of the sphere. An example of such artifacts can be seen in the upperleft part of Fig. 18 shown in Sect. 5. Besides, contrary to patchbased dictionary learning where the patch size remains typically small compared to a face size, the increasing size of the αshearlet atoms when going to lower scales can introduce large border effects.
In the following two subsections, we discuss two approaches for handling this problem. Similarly to dictionary learning in Sect. 2.2.1, we do not take into account the variation of the pixel shapes along the sphere when extending αshearlets to the sphere.
3.2.1. The rotationbased approach
The first strategy to alleviate the block artifacts was proposed for curvelets in Starck et al. (2006). This approach relies on considering overlapping charts that are obtained by considering HEALPix faces after resampling the sphere through a small number of rotations. More precisely, for a given Euclidean αshearlet system, a HEALPix face f, and a rotation r, the redundant coefficients are obtained by
where ℛ_{r} computes the resampled map by a rotation r of the sphere, H_{f} is a matrix extracting the pixels that belong to the HEALPix face f, and 𝒮_{α} computes the Euclidean αshearlet transform on this face. In practice, a bilinear interpolation is performed by the HEALPix rotation routines that are used for the resampling.
The reconstruction is performed using a partition of unity on the sphere (see Fig. 3), which is obtained from weights that are smoothly decaying from 1 in a central region of the faces to 0 at their borders and therefore mitigating border effects. Formally, the reconstruction reads
Fig. 3. Partition of unity for the rotationbased reconstruction. The weights smoothly decaying toward the border are presented in the top left panel and are copied to each HEALPix face in the top right panel. In the bottom left panel, resampling was first performed using a rotation and bilinear interpolation, and the image shows the weights that would be applied in the original reference coordinates. The resulting covering of the sphere using five rotations is illustrated in the bottom right panel. 

Open with DEXTER 
where ℛ_{−r} resamples the sphere with the inverse rotation matrix, 𝒯_{α} computes the inverse αshearlet transform, M applies weights, and the normalization matrix N is simply a pointwise multiplication with weights chosen such that where 1 is a vector with all entries equal to 1. An example of the weights and normalization maps used to construct this partition of unity is illustrated in Fig. 3.
Since the rotations ℛ_{r} and ℛ_{−r} are implemented using interpolation, it is not true exactly that ℛ_{−r}ℛ_{r}X = X. Therefore, even if the coefficients λ_{α, r, f} are obtained through Eq. (4), the reconstruction in Eq. (5) will only satisfy , not . However, the error introduced by the inexact inverse rotation is often negligible, at least for sufficiently smooth signals; Sect. 5.2.4 offers further comment on this.
3.2.2. The “patchwork” approach
The “patchwork” approach is another strategy to eliminate artifacts that would arise if one naively used the disjoint HEALPix faces. Contrary to the rotationbased technique, where an interpolation is performed during the resampling, the patchwork approach is based on extending the HEALPix faces using parts of the surrounding faces so as to avoid interpolation. Similar to the rotationbased approach, the six resulting extended faces (see Fig. 4) form a redundant covering of the sphere, which is beneficial for avoiding boundary artifacts. Once these six extended faces are computed, the αshearlet transform and all further processing are performed on these faces. Of course, for the reconstruction, the last step consists in combining the redundant faces to get back a proper HEALPix map.
Fig. 4. Left panel: twelve squares corresponding to the faces of the HEALPix framework (see Fig. 1) arranged as a net in the plane. The areas that are covered by multiple of the extended faces – the transition zones – are displayed in gray. The areas where pixels are “missing” are displayed in red. Right panel: six extended faces produced by the patchwork procedure. The two polar faces form the top row, followed by the four equatorial faces below. The shaded area around the transition zone of each composite face indicates the margin, which is later discarded. 

Open with DEXTER 
Formally, the decomposition can be described as
where 𝒫_{f} is now the operator that extracts the extended face f from the HEALPix map X. Similarly, the reconstruction reads
where ℳ is the operator that reconstructs a HEALPix map from data on the six extended faces.
The rest of this section explains how precisely the extended faces are obtained from the original HEALPix faces, and conversely how a HEALPix map can be obtained from data on these six extended faces. For an accompanying visual explanation of the procedure, the reader should consult Figs. 1, 4, and 5.
Fig. 5. Detailed view of two of the six extended faces. The dark outer boundary with width c_{m} is the margin that is discarded after the reconstruction step, and the two dark squares in the corners of the equatorial face on the right are treated likewise. The remaining part of the extended faces has a gray outer boundary of width 2c_{t}. In conjunction with the gray squares in the corners of the equatorial face, this boundary forms the transition zone that contains the values shared with the neighboring extended faces. 

Open with DEXTER 
Each of the six extended faces consists of an inner square with HEALPix pixels that are unique to this extended face, and a border zone with HEALPix pixels that appear in several of the extended faces. The border itself is again subdivided into an outer margin that is disregarded after the reconstruction step so that the artifacts at the boundary are cut off (not mapped to the sphere), and an inner part that forms a transition zone, where the values of neighboring faces are blended together to prevent visible discontinuities between them.
Instead of extending all twelve original faces, we combine them to six bigger composite faces and extend those. This reduces the number of additional pixels that have to be processed (when using a border of the same size), at the cost of increased memory requirements. The first two composite faces cover the bulk of the north and south polar regions, and particularly the poles itself. Since the four faces of each polar region meet at the poles, we can arrange those four faces to form a square around the pole. It only remains to clip this area to the requested size. Although there is much freedom to set the extent of the individual composite faces, we prefer all squares to be of equal size, so that they can be processed without distinction. The remaining four composite faces are obtained by expanding the equatorial faces. An expansion of the equatorial faces by in each direction results in areas of width , which each contain a fourth of every surrounding polar face. By removing those parts from the polar areas, constructed earlier, those are truncated to the same width (see Fig. 5). Thus, we get six areas of equal size that cover the sphere. Chosen this way, there is still no overlap between the polar and equatorial composite faces; therefore we extend each face further by half the requested width of the transition zone. We chose an extension of width (that is c_{t} in Fig. 5). Since each face enters its neighbors territory by that amount, this results in a transition zone of width between each face. Additionally each face is extended by a margin (that is c_{m} in Fig. 5) to avoid border artifacts. Here, a margin of width was chosen.
However, to extend the equatorial faces, we have to address the problem that there are eight vertexes where two faces of a polar region meet a face of the equatorial region (located on the circles of latitude θ = cos^{−1}(±2/3), depicted in Fig. 1). By arranging the twelve faces as a net in the plane – as illustrated in Fig. 4 – it becomes clear that there are gaps between the polar faces, where no values exist; these areas are marked in red in Fig. 4. We need to fill those gaps in order to obtain rectangular extended faces, to which we can apply the αshearlet transform. In the end, these parts will be cut away and disregarded like the outer margin of the extension, so the filledin values will not actually be used for the reconstruction. Nevertheless, we need to be careful, since otherwise we might introduce additional artifacts like the ones at the boundary.
For the sake of simplicity, we will describe the situation at the edge between faces 1 and 2 (see Figs. 1, 4, and 6), which is exemplary for all gaps. From the perspective of face 2, the missing square is expected to feature a rotated copy of face 1, while conversely face 1 expects a rotated copy of face 2. To fabricate a weighted blending of those anticipated values, we divide the empty square, interpreted as [0, 1]^{2}, along the lines 2x = y, x = y, and x = 2y, into quarters, as demonstrated in Fig. 6. On both outer quarters the full weight is assigned to the face which the adjoining face expects, while the two middle quarters serve to produce a smooth transition. All weights are normalized in such a way that every pixel is a convex combination of the pixels of the two faces; that is, the weights are nonnegative and their sum is one at each pixel.
Fig. 6. “Missing” square between faces 1 and 2 is divided into four triangles of equal size, separated by the lines 2x = y, x = y, and x = 2y, as seen on the left. The two images in the middle reveal how the rotated faces 1 and 2 are separately weighted along those segments. The data of face 1 have full weight (black) on the outer triangle adjacent to face 2, and no weight (white) on the other outer triangle, while the data of face 2 are treated conversely. A smooth transition is provided by the weights on the triangles in between. The sum of the weighted faces is used to fill the gap, as demonstrated in the rightmost illustration. 

Open with DEXTER 
With this process, we fill the vertex regions with values. We do not actually need to fill the whole square, but only the corner needed for the expansion (the red part in Fig. 4). Having done this, we can piece the equatorial faces together from the various parts of the six surrounding faces and two filler squares. Figure 4 shows the resulting extended faces on the right.
We have now described the operators 𝒫_{f} appearing in Eq. (6), which assign to a given HEALPix map X the six extended faces 𝒫_{1}(X),…,𝒫_{6}(X). On these rectangular faces, we can then apply the usual αshearlet transform, and do any further processing that is desired (for instance, we can denoise the six extended faces by thresholding the αshearlet coefficients).
After the processing is done on the six extended faces, the outer margin and filler values are disregarded and the remnant is separated along the boundaries of the original faces. From these pieces, the original faces are put back together. While doing so, all pixels that were part of a transition zone are weighted, similarly to above, as a convex combination of the pixels of the (up to four) involved extended faces.
Since we use only the values provided by the HEALPix grid, and instead of interpolating between pixels use convex combinations of pixel values in the transition zones, the patchwork procedure is invertible, with Eq. (7) describing a left inverse to the “patchwork αshearlet coefficient operator” described in Eq. (6). Thus, the patchworkbased αshearlets form a frame. We emphasize, however, that the reconstruction procedure described in Eq. (7) is not necessarily identical to the one induced by the canonical dual frame of the patchworkbased αshearlet frame.
4. Experiments
To evaluate αshearlets and dictionary learning, we have selected two different simulated data sets on the sphere:

Thermal dust map: a full sky thermal dust map from the Planck Sky Model (100 GHz map) (Planck Collaboration XII 2016), obtained through the Planck Legacy Archive^{1}.

Horizon full sky maps: a series of full sky maps from the Horizon Nbody simulations describing the dark matter halo distribution between redshift 0 and 1 (Teyssier et al. 2009)^{2}.
While in the former scenario, the signal is smooth and expected to be best represented by multiscale transforms, in the latter the signal is more discontinuous and geometrically composed of filamentary structures joining clusters, with density changing with redshift. These two simulations are therefore illustrative of different scenarios where such adaptive transforms would be useful.
To evaluate the respective performance of DL and αshearlets for denoising, we have added to the thermal dust map an additive white Gaussian noise with standard deviation 45 μK, which corresponds to the expected level of CMB at such frequency. The resulting map can be seen in Fig. 7.
Fig. 7. Thermal dust simulation map (at 100 GHZ) without (top panel) and with the additive white Gaussian noise added (bottom panel), for evaluation of the methods. The colorscale has been stretched to illustrate the challenge of recovering structures at intermediate latitude. Units are in μK. 

Open with DEXTER 
The galactic mask used for quantitative comparisons to separate regions of high dust amplitude from regions with lower values at higher galactic latitude is displayed in Fig. 8, along with the location of a region close to the galactic plane where the differences between the methods could be better visualized.
Fig. 8. Left panel: galactic mask used for thermal dust quantitative evaluation, covering 70% of the sky. Right panel: region close to galactic plane where methods are inspected. 

Open with DEXTER 
For the dark matter halo distribution, we select the first slice of the data cube, and adjust the white noise level to 5, so that filamentary structures are of a similar amplitude to the noise, as can be observed in Fig. 9. This noise does not correspond to something realistic in our actual experiments, but our goal here is only to evaluate how different adaptive representations behave when extracting features embedded in Gaussian noise.
Fig. 9. Dark matter halo distribution for the first slice, without (top panel) and with the additive white Gaussian noise added (bottom panel), for evaluation of the methods. The colorscale has been stretched to visualize filamentary structures. 

Open with DEXTER 
In the following two subsections, we outline the precise choice of the hyperparameters that we used for the αshearlets and for the dictionary learning based denoising, respectively.
4.1. Parameters for αshearlets
For the two αshearlet approaches, we used 11 values of α, sampled uniformly with a density of 0.1 ranging from 0 to 1. We used four scales of decomposition, using either the rotationbased approach (Eq. (4)), or the patchwork approach (Eq. (6)). For the actual denoising, we performed a hard thresholding of the αshearlet coefficients. For this, we used different detection thresholds on different scales. To be precise, we used a 4σ detection threshold for the last scale with a lower signal to noise ratio, and a detection threshold of 3σ for the other scales; for the coarse scale, however, we did not do any thresholding. The reconstruction was then performed using either Eq. (5) or (7).
For the rotationbased approach, five rotations were selected as a balance between having “more uniform” weights and the computational burden of this approach. The weight maps were built using a margin and transition (smooth trigonometric variation in between 0 and 1) of size .
For the patchwork approach, we set the size of both the utilized extension and the margin to , which results in increasing the number of pixels that have to be processed by about half (53.1%). A little less than half of the added pixels are used for the sake of redundancy, and the rest is disregarded.
4.2. Dictionary learning parameters
For the thermal dust data where the information is present at several scales, we chose the multiscale dictionary learning technique. Three wavelet scales of the Starlet transform on the sphere (Starck et al. 2006) were first computed from the input simulated dust map without noise. To avoid artifacts for a non bandlimited signal, the finest wavelet scale has not been directly computed through its spherical harmonic decomposition. We followed Algorithm ^{1} for the learning procedure, with the parameters listed in Table 1. The patch size, the number of atoms, and the maximal sparsity were selected experimentally by choosing values that lead to the lowest average approximation error during the training phase.
An example of a dictionary learned for this adaptive multiscale representation of thermal dust is shown in Fig. 10. The dictionaries have captured at various scales both directional and more isotropic structures.
Fig. 10. Atoms learned in the multiscale dictionary learning approach: on the left, scale 3, on the right, scale 2. The dictionaries have departed from the original redundant DCT dictionary and have learned specific features related to the scale. Due to the change of the N_{side} parameter with the scale, the actual distance between two adjacent pixels has increased, and the atoms for scale 2 are indeed smoother than those for scale 3. 

Open with DEXTER 
In the second scenario, because information is localized in space, the dictionary was learned directly on patches extracted from the first slice describing the dark matter halo distribution, from a training set of 200 000 patches of size 8 × 8. As in the previous experiment, a stopping criterion was set for the approximation error (which should be less than the targeted level of noise), and a maximal sparsity of 7 was set for OMP. KSVD was then run for 100 iterations. The learned dictionary is presented in Fig. 11. The atoms essentially contain high frequency information in this case, in contrast to the previously learned distribution on thermal dust.
Fig. 11. Atoms learned in the dictionary learning approach, applied to the dark matter halo distribution data. The dictionary elements are composed of pointlike structures and edges. 

Open with DEXTER 
Once these dictionary are learned, the sparse decomposition step with this representation is used for denoising. The same parameters as above were used for the sparse coding, except for the targeted approximation error, which was set to a value that would not be exceeded by a patch of pure noise with a probability of 0.9545.
5. Results
5.1. Denoising experiments
We tested our adaptive approaches to denoise the data in the two denoising scenarios presented in the previous section, using the parameters described in Sects. 4.1 and 4.2. For the thermal dust simulation, the full sky denoised maps using the three approaches are displayed in Fig. 12, with a zoom to a region close to the galactic plane in Fig. 13, to visually inspect the differences between methods. Residuals on the full sphere are also shown in Fig. 14, and the performance of the different approaches is quantitatively evaluated in Table 2 in the full sky as well as in regions defined by the galactic mask.
Fig. 12. Denoised thermal dust maps for all three approaches. Top and middle panels: αshearlet denoising with rotationbased (top) or patchwork (middle) approach, both for α = 0.6. Bottom panel: representation learned with dictionary learning. Units are in μK. 

Open with DEXTER 
Fig. 13. Zoom on a region close to the galactic plane to visualize the respective denoising performance of the methods. From top to bottom panels: input map, noisy map (with own colorscale), rotationbased approach with α = 0.6, patchwork approach with α = 0.6, sparse representation learned from data. All units are in μK. 

Open with DEXTER 
Fig. 14. Residuals for the maps displayed in Fig. 12. Units are in μK. 

Open with DEXTER 
Statistics on the recovery of spherical thermal dust maps with the proposed approaches.
Similarly, for the dark matter halo distribution, the full sky denoised maps are displayed in Fig. 15 and the residuals are presented in Fig. 16. To better inspect the recovery of the filamentary structures as well as the core regions, a zoomin was also performed for this dataset; this is shown in Fig. 17. Finally, the results are quantitatively evaluated in Table 3.
Fig. 15. Denoised dark matter maps for all three approaches. Top and middle panels: αshearlet denoising with rotationbased (top) or patchwork (middle) approach, both with α = 1. Bottom panel: representation learned with dictionary learning. 

Open with DEXTER 
Fig. 16. Amplitude of the residuals for all three approaches, for the dark matter map scenario. Top and middle panels: αshearlet denoising with rotationbased (top) or patchwork (middle) approach, both with α = 1. Bottom panel: representation learned with dictionary learning. 

Open with DEXTER 
Fig. 17. Dark matter map amplitudes for all three approaches in a zoomed region. From top to bottom and left to right panels: original map, noisy map, rotationbased approach with α = 1, patchwork approach with α = 0, patchwork approach with α = 1, representation learned from data. 

Open with DEXTER 
Statistics on the recovery of dark matter halo distribution with the proposed approaches.
To inspect the impact of the anisotropy parameter on the recovery of geometrical structures in the different redshift slices, we also computed for the patchwork approach the nonlinear approximation curves that display the evolution of the RMSE as a function of given thresholds. This allows for a more comprehensive view of the best α for different density level thresholds. These nonlinear approximation curves are illustrated in linear and log scale in Figs. 19 and 20, respectively.
5.2. Discussion
In the following, we discuss several questions concerning the results; in particular, we analyze the relative performance of our different approaches to sparsifying representations on the sphere.
5.2.1. Block artifacts
The first challenge in extending the representation from the Euclidean framework to data defined on the sphere was to avoid the border effects due to considering disjoint charts processed independently. Figure 18 illustrates that all our proposed redundant representations, based on different overlapping charts, are free of these block artifacts when denoising the thermal dust map. A similar result is obtained for denoising the dark matter maps.
Fig. 18. Cartesian projection of the denoised thermal dust maps centered at the intersection of four faces. From left to right and top to bottom panels: denoising each face independently using αshearlets with α = 1, restoration via the rotationbased approach, patchwork approach with α = 1, dictionary learning with patch width of 12. The colorscale has been stretched to visualize the artifacts seen as a crossshape discontinuity at the boundaries of the four HEALPix faces in the upper left panel. All of our proposed approaches are free from these artifacts. Units are in μK. 

Open with DEXTER 
5.2.2. Visual inspection
Qualitatively, Figs. 13 and 17 illustrate the different shapes captured by αshearlets and dictionary learning atoms. In particular, for the thermal dust maps, the noise appears as curveletlike structures for the αshearlet approaches, while for the dictionary learning approach, the noise appears both as isotropic and as directional structures.
For the first slice of the dark matter halo distribution simulations, the dictionary learning approach visually seems to best recover the structures in the data, in particular the filamentary structures and the compact cores.
5.2.3. Which approach is best?
This is confirmed quantitatively in Tables 2 and 3 where the dictionary learning approach outperforms both αshearlet techniques in the denoising of thermal dust (with a multiscale approach) and dark matter halo distribution. For thermal dust, when looking at specific regions (region inside or outside the galactic mask), the rotationbased approach gives, however, the lowest residuals in the galactic region, while using the learned representation gave the best results outside this region. This can be explained by the wide diversity of amplitudes in the galactic plane, not captured in our training set of 200 000 patches for the first wavelet scale, which corresponds only to 0.4% of the total number of patches over the full sky. Improving performance for dictionary learning in the galactic region would require us either to train the dictionary with a larger training set so that it encompasses more patches from the galactic center, or to sample more densely the galactic region than higher galactic latitudes in this training set.
5.2.4. Is the rotationbased or the patchwork approach preferable?
The rotationbased approach outperforms the patchwork approach in the thermal dust denoising scenario, but conversely the patchwork approach outperforms the rotationbased technique in the dark matter halo distribution scenario. The last result is due to the bilinear interpolation performed when resampling the sphere with rotations, which leads to severe approximation errors when the signal varies greatly at the scale of a few pixels.
5.2.5. What is the best αvalue?
Tables 2 and 3 show that for αshearlets in the denoising of thermal dust, α = 0.6 (system close to the curvelets) gives the best performance, while for the dark matter halo distribution scenario, α = 1.0 (system close to the wavelets) gave the best performance.
However, the second scenario displays a diversity of structures with both high density cores and numerous less dense filaments, with distribution changing in different slices of data corresponding to different redshifts. It would therefore be reductive to investigate a single noise level scenario to set a best α for one of these slices of the data.
We therefore computed for the patchwork approach the nonlinear approximation curves for the different slices in redshift. These nonlinear approximation curves are illustrated in linear and log scale in Figs. 19 and 20, respectively. These curves illustrate that for large threshold values, corresponding to selected dense core regions, the α = 0.9shearlet system is most suitable. For slice 600 and 605 (higher redshift), when decreasing the threshold, there is a transition from α = 0.9 to α = 0 (very elongated shearlets) for the best α value. This can be understood as including more and more filamentary structures when the threshold decreases.
Fig. 19. Normalized nonlinear approximation curves for four different slices of the dark matter distribution. For each threshold, the α value corresponding to the lowest approximation error is displayed on the top. 

Open with DEXTER 
Fig. 20. Normalized nonlinear logapproximation curves for four different slices of the dark matter distribution. For each threshold, the α value corresponding to the lowest approximation error is displayed on the bottom. 

Open with DEXTER 
For lower redshift slices on the other hand, the best values are obtained more consistently across thresholds for α = 0.9 or α = 1 because more core structures and less filaments are visible in the data. Overall, this illustrates how adaptive to diverse structures in the data the αshearlets can be. Furthermore, it shows that the anisotropy parameter α can be used to characterize different types of structure present in the data.
5.3. Computing requirements
All codes were run on the same cluster so that we can assess the relative computing time requirements for the three approaches. For the rotationbased approach, on the current python implementation using pyFFTW^{3} and also based on a parallelized transform using six cores, denoising a N_{side} = 2048 map using five rotations and four scales of decomposition takes about 35 min for α = 1 and 1 h for α = 0 (the most redundant transform). The time needed to perform the rotationbased approach scales linearly with the number of rotations. In comparison, denoising with the patchwork approach a N_{side} = 2048 map using four scales of decomposition (with the same parallelization of the transform as for the rotationbased approach) takes about 9 min for α = 1 and 20 min for α = 0.
For the multiscale dictionary learning algorithm, computing time for the learning phase ranged from about 2.5 h for scale 3 to about 3.5 h for scale 1, when using our C++ code with four cores for the sparse coding. This increase is due to the low value for ϵ^{(1)} and large value for the maximal sparsity K^{(1)}, even though the training set is smaller than for scale 3. Learning these dictionaries can be performed in parallel, which was done in practice. For the dark matter scenario, the learning took about 65 min. Once the dictionary was learned, sparse coding of all patches took typically from 15 min (scale 1) to about 22 min (scale 3) for the thermal dust map, and 9 min for the dark matter halo distribution, using 24 cores. Overall, the two αshearlet approaches are therefore easier to set up, with less parameters to optimize that depend directly on the data, and result in faster denoising than the dictionary learning based approach.
6. Conclusions
We have proposed two new types of adaptive representations on the sphere: a patchbased dictionary learning approach and choosing among a parametrized family of representations, the αshearlets. To extend these constructs from the Euclidean setting to data defined on the sphere, we proposed to use overlapping charts based on the HEALPix framework. For the dictionary learning technique, a possible multiscale extension was presented by learning dictionaries on each scale after performing a subsampled wavelet decomposition on the sphere. For the αshearlets, we proposed two approaches to construct the charts: resampling the sphere according to various rotations associated with a partition of unity not sensitive to border effects, or constructing six overlapping charts based on composite extended HEALPix faces.
We evaluated all three approaches by conducting denoising experiments on thermal dust maps, and dark matter maps. Our main findings are as follows:

thanks to the use of overlapping charts, all of our proposed approaches are free of the block artifacts that typically appear if one naively uses the disjoint HEALPix faces for doing denoising;

in both scenarios investigated, the dictionary learning approach gave the best performance by providing atoms adapted to the structure present in the images, for a given noise level;

the performance of the dictionary learning approach depends on setting several hyperparameters that depend on the signal observed (multiscale or not), and on the training set. This approach therefore requires more computing and tuning time than the other approaches;

which of the two αshearlet approaches performed better depended on the chosen scenario; the rotationbased approach involves interpolation, which is detrimental to capturing signals that vary significantly on the scale of just a few pixels, but it achieved better results for the thermal dust simulations;

for different values of the anisotropy parameter α, the αshearlet system is adapted to different structures (filaments, dense cores) present in the dark matter halo distribution simulation.
The respective performance of these approaches depends on the criteria used: the dictionary learning approach provided the best denoising results in both scenarios, but has a higher number of parameters to set and requires more computing time; among the αshearlets, the rotationbased approach is best for smooth signals, but the converse is true for signals with significant variation on the scale of a few pixels. The three proposed approaches can therefore be used to process data living on the sphere, and choosing the “best” approach will depend on the scenario considered as well as the computing resources available.
Reproducible research. In the spirit of reproducible research, we make public our codes for sparse representation systems on the sphere on the common repository^{4}. The dictionary learning and alphashearlet codes on the sphere are associated with tutorial jupyter notebooks illustrating how to use them for denoising.
Acknowledgments
This work is funded by the DEDALE project, contract no. 665044, within the H2020 Framework Program of the European Commission. The authors thank the Horizon collaboration for making their simulations available.
References
 Aharon, M., & Elad, M. 2008, SIAM J. Imaging Sci., 1, 228 [Google Scholar]
 Aharon, M., Elad, M., & Bruckstein, A. 2006, Int. Trans. Sig. Proc., 54, 4311 [Google Scholar]
 Beckouche, S., Starck, J. L., & Fadili, J. 2013, A&A, 556, A132 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Bessolaz, N., & Brun, A. S. 2011, ApJ, 728, 115 [NASA ADS] [CrossRef] [Google Scholar]
 Bobin, J., Sureau, F., & Starck, J.L. 2015, A&A, 583, A92 [NASA ADS] [EDP Sciences] [Google Scholar]
 Bobin, J., Sureau, F., & Starck, J.L. 2016, A&A, 591, A50 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Candès, E., & Donoho, D. 2004, Comm. Pure Appl. Math., 57, 219 [Google Scholar]
 Candès, E., Demanet, L., Donoho, D., & Ying, L. 2006, Multiscale Model. Simul., 5, 861 [Google Scholar]
 Chan, J. Y. H., Leistedt, B., Kitching, T. D., & McEwen, J. D. 2017, IEEE Trans. Signal Process., 65, 5 [NASA ADS] [Google Scholar]
 Christensen, O. 2016, in An Introduction to Frames and Riesz Bases, 2nd edn. (Cham: Birkhäuser/Springer), Appl. Numer. Harmonic Anal., XXV [Google Scholar]
 Daubechies, I. 1992, in Ten Lectures on Wavelets (Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM)), CBMSNSF Regional Conf. Ser. Appl. Math., 61, XX [Google Scholar]
 DíazHernández, R., PeregrinaBarreto, H., AltamiranoRobles, L., GonzálezBernal, J. A., & OrtizEsquivel, A. E. 2014, Exp. Astron., 38, 193 [NASA ADS] [Google Scholar]
 DíazHernández, R., OrtizEsquivel, A., PeregrinaBarreto, H., AltamiranoRobles, L., & GonzálezBernal, J. 2016, Exp. Astron., 41, 409 [NASA ADS] [Google Scholar]
 Doroshkevich, A. G., Naselsky, P. D., Verkhodanov, O. V., et al. 2005, Int. J. Mod. Phys. D, 14, 275 [NASA ADS] [CrossRef] [Google Scholar]
 Elad, M., & Aharon, M. 2006, IEEE Trans. Image Process., 15, 3736 [NASA ADS] [Google Scholar]
 Engan, K., Aase, S. O., & Husoy, J. H. 1999, in ICASSP 1999 Proceedings, IEEE, 5, 2443 [Google Scholar]
 Górski, K. M., Hivon, E., & Wandelt, B. D. 1999, Proc. MPA/ESO Conf., Evol. Large Scale, Struct, 37 [Google Scholar]
 Górski, K. M., Hivon, E., Banday, A. J., et al. 2005, ApJ, 662, 759 [NASA ADS] [CrossRef] [Google Scholar]
 Grohs, P., Keiper, S., Kutyniok, G., & Schäfer, M. 2016, Appl. Comput. Harmon. Anal., 41, 297 [Google Scholar]
 Guo, K., & Labate, D. 2007, SIAM J. Math. Anal., 39, 298 [Google Scholar]
 Guo, K., Kutyniok, G., & Labate, D. 2006, Wavelets and splines: Athens 2005 (Brentwood, TN: Nashboro Press), Mod. Methods Math., 189 [Google Scholar]
 Jenatton, R., Mairal, J., Obozinski, G., & Bach, F. 2011, J. Mach. Learn. Res., 12, 2297 [Google Scholar]
 Jost, P., Vandergheynst, P., Lesage, S., & Gribonval, R. 2006, ICASSP 2006 Proceedings, IEEE, 5, V [Google Scholar]
 Kutyniok, G., & Lim, W. 2011, J. Approx. Theor., 163, 1564 [Google Scholar]
 Kutyniok, G., & Labate, D. 2012, Applied and Numerical Harmonic Analysis (New York: Birkhäuser/Springer), XX [Google Scholar]
 Labate, D., Lim, W., Kutyniok, G., & Weiss, G. 2005, in Optics and Photonics 2005, Int. Soc. Opt. Photonics, 59140U [Google Scholar]
 Labate, D., Lim, W.Q., Kutyniok, G., & Weiss, G. 2005b, Wavelets XI, Vol. 5914 (SPIE), 254 [NASA ADS] [Google Scholar]
 Le Pennec, E., & Mallat, S. 2005, IEEE Trans. Image Process., 14, 423 [NASA ADS] [Google Scholar]
 Leistedt, B., McEwen, J. D., Büttner, M., & Peiris, H. V. 2017, MNRAS, 466, 3728 [NASA ADS] [Google Scholar]
 Mairal, J., Elad, M., & Sapiro, G. 2008a, IEEE Trans. Image Process., 17, 53 [NASA ADS] [Google Scholar]
 Mairal, J., Sapiro, G., & Elad, M. 2008b, Multiscale Model. Simul., 7, 214 [Google Scholar]
 Mairal, J., Ponce, J., Sapiro, G., Zisserman, A., & Bach, F. R. 2009, Adv. Neural Inf. Process. Syst., 1033 [Google Scholar]
 Mairal, J., Bach, F., Ponce, J., & Sapiro, G. 2010, J. Mach. Learn. Res., 11, 19 [Google Scholar]
 Mallat, S., & Zhang, Z. 1993, IEEE Trans. Signal Process., 41, 3397 [NASA ADS] [CrossRef] [Google Scholar]
 McDermott, S. D., Fox, P. J., Cholis, I., & Lee, S. K. 2016, JCAP, 7, 045 [NASA ADS] [Google Scholar]
 McEwen, J. D. 2015, IEEE Trans. Sig. Proc., submitted, [arXiv:1510.01595] [Google Scholar]
 McEwen, J. D., Feeney, S. M., Peiris, H. V., et al. 2017, MNRAS, 472, 4081 [NASA ADS] [Google Scholar]
 Naidoo, K., BenoitLévy, A., & Lahav, O. 2017, MNRAS, 472, L65 [NASA ADS] [Google Scholar]
 Olshausen, B., & Field, D. 1996, Vision Res., 37, 3311 [Google Scholar]
 Ophir, B., Lustig, M., & Elad, M. 2011, IEEE Sel. Sign. Process. Top., 5 [Google Scholar]
 Pati, Y. C., & Krishnaprasad, P. S. 1993, IEEE Trans. Neural Networks, 4, 73 [Google Scholar]
 Peyré, G. 2009, J. Math. Imaging Vision, 34, 17 [Google Scholar]
 Planck Collaboration XII. 2016, A&A, 594, A12 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Rassat, A., Starck, J.L., Paykari, P., Sureau, F., & Bobin, J. 2014, JCAP, 8, 006 [NASA ADS] [CrossRef] [Google Scholar]
 Rubinstein, R., & Elad, M. 2014, IEEE Trans. Signal Process., 62, 5962 [NASA ADS] [Google Scholar]
 Rubinstein, R., Peleg, T., & Elad, M. 2013, IEEE Trans. Signal Process., 61, 661 [NASA ADS] [Google Scholar]
 Schmitt, J., Starck, J. L., Casandjian, J. M., Fadili, J., & Grenier, I. 2010, A&A, 517, A26 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Starck, J.L., Candès, E., & Donoho, D. 2003, A&A, 398, 785 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Starck, J.L., Moudden, Y., Abrial, P., & Nguyen, M. 2006, A&A, 446, 1191 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Starck, J.L., Moudden, Y., & Bobin, J. 2009, A&A, 497, 931 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Starck, J.L., Murtagh, F., & Fadili, M. J. 2015, Sparse Image and Signal Processing: Wavelets and Related Geometric Multiscale Analysis (Cambridge University Press) [Google Scholar]
 Sureau, F. C., Starck, J.L., Bobin, J., Paykari, P., & Rassat, A. 2014, A&A, 566, A100 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Teyssier, R., Pires, S., Prunet, S., et al. 2009, A&A, 497, 335 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 van der Walt, S., Colbert, S. C., & Varoquaux, G. 2011, Comput. Sci. Eng., 13, 22 [Google Scholar]
 Van Rossum, G., & Drake, Jr., F. L. 1995, Python Tutorial (The Netherlands: Centrum voor Wiskunde en Informatica Amsterdam) [Google Scholar]
 Voigtlaender, F., & Pein, A. 2017, ArXiv eprints [arXiv:1702.03559v1] [Google Scholar]
 Woiselle, A. 2010, PhD Thesis Paris 7 [Google Scholar]
 Woiselle, A., Starck, J.L., & Fadili, J. 2011, J. Math. Imaging Vision, 39, 121 [Google Scholar]
 Zhang, Q., & Li, B. 2010, in 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE), 2691 [Google Scholar]
Appendix A: Review of Euclidean αshearlets
The αshearlet family of representations generalizes wavelets and shearlets. Like shearlets – originally introduced in Labate et al. (2005) and Guo et al. (2006) – they are a directionally sensitive multiscale system in ℝ^{2} improving upon wavelets when it comes to handling data that is governed by directional features like edges. They are characterized by an anisotropy parameter α ∈ [0, 1], and were designed to yield optimally sparse representations for the class of C^{β}cartoonlike functions (Kutyniok & Labate 2012; Kutyniok & Lim 2011; Guo & Labate 2007; Voigtlaender & Pein 2017), a model class for natural images (Candès & Donoho 2004) as illustrated in Fig. A.1.
In the remainder of this section, we briefly explain our motivation for choosing αshearlet systems, discuss the most important mathematical properties of αshearlet systems, and then comment on the implementation that we used.
Fig. A.1. Example of a cartoonlike function. Such a function f is smooth, apart from a jump discontinuity along a curve γ. Even though f might be discontinuous along γ, the boundary curve γ itself is required to be smooth. 

Open with DEXTER 
A.1. Motivation
Before giving a formal definition of (α)shearlet systems, it is instructive to roughly compare the operations used for their construction to the ones used for defining wavelet systems (Daubechies 1992). We recall (see, e.g., Daubechies 1992) that for a scaling function ϕ ∈ L^{2}(ℝ^{d}) and a mother wavelet ψ ∈ L^{2}(ℝ^{d}), the associated (discrete) wavelet system with sampling density δ > 0 is given by
In other words, the wavelet system consists of all translates of the scaling function ϕ along the lattice δℤ^{d}, together with certain translates of the isotropically dilated scaling functions ψ_{j} := 2^{dj/2} ψ(2^{j} • ). Here, the wavelet ψ_{j} on the jth scale is translated along the lattice δ ⋅ 2^{−j}ℤ^{d}, which is adapted to the “size” of ψ_{j}.
It is crucial to note that even in dimension d > 1, wavelets use the isotropic dilations x ↦ 2^{j}x, which treat all directions in the same way. Therefore, wavelet systems are not optimally suited for representing functions governed by features with different directions. Admittedly, instead of using a single mother wavelet ψ, it is common to employ wavelet systems that use finitely many mother wavelets ψ^{(1)}, …, ψ^{(N)}; usually these are obtained by choosing each ψ^{(j)} as a certain tensor product of onedimensional scaling functions and mother wavelets. However, such a modified wavelet system is again only able to distinguish a fixed number of directions, independent of the scale j, and therefore does not allow a satisfactory directional sensitivity.
To overcome this problem, shearlets (like curvelets) use the parabolic dilation matrices . More generally, αshearlets employ the αparabolic dilation matrices
As shown in Fig. A.2, dilating a function ψ with these matrices produces functions that are more elongated along the x_{2}axis than along the x_{1}axis, where the anisotropy is more pronounced for larger values of α or j. The support of the dilated function satisfies 2^{−jα} ≈ height ≈ width^{α}.
Fig. A.2. Effect of dilating a “prototype function” ψ (shown at the top of each column) with the matrices D_{j}^{(α)} to obtain ψ(D_{j}^{(α)} • ), for different values of the scale j (going from j = 0 (top panels) to j = 2 (bottom panels)) and of the “anisotropy parameter” α ∈ [0, 1]. 

Open with DEXTER 
It is apparent from Fig. A.2 that for α < 1 and large j ∈ ℕ_{0}, the functions have a distinguished direction. More precisely, if (as in the figure) ψ oscillates along the x_{1}axis, then is similar to a sharp jump along the x_{2}axis. Since we want our dictionary to be able to represent jumps along arbitrary directions, we have to allow for some way of changing the direction of the elements . The most intuitive way for achieving this is to use rotations, as was done in the construction of (second generation) curvelets (Candès & Donoho 2004). However, later on it was noted in Labate et al. (2005) and Guo et al. (2006) that from an implementation point of view, rotations have the disadvantage that they do not leave the digital grid ℤ^{2} invariant. Therefore, instead of rotations, (α)shearlets use the shearing matrices
to adjust the direction of the functions . However the shearing matrices S_{x}, x ∈ ( − ∞, ∞) can never cause an effect similar to a rotation with angle θ for θ> 90^{°}. Therefore, for the definition of a coneadapted shearlet system, one only uses shearings corresponding to rotations with angle θ≤45^{°}, and one then uses a modified mother shearlet ψ^{♮} to cover the remaining directions.
Collecting all previously described constructs, the coneadapted αshearlet system with sampling density δ > 0, associated to a lowpass filter φ ∈ L^{2}(ℝ^{2}) and mother shearlet ψ ∈ L^{2}(ℝ^{2}), is defined as
with , and
For brevity, let us set , and observe with this notation that
with .
A.2. Mathematical properties
The most basic property of αshearlets that we are interested in is that they indeed form a (redundant) representation system for L^{2}(ℝ^{2}). In mathematical terms, this means that the αshearlet system forms a frame (Christensen 2016), for a suitable choice of the generators φ, ψ. In particular, if φ, ψ ∈ L^{2}(ℝ^{2}) have compact support and satisfy certain decay and smoothness conditions (see Voigtlaender & Pein 2017, Theorem 5.10 for details), then there is a “minimal sampling density” δ_{0} > 0, such that the αshearlet system is indeed a frame for L^{2}(ℝ^{2}), for all 0 < δ ≤ δ_{0}.
The main motivation for introducing (α)shearlets was the need for a representation system better adapted to data governed by directional features, which are often present in natural and in astronomical images. One key result relates (α)shearlets to C^{1/α}cartoonlike functions. Roughly speaking, a function f ∈ L^{2}(ℝ^{2}) is called a C^{β}cartoonlike function, written f ∈ ℰ^{β}(ℝ^{2}) (with β ∈ (1, 2]), if f = f_{1} + f_{2} ⋅ 𝟙_{B} for certain f_{1}, f_{2} ∈ C_{c}^{β}([0, 1]^{2}) and such that the set B ⊂ [0, 1]^{2} has a boundary curve of regularity C^{β}. For a more formal definition, we refer to Voigtlaender & Pein (2017, Definition 6.1).
Using this notion, we have the result that the best Nterm approximation error with such a frame of αshearlets (that is, the smallest approximation error obtained by a linear combination of Nαshearlets) is decaying at (almost) the best rate that any dictionary Ψ can reach for C^{β}cartoonlike functions (see Voigtlaender & Pein (2017, Theorem 6.3) for a more precise formulation of this result). To obtain this optimal approximation rate, the anisotropy parameter α needs to be adapted to the regularity β of the C^{β}cartoonlike functions, that is, α = 1/β. In general, given a certain data set, or a certain data model, different types of αshearlet systems will be better adapted to the given data than other α′shearlet systems.
We close our discussion of the mathematical properties of αshearlet systems with a brief discussion of the frequency concentration of such systems. To this end, assume for the moment that the “mother shearlet” ψ is concentrated in frequency to the set
which is a union of two opposing “wedges” (highlighted in green in Fig. A.3). From elementary properties of the Fourier transform, one then sees that each αshearlet has frequency support in , where we denote by A^{T} the transpose of a matrix A. The resulting coverings of the frequency plane for different values of the anisotropy parameter α are shown in Fig. A.3.
Fig. A.3. Frequency concentration of αshearlets for different values of α. One sees that each “dyadic annulus” {ξ : ξ≍2^{j}} is split into a number of “wedges” representing the different directions. In fact, . 

Open with DEXTER 
Together, Figs. A.2 and A.3 show that the parameter α has three different, but related effects:

It affects the “shape” of the elements of the αshearlet system. Indeed, Fig. A.2 shows that height ≈ width^{α}.

It affects the directional selectivity: as seen in Fig. A.3, on scale j, an αshearlet system can distinguish about 2^{(1 − α)j} different directions.

It affects the frequency support of the elements of the αshearlet system (see Fig. A.3).
A.3. Implementation
The git repository of our implementation of the Euclidean αshearlet transform can be found online^{5}, with extensive documentation^{6}. Our software package is implemented in Python3 (Van Rossum & Drake 1995), using NumPy (van der Walt et al. 2011).
In this section, we give a rough overview over what the transform computes, and how it can be used. Our software package implements two different versions of the αshearlet transform: a fullysampled (nondecimated) version, and a subsampled (decimated) version. For the fullysampled version, the computed coefficients are the (discrete) convolutions φ * f and (for a certain range of scales j = 0, …, j_{max}), where the filters φ and are chosen as in Eqs. (A.1) and (A.2). Thus, for a given input image f ∈ ℂ^{N × N}, the resulting coefficients form a threedimensional tensor of dimension N_{α, jmax} × N × N, where the integer N_{α, jmax} is the total number of αshearlet filters that is used, and where each N × N component of the tensor is the discrete convolution of f with one of the αshearlet filters. When considering j_{max} many scales (i.e., j = 0, …, j_{max} − 1) and if α < 1, then
In particular, for α = 0, N_{0, jmax} ≍ 2^{jmax}, so that the redundancy of the fully sampled αshearlet frame grows very quickly when increasing the number of scales.
To motivate the subsampled transform, we note that according to Eq. (A.1), the αshearlet system does not contain all translations of the functions φ and . Rather, φ is shifted along the lattice δℤ^{2}, and – as seen in Eq. (A.2) – is shifted along the lattice , with . Effectively, this means that the full convolution is only sampled at certain points, where the sampling density gets more dense as the scale j increases. The subsampled version of the αshearlet transform computes these coefficients. Internally, this is achieved by using the “frequency wrapping” approach outlined in Candès et al. (2006, Sects. 3.3 and 6), Woiselle (2010, Chapter 4), and Woiselle et al. (2011) for the case of the curvelet transform. Since each convolution is sampled along a different lattice, the subsampled transform of a given image f is a list of rectangular matrices of varying dimension. This will become clearer in the example below. One can show for the subsampled transform that the total number M = M(α, j_{max}, N) of αshearlet coefficients for an N × N image is bounded, that is, M(α, j_{max}, N)≤M_{0} ⋅ N^{2}, with M_{0} independent of α, j_{max}, N. This is in stark contrast to the fully sampled transform (at least for α < 1), where the total number of coefficients is ≈2^{(1 − α)jmax} ⋅ N^{2} (see Eq. (A.3)).
The main effect of choosing the fully sampled transform is that one gets a translationinvariant transform (i.e., taking the transform of a shifted image is the same as shifting each component of the coefficient tensor), and the increased redundancy. This increased redundancy can actually be beneficial for certain tasks like denoising, but it can greatly impact the memory footprint and the runtime: computations using the subsampled transform are usually much faster and require much less memory, but yield slightly worse results.
We close this section with a short IPython session showing how our implementation of the αshearlet transform can be used.
In the line marked with #1, we set up the αshearlet transform object trafo. Roughly speaking, this precomputes all necessary αshearlet filters, which are stored in the trafo object. The first two parameters of the constructor simply determine the shape of the images for which the trafo object can be used, while the third parameter determines the number of scales j_{max} to be used, as well as the value of the anisotropy parameter α. Passing [alpha_0] * N will construct an αshearlet transform with N scales (plus the lowpass) and with α given by alpha_0. The verbose parameter simply determines how much additional output (like a progress bar) is displayed. The subsampled parameter determines whether the nondecimated, or the decimated transform is used. Finally, the real parameter determines whether realvalued or complexvalued αshearlet filters are used. Essentially, realvalued filters have frequency support in the union of two opposing wedges (as shown in Fig. A.3), while for complexvalued filters, one gets two filters for each realvalued one: one complexvalued filter has frequency support in the “left” wedge, while the other one is supported in the “right” wedge.
In line #2, we use the transform() method of the constructed trafo object to compute the αshearlet transform of im. As seen, the result is an ordinary NumPy array of dimension N_{α, jmax} × N_{1} × N_{2}, where the input image has dimension N_{1} × N_{2}, and where N_{α, jmax} is the total number of αshearlet filters used by the transform.
The indices property of the trafo object (see line #3) can be used to determine to which αshearlet filter the individual components of the coeff array are associated. The value 1 represents the lowpass filter, while a tuple of the form (j, l, c) represents the shearlet filter as in Eq. (A.1), where ι = 0 if c is (which stands for the horizontal frequency cone), and where ι = 1 if c is (vertical frequency cone).
To explain the differences between the fully sampled and the subsampled transform, in line #5, we set up a subsampled transform object trafo2 The only difference to the construction of the trafo object is that we pass subsampled=True, and real=False. The reason for this second change is that – at least with the current implementation – the subsampled transform can only be used with complexvalued shearlet filters. We then compute the coefficients (see line #6) just as for the fully sampled transform. We note, however, that the coefficients for the fully sampled transform were a single threedimensional NumPy array. For the subsampled transform, however, the coefficients are a list of twodimensional NumPy arrays. The reason for this is that the number of coefficients varies from scale to scale for the subsampled transform.
The indices property (see line #7) for the subsampled transform also differs from that of the fully sampled transform. The reason for this is that we use complex shearlets; therefore, the frequency plane is divided into four cones (top, or ; right, or ; bottom, or ; and left, or ), instead of the two cones that are used for realvalued shearlet filters.
The main advantage of the subsampled transform is revealed in line #8: the redundancy (that is, the number of αshearlet coefficients divided by the number of pixels of the input image) for the subsampled transform is much lower, which leads to a lower memory consumption and faster computation times. While the advantage of the subsampled transform might not be overwhelming in the given example, it becomes more pronounced if one uses a larger number of scales. For instance, if we use four scales instead of three, then the redundancy of the fully sampled transform is 41, while that of the subsampled transform is only ≈11.4.
All Tables
Statistics on the recovery of spherical thermal dust maps with the proposed approaches.
Statistics on the recovery of dark matter halo distribution with the proposed approaches.
All Figures
Fig. 1. HEALPix grid (visualizing N_{side} = 16) in orthographic projection on the left and Mollweide projection on the right. Faint lines indicate the circles of latitude . The right image also introduces the numbering of the faces, used in the following illustrations. 

Open with DEXTER  
In the text 
Fig. 2. Example of our covering of the sphere with overlapping patches based on HEALPix neighborhoods for N_{side} = 128 and patch width q = 8 (note that in our numerical experiments, N_{side} = 2048 and the patch width is either q = 8 or q = 12). Several randomly selected patches on the sphere are also represented in color. The plotted value in gray indicates the number of overlapping patches including each pixel. Because the patch width is usually small with respect to the number of pixels per face, the number of overlapping patches varies in small regions around the pixels that only have seven neighbors. 

Open with DEXTER  
In the text 
Fig. 3. Partition of unity for the rotationbased reconstruction. The weights smoothly decaying toward the border are presented in the top left panel and are copied to each HEALPix face in the top right panel. In the bottom left panel, resampling was first performed using a rotation and bilinear interpolation, and the image shows the weights that would be applied in the original reference coordinates. The resulting covering of the sphere using five rotations is illustrated in the bottom right panel. 

Open with DEXTER  
In the text 
Fig. 4. Left panel: twelve squares corresponding to the faces of the HEALPix framework (see Fig. 1) arranged as a net in the plane. The areas that are covered by multiple of the extended faces – the transition zones – are displayed in gray. The areas where pixels are “missing” are displayed in red. Right panel: six extended faces produced by the patchwork procedure. The two polar faces form the top row, followed by the four equatorial faces below. The shaded area around the transition zone of each composite face indicates the margin, which is later discarded. 

Open with DEXTER  
In the text 
Fig. 5. Detailed view of two of the six extended faces. The dark outer boundary with width c_{m} is the margin that is discarded after the reconstruction step, and the two dark squares in the corners of the equatorial face on the right are treated likewise. The remaining part of the extended faces has a gray outer boundary of width 2c_{t}. In conjunction with the gray squares in the corners of the equatorial face, this boundary forms the transition zone that contains the values shared with the neighboring extended faces. 

Open with DEXTER  
In the text 
Fig. 6. “Missing” square between faces 1 and 2 is divided into four triangles of equal size, separated by the lines 2x = y, x = y, and x = 2y, as seen on the left. The two images in the middle reveal how the rotated faces 1 and 2 are separately weighted along those segments. The data of face 1 have full weight (black) on the outer triangle adjacent to face 2, and no weight (white) on the other outer triangle, while the data of face 2 are treated conversely. A smooth transition is provided by the weights on the triangles in between. The sum of the weighted faces is used to fill the gap, as demonstrated in the rightmost illustration. 

Open with DEXTER  
In the text 
Fig. 7. Thermal dust simulation map (at 100 GHZ) without (top panel) and with the additive white Gaussian noise added (bottom panel), for evaluation of the methods. The colorscale has been stretched to illustrate the challenge of recovering structures at intermediate latitude. Units are in μK. 

Open with DEXTER  
In the text 
Fig. 8. Left panel: galactic mask used for thermal dust quantitative evaluation, covering 70% of the sky. Right panel: region close to galactic plane where methods are inspected. 

Open with DEXTER  
In the text 
Fig. 9. Dark matter halo distribution for the first slice, without (top panel) and with the additive white Gaussian noise added (bottom panel), for evaluation of the methods. The colorscale has been stretched to visualize filamentary structures. 

Open with DEXTER  
In the text 
Fig. 10. Atoms learned in the multiscale dictionary learning approach: on the left, scale 3, on the right, scale 2. The dictionaries have departed from the original redundant DCT dictionary and have learned specific features related to the scale. Due to the change of the N_{side} parameter with the scale, the actual distance between two adjacent pixels has increased, and the atoms for scale 2 are indeed smoother than those for scale 3. 

Open with DEXTER  
In the text 
Fig. 11. Atoms learned in the dictionary learning approach, applied to the dark matter halo distribution data. The dictionary elements are composed of pointlike structures and edges. 

Open with DEXTER  
In the text 
Fig. 12. Denoised thermal dust maps for all three approaches. Top and middle panels: αshearlet denoising with rotationbased (top) or patchwork (middle) approach, both for α = 0.6. Bottom panel: representation learned with dictionary learning. Units are in μK. 

Open with DEXTER  
In the text 
Fig. 13. Zoom on a region close to the galactic plane to visualize the respective denoising performance of the methods. From top to bottom panels: input map, noisy map (with own colorscale), rotationbased approach with α = 0.6, patchwork approach with α = 0.6, sparse representation learned from data. All units are in μK. 

Open with DEXTER  
In the text 
Fig. 14. Residuals for the maps displayed in Fig. 12. Units are in μK. 

Open with DEXTER  
In the text 
Fig. 15. Denoised dark matter maps for all three approaches. Top and middle panels: αshearlet denoising with rotationbased (top) or patchwork (middle) approach, both with α = 1. Bottom panel: representation learned with dictionary learning. 

Open with DEXTER  
In the text 
Fig. 16. Amplitude of the residuals for all three approaches, for the dark matter map scenario. Top and middle panels: αshearlet denoising with rotationbased (top) or patchwork (middle) approach, both with α = 1. Bottom panel: representation learned with dictionary learning. 

Open with DEXTER  
In the text 
Fig. 17. Dark matter map amplitudes for all three approaches in a zoomed region. From top to bottom and left to right panels: original map, noisy map, rotationbased approach with α = 1, patchwork approach with α = 0, patchwork approach with α = 1, representation learned from data. 

Open with DEXTER  
In the text 
Fig. 18. Cartesian projection of the denoised thermal dust maps centered at the intersection of four faces. From left to right and top to bottom panels: denoising each face independently using αshearlets with α = 1, restoration via the rotationbased approach, patchwork approach with α = 1, dictionary learning with patch width of 12. The colorscale has been stretched to visualize the artifacts seen as a crossshape discontinuity at the boundaries of the four HEALPix faces in the upper left panel. All of our proposed approaches are free from these artifacts. Units are in μK. 

Open with DEXTER  
In the text 
Fig. 19. Normalized nonlinear approximation curves for four different slices of the dark matter distribution. For each threshold, the α value corresponding to the lowest approximation error is displayed on the top. 

Open with DEXTER  
In the text 
Fig. 20. Normalized nonlinear logapproximation curves for four different slices of the dark matter distribution. For each threshold, the α value corresponding to the lowest approximation error is displayed on the bottom. 

Open with DEXTER  
In the text 
Fig. A.1. Example of a cartoonlike function. Such a function f is smooth, apart from a jump discontinuity along a curve γ. Even though f might be discontinuous along γ, the boundary curve γ itself is required to be smooth. 

Open with DEXTER  
In the text 
Fig. A.2. Effect of dilating a “prototype function” ψ (shown at the top of each column) with the matrices D_{j}^{(α)} to obtain ψ(D_{j}^{(α)} • ), for different values of the scale j (going from j = 0 (top panels) to j = 2 (bottom panels)) and of the “anisotropy parameter” α ∈ [0, 1]. 

Open with DEXTER  
In the text 
Fig. A.3. Frequency concentration of αshearlets for different values of α. One sees that each “dyadic annulus” {ξ : ξ≍2^{j}} is split into a number of “wedges” representing the different directions. In fact, . 

Open with DEXTER  
In the text 