Issue 
A&A
Volume 638, June 2020



Article Number  A73  
Number of page(s)  16  
Section  Numerical methods and codes  
DOI  https://doi.org/10.1051/00046361/202037687  
Published online  23 June 2020 
Accelerating linear system solvers for timedomain component separation of cosmic microwave background data
^{1}
INRIA Paris, Sorbonne Université, Université ParisDiderot SPC, CNRS, Laboratoire JacquesLouis Lions, ALPINES Team, France
email: jan@papez.org
^{2}
Currently at Institute of Mathematics, Czech Academy of Sciences, Prague, Czech Republic
^{3}
Université de Paris, CNRS, AstroParticule et Cosmologie, 75013 Paris, France
^{4}
CNRSUCB International Research Laboratory, “Centre Pierre Binétruy”, UMI2007, CPBIN2P3, France
Received:
7
February
2020
Accepted:
12
April
2020
Component separation is one of the key stages of any modern cosmic microwave background data analysis pipeline. It is an inherently nonlinear procedure and typically involves a series of sequential solutions of linear systems with similar but not identical system matrices, derived for different data models of the same data set. Sequences of this type arise, for instance, in the maximization of the data likelihood with respect to foreground parameters or sampling of their posterior distribution. However, they are also common in many other contexts. In this work we consider solving the component separation problem directly in the measurement (time) domain. This can have a number of important benefits over the more standard pixelbased methods, in particular if nonnegligible timedomain noise correlations are present, as is commonly the case. The approach based on the timedomain, however, implies significant computational effort because the full volume of the timedomain data set needs to be manipulated. To address this challenge, we propose and study efficient solvers adapted to solving timedomainbased component separation systems and their sequences, and which are capable of capitalizing on information derived from the previous solutions. This is achieved either by adapting the initial guess of the subsequent system or through a socalled subspace recycling, which allows constructing progressively more efficient twolevel preconditioners. We report an overall speedup over solving the systems independently of a factor of nearly 7, or 5, in our numerical experiments, which are inspired by the likelihood maximization and likelihood sampling procedures, respectively.
Key words: cosmic background radiation / methods: numerical
© J. Papež et al. 2020
Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Context and motivation
Measurements registered by cosmic microwave background (CMB) experiments contain, in addition to the soughtafter signal of cosmological origin, contributions from astrophysical sources. These are generically called foregrounds and can be of either galactic or extragalactic origins and be either diffuse or pointsourcelike morphologically. A separation of the foreground signals from each other and, specifically, from the CMB signal is therefore an essential step of any modern CMB data analysis. This step is referred to as component separation. It is performed by capitalizing on either different electromagnetic frequency dependence and/or statistical properties of different signals (e.g., Planck Collaboration X 2016, and references therein). In polarization the foreground signals tend to dominate the CMB signal over a broad range of angular scales and observational frequencies. The next generation of CMB observatories will therefore be only capable of delivering its science in full if highprecision statistically sound and reliable component separation techniques and their numerically efficient implementations are available.
Component separation is a nonlinear operation. Based on data measured at multiple different frequency bands, it aims to simultaneously recover the frequency dependence of the foregrounds as well as their spatial morphology. It is commonly performed in a pixel domain and uses maps of the sky estimated for each frequency band and their statistical uncertainties as inputs. These objects are assumed to have been obtained in a preceding step of the data analysis that is called mapmaking.
For concreteness, in this work we focus on the socalled parametric component separation approach (e.g., Brandt et al. 1994; Eriksen et al. 2008; Stompor et al. 2009), where the frequencyscaling relations for each considered sky component are assumed to be given up to a limited number of unknown parameters, called foreground spectral parameters. However, the numerical techniques discussed here are more general and should be found useful also in other component separation methods.
The component separation is typically performed in two steps. In the first step, the spectral parameters, or more generally, the mixing matrix elements, are estimated from the data, and in the second step, they are used to recover maps of sky components from the frequency maps. This approach is conceptually simple and potentially very efficient computationally. The input frequency maps preserve essentially all the information present in a typically much larger initial raw data set, and their smaller sizes make them easier to store and operate on. For the next generation of CMB experiments, we expect to have as many as n_{t} ∼ 𝒪(10^{13}−10^{15}) raw measurements, but only n_{pix} ∼ 𝒪(10^{5}−10^{8}) sky pixels.
The pixeldomain component separation approaches can ensure satisfactory performance but require a sufficiently precise statistical description of the frequency maps. This has to be derived from the raw measurements, which we refer to hereafter as timedomain data. In practice, this is often difficult because storage and computational cycles are limited. A general full covariance matrix of a single frequency map contains elements, which would need to be stored in memory. Computing these elements costs at least 𝒪(λ) 𝒪(n_{t}) floating point operations (flops). (Here λ is a timedomain noise correlation length and can reach many thousands of samples.) The full computations therefore quickly become prohibitively expensive. This is the case even if the explicit inversion of the covariance matrix is replaced by some iterative procedure, which typically requires flops, where the number of iterations, n_{iter} is usually on the order of 10^{2}. Consequently, the best way forward in practice may be to invoke some approximations. This is often problematic as well, however, because a successful approximation needs to ensure sufficient accuracy to avoid introducing systematic biases in the estimated foreground parameters and later also in the component maps.
A general solution to the problem would be to avoid relying on the frequency maps at all and to perform all the calculation directly on the timedomain data. This would typically require memory on the order of 𝒪(n_{t}) and 𝒪(p n_{iter} n_{t} ln λ) flops. The prefactor p is on the order of unity for a typical mapmaking run, but in our case, it can vary widely between a few tens and many thousands. This highlights the challenge faced by the proposed approach. We note that while this is certainly very demanding, it is not necessarily prohibitive. Some of the best optimized existing mapmaking codes can already perform many hundreds of runs, for instance, as required in massive Monte Carlo simulations. The proposed approach may not only be more robust, but may be the only way forward if significant timedomain noise correlations are present, λ ≫ 1. This is commonly the case in the CMB experiments, in particular, those operating from the ground.
In this work, we explore some of the avenues that might render this approach tractable. We first identify the main computationheavy step that unavoidably appears in any implementation of this technique. We then investigate how it might be accelerated by employing better and more advanced methods and their implementations.
The plan of this paper is as follows. In Sect. 2 we present the component separation problem and the numerical challenges it poses. In Sect. 3 we describe the proposed solution and in Sect. 4 the results of the numerical tests. Section 5 provides a brief summary and outlines prospects. Material that is more technical in nature or that is added for completeness is as usual deferred to the appendices.
2. Problem description and setting
2.1. Preliminaries
Hereafter, we consider polarized signals and assume, for simplicity and definiteness, that in every sky pixel the signal is characterized by two Stokes parameters, Q and U. Extensions to include total intensity, I, are straightforward. Consequently, hereafter, every considered sky map consists of two maps corresponding to the two Stokes parameters. They are concatenated in a single map vector,
Hereafter, partial brackets, ⌈…⌋, denote a vertical object. Examples of the sky maps as discussed in the following are singlefrequency maps storing information about the sky signal as observed at a given frequency, or singlecomponent maps containing information about a sky signal of some specific physical origin. We refer to the ordering defined above as Stokeswise because a complete sky map of one Stokes parameter is followed up by another. In addition, we also consider a pixelwise ordering, which for single maps reads
where the Q and U parameters of a signal in one pixel are stored consecutively and are followed by those in another.
The goal of the component separation procedure is to estimate all assumed sky component signals given multiple frequency data. Therefore, we commonly deal with multiple maps of the same type, such as multiple singlefrequency maps or multiple singlecomponent maps. We concatenate them in a single multifrequency or multicomponent vector. For definiteness, in this work we fix the number of components to n_{comp} = 3 and consider three different sky components: CMB, dust, and synchrotron. A multicomponent vector, 𝕤, therefore contains information about the Q and U Stokes parameters of all three components. Such a vector can be ordered in multiple ways. Most commonly, we assume that it is ordered either in a componentwise way, when
or in a pixelwise way, where for each pixel all Stokes parameters follow consecutively for all considered components, that is,
Multifrequency vectors can be ordered in analogous manners.
The choice of the ordering in general depends on the specific context and is obviously of key importance for the numerical implementation of the mapmaking or component separation procedures. Nonetheless, mathematically, switching the ordering from one to another is described by a linear, orthonormal, fullrank operator, U. This operator is conceptually trivial to apply, and its application commutes with other matrix operations such as a matrix inversion because
for any invertible matrix M. Consequently, a matrix can be inverted using one ordering, for instance, computing M^{−1}, and the result can later be reordered to obtain the inverse in the other ordering scheme, that is, (U M U^{t})^{−1}. For this reason, we freely switch between the different orderings depending on the context in the following in order to highlight specific structures of the matrices, which may be more apparent for one choice than the other.
2.2. Data model
As mentioned earlier, we consider a component separation procedure performed directly on the timedomain data as measured by the instrument. Thus we do not invoke any prior explicit mapmaking procedure. We therefore need to relate the timedomain measurements directly to the component maps because these maps are the intended outcome of the component separation procedure. We assume that for each frequency the timedomain data are made of sequences of consecutive observations registered by all detectors operating at this frequency and concatenated, we can write
Here 𝕤_{⋆} is the unknown vector of the component amplitudes, and the star indicates that those are their actual values. n_{f} denotes an (unknown) noise vector. The number of the frequency channels, n_{freq}, is assumed to be larger than that of the components, n_{comp}, set to 3 in this work, to ensure that the problem is well defined. The matrix P_{β⋆, f} in Eq. (6) combines the information about the instrument operations and the sky properties. It can be expressed as
where M_{β⋆, f} ∈ ℝ^{2npix × 6npix} is a socalled mixing matrix, and it determines how different sky components mix at all observed frequencies to yield the observed signal. The mixing matrix explicitly depends on the foreground scaling parameters, which we denote as β^{⋆}, and the frequency of the observation, f. P_{f} ∈ ℝ^{nt × 2npix} is in turn a pointing matrix defining which pixel of the sky each detector operating at a given frequency observed at every time. While it does not explicitly depend on frequency or scaling parameters, it therefore is in principle different for different frequencies because it encodes pointing of detectors specific to this frequency. This is highlighted by the subscript f. We have
where is a singlefrequency map expressing the combined sky signal at frequency f. The data vector, d_{f}, is timeordered because its elements are indexed by the time at which the measurement was taken.
2.3. Component separation
The goal of the component separation procedure is to solve an inverse problem, Eq. (6), and estimate the components, 𝕤_{⋆}, given the full data set, d (:={d_{f}}), made of data taken at all observational frequencies. This is typically solved by assuming that the noise, n_{f}, is Gaussian, with a zero mean and a known variance, N_{f}, and writing a data likelihood,
Here we have dropped the star to distinguish an estimate from the true value, and we have introduced a tilde to denote multifrequency objects. We have
which follows from Eq. (7), and
which assumes no noise correlations between different frequency channels. In addition, throughout this work we also assume that while the component mixing represented by M_{β} may involve (potentially) all components, it is always done on a pixelbypixel basis, so that all the elements of M_{β} corresponding to different pixels vanish. Similarly, and in agreement with assumptions made in mapmaking procedures, we assume that the noise matrices, N_{f}, are block diagonal, with each block representing a banded Toeplitz matrix.
The standard twostep component separation procedure proceeds by first estimating for each frequency band, f, a singlefrequency map, m_{f}, and its covariance, . These are given by
The followup component separation step is then performed assuming that the singlefrequency maps yielded by the first step can be represented as
where stands for a pixeldomain noise and is a Gaussian variable with variance . We can therefore write the corresponding likelihood as
This procedure is equivalent to directly solving the maximum likelihood problem defined by Eq. (9). However, it requires an explicit calculation of that for the current and forthcoming experiment is typically prohibitive because of restrictions on both the available computer memory and computational cycles. An alternative might be solving the original problem directly without explicitly invoking any pixeldomain objects. This is the option we study in this work. We note here in passing that intermediate approaches are also possible: for instance, one that relies on the likelihood in Eq. (15), but does not assume that is given explicitly. Instead, it computes a product of the covariance and a vector using an iterative procedure, which only requires applying the inverse covariance to a vector. This is performed using its implicit representation, Eq. (13), as is done in the mapmaking solvers. On the algorithmic level, such approaches are equivalent to solving the problem in the time domain, and the methods considered hereafter would be applicable to that approach as well.
To estimate β and 𝕤 directly from Eq. (9), we may either maximize this likelihood or sample from a posterior derived from it assuming some priors on the spectral parameters^{1}. Alternatively, a socalled spectral likelihood may be used (Stompor et al. 2009), where 𝕤 is already either marginalized or maximized over, that is,
which again can be either minimized or sampled from.
In both these cases, a key operation is a solution of a linear system of equations given by
for a sequence of tentative values of the spectral parameters, β_{i}. These can be either a chain produced as a result of sampling, or a sequence of values obtained in the course of a minimization. We note that Eq. (17) is essentially a socalled mapmaking equation (e.g., Natoli et al. 2001; Szydlarski et al. 2014; Puglisi et al. 2018), but with a pointing matrix now replaced by . We can thus hope that as in the mapmaking problem, we can capitalize on special structures of the involved matrices and very efficient iterative solvers for solving linear systems to render the problem feasible. We point out that in the applications considered here, a subsequent value of the parameter β, that is, β_{i + 1}, can only be known after the system for the current value, β_{i}, is fully resolved. A simultaneous computation of all systems for all values of β_{i} is therefore not possible, and any computational speedup has to come from using better solvers for the linear systems and/or their numerical implementations.
When we separate parts that are dependent and independent of β, the system in Eq. (17) can also be written as
The approach we propose here is based on two observations. First, our system has some essential similarities to that of the mapmaking problem, we should therefore be able to capitalize on novel iterative techniques proposed in that case. Second, we expect that consecutive values of β_{i} in realistic sequences should not vary arbitrarily, and therefore subsequent linear systems (17) should show some resemblance. Consequently, it should be possible to shorten the time to solution for the next value of β_{i + 1} by capitalizing on the solution for the current one, β_{i}.
2.4. Blockdiagonal preconditioner
The blockdiagonal preconditioner is the most common preconditioner used in the preconditioned conjugate gradient solvers applied in the context of the CMB mapmaking problem (Natoli et al. 2001), which has demonstrated a very good performance in a number of applications. It is also the basis for the construction of more advanced preconditioners (e.g., Szydlarski et al. 2014). The blockdiagonal preconditioner is derived by replacing the noise covariance in Eq. (13) by its diagonal. In the mapmaking case, when pixelwise ordering is assumed, this leads to a blockdiagonal matrix with the blocksize defined by the number of the considered Stokes parameters. In the component separation case, this preconditioner is given by , and in the pixelwise ordering, it is blockdiagonal. The diagonal block size is now equal to the product of the number of Stokes parameters and the number of sky components, that is, 6 × 6 in the specific case considered here. Consequently, the preconditioner can easily be inverted in any ordering scheme adapted.
Hereafter, we denote the βindependent part of the preconditioner as so that
By preconditioning the system (17) from the left, we obtain
To simplify the notation in the following, we define
2.5. Component mixing
For concreteness, we assume throughout the paper the following component mixing scheme:
which follows the standard assumptions that there is no Q and U mixing, and that the scaling laws for the Stokes parameters Q and U of each components are the same. In the componentwise ordering, such mixing corresponds to the mixing matrix of the form (I is the identity matrix, 2 × 2 in this case)
The coefficients α_{f, i} encode the assumed scaling laws of the CMB, i = 1, dust, i = 2, and synchrotron, i = 3, where the last two depend on unknown scaling parameters, β_{d} and β_{s}. This matrix can be rewritten with the help of the Kronecker product as
Hereafter, we drop the explicit dependence of the mixing coefficients on β, denoting them simply as α_{f, k}.
3. Solution procedure for the parametric component separation problem
A complete solution to the component separation problem has to successfully address two aspects. First, it needs to propose an efficient approach to solving the sequences of linear systems as in Eq. (20). Second, it has to combine it with an optimized procedure for the efficient determination of the new values of the parameters β. This study addresses the former problem and focuses on the solution of a sequence of linear systems obtained for some sequences of the spectral parameters. In order to provide a fair comparison of various proposed techniques, we generate a sequence {β_{i}} beforehand and therefore, unlike in the actual applications, in our experiments, β_{i + 1} is in fact independent of the results of the preceding solution. This ensures that the performance of all the considered solvers is evaluated on the identical sequences of linear systems.
The overall solution scheme we adapt here is then as follows:
(0) Initialize β_{0} and (typically , set i := 0.
(1) Given β_{i} and the initial guess , solve the preconditioned problem, Eq. (20), deriving the current approximation .
(2a) Determine the new parameters β_{i + 1}.
(2b) Compute a new deflation space for the system associated with β_{i + 1} using a recycling technique (see details below). This should not involve the value of β_{i + 1} so that this step can be made in parallel with (2a).
(3) Compute the initial guess .
(4) Set i := i + 1 and go to (1).
In the subsections below, we discuss steps (1), (2b), and (3) in more detail.
3.1. PCG with deflation and twolevel preconditioners
Although the blockdiagonal preconditioner has been shown to ensure good performance in the mapmaking experience, it has been pointed out that even better performance can often be achieved by employing socalled twolevel preconditioners (Szydlarski et al. 2014; Puglisi et al. 2018). Such preconditioners are built from the blockdiagonal preconditioner, constituting the first level, and the second level is constructed from a limited number of vectors that are to be deflated (i.e., suppressed in the operator) in order to accelerate the convergence. These vectors are typically taken to be approximate eigenvectors of the system matrix corresponding to its smallest eigenvalues, which often hamper the convergence of PCG with the blockdiagonal preconditioner.
We start from the case of deflation for the (unpreconditioned) conjugate gradient (CG) method. CG applied to a linear system 𝔸𝕤 = 𝕓 with a given initial vector 𝕤^{(0)} and an initial residual 𝕣^{(0)} := 𝕓 − 𝔸𝕤^{(0)} builds implicitly orthogonal (residuals) and 𝔸orthogonal (search directions) bases of the Krylov subspace,
and the jth CG approximation 𝕤^{(j)} ∈ 𝕤^{(0)} + 𝒦_{j}(𝔸, 𝕣^{(0)}) is determined by the orthogonality condition on the jth residual 𝕣^{(j)} := 𝕓 − 𝔸𝕣^{(j)},
For a given set of deflation vectors, that is, the vectors to be suppressed, we denote by 𝒰 the subspace spanned by these vectors. The deflation techniques replace the original operator 𝔸 : ℝ^{n} → ℝ^{n} by a deflated operator . The approximation is then sought over the augmented subspace (see, e.g., Gaul et al. 2013),
and the jth residual is required to be orthogonal to . This effectively prevents the solver from exploring the subspace 𝒰.
An extension of this for the PCG with the (firstlevel) preconditioner 𝔹 is straightforward because we can use the PCG to implicitly build the Krylov subspace 𝒦_{j}(𝔹^{−1}𝔸, 𝔹^{−1}𝕣^{(0)}). In the considered application, the preconditioner 𝔹 is the blockdiagonal preconditioner. There are many variants of twolevel preconditioners, and we summarize them briefly in Appendix B.2. A more thorough survey can be found in Tang et al. (2009), for example.
Each iteration of a deflated (P)CG, that is, with or without the first level, is more costly than a single iteration of a standard (P)CG. The additional cost primarily depends on the number of deflated vectors, that is, the dimension of 𝒰, but also on the deflation variant. Building the subspace 𝒰 typically requires some preliminary computations, which can be as costly as solving the system (see, e.g., Szydlarski et al. 2014; Puglisi et al. 2018). Another approach, applicable to the cases when multiple systems need to be solved, is to construct the vectors “on the fly” during the solution of the systems themselves, thus hiding the additional cost. This is the approach we detail in the next sections.
3.2. Subspace recycling
Several constructions of the deflation space have been adapted to solving a sequence of linear systems, for instance, those of Saad et al. (2000), Parks et al. (2006), Kilmer & de Sturler (2006), O’Connell et al. (2017), and Jolivet & Tournier (2016). In this work, where the system matrix is symmetric positive definite (SPD), we follow Saad et al. (2000). We build a subspace by storing some of the vectors computed during the previous run of (P)CG solver and determine the slowest eigenvectors of the operator 𝔸 restricted on the subspace 𝒰 ∪ 𝒵. These are taken as the deflation vectors for the next solution. The resulting algorithm is given in Appendix C.
We can determine the subspace 𝒵 using either the residual or the search direction vectors forming (assuming the exact arithmetic) the orthogonal or an orthogonal basis of . Following Saad et al. (2000), we choose here to use the search direction vectors. We retain the first dim_{p} search direction vectors, where dim_{p} defines the dimension of the socalled recycle subspace. We use the first vectors because the orthogonality among the computed vectors is gradually lost in CG; it is therefore better preserved in the initial iterations.
The techniques for approximating k eigenvectors of the operator over a given subspace are well established. Among them, we note the Ritz and harmonic Ritz projections, which are described in detail in Appendix B.1.1. They lead to solving a (generalized) eigenvalue problem of small dimension, in our case, of dim(𝒰)+dim(𝒵). While Saad et al. (2000) suggested using the harmonic Ritz projection, we found the Ritz projection slightly more efficient in our numerical experiments, and we therefore include this in the full algorithm described in Appendix C. In another difference with Saad et al. (2000), we assemble the (small) generalized eigenvalue problem matrices in the harmonic Ritz projection using the matrixmatrix products (see Algorithm 2 in Appendix B.1.1) instead of the optimized algorithm from (Sect. 5.1, Saad et al. 2000). This is because we expect that the additional computational cost in our application is negligible and we therefore opted for simplicity.
There is no general recommendation for the choice of the number of deflation vectors, k, and the dimension of the recycling subspace, dim_{p}. Higher k may result in an increase of the overall number of matrixvector products (the system matrix has to be applied to k deflation vectors before the deflated (P)CG is started for each system) and high dim_{p} may cause numerical instabilities in solving the eigenvalue problems that determine the new deflation vectors. On the other hand, the low values of k and dim_{p} may not speed up the process sufficiently. We test this numerically in Sect. 4.
3.3. Effect of the eigenvalue multiplicity
One limiting factor to the efficiency of this approach, and more generally, to the performance of any twolevel preconditioner with the deflation space estimated using standard iterative techniques such as Arnoldi or Lanczos iterations, comes from a higher multiplicity of eigenvalues, that is, multiple eigenvectors with the same corresponding eigenvalue. This can arise either as a result of some symmetries in the scanning strategies in the case of the mapmaking systems of equations, or as similarities in the noise covariances of the different singlefrequency maps in the case of the component separation problem as studied here; see Appendix A for an example. Admittedly, such symmetries and/or similarities are not typically expected in the cases of real data analysis, but they can arise in the cases of simulated data, in particular if simplifying assumptions are adopted in order to speed up and/or simplify the simulation process.
To shed light on this problem, we consider an SPD matrix A and assume that λ is an eigenvalue with multiplicity higher than one. This means that there exists a subspace V with dim(V) > 1 such that
Let w be an arbitrary vector used to initiate the construction of a Krylov subspace and w_{V} its projection onto V, that is,
Then
Therefore, the jth Krylov subspace satisfies
and the intersection of 𝒦_{j}(A, w) and V is at most a onedimensional subspace spanned by w_{V},
Consequently, methods based on the Krylov subspace approximation, therefore including Lanczos and Arnoldi iterations (see Appendix B.1.2 for more details), can recover one vector at most from the subspace spanned by multiple eigenvectors with the same eigenvalue. This may not be sufficient to allow for a construction of an efficient twolevel preconditioner, however, in particular if the eigenvalue with many corresponding eigenvectors happens to be small: with the single precomputed vector we can only deflate a onedimensional subspace of the entire multidimensional space as spanned by all these eigenvectors, and the remainder may continue hampering the convergence.
This problem can be overcome by using a more advanced eigenvalue solver that can detect and handle the higher multiplicity of eigenvalues. An efficient implementation of such a solver is for instance provided by the ARPACK library (Lehoucq et al. 1998). In this case, the preconditioner may need to be precomputed with the help of such advanced routines, instead of constructing it on the fly as proposed here. If the presence of the eigenvalue multiplicity and the corresponding eigenvectors is known ahead of time, these vectors can be accomodated in the onthefly procedure proposed here. This is indeed the case we have encountered in one of the test cases discussed below.
We point out that the multiplicity of the eigenvalues is in principle advantageous for the standard (P)CG. In exact arithmetic, the effect of the whole subspace might be then eliminated at the cost of a single iteration. This fact is often used in the analysis of preconditioning methods based on preconditioners shifting some, possibly many, eigenvalues to the same value.
Last but not least, we emphasize that an eigenvalue multiplicity implies neither any indistinguishability of the corresponding eigenvectors nor a presence of degenerate modes in the solution, at least as long as the eigenvalue is not (numerically) zero. If the eigenvalue is not zero, the multiplicity only tells us that components of the righthand side of the linear system that belong to the subspace spanned by the corresponding eigenvectors are merely weighted by the inverse system matrix in exactly the same way.
3.4. Choice of the initial guess
The simplest and standard way to solve the sequence is to run the PCG method with the initial guess set to zero. However, some evidence exists showing that at least in the mapmaking case, this may not always be the best choice (Papež et al. 2018), in particular in cases with high signaltonoise ratios. In the case of a sequence of linear systems, all the systems involve the same initial data set with the same signal and noise content. Even in data with a low signaltonoise ratio, it may therefore be expected that adapting the initial guess following previous results may speed the process up in an interesting way. Consequently, we explore here two alternatives and show by numerical experiments that they are indeed much more efficient.
3.4.1. Previous solution as the initial guess
A natural idea is to run the PCG for the (new) problem corresponding to β_{i + 1} starting with the computed approximation ,
This can be in particular efficient when the parameters β_{i} and β_{i + 1} do not significantly differ and it is expected that so do 𝕤_{βi} and 𝕤_{βi + 1}.
3.4.2. Adapted previous solution as the new initial guess
Equation (33) can be further adapted by capitalizing on the particular structure of the mixing matrix. To start, we rewrite Eq. (17) as
If the matrix were square (and nonsingular), then
would be the vector independent of β. The solution 𝕤_{β} might then be interpreted as the coefficients with respect to the basis given by the columns of . Therefore we would have
for arbitrary (for which is nonsingular).
In our case, matrix is rectangular of size 2 n_{freq} n_{pix} × 6 n_{pix} and has full column rank. When the number of frequencies n_{freq} is not significantly higher than 3, we can generalize the above idea and use as the initial guess for the new system the vector
where M^{†} is the (Moore–Penrose) pseudoinverse of M,
We recall our assumption that M is of full column rank. Clearly, for β_{i + 1} = β_{i},
Finally, we note that the computation of the vector in Eq. (37) is very cheap because of the Kronecker structure (24) of the matrices . Writing , we obtain
in other words, only the matrices of size 2 n_{freq} × 6 need to be handled, and the cost of the proposed adaptation is nearly negligible.
4. Numerical experiments
4.1. Simulated data
For our numerical tests we use a simulated data set composed of timeordered multifrequency observations with a correlated, “1/f”, noise. The characteristics of this data set are as follows.
4.1.1. Pointing matrix
We adopt the simple scanning strategy used in Papež et al. (2018). The entire timeordered data set is composed of 𝒪(10^{8}) measurements per frequency and divided into four consecutive subsets. The pointing is assumed to be the same for each frequency. The underlying sky is pixelized using the Healpix pixelization scheme (Górski et al. 2005) with the resolution parameter n_{side} set to 1024. The scan consists of repetitive scanning of a rectangular patch made of 256 pixel rows and columns. The scanning is either horizontal, that is, along the pixel rows, for the first and third subset, or vertical, that is, along the pixel columns for the second and fourth subset. During a single lefttoright, or bottomup sweep, each sky pixel is sampled only once, and the direction of the polarizer, φ_{t}, is fixed for each of the four subsets and is equal, with respect to the sky, to 0, π/4, π/2, and, 3π/4.
The sky signal contribution to every measurement is modeled as
where p(t) denotes the pixel observed at time t, we do not include the total intensity, and Q_{c} and U_{c} stand for Q and U Stokes parameters of the combined, CMB + foregrounds, sky signal observed at frequency ν_{c}.
4.1.2. Sky maps
We assume six frequency channels that approximately correspond to those accessible for a groundbased experiment. These are
The sky signal is composed of emissions from three sources: CMB, dust, and synchrotron. The CMB signal is simulated using the current bestfit CMB model (Planck Collaboration XIII 2016), while we use the socalled COMMANDER templates (Planck Collaboration X 2016) to model the dust and synchrotron signals that we scale to our reference frequency, ν_{ref} = 150 GHz, using Planck’s fiducial laws.
For the scaling laws we take a blackbody for the CMB component (T_{CMB} = 2.7525 K), a power law for the synchrotron, and a modified blackbody for the dust, therefore
where the star distinguishes the true values of the parameters,
and B(ν, T) denotes a blackbody at temperature, T. The simulated maps are expressed in thermodynamic units and are given by
for each frequency ν = ν_{c} and each observed sky pixel p. An analogous formula holds for the Stokes U parameter. Here, Γ_{RJ}(ν) stands for a conversion factor from RayleighJeans to thermodynamic units. This expression is consistent with Eq. (24) upon a suitable definition of the coefficients α.
In our numerical experiments, we fix the dust temperature, T_{d}, to its true value and assume that only the spectral indices, β = [β_{s}, β_{d}], are determined from the data. We assume that these are estimated by maximizing the spectral likelihood, Eq. (16), using a truncated Newton maximization procedure. We use this approach to generate a single sequence of {β_{i}}, which, as explained in Sect. 3, we adopt consistently in all our runs. The sequence is made of 26 values and is shown in Fig. 1. In Appendix D we show for completeness the results of a similar test, but performed for a sequence of β derived by sampling of the spectral likelihood. The main conclusions derived in these two examples are consistent.
Fig. 1. Sequence of the spectral parameters β_{i} = [β_{i, s}, β_{i, d}] used in our experiments and derived from the maximization of the spectral likelihood, Eq. (16). The panels in clockwise order show consecutive zooms on the sequence that converged on the likelihood peak values of β = [− 3.006, 1.584], slightly off the true values marked by the blue solid point in the top left panel and given by Eq. (45) (with T_{d} fixed in our test cases). The sequence was generated as described at the end of Sect. 4.1.2. 
4.1.3. Noise
We assume a correlated noise in the time domain with a spectrum given by
where f is the timedomain frequency. The values of f_{knee} adopted here are different for different frequency channels and taken to be such that there are strong noise correlations within a single sweep. They span the range from 0.5 up to 3 Hz from the lowest to the highest frequency channel, respectively. The noise is apodized at very low frequencies, so that the noise power is finite. is taken to be about 30 μK per sample, reflecting the fact that each measurement effectively corresponds to the combined measurement of many modern detectors operating at the same frequency. This leads to sky maps with a noise K per pixel.
4.2. Multiplicity of the eigenvalues as a result of the particular scanning strategy
In this section we address the issue of multiple eigenvectors with the same eigenvalues, which we have identified in our numerical tests. In agreement with Sect. 3.3, these were found to have significant effect on the performance of the proposed solvers. We show here how they can be traced back to the specific scanning strategy adopted in our simulations. We then describe corrective measures we included in order to minimize their effect and to ensure that our results are indeed representative of more typical cases. These considerations are given here for completeness and as a potentially useful example. However, as the proposed measures may not be necessary in most of the cases of realistic data, this section can be skipped without affecting the further reading.
We denote two pointing matrices for two horizontal scans as P_{0} and P_{π/2} and two pointing matrices for two vertical scans as P_{π/4} and P_{3π/4}, where the subscript stands for the polarizer angle in the sky coordinates. They are related as
where ℛ is a 12 n_{pix}by12 n_{pix} blockdiagonal matrix with each diagonal block equal to a 2by2 spin2 rotation matrix for each pixel of the six frequency maps. While this is clearly a result of the simplifying assumption made in the simulations, this example may be of interest also in more realistic circumstances where certain relations of this sort may happen to be fulfilled approximately following from some common but inexact symmetries of typically adopted scans. We therefore briefly explore the consequences of this here.
In the case at hand, we can represent the total pointing matrix as
given that all four scan subsets observe exactly the same sky.
When in addition the noise covariance for each of the four scan subsets is exactly the same, we have
We note that this holds if the noise properties vary from one frequency channel to another, as is indeed the case in our simulations.
If now v is an eigenvector of the matrix with a corresponding eigenvalue, λ_{v}, then
and therefore,
This means that also ℛ_{π/4} v is an eigenvector of the matrix A with the same eigenvalue, λ_{v}. Because this reasoning applies as much to the matrix A as the matrix , we have
In general, this does not yet imply that the component separation system matrix preconditioned with the blockdiagonal preconditioner, Eq. (21), given by
has two eigenvectors corresponding to the same eigenvalue related by the rotation operator acting in the component space. This is the case when the subspace spanned by v and ℛ_{π/4} v is contained in the subspace spanned by the columns of the mixing matrix, . Otherwise, the preconditioned system matrix may have a single (when these two subspaces merely intersect) or no corresponding eigenvectors (when these two subspaces are disjoint). Which of these situations is actually realized is case dependent and in general also depends on the value of β.
We found that in our numerical experiments the eigenvalue multiplicity of the preconditioned system matrix due to the assumed scan symmetries was sufficiently common that we opted to account for it explicitly in our analysis. Consequently, we use the subspace recycling to approximate one of the eigenvectors, and we compute the other by applying the rotation operator. We then use both vectors to construct the deflation operator. Given that ℛ_{π/4} = −ℛ_{−π/4}, there is no ambiguity because we do not know a priori which of the two vectors we estimate directly through the subspace recycling, and this approach leads to the same deflation space, regardless of the rotation that is applied. This technique has led to a significant speedup in the cases studied below.
4.3. Results
We compare the convergence using the relative norm of the jth residual,
The iterations for each system are stopped when this value drops below tolerance TOL = 10^{−8}, but we always perform at least one iteration.
We first show that the systems corresponding to different βs from the sequence are indeed “close to each other” in the sense that they display a similar behavior during the solution process. We illustrate this by showing the convergence of PCG with zero initial guess in Fig. 2. We find that all the convergence curves are indeed very similar, even for the initial elements of the sequence where the values of β parameters continue to change quite significantly.
Fig. 2. Convergence of PCG with zero initial guess and blockJacobi preconditioner for all 26 systems in the sequence shown in Fig. 1. 
We can therefore focus on the “true” system corresponding to β = β^{⋆} in order to investigate the improvement caused by deflating the eigenvectors corresponding to the smallest eigenvalues of the system matrix. This is depicted in Fig. 3. Here, the eigenvectors are computed using the ARPACK eigensolver (Lehoucq et al. 1998), and as expected, we find that significant improvement is achieved by the deflation.
Fig. 3. Convergence of PCG with the deflation applied to 2, 6, and 10 slowest eigenvectors for the true system with β = β^{⋆}. 
Then, we illustrate the effect of the deflation space built by recycling. To do this, we first consider six systems and start the iterations always with zero initial guess, 𝕤^{(0)} = 0. The result for k = 10 eigenvectors approximated using the dimension of the subspace, dim_{p} = 100, is given in Fig. 4.
Fig. 4. Convergence of deflated PCG with the deflation subspace build by recycling. Here we consider the first six systems from the sequence and start the iterations always with zero initial guess. k = 10 eigenvectors are approximated using dim_{p} = 100. 
In Fig. 5 we compare the convergence of the full sequence of the 26 systems using the techniques of setting the initial guess proposed in Eqs. (33) and (37) and using subspace recycling. We recall that standard PCG with zero initial vector takes more than 6000 iterations; cf. Fig. 2. Although the subspace recycling variant requires 25 × 10 matrixvector products^{2} more than the PCG with blockJacobi preconditioner for any choice of the initial guess, it still provides an interesting increase in speed.
Fig. 5. Comparison of the PCG with different choices of initial guess (as in Eqs. (33) and (37)) and the PCG with the subspace recycling (together with the choice of the initial guess as in Eq. (37)). For the recycling we consider k = 10 eigenvectors approximated using dim_{p} = 100. 
We also compare the time required by one iteration of the PCG with a deflation with the time required by a standard PCG iteration. In Table 1 we list the relative times of a single iteration of a deflated PCG with 2, 6, and 10 deflation vectors in our experiments (taking an average from five runs, each with ten iterations). The small and negligible overhead introduced by the twolevel preconditioner indicates that most of the time in the code is spent on the standard mapmaking operations, such as (de)projection operators and noise weighting (e.g., Cantalupo et al. 2010). We emphasize that these timings depend on the implementation and the hardware on which it is executed. For massively parallel codes, the cost of a deflated PCG iteration compared to a standard PCG iteration may increase somewhat, but we expect that the proposed algorithm should nevertheless still be offering an attractive increase in speed.
Timings of a single deflated PCG iteration with 2, 6, and 10 deflation vectors.
As noted above, the performance of the PCG with the deflation space built by recycling is affected by the number of the deflation vectors, k, and the dimension of the recycling subspace, dim_{p}. There is no general recommendation for their choice. Higher k may result in an increase of the overall number of matrixvector products (the system matrix has to be applied to k deflation vectors before the deflated PCG is started for each system) and high dim_{p} may cause numerical instabilities in solving the eigenvalue problems that determine the new deflation vectors. On the other hand, the low values of k and dim_{p} may not increase the speed sufficiently. We compare the convergence of PCG with some choices of k and dim_{p} in Fig. 6 and in Table 2. The deflation clearly has a small effect for small dim_{p}, that is, when the eigenvectors are not approximated accurately.
Fig. 6. Comparison of the PCG with different choices of k and dim_{p} for the first ten systems in the sequence. The initial guess is the same as in Eq. (37). The iteration counts are also listed in Table 2. Because the convergence for the first system is independent of dim_{p} and k, the xaxis is depicted starting from the iteration 200. 
Number of PCG iterations and matrixvector products (MatVecs) for different choices of k and dim_{p} for the first ten systems in the sequence.
5. Conclusions and further perspectives
We have presented a procedure for efficiently solving a sequence of linear systems arising from the CMB parametric component separation problem. Two main novelties are the proposed choice of the initial vectors and the recycling technique used to determine the deflation space. Motivated by our simulated data, we also emphasized and elaborated on the role of the multiplicity of the eigenvalues, in particular in the context of their effect on the performance of twolevel preconditioners.
The overall speedup factor we obtained, ∼5−7, is significant. The bulk of the improvement comes from reusing the solution of an earlier system as the initial guess of the next solution – a simple but not trivial observation owing to the fact that this is the same data set being used in all the solutions. However, other proposed amendments provide a significant additional performance boost on the order of ∼2. This is particularly significant in the case of the samplingbased application. Further improvements and optimizations are clearly possible. For instance, the number of required matrixvector products can be decreased by not using the twolevel preconditioner for all the systems. As the experiments showed, when two consecutive system solutions are very similar, the PCG with the diagonal preconditioner and a proper initial guess (e.g., as proposed in Sect. 3.4.2) can already be sufficient for convergence in a few iterations.
We emphasize that in practice, we will be only able to capitalize on this type of approach when they are implemented in a form of highly efficient highperformance massively parallel numerical algorithms and codes. We leave a full demonstration of this to future work, noting here only that many of the relevant techniques have been studied in the past and recent literature, showing that this should indeed be feasible (e.g., Cantalupo et al. 2010; Sudarsan et al. 2011; Szydlarski et al. 2014; Papež et al. 2018; Seljebotn et al. 2019).
The techniques we presented, rendered in the form of efficient highperformance codes, should allow for the efficient maximization of the data likelihood or the posterior distributions in the component separation problems and produce reliable sky component estimates for at least some of the forthcoming data sets. However, in the cases of sampling algorithms, when many thousand linear systems may need to be solved, this still remains to be demonstrated, and further improvements will likely be required. They will depend in general on specific details of the employed sampling technique, however, and we leave them here as future work.
The techniques discussed here can also be used in other problems in CMB data analysis that require solving a sequence of linear systems. In particular, they should be directly relevant for applications that estimate instrumental parameters, which commonly have to be included in more realistic data models and estimated from the data.
The codes used in this work are available from a GitHub repository^{3}.
It is necessary to apply the matrix to deflation vectors at the beginning of the deflated PCG to build the projection matrices, see Algorithm 1 in Appendix C.
FGBUSTER: https://github.com/fgbuster
Acknowledgments
We thank Olivier Tissot for insightful discussions and Dominic Beck and Josquin Errard for their help with numerical experiments. The first two authors’ work was supported by the NLAFET project as part of European Union’s Horizon 2020 research and innovation program under grant 671633. This work was also supported in part by the French National Research Agency (ANR) contract ANR17C23000201 (project B3DCMB). This research used resources of the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DEAC0205CH11231.
References
 Brandt, W. N., Lawrence, C. R., Readhead, A. C. S., Pakianathan, J. N., & Fiola, T. M. 1994, ApJ, 424, 1 [Google Scholar]
 Calvetti, D., Reichel, L., & Sorensen, D. C. 1994, Electron. Trans. Numer. Anal., 2, 1 [Google Scholar]
 Cantalupo, C. M., Borrill, J. D., Jaffe, A. H., Kisner, T. S., & Stompor, R. 2010, ApJS, 187, 212 [NASA ADS] [CrossRef] [Google Scholar]
 Dostál, Z. 1988, Int. J. Comput. Math., 23, 315 [CrossRef] [Google Scholar]
 Eriksen, H. K., Jewell, J. B., Dickinson, C., et al. 2008, ApJ, 676, 10 [NASA ADS] [CrossRef] [Google Scholar]
 Gaul, A., Gutknecht, M. H., Liesen, J., & Nabben, R. 2013, SIAM J. Matrix Anal. Appl., 34, 495 [CrossRef] [Google Scholar]
 Górski, K. M., Hivon, E., Banday, A. J., et al. 2005, ApJ, 622, 759 [NASA ADS] [CrossRef] [Google Scholar]
 Grigori, L., Stompor, R., & Szydlarski, M. 2012, SC 2012: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, 1 [Google Scholar]
 Hestenes, M. R., & Stiefel, E., 1952, J. Res. Natl. Bureau Stand., 49, 409 [Google Scholar]
 Jolivet, P., & Tournier, P. H. 2016, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16 (Piscataway, NJ, USA: IEEE Press), 17:1 [Google Scholar]
 Kilmer, M. E., & de Sturler, E. 2006, SIAM J. Sci. Comput., 27, 2140 [CrossRef] [Google Scholar]
 Lehoucq, R. B., Sorensen, D. C., & Yang, C. 1998, Software, Environments, and Tools. ARPACK Users’ Guide (Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM)), 6, xvi+142 [Google Scholar]
 Morgan, R. B. 1995, SIAM J. Matrix Anal. Appl., 16, 1154 [CrossRef] [Google Scholar]
 Natoli, P., de Gasperis, G., Gheller, C., & Vittorio, N. 2001, A&A, 372, 346 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Nicolaides, R. 1987, SIAM J. Numer. Anal., 24, 355 [NASA ADS] [CrossRef] [MathSciNet] [Google Scholar]
 O’Connell, M., Kilmer, M. E., de Sturler, E., & Gugercin, S. 2017, SIAM J. Sci. Comput., 39, B272 [CrossRef] [Google Scholar]
 Paige, C. C., Parlett, B. N., & van der Vorst, H. A. 1995, Numer. Linear Algebra Appl., 2, 115 [Google Scholar]
 Papež, J., Grigori, L., & Stompor, R. 2018, A&A, 620, A59 [CrossRef] [EDP Sciences] [Google Scholar]
 Parks, M. L., de Sturler, E., Mackey, G., Johnson, D. D., & Maiti, S. 2006, SIAM J. Sci. Comput., 28, 1651 [CrossRef] [Google Scholar]
 Planck Collaboration X. 2016, A&A, 594, A10 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Planck Collaboration XIII. 2016, A&A, 594, A13 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Puglisi, G., Poletti, D., Fabbian, G., et al. 2018, A&A, 618, A62 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Saad, Y., & Schultz, M. H. 1986, SIAM J. Sci. Stat. Comput., 7, 856 [Google Scholar]
 Saad, Y., Yeung, M., Erhel, J., & Guyomarc’h, F. 2000, SIAM J. Sci. Comput., 21, 1909 [CrossRef] [Google Scholar]
 Seljebotn, D. S., Bærland, T., Eriksen, H. K., Mardal, K. A., & Wehus, I. K. 2019, A&A, 627, A98 [CrossRef] [EDP Sciences] [Google Scholar]
 Sleijpen, G. L. G., & Van der Vorst, H. A. 2000, SIAM Rev., 42, 267 [Google Scholar]
 Sorensen, D. C. 1992, SIAM J. Matrix Anal. Appl., 13, 357 [CrossRef] [MathSciNet] [Google Scholar]
 Sorensen, D. C. 2002, Acta Numer., 11, 519 [CrossRef] [Google Scholar]
 Stewart, G. W. 2001/02, SIAM J. Matrix Anal. Appl., 23, 601 [CrossRef] [Google Scholar]
 Stompor, R., Leach, S., Stivoli, F., & Baccigalupi, C. 2009, MNRAS, 392, 216 [Google Scholar]
 Sudarsan, R., Borrill, J., Cantalupo, C., et al. 2011, Proceedings of the International Conference on Supercomputing, ICS ’11 (New York, NY, USA: Association for Computing Machinery), 305 [CrossRef] [Google Scholar]
 Szydlarski, M., Grigori, L., & Stompor, R. 2014, A&A, 572, A39 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Tang, J. M., Nabben, R., Vuik, C., & Erlangga, Y. A. 2009, J. Sci. Comput., 39, 340 [CrossRef] [Google Scholar]
 Wu, G. 2017, SIAM J. Matrix Anal. Appl., 38, 118 [CrossRef] [Google Scholar]
 Wu, K., & Simon, H. 2000, SIAM J. Matrix Anal. Appl., 22, 602 [CrossRef] [MathSciNet] [Google Scholar]
Appendix A: Eigenvalue multiplicity in the component separation problem. A worked example
In this section we discuss the eigenvector and eigenvalue structure of the preconditioned matrix defined in Eq. (20) in the context of eigenvalue multiplicity in some specific setup that in particular assumes that the pointing matrix is the same for each frequency and that the noise has the same covariance (up to a scaling factor) for each frequency, that is,
While these requirements are not very likely to be strictly realized in any actual data set, they can be fulfilled approximately, leading to near multiplicities of the eigenvalues. If these are not accounted for, they may be as harmful to the action of the preconditioner as the exact multiplicities. Moreover, this worked example demonstrates that the component separation problem is in general expected to be more affected by this type of effect than the standard mapmaking solver, for instance, therefore emphasizing that due diligence is necessary in this former application.
First, let (λ_{i}, v_{i}) = (λ_{i}, ⌈v_{i, q},v_{i, u}⌋) be an eigenpair of the mapmaking matrix, that is, there holds
We note that
because of the form of the mixing we assumed in Eq. (22). Consequently, using Eq. (A.2),
Because the matrix is assumed to be nonsingular (equivalently, because is of full column rank), we can multiply this equation from the left by showing that (λ_{i}, ⌈v_{i},0,0⌋) is the eigenpair of the matrix . We can proceed analogously for the vectors ⌈0,v_{i},0⌋ and ⌈0,0,v_{i}⌋, with replacing in Eq. (A.3) α_{f, 1} by α_{f, 2} and α_{f, 3}, respectively.
There are 2 n_{pix} eigenpairs for (P^{⊤} diag(N^{−1})P)^{−1}(P^{⊤}N^{−1}P). As we showed above, each of them generates three eigenpairs (with the same eigenvalue) of . This gives together 6 n_{pix} eigenpairs, in other words, we have described the full spectrum of the preconditioned system matrix in Eq. (20).
Finally, we remark that all the eigenpairs of the preconditioned matrix in the simplified setting are independent of the parameters β. In this case, we suggest using a specialized eigensolver (e.g., ARPACK, Lehoucq et al. 1998) to compute the eigenpairs from Eq. (A.2), build the triplets of eigenvectors ⌈v_{i},0,0⌋, ⌈0,v_{i},0⌋, and ⌈0,0,v_{i}⌋, and then use the deflated PCG with these vectors.
Figure A.1 is the same as Fig. 5, but for the simplified setting. Here two triplets of the eigenvectors are constructed following the procedure described above.
Fig. A.1. Same as Fig. 5 for the simplified setting of Appendix A. Comparison of the PCG with different choices of the initial guess (as in Eqs. (33) and (37)) and the deflated PCG with 2 × 3 vectors. 
Appendix B: Ingredients of the proposed procedure
We present in this section two ingredients of the proposed procedure in more detail. Namely, we discuss approaches for estimating the eigenpairs from the computed basis of the Krylov subspace and approaches for combining the deflation of the approximate eigenvectors with another preconditioner. To facilitate presentation, we simplify the notation in this section.
B.1. Approximating the eigenvalues using Krylov subspace methods
We present first the Rayleigh–Ritz approximation, which is used in the Arnoldi and Lanczos algorithms to approximate the eigenvalues of a general nonsingular or, respectively, a hermitian matrix. Then, we recall the Arnoldi and Lanczos algorithms, and finally, we briefly comment on their restarted variants.
The methods discussed below do not represent an exhaustive overview of methods for approximating several eigenvalues and the associated eigenvectors. The omitted methods include the Jacobi–Davidson method (Sleijpen & Van der Vorst 2000), for example, which proved to be particularly efficient for approximating the inner part of the spectrum. For a survey of the methods and a list of references, see Sorensen (2002), for instance.
B.1.1. Ritz value and harmonic Ritz value approximations
For a subspace 𝒮 ⊂ ℂ^{n}, we call y ∈ 𝒮 a Ritz vector of A with Ritz value θ if
When a (computed) basis V_{j} of 𝒮 is used and y = V_{j}w is set, the above relation is equivalent to solving
Ritz values are known to approximate the extreme eigenvalues of A well. When an approximation to the interior eigenvalues is required, it can be preferable to compute the harmonic Ritz values. The term harmonic Ritz values was introduced in Paige et al. (1995), where references to previous works using this approximation can be found. Following Parks et al. (2006), we define harmonic Ritz values as the Ritz values of A^{−1} with respect to the space A𝒮,
We call a harmonic Ritz value and a harmonic Ritz vector. When V_{j} is a basis of 𝒮 and , this relation can be represented as
For the properties of the harmonic Ritz value approximations and the relationship with the iteration polynomial in MINRES method, see Paige et al. (1995).
There are various presentations and definitions in the literature of the harmonic (Rayleigh–)Ritz procedure; it is often introduced to approximate eigenvalues close to a target τ ∈ ℂ. For example, Wu (2017) prescribes the procedure by
where I is the identity matrix. With , this corresponds to the generalized eigenvalue problem
which becomes for τ = 0 exactly the righthand side equality in Eq. (B.4).
We note that harmonic Ritz approximation is also often used in the Krylov subspace recycling methods to approximate the smallest (in magnitude) eigenvalues and the associated eigenvectors.
Finally, we comment on the (harmonic) Ritz approximation in the case when we wish to compute the eigenvalues of the matrix A preconditioned from the left by M. In the general case, when only A and M are assumed to be nonsingular, the Ritz and harmonic Ritz approximation are applied as above by just replacing in the formulas A by M^{−1}A. When the matrix A is hermitian and the preconditioner M is SPD, there is also another option. First, we note that the matrix M^{−1}A is not hermitian, but is selfadjoint with respect to the inner product induced by M, that is,
where (v, w)_{M} ≡ v^{⊤}Mw. This allows in the definition of Ritz and harmonic Ritz approximation replacing A by M^{−1}A and the standard inner product by the inner product induced by the matrix M, giving
or
respectively. The corresponding algebraic problems with y = V_{j}w and are
and
respectively. The problems above involve hermitian matrices only.
B.1.2. Arnoldi and Lanczos methods
Arnoldi and Lanczos algorithms for approximating the eigenvalues of a general nonsingular or a hermitian matrix are based on a Ritz approximation with setting 𝒮 = 𝒦_{j}(A, v_{1}) = span(v_{1}, Av_{1}, …, A^{j − 1}v_{1}), the jth Krylov subspace. The methods compute an orthogonal basis V_{j} of 𝒮 such that
where e_{j} is the last column vector of the identity matrix (of size j) and , . Consequently, the eigenvalue problem in Eq. (B.2) corresponding to the Ritz approximation reads
The matrix T_{j} is available during the iterations. The standard use of the Arnoldi and Lanczos method for eigenvalue approximation consists of solving the above problem and setting the pairs (θ, V_{j}w) as the computed approximations.
The Ritz approximation can be replaced by the harmonic Ritz approximation. Then, the matrices in Eq. (B.4) become
The Lanczos algorithm is a variant of the Arnoldi algorithm for a hermitian A. The matrix , which is in the Arnoldi method upper Hessenberg, is then also hermitian. Consequently, it is tridiagonal, which means that in each step of the Lanczos method, we orthogonalize the new vector only against the two previous vectors. This ensures that the computational cost of each iteration is fixed, and only when the eigenvalues are to be approximated, storing three vectors v_{j − 1}, v_{j} and v_{j + 1} is sufficient instead of handling the full matrix V_{j}. The assumption on exact arithmetic is crucial here, however. In finite precision computations, the global orthogonality is typically quickly lost, which can cause several stability issues.
As noted above, an orthonormal basis V_{j} of 𝒮 is advantageous for the Ritz approximation. For the harmonic Ritz approximation applied to an SPD matrix A, an Aorthonormal basis can instead be constructed, which ensures that the matrix on the righthand side of Eq. (B.4) is equal to the identity. An Aorthonormal basis of a Krylov subspace can be constructed within the iterations of conjugate gradient method using the search direction vectors.
The Arnoldi method can also be applied to the preconditioned matrix M^{−1}A to compute an orthonormal basis V_{j} of the associated Krylov subspace 𝒦_{j}(M^{−1}A, M^{−1}v_{1}), giving
For a hermitian A and an SPD preconditioner M, we can apply the Lanczos method following the comment made above, using the matrix M^{−1}A and the inner product (⋅, ⋅)_{M} induced by M instead of the standard euclidean (⋅, ⋅), giving
The computed basis V_{j} is therefore Morthonormal.
B.1.3. Restarted variants
The number of iterations necessary to converge is not a priori known in Arnoldi and Lanczos algorithms, and it can in general be very high. High iteration counts require a large memory to store the basis vectors, and when a full reorthogonalization is used, a high computational effort because of the growing cost of the reorthogonalization in each step. The idea behind implicitly restarted variants is to limit the dimension of the search space 𝒮. This means that the iterations are stopped after a (fixed) number of steps, the dimension of the search space is reduced while maintaining its (Krylov) structure, and the Arnoldi/Lanczos iterations are resumed.
Several restarted variants are described in the literature (a detailed description is beyond the scope of this paper, however): the implicitly restarted Arnoldi (IRA, Sorensen 1992), the implicitly restarted Lanczos (IRL, Calvetti et al. 1994), or the Krylov–Schur method (Stewart 2001/02; Wu & Simon 2000).
The estimation of the spectrum of A is possible within the GMRES, MINRES, and CG iterations (applied to solve a system with A) because they are based on Arnoldi (GMRES and MINRES) or Lanczos (CG) algorithms. In contrast, a combination of restarted variants with solving a linear algebraic system is, to the best of our knowledge, not described in the literature.
B.2. Deflation and twolevel preconditioners
In this section we first discuss a deflation preconditioner for Krylov subspace methods that can be regarded as eliminating the effect of several (given) vectors from the operator or, equivalently, augmenting by these vectors the space in which we search for an approximation. Then we describe a combination of the deflation preconditioner with another preconditioner that is commonly used in practice.
The Krylov subspace methods (in particular CG, Hestenes 1952, and GMRES, Saad & Schultz 1986) are wellknown for their minimization (optimal) properties over the consecutively built Krylov subspace,
A question then arises: given some other subspace 𝒰, can we modify the methods such that they have the same optimal properties over the union of 𝒦_{j}(A, v) and 𝒰 (which is often called an augmented Krylov subspace)? The answer is positive and the implementation differs according to the method: it is straightforward for GMRES and requires more attention for CG. Hereafter, we denote by I the identity matrix and by Z the basis of 𝒰.
The deflation in GMRES method is often (see, e.g., GCROT by Morgan 1995) considered as a remedy to overcome the difficulties caused by restarts: for computational and memory restrictions, only a fixed number of GMRES iterations is typically performed, giving an approximation that is then used as the initial vector for a new GMRES run. In GCROT, several vectors are saved and used to augment the Krylov subspace built after the restart. The GMRES method with the deflation was used to solve a sequence of linear algebraic systems in Parks et al. (2006), for example.
The augmentation of the Krylov subspace in CG is more delicate because the original CG method can only be applied to an SPD matrix. The first such algorithm was proposed in Nicolaides (1987) and Dostál (1988). We note that it includes the construction of the conjugate projector
and in each iteration, the computation of the preconditioned search direction q_{i} = (I − P_{c.proj.}) p_{i} and of the vector Aq_{i}. The latter can be avoided at the price of storing ZandAZ and performing additional multiplication with AZ. In both variants, the cost of a single iteration is higher than the cost of one standard CG iteration.
The combination of a preconditioner with a deflation is widely studied in the literature and therefore we present this only briefly; more details and an extensive list of references can be found in the review paper by Tang et al. (2009), for instance. The preconditioner stemming from the combination of a (typically relatively simple) traditional preconditioner with the deflation is called a twolevel preconditioner. As shown in Tang et al. (2009), this shows an analogy with multilevel (multigrid) and domain decomposition methods. While the traditional preconditioner aims at removing the effect of the largest (in magnitude) eigenvalues, the deflation (projectiontype preconditioner) is intended to remove the effect of the smallest eigenvalues. Common choices for the traditional preconditioner are block Jacobi, (restricted) additive Schwarz method, and incomplete LU or Cholesky factorizations. In many applications, twolevel preconditioners proved to be efficient in the CMB data analysis (see, e.g., Grigori et al. 2012; Szydlarski et al. 2014).
We now present the combination of the traditional and projectiontype (deflation) preconditioners following the discussion and notation of Tang et al. (2009). Hereafter, we assume that the system matrix A and the traditional preconditioner M are SPD. We note that some of the preconditioners 𝒫_{□} mentioned below are not symmetric. However, their properties allow us to use them (with possible modification of the initial vector) as left preconditioners in PCG; see Tang et al. (2009) for details.
Let the deflation space span the columns of the matrix Z. We denote
Twolevel preconditioners based on the deflation are given as
Other preconditioners can be determined using the additive combination of two (SPD) preconditioners C_{1}, C_{2} as
or, using the multiplicative combination of the preconditioners, as
Three variants of twolevel preconditioners are derived by choosing aan dditive or multiplicative combination and setting C_{1} = M^{−1}, C_{2} = Q, or C_{1} = Q, C_{2} = M^{−1}. Other preconditioners can be derived using the multiplicative combination of three SPD matrices (see Tang et al. 2009).
The variants of the twolevel preconditioner mentioned above differ in the implementation cost and also in the numerical stability; see Tang et al. (2009). The variant 𝒫_{DEF1}, which is often used in the procedures for solving the sequences of linear systems (see, e.g., Saad et al. 2000), was found to be cheap but less robust, especially with respect to the accuracy of solving the coarse problem with the matrix Z^{⊤}AZ and with respect to the demanded accuracy. The conclusion drawn in Tang et al. (2009) is that “ADEF2 seems to be the best and most robust method, considering the theory, numerical experiments, and the computational cost”. Therefore the preconditioner 𝒫_{A − DEF2},
is of interest, in particular in the cases where the dimension of the deflation space (equal to the number of columns in Z) is high and/or the matrix M^{−1}A is illconditioned.
As noted in Saad et al. (2000), the gradual loss of orthogonality of the computed residuals with respect to the columns of Z can cause stagnation, divergence, or erratic behaviour of errors within the iterations (see also the comment in Tang et al. 2009). The suggested remedy in this case consists of reorthogonalizing the computed residuals as
However, no such instabilities were observed in our experiments, and the results depicted throughout the paper are for the standard (nonreorthogonalized) variant.
Appendix C: Full algorithm
In this section we provide the pseudocodes for the deflated PCG (Algorithm 1), the subspace recycling (Algorithm 2), and for the full procedure (Algorithm 3) proposed in this paper and used in the numerical experiments in Sect. 4.
function DEFLPCG(𝔸, 𝔹, 𝕓, 𝕤^{(0)}, Z, dim_{p}, j_{max})
Q = Z(Z^{⊤}𝔸Z)^{−1}Z^{⊤} ⊳ in practice, we save 𝔸Z to use later
𝕣^{(0)} = (I − 𝔸Q)(𝕓 − 𝔸𝕤^{(0)}) ⊳ with saved 𝔸Z, Q𝔸 and 𝔸Q
can be computed without applying 𝔸
𝕡^{(0)} = 𝕣^{(0)}
for j = 0, …, j_{max} do
𝕨^{(j)} = (I − 𝔸Q)(𝔸𝕡^{(j)})
if j ≤ dim_{p} then
save (I − Q𝕤)𝕡^{(j)} into ⊳ in practice, we also
save 𝕨^{(j)} to avoid computing later
end if
𝕤^{(j + 1)} = 𝕤^{(j)} + γ^{(j)}𝕡^{(j)}
𝕣^{(j + 1)} = 𝕣^{(j)} − γ^{(j)}𝕨^{(j)}
check the stopping criterion
end for
𝕤^{(final)} := Q𝕓 + (I − Q𝕤)𝕤^{(j)}
return 𝕤^{(final)};
⊳ to be efficient, we also return 𝔸Z and the vectors {𝕨^{(j)}}
end function
function SUBSPREC(𝔸, 𝔹, U, k
F = U^{⊤}𝔹U
G = U^{⊤}𝔸U ⊳ in practice, we reuse 𝔸U saved during the deflated PCG
solve the generalized eigenvalue problem GY = diag(λ_{i})FY
take k smallest λ_{i} and the respective columns of Y, Y_{k}
return Z = UY_{k}
end function
Require: β _{0},
Require: k, dim_{p}
set Z := []
for i = 0, …, i_{max} do
assembly the system matrix 𝔸, righthand side 𝕓, and the preconditioner 𝔹 corresponding to β_{i}
if i > 0 then
DEFLPCG𝔸, 𝔹, 𝕓, , Z, dim_{p}, j_{max} → (𝕤, )
check the stopping criterion for β_{i}, exit if i = i_{max}
(determine β_{i + 1}) ⊳considered here as a black box
SUBSPREC(𝔸, 𝔹, , k) → Z
end for
Appendix D: Results for an alternative sequence of mixing parameters from a Monte Carlo sampling
In this section we provide results for a sequence of spectral parameters generated by a Monte Carlo sampling algorithm. In realistic circumstances, these sequences may contain up to many thousand samples, but for computational efficiency, we here restrict ourselves to a subsequence made of merely 30 elements. We use them in order to demonstrate the performance of the proposed approach on a sequence with realistic properties but sufficiently different than those of the sequences encountered in the likelihood maximization process. We emphasize that it is not yet our purpose here to validate the method on the actual application, and more work is needed to achieve this goal, see Sect. 5. The sequence is depicted in Fig. D.1. It was produced using the publicly available software FGBUSTER^{4} and indeed shows qualitatively a very different behavior than that of our standard case displayed in Fig. 1.
Fig. D.1. Plot of a sequence of the spectral parameters β_{i} = [β_{i, s}, β_{i, d}] drawn through a Monte Carlo sampling technique and used as an alternative test case in the numerical experiments described in Appendix D. 
In Fig. D.2 and in Table D.1, we compare the results obtained in this case by applying the various techniques discussed and proposed in this work.
Fig. D.2. Comparison of the PCG with different choices of initial guess (as in Eqs. (33) and (37)) and the PCG with the subspace recycling (together with the choice of the initial guess as in Eq. (37)). For the recycling, we consider k = 10 eigenvectors approximated using dim_{p} = 100. The convergence for the whole sequence when the initial guess is as in Eq. (33) (the yellow line) requires 4010 iterations. 
Number of matrixvector products (MatVecs) for different techniques as in Fig. D.2.
All Tables
Number of PCG iterations and matrixvector products (MatVecs) for different choices of k and dim_{p} for the first ten systems in the sequence.
Number of matrixvector products (MatVecs) for different techniques as in Fig. D.2.
All Figures
Fig. 1. Sequence of the spectral parameters β_{i} = [β_{i, s}, β_{i, d}] used in our experiments and derived from the maximization of the spectral likelihood, Eq. (16). The panels in clockwise order show consecutive zooms on the sequence that converged on the likelihood peak values of β = [− 3.006, 1.584], slightly off the true values marked by the blue solid point in the top left panel and given by Eq. (45) (with T_{d} fixed in our test cases). The sequence was generated as described at the end of Sect. 4.1.2. 

In the text 
Fig. 2. Convergence of PCG with zero initial guess and blockJacobi preconditioner for all 26 systems in the sequence shown in Fig. 1. 

In the text 
Fig. 3. Convergence of PCG with the deflation applied to 2, 6, and 10 slowest eigenvectors for the true system with β = β^{⋆}. 

In the text 
Fig. 4. Convergence of deflated PCG with the deflation subspace build by recycling. Here we consider the first six systems from the sequence and start the iterations always with zero initial guess. k = 10 eigenvectors are approximated using dim_{p} = 100. 

In the text 
Fig. 5. Comparison of the PCG with different choices of initial guess (as in Eqs. (33) and (37)) and the PCG with the subspace recycling (together with the choice of the initial guess as in Eq. (37)). For the recycling we consider k = 10 eigenvectors approximated using dim_{p} = 100. 

In the text 
Fig. 6. Comparison of the PCG with different choices of k and dim_{p} for the first ten systems in the sequence. The initial guess is the same as in Eq. (37). The iteration counts are also listed in Table 2. Because the convergence for the first system is independent of dim_{p} and k, the xaxis is depicted starting from the iteration 200. 

In the text 
Fig. A.1. Same as Fig. 5 for the simplified setting of Appendix A. Comparison of the PCG with different choices of the initial guess (as in Eqs. (33) and (37)) and the deflated PCG with 2 × 3 vectors. 

In the text 
Fig. D.1. Plot of a sequence of the spectral parameters β_{i} = [β_{i, s}, β_{i, d}] drawn through a Monte Carlo sampling technique and used as an alternative test case in the numerical experiments described in Appendix D. 

In the text 
Fig. D.2. Comparison of the PCG with different choices of initial guess (as in Eqs. (33) and (37)) and the PCG with the subspace recycling (together with the choice of the initial guess as in Eq. (37)). For the recycling, we consider k = 10 eigenvectors approximated using dim_{p} = 100. The convergence for the whole sequence when the initial guess is as in Eq. (33) (the yellow line) requires 4010 iterations. 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.