Accelerating linear system solvers for time domain component separation of cosmic microwave background data

Component separation is one of the key stages of any modern, cosmic microwave background (CMB) data analysis pipeline. It is an inherently non-linear procedure and typically involves a series of sequential solutions of linear systems with similar, albeit not identical system matrices, derived for different data models of the same data set. Sequences of this kind arise for instance in the maximization of the data likelihood with respect to foreground parameters or sampling of their posterior distribution. However, they are also common in many other contexts. In this work we consider solving the component separation problem directly in the measurement (time) domain, which can have a number of important advantageous over the more standard pixel-based methods, in particular if non-negligible time-domain noise correlations are present as it is commonly the case. The time-domain based approach implies, however, significant computational effort due to the need to manipulate the full volume of time-domain data set. To address this challenge, we propose and study efficient solvers adapted to solving time-domain-based, component separation systems and their sequences and which are capable of capitalizing on information derived from the previous solutions. This is achieved either via adapting the initial guess of the subsequent system or through a so-called subspace recycling, which allows to construct progressively more efficient, two-level preconditioners. We report an overall speed-up over solving the systems independently of a factor of nearly 7, or 5, in the worked examples inspired respectively by the likelihood maximization and likelihood sampling procedures we consider in this work.


Context and motivation
Measurements registered by the cosmic microwave background (CMB) experiments in addition to the sought-after signal of cosmological origin contain contributions generated by astrophysical sources. These are generically called foregrounds and can be of either galactic or extra-galactic origins, either diffuse or point-source like. An essential step of the CMB data analysis is therefore that of separating the foreground signals from each other and, specifically, from the CMB one, capitalizing on their different electromagnetic frequency dependence and/or statistical properties than those of the CMB (e.g., Planck Collaboration et al. 2016a, and references therein). This process is referred to as component separation. As in polarization the foreground signals tend to dominate the cosmological one over a broad range of angular scales and observational frequencies the next generation of the CMB observatories will be only capable of delivering its science in full if a high precision, statistically sound and reliable component separation techniques and their numerically efficient implementations are available.
The component separation is a non-linear procedure, which from the data measured at multiple different frequency bands aims at estimating simultaneously properties of the foregrounds as well as the signals themselves. This is commonly accomplished in a pixel domain assuming that maps of the sky for e-mail: jan@papez.org each frequency band have been already reconstructed and their statistical uncertainties somehow estimated as part of the initial procedure commonly called a map-making. Then the component separation is typically performed in two steps. On the first step, the foreground parameters are estimated, and, on the second step, they are used to disentangle the frequency maps into maps of sky components. For concreteness, we focus in this work on the socalled parametric component separation approach (e.g., Brandt et al. 1994;Eriksen et al. 2008;Stompor et al. 2009), where the frequency scaling relations for each considered sky component are assumed to be given and known up to a limited number of unknowns, called foreground spectral parameters, which are then estimated from the data on the first step of the component separation. However, the numerical techniques discussed here are more general and should be found useful also in alternative component separation methods.
The two-step, pixel-domain based approach outlined above is conceptually simple and computationally potentially very efficient as the maps conveniently compress the information contained in a typically much larger initial raw data set, referred to hereafter as the time domain data set, and do so in a nearly lossless fashion. Indeed, for the next generation of the CMB experiments we expect to have as many as n t ∼ O(10 13 − 10 15 ) raw measurements but only n pix ∼ O(10 5 − 10 8 ) sky pixels. This compression however can only work satisfactorily if a manageable, but sufficiently precise, statistical description of these maps Article number, page 1 of 17 arXiv:2002.02833v1 [astro-ph.CO] 7 Feb 2020 A&A proofs: manuscript no. main can be derived. In practice, this is often difficult due to storage and computational cycles limitations. A general, full covariance matrix of a single frequency map contains n pix 2 ∼ O(10 10 − 10 16 ) elements, all of which need to be first computed, at the cost of at least of O(λ) O(n t ) floating point operations (flops), and later stored in a computer memory. (Here λ is a time-domain noise correlation length and can reach many thousands of samples.) The full computations therefore quickly become prohibitively expensive even if an explicit inversion of the covariance matrix is replaced by some iterative procedure which typically requires O(n iter n pix 2 ) flops, where the number of iterations, n iter is usually on order of 10 2 . Consequently, in practice the best way forward may be to invoke some approximations. This however is often problematic as well, as a successful approximation needs to ensure sufficient accuracy to avoid introducing systematic biases in the estimated foreground parameters and later also in the component maps, while being computationally feasible.
A general solution to the problem would be to avoid resorting to the frequency maps at all and to perform all the calculation in the time data domain. This would typically require O(n iter n t ln λ) flops and the memory on order of O(n t ) and would not only be more robust but also more manageable whenever significant time-domain noise correlations are present, λ 1. Nevertheless, this approach also faces significant numerical challenges. In this work, we explore some of the avenues which could render this approach feasible. We first identify the major, computation-heavy step which unavoidably appears in any implementation of this technique and then investigate how it could be accelerated by employing better, more advanced methods and their implementations.
The plan of this paper is as follows. In Section 2 we present the component separation problem and the numerical challenges it poses. In Section 3 we describe the proposed solution and in Section 4 results of the numerical tests. Section 5 provides a brief summary and outlines prospects for the future. Material which is more technical in nature or which is added for completeness is as usually deferred to the Appendices.

Preliminaries.
Hereafter, we consider polarized signals and assume, for simplicity and definiteness, that in every pixel on the sky the signal is characterized by two Stokes parameters, Q and U. Extensions to include total intensity, I, are straightforward. Consequently, hereafter every considered sky map consists of two maps each corresponding to one of the two Stokes parameters and which are concatenated together in a single map vector, (2.1) Hereafter, partial brackets, . . . , denote a vertical object. Examples of the sky maps as discussed in the following are single frequency maps storing information about the sky signal as observed at a given frequency or single sky component maps containing information about a sky signal of some specific physical origin. We refer to the ordering defined above as Stokes-wise, as a complete sky map of one Stokes parameter is followed up by another. In addition, we will also consider a pixel-wise ordering which for single maps reads, where Q and U parameters of a signal in one pixel are stored consecutively and followed by those in another. The goal of the component separation procedure is to estimate all assumed sky component signals given multiple frequency data. Therefore, commonly, we will be dealing with multiple maps of the same type, say, multiple single frequency maps or multiple single component maps, and will concatenate them in a single, multi-frequency or multi-component vector. For definiteness, in this work we fix the number of components to n comp = 3 and consider three different sky components, CMB, dust, and synchrotron. A multi-component vector, s, will therefore contain information about Q and U Stokes parameters of all three components. Such a vector can be ordered in multiple ways and most commonly we will assume to be ordered either in a component-wise way, when, s ≡ s cmb , s dust , s sync = s cmb,q , s cmb,u , s dust,q , s dust,u , s sync,q , s sync,u ∈ R 6n pix , (2.3) or a pixel-wise way, where for each pixel all Stokes parameters follow consecutively for all considered components, i.e., s ≡ s cmb,q (1), s cmb,u (1), . . . , s sync,q (1), s sync,u (1), . . . (2.4) s cmb,q (n pix ), s cmb,u (n pix ), . . . , s sync,q (n pix ), s sync,u (n pix ) .
Multi-frequency vectors can be ordered in analogous manners.
The choice of the ordering will in general depend on the specific context and is obviously of a key importance for a numerical implementation of the map-making or component separation procedures. Nonetheless, mathematically, switching the ordering from one to another, is described by a linear, orthonormal, full-rank operator, U, which is conceptually trivial to apply and an application of which commutes with other matrix operations such as a matrix inversion, given that, (2.5) for any invertible matrix, M. Consequently, one can perform a matrix inversion using one ordering, say, computing M −1 , and reorder the result afterwards to obtain the inverse in the other ordering scheme, i.e., (U M U t ) −1 . For this reason, in the following we will freely switch between the different orderings depending on the context in order to highlight specific structures of the matrices, which may be more apparent for one choice than the other.

Data model.
As mentioned earlier we consider a component separation procedure performed directly on the time-domain data as measured by the instrument thus avoiding performing any prior, explicit map-making procedure. For this we need to relate the data directly to amplitudes of the component maps in the predefined sky pixels as these maps are the targeted outcome of the component separation procedure. We assume that for each frequency the time-domain data are made of sequences of consecutive observations registered by all detectors operating at this frequency and concatenated together, we can write, where s is the unknown vector of the component amplitudes and the star indicates that those are their actual values, n f is an (unknown) noise vector, and the number of the frequency channels, n freq , is assumed to be larger than that of the components, n comp , set to 3 in this work, to ensure that the problem is welldefined. The matrix, P β , f , appearing in Eq. (2.6) combines the information about the instrument operations and the sky properties, and can be expressed as, where M β , f ∈ R 2n pix ×6n pix is a so-called mixing matrix, which determines how different sky components mix together at all observed frequencies to yield the observed signal. The mixing matrix depends explicitly on the foreground scaling parameters, which we denote as β , and frequency of the observation, f . P f ∈ R n t ×2n pix is in turn a pointing matrix defining which pixel of the sky each detector operating at a given frequency observed at every time. So while it does not explicitly depend on frequency or scaling parameters it is in principle different for different frequencies as it encodes pointing of detectors specific to this frequency, the fact which we highlight by the subscript, f . We have, where s f is a single frequency map expressing the combined sky signal at frequency f . The data vector, d f , is time-ordered as its elements are indexed by the time at which the measurement was taken.

Component separation.
The goal of the component separation procedure is to solve an inverse problem, Eq. (2.6), and estimate the components, s , given the full data set, d (:= {d f }), made of data taken at all observational frequencies. This is typically solved by assuming that the noise, n f , is Gaussian, with a vanishing mean and a known variance, N f , and writing a data likelihood, where we have dropped to distinguish between an estimate and the true value and we use tilde to denote multifrequency objects. We have, (2.10) which follows from Eq. (2.7), and, which assumes no noise correlations between different frequency channels. In addition, throughout this work we also assume that while the component mixing represented by M β may involve (potentially) all components, it is always done on a pixelby-pixel basis so all the elements of M β which corresponds to different pixels vanish. Similarly, and in agreement with assumptions made in map-making procedures, we will assume that the noise matrices, N f , are block diagonal with each block representing a banded Toeplitz matrix. The standard, two-step component separation procedure proceeds by first estimating for each frequency band, f , a single frequency map, m f , and its covariance,N f , given by, and then by performing component separation by modeling the single frequency maps yielded by the first step as, (2.14) wheren f stands for a pixel-domain noise and is a Gaussian variable with varianceN f . We can therefore write the corresponding likelihood as, This procedure is equivalent to solving the maximum likelihood problem defined by Eq. (2.9) directly. However it requires an explicit calculation ofN −1 f , which for the current and forthcoming experiment is typically prohibitive due to restrictions on both the available computer memory and computational cycles. An alternative could be to solve the original problem directly without invoking explicitly any pixel-domain objects. This is the option we study in this work. We note here in passing that intermediate approaches are also possible as for instance one which relies on the likelihood in Eq. (2.15), but does not assume thatN −1 f is given explicitly. Instead, it computes a product of the covariance and a vector using an iterative procedure which only requires applying the inverse covariance to a vector, what is performed using its implicit representation, Eq. (2.13), as it is done in the map-making solvers. On the algorithmic level such approaches are equivalent to solving the problem in the time domain and the methods considered hereafter would be applicable to that approach as well.
To estimate β and s directly from Eq. (2.9) we may either maximize this likelihood or sample from a posterior derived from it assuming some priors on the spectral parameters 1 . Alternately, one may resort to a so-called spectral likelihood (Stompor et al. 2009), where s is already either marginalized or maximized over, i.e., (2.16) which can be either minimized or sampled from. In both these cases, a key operation is a solution of a linear set of equations given by, for a sequence of tentative values of the spectral parameters, β i . Those can either constitute a chain produced as a result of sampling or a sequence of values obtained in the course of a minimization. We note that Eq. (2.17) is essentially a so-called mapmaking equation (e.g., Natoli et al. 2001;Szydlarski et al. 2014;Puglisi et al. 2018) but with a pointing matrix now replaced by P β i . We can thus hope that, as in the map-making problem, we can capitalize on special structures of the involved matrices and very efficient iterative solvers of linear system to render the problem feasible. We point out that in the applications considered here a subsequent value of the parameter β, i.e., β i+1 , can be only known once the system for the current value, β i , is fully resolved. A simultaneous computation of all systems for all values of β i is not therefore possible and any computational speedup has to come from using better solvers for the linear systems and/or their numerical implementations.
Separating parts dependent and independent on β, the system in Eq. (2.17) can also be written as, The approach we propose here is based on two observations. First, our system has some essential similarities to that of the map-making problem, we should be therefore able to capitalize on novel iterative techniques proposed in that case. Second, we expect that consecutive values of β i in realistic sequences should not vary arbitrarily and therefore subsequent linear systems (2.17) should show some resemblance. Consequently, it should be possible to shorten time to solution for the next value of β i+1 capitalizing on the solution for the current one, β i .

Block-diagonal preconditioner
The block-diagonal preconditioner is the most common preconditioner used in the preconditioned conjugate gradient solvers applied in the context of the CMB map-making problem (Natoli et al. 2001), which has demonstrated a very good performance in the number of applications. It is also a basis for a construction of more advanced preconditioners (e.g. Szydlarski et al. 2014). The block-diagonal preconditioner is derived by replacing the noise covarianceN −1 f in Eq. (2.12) by its diagonal. In the map making case, assuming the pixel-wise ordering, this leads to a block-diagonal matrix with the blocksize defined by the number of the considered Stokes parameters. The generalization of this preconditioner to the component separation case is straightforward and it is given by P β diag( N −1 ) P β . Again in the pixel-wise ordering the matrix constructed in such a way is block-diagonal with the block size now equal to the product of the number of the Stokes parameters and the number of the sky components, i.e., 6 × 6 in the specific case considered here. Consequently, the preconditioner is easily invertible in any ordering scheme adapted.
Hereafter, we denote the β-independent part of the preconditioner as B := P diag( N −1 ) P so, By preconditioning the system (2.17) from the left, we get (2.19) To simplify the notation in the following, we define,

Component mixing
For concreteness throughout the paper we assume the following component mixing scheme, which follows the standard assumptions that there is no Q and U mixing and that the scaling laws for the Stokes parameters Q and U of each components are the same. In the component-wise ordering such mixing corresponds to the mixing matrix of the form, (I is the identity matrix, 2 × 2 in this case), The coefficients α f,i encode the assumed scaling laws of the CMB, i = 1, dust, i = 2, and synchrotron, i = 3 where the last two depend on unknown scaling parameters, β d and β s . This matrix can be rewritten with help of the Kronecker product as, Hereafter, we will drop the explicit dependence of the mixing coefficients on β denoting them simply as α f,k .

Solution procedure for parametric component separation problem
A complete solution to the component separation problem will have to address two aspects successfully. First, it will need to propose an efficient approach to solving the sequences of linear systems as in Eq. (2.19) and, second, it will have to combine it with an optimized procedure for an efficient determination of the new values of the parameters β. This study aims at addressing the former problem and focuses on the solution of a sequence of the linear systems obtained for some sequences of the spectral parameters. In order to provide a fair comparison of various solution techniques for the linear systems, we generate a sequence {β i } a priori, i.e., in our experiments β i+1 is in fact in- , what ensures that performance evaluation of all the considered solvers is based on solving identical sequence of linear systems. The overall solution scheme we adapt here is then as follows, 0) Initialize β 0 and s (0) β 0 (typically s (0) 1) Given β i and the initial guess s (0) , in our experiments β i+1 is given a priori. 2b) Compute a new deflation space for the system associated with β i+1 using a recycling technique (see details below). This should not involve the value of β i+1 so that this step can be done in parallel with 2a).
In the subsections below, we discuss the steps 1), 2b), and 3) in more details.

PCG with deflation and two-level preconditioners
Though the block-diagonal preconditioner has been shown to ensure good performance in the map-making experience, it has been pointed out that even better performance can be often achieved by employing so-called two-level preconditioners (Szydlarski et al. 2014;Puglisi et al. 2018). Such preconditioners are built from the block-diagonal preconditioner (referred to as the first level) and the second level based on several vectors that are deflated (i.e., suppressed in the operator) in order to accelerate the convergence. These vectors are typically taken to be approximate eigenvectors of the system matrix corresponding to its smallest eigenvalues, which often hamper the convergence of PCG with the block-diagonal preconditioner.
We start from the case of deflation for (unpreconditioned) conjugate gradient (CG) method. CG applied to a linear system As = b with a given initial vector s (0) and an initial residual r (0) := b − As (0) , builds implicitly an orthogonal (residuals) and an A-orthogonal (search directions) basis of the Krylov subspace, Given a set of deflation vectors, i.e., the vectors to be suppressed, denote by U the subspace spanned by these vectors. The deflation techniques replace the original operator A : R n → R n by a deflated operator A : (R n \ U) → (R n \ U). The approximation is then sought over the augmented subspace (see, e.g., Gaul et al. (2013)) s ( j) ∈ s 0 + K j A, r (0) ∪ U and the jth residual is required to be orthogonal to K j ( A, r (0) ) ∪ U. This effectively prevents the solver from exploring the subspace U.
Extension of the above discussion for PCG with a (first-level) preconditioner B is straightforward as we can use PCG to build implicitly the Krylov subspace K j (B −1 A, B −1 r (0) ). In the considered application, the preconditioner B is the block diagonal preconditioner. There are many variants of two-level preconditioners and we summarize them briefly in Appendix B.2, a more thorough survey can be found, e.g., in Tang et al. (2009). Each iteration of a deflated (P)CG, i.e., with or without the first level, is more costly than a single iteration of standard (P)CG. The additional cost depends, primarily, on the number of deflated vectors, i.e., the dimension of U, but also on a deflation variant. Building the subspace U typically requires some preliminary computations, which can be as costly as solving the system (see e.g., Szydlarski et al. 2014;Puglisi et al. 2018). Another approach, applicable to the cases when multiple systems need to be solved, is to construct the vectors "on the fly" during the solution of the systems themselves thus hiding the additional cost. This is the approach we detail in the follow-up sections.

Subspace recycling
There already exist several constructions of the deflation space adapted to solving a sequence of linear systems such as Saad et al. (2000); Parks et al. (2006); Kilmer & de Sturler (2006); O'Connell et al. (2017); Jolivet & Tournier (2016). In this work, where the system matrix is SPD, we follow Saad et al. (2000) and build a subspace Z ⊂ K j ( A, r (0) ) by storing some of the vectors computed during the previous run of (P)CG solver and determine the slowest eigenvectors of the operator A restricted on the subspace U ∪ Z. These are taken as the deflation vectors for the next solution. The resulting algorithm is given in Appendix C.
We can determine the subspace Z by using either the residual or the search direction vectors forming (assuming the exact arithmetic) the orthogonal or, respectively, an A-orthogonal basis of K j ( A, r (0) ). Following Saad et al. (2000), we choose here to use the search direction vectors. We retain the first dim p search direction vectors, where dim p defines the dimension of the socalled recycle subspace. Using the first vectors is motivated by the fact that the orthogonality among the computed vectors is gradually lost in CG; it is therefore preserved better in the initial iterations.
The techniques for approximating k eigenvectors of the operator over a given subspace are well-established. Among them, we note the Ritz and harmonic Ritz projections, which are described in detail in Appendix B.1.1. They lead to solution of (generalized) eigenvalue problem of a small dimension -in our case of dim(U)+dim(Z). While Saad et al. (2000) suggest using the harmonic Ritz projection, we found Ritz projection slightly more efficient in our numerical experiments and we therefore include this in the full algorithm described in Appendix C. In another difference with Saad et al. (2000) we assemble the (small) generalized eigenvalue problem matrices in the harmonic Ritz projection using the matrix-matrix products (see Algorithm 2 in Appendix B.1.1) instead of the optimized algorithm of Saad et al. (2000) (Section 5.1). This is because we expect that the extra computational cost in our application is negligible and we therefore have opted for the simplicity.
There is no general recommendation for the choice of the number of the deflation vectors, k, and the dimension of the recycling subspace, dim p . Higher k may result in an increase of the overall number of matrix-vector products (one has to apply the system matrix to k deflation vectors before starting deflated (P)CG for each system) and high dim p may cause numerical instabilities in solving the eigenvalue problems determining the new deflation vectors. On the other hand, the small values of k and dim p may not bring enough speed up. We test this numerically in Section 4.

Impact of the eigenvalue multiplicity.
One limiting factor to the efficiency of this approach, and more generally to performance of any two-level preconditioner whenever the deflation space is estimated with help of standard iterative techniques, such as Arnoldi or Lanczos iterations, comes from a higher multiplicity of eigenvalues, i.e., presence of multiple eigenvectors with the same corresponding eigenvalue. This can arise either due to some symmetries in the scanning strategies, in the case of the map-making systems of equations, and similarities in the noise covariances of the different single frequency maps, in the case of the component separation problem as studied here, see Appendix A for a worked example. Admittedly, such symmetries and/or similarities are not typically expected in the cases of real data analysis, however they can arise in the cases of simulated data in particular if simplifying assumptions are adopted in order to speed and/or simplify the simulation process.
A&A proofs: manuscript no. main To shed a light on this problem we consider an SPD matrix A and assume that λ is an eigenvalue with multiplicity higher than one. This means that there exists a subspace V with dim(V) > 1 such that Let w be an arbitrary vector and w V its projection onto V, i.e.
This decomposition is unique for any normal matrix A (in particular, an SPD matrix is normal). Then Therefore, the jth Krylov subspace satisfies Consequently, whenever we use Lanczos (or Arnoldi) method, which is based on Krylov subspace approximation (see Appendix B.1.2 for more details), for approximating the eigenpairs, we cannot recover neither multiplicity of the eigenvalues (without using some additional techniques) nor the full subspace associated to an eigenvalue with higher multiplicity which will always be approximated by a single vector. If the corresponding eigenvalue happens to be small, this approximation may not be sufficient hampering the performance of the preconditioner constructed with help of the combination of Lanczos/Arnoldi method.
This issue can be overcome by using a more advanced eigenvalue solver, which can detect and handle the higher multiplicity of eigenvalues. An efficient implementation of such a solver is for instance provided by the ARPACK library (Lehoucq et al. 1998). In this case one may need to resort to a precomputation of the preconditioner with help of such advanced routines instead of the constructing it on the fly as proposed here. If the presence of the eigenvalue multiplicity and the corresponding eigenvectors is known ahead of time, one can accommodate these vectors in the on-the-fly procedure proposed here. This is indeed the case we have encountered in one of the test cases discussed later in this work.
We point out that the multiplicity of the eigenvalues is in principle advantageous for the standard (P)CG. In exact arithmetic, the effect of the whole subspace could be eliminated at the cost of a single iteration. This fact is often used in the analysis of preconditioning methods where proper preconditioners shift some, possibly many, eigenvalues to the same value.
Last but not least, we emphasize that an eigenvalue multiplicity does not imply any indistinguishability or degeneracy of the corresponding eigenmodes, unless the corresponding, common eigenvalue happens to be (numerically) zero. If this is not the case, the different eigenmodes, as present in the right hand side of the linear system, will be merely weighted by the inverse system matrix in the same way, even though their contributions to the right hand side may be very different start with.

Choice of the initial guess
The simplest and the standard way to solve the sequence is to run the PCG method with the initial guess set to zero. However, some evidence exists showing that at least in the map-making case this may not always be the best choice (Papež et al. 2018) in particular in the high signal-to-noise cases. In the cases of the sequences all the systems involve the same initial data set with the same signal and noise content so even in the low signal-tonoise data one may expect that adapting the initial guess following previous results may bring some interesting speed-up. Consequently, we explore here two alternative choices and show via numerical experiments that they are indeed much more efficient.

Previous solution as the initial guess
A natural idea is to run PCG for the (new) problem corresponding to β i+1 starting with the computed approximation s ( f inal) This can be in particular efficient if the parameters β i and β i+1 do not significantly differ and one can expect that so do s β i and s β i+1 .

Adapted previous solution as the new initial guess
Eq. (3.1) can be further adapted capitalizing on the particular structure of the mixing matrix. To start with we rewrite Eq. (2.17) as, In our case, matrix M β is rectangular of size 2 n freq n pix × 6 n pix and has full column rank. When the number of frequencies n freq is not significantly higher than 3, we can then generalize the above idea and use as the initial guess for the new system the vector, where M † is the (Moore-Penrose) pseudoinverse 2 of M, Clearly, such vector satisfies a natural requirement: for β i+1 = β i there holds, Finally, we note that the computation of the vector in (3.3) is very cheap due to the Kronecker structure (2.22) of the matrices M β . Writing M β = K β ⊗ I, we obtain, in other words, only the matrices of size 2 n freq × 6 need to be handled and the cost of the proposed adaptation is nearly negligible.

Simulated data
For our numerical tests we use a simulated data set composed of time-ordered, multi-frequency observations with a correlated, '1/f', noise. The characteristics of this data set are so follows.

Pointing matrix
We adopt a simple scanning strategy used in Papež et al. (2018). The entire time-ordered data set is composed of O(10 8 ) measurements per frequency and divided into four, consecutive subsets. The pointing is assumed to be the same for each frequency. The underlying sky is pixelized using Healpix pixelization scheme (Górski et al. 2005) with the resolution parameter, n side set to 1024. The scan consists of repetitive scanning of a rectangular patch made of 256 pixel rows and columns. The scanning is either horizontal, i.e., along the pixel rows, for the first and third subset, or vertical, i.e., along the pixel columns for the second and fourth subset. During a single left-to-right, or bottom-up sweep, each sky pixel is sampled only once and the direction of the polarizer, ϕ t , is fixed for each of the four subsets and equal, with respect to the sky, to 0, π/4, π/2, and, 3π/4. The sky signal contribution to every measurement is modeled as, where p(t) denotes the pixel observed at time t, we do not include the total intensity, and Q c and U c stand for Q and U Stokes parameters of the combined, CMB + foregrounds, sky signal observed at frequency ν c .

Sky maps
We assume 6 frequency channels roughly corresponding to those accessible for a ground based experiment. These are, For the scaling laws we take a black-body for the CMB component (T CMB = 2.7525K), a power-law for the synchrotron, and a modified black-body for the dust, hence, and B(ν, T ) denotes a black-body at temperature, T . The simulated maps are expressed in the thermodynamic units and given by, for each frequency ν = ν c and each observed sky pixel, p. An analogous formula holds for the Stokes U parameter. Here, Γ RJ (ν) stands for a conversion factor from the Rayleigh-Jeans to thermodynamic units. This expression is consistent with Eq. (2.22) upon a suitable definition of the coefficients α.
In our numerical experiments, we fix the dust temperature, T d , to its true value and assume that only the spectral indices, β = [β s , β d ], are determined from the data. We assume that these are estimated via a maximization of the spectral likelihood, Eq. (2.16), using a truncated Newton maximization procedure. We use this approach to generate a single sequence of {β i }, which, as explained in Sect. 3, we adopt consistently in all our runs. The sequence is made of 26 values and is shown in Fig. 1. In appendix D we show for completeness results of the similar test but performed for a sequence of β derived via sampling of the spectral likelihood. The main conclusions derived in both these examples are consistent.

Noise
We assume a correlated noise in the time domain with a spectrum given by where f is the time-domain frequency. The values of f knee adopted here are different for different frequency channels and taken to be such that there are strong noise correlations within a single sweep. They span the range from 0.5 up to 3Hz from the lowest to the highest frequency channel, respectively. The noise is apodized at very low frequencies so the noise power is finite. σ 2 rms is taken to be around 30µK per sample reflecting the fact that each measurements effectively corresponds to the combined measurement of many modern detectors operating at the same frequency. This leads to sky maps with the noise, σ Q/U rms ∼ 2−3µK per pixel.

Multiplicity of the eigenvalues due to the particular scanning strategy
We denote two pointing matrices for two horizontal scans as P 0 and P π/2 and two pointing matrices for two vertical scans as P π/4 and P 3π/4 , where the subscript stands for the polarizer angle in the sky coordinates. Those are related as follows, P π/2 = P 0 R 2 (π/4), P 3π/4 = P π/4 R 2 (π/4), where R is a 12 n pix -by-12 n pix block diagonal matrix with each diagonal block equal to a 2-by-2, spin-2 rotation matrix for each pixel of the 6 frequency maps. While this is clearly a result of the simplifying assumption made in the simulations, this example may be of interest also in more realistic circumstances where certain relations of this sort may happen to be fulfilled approximately following from some common, albeit inexact, symmetries of typically adopted scans. We will therefore explore here briefly consequences of this fact. In the case at hand we can represent the total pointing matrix as given that all four scan subsets observe exactly the same sky.
In the case when moreover the noise covariance for each of the four scan subsets is exactly the same, then P N −1 P = P N −1 P = R π/4 P N −1 P R π/4 . We note that this holds if the noise properties vary from one frequency channel to another as it is indeed the case in our simulations.
If now v is an eigenvector of the matrix A = P N −1 P with a corresponding eigenvalue, λ v , then and therefore, meaning that also R π/4 v is an eigenvector of the matrix A with the same eigenvalue, λ v . As this reasoning applies as much to the matrix A as the matrix B = P diag N −1 P, we have Whether this implies that the component separation system matrix preconditioned with the block-diagonal preconditioner, Eq. (2.20), which is given by, has two eigenvectors corresponding to the same eigenvalue related by the rotation operator acting in the component space, will depend whether the subspace spanned by v and R π/4 v is contained in the subspace spanned by the columns of the mixing matrix, M β . Otherwise, it may have a single (in the case of the intersecting subspaces) or no eigenvectors (in the case of the disjoint subspaces) with the corresponding eigenvalue. Which situation is actually realized in practice will be case-dependent and will in general depend on the value of β. We found that the former case is common enough that in our numerical experiments we have opted to account explicitly on the possibility of having duplicity of the eigenvectors due to the scanning. Consequently, while we use the subspace recycling to approximate only one eigenvector we compute the other one by applying the rotation operator. We then use both vectors to construct the deflation operator. Given that R π/4 = −R −π/4 , there is no ambiguity due to the fact that a priori we do not know which of the vectors we estimate via the subspace recycling and this approach leads to the same deflation space, whichever rotation is applied. This technique has led to an important speed-up in the cases studied hereafter.

Results
We compare the convergence using the relative norm of the jth residual The iterations for each system are stopped once this value drops below tolerance TOL = 10 −8 , but we always perform at least one iteration.
We first show that the systems corresponding to different βs from the sequence are indeed "close to each other" in a sense that they display a similar behavior during the solution process. We illustrate this by showing the convergence for PCG with zero initial guess in Figure 2. We find that all the convergence curves are indeed very similar even for the initial elements of the sequence, where the values of β parameters keep on changing quite significantly. We can therefore focus on the "true" system corresponding to β = β in order to investigate improvement due to the deflation of the eigenvectors corresponding to the smallest eigenvalues of the system matrix. This is depicted in Figure 3. Here, the eigenvectors are computed using the ARPACK eigensolver (Lehoucq et al. 1998) and, as expected, we find a significant improvement thanks to the deflation. Fig. 3. Convergence of the PCG with the deflation applied to 2, 6, and 10 slowest eigenvectors for the true system with β = β .
Then, we illustrate the effect of the deflation space built by recycling. For that, we consider first six systems and we start the iterations always with zero initial guess, s (0) = 0. The result for k = 10 eigenvectors approximated using the dimension of the subspace, dim p = 100, is given in Figure 4.  4. Convergence of deflated PCG with the deflation subspace build by recycling. Here we consider the first six systems from the sequence and we start the iterations always with zero initial guess. k = 10 eigenvectors are approximated using dim p = 100.
In Figure 5 we compare the convergence for the full sequence of the 26 systems using the techniques for setting the initial guess proposed in Eq. (3.1) and Eq. (3.3) and using the subspace recycling. We recall that standard PCG with zero initial vector takes more than 6000 iterations; cf. Figure 2. Despite the fact that the subspace recycling variant requires additional 25 × 10 matrixvector products 3 as compared to the PCG with the block-Jacobi preconditioner for any choice of the initial guess, it still provides an interesting speed up.
As noted above, the performance of the PCG with the deflation space built by recycling is affected by the number of the deflation vectors, k, and the dimension of the recycling subspace, dim p . There is no general recommendation for their choice. Higher k may result to increase of the overall number of matrixvector products (one has to apply the system matrix to k deflation vectors before starting deflated PCG for each system) and high dim p may cause numerical instabilities in solving the eigenvalue problems determining the new deflation vectors. On the other hand, the small values of k and dim p may not bring enough speed up. We compare the convergence of PCG with some choices of k and dim p in Figure 6 and

Conclusions and further perspectives
We have presented a procedure for an effective solution of sequences of the linear systems arising in CMB parametric com-   Table 1. Since the convergence for the first system is the same for arbitrary dim p and k, the x-axis is depicted starting from the iteration 200. ponent separation problem. Two main novelties are in the proposed choice of the initial vectors and the recycling technique used to determine the deflation space. These techniques can be also used in other problems in CMB data analysis which lead to a sequence or a set of linear systems. Motivated by our simulated data, we have also emphasized and elaborated on the role of the multiplicity of the eigenvalues, in particular in the context of their impact on performance of two-level preconditioners.
The overall speed-ups we have recovered, ∼ 5-7, are quite impressive. While a bulk of the improvement comes from reusing the solution of an earlier system as the initial guess of the follow-up one, a simple but not trivial observation owing to the fact that this is the same data set which is being used in all the solutions, other proposed amendments provide important additional performance boost on order of ∼ 2, particularly significant in the case of the sampling-based application. Further improvements and optimizations are clearly possible. For instance, the number of required matrix-vector products can be decreased by not using the two-level preconditioner for all the systems as the experiments showed that when two consecutive system solutions are very close to each other, the PCG with the diagonal preconditioner and a proper initial guess (e.g. as proposed in Section 3.4.2) can be already sufficient converging in few iterations.
We emphasize that in practice we will be only able to capitalize on this kind of approaches if they are implemented in a form of highly efficient, high performance massively parallel numerical algorithms and codes. We leave a full demonstration of this to future work, noting only here that many of the relevant techniques have been already studied in the past and recent literature showing that this indeed should be plausible (e.g., Cantalupo et al. 2010;Sudarsan et al. 2011;Szydlarski et al. 2014;Papež et al. 2018;Seljebotn et al. 2019).
The presented techniques, rendered in a form of efficient, high performance codes, should allow for the efficient maximization of the data likelihood or the posterior distributions in the component separation problems producing the reliable sky component estimates for at least some of the forthcoming data sets. However, in the cases of sampling algorithms, when many thousands of the linear systems may need to be solved, this still remains to be demonstrated and further improvements will likely be required. Those will however in general depend on specific details of the employed sampling technique and we leave them here to future work.

Appendix A: Eigenvalue multiplicity in component separation problem -a worked example.
In this section we discuss the eigenvector and eigenvalue of the preconditioned matrix defined in Eq. (2.19) in the context of eigenvalue multiplicity in some specific set-up, which in particular assumes that the pointing matrix is the same for each frequency and that the noise has the same covariance (up to a scaling factor) for each frequency, i.e., that P f = P, N f = N, f = 1, . . . , n freq. . While such requirements are not very likely to be realized strictly in any actual data set, they can be fulfilled approximately leading to near multiplicities of the eigenvalues. Those if not accounted for may be as harmful to the action of the preconditioner as the exact ones. Moreover, this worked example demonstrates that the component separation problem is in general expected to be more affected by this kind of effects than for instance the standard map-making solver, therefore emphasizing that a due diligence is necessary in this former application.
be an eigenpair of the mapmaking matrix, i.e., there holds due to the assumed form of the mixing Eq. (2.21). Consequently, using Eq. (A.1), Since the matrix M β A M β is assumed to be non-singular (equivalently, since M β is of full column rank), we can multiply the above equation from the left by M β showing that (λ i , v i , 0, 0 ) is the eigenpair of the matrix ( M β B M β ) −1 M β A M β . We can proceed analogously for the vectors 0, v i , 0 and 0, 0, v i , with replacing in (A.2) α f,1 by α f,2 and α f,3 , respectively. There are 2 n pix eigenpairs for (P diag(N −1 )P) −1 (P N −1 P). As we have seen above, each of them generates three eigenpairs (with the same eigenvalue) of ( M t β B M β ) −1 M β A M β . This gives together 6 n pix eigenpairs, in other words, we have described the full spectrum of the preconditioned system matrix in (2.19).
Finally, we remark that all the eigenpairs of the preconditioned matrix are, in the simplified setting, independent of the parameters β. In such case, we suggest to use a specialized eigensolver (as, e.g., ARPACK Lehoucq et al. (1998)) to compute the eigenpairs from (A.1), build the triplets of eigenvectors v i , 0, 0 , 0, v i , 0 , 0, 0, v i , and then use the deflated PCG with those vectors. Figure A.1 is as Figure 5 but for the simplified setting. Here two triplets of the eigenvectors are constructed following the above procedure.

Appendix B: Ingredients of the proposed procedure
We present in this section in more detail two ingredients of the proposed procedure. Namely, the ways how the eigenpairs are estimated using the computed basis of Krylov subspace and the ways how the deflation of the approximate eigenvectors can be combined with another preconditioner. For the ease of presentation, we simplify the notation in this section.
Appendix B.1: Approximating the eigenvalues using Krylov subspace methods Arnoldi and Lanczos algorithms for approximating the eigenvalues of a general nonsingular or, respectively, a hermitian matrix, are based on Rayleigh-Ritz approximation that will be presented first. Then, we recall the Arnoldi and Lanczos algorithms and, finally, we briefly comment on their restarted variants. The methods discussed below do not represent an exhaustive overview of methods for approximating several eigenvalues and the associated eigenvectors. Among the omitted methods, there is, e.g., the Jacobi-Davidson method (Sleijpen & Van der Vorst 2000), which proved to be particularly efficient for approximating an inner part of the spectrum. For a survey on the methods and a list of references see, e.g., Sorensen (2002).

Appendix B.1.1: Ritz values and harmonic Ritz values approximations
For a subspace S ⊂ C n , we call y ∈ S a Ritz vector of A with Ritz value θ if Ay − θy ⊥ S.
Using a (computed) basis V j of S and setting y = V j w, the above relation is equivalent to solving Ritz values are known to approximate well the extreme eigenvalues of A. If an approximation to the interior eigenvalues is required, computing the harmonic 4 Ritz values can be preferable. Following Parks et al. (2006), we define harmonic Ritz values as the Ritz values of A −1 with respect to the space AS, y ∈ AS, A −1 y − µ y ⊥ AS.
We call θ ≡ 1/ µ a harmonic Ritz value and y a harmonic Ritz vector. If V j is a basis of S and y = V j w, the above relation can be represented as For the properties of the harmonic Ritz values approximations and the relationship with the iteration polynomial in MINRES method, see Paige et al. (1995).
Remark 1. There are various ways in the literature how the harmonic (Rayleigh-)Ritz procedure is presented and defined; often it is introduced to approximate eigenvalues close to a target τ ∈ C. For example, Wu (2017)  where I is the identity matrix. With y = V j w this corresponds to the generalized eigenvalue problem which becomes for τ = 0 exactly the right-hand side equality in Eq. (B.2).
We note that harmonic Ritz approximation is often used in the Krylov subspace recycling methods also for approximating the smallest (in magnitude) eigenvalues and the associated eigenvectors.
Finally, we comment on the (harmonic) Ritz approximation in the case we want to compute the eigenvalues of the matrix A preconditioned from the left by M. In a general case, assuming only that A, M are nonsingular, the Ritz and harmonic Ritz approximation are applied as above just replacing in the formulas A by M −1 A. When the matrix A is hermitian and the preconditioner M is SPD, there is also another option. First, we note that the matrix M −1 A is not hermitian but it is self-adjoint with respect to the inner product induced by M, that is The corresponding algebraic problems with y = V j w, y = V j w, are V j AV j w = θV j MV j w, respectively, Note that the problems above involve hermitian matrices only.

Appendix B.1.2: Arnoldi and Lanczos methods
Arnoldi and Lanczos algorithms for approximating the eigenvalues of a general nonsingular or, respectively, a hermitian matrix, are based on Ritz approximation with setting S = K j (A, v 1 ) = span(v 1 , Av 1 , . . . , A j−1 v 1 ), the jth Krylov subspace. The methods compute an orthogonal basis V j of S such that, where e j is the last column vector of the identity matrix (of size j) and V j V j = I, V j v j+1 = 0. Consequently, the eigenvalue problem (B.1) corresponding to the Ritz approximation reads The matrix T j is available during the iterations. The standard use of Arnoldi and Lanczos method for eigenvalue approximation consists of solving the above problem and setting the pairs (θ, V j w) as the computed approximations. One can naturally replace the Ritz approximation by the harmonic Ritz approximation. Then, the matrices in Eq. (B.2) become, Remark 2. The Lanczos algorithm is a variant of the Arnoldi algorithm for a hermitian A. The matrix T j = V j AV j , which is in Arnoldi method upper Hessenberg, is then also hermitian. Consequently, it is tridiagonal, which means that in each step of Lanczos method we orthogonalize the new vector only against two previous. This assures that the computational cost of each iteration is fixed and, if one is interested in approximating the eigenvalues only, storing three vectors v j−1 , v j and v j+1 is sufficient instead of handling the full matrix V j . The assumption on exact arithmetic is, however, crucial here. In finite precision computations, the global orthogonality is typically quickly lost, which can cause several stability issues.
As noted above, an orthonormal basis V j of S is advantageous for the Ritz approximation. For the harmonic Ritz approximation applied to an SPD matrix A, one can instead aim at constructing an A-orthonormal basis, which assures that the matrix V j A V j = V j AV j on the right-hand side of Eq. (B.2) is equal to the identity. An A-orthonormal basis of a Krylov subspace can be constructed within the iterations of conjugate gradient method by using the search direction vectors.
The Arnoldi method can be naturally applied also to the preconditioned matrix M −1 A to compute an orthonormal basis V j of the associated Krylov subspace K j (M −1 A, M −1 v 1 ), giving, For a hermitian A and an SPD preconditioner M, we can apply the Lanczos method following the comment made above -using the matrix M −1 A and the inner product (·, ·) M induced by M instead of the standard euclidean (·, ·), giving, The computed basis V j is therefore M-orthonormal.

Appendix B.1.3: Restarted variants
The number of iterations necessary to converge is not a priori known in Arnoldi and Lanczos algorithms and, in general, it can be very high. High iteration counts require a large memory to store the basis vectors and, whenever a full-reorthogonalization is used, a high computational effort because of the growing cost of the reorthogonalization in each step. The idea behind implicitly restarted variants is to limit the dimension of the search space S. This means that the iterations are stopped after a (fixed) number of steps, the dimension of the search space is reduced while maintaining its (Krylov) structure, and the Arnoldi/Lanczos iterations are resumed. There are several restarted variants described in the literature (a detailed description is, however, beyond the scope of this paper): the implicitly restarted Arnoldi (IRA, Sorensen (1992)), the implicitly restarted Lanczos (IRL, Calvetti et al. (1994)), or the Krylov-Schur method (Stewart (2001/02);Wu & Simon (2000)).
The estimation of the spectrum of A is possible within the GMRES, MINRES and CG iterations (applied to solve a system with A) because they are based on Arnoldi, respectively Lanczos algorithms. In contrast, a combination of restarted variants with solving a linear algebraic system is, to the best of our knowledge, not described in the literature.

Appendix B.2: Deflation and two-levels preconditioners
In this section we first discuss a deflation preconditioner for Krylov subspace methods that can be regarded as eliminating the effect of several (given) vectors from the operator or, equivalently, augmenting by these vectors the space where we search for an approximation. Then we describe a combination of the deflation preconditioner with another preconditioner that is commonly used in practice.
A question then arises whether given some other subspace U, we can modify the methods such that they have the same optimal properties over the union of K j (A, v) and U, which is often called an augmented Krylov subspace. The answer is positive and the implementation differs on the method -it is straightforward for GMRES and it requires more attention for CG. Hereafter, we denote by I the identity matrix and by Z the basis of U.
The deflation in GMRES method is often (see, e.g., GCROT by Morgan (1995)) considered as a remedy to overcome the difficulties caused by restarts: for computational and memory restrictions, only a fixed number of GMRES iterations is typically performed giving an approximation that is then used as the initial vector for a new GMRES run. In GCROT, several vectors are saved and used to augment the Krylov subspace built after the restart. The GMRES method with the deflation was used for solving a sequence of linear algebraic systems, e.g., in Parks et al. (2006).
The augmentation of the Krylov subspace in CG is more delicate, since the original CG method can only be applied to an SPD matrix. The first such algorithm was proposed in Dostál (1988) and Nicolaides (1987). We note that it includes the construction of the conjugate projector P c.pro j. = Z(Z AZ) −1 Z A and, in each iteration, the computation of the preconditioned search direction q i = (I − P c.pro j. ) p i and of the vector Aq i . The latter can be avoided at the price of storing Z and AZ and performing additional multiplication with AZ. In both variants, the cost of a single iteration is higher than the cost of one standard CG iteration.
The combination of a preconditioner with a deflation is widely studied in the literature and therefore we present this only briefly; more details and extensive list of references can be found, e.g., in the review paper by Tang et al. (2009). The preconditioner stemming from the combination of a (typically relatively simple) traditional preconditioner with the deflation is called a two-level preconditioner 5 . While the traditional preconditioner aims at removing the effect of the largest (in magnitude) eigenvalues, the deflation (projection-type preconditioner) is intended to get rid of the effect of the smallest eigenvalues. Common choices for the traditional preconditioner are block Jacobi, (restricted) additive Schwarz method, and incomplete LU or Cholesky factorizations. Among many applications, two-level preconditioners proved to be efficient in the CMB data analysis; see, e.g., Grigori et al. (2012); Szydlarski et al. (2014).
We now present the combination of the traditional and projection-type (deflation) preconditioners following the discussion and notation of Tang et al. (2009). Hereafter, we assume that the system matrix A and the traditional preconditioner M are SPD. We note that some of the below mentioned preconditioners P are not symmetric. However, their properties allow us to use them (with possible modification of the initial vector) as left preconditioners in PCG; see Tang et al. (2009) for details.
Let the deflation space spans the columns of the matrix Z. We denote Two-level preconditioners based on the deflation are given as Other preconditioners can be determined using the additive combination of two (SPD) preconditioners C 1 , C 2 as or, using the multiplicative combination of the preconditioners, as P mult ≡ C 1 + C 2 − C 2 AC 1 .
Three variants of two-level preconditioners are derived by choosing additive or multiplicative combination and setting C 1 = M −1 , C 2 = Q, or C 1 = Q, C 2 = M −1 . Other preconditioners can be derived using the multiplicative combination of three SPD matrices; see (Tang et al. 2009, Section 2.3.4).
The variants of two-level preconditioner mentioned above differ in the implementation cost and also in the numerical stability; see Tang et al. (2009). The variant P DEF1 , which is often used in the procedures for solving the sequences of linear systems (see, e.g., Saad et al. (2000)), was found cheap but less robust, especially with respect to the accuracy of solving the coarse problem with the matrix Z AZ and with respect to the demanded accuracy. The conclusion drawn in Tang et al. (2009) is that "A-DEF2 seems to be the best and most robust method, considering the theory, numerical experiments and the computational cost". Therefore the preconditioner P A-DEF2 , is of interest, in particular in the cases where the dimension of the deflation space (equal to the number of columns in Z) is high and/or the matrix M −1 A is ill-conditioned, As noted in Saad et al. (2000), the gradual lost of orthogonality of computed residuals with respect to the columns of Z can cause stagnation, divergence or erratic behaviour of errors within the iterations; see also the comment in (Tang et al. 2009, Section 4.7). The suggested remedy in that case consists in the reorthogonalization of computed residuals as r j := Wr j , However, such instabilities were not observed in our experiments and the results depicted throughout the paper are for the standard (non-reorthogonalized) variant.

Appendix C: Full algorithm
In this section we provide the pseudocodes for the deflated PCG (Algorithm 1), the subspace recycling (Algorithm 2), and for the full procedure (Algorithm 3) proposed in this paper and used in the numerical experiments in Section 4.

Appendix D: Results for additional sequence of mixing parameters from a Monte Carlo sampling
In this section we provide results for a sequence of spectral parameters generated by a Monte Carlo sampling algorithm. In realistic circumstances such sequences contain possibly up to many thousands of samples, however here for computational efficiency we restrict ourselves to a sub-sequence made of merely 30 elements and use them in order to demonstrate performance of the proposed approach on a sequence with realistic properties but sufficiently different than those of the sequences encountered in the likelihood maximization process. We emphasize that it is not our purpose here to validate the method on the actual application yet and more work is needed to achieve this goal, see Sect. 5. The sequence is depicted in Figure D.1. It was produced using a publicly available software fgbuster 6 and indeed shows qualitatively a very different behavior than that of our standard case displayed in Fig. 1.   (3.1) and (3.3)) and the PCG with the subspace recycling (together with the choice of the initial guess as in (3.3)). For the recycling we consider k = 10 eigenvectors approximated using dim p = 100. The convergence for the whole sequence when using the initial guess (3.1) (the yellow line) requires 4010 iterations.