Learning mid-IR emission spectra of polycyclic aromatic hydrocarbon populations from observations

The JWST will deliver large data sets of high-quality spectral data over the 0.6-28 $\mu$m range. It will combine sensitivity, spectral and spatial resolution. Specific tools are required to provide efficient scientific analysis of such large data sets. Our aim is to illustrate the potential of unsupervised learning methods to get insights into chemical variations in the populations that carry the aromatic infrared bands (AIBs), more specifically PAH species and carbonaceous very small grains (VSGs). We present a method based on linear fitting and blind signal separation (BSS) for extracting representative spectra for a spectral data set. The method is fast and robust, which ensures its applicability to JWST spectral cubes. We tested this method on a sample of ISO-SWS data, which resemble most closely the JWST spectra in terms of spectral resolution and coverage. Four representative spectra were extracted. Their main characteristics appear consistent with previous studies with populations dominated by cationic PAHs, neutral PAHs, evaporating VSGs, and large ionized PAHs, known as the PAH$^x$ population. In addition, the 3 $\mu$m range, which is considered here for the first time in a BSS method, reveals the presence of aliphatics connected to neutral PAHs. Each representative spectrum is found to carry second-order spectral signatures (e.g. small bands), which are connected with the underlying chemical diversity of populations. However, the precise attribution of theses signatures remains limited by the combined small size and heterogeneity of the sample of astronomical spectra available in this study. The upcoming JWST data will allow us to overcome this limitation. The large data sets of hyperspectral images provided by JWST analysed with the proposed method, which is fast and robust, will open promising perspectives for our understanding of the chemical evolution of the AIB carriers.


Introduction
The aromatic infrared bands (AIBs) have held the attention of a large community of astronomers since their discovery in the 1970s (Gillett et al. 1973). At first named unidentified infrared bands (UIBs), they obtained the current denomination of AIBs after Leger & Puget (1984) and Allamandola et al. (1985) proposed that they can be emitted by large carbonaceous molecules, namely polycyclic aromatic hydrocarbons (PAHs). Their limited size (20 to more than 100 carbon atoms) implies that they are easily heated to high temperatures by the absorption of individual UV photons in photo-dissociation regions (PDRs) where the AIBs dominate the mid-IR spectrum. Once the molecule absorbs a UV photon, it can dissociate, be ionized or relax through infrared emission (e.g., Montillaud et al. 2013). The strongest AIBs, at 3. 30, 6.20, 7.70, 8.60, 11.30, and 12.70 µm correspond to the cooling of the molecule through vibrational stretching and bending modes of its C-H and C-C bonds. To study this emission the community has counted on several space observatories: the Infrared Astronomical Satellite (IRAS, Neugebauer et al. 1984), the Infrared Space Observatory (ISO, Kessler et al. 1996), the Spitzer Space Telescope (Spitzer, Werner et al. 2004), and AKARI (Murakami et al. 2007). The upcoming James Webb Space Telescope (JWST, Gardner et al. 2006) has inspired a lot of hope as its capabilities will surpass all former space observatories.
One of the main findings of observational studies is the discovery that the AIB spectrum shows significant variability, depending on the astrophysical sources (Peeters et al. 2002). Variations within individual objects have also been observed (e.g., Joblin et al. 1996a;Sloan et al. 1997;Giard et al. 1992;Cesarsky et al. 2000;Rapacioli et al. 2005). These spectral variations can consist of changes in the band positions and widths or in band relative intensities. To analyze and interpret these variations, several approaches have been developed over the years. The first is a decomposition method, which dates back to the 1990s and consists in fitting multiple Lorentzian profiles to the AIBs (Boulanger et al. 1998). This was extended in the Spitzer era by Smith et al. (2007) who proposed a fitting tool, PAHfit, that automatically decomposes the AIB spectrum into several individual bands (Gaussian or Drude profiles) in the Spitzer range (5−35 µm). These fitting methods have been used widely to extract band intensities and relate them to star-formation rates in galaxies (Galliano et al. 2018), for example. They have also been used to study the variations of the AIB spectrum, for instance by Galliano et al. (2008) who were able to show that the 6.2 µm over 11.3 µm band ratio, expected to trace ionization degree of PAHs, correlates with the ionization balance param- Peeters et al. (2017) used Gaussian profiles to fit individual bands from spectral maps of NGC 2023 PDRs (north and south) and discussed in detail the possible assignments of these bands.
Another well-known approach for AIB analysis is the classification of Peeters et al. (2002). Using continuum-subtracted ISO-SWS spectra in the 6−9 µm range, the authors defined the A, B, and C classes based on the peak positions of the 6.2, 7.7, and 8.6 µm bands. Some objects show spectra that belong only to one class (e.g., HII regions are always class A), while other objects can show mid-IR spectra belonging to different classes (e.g., post-AGB stars can be found in Class A, B, and C). Hony et al. (2001) proposed a classification for the 10−15 µm range. This empirical method has the advantage of being simple, but is difficult to apply automatically to large samples of observations. In addition, the strategies of Peeters et al. (2002) and Hony et al. (2001) rely on subregions of the spectrum, and do not consider the full AIB spectrum.
A more detailed analysis of the AIB spectrum consists in using IR absorption spectra of PAHs that have been either measured or computed, as available in the NASA Ames PAH IR Spectroscopic Database (PAHdb, Bauschlicher et al. 2010Bauschlicher et al. , 2018Boersma et al. 2014) or the theoretical spectral database of PAHs (Malloci et al. 2007). Using this approach provides further insights into the chemical diversity of the interstellar PAH population (e.g., Hudgins et al. 2005;Boersma et al. 2009;Simon & Joblin 2010;Bauschlicher et al. 2009;Candian & Sarre 2015). The limitation of this approach concerns the challenge of modeling, in a precise way, the emission spectra of PAHs in astrophysical environments (Mulas et al. 2006a;Boersma et al. 2013), especially when band profiles need to be simulated (Pech et al. 2002;Mulas et al. 2006b).
The last type of analysis approach involves blind signal separation (BSS) methods. Rapacioli et al. (2005) analyzed ISOCAM observations of NGC 7023 and ρ-Ophiucus by using singular value decomposition (SVD) coupled with a Monte Carlo search algorithm and a positivity criterion. Three elementary spectra were extracted and attributed to cationic PAHs, neutral PAHs, and carbonaceous very small grains (PAH + , PAH 0 , and VSGs, respectively). The PAH + spectrum exhibits mainly the 6.2, 7.7, 8.6, 11.3, and 12.7 µm bands. It is dominated by the 7.7 µm bands (hereafter the 7.7 µm complex). The PAH 0 spectrum is similar, but with a 11.3 µm much stronger than that of the PAH + spectrum. The VSG population shows a spectrum with a strong contribution of the continuum at wavelengths longer than 10 µm. Its 7.7 µm complex is redshifted to 7.8 µm with respect to the first two spectra. The 8.6 µm band is blended with the 7.8 µm band.
Subsequently, Berné et al. (2007) used non-negative matrix factorization (NMF) on the Spitzer-IRS spectro-imagery observation of NGC 7023, ρ-Ophiucus, and Ced 201. They extracted three representative spectra consistent with those obtained by Rapacioli et al. (2005), but over a wider wavelength range and with higher spectral resolution. Rosenberg et al. (2011) conducted a coupled NMF-database fitting analysis of NGC 7023 NW Spitzer IRS-SH spectra, from 10 to 15 µm where they also identified these three populations, and assigned the 11.0, 11.2, and 11.3 µm bands to PAH + , PAH 0 , and VSGs, respectively. Joblin et al. (2008) averaged the representative spectra extracted by Rapacioli et al. (2005) and Berné et al. (2007) to build template spectra for each of the three mean populations, i.e., PAH 0 , PAH + , and VSGs. Then they used them to fit the spectra of the HII regions and planetary nebulae (PNe). They found that the three template spectra were not sufficient to fit PNe spectra in which the 7.7 µm complex is redshifted to ∼7.8 µm. The authors then invoked a fourth template spectrum to better fit this spectrum type. Guided by the PAH + spectrum and calculated IR absorption spectra available in the databases, they constructed the PAH x spectrum and attributed it to large ionized PAHs. Later, these four templates were incorporated into the PAH Toulouse Astronomical Templates (PAHTAT) tool 1 (Pilleri et al. 2012), which can be used to fit Spitzer-IRS spectra with these four template spectra. So far, much of the success of the BSS methods in providing a scenario of evolution of the AIB populations with UV processing resides in the study of NGC 7023, a nebula that is sufficiently extended to observe chemical variations even at the spatial scale of a few arcseconds sampled by the previous IR missions.
The JWST will provide hyperspectral images with unprecedented details combining middle spectral (R ∼ 3000) and high spatial (up to 0.1 ) resolutions over the full 0.6 to 28 µm wavelength range. The high sensitivity of the instruments will increase the number of observable astrophysical objects (including fainter regions of PDRs toward molecular clouds, the diffuse interstellar medium, and distant galaxies). These improvements will increase the quantity of the collected hyperspectral image information, with respect to previous missions, by several orders of magnitude. Dedicated methods for analyzing AIB emission in these large data sets are thus required.
The present study aims to provide a fast and robust method based on BSS for the analysis of these data sets. In Sect. 2 we propose such a method. In Sect. 3 we present our test of this method on ISO-SWS data and the obtained set of representative spectra. In Sect. 4 we discuss how this method can help interpret the observations in terms of chemical populations. We conclude in Sect. 5.

Challenges of large JWST data sets
The next generation of space infrared spectroscopic observations will be provided by the JWST, whose launch is planned in 2021. It will deliver mid-IR spectral cubes of unprecedented quality over the 0.6 to 28 µm range once the data from the near infrared spectrometer (NIRSpec) andmid-IR imager (MIRI) instruments are combined.
A summary of its performance is available in Table A.1. Overall, the MIRI and NIRSpec integral field unit (IFU) mode characteristics combine and surpass the advantages of former space observatories as ISO, Spitzer, and AKARI. The achieved medium spectral resolution will be well adapted to the study of the AIBs. MIRI cubes will have different sizes depending on the channel, with typical values between 17 × 19 × 3000 and 26 × 29 × 3000, whereas NIRSpec IFU cubes will typically be 30 × 30 × 12 000. The memory needed for one data cube will reach several hundred megabytes. Efficient algorithms will thus be required to analyze these large data sets.
In this paper we propose an approach based on linear methods allowing a reduction in the dimensionality of data to that of a few representative spectra using a BSS approach. The first step of the method (which is also an improvement over earlier BSS applications to mid-IR spectra) is to extract the AIB emission from the observations (i.e., without continuum emission, and gas lines). This is done with the linear fitting method described in Sect. 2.2. The BSS method used to perform dimensionality reduction on the AIB extracted spectra is described in Sect. 2.3.
The initialization process of NMF algorithms is described in Sect. 2.4. The method is summarized in Sect. 2.5.

AIB extraction
In this section we present a method for extracting the AIB emission from observations using a linear model to maintain a low computational time. We note that the model presented here is not physically motivated, and only extracts the AIB emission and reduces the impact of noise. Thus, the degeneracy (i.e., that many solutions can provide a good fit) induced by the use of a large catalog is of limited concern, as the objective is mainly to have a good fit to extract a clean AIB spectrum and not to use fit parameters for further analysis. The proposed algorithm uses predominantly relatively localized spectral structures like Gaussians to model bands, lines, and plateaus, in particular in the 6−9 µm region (Peeters et al. 2002), whereas much broader components like blackbodies are used for large-scale spectral structures like the continuum. Figure 1 shows typical ISO-SWS spectra. They contain broad AIBs, gas lines, continuum, and instrumental noise. Our aim was to extract the AIB contribution. We can model the observation vector s obs by s mod as

Model
where s AIB , s cont , s gas , and s inst correspond respectively to the model of AIBs, continuum, gas lines, and instrumental noise. Each component (but not the instrumental noise) is composed of a list of spectra. Section 2.2.2 describes how the lists are constructed.

Model components
The objective of the present model is only to represent the data mathematically in a way that allows us to extract easily an estimation of an AIB spectrum from an observation. The first component of Eq.
(1), s AIB , corresponds to the AIB emission expressed as where g is a vector of coefficients and A a matrix whose lines are Gaussians.
To build matrix A, we first define the intervals of wavelength over which AIBs can exist. For each interval we define a number of Gaussians with a given width σ (Table 3). We note that for a given interval σ can take several discrete values. Each Gaussian position is spread out evenly over the interval of wavelengths.
The continuum emission model, s cont , is expressed as where k is a vector of coefficients and B a matrix whose lines are defined as follows. Each continuum B λ,i is defined by We use discrete values of T in an interval of temperature between T min and T max with a temperature step of ∆T . Likewise, several discrete values of N H can be taken into account. The value of C ext is taken from Weingartner & Draine (2001) for the Milky Way extinction curve.
The third component of the model is the gas emission, s gas , described as where p is a vector of coefficients and L a matrix whose lines are Gaussians. The same principle as for the AIB emission is applied. However, the lines are much narrower. Each set of Gaussians is distributed over an interval ∆λ around the central wavelength µ of the line. It must be noted that all matrices A, B, and L are pre-computed. Thus, in the fitting optimization only the coefficients in vectors g, k, and p have to be adjusted. This implies that the model is fully linear since Eqs.

Optimization
To estimate s mod we minimized the L2 norm between s obs and s mod : To perform this minimization, we used the non-negative least square (nnls) function of the scipy package implemented in Python. This is a critical aspect of the method. The fully linear model described in Sect. 2.2.2 allows us to use the nnls function, which is much faster than the Levenberg-Marquardt method used in PAHfit. As an illustration, we compared the computational cost between a Python version of PAHfit and our method (also in Python) when applied to a single Spitzer-IRS spectrum of the NGC 7023 NW PDR, from 5 to 14 µm. The fitting part in PAHfit (algorithm setup and the fit itself) runs for 194 s, while our code lasts 0.47 s. Compared to PAHfit, the main advantage of our method is that it provides a good AIB extraction 400 times faster. This suggests that PAHfit in its current form would be hardly applicable on a 0.6−28 µm IFU cube from JWST (containing 900 spectra of 23 k spectral points), while the linear method would still run over a reasonable timescale.
This method can be applied to each of the observed spectra, yielding for each source an estimate of s mod (i.e.,ŝ mod ). For each model, this allows us to isolate an estimate of s AIB (i.e.,ŝ AIB ) to then construct X aib , the matrix containing the modeled contribution of AIB for each object. In other words, a line of X aib is thê s AIB vector obtained for a given object.

Choice of BSS method
The BSS methods are based on the assumption that the signal measured (mid-IR spectra in our case) is a mixture of elementary sources (spectra). The goal of these methods is to extract each of these sources from the data. They have been used in several fields, such as acoustics (Abrard & Deville 2005) and IR astronomical spectroscopy (Berné et al. 2007;Rosenberg et al. 2011). A common assumption on the data mixture is the linear instantaneous model, which implies that all the n spectra of the data set are a linear combination of r elementary sources. This is expressed as where X is the n × m data matrix whose lines are observed spectra, M the n × r weight matrix, and S the r × m source matrix. There are four main classes of methods used to perform BSS: independent component analysis (ICA), sparse component analysis (SCA), non-negative matrix factorization (NMF), and Bayesian methods. Here we used NMF since it requires that the elementary sources are non-negative and the data satisfy the linear instantaneous mixture with non-negative coefficients. These conditions are verified in principle in the present case except for linearity when extinction becomes important (see Berné et al. 2007). NMF provides an estimation of X by the product of the non-negatives matrices of weights W and of sources H as where W is an estimate of M and H of S. The lines of H contain an approximation of the elementary spectra which we call "representative spectra". An algorithm for obtaining W and H was introduced by Lee & Seung (1999) (see details of the algorithm in Berné et al. 2007). Lin (2007) proposed an algorithm for performing NMF that converges faster using the projected gradient method instead of multiplicative updates. We used the latter method, which minimize the Euclidean distance between matrix X aib and the matrix product W × H, implemented in MATLAB.

Choice of r
The choice of r (i.e., the number of representative spectra) is a crucial and non-trivial problem in BSS. As it is unknown, a criterion is needed to fix it. Several methods were proposed in the literature. Rapacioli et al. (2005) and Berné et al. (2012) used a method based on the standard deviation σ r of the reconstruction residual matrix with a varying number r. The number r is increased until σ r reaches the noise level of the observed data.
To apply this method, a correct measure of the data sample noise level is needed (e.g., an empty pixel).
Other methods are based on the dimension of the vector space spanned by the data set. The method of singular value decomposition (SVD, Luo & Zhang 2000) applied to a set of spectra can help us find the dimension of the vector space in which all spectra can be expressed. Boulais et al. (2015) used this method on astrophysical data. A steep transition in the eigenvalues of the covariance matrix of X indicate the dimension of its vector space.
Finally, Berné et al. (2007) proposed a user-based method to identify r, which consists of increasing r until artifacts or physically non-relevant spectra are extracted.

Variability of the NMF solutions
Both the Lee & Seung (2001) and Lin (2007) algorithms use iterative updates of matrices W and H. The most common method used to initialize W and H is to set their values randomly. Earlier studies (Berné et al. 2007;Rosenberg et al. 2011) show that solutions of NMF applied to mid-IR astronomical with random initialization show variability in the results. To overcome this issue, Rosenberg et al. (2011) used a Monte Carlo (MC) analysis to extract an average solution and error bars. This method is costly in terms of time, however, as it multiplies the total time cost by the number of MC-NMF runs. This is an issue because the convergence time increases with the number of points, and JWST data matrices will contain a hundred times more points.
The robustness of NMF representative spectra extraction relies on how these mathematical variabilities can be reduced.
In the following we present a method that allows us to obtain a unique solution for a given data set. Using a different data set, i.e., a different X matrix, implies a change in the solution, i.e., a change in W and H matrices.

Initialization of H by MASS
To circumvent the issues associated with random initialization, Boulais (2017) developed an innovative way to initialize W or H using a geometrical method, the Maximum Angle Signal Separation (MASS) algorithm. The MASS algorithm extracts from the data the observed spectra that are the farthest from an angular point of view in the space vector generated by the data set. Practically, MASS computes the scalar product between each pair of spectra of the data set and, in a first step, isolates the pair that minimize it and creates a subspace vector with them. Then each remaining spectrum is projected on this subspace and the one that maximizes the angle between itself and its orthogonal projection on the first subspace is isolated. This spectrum is added to the first two spectra and the operation starts again until the algorithm isolates r spectra. The output is an estimation of the matrix H called S mass based on the observed spectra.
Based on tests performed on synthetic spectra, Boulais (2017) showed that initializing H by S mass significantly reduces the variability of the solution of NMF, although W is still initialized randomly. It was also shown using these synthetic data that this form of initialization provides a solution that is more reliable compared to what can be obtained using a MC approach. For example, when H is initialized by S mass , the normalized root mean square error measured is 7% against 18% with MC-NMF alone.
Overall, since the dispersion of the solution is significantly reduced, only one run of MASS-NMF allows us to obtain a more accurate result than MC-NMF and a significant gain of time. A detailed comparison of the MC method results and initialization with MASS will be provided in a subsequent paper (Boulais et al. 2019), following Boulais (2017). We therefore included this initialization scheme in the algorithm presented in this study.

Summary of the MASS-NMF method
In summary, the proposed method for extracting representative spectra from the JWST data is the following: -Spectral resolution and sampling are homogenized over the whole data sample; -From the results of the fit, a data matrix, X aib , whose lines are the AIB spectra extracted from observations is created (Sect. 2.2.3); -The number of representative spectra r is defined by the user (Sect. 2.3.2); -The MASS algorithm is applied to X aib , yielding S mass , a first gross estimate of the results (Sect. 2.4); -NMF, with H initialized with the values of S mass , is applied to X aib until convergence is reached (Sect. 3).

Data set and test parameters
In this section, we describe the ISO-SWS data on which the present method is tested in Sect. 3.1.1. The model parameters used are described in Sect. 3.1.2. The choice of the number on representative spectra is described in Sect. 3.1.3. Notes. All the data come from the Sloan et al. (2003) catalog and are taken in SWS01 mode. The last column gives the scan speed used for the corresponding spectrum. (a) For precise details about the position of the M 17 spectra, see Verstraete et al. (1996).

Data
The ISO-SWS observations cover the 2.4−45 µm range, which overlaps that of the JWST-NIRSpec-MIRI combined data (0.6−28 µm, Gardner et al. 2006). We limit our study to the 2.6−15 µm range over which the major PAH bands are located.
Since there is no spatial information, we gathered the spectra of different astrophysical objects from the Sloan et al. (2003) catalog of highly processed data product spectra. More specifically, we selected spectra showing clear AIB emission and no noticeable extinction (see Sect. 2.3). The compiled list of objects and their types is given in Table 1. In total, n = 31 spectra were used coming from reflection nebulae (RN), PNe, HII, and Compact HII (CHII) regions, as well as one post-AGB star and one starburst galaxy (i.e., Red Rectangle and M 82, respectively). It appears that some spectra have a negative part due to calibration issues. For example, the IRAS 19442+2427 spectrum has negative values over the wavelength region from 2.6 to ∼6.1 µm. This is circumvented by adding a constant to the spectrum corresponding to the absolute value of the minimum intensity of this spectral region. It has no impact on the analysis since the aim of the present paper is to study the AIB emission that will be extracted using a non-negative leastsquares algorithm (see Sect. 2). Notes. This is applicable for the observing mode considered in this paper (i.e., SWS 01 scan speed 2).
The selected observations came from different proposals and involve various observing modes. An ISO-SWS spectrum is made of the combination of several discrete segments. To homogenize the spectra, we selected the spectral resolution of the mode with the lowest spectral resolution present in our data sample, which corresponds to the scan speed 2 of mode 01. Then we applied Gaussian convolution using a kernel with width corresponding to the lowest spectral resolution of the same segment in the observing mode 01 in scan speed 2. Another necessity is to have dimensional consistency (i.e., the same spectral sampling for all spectra). A wavelength vector with step size of ∆λ 10 (where ∆λ = λ ref R ) was generated, and the data was interpolated onto this new grid.
This yields n = 31 spectra of m = 6799 points with a spectral resolution constant over each segment, as given in Table 2. Figure 1 shows an example of spectra from each type of object.

AIB extraction model parameters
The parameters of the AIB extraction model discussed in Sect. 2.2 are presented as follows. Table 3 shows the intervals of wavelengths on which we place the catalog of Gaussians modeling the AIBs. No intervals are considered between 3.55 µm and 5.195 µm because of the lack of known AIBs in this region. Table 4 shows the characteristics of the modified blackbody catalog. Table 5 shows the characteristics of the Gaussian catalog corresponding to the gas lines (i.e., the identification of the transitions and the Gaussian parameters). Figure 2 shows two examples of extraction of the AIB emission. All our fits are found to be of good quality.

Choice of r
Several ways to determine r are discussed in Sect. 2.3.2. The method used by Rapacioli et al. (2005) and Berné et al. (2012) is not usable here due the impossibility of having a correct measure of the noise. The observations were not done with the same instrumental modes, which means the noise values of the spectra are not uniform. Figure 3 shows the eigenvalues of the covariance matrix obtained by applying SVD to the data sample used in the present study. No sharp step is observed, which precludes the choice of r by this method. The Berné et al. (2007) method is also not successful here, as these artifacts never appear when increasing r. Since all of the methods known to us for identifying r automatically failed, we relied on previous studies. Rapacioli et al. (2005) found r = 4, with three physical spectra and one spectrum dominated by artifacts. Berné et al. (2007) and Rosenberg et al. (2011) Table 1 for the complete list of objects. both found r = 3. However, the following studies show that an additional spectrum was needed (PAH x , see Sect. 1) in particular when PNe are included (which is the case here). Thus, r = 4 is also likely and is used for the following test.

Results and performance
The output of the present method is described in Sect. 3.2.1. The reconstruction of each spectrum of the data set using the set of we discuss the influence of the flux of each spectrum on the output of the method. Finally, we briefly discuss the time cost of this method in the perspective of its application on JWST data sets. Figure 4 displays the four representative spectra obtained by applying the method described in Sect. 3, with r = 4, on the ISO-SWS data set. The dashed lines in this figure show the template spectra from the PAHTAT fitting tool (Pilleri et al. 2012, see Sect. 1). The first conclusion that can be drawn from this figure is that the extracted spectra using MASS-NMF applied to the ISO data are qualitatively in good agreement with those of Pilleri et al. (2012), and in particular we recognize the four chemical populations presented in this earlier study and ascribed to cationic PAHs (PAH + ), neutral PAHs (PAH 0 ), evaporating very small grains (eVSGs), and PAH x (i.e., large ionized PAHs). These assignments are discussed in further detail in Sect. 4.1; however, for the sake of simplicity we use this chemical denomination from now on.

Spectral reconstruction of observations with representative spectra
One way to assess the quality of the decomposition is to reconstruct the observed spectra using the extracted representative spectra described in the previous section. To perform this reconstruction (or fit), we used the algorithm described in Sect. 2.2 (see Eq. (6)). We used the same catalogs of blackbodies and Gaussian functions to model the continuum and gas line emissions; however, the AIBs are now reconstructed using the four representative spectra. Figure 5 shows two examples of spectral reconstruction. The fits for all the objects reported in Table 1 are given in Appendix D, and the proportions of each spectrum for each reconstruction is given in Table 6. The fits are generally found to be of good quality. To estimate this more quantitatively, we compared the variance of the residual to the noise of the observations. A good reconstruction implies that the two quantities are of the same order. However we do not have a reliable measure of the data noise since each spectrum was obtained at a different time and not all of them were recorded with the same setting. A good measure of the noise in a hyperspectral image would be from a pixel with no AIB signal, which we do not have here. In the absence of an empirical noise estimate, we use the uncertainties provided by the pipeline, as described in Sloan et al. (2003). These are instrumental uncertainties related to the measurement method, not a measure of the total noise, and therefore likely an underestimate of the total noise. The ratio of the variance of the residuals of the reconstruction to the mean of the instrumental uncertainties is compared in Fig. 6 for each source. The values are between 0.34 and 2.5, which shows that the variance of the residuals is on the order of the instrumental uncertainty, implying a good quality of the reconstruction. Joblin et al. (2008) used the four PAHTAT template spectra reported in Fig. 4 and two broad features at 8.2 and 12.3 µm to fit the 5.5-14 µm spectra of a set of PNe and CHII regions after smoothing the observations to the spectral resolution of the templates (λ/∆λ = 45). All SWS spectra used in this earlier study are also included in our sample. We can thus compare the results obtained in each study from the spectral reconstruction in terms of the contributions derived for each population (see Table A Although differences in detail can be seen (e.g., a larger contribution of eVSGs in the new study compared to the previous one), an overall agreement is found. This demonstrates that the inclusion of higher resolution data in the decomposition method has preserved the general spectral information in terms of the defined PAH 0 , PAH + , eVSG, and PAH x contributions. However, the representative spectra obtained here contain more spectral information and therefore possibly chemical information, as is discussed in Sect. 4.2.   3.29, 6.20, 7.60, 8.60, 11.22, and 12.70 µm).

Importance of flux in the extraction of the PAH x spectrum
The last line of Table 6 shows the sum of the flux emitted by each population computed from the reconstruction of the observed spectra presented in Sect. 3.2.2. The prominent emission of the PAH x population compared to others is noticeable. This is due to the presence in the sample of the Red Rectangle (RR), whose emission accounts for 55% of the total PAH x flux. This source is 12 times stronger in flux than the average of the rest of the sample, and 3 times stronger than the second brightest spectrum (IRAS 07027−7934). This is important to keep in mind since the version of NMF used in this paper aims to reduce the Euclidean distance between the data matrix X and the estimated W and H matrices (see Sect. 2.3). This implies that the algorithm will provide one vector in the solution that is close (in terms of Euclidean distance) to this observed spectrum. This is what happens here in the extraction, where the PAH x spectrum is very similar to the RR spectrum. If a normalization of the observed spectra to an integrated flux density of one is performed (and the RR contribution is thus artificially reduced) before running NMF, then the population identified as PAH x in Fig. 4 (spectrum 4) is not extracted by the algorithm. This underlines the fact that the presence of the RR in our sample is crucial to the extraction of the PAH x spectrum.

Computation time
On the ISO-SWS data set of dimensions (31 × 6799), the method runs in 10 min on a laptop computer. We generated, based on the ISO data and using interpolations, a data set with the dimensions corresponding to an observation of one field of view of NIRSpec and MIRI combined (i.e., 30 × 30 × 23 000) and applied the proposed method to it. We find that the computation time on a server is on the order of one hour.

Towards a chemical interpretation of extracted spectra
In this section we briefly discuss the characteristics of each representative spectrum. We compare them to spectra obtained in earlier studies to propose a possible chemical interpretation of these spectra. This is done to illustrate the methodology, thanks  to the global similarities we observed in the representative spectra in the studies and to the second-order variations due to the specificity of the data set used. These assignments should not be considered definitive, in particular given the limited size of the sample and the absence of spatial information. Hopefully, the use of JWST data will allow us to achieve a more secure chemical assignment thanks to larger samples and spatially resolved observations. We also discuss how the classes of Peeters et al. (2002) can be connected to these extracted spectra. In Sect. 4.2, we present some considerations on chemical diversity and on the possibilities that a MASS-NMF analysis can open when dealing with the data sets that will be provided by JWST.

Chemical assignment of extracted spectra and their connection to standard classes
Improvements in the extracted spectra can be clearly seen in Fig. 4 compared to the spectra presented in Pilleri et al. (2012). These concern both the wavelength coverage (i.e., inclusion of the 3 µm region) and the spectroscopic details that are present in the new spectra. Table B.1 lists the positions of the main detected bands, and of some more minor bands where we are confident that they are associated with real bands and not artifacts of the method. Some bands are well known, such as the satellite bands in the 3.4−3.5 µm range. Other bands are identified because they are observed in several representative spectra, at close positions. The positions of the major bands observed in spectrum 2 (3. 29, 6.20, 7.61, 8.60, 11.22, and 12.73 µm) are used as arbitrary reference positions and are indicated by vertical dashed lines in Fig. 4. Overall, these results demonstrate that an application of the MASS-NMF method to the JWST data sets is promising, and likely to provide a scheme of analysis providing unprecedented detail. We describe here the spectral characteristics of the representative spectra and compare them to earlier studies to interpret their chemical origin.
Spectrum 1: PAH cations. Spectrum 1 (Fig. 4) is dominated by two bands centered at 6.20 µm and 7.61 µm. The other major bands peak at 8.53, 11.24, and 12.75 µm. Spectrum 1 shares some characteristics with spectrum 2, but has a higher value of the band intensity ratio of the 7.7 µm complex (C-C stretch) to the 11.20 µm band (C-H out-of-plane bend). Another major difference concerns the 3 µm region where spectrum 1 has no significant bands, contrary to spectrum 2. All these characteristics point to PAH cations (PAH + ) being the carriers of spectrum 1, in line with experimental and theoretical studies (e.g., DeFrees et al. 1993;Szczepanski & Vala 1993;Hudgins & Allamandola 1999) and the study by Boersma et al. (2013) using the NASA Ames PAH database. We note that the 6.20 µm band is blueshifted compared to the PAHTAT template, 6.20 µm relative to 6.27 µm. However, the position of this peak varies in the various BSS extractions of the PAH + spectrum; for instance, it is reported at 6.28 µm in Rapacioli et al. (2005) and 6.24 µm in the extraction performed by Berné et al. (2007). However, the shift of 0.08 µm in previous templates relative to spectrum 1 is less than the spectral resolution of the ISO-CAM data (∼0.14 µm) that was used in these studies. Our results can also be compared to the study of Rosenberg et al. (2011) on NGC 7023. The authors applied NMF to Spitzer spectra of NGC 7023 obtained over the 10−15 µm range at R ∼ 600. They found a PAH cation spectrum dominated by a band at 11.00 µm, albeit with relatively large uncertainties (see lower left panel of their Fig. 3). A band at 11.0 µm is indeed present in spectrum 1, Notes. (a) From Peeters et al. (2002). (b) Represents the ratio of the total AIB flux of an astrophysical object to the continuum flux of the same object. although the main one is located at 11.24 µm. This indicates that the representative spectra that can be extracted depend on the considered spectral range and objects (see discussion on chemical diversity in Sect. 4.2).
Spectrum 2: PAH neutrals. This spectrum presents clear peaks at 3.29, 6.20, 7.61, 8.59, 11.22, and 12.73 µm. In addition to the 3.29 µm band, there is a weaker satellite band at 3.40 µm over a plateau, which extends from 3.4 to 3.6 µm and comprises very weak bands at 3.43, 3.47, and 3.51 µm. The 11.22 band is much stronger than in spectrum 1, being close to the 7.7 µm complex emission, which points to neutral species (PAH 0 ) being the dominant carriers (Compiègne et al. 2007;Joblin et al. 1996b). This is also consistent with the previous decomposition studies (Rapacioli et al. 2005;Berné et al. 2007;Pilleri et al. 2012). The position of the 11.2 µm C-H band observed here (11.22 µm) is in good agreement with that of the neutral PAH spectrum (11.20 µm) derived by Rosenberg et al. (2011).
Spectrum 3: eVSGs. This spectrum exhibits bands that are redshifted compared to the bands of the PAH + and PAH 0 spectra and are found to be superimposed on a plateau (Fig. 4). Although there are notable differences, in particular in the 7.7 µm complex, we associate this spectrum with the population of eVSGs described in Pilleri et al. (2012). This population spectrum has a dominant contribution in the PNe spectra. We note in addition that the position of the 11.20 µm band is shifted to ∼11.35 µm, which is similar to the value of the position observed for this A84, page 10 of 30 band in the VSG spectrum extracted by Rosenberg et al. (2011), i.e., ∼11.40 µm (see their Fig. A1).
Spectrum 4: PAH x ? Spectrum 4 shares some similarities with the PAH + spectrum (spectrum 1), but with a significant redshift of the main bands corresponding to the C-C stretch and C-H in-plane bend motions. The shift is the strongest for the band at 7.85 µm, compared with a position at 7.61 µm in the PAH + and PAH 0 spectra. This is the characteristic put forward by Joblin et al. (2008) for the so-called PAH x , which would be large ionized PAHs (cations or anions of more than 100 carbon atoms). We note that the latter authors had constructed the PAH x spectrum, whereas here this spectrum results from the extraction procedure. We found the presence of a strong 3.3 µm band associated with the PAH x component, also slightly redshifted compared to the band of PAH 0 . However, we should consider that the extraction of the PAH x spectrum is only possible due to the presence of the intense RR spectrum in the data set (see Sect. 3.2.3). Its extraction is promising but still to be confirmed on a sample containing more RR-like objects.
Finally, although the spectrum of every object is a mix of the four representative spectra, we note that the class A objects of Peeters et al. (2002) are dominated by the PAH 0 and PAH + populations, whereas the class B objects are dominated by eVSG and PAH x populations.

Chemical diversity
The improved spectral resolution and larger spectral range of our representative spectra can provide new information on the emitting populations. Table B.1 reports the positions of the main detected bands and of some minor bands where we are confident that they are associated with real bands and not artifacts of the method. In the following we more specifically focus on the differences that can be noted for a given type of spectral feature in the representative spectra.
We first consider the 3−4 µm range, which is studied here for the first time using BSS methods. Figure 4 shows that all populations except the carriers of spectrum 1 have a contribution in this spectral range. For PAHs, it is known that ionization induces a collapse of the intensity of the C-H stretch (Pauzat et al. 1995). The effect decreases with increasing size. For large species containing 100 carbons or more, the 3.30 µm band has a comparable intensity for neutrals and cations ). On the other hand, the 3.30 µm band is in general more intense for anions than for neutrals, as shown by density functional theory calculations, which are gathered in the Theoretical PAH database 2 and the NASA Ames PAH IR Spectroscopic Database 3 . These results strengthen the assignment we made for spectra 1, 2, and 4. Spectrum 1 is dominated by PAH cations, whereas spectrum 2 is dominated by neutrals. The presence of a strong 3.30 µm band in spectrum 4 supports the idea that their carriers, some large ionized PAHs, are anions rather than cations, whereas both species were proposed in Joblin et al. (2008). Even so, the new data raise a number of questions. The absence of the 3.30 µm band in spectrum 1 would go against the fact that the carriers are large PAH cations, unless they are not sufficiently excited to emit at 3.30 µm. The fact that spectrum 3 attributed to eVSGs exhibits a 3.30 µm band is also surprising considering that eVSGs are expected to be larger species and to then emit at lower temperatures than PAHs (Rapacioli et al. 2005). It should be kept in mind, however, that the retrieved representative spec-tra reflect the dominant contributions in the sample of spectra and do not have the ability to separate chemical populations that are present at an intensity that is too low compared to other populations. Spectrum 3 may therefore contain a mix of eVSGs and smaller PAH-like species that could be produced by their evaporation (Pilleri et al. 2015).
In addition to the 3.30 µm band, spectrum 2 also exhibits the 3.40 µm band, as well as minor bands at 3.43, 3.47, and 3.51 µm. All these bands have been seen in several sources (e.g., Pilleri et al. 2015;Sloan et al. 1997;Ohsawa et al. 2016). They have been attributed to aliphatic bonds in methyl side groups attached to PAHs (Joblin et al. 1996a) or in superhydrogenated PAHs (Bernstein et al. 1996). These species are expected to have another characteristic band at ∼6.9 µm (Steglich et al. 2013), which is indeed present in the PAH 0 spectrum. There is also a significant band at 6.9 µm in the eVSG spectrum, which might be related to the aliphatic content of this population (Pilleri et al. 2015). For spectra 1, 3, and 4, no band at 3.40 µm is detected, but a broad plateau centered around 3.45 µm is present. Pilleri et al. (2015) detected this plateau in NGC 7023 and concluded from a spatial analysis that it is related to the emission from hot bands and combination bands involving aromatic bonds.
We see in Sect. 3.2.1 that the 6.20 µm band peak position varies significantly, from 6.20 to 6.27 µm, between spectra 1, 2, and 3, 4. In addition, spectrum 4 has a satellite band at 6.00 µm. This feature has been observed in a number of objects and was previously attributed to a C=O stretching mode in oxygenated PAHs (Peeters et al. 2002), C-C mode in complexes of PAHs with Si atoms (Joalland et al. 2009) or C=C mode from an olefinic side group (Hsia et al. 2016). The range of the 7.70 µm complex indicates some diversity between the populations, although the PAH 0 and PAH + spectra appear relatively close with common bands at 7.61, 7.52/7.53, and 7.76/7.77.
There are significant differences in the 11−15 µm range within the PAH spectra. The band at 11.20 µm is the dominant one followed by the 12.70 µm band. However the 11.20 µm band is much more intense in spectrum 2 (PAH 0 ) than in spectrum 1 (PAH + ), where it is accompanied by other features at ∼12 µm. This would indicate more chemical diversity in the PAH + population.
To conclude, we can see that the representative spectra carry some information on the chemical diversity involved in each population. Even so, our sample is limited and rather heterogeneous. We therefore have to be cautious with any detailed band assignment. More robust information will be obtained while analyzing data obtained at high spatial resolution on specific regions in which the chemical evolution of the populations is driven by the local physical conditions such as the UV radiation field. Our four template spectra are consistent with previous studies showing the importance of UV processing. Among the PAH populations, the PAH 0 population is the least processed, which is illustrated by the presence of more fragile aliphatic side groups. The PAH + population is expected to be more processed and ionized. It is possible that the trends observed in the 3.30 and 11−13 µm ranges (see above) reveal some first steps of dehydrogenation, although not much data are available to support this scenario (Pauzat et al. 1995). The PAH x population indicates the presence of large PAH anions. The spectrum of eVSGs is more difficult to comment and this also reflects our poor knowledge on the nature of this population.

Conclusions
In this paper, we present a new linear (thus fast) and robust method for the analysis of the AIB spectra in preparation of the large data sets that will be provided by the JWST. After an automatic extraction of the AIBs from each observed spectrum (see Sect. 2.2), a hybrid blind signal separation method (see Sect. 2.3), i.e., NMF initialized using the MASS algorithm (see Sect. 2.4), is applied on the extracted AIBs, and provides a set of representative spectra (here, r = 4). The method reduces the dimensionality of the study of a data set to that of a reduced number of representative spectra, physically and chemically interpretable in terms of populations of PAHs and VSGs. The speed of the method allows us to analyze large sets of data in a limited time compared to non-linear methods or hybrid Monte Carlo-NMF methods. As a test, we used ISO-SWS archival data (see Sect. 3.1.1) which most closely approached the spectral range and resolution of JWST data. Four representative spectra shown in Fig. 4 were extracted, and they demonstrate that the method is applicable. Based on comparison with previous studies, we propose the following interpretation: -Three representative spectra were assigned to PAH 0 , PAH + , and eVSG spectra, in agreement with previous studies. -We attributed the fourth spectrum to the PAH x population, initially invoked and constructed by Joblin et al. (2008) and extracted here automatically for the first time. However, this spectrum should be confirmed on a larger sample because of its specific behavior. -We found that the spectral classes of Peeters et al. (2002) can be explained by a dominant PAH 0 and PAH + emission for class A, and by eVSG and PAH x emission for class B. -The 3 µm range was included for the first time in such a study, and reveals the presence of aliphatics connected to neutral PAHs, and that the anionic character of the PAH x population is supported by this new analysis. -Second-order spectral signatures (e.g., small bands) have been identified, showing that more chemical diversity can be investigated thanks to an improved spectral resolution. This list should however be taken with caution, considering the combined small size and heterogeneity of the sample of astronomical spectra used in this study. The method presented in this study is readily applicable to JWST hyperspectral images that will be provided by MIRI and NIRSpec IFUs, and we expect that it will allow us to refine the scenarios of PAH formation, evolution, and destruction in galaxies. Notes. Sensitivities are calculated in order to have a 10σ detection with a 10 k s exposure for MIRI and 13. (a) At 3.30 µm with a G395H-F290LP disperser-filter configuration (highest spectral resolution). (b) At 7.70 µm with MRS mode (highest spectral resolution). Combination bands/aromatic 5.68 5.69 5.79 5.70 6.0 ---6.02 6.02 6.7

Appendix A: Summary of MIRI and NIRSpec characteristics
Aromatic C-C stretch? 6.66 6.73 -6.62 6.9 Aliphatic C-H deformation? -6.92 6.88 6.87 7.0 ? --7.04 7.09 8.3 Aromatic C-H in-plane bend? 8.32 8.34 8.45 8.35 11.0 Aromatic C-H out-of-plane bend 11.03 11.02 11.09 11.07 11.5 Aromatic C-H out-of-plane bend --11.58,11.81 -12.0 Aromatic C-H out-of-plane bend 11.92 -11.96 11.97 12.0 Aromatic C-H out-of-plane bend 12.13 -12.09 -12.4 Aromatic C-H out-of-plane bend --12.42 -13.5 Aromatic Notes. Values in italics correspond to mean value of complexes of two bands whose estimated peak positions are shown in parentheses. The sign "-" is used when no significant band could be extracted.  Table B.1 gives the position and the assignment of the bands extracted from the different representative spectra. They are classified as main peaks, minor bands, and shoulders, which are bands blended with a nearby main band.