Open Access
Issue
A&A
Volume 650, June 2021
Article Number A100
Number of page(s) 26
Section Numerical methods and codes
DOI https://doi.org/10.1051/0004-6361/202039435
Published online 11 June 2021

© S. Milosevic et al. 2021

Licence Creative CommonsOpen Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Open Access funding provided by Max Planck Society.

1. Introduction

The interstellar medium (ISM) is a key element of the Milky Way and subject to both astrophysical and cosmological studies. It consists of localized components such as molecules and interstellar dust in cold clouds, atomic and ionized hydrogen, and hot plasma in the Galactic halo, as well as components which pervade the entire ISM, such as cosmic rays and magnetic fields. Our present knowledge about the existence of these components is based on the fact that they all contribute to the interstellar radiation field (e.g., Draine 2011). Each component, or the interplay of several components, generates radiation of a specific spectrum and can be reconstructed by component separation algorithms.

There are components showing very characteristic emission lines such as CO or neutral (HI) and ionized (HII) atomic hydrogen, which permits their Galactic distribution to be determined very precisely. Other components, however, can generate radiation distributed over completely opposite areas of the electromagnetic spectrum. For example, the interaction of cosmic rays and magnetic fields generates synchrotron radiation and can be observed in the radio regime, as in the 408 MHz radio map (Haslam et al. 1982), while the interaction of cosmic rays with interstellar matter imprints in the γ-ray regime due to hadron-nucleon collisions producing pions (e.g., Mannheim & Schlickeiser 1994). Another example is hot ionized plasma, which radiates in the X-ray regime when generated by supernovae, whereas hot molecules ionized by collisions emit in the UV regime (e.g., Ferriere 2001). In this specific case, both the UV and soft X-ray photons are again absorbed by interstellar dust, which prevents observations in regions of high dust density such as the Galactic plane. This interplay shows the high complexity of radiative extinction, which needs to be taken into account when reconstructing single emission components.

The cosmic microwave background (CMB), for example, is not affected by dust extinction, but it is superimposed by Galactic foregrounds. Cosmological studies aim to extract the CMB radiation from multiple frequency channels by identifying and systematically removing these Galactic foregrounds. In frequencies below 100 GHz, Galactic synchrotron and free-free emission contaminate the CMB; whereas, above 100 GHz, thermal dust emission and the cosmic infrared background dominate (Gold et al. 2011; Planck Collaboration X 2016). To distinguish between different sources of emission, members of the Planck Collaboration IV (2020) developed several component separation algorithms, for example the COMMANDER (Eriksen et al. 2008), SEVEM (Leach et al. 2008), or SMICA code (Delabrouille et al. 2003; Cardoso et al. 2008). The results obtained from those algorithms contain, among others, all-sky maps of the CMB; synchrotron; free-free; thermal and spinning dust; and CO line emission (Planck Collaboration X 2016). The approaches are, however, based on cosmological, astrophysical, and instrumental parameters, and they require preprocessed spectral templates or explicit knowledge about physical correlations. The COMMANDER code, for example, describes the data at a given frequency as the linear sum of several signal components and a noise term. The signal terms contain a number of astrophysical spectra describing zodiacal light, thermal and spinning dust emission, or synchrotron radiation, which depend on physical parameters, such as the dust temperature, and which are processed by an instrument response function that depends on frequency calibration factors and instrumental bandpass integration effects (Planck Collaboration X 2016). To verify component separation results in a manner independent of such extensive modeling, approaches allowing automated component identification have been increasingly pursued in recent years (Longo et al. 2019).

A broad range of machine learning algorithms was applied to cosmological and astrophysical problems (Fluke & Jacobs 2020). For example, Beaumont et al. (2011) employed the Support Vector Machine (SVM) algorithm to classify structures in the ISM. The algorithm was able to identify a supernovae remnant behind a molecular cloud based on a sample of manually classified data. Ucci et al. (2018a) examined the ISM composition of dwarf galaxies by processing the available spectral information in a machine learning code. The so-called GAME algorithm was trained on a large library of synthetic data (Ucci et al. 2018b) and recovered ISM properties such as metallicity, gas densities, and their respective correlations on the basis of spectral emission lines. By training a convolutional neural network on synthetic spectra, Murray & Peek (2019) were able to decompose the thermal phases of neutral hydrogen HI. This selection of algorithms represents supervised learning approaches, which require labeled or preclassified data in order to be trained. We investigate to what extend unsupervised approaches can be applied to Galactic observations.

Our study is based on the work of Müller et al. (2018). The authors applied Gaussian mixture models (GMMs) to Galactic all-sky data in order to augment pixels with missing measurement data by learning spectral pixel-wise relations. They verified the prestated hadronic and leptonic component of the γ-ray sky (Selig et al. 2015) and presented a higher resolved hadronic component as well as a completion of nonobserved information. In their work, the authors had hoped to correlate the main component maps of the GMM with the different phases of the ISM, which however was not the case (see Appendix F in Müller et al. 2018). The idea of classifying astrophysical data in an unsupervised manner motivated the approach in the present study to explore other latent variable models, such as autoencoders, which are able to encode relevant representations of the data in their latent spaces. Autoencoders belong to a class of unsupervised machine learning approaches called representation learning (Hinton 1990; Bengio et al. 2013; Goodfellow et al. 2016). Representation learning methods are constructed to extract the underlying causes that generated the observed data, in other words they calculate the most informative representation of the observation (Bengio et al. 2013; Goodfellow et al. 2016). Autoencoders reduce the dimension of the input data to the so-called underlying, explanatory factors of variation in their latent space (Hinton & Salakhutdinov 2006; Bengio et al. 2013; Goodfellow et al. 2016). They are trained to regenerate their initial input as accurate as possible from their lower-dimensional latent space, meaning a generative process is modeled. We translated this functionality to the task of astrophysical component separation: Since astrophysical data display the superposition of several radiative processes, we modeled this superposition by the generative part of an autoencoder and extracted the underlying radiative processes as “features” in the latent space. In contrast to component separation algorithms similar to the COMMANDER code, we did not predefine the components to be separated, but rather constructed the algorithm to learn a lower-dimensional representation of the data and analyzed the resulting features after training. We stated a similar forward model compared to the COMMANDER code, but by the use of neural networks, we built our algorithm to learn the generative signal interactions from the data. This unsupervised and generic approach to classify astrophysical data could be used as a preprocessing step for subsequent analyses. Especially, considering the large quantities of astrophysical data produced every day by surveys such as the Sloan Digital Sky Survey (SDSS), an effective preprocessing strategy needs to be developed to be able to analyze the vast amount of data effectively (Kremer et al. 2017; Reis et al. 2019).

The learning process of autoencoders, and in general of neural networks, is guided by so-called loss functions. In this study, we developed a specific loss function for spectral component separation. We used a Bayesian framework to formulate the signal reconstruction problem, and approximated the resulting posterior distribution using variational inference. Autoencoders consisting of a probabilistic inference and generation process are called variational autoencoders (VAE; Kingma & Welling 2013). This approach was, for example, used in the context of the Sloan Digital Sky Survey, as Portillo et al. (2020) successfully applied VAEs to the SDSS data. The authors efficiently reduced the dimension of 1000 input pixels to six latent components, while the VAE was able to outperform principal component analysis considering the spectral dimensionality reduction. Their latent space separates types of galaxies or detects outliers, making it a very useful preprocessing step for large astrophysical data. However, the authors claim that the uncertainty quantification could be improved and suggest to include the pixel-level noise as a separate feature to improve latent variance representations. In contrast to this study, we fully derived the loss function describing the component separation process that, compared to a standard VAE, contains additional terms guiding the learning process. These additional terms describe the influence of an adaptive model noise introduced by our forward model and weights introduced by our specific generative process. The Bayesian framework further allows to calculate uncertainties and to define the significance of each result.

2. Data and methods

In the following, we describe how we modeled a loss function describing the problem of spectral component separation. The loss function guided the learning process of our algorithm and was based on statistical assumptions rather than knowledge about physical correlations. In Sect. 2.2, we state a generative model describing how 35 Galactic all-sky maps were generated by a smaller number of signals. We combined the generative model with statistical prior knowledge on the signals and applied Bayes’ theorem to calculate an analytic solution to the signal reconstruction problem. Using variational inference, we derived a minimization objective approximating the signal reconstruction, which we optimized with neural networks. The approximated signals are called “features” in this study, following machine learning terminology. The architecture of our algorithm was based on autoencoders, which are trained to reconstruct their input data from a lower-dimensional, latent representation. After training, this latent representation contained our resulting features. In Sect. 2.3 we explain how the derived minimization objective was translated to a variational autoencoder architecture.

2.1. Data

Our goal was to determine a compressed spectral representation of the Galactic sky. We built an unsupervised algorithm to learn this representation from data itself rather than defining specific components to be reconstructed. To give the algorithm a broad range of data to learn from, we chose a data set consisting of 39 Galactic all-sky maps compiled by Müller et al. (2018), covering frequencies from the radio to γ-ray regime. To generate the data set, the authors assembled information from all-sky surveys (Müller et al. 2018, and references therein) with at least 80% spatial coverage in HEALPix format (Gorski et al. 2005), and used the highest resolution map for each frequency. The UV regime is not part of their data set, since the respective GALEX survey (Bianchi et al. 2017) shows too many unobserved regions. To homogenize and unify the data, they (1) converted the sky brightness values in the original maps to flux magnitude values, (2) reduced the noise level in X-ray data by smoothing with a Gaussian kernel, (3) removed extra-Galactic sources and calibration artifacts from all maps except the already cleaned γ-ray data from Selig et al. (2015), and (4) unified the resolution of all maps to a HEALPix pixelization of Nside = 128.

Based on the resulting 39 Galactic all-sky maps, we built an indexed data set D (see Table A.1) for the present study with three modifications: (1) We discarded the hadronic and leptonic component maps by Selig et al. (2015), since these were derived from the Fermi data and thus contain redundant information, (2) we removed the AKARI far infrared map recorded at 65 μm, which exhibits a poor spatial coverage (and we can only work with pixels that are covered by all maps), and (3) we neglected the CO line emission map, since this information is partly contained in the Planck 100, 217, and 353 GHz frequency channels (Planck Collaboration X 2016). We then defined our data set consisting of k = 35 all-sky maps as an indexed set D = (d1,…,dp), where p is the number of pixels, and the magnitude pixel vectors di ∈ Rk represent the magnitude flux values of all frequency maps for the ith pixel1. In other words, we extracted the same pixel from each of the 35 all-sky maps respectively, combined the pixels to a pixel vector di, and used the pixel vectors as input data for our autoencoder.

2.2. Model design

Determining a lower-dimensional representation of the spectral information in our data set is an inverse problem: We were looking for unknown quantities, our so-called features, and procedures applied to them that could have generated the full information in D. To describe this problem, we used generative modeling to define a data model with corresponding parameters Θ. We solved the inverse problem with the posterior probability distribution P(Θ ∣ D) of the model parameters Θ, which we specify in the next paragraph. Using Bayes’ theorem, this posterior can be expressed by the data likelihood and the prior distribution up to a Θ-independent factor, reading P(Θ ∣ D)  ∝  P(D ∣ Θ)  ×  P(Θ). We performed a pixel-based analysis of image data by assuming magnitude pixel vectors to be independent. This enabled us to factorize the data likelihood and prior distributions. In the following, we describe how we calculated the respective distributions per pixel and how we combined all derivations to one solution for the entire data set D.

We started by defining a generative process leading from an abstract source S to the observations in the data set D. S was not determined in the beginning, but we aimed to associate S a posteriori with relevant physical quantities. We assumed that each di ∈ D was generated from a source vector si ∈ Rl with l ≤ k, and the collection of these vectors is expressed by an indexed source set S = (s1, …, sp). This relation is described in a data model

D = D + N model , $$ \begin{aligned} \boldsymbol{D} = \boldsymbol{\widetilde{D}} + \boldsymbol{N}_{\mathrm{model} }, \end{aligned} $$(1)

where we defined the observed data D to be composed of D : = ( d 1 , , d p ) $ \boldsymbol{\widetilde{D}}:= (\widetilde{\boldsymbol{d}}_1, \ldots, \widetilde{\boldsymbol{d}}_p) $, which is the output of a generative process G : Rl → Rk, and the model noise Nmodel. The generative process G(  ⋅  ) mapping the variables of S pixel by pixel to D $ \boldsymbol{\widetilde{D}} $, that is D = G ( S ) $ \boldsymbol{\widetilde{D}} = G(\boldsymbol{S}) $, was unknown at this point.

2.2.1. Prior distributions

We factorized the prior distribution of the set of source vectors S as P ( S ) = i = 1 p P ( s i ) $ P(\boldsymbol{S}) = \prod^{p}_{i = 1} P(\boldsymbol{s}_i) $. The prior on P(si) can be arbitrarily complex, but without loss of generality we stated a transformation si = T(zi) of the source vectors to an indexed set of latent vectors zi ∈ Rl, providing a standardized prior distribution P(zi) = 𝒢(zi, 1). The coordinate transformation into the eigenspace of the prior (e.g., Devroye 1986; Kingma & Welling 2013; Rezende et al. 2014; Titsias & Lázaro-Gredilla 2014) enabled us to absorb all complex and unknown structures of the source space in the transformation T(  ⋅  ). This transformation results in a unit Gaussian prior distribution. Using this definition, we rewrote the generative process as a parametrized function of the latent variables d i = G ( s i ) = G ( T ( z i ) ) = : f θ ( z i ) $ \widetilde{\boldsymbol{d}_i} = G(s_i) = G(T(\boldsymbol{z}_i)) =: f_{\boldsymbol{\theta}}(\boldsymbol{z}_i) $, with θ denoting the parameters of the transformed generative forward model. We did not model the generative process explicitly, but used a neural network to learn the function fθ from the data.

We further defined the model noise Nmodel = (n1, …, np) as an indexed set of p noise vectors ni ∈ Rk, and assumed the pixel-wise noise was independent and identically distributed with a Gaussian distribution of zero mean and noise covariance N ∈ Rk × k, meaning P ( N model ) = i = 1 p P ( n i ) = i = 1 p G ( n i , N ) $ P(\boldsymbol{N}_{\mathrm{model}}) = \prod^{p}_{i = 1} P(\boldsymbol{n}_i) = \prod^{p}_{i = 1}\mathcal{G}(\boldsymbol{n}_i, N) $. This noise covariance induces a metric in data space, which is between two magnitude pixel vectors di and d i $ \widetilde{\boldsymbol{d}_i} $. N thus indicates how accurately the data vector di can be reconstructed using the latent variables zi. Since this accuracy was unknown, we introduced N as an inference parameter. We again performed a transformation to the prior eigenspace of the form N = tψ(ξN) with the latent noise parameter ξN ∈ R, being distributed as P(ξN) = 𝒢(ξN, 1) with the transformation parameter ψ. For our specific application, we chose a log-normal mapping

N = t ψ ( ξ N ) = 1 k × k exp ( μ N + σ N ξ N ) , $$ \begin{aligned} N = t_{\boldsymbol{\psi }}(\xi _N) = {1}^{k \times k} \, \, \mathrm{exp} \, (\upmu _N + \sigma _N \, \xi _N ), \end{aligned} $$(2)

with mean μN and standard deviation σN. In other words, this is a diagonal noise covariance matrix N with transformation parameters ψ = (μN, σN)T and a single global noise parameter ξN, which we aimed to learn.

Combining the definitions of the noise and forward model, we rewrote the data model from Eq. (1) as a pixel-wise expression:

d i = f θ ( z i ) + n i . $$ \begin{aligned} \boldsymbol{d}_i = f_{\theta }(\boldsymbol{z}_i) + \boldsymbol{n}_i . \end{aligned} $$(3)

At this point, we defined the data model parameter vector Θ described in Sect. 2.2. It consists of the latent vectors zi, the latent noise parameter ξN, which indirectly defines the model noise ni, and the parameters of the generative model θ. The parameters θ are the parameters of the neural network used to approximate the generative process, and they are specified in further detail in Sect. 2.3. Since we did not have prior knowledge on these forward model parameters, we assumed a uniform prior distribution on θ (see Table 1). This resulted in Θ = (Z, θ, ξN) with the indexed latent space set Z = (z1, …, zp).

Table 1.

Priors on model parameters.

2.2.2. Data likelihood

We included our forward model in the pixel-wise likelihood P ( D Θ ) = i = 1 p P ( d i Z , θ , ξ N ) $ P( \boldsymbol{D}\mid \boldsymbol{\Theta}) = \prod\nolimits_{i=1}^p \, P(\boldsymbol{d}_i \mid \boldsymbol{Z}, \boldsymbol{\theta}, \xi_N) $ by marginalizing over the model noise

P ( d i Z , θ , ξ N ) = d n i P ( d i , n i Z , θ , ξ N ) = d n i P ( d i z i , θ , n i ) P ( n i ξ N ) = d n i δ ( d i f θ ( z i ) n i ) G ( n i , t ψ ( ξ N ) ) = G ( d i f θ ( z i ) , t ψ ( ξ N ) ) , $$ \begin{aligned} P(\boldsymbol{d}_i \mid \boldsymbol{Z}, \boldsymbol{\theta }, \xi _N)&= \int d\boldsymbol{n}_i \; P(\boldsymbol{d}_i , \boldsymbol{n_i}\mid \boldsymbol{Z}, \boldsymbol{\theta }, \xi _N) \nonumber \\&=\int d\boldsymbol{n}_i P(\boldsymbol{d}_i \mid \boldsymbol{z}_i, \boldsymbol{\theta }, \boldsymbol{n}_i) \, P(\boldsymbol{n}_i \mid \xi _N) \nonumber \\&= \int d\boldsymbol{n}_i \, \delta (\boldsymbol{d}_i - f_{\boldsymbol{\theta }}(\boldsymbol{z}_i) - \boldsymbol{n}_i) \, \mathcal{G} (\boldsymbol{n}_i, t_{\boldsymbol{\psi }}(\xi _N)) \nonumber \\&= \mathcal{G} (\boldsymbol{d}_i - f_{\boldsymbol{\theta }}(\boldsymbol{z}_i), t_{\boldsymbol{\psi }}(\xi _N)) \, , \end{aligned} $$(4)

where we assumed the noise ni to be Gaussian distributed and a priori independent of Z and θ. A Gaussian distributed likelihood states that we want the squared deviation of all reconstructed brightness magnitudes compared to the original data to be small. A way to generalize this requirement in the case of more complicated and higher dimensional data distributions is a GAN-VAE (Larsen et al. 2016), where the likelihood is represented by a neural network as well.

2.2.3. Approximating the posterior distribution

Combining the prior distributions listed in Table 1 and the data likelihood in Eq. (4), the posterior distribution P(Θ ∣ D) reads:

P ( Z , θ , ξ N D ) i = 1 p P ( d i Z , θ , ξ N ) i = 1 p P ( z i ) P ( θ ) P ( ξ N ) i = 1 p ( G ( d i f θ ( z i ) , t ψ ( ξ N ) ) G ( z i , 1 ) ) × G ( ξ N , 1 ) , $$ \begin{aligned} P(\boldsymbol{Z}, \boldsymbol{\theta }, \xi _N\mid \boldsymbol{D} )&\propto \prod _{i=1}^p P(\boldsymbol{d}_i \mid \boldsymbol{Z}, \boldsymbol{\theta }, \xi _N) \, \prod _{i=1}^p P(\boldsymbol{z}_i) \, \,P(\boldsymbol{\theta }) \, P(\xi _N) \nonumber \\&\propto \prod _{i=1}^p \left( \mathcal{G} (\boldsymbol{d}_i - f_{\boldsymbol{\theta }}(\boldsymbol{z}_i), t_{\boldsymbol{\psi }}(\xi _N)) \, \mathcal{G} (\boldsymbol{z}_i, {1}) \right) \, \times \nonumber \\&\quad \, \, \, \mathcal{G} (\xi _N, 1) \, , \end{aligned} $$(5)

where we used P(θ) = const. To overcome analytic limitations, we used variational inference (e.g., Blei et al. 2017) to approximate the posterior distribution P(Θ ∣  D) with an easier to calculate distribution QΦ(Θ ∣ D) with variational parameters Φ. We defined a suitable approximate distribution QΦ(Θ ∣ D) and used the Kullback-Leibler Divergence2 as a measure to evaluate the dissimilarity of P(Θ ∣  D) and QΦ(Θ ∣ D).

Assuming that Z, θ and ξN are a posteriori independent, the approximate distribution QΦ was written as

Q Φ ( Z , θ , ξ N D ) = Q Φ ( Z D ) Q Φ ( θ D ) Q Φ ( ξ N D ) . $$ \begin{aligned} Q_{\boldsymbol{\Phi }}(\boldsymbol{Z}, \boldsymbol{\theta }, \xi _N \mid \boldsymbol{D}) = Q_{\boldsymbol{\Phi }}(\boldsymbol{Z} \mid \boldsymbol{D}) \, Q_{\boldsymbol{\Phi }}( \boldsymbol{\theta }\mid \boldsymbol{D}) \, Q_{\boldsymbol{\Phi }}(\xi _N\mid \boldsymbol{D}) \, . \end{aligned} $$(6)

For Q Φ ( Z D ) = i = 1 p Q Φ ( z i d i ) $ Q_{\boldsymbol{\Phi}}(\boldsymbol{Z} \mid \boldsymbol{D}) \, = \, \prod_{i=1}^{p} \, Q_{\boldsymbol{\Phi}}(\boldsymbol{z}_i\mid \boldsymbol{d}_i) $ we chose a pixel-wise independent Gaussian distribution 𝒢(zi − μi, Σi) based on the maximum entropy principle (Jaynes 1982). The principle states if only the mean μ and covariance Σ of data are known, then the knowledge contained in that data set can be best expressed by a Gaussian distribution with exactly this mean μ and covariance Σ (e.g., Enßlin 2019). Later, we discuss that by construction μi and Σi are functions of the input data, making the choice of a Gaussian distribution for QΦ(Z ∣ D) valid. For QΦ(θ ∣ D) and QΦ(ξN ∣ D) we chose maximum a posteriori solutions. In practice, we evaluated the approximation at the variational parameter values { θ ^ , ξ N ^ } Φ $ \{\widehat{\boldsymbol{\theta}}, \widehat{\xi_N} \} \in \boldsymbol{\Phi} $, expressed by Q Φ ( θ D ) = δ ( θ θ ^ ) $ Q_{\boldsymbol{\Phi}}(\boldsymbol{\theta} \mid \boldsymbol{D}) = \delta(\boldsymbol{\theta} - \widehat{\boldsymbol{\theta}}) $ and Q Φ ( ξ N D ) = δ ( ξ N ξ N ^ ) $ Q_{\boldsymbol{\Phi}}(\xi_N \mid \boldsymbol{D}) = \delta(\xi_N - \widehat{\xi_N}) $.

To approximate P(Θ ∣  D) with QΦ(Θ ∣ D), we used the Kullback-Leibler Divergence

D KL [ Q Φ ( Z , θ , ξ N D ) P ( Z , θ , ξ N D ) ] = d Z d θ d ξ N Q Φ ( Z , θ , ξ N D ) ln ( Q Φ ( Z , θ , ξ N D ) P ( Z , θ , ξ N D ) ) . $$ \begin{aligned}&D_{\rm KL} [\, Q_{\boldsymbol{\Phi }}(\boldsymbol{Z}, \boldsymbol{\theta }, \xi _N \mid \boldsymbol{D}) \, \mid \mid \, P(\boldsymbol{Z}, \boldsymbol{\theta }, \xi _N \mid \boldsymbol{D}) \,] = \nonumber \\&\int \mathrm{d}\boldsymbol{Z} \, \mathrm{d}\boldsymbol{\theta } \, \mathrm{d} \xi _N \, Q_{\boldsymbol{\Phi }}(\boldsymbol{Z}, \boldsymbol{\theta }, \xi _N \mid \boldsymbol{D}) \, \ln \Bigg ( \frac{Q_{\boldsymbol{\Phi }}(\boldsymbol{Z}, \boldsymbol{\theta }, \xi _N \mid \boldsymbol{D})}{P(\boldsymbol{Z}, \boldsymbol{\theta }, \xi _N \mid \boldsymbol{D})}\Bigg ). \end{aligned} $$(7)

Inserting the derived expressions from the previous sections and using Monte Carlo Methods to approximate integrals with finite sums, results in

D KL [ Q Φ ( · ) P ( · ) ] = 1 2 i = 1 p [ tr ( ln Σ i ) l ( 1 + ln ( 2 π ) ) + 1 p ξ N ^ 2 + tr ( 1 t ψ ( ξ N ^ ) ( d i f θ ^ ( z i ) ) ( d i f θ ^ ( z i ) ) T ) + tr ( Σ i + μ i μ i T ) + tr ( ln t ψ ( ξ N ^ ) ) ] + H 0 , $$ \begin{aligned} D_{KL}[Q_{\boldsymbol{\Phi }}(\cdot ) \mid \mid P(\cdot )]&= \frac{1}{2} \sum _{i=1}^p \Bigg [ - \mathrm{tr} \,(\ln \Sigma _i) - l \left( 1+ \mathrm{ln} \,(2\pi ) \right) + \frac{1}{p} \, \widehat{ \xi _N}^2 \nonumber \\&\quad +\, \mathrm{tr} \, \left( \frac{1}{ t_{\boldsymbol{\psi }}(\widehat{ \xi _N})} \left(\boldsymbol{d}_i - f_{\widehat{\boldsymbol{\theta }}}(\boldsymbol{z}_i)\right) \left(\boldsymbol{d}_i - f_{\widehat{\boldsymbol{\theta }}}(\boldsymbol{z}_i)\right)^T \right) \nonumber \\&\quad + \, \mathrm{tr} \, \left( \Sigma _i + \boldsymbol{\upmu }_i \boldsymbol{\upmu }_i^T \right) + \mathrm{tr} \,\left(\ln \, t_{\boldsymbol{\psi }}(\widehat{ \xi _N})\right) \Bigg ] + \mathcal{H} _0, \end{aligned} $$(8)

where we absorbed all constant terms into ℋ0. The full calculations with all intermediate steps are presented in Appendix B.

2.3. NEAT-VAE

Our goal was to infer a lower-dimensional representation of the data, which in our case was expressed by the latent source space Z. We achieved this by minimizing Eq. (8), which describes the spurious amount of artificial information introduced by the approximation of P with QΦ. In the following, we explain how this objective function was translated into the framework of a latent variable model called variational autoencoder (Kingma & Welling 2013).

A basic autoencoder is constructed to compute a low-dimensional, latent representation of higher-dimensional input data. This is achieved by training the autoencoder to reconstruct the original input as accurate as possible from the reduced, latent representation (e.g., Rumelhart et al. 1985; Hinton & Salakhutdinov 2006; Goodfellow et al. 2016). In this context, training describes the minimization of an objective or loss function, for example the mean squared error between input data and the reconstructed output data. The dimensionality reduction of the input to the latent space occurs in the so-called encoder, the latent space itself is called bottleneck layer, and in the decoder, the latent space gets translated back to data space. Variational autoencoders (VAEs) offer a probabilistic framework to jointly optimize latent variable (or generative) models and inference models (Kingma & Welling 2019). Equation (8) was transformed to such a VAE framework (see Fig. 1) as follows:

thumbnail Fig. 1.

Model design of our algorithm (NEAT-VAE). (1) In the forward model, we assumed the 35 Galactic all-sky maps D to be generated by a smaller number of features Z and some additive noise N. The generative process fθ, which we approximated by the decoder neural network, was learned by the algorithm. (2) We calculated the joint posterior distribution of the features Z, the network parameters θ, and the noise parameter ξN using statistical priors and Bayes’ theorem. (3) We approximated the posterior distribution using variational inference, the maximum entropy principle, and inverse transform sampling. Batches of spatially independent pixel vectors di served as input data, where each vector contains the spectral information of the same pixel in 35 Galactic all-sky maps. The algorithm was constructed to infer a latent representation zi of the input data (encoder), and to regenerate its input di as accurate as possible from the latent space (decoder). The minimization objective, or loss function, guiding the algorithm’s learning process is Eq. (8).

– Generative Process: Neural networks are generalized function approximators. Since the exact form of the data generating process f θ ( z i ) = d i $ f_{\boldsymbol{\theta}}(\boldsymbol{z_i}) = \widetilde{\boldsymbol{d}}_i $ was unknown, we used a neural network with input zi and output d i $ \widetilde{\boldsymbol{d}}_i $ to approximate fθ, and we used back propagation to optimize the parameters θ of the network, which corresponded to the generative decoder.

– Variational Inference: The inference of the latent space variables was approximated by QΦ(zi  ∣ di) = 𝒢(zi − μi, Σi). We built a function delivering posterior samples following this variational distribution through a neural network with input di and output zi. We let the pixel-wise mean μi ∈ Rl and covariance Σi ∈ Rl × l be determined by a parametrized function of the input data eϕ(di) = (μi, log Σi), where ϕ contains the parameters of the function eϕ and the matrix logarithm of Σi is calculated. This parametrization ensured the variance to be positive and the calculation to be numerically stable, since the logarithmic function maps the small values of the variance to a larger space. By inverse transform sampling, we defined the posterior latent space variables as z i = μ i + exp ( 1 2 log ( Σ i ) ) · ϵ i = μ i + Σ i ϵ i $ \boldsymbol{z}_i = \boldsymbol{{\upmu}}_i + \mathrm{exp}\left(\frac{1}{2} \mathrm{log}\,\left(\Sigma_i \right)\right) \cdot \boldsymbol{\epsilon}_i = \boldsymbol{{\upmu}}_i + \sqrt{\Sigma_i} \, \boldsymbol{\epsilon}_i $ with an auxiliary variable ϵi and P(ϵi) = 𝒢(ϵi, 1). In practice, we approximated the function eϕ(di) by the variational inference encoder. We took Σ i $ \sqrt{\Sigma_i} $ to be diagonal and described it by its diagonal vector diag ( Σ i ) $ \mathrm{diag}\left(\sqrt{\Sigma_i}\right) $, allowing us to calculate μi and diag ( Σ i ) $ \mathrm{diag}\left(\sqrt{\Sigma_i}\right) $ as two distinct outputs of the encoder network.

– Independent representation: Based on the input data, the encoder network delivers latent space variables zi with mean μi and covariance vector Σ i = diag ( Σ i Σ i T ) $ \widetilde{\Sigma}_i = \mathrm{diag}\left(\sqrt{\Sigma_i} \, \sqrt{\Sigma_i}^{\mathrm{\, T}}\right) $. Using this definition we found the optimal independent approximation to the posterior P(Θ ∣ D). This lead to a disentanglement of the input information, meaning each dimension of zi, which is equivalent to a hidden neuron in the bottleneck layer, encoded (approximately) mutually independent features of the data.

The minimization objective in Eq. (8) contains the loss function of a classic VAE with three modifications: (1) We included the size of the latent space l and its corresponding weight (see Appendix B for the derivation) to be able to compare different latent space sizes with each other, (2) in addition to the network weights, we aimed to optimize the noise covariance and thus the latent noise variable ξN. The prior on this latent noise added an extra term proportional to ξ N ^ 2 $ \widehat{\xi_N}^{ 2} $ to the objective function, and (3) by including noise in the data model, the likelihood contains a factor 1 / t ( ξ N ^ ) $ 1/ t(\widehat{\xi_N}) $ and contributed an additional term tr ( ln t ( ξ N ^ ) ) $ \mathrm{tr}\left(\ln \, t(\widehat{\xi_N})\right) $ from the normalization. Since we expanded the VAE framework by the adaption to noise, we named our algorithm NEAT-VAE (NoisE AdapTing Variational AutoEncoder).

Using the transformations illustrated before, the final objective function only depends on the generative decoder parameters θ ^ $ \widehat{\boldsymbol{\theta}} $, the variational encoder parameters ϕ and the latent noise parameters ξ N ^ $ \widehat{\xi_N} $. We implemented an autoencoder architecture (encoder network, bottleneck layer, decoder network) with Eq. (8) as the respective loss function in the PyTorch framework3. The framework enabled us to calculate derivatives of Eq. (8) with respect to θ ^ $ \widehat{\boldsymbol{\theta}} $, ϕ and ξ N ^ $ \widehat{\xi_N} $ using automated back propagation, and to minimize the loss with a built-in optimizer. The conducted experiments are described in Appendix C.

2.4. Evaluation metric

Reconstruction accuracy. To evaluate the ability of the NEAT-VAE to reconstruct its input, we defined the reconstruction accuracy by the mean squared error of input maps and reconstructed (output) maps by

MSE ( d , d ) = 1 p i = 1 p ( d i d i ) 2 , $$ \begin{aligned} \mathrm{MSE} \,(\boldsymbol{d}, \widetilde{\boldsymbol{d}} ) = \frac{1}{p} \, \sum _{i=1}^p \, (\boldsymbol{d}_i - \widetilde{\boldsymbol{d}}_i )^2, \end{aligned} $$(9)

with the number of pixels p, the 35-dimensional data vectors di ∈ D and the corresponding reconstructed vectors d i D $ \widetilde{\boldsymbol{d}}_i \in \widetilde{\boldsymbol{D}} $.

Significance of results. To investigate whether there is an order among the latent feature maps calculated by the NEAT-VAE with respect to information content, we evaluated the significance of each feature for the reconstruction of the input data by

S feature = 1 p i = 1 p ( z feature map , i z ¯ feature map ) 2 σ feature , i 2 , $$ \begin{aligned} S_{\mathrm{feature} } = \frac{1}{p} \, \sum _{i=1}^p \, \frac{ \left( z_{\rm feature \, map , \, i} - \overline{z}_{\rm feature \, map } \right)^2}{ \sigma _{\mathrm{feature} , \, i}^2}, \end{aligned} $$(10)

where zfeature map,  i is the intensity value at the ith pixel, z ¯ feature map $ \overline{z}_{\mathrm{feature \, map}} $ is the mean intensity value of a single feature map, and σ feature,i 2 $ \sigma_{\mathrm{feature}, \, i}^2 $ denotes the posterior variance as calculated by the algorithm at the ith pixel. This means we calculated the feature map variance, which describes the fluctuations within the posterior mean map, weighted by the posterior feature variance averaged over all pixels p. The significance expresses the ratio of the magnitude of fluctuations within a map compared to the uncertainties of the map. In this context, a high significance highlights the most relevant features for the reconstruction, while a feature significance below 1.0 corresponds to an insignificant feature, as the posterior uncertainty is larger compared to the posterior mean values of the map.

Encoded Information. We investigated which data information a feature map encodes by considering the generative process of the NEAT-VAE and using the back propagation algorithm: Since the amount of parameters in the latent space is much smaller compared to the data space, the autoencoder is forced to extract the essential information of the input data in order to generate it again. The generative process therefore offers, at least approximately, an explanation for the information flow within the autoencoder, and we visualized to what extend specific features are generating the individual Galactic all-sky maps using sensitivity analysis (e.g., Zurada et al. 1994). Here, we computed the gradients of the reconstructed output maps with respect to the feature maps using the back propagation algorithm. The values of the resulting decoder Jacobian are displayed in HEALPix pixelization in Appendix E.

We used two other measures to quantify correlations between subsequent all-sky maps. One is the Pearson correlation coefficient, for our discrete case defined by

ρ ( X ; Y ) = cov ( X , Y ) std ( X ) std ( Y ) , $$ \begin{aligned} \rho (X;Y) = \frac{\mathrm{cov} (X, Y)}{\mathrm{std} (X) \, \mathrm{std} (Y)} \, , \end{aligned} $$(11)

where cov(X, Y) is the covariance of two maps X and Y, and std(⋅) denotes the standard deviation of the respective maps. A value of −1.0 corresponds to negative correlation, 0.0 means the values are not correlated, and 1.0 corresponds to positive correlation. The Pearson correlation coefficient only accounts for linear relations between two maps. To account for nonlinear relations as well, we used the mutual information

I ( X ; Y ) = x X y Y P ( x , y ) ln ( P ( x , y ) P ( x ) P ( y ) ) , $$ \begin{aligned} I(X;Y) = \sum _{x\in X} \sum _{{ y} \in Y} \mathcal{P} (x,{ y}) \, \mathrm{ln} \left(\frac{\mathcal{P} (x,{ y})}{\mathcal{P} (x) \, \mathcal{P} ({ y})}\right), \end{aligned} $$(12)

which can be calculated from two-dimensional histograms of feature maps (X) and output maps (Y). For a given number of bins, we represented the joint probability distribution 𝒫(x, y) in an (x, y)-shaped matrix by counts per bin, while the marginalized distributions 𝒫(x) and 𝒫(y) were obtained by summing over the respective y and x-axes of this matrix. In our interpretations, we used all above measures to evaluate the encoded information in the latent space (mutual information of feature and input maps), the generative process from the latent space (decoder Jacobian maps), and the linear correlations between maps (correlation plots and Pearson coefficient) in order to determine the physical content of the resulting features.

3. Results and discussion

We applied the NEAT-VAE to the described 35 Galactic all-sky maps with equal resolution. We trained the algorithm on magnitude pixel vectors di independently, we specifically did not include spatial correlations. Each pixel vector contains the spectral information of the same pixel in all 35 Galactic all-sky maps. After training, the NEAT-VAE yields a posterior probability distribution of the reduced spectral information of the input data in its latent space. In the following, we show all-sky representations of the latent space variables zi ∈ Z, where the dimension of the latent space corresponds to the number of hidden neurons in the bottleneck layer and was varied in different experiments. We call the subsequent hidden neurons “features” and the resulting full-sky representations “feature maps”.

3.1. Dimensionality reduction

We first analyzed how the dimension of the latent space in the NEAT-VAE, that is the number of features, correlates with the reconstruction quality of the generative process (Fig. 2). Using Eq. (9), we observe that only three features are required to achieve an MSE below 0.1, which describes an average deviation of the reconstructed values compared to the input values of a natural logarithm-based flux magnitude. For small values, the absolute uncertainty on logarithmic scale equals the relative uncertainty on linear scale, meaning MSE = 0.1 corresponds to a relative uncertainty of ≈10%. The MSE decreases for a further growing number of features, and stagnates around a value of 0.02, corresponding to a relative uncertainty of ≈2% at approximately ten features. We interpret this as an indication for a high redundancy of the information contained in the 35 Galactic all-sky maps, since increasing the number of latent features does not increase the quality of the reconstruction any further.

thumbnail Fig. 2.

Reconstruction mean squared error (MSE) and values of the minimized Kullback-Leibler Divergence (Loss) ΔDKL (DKL in Eq. (8) except constant terms) depending on the dimension of the latent space (also called number of features, x-axis). The values were determined using the NEAT-VAE with a configuration of six layers, 30 hidden neurons in the encoder and decoder layers, noise transformation parameters μN = −7 and σN = 1, learning rates of 0.005 for the network weights and 0.001 for ξN, and a batchsize of 128. We did not track all normalization constants through the calculations, which leads to negative values for the loss.

3.2. Morphology of features

Based on the experiments with different latent space dimensions, we examined the spatial structures in the feature maps in more detail. We recognize spatially correlated structures of the input data in the feature maps. We assume this to be a meaningful result, since the autoencoder only computes spectral correlations among the magnitude flux values within one pixel vector di, and is not informed about spatial structures. We also observe feature maps with the same morphology to occur in latent spaces of different dimensions. Averaged over all experiments we conducted (see Appendix C), we call the three most significant features A, B and C.

For the configuration shown in Fig. 3, the features have significance values (Eq. (10)) SfeatureA = 1.67 × 103, SfeatureB = 1.41 × 102, and SfeatureC = 4.68 × 101. The remaining features have significance values ranging from 8.75 to 3.82 × 101 and encode artifacts and other morphologies of the input data, as displayed in Appendix D. A similar behavior was also observed by Müller et al. (2018): The GMM components used in their study also encode artifacts of the input data when the number of components is increased. However, posterior samples of the latent space of our algorithm show that the NEAT-VAE does not always assign information to each and every feature: Starting from twelve features and adding further neurons to the latent space, the significance of the added features drops below 1.0. This means the posterior variance of a feature map is greater than the fluctuations within the map, and the resulting posterior samples show white noise statistics. On average, ten to eleven features are significant throughout our experiments with varying latent space size. We assume that from this point on, our algorithm has identified all mutually independent features of the input data. When further dimensions are added to the latent space, the algorithm “tunes out” those degrees of freedom by making them insignificant.

thumbnail Fig. 3.

Main results of the NEAT-VAE: (panel a) feature A, (panel b) feature B and (panel c) feature C are the most significant representations of the input data. Left panels: posterior mean, right panels: posterior variance. The NEAT-VAE was trained to reduce the spectral information of 35 Galactic all-sky maps to ten features in the latent space with the configuration described in Fig. 2. We show the three most important feature maps in descending order of significance according to Eq. (10). The colors in the posterior mean of feature C (left panel in (c)) were inverted for illustration purposes. The gray pixels correspond to missing values in the input data.

In the next sections, we focus on the three most significant features of the configuration with ten latent space features (since from this point on the reconstruction does not change significantly, see Fig. 2), and their physical interpretation. Visual inspections of the other features (Appendix D) indicate that the separation into independent components is not fully reached, possibly due to the finite amount of optimization. For example, morphological structures similar to the North Galactic Spur, the Fermi bubbles, and measurement artifacts imprint onto several features simultaneously.

The posterior mean values in Fig. 3 reflect the optimally compressed representation of the input data as calculated by our algorithm. We observe that the overall sign of the maps changes for different hyperparameter configurations. We do not observe the change of the sign to follow a specific rule or to depend on the sign of other features from the same hyperparameter configuration. Hence, we assume the overall sign not to have any physical interpretation and only to depend on the initialization of the network parameters. We also observe that the relation of positive and negative values within single maps is constant throughout different choices of hyperparameters, that is the overall feature map structure. Since these signs do not change the morphology of the map, we chose the sign of the color code in the map display of Fig. 3 according to the input map which it most closely resembles. This corresponds to positive values for stronger emitting regions and negative values for weaker emitting regions.

3.3. Identifying the information encoded by feature A

Feature A, which is the most significant feature in 98% of the examined hyperparameter configurations (see Appendix C), is displayed in Fig. 3a. From a visual analysis, we recognize a positive color-coded Galactic plane and negative color-coded Galactic poles in the posterior mean. In the eastern part of the Galactic plane, we see a bright, circular structure in the Cygnus region. Further in the east, south of the Galactic plane, structures similar to the Perseus region occur. The circular shaped structure north of the Galactic center resembles the Ophiuchus region and southwestern of the Galactic center structures similar to the Small and Large Magellanic clouds can be recognized. In the western part of the Galactic plane, the bright structures match the Orion region. The posterior variance shows a high certainty in the region of the Galactic plane and the structures of the surrounding latitudes, while at the southern and northern Galactic poles, the uncertainty increases.

Interpretation. We find compelling evidence that feature A traces the dense and dusty parts of the ISM: Based on calculations of the mutual information with 512 bins, the top three data sets contributing to feature A are the AKARI far-infrared 140 μm input data with mutual information I = 1.91, the IRIS infrared 100 μm input data with I = 1.84, and the AKARI far-infrared 160 μm input data, also with I = 1.84. This frequency band of the interstellar radiation field is dominated by the infrared emission of dust (e.g., Draine 2011, p.121). Feature A shares the least mutual information with the ROSAT X-ray 1.545 keV input data with I = 0.20. Next, we analyzed the decoder Jacobian maps (see Appendix E), which display the gradients of the reconstructed Galactic all-sky maps with respect to latent space features in HEALPix format. We observe the following: The lower and mid-latitudes of the γ-ray regime strongly depend on feature A, and going from higher to lower energies, the overall-dependence on feature A grows. Especially, the low-energy regime of the Fermi data is dominated by hadronic interactions of cosmic rays with the interstellar medium (ISM) and thus shows the gas distribution (Selig et al. 2015). In the X-ray regime, small to no dependence is observed; however, the mid- and high latitudes of the soft X-ray data (0.212 keV and 0.197 keV) show negative gradients toward feature A. Here, negative gradients highlight an inverse-proportional relationship of feature A and the reconstructed X-ray maps, which is plausible by the physical context: Radiation from X-rays is extincted by cold interstellar gas (Ferriere 2001), thus the absence of X-ray emission reveals regions where interstellar matter is present. The Hα map at 656.3 nm, which displays emission due to hydrogen transitions occurring from the second excited state n = 3 to the first excited state n = 2 (e.g., Finkbeiner 2003), positively depends on feature A. This corresponds to neutral hydrogen preferably residing in the dense ISM, as traced by feature A. We observe a very strong dependence of the Galactic planes of the 12 and 25 μm infrared data on feature A, while with further decreasing energies, the infrared data down to 545 GHz show an overall, positive dependence. In this regime, emission from thermal dust is observed (Klessen & Glover 2014; Planck Collaboration I 2020). With further decreasing energy, the Planck microwave data depend strongly on feature A, in particular the Galactic plane. The 21 cm line emission shows an overall dependence on feature A; it describes the total neutral atomic hydrogen column density, since the emission occurs due to the transition between two levels of the ground state of atomic hydrogen (Ewen & Purcell 1951). In the synchrotron radio regime at 1420 MHz and 408 MHz, both feature A and feature B play an important role in determining the emission structures, which we address in Sect. 3.6.

The positive dependencies of sky maps on feature A describe areas of our Galaxy with a high density of interstellar matter, while the negative gradients correlate with the extinction of emissions by interstellar matter. We assume positive gradients highlight pixels in the latent space used to generate the corresponding pixels in data space, while negative gradients mark an anti-correlation of feature and data pixels. On the basis of this specific combination of gradients and the mutual information, we infer that feature A encodes dense regions of the ISM.

Discussion. We can test our hypothesis by investigating the correlation of feature A with dust as a tracer for the ISM (Kennicutt & Evans 2012). Figure 4b shows the relationship of feature A and the thermal dust emission (Fig. 4a) as calculated by the Planck COMMANDER code (Eriksen et al. 2008; Planck Collaboration X 2016). The posterior mean of feature A has a strong, positive, and linear correlation (ρ = 0.98) with the thermal dust component. Both the COMMANDER and the NEAT-VAE algorithm perform a Bayesian, pixel-wise analysis based on a data model containing the linear sum of a signal function and noise, but with two main differences: First, we only employed statistical information in our forward model, while the COMMANDER code included detailed physical templates for the various emission processes contributing to the radio to far-infrared sky, as well as calibration and correction factors, and a prior for the CMB. Second, the COMMANDER algorithm has 11 million free parameters to tune (Planck Collaboration X 2016), while our model has only about 5000. The algorithms are not directly comparable, since the COMMANDER code was especially developed to separate the Galactic foregrounds in order to reconstruct the CMB, while the NEAT-VAE only seeks to find an informative representation of the input data. But in this context, the NEAT-VAE summarized the input data into categories, one of which already contains dust emission. With this result, we assume it is possible to derive the dust component based on feature A with little computational effort.

thumbnail Fig. 4.

Correlation of feature A with published astrophysical components tracing interstellar matter. From top left to bottom right: (panel a) thermal dust component in logarithmic scaling (Planck Collaboration X 2016), (panel b) 2D histogram of the posterior mean intensity of feature A and the thermal dust component (mutual information I = 1.72, Pearson correlation coefficient ρ = 0.98), (panel c) hadronic component of the γ-ray spectrum in logarithmic scaling (Selig et al. 2015) with white pixels denoting missing values, (panel d) 2D histogram of the posterior mean intensity of feature A and the hadronic component (I = 1.22, ρ = 0.90). For the 2D histograms, the intensity ranges of the maps were divided into 256 equal bins and the number of intensity pairs per bin was displayed as counts on logarithmic scaling. Bright colors denote a high number of counts. Mutual information was calculated as described in Eq. (12) with 512 bins.

We investigate another tracer for interstellar matter, namely the hadronic γ-ray component derived by Selig et al. (2015), shown in Fig. 4c. It represents the γ-ray emission due to the interaction of cosmic ray protons with interstellar matter and it is positively correlated (ρ = 0.90) with feature A, see Fig. 4d. This correlation is reasonable, since Selig et al. (2015) composed the hadronic component of the low-energy γ-ray maps, on which feature A also depends. These results meet our initial aim of finding a reduced representation of the input data that combines redundant information in one feature, in this case tracers for dense regions in the ISM.

The high significance of feature A is likely to result from the choice of input data, since most of the maps in our data set D represent emission from the dense interstellar matter. We did not include the CO line emission data, but from the positive gradients of the Planck 100–353 GHz channels with respect to feature A, which contain the information of the CO line emission (Planck Collaboration X 2016), we assume that the Galactic plane of feature A most likely would generate the CO data. This would support our interpretation of feature A, since CO is a tracer for molecular interstellar gas (Scoville et al. 1987), and thus, for regions of high density.

3.4. Identifying the information encoded by feature B

Feature B, the second most significant feature in 98% of our experiments, is displayed in Fig. 3b. We see a negative color-coded equator which resembles the Galactic plane, with positive color-coded bulges north and south of the plane. Especially the northern bulge structure looks similar to the morphology of the North Polar Spur. The positive color-coded, circular structure in the western part of the Galactic plane lies in the Vela region, whereas in the east, the location of the Cygnus region appears negative color-coded with a positive, ring-shaped structure surrounding it. The uncertainty of this map appears to be low in all latitudes, but is an order of magnitude higher compared to the variance map of feature A.

Interpretation. Feature B shows highest mutual information with the X-ray input data of ROSAT at 0.885 keV (I = 0.82), 0.725 keV (I = 0.78) and 1.145 keV (I = 0.67), and lowest mutual information with the Planck 70 GHz data (I = 0.18). In general, X-ray emission is assumed to arise from hot, ionized gas in regions of the ISM with low density (e.g., Ferriere 2001). Considering the gradient maps in Appendix E, there is little to no dependence of the reconstructed γ-ray data on feature B. Starting from X-ray data at 1.545 keV, the positive gradients get stronger with decreasing energy, and especially the bulge-shaped area of the reconstructed X-ray data strongly depends on feature B. The soft X-ray regime around 0.25 keV, which coincides with hydrogen cavities (e.g., Sanders et al. 1977; Snowden et al. 1990), also shows a weak, positive dependence on feature B. The infrared and microwave regime again show little to no dependence, we however observe small negative gradients to be present whereas in the corresponding gradients maps of feature A, strong positive gradients occur. The reconstructed 1420 MHz and 408 MHz radio maps show positive dependence on feature B, especially in the bulge-shaped areas. This data set detects the synchrotron radiation generated by the interaction of cosmic ray electrons with magnetic fields in the ISM (Ginzburg & Syrovatskii 1965), and the intensity of this radiation depends on the density of relativistic particles and the magnetic field strength. The former one is predominantly found in the hot areas of the ISM, if only for the much larger volume occupation of this phase within the Milky Way (Cox 2005). Thus, we assume feature B encodes the enhanced presence of such cosmic ray electrons. With exception of the low energy X-ray and radio synchrotron regime, we observe the data predominantly generated by feature A to have little to no dependence on feature B, which we interpret as feature B being complementary to dense regions of the ISM. In combination with the indications from the mutual information with the X-ray input data and positive gradients, we assume feature B encodes tracers for dilute and hot regions of the ISM.

Discussion. Figure 5, showing the correlation between Feature A and B, supports our interpretation: The features, in our case the dense and dilute regions of the ISM, are basically uncorrelated with a Pearson correlation coefficient of ρ = 0.03. This relationship is reasonable since the features describe two distinct categories of the ISM: The data generated by feature A are based on emissions from cold and warm ionized gas (as observed in the 21 cm and Hα line emissions), dust, and interactions with interstellar matter (cosmic rays). The ionization of the warm gas occurs mainly due to photo-ionization by O and B stars, the hottest and most massive stars of the Milky Way. Due to their ratio of mass and luminosity, these stars have a short main sequence life cycle and are thus found near their initial birthplaces, the dense ISM (e.g., Blome et al. 1997, p.60). Feature B, however, encodes data which represent radiation of an even hotter medium, the hot ionized plasma, which is generated by supernovae explosions (e.g., Kahn 1980). The radiative processes encoded by features A and B can thus be traced back to two fundamentally different origins, namely emission from interstellar matter and from stellar explosions.

thumbnail Fig. 5.

Correlation of feature B and feature A with mutual information I = 0.33 and Pearson correlation coefficient ρ = 0.03. The 2D histogram was computed as described in Fig. 4.

One regime of the electromagnetic spectrum missing in our input data is the ultraviolet (UV) frequency band. In this regime, the emission of very hot gas which gets ionized by collisions can be observed, but UV radiation does not penetrate dense regions of the ISM and is thus mostly absorbed in low and mid-latitudes (Ferriere 2001). Due to this extinction, it is likely that UV radiation would be interpreted by the NEAT-VAE as redundant to the soft X-ray data, which shows similar dust absorption patterns. This would support our interpretation of feature B to encode the hot and dilute ISM.

3.5. Identifying the information encoded by feature C

The third most significant feature in 74% of the investigated configurations, Feature C, is displayed in Fig. 3c. Here, we inverted the colors of the posterior mean to resemble the color-coding of the CMB, see Fig. 6a. The regions of the Galactic plane and Cygnus have strong, positive values, while most of the other structures fluctuate around zero. The uncertainty is highest in the Galactic plane, especially in the Galactic center. There are bulge-like shapes in the variance map north and south of the Galactic center denoting a medium level of uncertainty. Toward higher latitudes, the uncertainty is very low.

thumbnail Fig. 6.

Panel a: CMB as derived by the Planck Collaboration IV (2020), panels b–d: correlations of the posterior mean of feature C with (panel b) the CMB, mutual information I = 0.88 and Pearson correlation coefficient ρ = 0.70, (panel c) feature A, I = 0.29 and ρ = 0.01, and (panel d) feature B, I = 0.20 and ρ = 0.01. The histograms and mutual information were compiled as described in Fig. 4. Bright colors in the histograms denote a high number of counts.

Interpretation. The mutual information of feature C is highest with the Planck data of 70, 100 and 143 GHz with values of I = 1.01, 0.95, 0.82, respectively, and the least mutual information is observed with the ROSAT X-ray 1.545 keV data with I = 0.12. According to the decoder Jacobian maps in Appendix E, feature C mostly generates the Planck 30−217 GHz channels, especially the 70 and 100 GHz data, in which the CMB can be observed (Planck Collaboration I 2020). All other reconstructed maps show little to no dependence on feature C, with exception of a weak, positive dependence of the soft ROSAT X-ray data that we address in Sect. 3.6. Based on the mutual information and the strong positive gradients in the microwave regime around 100 GHz, we assume feature C to encode the CMB.

Discussion. Considering our physical interpretation of features A and B, it is reasonable that the CMB emission is encoded in a separate feature, since the underlying physical processes in the Early Universe shaping the CMB fluctuations are independent of the processes in the ISM. To test our hypothesis we compare feature C to the CMB all-sky map derived by the Planck COMMANDER code in Fig. 6a. Feature C shows a positive, linear correlation (ρ = 0.70) with the CMB (Fig. 6b). A main difference between feature C and the CMB can be seen in the Galactic plane, which feature C encodes (in addition to the Cygnus region) with high intensities, but also with high uncertainty (see Fig. 3c). We assume the high intensities in the Galactic plane of feature C to originate from the model’s architecture: Since we do not take spatial correlations into account, the algorithm is constructed to learn spectral relations of each pixel vector independently. The Galactic planes of the input maps are dominated by high intensity values, while for other latitudes, more low-intensity values are present. We assume this generates two distinct clusters of intensity ranges, leading to two different spectral relations learned by the algorithm. We can also observe this distinction in intensity by analyzing the contributions of pixels for data generation (see Appendix E): The structure of the Galactic plane is recognizable in most of the decoder Jacobian maps, even though the algorithm has no information about spatial correlations. This again indicates the relations detected in different intensity regimes to represent different processes in the NEAT-VAE algorithm, which however are attributable to internal computations rather than physical properties.

3.6. Nonphysical interpretation of the NEAT-VAE

We finally examine the decoder Jacobian maps which do not have a clear physical interpretation. For example, we observe feature A (the dense ISM) contributing to the 1420 and 408 MHz radio data, as well as feature C (the CMB) to supposedly generate the soft X-ray data.

The NEAT-VAE algorithm recombines all features of the latent space in a highly nonlinear way to generate the output maps, meaning the gradients in Appendix E cannot always be considered independently or in a linear way. Especially when the absolute gradient values of the reconstructed maps are large for more than one feature, a holistic analysis is required. In most of the displayed cases in Appendix E, there is one feature dominantly generating the reconstruction of input data (displayed by one gradient map (red and blue colors) per row showing stronger intensities than others). One exception, however, is the radio data, where we observe positive gradients with respect to both feature A and B. A speculative physical explanation is that feature A marks regions of enhanced magnetic field strength and feature B of enhanced relativistic electron densities, both quantities that in combination determine the synchrotron emission. However, we rather assume that these two features have no physical meaning for synchrotron emission, but the internal computations of our algorithm run most efficiently when latent values of features A and B are used for radio data generation. One always has to consider the fact that our algorithm has no information about physical relations and primarily optimizes a statistical function. Another example for such a behavior can be found in the soft X-ray data, where we observe strong negative gradients with respect to feature A and weak positive gradients toward both features B and C. This mixing of features in a partially contrasting manner indicates the nonlinearity of the function mapping from the latent space values to the reconstructed output data: We hypothesize some values might be used by the algorithm just to tune others out, since the decoder has to regenerate the data based on all available features. This discussion shows that the decoder Jacobians give a valid first overview of the strongest correlations, but have to be interpreted with care. Not each contribution to the gradient represents a physical causality, especially since all gradients are entangled and cannot be considered independently.

4. Conclusions

In summary, we derived a representation learning algorithm (NEAT-VAE) for spectral component separation on a pixel basis. The architecture of the NEAT-VAE was based on variational autoencoders and was constructed to learn a compressed representation of its input data. The learning process is guided by a minimization objective, which we designed to describe a component separation process following three steps: (1) We allowed the input data to be generated by a smaller set of features and some additive noise, where the generative process was automatically learned by our algorithm. (2) We calculated the posterior distribution of these features using statistical priors and Bayes’ theorem, and (3) approximated the posterior distribution using variational inference, the maximum entropy principle, and inverse transform sampling.

We ran our experiments on 35 Galactic all-sky maps ranging from radio to γ-radiation with a spatial resolution of Nside = 128. Our algorithm computed a compressed representation of the input data in spectral direction, without the knowledge of physical or spatial correlations. After training, the calculated latent space pixels were merged to all-sky maps. The most significant representation of the spectral information in our input data contains three separated astrophysical morphologies, namely emission patterns of the dense ISM, a superposition of the hot and dilute ISM, and the CMB. The association with physical quantities was verified in three ways by (1) correlation plots, (2) calculations of the Pearson correlation coefficient and (3) the mutual information with the dust and CMB component maps of the Planck Collaboration. Since the NEAT-VAE processes pixel vectors independently without the knowledge of their spatial correlations, we conclude that the morphologies of the dense and dilute ISM, as well as the CMB are sufficiently defined by their pixel-wise spectral correlations.

The minimization objective of the NEAT-VAE and the associated hyperparameter tuning mainly determine the performance of the algorithm. Since the mathematical derivation was performed in a Bayesian framework, the initial signal component separation problem was described in a probabilistic manner. This enabled us to track uncertainties and evaluate the significance of each feature map. Hyperparameter tuning was reduced by including parameters in the Bayesian inference process, as in our case the noise covariance of the model noise.

Based on our results, we conclude that spectral independent morphologies can be disentangled by representation learning algorithms, however a more detailed separation into single astrophysical components requires spatial correlations and physical knowledge. We propose our algorithm could be used as a preprocessing step in order to perform a component separation analysis more efficient. For example, the task of deriving the thermal dust emission is computationally very expensive. If only the data necessary for the separation of the dust component were used, such as our feature map encoding the dense regions of the ISM, the separation process might perform faster and more efficient compared to analyzing a larger data set with redundant and entangled information.

This work shows that Bayesian signal inference can be performed with neural networks in an unsupervised manner, and that variational autoencoders are a suitable architecture for problems consisting of generative modeling and inference tasks. It also shows that purely statistical approaches are not sufficient to replace physical analyses, but when customized to the specific problem, for example by incorporating statistical independence, they reveal meaningful relations in astrophysical data.


1

A single pixel vector reads d i = ( d i (1) , d i (2) ,, d i (k) ) T $ {\boldsymbol{d}}_i = \left( d_i^{\,(1)}, d_i^{\,(2)}, \cdot \cdot \cdot, d_i^{\,(k)}\right)^{\mathrm{T}} $, where d i (λ) $ d_i^{\,(\lambda)} $ is the magnitude flux value of the λth sky map at pixel i.

2

We consider the spurious amount of artificial information introduced by Q when we calculate DKL[ Q(⋅)  ∣ ∣ P(⋅) ], while DKL[ P(⋅)  ∣ ∣ Q(⋅) ] expresses the loss of information when using Q instead of P (Leike & Enßlin 2017). The latter quantity is what one ideally would like to base approximate (conservative) inference on, but unfortunately it cannot be calculated with an intractable posterior P. For this reason, we used the variational inference approach and minimize the former KL-Divergence DKL[ Q(⋅)  ∣ ∣ P(⋅) ], that is minimizing the amount of spurious information added when going from P to Q.

Acknowledgments

We acknowledge Jakob Knollmüller for helpful discussions and thank an anonymous reviewer for useful comments that improved the clarity and quality of the manuscript.

References

  1. Ackermann, M., Ajello, M., Albert, A., et al. 2012, ApJS, 203, 4 [NASA ADS] [CrossRef] [Google Scholar]
  2. Atwood, W. B., Abdo, A. A., Ackermann, M., et al. 2009, ApJ, 697, 1071 [NASA ADS] [CrossRef] [Google Scholar]
  3. Beaumont, C. N., Williams, J. P., & Goodman, A. A. 2011, ApJ, 741, 14 [NASA ADS] [CrossRef] [Google Scholar]
  4. Bengio, Y., Courville, A., & Vincent, P. 2013, IEEE Trans. Pattern Anal. Mach. Intell., 35, 1798 [CrossRef] [Google Scholar]
  5. Bianchi, L., Shiao, B., & Thilker, D. 2017, ApJS, 230, 24 [NASA ADS] [CrossRef] [Google Scholar]
  6. Bishop, C. M. 2006, Pattern Recognition and Machine Learning (Springer) [Google Scholar]
  7. Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. 2017, J. Am. Stat. Assoc., 112, 859 [CrossRef] [Google Scholar]
  8. Blome, H. J., Hoell, J., & Priester, W. 1997, Bergmann-Schäfer, Bd. 8: Sterne und Weltraum [Google Scholar]
  9. Cardoso, J.-F., Le Jeune, M., Delabrouille, J., Betoule, M., & Patanchon, G. 2008, IEEE J. Sel. Top. Signal Proc., 2, 735 [NASA ADS] [CrossRef] [Google Scholar]
  10. Cox, D. P. 2005, ARA&A, 43, 337 [NASA ADS] [CrossRef] [EDP Sciences] [MathSciNet] [Google Scholar]
  11. Delabrouille, J., Cardoso, J.-F., & Patanchon, G. 2003, MNRAS, 346, 1089 [Google Scholar]
  12. Dennison, B., Simonetti, J., & Topasna, G. 1999, BAAS, 31, 1455 [Google Scholar]
  13. Devroye, L. 1986, Non-uniform Random Variate Generation (Springer) [CrossRef] [Google Scholar]
  14. Doi, Y., Takita, S., Ootsubo, T., et al. 2015, PASJ, 67, 50 [NASA ADS] [Google Scholar]
  15. Draine, B. T. 2011, Physics of the Interstellar and Intergalactic Medium [Google Scholar]
  16. Enßlin, T. A. 2019, Ann. Phys., 531, 1800127 [Google Scholar]
  17. Eriksen, H., Jewell, J., Dickinson, C., et al. 2008, ApJ, 676, 10 [NASA ADS] [CrossRef] [Google Scholar]
  18. Ewen, H. I., & Purcell, E. M. 1951, Nature, 168, 356 [NASA ADS] [CrossRef] [Google Scholar]
  19. Ferriere, K. M. 2001, Rev. Mod. Phys., 73, 1031 [NASA ADS] [CrossRef] [Google Scholar]
  20. Finkbeiner, D. P. 2003, ApJS, 146, 407 [Google Scholar]
  21. Fluke, C. J., & Jacobs, C. 2020, Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, 10, e1349 [CrossRef] [Google Scholar]
  22. Freyberg, M. J. 1998, Astron. Nachr., 319, 93 [NASA ADS] [CrossRef] [Google Scholar]
  23. Freyberg, M. J., & Egger, R. 1999, Highlights in X-ray Astronomy, 272, 278 [Google Scholar]
  24. Gaustad, J. E., McCullough, P. R., Rosing, W., & Van Buren, D. 2001, PASP, 113, 1326 [NASA ADS] [CrossRef] [Google Scholar]
  25. Ginzburg, V. L., & Syrovatskii, S. 1965, ARA&A, 3, 297 [NASA ADS] [CrossRef] [Google Scholar]
  26. Gold, B., Odegard, N., Weiland, J. L., et al. 2011, ApJS, 192, 15 [Google Scholar]
  27. Goodfellow, I., Bengio, Y., & Courville, A. 2016, MIT Press, 521, 800 [Google Scholar]
  28. Gorski, K. M., Hivon, E., Banday, A. J., et al. 2005, ApJ, 622, 759 [Google Scholar]
  29. Haslam, C. G. T., Salter, C. J., Stoffel, H., & Wilson, W. E. 1982, A&AS, 47, 1 [NASA ADS] [Google Scholar]
  30. HI4PI Collaboration (Ben Bekhti, N., et al.) 2016, A&A, 594, A116 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  31. Hinton, G. E. 1990, Machine Learning (Elsevier), 555 [CrossRef] [Google Scholar]
  32. Hinton, G. E., & Salakhutdinov, R. R. 2006, Science, 313, 504 [NASA ADS] [CrossRef] [MathSciNet] [PubMed] [Google Scholar]
  33. Jaynes, E. T. 1982, Proc. IEEE, 70, 939 [NASA ADS] [CrossRef] [Google Scholar]
  34. Kahn, F. D. 1980, Highlights of Astronomy, 5, 365 [CrossRef] [Google Scholar]
  35. Kennicutt, R. C., & Evans, N. J. 2012, ARA&A, 50, 531 [NASA ADS] [CrossRef] [Google Scholar]
  36. Kingma, D. P., & Ba, J. 2014, ArXiv e-prints [arXiv:1412.6980] [Google Scholar]
  37. Kingma, D. P., & Welling, M. 2013, ArXiv e-prints [arXiv:1312.6114] [Google Scholar]
  38. Kingma, D. P., & Welling, M. 2019, An Introduction to Variational Autoencoders [CrossRef] [Google Scholar]
  39. Klessen, R. S., & Glover, S. C. O. 2014, Physical Processes in the Interstellar Medium [Google Scholar]
  40. Kremer, J., Stensbo-Smidt, K., Gieseke, F., Pedersen, K. S., & Igel, C. 2017, IEEE Intell. Syst., 32, 16 [CrossRef] [Google Scholar]
  41. Larsen, A. B. L., Sønderby, S. K., Larochelle, H., & Winther, O. 2016, International Conference on Machine Learning, PMLR, 1558 [Google Scholar]
  42. Leach, S. M., Cardoso, J.-F., Baccigalupi, C., et al. 2008, A&A, 491, 597 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  43. Leike, R., & Enßlin, T. 2017, Entropy, 19, 402 [Google Scholar]
  44. Longo, G., Merényi, E., & Tiňo, P. 2019, PASP, 131, 100101 [Google Scholar]
  45. Madsen, G. J., Haffner, L. M., & Reynolds, R. J. 2001, The Wisconsin H-Alpha Mapper Northern Sky Survey of Galactic Ionized Hydrogen [Google Scholar]
  46. Mannheim, K., & Schlickeiser, R. 1994, A&A, 286, 983 [NASA ADS] [Google Scholar]
  47. Miville-Deschênes, M.-A., & Lagache, G. 2005, ApJS, 157, 302 [Google Scholar]
  48. Müller, A., Hackstein, M., Greiner, M., et al. 2018, A&A, 620, A64 [EDP Sciences] [Google Scholar]
  49. Murray, C., & Peek, J. E. 2019, Am. Astron. Soc. Meeting Abstracts, 233, 252.09 [Google Scholar]
  50. Neugebauer, G., Habing, H. J., van Duinen, R., et al. 1984, ApJ, 278, L1 [Google Scholar]
  51. Planck Collaboration I. 2016, A&A, 594, A1 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  52. Planck Collaboration X. 2016, A&A, 594, A10 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  53. Planck Collaboration I. 2020, A&A, 641, A1 [CrossRef] [EDP Sciences] [Google Scholar]
  54. Planck Collaboration IV. 2020, A&A, 641, A4 [CrossRef] [EDP Sciences] [Google Scholar]
  55. Portillo, S. K. N., Parejko, J. K., Vergara, J. R., & Connolly, A. J. 2020, AJ, 160, 45 [Google Scholar]
  56. Reich, W. 1982, A&AS, 48, 219 [NASA ADS] [Google Scholar]
  57. Reich, P., & Reich, W. 1986, A&AS, 63, 205 [NASA ADS] [Google Scholar]
  58. Reich, P., Testori, J. C., & Reich, W. 2001, A&A, 376, 861 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  59. Reis, I., Rotman, M., Poznanski, D., Prochaska, J. X., & Wolf, L. 2019, Effectively Using Unsupervised Machine Learning in Next Generation Astronomical Surveys [Google Scholar]
  60. Remazeilles, M., Dickinson, C., Banday, A. J., Bigot-Sazy, M. A., & Ghosh, T. 2015, MNRAS, 451, 4311 [Google Scholar]
  61. Rezende, D. J., Mohamed, S., & Wierstra, D. 2014, ArXiv e-prints [arXiv:1401.4082] [Google Scholar]
  62. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. 1985, Learning internal representationsby error propagation, Tech. rep., California Univ San Diego La Jolla Inst for Cognitive Science [Google Scholar]
  63. Sanders, W. T., Kraushaar, W. L., Nousek, J. A., & Fried, P. M. 1977, ApJ, 217, L87 [NASA ADS] [CrossRef] [Google Scholar]
  64. Scoville, N. Z., & Sanders, D. B. 1987, in Interstellar Processes, eds. D. J. Hollenbach, & H. A. Thronson (Dordrecht, Netherlands: Springer), 21 [Google Scholar]
  65. Selig, M., Vacca, V., Oppermann, N., & Enßlin, T. A. 2015, A&A, 581, A126 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  66. Snowden, S., Cox, D., McCammon, D., & Sanders, W. 1990, ApJ, 354, 211 [NASA ADS] [CrossRef] [Google Scholar]
  67. Snowden, S. L., Freyberg, M. J., Plucinsky, P. P., et al. 1995, ApJ, 454, 643 [NASA ADS] [CrossRef] [Google Scholar]
  68. Snowden, S. L., Egger, R., Freyberg, M. J., et al. 1997, ApJ, 485, 125 [NASA ADS] [CrossRef] [Google Scholar]
  69. Titsias, M., & Lázaro-Gredilla, M. 2014, International Conference on Machine Learning, 1971 [Google Scholar]
  70. Ucci, G., Ferrara, A., Gallerani, S., et al. 2018a, MNRAS, 483, 1295 [Google Scholar]
  71. Ucci, G., Ferrara, A., Pallottini, A., & Gallerani, S. 2018b, MNRAS, 477, 1484 [NASA ADS] [CrossRef] [Google Scholar]
  72. Zurada, J. M., Malinowski, A., & Cloete, I. 1994, Proceedings of IEEE International Symposium on Circuits and Systems-ISCAS’94 (IEEE), 6, 447 [Google Scholar]

Appendix A: Astrophysical data set

Table A.1.

Overview of the data sets used in this study.

Appendix B: Calculations

Here, we provide the analytic steps between Eqs. (7) and (8). For completeness, we start our calculations with the full posterior P(θ ∣ D) = P(D ∣ θP(θ)/P(D):

D KL [ Q Φ ( · ) P ( · ) ] = d Z d θ d ξ N Q Φ ( Z , θ , ξ N D ) ln ( Q Φ ( Z , θ , ξ N D ) P ( Z , θ , ξ N D ) ) = ( i = 1 p d z i Q Φ ( z i d i ) ) d θ Q Φ ( θ D ) d ξ N Q Φ ( ξ N D ) × ln ( i = 1 p Q Φ ( z i d i ) Q Φ ( θ D ) Q Φ ( ξ N D ) P ( θ ) P ( D ) P ( ξ N ) i = 1 p G ( d i f θ ( z i ) , t ψ ( ξ N ) ) G ( z i , 1 ) ) = ( i = 1 p d z i Q Φ ( z i d i ) ) d θ Q Φ ( θ D ) d ξ N Q Φ ( ξ N D ) × [ ( i = 1 p ln Q Φ ( z i d i ) ) + ln Q Φ ( θ D ) + ln Q Φ ( ξ N D ) ln P ( θ ) ln P ( ξ N ) + ln P ( D ) i = 1 p ln ( G ( z i , 1 ) G ( d i f θ ( z i ) , t ψ ( ξ N ) ) ) ] = ( i = 1 p d z i Q Φ ( z i d i ) ) ( i = 1 p ln Q Φ ( z i d i ) ) = i = 1 p ln Q Φ ( z i d i ) Q Φ ( z i d i ) d θ Q Φ ( θ D ) = 1 d ξ N Q Φ ( ξ N D ) = 1 + ( i = 1 p d z i Q Φ ( z i d i ) ) = 1 d θ Q Φ ( θ D ) ln Q Φ ( θ D ) = d θ δ ( θ θ ^ ) ln δ ( θ θ ^ ) = const . d ξ N Q Φ ( ξ N D ) = 1 + ( i = 1 p d z i Q Φ ( z i d i ) ) = 1 d θ Q Φ ( θ D ) = 1 d ξ N Q Φ ( ξ N D ) ln Q Φ ( ξ N D ) = d ξ N δ ( ξ N ξ ^ N ) ln δ ( ξ N ξ ^ N ) = const . ( i = 1 p d z i Q Φ ( z i d i ) ) = 1 d θ Q Φ ( θ D ) ln P ( θ ) = const . , since P ( θ ) = const . d ξ N Q Φ ( ξ N D ) = 1 ( i = 1 p d z i Q Φ ( z i d i ) ) = 1 d θ Q Φ ( θ D ) = 1 d ξ N Q Φ ( ξ N D ) ln P ( ξ N ) = ln G ( ξ N , 1 ) Q Φ ( ξ N D ) + ( i = 1 p d z i Q Φ ( z i d i ) ) = 1 d θ Q Φ ( θ D ) = 1 d ξ N Q Φ ( ξ N D ) ln P ( D ) = const . , since P ( D ) = const . ( i = 1 p d z i Q Φ ( z i d i ) ) i = 1 p ln G ( z i , 1 ) = i = 1 p ln G ( z i , 1 ) Q Φ ( z i d i ) d θ Q Φ ( θ D ) = 1 d ξ N Q Φ ( ξ N D ) = 1 ( i = 1 p d z i Q Φ ( z i d i ) ) d θ Q Φ ( θ D ) d ξ N Q Φ ( ξ N D ) × i = 1 p ln G ( d i f θ ( z i ) , t ψ ( ξ N ) ) . $$ \begin{aligned} D_{\rm KL}[Q_{\boldsymbol{\Phi }}(\cdot ) \mid \mid P(\cdot )]&= \int \mathrm{d}\boldsymbol{Z} \, \mathrm{d}\boldsymbol{\theta } \, \mathrm{d} \xi _N \, Q_{\boldsymbol{\Phi }}(\boldsymbol{Z}, \boldsymbol{\theta }, \xi _N \mid \boldsymbol{D}) \, \ln \left( \frac{Q_{\boldsymbol{\Phi }}(\boldsymbol{Z}, \boldsymbol{\theta }, \xi _N \mid \boldsymbol{D})}{P(\boldsymbol{Z}, \boldsymbol{\theta }, \xi _N \mid \boldsymbol{D})}\right) \nonumber \\&= \int \left( \prod _{i=1}^{p} \mathrm{d}\boldsymbol{z}_i \, Q_{\boldsymbol{\Phi }}(\boldsymbol{z}_i \mid \boldsymbol{d}_i) \right) \,\int \mathrm{d}\boldsymbol{\theta } \, Q_{\boldsymbol{\Phi }}( \boldsymbol{\theta }\mid \boldsymbol{D}) \, \int \mathrm{d} \xi _N \, Q_{\boldsymbol{\Phi }}( \xi _N\mid \boldsymbol{D}) \, \nonumber \\&\quad \times \ln \left( \frac{\prod _{i=1}^{p} Q_{\boldsymbol{\Phi }}(\boldsymbol{z}_i \mid \boldsymbol{d}_i) \, Q_{\boldsymbol{\Phi }}( \boldsymbol{\theta }\mid \boldsymbol{D}) \, Q_{\boldsymbol{\Phi }}( \xi _N\mid \boldsymbol{D}) }{\frac{P(\boldsymbol{\theta })}{P(\boldsymbol{D})} \, P(\xi _N) \, \prod _{i=1}^p \, \mathcal{G} \left(\boldsymbol{d}_i - f_{\boldsymbol{\theta }}(z_i), t_{\boldsymbol{\psi }}( \xi _N)\right)\, \mathcal{G} (\boldsymbol{z}_i, {1})} \right) \nonumber \\&= \int \left( \prod _{i=1}^{p} \mathrm{d}\boldsymbol{z}_i \, Q_{\boldsymbol{\Phi }}(\boldsymbol{z}_i \mid \boldsymbol{d}_i) \right) \,\int \mathrm{d}\boldsymbol{\theta } \, Q_{\boldsymbol{\Phi }}( \boldsymbol{\theta }\mid \boldsymbol{D}) \, \int \mathrm{d} \xi _N \, Q_{\boldsymbol{\Phi }}( \xi _N\mid \boldsymbol{D}) \, \times \nonumber \\&\quad \Bigg [ \left(\sum _{i=1}^{p} \ln Q_{\boldsymbol{\Phi }}(\boldsymbol{z}_i \mid \boldsymbol{d}_i) \right) + \ln Q_{\boldsymbol{\Phi }}( \boldsymbol{\theta }\mid \boldsymbol{D}) + \ln Q_{\boldsymbol{\Phi }}( \xi _N\mid \boldsymbol{D}) - \ln P(\boldsymbol{\theta }) \nonumber \\&\quad - \ln P( \xi _N) + \ln P(\boldsymbol{D}) - \sum _{i=1}^{p} \ln \left(\mathcal{G} (\boldsymbol{z}_i, {1}) \, \mathcal{G} \left(\boldsymbol{d}_i - f_{\boldsymbol{\theta }}(z_i), t_{\boldsymbol{\psi }}( \xi _N)\right)\right) \Bigg ] \nonumber \\&= \underbrace{\int \left( \prod _{i=1}^{p} \mathrm{d}\boldsymbol{z}_i Q_{\boldsymbol{\Phi }}(\boldsymbol{z}_i \mid \boldsymbol{d}_i) \right) \,\left( \sum _{i=1}^{p} \ln Q_{\boldsymbol{\Phi }}(\boldsymbol{z}_i \mid \boldsymbol{d}_i)\right)}_{=\sum _{i=1}^p \, \left\langle \ln Q_{\boldsymbol{\Phi }}(\boldsymbol{z}_i \mid \boldsymbol{d}_i) \right\rangle _{Q_{\boldsymbol{\Phi }}(\boldsymbol{z}_i \mid \boldsymbol{d}_i)}} \underbrace{\int \mathrm{d}\boldsymbol{\theta } \, Q_{\boldsymbol{\Phi }}( \boldsymbol{\theta }\mid \boldsymbol{D})}_{= \, 1} \, \underbrace{\int \mathrm{d} \xi _N \, Q_{\boldsymbol{\Phi }}( \xi _N\mid \boldsymbol{D})}_{= \, 1} \nonumber \\&\quad + \underbrace{\int \left( \prod _{i=1}^{p} \mathrm{d}\boldsymbol{z}_i Q_{\boldsymbol{\Phi }}(\boldsymbol{z}_i \mid \boldsymbol{d}_i)\right)}_{= \, 1} \,\underbrace{\int \mathrm{d}\boldsymbol{\theta } \, Q_{\boldsymbol{\Phi }}( \boldsymbol{\theta }\mid \boldsymbol{D}) \, \ln Q_{\boldsymbol{\Phi }}( \boldsymbol{\theta }\mid \boldsymbol{D})}_{= \, \int \mathrm{d}\boldsymbol{\theta } \, \delta (\boldsymbol{\theta } - \widehat{\boldsymbol{\theta }}) \, \ln \delta (\boldsymbol{\theta } - \widehat{\boldsymbol{\theta }})\, =\, \mathrm{const.} } \underbrace{\int \mathrm{d} \xi _N \, Q_{\boldsymbol{\Phi }}( \xi _N\mid \boldsymbol{D})}_{= \, 1} \nonumber \\&\quad + \underbrace{\int \left(\prod _{i=1}^{p} \mathrm{d}\boldsymbol{z}_i Q_{\boldsymbol{\Phi }}(\boldsymbol{z}_i \mid \boldsymbol{d}_i)\right)}_{= \, 1} \,\underbrace{\int \mathrm{d}\boldsymbol{\theta } \, Q_{\boldsymbol{\Phi }}( \boldsymbol{\theta }\mid \boldsymbol{D})}_{= \, 1} \, \underbrace{\int \mathrm{d} \xi _N \, Q_{\boldsymbol{\Phi }}( \xi _N\mid \boldsymbol{D}) \ln Q_{\boldsymbol{\Phi }}( \xi _N\mid \boldsymbol{D})}_{= \, \int \mathrm{d} \xi _N \, \delta ( \xi _N - \widehat{ \xi }_N) \, \ln \delta ( \xi _N - \widehat{ \xi }_N)\, =\, \mathrm{const.} } \nonumber \\&\quad - \underbrace{\int \left(\prod _{i=1}^{p} \mathrm{d}\boldsymbol{z}_i Q_{\boldsymbol{\Phi }}(\boldsymbol{z}_i \mid \boldsymbol{d}_i)\right)}_{= \, 1} \,\underbrace{\int \mathrm{d}\boldsymbol{\theta } \, Q_{\boldsymbol{\Phi }}( \boldsymbol{\theta }\mid \boldsymbol{D}) \, \ln P(\boldsymbol{\theta })}_{= \, \mathrm{const.} , \, \mathrm{since} \,P(\boldsymbol{\theta })\,=\,\mathrm{const} .} \underbrace{\int \mathrm{d} \xi _N \, Q_{\boldsymbol{\Phi }}( \xi _N\mid \boldsymbol{D})}_{= \, 1} \nonumber \\&\quad - \underbrace{\int \left(\prod _{i=1}^{p} \mathrm{d}\boldsymbol{z}_i Q_{\boldsymbol{\Phi }}(\boldsymbol{z}_i \mid \boldsymbol{d}_i)\right)}_{= \, 1} \,\underbrace{\int \mathrm{d}\boldsymbol{\theta } \, Q_{\boldsymbol{\Phi }}( \boldsymbol{\theta }\mid \boldsymbol{D})}_{= \, 1} \, \underbrace{\int \mathrm{d} \xi _N \, Q_{\boldsymbol{\Phi }}( \xi _N\mid \boldsymbol{D}) \ln P( \xi _N)}_{= \, \left\langle \ln \mathcal{G} ( \xi _N, 1) \right\rangle _{Q_{\boldsymbol{\Phi }}( \xi _N\mid \boldsymbol{D})}} \nonumber \\&\quad + \underbrace{\int \left(\prod _{i=1}^{p} \mathrm{d}\boldsymbol{z}_i Q_{\boldsymbol{\Phi }}(\boldsymbol{z}_i \mid \boldsymbol{d}_i)\right)}_{= \, 1} \,\underbrace{\int \mathrm{d}\boldsymbol{\theta } \, Q_{\boldsymbol{\Phi }}( \boldsymbol{\theta }\mid \boldsymbol{D})}_{= \, 1} \, \underbrace{\int \mathrm{d} \xi _N \, Q_{\boldsymbol{\Phi }}( \xi _N\mid \boldsymbol{D}) \ln P(\boldsymbol{D})}_{= \, \mathrm{const} ., \, \mathrm{since} \,P(\boldsymbol{D})\,=\,\mathrm{const} .} \nonumber \\&\quad - \underbrace{\int \left(\prod _{i=1}^{p} \mathrm{d}\boldsymbol{z}_i Q_{\boldsymbol{\Phi }}(\boldsymbol{z}_i \mid \boldsymbol{d}_i) \right)\sum _{i=1}^{p} \ln \mathcal{G} (\boldsymbol{z}_i, {1})}_{= \sum _{i=1}^{p} \left\langle \, \ln \mathcal{G} (\boldsymbol{z}_i, {1}) \right\rangle _{Q_{\boldsymbol{\Phi }}(\boldsymbol{z}_i \mid \boldsymbol{d}_i)}}\,\underbrace{\int \mathrm{d}\boldsymbol{\theta } \, Q_{\boldsymbol{\Phi }}( \boldsymbol{\theta }\mid \boldsymbol{D})}_{= \, 1} \, \underbrace{\int \mathrm{d} \xi _N \, Q_{\boldsymbol{\Phi }}( \xi _N\mid \boldsymbol{D})}_{= \, 1} \nonumber \\&\quad - \int \left(\prod _{i=1}^{p} \mathrm{d}\boldsymbol{z}_i Q_{\boldsymbol{\Phi }}(\boldsymbol{z}_i \mid \boldsymbol{d}_i)\right) \int \mathrm{d}\boldsymbol{\theta } \, Q_{\boldsymbol{\Phi }}( \boldsymbol{\theta }\mid \boldsymbol{D}) \int \mathrm{d} \xi _N \, Q_{\boldsymbol{\Phi }}( \xi _N\mid \boldsymbol{D}) \times \nonumber \\&\quad \quad \sum _{i=1}^{p} \ln \mathcal{G} \left(\boldsymbol{d}_i - f_{\boldsymbol{\theta }}(z_i), t_{\boldsymbol{\psi }}( \xi _N)\right). \nonumber \end{aligned} $$

Inserting the definitions for Q Φ ( θ D ) = δ ( θ θ ^ ) $ Q_{\boldsymbol{\Phi}}( \boldsymbol{\theta}\mid \boldsymbol{D} \delta(\boldsymbol{\theta} - \widehat{\boldsymbol{\theta}}) $, Q Φ ( ξ N D ) = δ ( ξ N ξ ^ N ) $ Q_{\boldsymbol{\Phi}}( \xi_N\mid \boldsymbol{D} \delta( \xi_N - \widehat{ \xi}_N) $ and QΦ(zi ∣ di) = 𝒢(ziμii), we get

D KL [ Q Φ ( · ) P ( · ) ] = i = 1 p ln G ( z i μ i , Σ i ) G ( z i μ i , Σ i ) ln G ( ξ N , 1 ) δ ( ξ N ξ ^ N ) i = 1 p ln G ( z i , 1 ) G ( z i μ i , Σ i ) i = 1 p ln G ( d i f θ ^ ( z i ) , t ψ ( ξ ^ N ) ) G ( z i μ i , Σ i ) + H 0 , $$ \begin{aligned} D_{\rm KL}[Q_{\boldsymbol{\Phi }}(\cdot ) \mid \mid P(\cdot )]&= \sum _{i=1}^p \, \left\langle \ln \mathcal{G} \left(\boldsymbol{z}_i - \boldsymbol{\upmu }_i, \Sigma _i \right) \right\rangle _{\mathcal{G} \left(\boldsymbol{z}_i - \boldsymbol{\upmu }_i, \Sigma _i \right)} - \left\langle \ln \mathcal{G} ( \xi _N, 1) \right\rangle _{\delta ( \xi _N - \widehat{ \xi }_N)} - \sum _{i=1}^{p} \left\langle \, \ln \mathcal{G} (\boldsymbol{z}_i, {1}) \right\rangle _{\mathcal{G} (\boldsymbol{z}_i - \boldsymbol{\upmu }_i, \Sigma _i)}\nonumber \\&\quad - \sum _{i=1}^{p} \left\langle \ln \mathcal{G} \left(\boldsymbol{d}_i - f_{\widehat{\boldsymbol{\theta }}}(z_i), t_{\boldsymbol{\psi }}(\widehat{\xi }_N) \right) \right\rangle _{\mathcal{G} (\boldsymbol{z}_i - \boldsymbol{\upmu }_i, \Sigma _i)} + \mathcal{H} _0, \end{aligned} $$(B.1)

where ℋ0 collects all terms independent of the parameters. We renamed the subsequent terms of the Kullback-Leibler Divergence as follows and calculated each expression respectively:

  • T1 = ⟨ln𝒢(ziμii)⟩𝒢(zi − μi, Σi),

  • T 2 = ln G ( ξ N , 1 ) δ ( ξ N ξ ^ N ) $ \boldsymbol{\mathrm{T}2} = \left\langle \ln \mathcal{G}( \xi_N, 1) \right\rangle_{\delta( \xi_N - \widehat{ \xi}_N)} $,

  • T3 = ⟨ ln𝒢(zi,1)⟩𝒢(zi − μi, Σi),

  • T 4 = ln G ( d i f θ ^ ( z i ) , t ψ ( ξ ^ N ) ) G ( z i μ i , Σ i ) $ \boldsymbol{\mathrm{T}4} =\left\langle \ln\mathcal{G}\left(\boldsymbol{d}_i - f_{\widehat{\boldsymbol{\theta}}}(z_i), t_{\boldsymbol{\psi}}(\boldsymbol{\widehat{\xi}}_N)\right) \right\rangle_{\mathcal{G}(\boldsymbol{z}_i - \boldsymbol{{\upmu}}_i, \Sigma_i)} $.

First energy term T1:

T 1 = d z i G ( z i μ i , Σ i ) ln ( exp ( 1 2 ( z i μ i ) T Σ i 1 ( z i μ i ) ) | 2 π Σ i | ) = d z i G ( z i μ i , Σ i ) [ 1 2 ( z i μ i ) T Σ i 1 ( z i μ i ) 1 2 ln ( | 2 π Σ i | ( 2 π ) l Σ i | ) ] . $$ \begin{aligned} \boldsymbol{T}1&= \int \mathrm{d}{\boldsymbol{z}}_i \, \mathcal{G} ({\boldsymbol{z}}_i - {\boldsymbol{\upmu }}_i \, , \, \Sigma _i) \, \ln \, \left( \frac{\exp \left(-\frac{1}{2}({\boldsymbol{z}}_i - {\boldsymbol{\upmu }}_i)^T \, \Sigma _i^{-1} \, ({\boldsymbol{z}}_i - {\boldsymbol{\upmu }}_i)\right)}{\sqrt{| 2 \pi \Sigma _i|}} \right) \nonumber \\&= \int \mathrm{d}{\boldsymbol{z}}_i \, \mathcal{G} ({\boldsymbol{z}}_i - {\boldsymbol{\upmu }}_i \, , \, \Sigma _i) \, \left[ -\frac{1}{2}({\boldsymbol{z}}_i - {\boldsymbol{\upmu }}_i)^T \, \Sigma _i^{-1} \, ({\boldsymbol{z}}_i - {\boldsymbol{\upmu }}_i) \, -\frac{1}{2}\ln \big (\underbrace{|2 \pi \Sigma _i|}_{(2\pi )^l \, \mid \Sigma _i|}\big )\right]. \nonumber \end{aligned} $$

The trace of a scalar is the scalar itself, resulting in ( z i μ i ) T Σ i 1 ( z i μ i )=tr( ( z i μ i ) T Σ i 1 ( z i μ i ) ) $ (\boldsymbol{z}_i - \boldsymbol{\muup}_i)^T \, \Sigma_i^{-1} \, (\boldsymbol{z}_i - \boldsymbol{\muup}_i) = \mathrm{tr}\left((\boldsymbol{z}_i - \boldsymbol{\muup}_i)^T \, \Sigma_i^{-1} \, (\boldsymbol{z}_i - \boldsymbol{\muup}_i)\right) $. Further, the trace is invariant under cyclic permutations: tr( ( z i μ i ) T Σ i 1 ( z i μ i ) )=tr( Σ i 1 ( z i μ i ) ( z i μ i ) T ) $ \mathrm{tr}\left((\boldsymbol{z}_i - \boldsymbol{\muup}_i)^T \, \Sigma_i^{-1} \, (\boldsymbol{z}_i - \boldsymbol{\muup}_i)\right) = \mathrm{tr}\left( \, \Sigma_i^{-1} \, (\boldsymbol{z}_i - \boldsymbol{\muup}_i) (\boldsymbol{z}_i - \boldsymbol{\muup}_i)^T\right) $. Finally, the trace is a linear operator and therefore commutes with the expectation:

= 1 2 tr ( d z i G ( z i μ i , Σ i ) Σ i 1 ( z i μ i ) ( z i μ i ) T = Σ i 1 Σ i = 1 ) 1 2 d z i G ( z i μ i , Σ i ) [ ln ( | Σ i | ) = tr ( ln Σ i ) + l ln ( 2 π ) ] = 1 2 [ tr ( 1 ) = l + tr ( ln Σ i ) + l ln ( 2 π ) ] = 1 2 [ tr ( ln Σ i ) + l ( 1 + ln ( 2 π ) ) ] . $$ \begin{aligned}&= - \, \frac{1}{2} \, \mathrm{tr} \, \left( \int \mathrm{d}\boldsymbol{z}_i \, \mathcal{G} (\boldsymbol{z}_i - \boldsymbol{\upmu }_i \, , \, \Sigma _i) \, \underbrace{\Sigma _i^{-1} \, (\boldsymbol{z}_i - \boldsymbol{\upmu }_i) (\boldsymbol{z}_i - \boldsymbol{\upmu }_i)^T}_{= \, \Sigma ^{-1}_i \, \Sigma _i \, = \, {1}} \right) \nonumber \\&\quad - \frac{1}{2} \int \mathrm{d}\boldsymbol{z}_i \, \mathcal{G} (\boldsymbol{z}_i - \boldsymbol{\upmu }_i \, , \, \Sigma _i) \Big [ \underbrace{\ln (|\Sigma _i|)}_{= \, \mathrm{tr} (\ln \Sigma _i)} +\, l \, \ln (2 \pi )\Big ] \nonumber \\&= - \, \frac{1}{2} \Big [ \underbrace{\mathrm{tr} ({{1}})}_{= \, l} + \mathrm{tr} (\ln \Sigma _i) + \, l \, \ln (2 \pi ) \Bigr ] \nonumber \\&= - \, \frac{1}{2} \, \Big [\mathrm{tr} (\ln \Sigma _i) + l \left( 1+ \mathrm{ln} \,(2\pi ) \right) \Big ]. \end{aligned} $$(B.2)

In the upper calculation, we carried the term l(1+ln (2π)) through our computations, since we changed the number of latent space features l in our different experiments.

Second energy term T2:

T 2 = d ξ N ln ( 1 2 π exp ( 1 2 ξ N 2 ) ) δ ( ξ N ξ N ^ ) = 1 2 ξ N ^ 2 1 2 ln ( 2 π ) = const . = 1 2 ξ N ^ 2 + C 2 . $$ \begin{aligned} \boldsymbol{\mathrm{T} 2}&= \int \mathrm{d} \xi _N \, \ln \Biggl ( \frac{1}{\sqrt{2 \pi }} \, \exp (- \frac{1}{2} \, \xi _N^2 ) \Biggr ) \, \delta ( \xi _N - \widehat{ \xi _N}) \nonumber \\&= -\frac{1}{2} \, \widehat{ \xi _N}^2 - \underbrace{\frac{1}{2} \ln (2 \pi )}_{= \, \mathrm{const.} } \nonumber \\&= - \frac{1}{2} \, \widehat{ \xi _N}^2 + C_2. \end{aligned} $$(B.3)

Third energy term T3:

T 3 = d z i G ( z i μ i , Σ i ) ln ( exp ( 1 2 z i T 1 1 z i ) | 2 π 1 | ) = [ d z i G ( z i μ i , Σ i ) 1 2 z i T 1 1 z i = tr ( z i z i T ) + d z i G ( z i μ i , Σ i ) 1 2 ln ( | 2 π 1 | ) = const . ] = 1 2 tr ( d z i G ( z i μ i , Σ i ) z i z i T ) + C 3 . $$ \begin{aligned} \boldsymbol{\mathrm{T} 3}&= \int \mathrm{d}\boldsymbol{z}_i \, \mathcal{G} (\boldsymbol{z}_i - \boldsymbol{\upmu }_i, \Sigma _i) \, \ln \Biggl ( \frac{\exp (- \frac{1}{2} \, \boldsymbol{z}_i^T \, {1}^{-1} \, \boldsymbol{z}_i)}{\sqrt{|2 \pi {1}|}} \Biggr ) \nonumber \\&= - \Bigg [ \int \mathrm{d}\boldsymbol{z}_i \, \mathcal{G} (\boldsymbol{z}_i - \boldsymbol{\upmu }_i \, , \, \Sigma _i) \, \frac{1}{2} \, \underbrace{\boldsymbol{z}_i^T \, {1}^{-1} \, \boldsymbol{z}_i}_{= \mathrm{tr} ( \boldsymbol{z}_i \, \boldsymbol{z}_i^T )} + \underbrace{\int \mathrm{d}\boldsymbol{z}_i \, \mathcal{G} (\boldsymbol{z}_i - \boldsymbol{\upmu }_i \, , \, \Sigma _i) \, \frac{1}{2} \ln (|2 \pi {1}|)}_{= \, \mathrm{const.} } \Bigg ] \nonumber \\&= - \frac{1}{2} \, \mathrm{tr} \, \Biggl ( \int \mathrm{d}\boldsymbol{z}_i \, \mathcal{G} (\boldsymbol{z}_i - \boldsymbol{\upmu }_i \, , \, \Sigma _i) \, \boldsymbol{z}_i \, \boldsymbol{z}_i^T \Biggr ) + C_3 . \nonumber \\ \end{aligned} $$(B.4)

The expression z i z i T $ \boldsymbol{z}_i \, \boldsymbol{z}_i^T $ can be rewritten as z i z i T =( z i μ) ( z i μ) T μ μ T + z i μ T +μ z i T $ \boldsymbol{z}_i \, \boldsymbol{z}_i^T = (\boldsymbol{z}_i - \boldsymbol{\muup})(\boldsymbol{z}_i - \boldsymbol{\muup})^T - \boldsymbol{\muup} \boldsymbol{\muup}^T + \boldsymbol{z}_i\boldsymbol{\muup}^T + \boldsymbol{\muup} \boldsymbol{z}_i^T $. We inserted this in the line above:

= 1 2 tr ( d z i G ( z i μ i , Σ i ) [ ( z i μ i ) ( z i μ i ) T = Σ i μ i μ i T + z i μ i T + μ i z i T ] ) + C 3 = 1 2 tr ( Σ i μ i μ i T + μ i μ i T + μ i μ i T = Σ i + μ i μ i T ) + C 3 = 1 2 tr ( Σ i + μ i μ i T ) + C 3 . $$ \begin{aligned}&= - \frac{1}{2} \, \mathrm{tr} \, \Biggl ( \int \mathrm{d}\boldsymbol{z}_i \, \mathcal{G} (\boldsymbol{z}_i - \boldsymbol{\upmu }_i \, , \, \Sigma _i) \Bigl [ \underbrace{(\boldsymbol{z}_i - \boldsymbol{\upmu }_i)(\boldsymbol{z}_i - \boldsymbol{\upmu }_i)^T}_{= \Sigma _i} - \boldsymbol{\upmu }_i \boldsymbol{\upmu }_i^T + \boldsymbol{z}_i\boldsymbol{\upmu }_i^T + \boldsymbol{\upmu }_i \boldsymbol{z}_i^T \Bigr ] \Biggr ) + C_3 \nonumber \\&= - \frac{1}{2} \, \mathrm{tr} \, \Bigl (\underbrace{ \Sigma _i - \boldsymbol{\upmu }_i \boldsymbol{\upmu }_i^T +\boldsymbol{\upmu }_i \boldsymbol{\upmu }_i^T +\boldsymbol{\upmu }_i \boldsymbol{\upmu }_i^T}_{= \Sigma _i + \boldsymbol{\upmu }_i \boldsymbol{\upmu }_i^T} \Bigr ) + C_3 \nonumber \\&= - \frac{1}{2} \, \mathrm{tr} \, \Bigl ( \Sigma _i + \boldsymbol{\upmu }_i \boldsymbol{\upmu }_i^T \Bigr ) + C_3.&\end{aligned} $$(B.5)

Fourth energy term T4:

T 4 = d z i G ( z i μ i , Σ i ) ln G ( d i f θ ^ ( z i ) , t ψ ( ξ N ^ ) ) = d z i G ( z i μ i , Σ i ) ln ( exp ( 1 2 ( d i f θ ^ ( z i ) ) T t ψ ( ξ N ^ ) 1 ( d i f θ ^ ( z i ) ) ) | 2 π t ψ ( ξ N ^ ) | ) = d z i G ( z i μ i , Σ i ) [ 1 2 ( d i f θ ^ ( z i ) ) T t ψ ( ξ N ^ ) 1 ( d i f θ ^ ( z i ) ) + 1 2 ln ( | 2 π t ψ ( ξ N ^ ) | = ( 2 π ) k | t ψ ( ξ N ^ ) | ) ] = 1 2 d z i G ( z i μ i , Σ i ) [ ( d i f θ ^ ( z i ) ) T t ψ ( ξ N ^ ) 1 ( d i f θ ^ ( z i ) ) ] 1 2 d z i G ( z i μ i , Σ i ) [ k ln ( 2 π ) = const . + tr ( ln t ψ ( ξ N ^ ) ) ] = 1 2 d z i G ( z i μ i , Σ i ) tr [ t ψ ( ξ N ^ ) 1 ( d i f θ ^ ( z i ) ) ( d i f θ ^ ( z i ) ) T ] 1 2 tr ( ln t ψ ( ξ N ^ ) ) + C 4 . $$ \begin{aligned} \boldsymbol{\mathrm{T} 4}&= \int \mathrm{d}\boldsymbol{z}_i \,\mathcal{G} (\boldsymbol{z}_i - \boldsymbol{\upmu }_i, \Sigma _i) \, \ln \mathcal{G} (\boldsymbol{d}_i - f_{\widehat{\boldsymbol{\theta }}}(\boldsymbol{z}_i),t_{\boldsymbol{\psi }}( \widehat{ \xi _N})) \nonumber \\&= \int \mathrm{d}\boldsymbol{z}_i \, \mathcal{G} (\boldsymbol{z}_i - \boldsymbol{\upmu }_i, \Sigma _i) \, \ln \Biggl ( \frac{\exp (-\frac{1}{2}(\boldsymbol{d}_i - f_{\widehat{\boldsymbol{\theta }}}(\boldsymbol{z}_i))^T \, t_{\boldsymbol{\psi }}(\widehat{ \xi _N})^{-1} \, (\boldsymbol{d}_i - f_{\widehat{\boldsymbol{\theta }}}(\boldsymbol{z}_i)))}{\sqrt{| 2 \pi \, t_{\boldsymbol{\psi }}(\widehat{ \xi _N})|}} \Biggr ) \nonumber \\&= -\int \mathrm{d}\boldsymbol{z}_i \, \mathcal{G} (\boldsymbol{z}_i - \boldsymbol{\upmu }_i \, , \, \Sigma _i) \, \Big [ \frac{1}{2} (\boldsymbol{d}_i - f_{\widehat{\boldsymbol{\theta }}}(\boldsymbol{z}_i))^T \, t_{\boldsymbol{\psi }}(\widehat{ \xi _N})^{-1} \, (\boldsymbol{d}_i - f_{\widehat{\boldsymbol{\theta }}}(\boldsymbol{z}_i)) \, \nonumber \\&\quad + \frac{1}{2}\ln (\underbrace{|2 \pi \, t_{\boldsymbol{\psi }}(\widehat{ \xi _N})|}_{= (2\pi )^k \, |t_{\boldsymbol{\psi }}(\widehat{ \xi _N})|})\Big ] \nonumber \\&= - \, \frac{1}{2}\int \mathrm{d}\boldsymbol{z}_i \, \mathcal{G} (\boldsymbol{z}_i - \boldsymbol{\upmu }_i \, , \, \Sigma _i) \, \Big [(\boldsymbol{d}_i - f_{\widehat{\boldsymbol{\theta }}}(\boldsymbol{z}_i))^T \, t_{\boldsymbol{\psi }}(\widehat{ \xi _N})^{-1} \, (\boldsymbol{d}_i - f_{\widehat{\boldsymbol{\theta }}}(\boldsymbol{z}_i)) \Big ]\nonumber \\&\quad - \frac{1}{2} \int d\boldsymbol{z}_i \, \mathcal{G} (\boldsymbol{z}_i - \boldsymbol{\upmu }_i \, , \, \Sigma _i)\, \big [\underbrace{k \, \ln (2\pi )}_{= \mathrm{const.} } + \mathrm{tr} (\ln \, t_{\boldsymbol{\psi }}(\widehat{ \xi _N}))\big ] \nonumber \\&= - \, \frac{1}{2}\int \mathrm{d}\boldsymbol{z}_i \, \mathcal{G} (\boldsymbol{z}_i - \boldsymbol{\upmu }_i \, , \, \Sigma _i) \,\mathrm{tr} \Big [ t_{\boldsymbol{\psi }}(\widehat{ \xi _N})^{-1} \, (\boldsymbol{d}_i - f_{\widehat{\boldsymbol{\theta }}}(\boldsymbol{z}_i))\,(\boldsymbol{d}_i - f_{\widehat{\boldsymbol{\theta }}}(\boldsymbol{z}_i))^T \Big ] \nonumber \\&\quad \, - \,\frac{1}{2}\,\mathrm{tr} \,(\ln \, t_{\boldsymbol{\psi }}(\widehat{ \xi _N})) + C_4. \end{aligned} $$(B.6)

We cannot exactly evaluate the integral in this expression using analytic techniques. Fortunately, we can make use of Monte Carlo methods to approximate the expectation with a finite sum:

d z i G ( z i μ i , Σ i ) tr [ t ψ ( ξ N ^ ) 1 ( d i f θ ^ ( z i ) ) ( d i f θ ^ ( z i ) ) T ] 1 M m = 1 M tr ( t ψ ( ξ N ^ ) 1 ( d i f θ ^ ( z i ( m ) ) ) ( d i f θ ^ ( z i ( m ) ) ) T ) . $$ \begin{aligned} \int \mathrm{d}\boldsymbol{z}_i \, \mathcal{G} (\boldsymbol{z}_i - \boldsymbol{\upmu }_i \, , \, \Sigma _i) \, \,\mathrm{tr} \Big [ t_{\boldsymbol{\psi }}(\widehat{ \xi _N})^{-1} \, (\boldsymbol{d}_i - f_{\widehat{\boldsymbol{\theta }}}(\boldsymbol{z}_i))\,(\boldsymbol{d}_i - f_{\widehat{\boldsymbol{\theta }}}(\boldsymbol{z}_i))^T \Big ] \approx \frac{1}{M} \, \sum _{m=1}^M \, \mathrm{tr} \,\Big ( t_{\boldsymbol{\psi }}(\widehat{ \xi _N})^{-1} \, (\boldsymbol{d}_i - f_{\widehat{\boldsymbol{\theta }}}(\boldsymbol{z}_i^{(m)}))\,(\boldsymbol{d}_i - f_{\widehat{\boldsymbol{\theta }}}(\boldsymbol{z}_i^{(m)}))^T \Big ). \end{aligned} $$(B.7)

With this method we achieved a high accuracy estimator (even with a small number of samples) for the expectation value, as long as we draw samples z i (m) $ \boldsymbol{z}_i^{(m)} $ from the distribution 𝒢(zi − μi ,  Σi) (Bishop 2006). This means the samples zi are a function of μi and Σi by z i = μ i + Σ i ϵ i $ \boldsymbol{z}_i = \boldsymbol{{\upmu}}_i + \sqrt{\Sigma_i} \, \boldsymbol{\epsilon}_i $ with an auxiliary variable ϵi and P(ϵi) = 𝒢(ϵi, 1) (see Sect. 2.3 for details). We sampled once for each pixel, that is M = 1. The full Kullback-Leibler Divergence (B.1) then reads

D KL [ Q Φ ( · ) P ( · ) ] = 1 2 i = 1 p [ tr ( ln Σ i ) l ( 1 + ln ( 2 π ) ) + 1 p ξ N ^ 2 + tr ( Σ i + μ i μ i T ) + tr ( 1 t ψ ( ξ N ^ ) ( d i f θ ^ ( z i ) ) ( d i f θ ^ ( z i ) ) T ) + tr ( ln t ψ ( ξ N ^ ) ) ] + H 0 , $$ \begin{aligned} D_{\rm KL}[Q_{\boldsymbol{\Phi }}(\cdot ) \mid \mid P(\cdot )]&= \frac{1}{2} \sum _{i=1}^p \Bigg [ - \mathrm{tr} \,(\ln \Sigma _i) - l \left( 1+ \mathrm{ln} \,(2\pi ) \right) + \frac{1}{p} \, \widehat{ \xi _N}^2+ \, \mathrm{tr} \, \Bigl ( \Sigma _i + \boldsymbol{\upmu }_i \boldsymbol{\upmu }_i^T \Bigr ) +\, \mathrm{tr} \, \left( \frac{1}{ t_{\boldsymbol{\psi }}(\widehat{ \xi _N})} (\boldsymbol{d}_i - f_{\widehat{\boldsymbol{\theta }}}(\boldsymbol{z}_i))(\boldsymbol{d}_i - f_{\widehat{\boldsymbol{\theta }}}(\boldsymbol{z}_i))^T \right) \\ \nonumber&\quad \quad \quad \quad + \mathrm{tr} \,(\ln \, t_{\boldsymbol{\psi }}(\widehat{ \xi _N}) ) \Bigg ] + \mathcal{H} _0, \end{aligned} $$(B.8)

where we absorbed all constant terms in a redefined ℋ0.

Appendix C: Hyperparameters

Hyperparameters for the NEAT-VAE are (1) the number of network layers, (2) the number of hidden neurons per layer, (3) the number of neurons in the bottleneck layer, (4) the mean μN and (5) the standard deviation σN of the log-normal model for the noise covariance transformation, the learning rates of (6) network weights and (7) noise parameter weights, and (8) the batch size. We examined 50 configurations of the NEAT-VAE, which consist of different arrangements of hyperparameters. The values for the hyperparameters were randomly chosen from limited intervals, which we specified for each hyperparameter based on prior experiments (see Table C.1). We implemented rectified linear unit (ReLU) functions as activation functions in each layer except the output layer and used the Adam optimizer (Kingma & Ba 2014) to update the network weights ϕ and θ ^ $ \widehat{\boldsymbol{\theta}} $, and the latent noise value ξ N ^ $ \widehat{\xi_N} $. We trained each configuration for (105 × batchsize)/p epochs with p denoting the number of pixels. We built our NEAT-VAE as a descriptive rather than a predictive model, since we aim to learn the underlying, independent features generating one certain data set. For this reason, we did not split the data into training, validation and test sets, nor did we analyze overfitting. For reproducibility, we fixed the random seed to 123.

Table C.1.

Hyperparameters of NEAT-VAE.

Appendix D: Other features

In the experiment discussed in Sect. 3, our NEAT-VAE algorithm mapped 35 Galactic all-sky maps to ten latent space features that encode the essential information required to reconstruct the input again. The latent space exhibits an order in significance of the features, which we measured by the ratio of mean fluctuations compared to feature posterior variance (see Eq. (10)). Features A, B, and C, which have the highest significance, are shown in Fig. 3. The remaining seven features (with their mean and variance in HEALPix representation) are displayed in the following in order of descending significance. By a visual analysis, we can recognize artifacts, for example, of the IRIS scanning scheme, in features D, F, G, and H. Feature H also seems to encode structures near the Galactic plane of the Hα map, the mean of feature J encodes structures similar to the Fermi Bubbles in high energy γ-ray data. All-sky images of the 35 Galactic input data are displayed in the leftmost columns of Appendix E.

thumbnail Fig. D.1.

Full latent space representation with 10 neurons in descending order of significance. Significance values are: S(feature D) = 38.18, S(feature E) = 34.07, and S(feature F) = 13.77.

thumbnail Fig. D.2.

Least significant latent space feature maps. Significance values are: S(feature G) = 13.63, S(feature H) = 11.85, S(feature I) = 10.34, and S(feature J) = 8.75.

Appendix E: Decoder Jacobian maps

The following panels show the gradients of reconstructed Galactic all-sky maps (that is the output of our NEAT-VAE algorithm) with respect to the latent space features A, B, and C. The Galactic input data is displayed in the leftmost column, while the single features mark the top row. The resulting gradient values in each pixel are shown as a HEALPix map connecting a Galactic map and a feature map, with red colors indicating positive gradient values and blue colors denoting negative gradient values.

thumbnail Fig. E.1.

Decoder Jacobian maps. Derivatives of reconstructed Galactic all-sky maps with respect to latent feature maps A, B, and C (top panel).

thumbnail Fig. E.1.

continued.

thumbnail Fig. E.1.

continued.

thumbnail Fig. E.1.

continued.

thumbnail Fig. E.1.

continued.

thumbnail Fig. E.1.

continued.

All Tables

Table 1.

Priors on model parameters.

Table A.1.

Overview of the data sets used in this study.

Table C.1.

Hyperparameters of NEAT-VAE.

All Figures

thumbnail Fig. 1.

Model design of our algorithm (NEAT-VAE). (1) In the forward model, we assumed the 35 Galactic all-sky maps D to be generated by a smaller number of features Z and some additive noise N. The generative process fθ, which we approximated by the decoder neural network, was learned by the algorithm. (2) We calculated the joint posterior distribution of the features Z, the network parameters θ, and the noise parameter ξN using statistical priors and Bayes’ theorem. (3) We approximated the posterior distribution using variational inference, the maximum entropy principle, and inverse transform sampling. Batches of spatially independent pixel vectors di served as input data, where each vector contains the spectral information of the same pixel in 35 Galactic all-sky maps. The algorithm was constructed to infer a latent representation zi of the input data (encoder), and to regenerate its input di as accurate as possible from the latent space (decoder). The minimization objective, or loss function, guiding the algorithm’s learning process is Eq. (8).

In the text
thumbnail Fig. 2.

Reconstruction mean squared error (MSE) and values of the minimized Kullback-Leibler Divergence (Loss) ΔDKL (DKL in Eq. (8) except constant terms) depending on the dimension of the latent space (also called number of features, x-axis). The values were determined using the NEAT-VAE with a configuration of six layers, 30 hidden neurons in the encoder and decoder layers, noise transformation parameters μN = −7 and σN = 1, learning rates of 0.005 for the network weights and 0.001 for ξN, and a batchsize of 128. We did not track all normalization constants through the calculations, which leads to negative values for the loss.

In the text
thumbnail Fig. 3.

Main results of the NEAT-VAE: (panel a) feature A, (panel b) feature B and (panel c) feature C are the most significant representations of the input data. Left panels: posterior mean, right panels: posterior variance. The NEAT-VAE was trained to reduce the spectral information of 35 Galactic all-sky maps to ten features in the latent space with the configuration described in Fig. 2. We show the three most important feature maps in descending order of significance according to Eq. (10). The colors in the posterior mean of feature C (left panel in (c)) were inverted for illustration purposes. The gray pixels correspond to missing values in the input data.

In the text
thumbnail Fig. 4.

Correlation of feature A with published astrophysical components tracing interstellar matter. From top left to bottom right: (panel a) thermal dust component in logarithmic scaling (Planck Collaboration X 2016), (panel b) 2D histogram of the posterior mean intensity of feature A and the thermal dust component (mutual information I = 1.72, Pearson correlation coefficient ρ = 0.98), (panel c) hadronic component of the γ-ray spectrum in logarithmic scaling (Selig et al. 2015) with white pixels denoting missing values, (panel d) 2D histogram of the posterior mean intensity of feature A and the hadronic component (I = 1.22, ρ = 0.90). For the 2D histograms, the intensity ranges of the maps were divided into 256 equal bins and the number of intensity pairs per bin was displayed as counts on logarithmic scaling. Bright colors denote a high number of counts. Mutual information was calculated as described in Eq. (12) with 512 bins.

In the text
thumbnail Fig. 5.

Correlation of feature B and feature A with mutual information I = 0.33 and Pearson correlation coefficient ρ = 0.03. The 2D histogram was computed as described in Fig. 4.

In the text
thumbnail Fig. 6.

Panel a: CMB as derived by the Planck Collaboration IV (2020), panels b–d: correlations of the posterior mean of feature C with (panel b) the CMB, mutual information I = 0.88 and Pearson correlation coefficient ρ = 0.70, (panel c) feature A, I = 0.29 and ρ = 0.01, and (panel d) feature B, I = 0.20 and ρ = 0.01. The histograms and mutual information were compiled as described in Fig. 4. Bright colors in the histograms denote a high number of counts.

In the text
thumbnail Fig. D.1.

Full latent space representation with 10 neurons in descending order of significance. Significance values are: S(feature D) = 38.18, S(feature E) = 34.07, and S(feature F) = 13.77.

In the text
thumbnail Fig. D.2.

Least significant latent space feature maps. Significance values are: S(feature G) = 13.63, S(feature H) = 11.85, S(feature I) = 10.34, and S(feature J) = 8.75.

In the text
thumbnail Fig. E.1.

Decoder Jacobian maps. Derivatives of reconstructed Galactic all-sky maps with respect to latent feature maps A, B, and C (top panel).

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.