Inferring properties of dust in supernovae with neural networks

Context. Determining properties of dust formed in and around supernovae from observations remains challenging. This may be due to either incomplete coverage of data in wavelength or time but also due to often inconspicuous signatures of dust in the observed data. Aims. Here we address this challenge using modern machine learning methods to determine the amount, composition and temperature of dust from a large set of simulated data. We aim to determine whether such methods are suitable to infer these properties from future observations of supernovae. Methods. We calculate spectral energy distributions (SEDs) of dusty shells around supernovae. We develop a neural network consisting of eight fully connected layers and an output layer with specified activation functions that allow us to predict the dust mass, temperature and composition and their respective uncertainties from each SED. We conduct a feature importance analysis via SHapley Additive exPlanations (SHAP) to find the minimum set of JWST filters required to accurately predict these properties. Results. We find that our neural network predicts dust masses and temperatures with a root-mean-square error (RMSE) of $\sim$ 0.12 dex and $\sim$ 38 K, respectively. Moreover, our neural network can well distinguish between the different dust species included in our work, reaching a classification accuracy of up to 95\% for carbon and 99\% for silicate dust. Conclusions. Our analysis shows that the JWST filters NIRCam F070W, F140M, F356W, F480M and MIRI F560W, F770W, F1000W, F1130W, F1500W, F1800W are likely the most important needed to determine the properties of dust formed in and around supernovae from future observations. We tested this on selected optical to infrared data of SN 1987A at 615 days past explosion and find good agreement with dust masses and temperatures inferred with standard fitting methods in the literature.


Introduction
The origin of dust in galaxies in the Universe remains debated. Large amounts of dust are observed in galaxies and quasars in the early and local Universe (e.g. Bertoldi et al. 2003;Priddey et al. 2003;Michałowski et al. 2010a;Watson et al. 2015;Wang et al. 2008;Michałowski et al. 2010b;Marrone et al. 2018), some of which require a rapid and efficient dust formation process (e.g. Dwek et al. 2007;Gall et al. 2011a,b;Finkelstein et al. 2012). There is growing evidence that core collapse supernovae (CC-SNe), which mark the death of short-lived massive stars, are efficient dust producers likely responsible for the observed large amounts of dust in galaxies (Gall et al. 2011b(Gall et al. , 2014Ferrara et al. 2016;De Looze et al. 2020). An alternative to the rapid in situ dust production in CCSNe is grain growth in cold molecular clouds in the interstellar medium (ISM, e.g. Draine 2009) from rapidly produced dust grain seeds and heavy elements by CCSNe.
Dust masses inferred from observations of supernovae (SNe) range from less than about 10 −4 M in young CCSNe of a few hundred days old to about 0.1-1.0 M in old CCSN remnants of a few 100 -1 000 years of age. From a handful of CCSNe that have observationally been monitored over several years, it is evident that the amount of dust gradually increases over about 25-30 years (Gall et al. 2011b(Gall et al. , 2014Wesson et al. 2015;Bevan & Barlow 2016;. Observations of older supernova remnants (SNRs) such as Cas A (Niculescu-Duvaz et al. 2021), N49 (Otsuka et al. 2010), Sgr A East (∼ 0.02 M and ∼ 10 000 years old, Lau et al. 2015), G11.2−0.3 (∼ 0.34 M ), G21.5−0.9 (∼ 0.29 M ), and G29.7−0.3 (∼ 0.51 M ) (Chawner et al. 2019) confirm that on average about ∼ 0.3 M of CCSN produced dust is sustained over a period of about 3 000 years. While this is sufficient to account for the total dust mass observed in local as well as high redshift galaxies , the final amount of dust released into the ISM may still depend on the efficiency of dust destruction and re-formation behind diverse short and long time-scale reverse shocks launched by the forward shock interaction with either the CSM (e.g. Mauerhan & Smith 2012;Matsuura et al. 2019) or ISM (e.g. Silvia et al. 2012;Micelotta et al. 2016).
Inferring dust quantities as well as properties from observations is challenging. Typically, the amount of dust and its temperature is determined by fitting the thermal dust emission in the near-to far-infrared (far-IR) wavelength range with dust models at different levels of complexity (e.g. Rho et al. 2009;Gall et al. 2011b;Wesson et al. 2015;Matsuura et al. 2015Matsuura et al. , 2019Chen et al. 2021). However, the most common dust species are rather featureless in this wavelength range with silicates having the most prominent emission feature at around 10-12 micron (Draine & Lee 1984;Henning 2010), which also could appear featureless for cold dust and/or dust with large grains. Due to limited computational power or insufficient data, the manifold of dust model parameters can often neither be fully explored nor constrained. This leads to dust mass estimates that may vary over an order of magnitude .
Warm and cold dust ( 500 K) in nearby SNe and SNRs has been detected in the mid-to far-IR wavelength range with telescopes such as WISE, SOFIA, ALMA or the Herschel mission (2009-2013) (e.g. Gomez et al. 2012;Indebetouw et al. 2014;De Looze et al. 2019;Gall et al. 2011b;Gall & Hjorth 2018, and references therein), and notably the Spitzer Space Telescope, which observed during its cold (2003)(2004)(2005)(2006)(2007)(2008)(2009) and warm phase (2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019)(2020) about 380 CCSNe out of about 1100 SNe in total (see for a summary Szalai et al. 2019). The next telescope in line with the right sensitivity to observe dust that either is newly formed or heated and to possibly constrain some dust species will be the James Web Space Telescope (JWST, Gardner et al. 2006). With instruments onboard, such as the Near-Infrared Camera and Spectrograph (NIRCam, NIRSpec), the Near-Infrared Imager and Slitless Spectrograph (NIRISS), and the Mid-Infrared Instrument (MIRI) imaging as well as spectroscopic observations of CCSNe in the wavelength range 0.6 -28 µm will be possible. However, the wavelength range of JWST is shorter than the Spitzer Infrared Spectrograph wavelength range that extended out to ∼ 38 µm, thus JWST will preferentially allow to probe the hot and warm dust regime but will not be suitable to probe the cold dust regime at which the majority of the large dust masses in SNRs are detected.
In this paper, we investigated whether modern machine learning algorithms can be used to determine the dust mass, temperature, and possible grain species from the signatures dust imprints in the spectral energy distributions (SEDs) of SNe. We trained a neural network to predict such dust quantities from a simulated set of SEDs of CCSNe with different dust quantities and properties. The SEDs were produced using the fully threedimensional photoionisation and dust radiative transfer code MOCASSIN 1 (Ercolano et al. 2003a(Ercolano et al. , 2005) exploring a large parameter space of dust and SN properties. Assuming that the SNe are distributed within maximally 65 Mpc, we then convolved the SEDs with the suite of available JWST NIRCam (0.6-5.0 µm) and MIRI (5.0-28 µm) bandpass filters to synthesise a photometric data set. The use of simulated data was essential for this work since unfortunately, the presently existing wealth of observational data of dust in and around SNe is insufficient.
The neural network was optimised to predict the total dust mass, dust temperature, and dust species. The data input to the neural network included the entire photometric data set, which consists of 293 236 SEDs and the redshift for each SED. To obtain a practical method, we performed a feature selection method to find the minimum number of JWST filters to estimate the dust properties. Furthermore, we trained the neural network to obtain an estimate on the uncertainties of the predicted quantities (i.e. dust mass, temperature and species). We then identified the most reliable predictions using self-defined and common performance evaluation metrics, which also provide information about the overall performance of the neural network.
In Section 2 we describe the simulated data set which sets the basis of our analysis and which we used to train our machine learning algorithm, which is described in Section 3. In Section 4 we describe the metrics that we employed to evaluate the performance of the neural network, and discuss possible caveats in Section 5. We present our results in Section 6 and discuss the implications of our results on future observations and the SN dust community in Section 7. We conclude in Section 8. Throughout the paper we assume a ΛCDM model with H 0 = 70 (km/s) Mpc, and Ω 0 = 0.3 (Abbott et al. 2017). We applied the above mentioned assumptions on our simulated data set whenever needed, via a built-in library from astropy 2 .

Simulated data
Here, we describe the simulated data set, which consists of simulated SN SEDs from which we synthesised a photometric data set using the entire suite of JWST NIRCam and MIRI bandpass filters. We describe how we dealt with either exceptionally faint or bright sources with respect to the JWST detection / sensitivity limits. Furthermore, we define three different scenarios, in each of which we derived a different data set from the simulated data set to train the neural network and test its performance for predicting the SN dust quantities and properties.

MOCASSIN
MOCASSIN (Monte Carlo Simulations of Ionised Nebulae) is a fully three-dimensional radiative transfer code that propagates radiation packets using a Monte Carlo technique (Ercolano et al. 2003a(Ercolano et al. , 2005. Arbitrary distributions of material can be represented within a Cartesian grid. The material can consist of gas, dust, or both. In each grid cell, the thermal equilibrium and ionisation balance equations are solved to determine the physical conditions. For dusty models, MOCASSIN uses standard Mie scattering theory to calculate the effective absorption and scattering efficiencies for a grain of radius a at wavelength λ, from the optical constants of the material. Any type of grain size distribution and mixture of materials may be specified. The material is illuminated by a radiation source or sources, which can be discrete point sources, or a diffuse source present within each grid cell. The spectral energy distribution of the illuminating source can be a simple blackbody (BB) or an arbitrary spectral shape such as a stellar atmosphere model. The radiation field is described by a composition of a discrete number of monochromatic packets of energy (Abbott & Lucy 1985) for all sources. At each location, the Monte Carlo estimator (Lucy 1999) derives the mean intensity of the radiation field. The contribution of each energy packet to the radiation field at each location is defined by its path through the grid.
To synthesise different SEDs of SNe with dusty shells (hereafter SN model SEDs) using MOCASSIN we defined a set of parameters for the underlying radiation source (the SN), the dust itself and its location. Specifically, in our simulation, our chosen radiation source is a central blackbody, which is defined by a temperature (T BB ) and a luminosity (L BB ). The range of the two parameters follows typical measurements of SN photospheres up to a few hundred days past explosion. The range of radii used in our models covers both the expected radii of SNe ejecta up to ∼1 000 days after explosion, as well as larger radii at which preexisting dust flash-heated by a SN explosion could give rise to infrared emission. For the dust, we considered two prominent grain species, which are amorphous carbon and astronomical silicates with optical constants taken from Zubko et al. (1996) and Draine & Lee (1984), respectively. For our simulations, we considered that all the dust consists of either 100% carbon, 100% silicates, or is a 50:50 mixture of the two dust species. The range of initial dust masses is limited to 10 −5 − 10 −1 M . The upper dust mass limit is partly motivated by the long run-time of simulations of SN model SEDs have, if a lot of dust is present. Another reason is that the mean dust mass for SNe and SNRs is 0.4 ± 0.07 M , but the dust temperatures for large dust masses ( 10 −2 M ) in some SNe is 50 K (Gall et al. 2014). Even with JWST such cold dust will not be easily detected. Furthermore, we considered only single grain sizes ranging between 0.005 -5 µm. Typically, such grain sizes are present in for example the Milky Way (Mathis et al. 1977, e.g.) and observed in some SNe (e.g. Gall et al. 2014;Wesson et al. 2015;Bevan et al. 2020). In total, our SN model SEDs are composed of seven parameters, for which we defined either a set of distinct choices or a range of values (some are described above). A summary of the entire parameter space is presented in Table 1. To finally create our data set, each SN model SED was synthesised from a set of parameters that was stochastically generated from this parameter space. This method ensures that the entire parameter space is uniformly exploited.
In this work, we used MOCASSIN version 2.02.73 to synthesise 293 236 model SEDs. We constructed a cubical Cartesian grid with 11 cells on each side of the 3D grid to model the dusty shells, which are defined by an inner and outer radius of the shell, R in and R out , respectively. We modelled one-eighth of the grid cube (shell) with the illuminating source in one corner. Assuming spherical symmetry, this cube segment was then scaled to the full cube for an effective resolution of 21 3 cells (Ercolano et al. 2003b). We used 10 6 energy packets in most of our simulations. This relatively low number (∼750 energy packets per grid cell) ensures that the MOCASSIN models run very quickly. However, at wavelengths where only a few photons are emitted, the SEDs are affected by small number of statistics and hence dominated by noise. Therefore, for MOCASSIN models with dust masses lower than 10 −4 M in which few photons are reprocessed to longer wavelengths, we used 10 times as many energy packets to reduce the statistical noise in the SEDs at longer wavelengths (e.g. 5-30 µm).
For efficiency reasons, we set a maximum run-time of two minutes for each model. For most regions of the investigated parameter space, the MOCASSIN models have a run-time of a few seconds, but models with both a small shell radius ( 4×10 16 cm) and a high dust mass ( 10 −2 M ) have very high optical depths and thus, time out. This results in a slightly non-uniform filling of the entire parameter space. Furthermore, any dust grains in a simulation which reach the sublimation temperature of its species (1 400 K for silicate dust, 2 200 K for carbon dust) are considered to have evaporated and are not included when calculating the SED. The final dust mass is then either lower than the input dust mass or dust may even be no longer existing. Consequently, for mixed-chemistry models, the composition is altered from 50:50 to a higher carbon fraction due to the higher sublimation temperature of carbon dust. MOCASSIN does not directly provide the final dust mass and composition if dust evaporation occurs, but they are easily extracted from the output grid files by summing the dust masses in cells where the temperature is below the dust sublimation temperature. Some dust evaporation occurs in about 5% of our models. Figure 1 shows the final distribution of the SN model SEDs in the dust mass, temperature, and species MOCASSIN output-parameter space.
2.1.1. Synthetic photometry of optical and mid-IR JWST bandpass filters The JWST is equipped with two imaging cameras, NIRCam and MIRI. The two cameras have in total six narrow and 31 broad bandpass filters available that cover the wavelength ranges 0.6 − 5 µm (NIRCam) and 5 − 30 µm (MIRI). As a next step in preparing the data set for our neural network, we convolved the SN model SEDs with both NIRCam and MIRI bandpass filters (hereafter filters) in order to synthesise a photometric data set.
To do so, we used the python program Pyphot 3 . This program has a built-in library of transmission curves of different filters.
Since Pyphot also allows customised transmission curves, we imported transmission curves for NIRCam and MIRI filters from the Spanish virtual observatory 4 . For each NIRCam and MIRI filter, we first calculated the integrated flux in units of Jansky via Pyphot, which then were converted to AB magnitudes as following the definition of Hogg et al. (2002).

JWST detection limits
For the final step, we considered that our synthetic photometric data set contains magnitudes in some filters that would either be too bright or too faint to be detected with JWST. In order to filter out data with magnitudes that practically cannot be observed (hereafter missing values), we adopted the pre-calculated pointsource continuum detection limits (Glasse et al. 2015;Greene et al. 2017) that have been derived using the JWST exposure time calculator (ETC, Pontoppidan et al. 2016) for a signal-tonoise ratio (S/N) of 10 and exposure times of 21.4 s and 10 000 s for the saturation and sensitivity limits, respectively. A visualisation of these limits for all NIRCam and MIRI filters is shown in the appendix in Figures A.1 and A.2, respectively.

Three scenarios
Typically, CCSNe occur in different types of galaxies at different distances. Consequently, distant CCSNe appear fainter than the same nearby CCSNe because their brightness decreases with distance as with F ν (λ obs ) the observed flux as a function of the observed wavelength, λ obs , in units of Jy; L ν (λ emit ) the emitted luminosity at the emitted wavelength, λ emit ; D L the luminosity distance and z the redshift. The observed wavelength is given by λ obs = λ emit (1 + z). This implies that the SEDs of CCSNe are redshifted and a well-defined bandpass filter will sample the light from a bluer wavelength region of the intrinsic CCSN spectrum compared to the restframe wavelength range of the bandpass filter. In extreme cases (e.g. at high redshift) such an effect may cause a non-negligible degeneracy between dust properties and redshift.
In what follows, we define three individual scenarios that are used to test if some quantities and properties of dust formed in and around CCSNe, such as the dust mass, M dust dust temperature, T dust and the dust species can be determined with neural networks.
For the first scenario, S1, we simply assumed that all CCSNe are at the same, low redshift of z = 0.0001, which corresponds to a distance of ≈ 0.43 Mpc. For comparison, the distance of SN 1987A, the closest observed extragalactic CCSN is ≈ 0.49 ± 0.0009 (statistical) ± 0.0054 (systematic) Mpc (Pietrzyński et al. 2019) and the next closest CCSN, SN 1885A (Fesen et al. 1989), is ∼ 0.765 Mpc away. Placing all SN model SEDs at the same such short distance has the advantage that the observed model magnitudes are nearly identical to the intrinsic magnitudes of SN model SEDs and thus, free of any possible degeneracy between dust properties and distance. Hence, we expect this scenario to be an ideal test case for the neural network. Moreover, from this scenario we can identify the smallest amount of dust detectable with JWST (see Section 2.1). For simplicity reasons, here for S1, we only considered the upper sensitivity limit of the JWST filters but did not apply the lower saturation limits.
For the second scenario, S2, we assumed that all our simulated CCSNe are uniformly distributed within the redshift range 0.0001-0.015, which corresponds to a distance range of ∼0.43-65 Mpc. The decrease in brightness and shift in wavelength with increasing distance, together with the sensitivity and saturation limits of the JWST filters (see Figures A.1 and A.2) place a limit on the distance out to which dust in CCSNe may be observed. Therefore, we chose z = 0.015 (i.e. ∼ 65 Mpc) as an upper limit. This limit is based on the SN model SEDs, for which the thermal dust emission of 10 −5 M carbon dust at a temperature of ∼ 2 000 K remains detectable (see Section 2.1.2) in at minimum 10 out of 28 NIRCam filters.
The data sets of scenarios S1 and S2 solely consist of synthesised magnitudes of all available JWST filters without uncertainties. Therefore, as our third test scenario, S3, we used the data set of S2 and added synthetic photometric noise. We assumed that each synthesised magnitude is 'observed' at S/N = 10, which translates into an uncertainty of 0.1 mag. This assumption is in line with what has been used to derive the detection limits (see Section 2.1.2). Hence, to create S3, we added randomly synthesised noise to the data of S2 as m i,S3 = m i,S2 + N(0, 0.1), with m i the magnitude of each JWST filters, and N(0, 0.1) as a randomly generated number taken from a Gaussian distribution with zero mean and σ = 0.1.

Neural networks
Our analysis is based on training a deep neural network using simulated data (see Section 2). The goal is to predict three dust quantities and properties, T dust , M dust and dust species, together with a prediction of their respective uncertainties. To conform with machine learning nomenclature, we refer to the set of photometric data that is synthesised from each SN model SED using JWST filters, along with the redshift of the SN model SED, as the input features. We also refer to each SN model SED that corresponds to each set of synthesised magnitudes, as a data point, since it is defined as a point in the input features' space.
In the following subsections, we describe the artificial neural networks and the corresponding hyperparameters. We also describe the specific type of neural network that we used and its corresponding optimal set of hyperparameters as well as a preprocessing method to treat the missing values in our data set. Furthermore, we describe the training process of our neural network, in which we defined target values for three dust quantities and properties. Thereafter, we explain an iterative feature selection procedure which we used to find the minimum set of the most important JWST filters, with which the dust quantities and properties can still be predicted with an acceptable accuracy.

Artificial neural networks
An artificial neural network or in short, neural network, is a set of algorithms that is used to recognise relationships in a data set, and to find patterns. The structure of a neural network is inspired by biological neurons, and thus it mimics the methodology that biological neurons use to send signals to one another. Neural networks consist of one or more layers, known as hidden layers, between an input and an output layer. Each layer contains a set of neurons. The process of training a neural network consists of transferring information from the input layer to the output layer via a set of connections. Each connection is defined between each neuron in one layer to each neuron in the next layer. There are different methods to connect neurons and to transfer information between them. In the classic framework, each neuron of a given layer is connected to all neurons in the next layer. Layers that follow this pattern are called fully connected layers. Another method for connecting neurons consists of convolutional layers, in which each neuron from a layer is only connected to a well defined set of neurons from the next layer. A neural network can be built using either one or a combination of different layers and different patterns. To transfer the information, each layer applies an activation function to a set of weights associated with a set of neurons in the layer.
The output vector of each layer is defined as follows: where a l−1 is the input vector to layer l, W l i, j is a matrix that contains a set of weights from neuron j in layer l − 1 to neuron i in layer l, p is the number of neurons in the layer l − 1, b l−1 is a vector of constant values assigned to neurons of layer l, known as thresholds, and H l is an activation function for layer l. For the input layer (i.e. l=0) a l i = x, where x is the input feature vector for the neural network.
The weights and the thresholds of neural networks are the model parameters that a neural network aims to optimise by improving its performance of estimating the target values. In a forward-propagation process of a neural network training, the prediction error is first calculated using random weights. The prediction errors are quantified by a 'loss function'. In a subsequent back-propagation process (e.g. Rumelhart et al. 1986), the weights are adjusted with the aim of minimising the loss. As the name suggests, the forward-propagation method iterates from the input via the hidden to the output layer, while the backpropagation is converse. This combination of forward-and backpropagation takes place within one epoch of training (hereafter epoch). Typically, several epochs are required to minimise the loss function and to improve the performance of the neural network.
Since the loss function can be non-convex, and finding a global minimum of a general non-convex function is NP-hard (Murty & Kabadi 1987), a neural network can be considered optimised when the loss function is converged to a 'good' local minimum. To do so, minimisation algorithms, such as the classical gradient descent, are employed. The basic principle of such algorithms is to calculate the gradient of the loss function and step by step move in the direction as specified by the gradient, with the step size termed as the learning rate.
Choosing the right learning rate is important as for a high learning rate the calculated loss with updated model parameters can jump over the local minimum, therefore can not converge to it. On the other hand, using a low learning rate, the algorithm takes a long time to reach the local minimum of the loss function.
The batch gradient descent is a gradient descent optimisation method in which the neural network updates the weights only once per epoch for the entire training data set. Although this process is a fast approach for finding the local minimum of the loss function, the memory requirement for such computational task is large. A remedy to this is to employ a mini-batch gradient descent, which allows the neural network in each epoch to update the weights for a sub-sample of the data set separately. This subsample is called mini-batch, and the size of it is defined by the size of the mini-batch.
The classical gradient descent uses a fixed learning rate for the entire process. Since this is not optimal, other types of optimisation algorithms that can adjust the learning rate, such as Adaptive Moment Estimation (ADAM, Kingma & Ba 2014) may be used instead.
Neural network parameters, such as the number of either hidden layers, neurons or epochs, the learning rate, the optimiser, the activation function for each layer and the size of the minibatch, are referred to as hyperparameters. The hyperparameters affect the efficiency and performance of the neural network and, like the model parameters, need to be optimised to reach the best possible network performance. While the process of training a neural network adjusts the model parameters, usually the hyperparameters must be manually fine-tuned for each science case and data set in question (LeCun et al. 1998;Bengio 2012;You et al. 2017;van Rijn & Hutter 2018;Weerts et al. 2020).

Our neural network
We designed a neural network to estimate a set of target values, along with their uncertainties. Our neural network aims to approximate a distribution for each target value with a given input feature, x, of each data point and three target values correspond to three dust properties, y sim M dust , y sim T dust , and y sim κ . The neural network implements this approximation by maximising the log-likelihood of the target values under the assumption that the deviations follow a normal distribution, by approximating the mean (m k ) and standard deviation (σ pred k ), which is the expected squared difference between the y pred and y sim , as follows: where N is the number of data points in the data set, while K represents the number of target values. Therefore, each target value y sim k is estimated by a mean m k (hereafter y pred k ), and a standard deviation σ pred k , that represents the estimated uncertainty of y pred k .

Hyperparameter tuning
To find the optimal set of hyperparameters for our neural network, we first explored combinations of 3-12 convolutional and fully connected layers. Each layer can have either four, 16, 32, 64, 128, 256, or 512 neurons. We used the standard Rectified Linear Units (ReLU, Maas et al. 2013) and Parametric Rectified Linear Units (PReLU, He et al. 2015) as non-linear activation functions between the input and the hidden layers. For the output layer, we used a linear activation function to predict the mean of the target values and an exponential linear unit (ELU, Clevert et al. 2015) as activation function to predict the standard deviations of the target values. Using ELU as the activation function ensures that the estimated standard deviations are positive.
We used six different learning rates of 10 −6 , 5 × 10 −6 , 10 −5 , 5×10 −5 , 10 −4 , and 10 −3 for the ADAM optimiser (Kingma & Ba 2014) to search for the local minimum of the loss function with mini-batch sizes of 32 and 64 data points. By comparing the validation and training loss of the neural network with different sets of hyperparameters, we found that the optimal set of hyperparameters consists of eight fully connected layers (one input and seven hidden layers) with 512, 256, 128, 64, 32, 16, eight, and four neurons in the first to the eighth layer, respectively. Furthermore, ReLU activation functions are best used between the layers together with a learning rate of 10 −5 for the ADAM optimiser with mini-batch size of 64 data points. The number of epochs is chosen to be 2 000 in S1 and 1500 for S2 and S3, in which the training and validation loss are converged.

Missing data
Considering the sensitivity and saturation limits (see Section 2.2) for both MIRI and NIRCam filters, some SN model SEDs are not detectable in all filters over the entire wavelength range. For instance, particularly bright or faint SEDs (or parts of the SEDs) result in magnitudes that either exceed the sensitivity limit or remain below the saturation limits of some filters. In reality, such cases would not lead to detections (magnitude measurements) and hence, may be considered as 'missing values'. Here, for each filter, we replaced the synthesised magnitudes that fall outside the saturation and sensitivity limits with the magnitude of the sensitivity and saturation limits, respectively. This approach was inspired by the forced photometry measurement that is commonly used to study transients, for example for Pan-STAARS1 5 . In this method, when a source is detected in a filter at a specific location in the sky, photometric values are forced to be extracted in other filters. These forced photometric values are either the actual magnitudes of the source, or the magnitude limits.

Neural network training preparation
To train our neural network with the set of hyperparameters that are defined in Section 3.3, we created a 'training -validationtest' split from each of the data sets that are described in Section 2.2. Particularly, out of a total of 293 236 data points, we used 70% (193 536) as training, 15% (49 850) as validation and the remaining 15% as test data set.
We normalise y sim M dust and y sim T dust of all SN model SEDs as g(y sim ) = y sim y sim, max with y sim, max M dust = 0.1 M , and y sim, max T dust = 2 200 K. Moreover, we define a conditional function in which we arbitrarily assign each dust species (e.g. carbon, and silicate) a target value: if y sim = silicate 0.75, if y sim = a mix of carbon and silicate 0.5, if y sim = carbon .
We find that inferring the dust properties from SN model SEDs that contain no dust or only very small amounts of dust at cooler temperatures (M dust < 5 × 10 −5 M and T dust < 800 K) using neural networks is challenging (see Section 5 for further explanation). Therefore, to let the neural network differentiate between these SN model SEDs and SN model SEDs that contain recognisable dust, we defined a dedicated target value for this group of 'no-dust' data points as y sim no-dust = -0.5.

Feature selection
The SHapley Additive exPlanations (SHAP; Lundberg & Lee 2017) is a framework that uses an additive feature attribution method to evaluate the importance of a certain input feature on the prediction of a neural network. In this framework, the Shapley values (Shapley 2016) are calculated for each input feature based on cooperative game theory (Nash 1953). In this theory, to calculate the contribution of each input feature to a model's output, the average marginal effect of feature i is measured for all possible coalitions, which represents the effect of feature i on the model's output. In an additive feature attribution method, for an input feature's vector x, for a model f , a simplified local input feature's vector x , is defined for an explanatory model F . The simplified local input feature's vector is a discrete binary vector, x ∈ {0, 1} d (where d is the number of the input features), which means that either features are included or excluded. The explanatory model F is defined as where φ 0 is the base value of the model in the absence of any information, that is defined by the average of the model's output, and φ i is the explained effect of feature i, known as the attribution of feature i. The φ i shows how much feature i changes the output of the model. The second term of the model F is the average over marginal contributions of each feature, over all possible coalitions. The absolute value of F (x n,i )−F (x n,−i ) indicates the importance of the feature i, where x n,−i represents that feature i is not included in the input feature vector x n,i . Therefore, the Shapley values are defined as: when the summation is over all feature subsets S ⊆ d.
To calculate the Shapley values, all coalition values for all possible feature permutations must be sampled. Since the relation between the number of features and the number of possible feature permutations is exponential, for a large set of features the number of calculations in F is immense, and practically not feasible to implement. Therefore, the SHAP framework uses a fast approximation, Deep Learning Important FeaTures (DeepLIFT, Shrikumar et al. 2016Shrikumar et al. , 2017, in which a linear approximation of Taylor series is used to approximate in which the expectation values, E[x ], are calculated for all features and are used as referenced values in the input features vector, when the feature is omitted during the calculations. Since the variance of the expectation values for N data points is roughly 1/ √ N, using approximately 1 000 data points gives an acceptable estimation for expectation values 6 . Therefore, in this work, for each of our three test scenarios S1, S2 and S3, we used a sample of 5 000 data points that we randomly chose from the training data set to approximate the expectation values for all features (i.e. E[x i ] for ∀i; i ∈ x ). Thereafter, we computed the Shapley values for 1 000 randomly selected data points from the validation data set (see Section B.1 for the details of the computational cost). We selected the subsamples from the validation and training data sets with a random seed that we changed for each step in the feature selection process. Therefore, we calculated the importance of each feature (i.e. filter) with index i via where N=1 000. In each step, we removed the three filters that achieved the three lowest absolute Shapley values. Subsequently, in the next step, we trained the neural network using the reduced set of filters as the input feature's vector of the entire training data set and repeated the procedure. Considering that in each step, we removed the three least important filters, we performed the process for a total of 11 steps. Therefore, we are left with four filters out of 37 filters at the end of the process.

Evaluation
In this section we describe the chosen evaluation metrics to evaluate the performance of our trained neural network. We address how we interpreted the resulting predictions for the dust species and how we treated the no-dust models in the performance evaluation. Furthermore, we define criteria to estimate the reliability of the predictions via the predicted standard deviations as the outputs of the neural network. Finally we describe the metrics for comparing the performance of the neural network in different steps of the feature selection process. The performance evaluation of the predicted target values, y pred M dust , y pred T dust and y pred κ , is applied on test data sets, and consists of three individual methods: root-mean-square error (RMSE), bias, and 3σ outliers. For the dust temperature, the residual of data point, n, is defined as ∆y T dust ,n = y pred T dust ,n −y sim T dust ,n . For the dust mass, due to the logarithmic distribution of M dust in the simulated data set, we define the residual as ∆y M dust ,n = log 10 (y pred M dust ,n /y sim M dust ,n ) . For both M dust and T dust the bias is defined as the mean of the residuals as 1 N N n=1 ∆y n , and the RMSE is defined as where n represents each data point, and N is the number of data points in the test data set. Furthermore, for M dust and T dust we define the 3σ-outliers as the predictions with |∆y n | > 3 × RMSE. Moreover, due to the numeric representation of all dust species (see Section 3.5) that are fed to the neural network, numeric target values are predicted. In order to interpret these numeric target values, we define each dust species as a 'class'. This way, we have the following classes: silicate, mixed, carbon and no-dust that we define by a conditional function as Furthermore, to evaluate how well the neural network predicts the dust species, we used the definition of true and false positives, and true and false negatives (e.g. Fawcett 2006) to build a confusion matrix. The classification accuracy for each dust species class is defined as the fraction of correct predictions out of the total number of predictions of each class from the neural network.
Moreover, we investigated whether the predicted uncertainties can be used to filter out uncertain predictions reliably. For this, we assumed that errors in the predicted quantities are approximately normal distributed with mean and variance as predicted by our model. Then, given a chosen confidence level the central confidence interval of the predicted quantity is where y pred k,n and σ pred k,n are the predicted mean and standard deviation of the kth target value for the nth datapoint and y k,n is the unknown true value. The factor a1 is a parameter that depends on the chosen confidence level, where values a1 = 1, 2, 3 give rise to the 68%, 95%, and 99.7% confidence levels, respectively.
With this, we define a threshold for the acceptable relative error, a2 and accept a predicted mean value as (likely) accurate if the width of the confidence interval is small compared to y pred k,n . For M dust and T dust this yields the criterion and for dust species, we use a1 × σ pred κ,n < a2 .
(8) In the following, we use a2 = 0.2 and a1 = 1. If a prediction satisfies equations 7 and 8, we say that it has a reliable standard deviation(σ pred ). To compare the performance of the neural network in each step of the feature selection process, we used two values from the neural network output; i) the values that are reached by the loss function (i.e. Equation 4), for the training and validation data sets at the end of the training process, ii) the ratio of the number of predictions that have σ pred , to the total number of predictions of the test data set (hereafter Rσ pred ). Since we chose a fixed set of hyperparameters (see Section 3.3), for instance, a fixed number of epochs, the minimum loss achieved by the neural network in the training process in each step of the feature selection process can differ from the 'absolute or true' minimum that could be achieved, if the hyperparameters were to be re-adjusted for each step. This is independent of the chosen subset of JWST filters and happens in all scenarios. Ideally, in order to reach the absolute minimum loss possible one should re-adjust the hyperparameters for each step. However, this is a very time consuming process. Additionally, this would make the entire feature selection process dependent on the training data set as well as on the subset of the JWST filters, while not providing further relevant information for all the steps necessary to obtain the final preferred subset of JWST filters. Table 2. Comparison of neural network performance for estimating M dust and T dust in different scenarios for 4 different cases. In case-1 and case-2, the training data set contains the preferred subset of JWST filters. In case-3 and case-4, the data set with the minimum subset of JWST filters is used to train our neural network. In case-1, and case-3 the evaluation metrics are applied on the entire test data set. In case-2, and case-4 the evaluation metrics are applied on the predictions of the test data set that have σ pred . Notes. With the subset of JWST filters that are selected via the feature selection procedure as follows:  Notes. With the subset of JWST filters that are selected via the feature selection procedure as follows: (a) -(f) : See the definitions in Table 2.

Caveats
Typically, very low amounts of dust (less than about 10 −5 M ) are not easily observable in SNe, since the thermal dust emission is rather weak at the expected wavelengths. This means that in some of our SN model SEDs that contain such low amounts of dust, the thermal dust emission in the simulated SEDs may either not be clearly discernible from the emission of the SN or generally remains below the detection capabilities of JWST. Such SN model SEDs that exhibit barely noticeable or no dust signatures may therefore also remain largely unrecognised by our neural network.
In what follows, we trained the neural network on the synthesised photometric data set for S1 to identify the SN model SEDs in this data set that have the lowest M dust and T dust that still can be recognised by the neural network. We find that for SN model SEDs with M dust < 5 × 10 −5 M and T dust < 800 K the predicted dust properties have very large uncertainties (i.e. |∆y T dust | 0.5 × y sim T dust ), causing so called catastrophic outliers. Consequently, we trained the neural network again, but this time to label such SN model SEDs as 'no-dust' data points (See Section 3.5), similar to the SN model SEDs that indeed contain no dust. Figure 2 presents an example set of such no-dust SN model SEDs with M dust and T dust below the aforementioned thresholds. Due to the fact that from the no-dust SN model SEDs the predicted dust properties including their uncertainties are highly unreliable, we did not include these models in subsequent performance evaluations of the dust properties.

Results
We investigated whether a neural network can be used as an effective tool to determine different properties of dust that formed in and around CCSNe from its spectral energy distribution. Since the number of observed SNe is too sparse to be used for such an endeavour, we simulated a total of 293 236 SN SEDs (referred to as SN model SEDs), each with different dust properties. Then, we convolved each SN model SED with the entire suite of JWST NIRCam + MIRI banpass filters (see details in Section 2) to synthesise a photometric data set that is suitable for machine learning purposes.
For a step by step analysis we considered three different scenarios, which are described in more detail in Section 2.2. In short, for the first scenario, S1, all SN model SEDs are placed at the same, low redshift, z = 0.0001. In the second scenario, S2, we uniformly distributed the SN model SEDs within the redshift range 0.0001-0.015. In the third scenario, S3, we used the data set of S2 and added random noise that corresponds to a photometric uncertainty of 0.1 mag (see the details in Section 2.1.2). Comparing the outcome of these scenarios allowed us to examine how strongly the performances of the neural network and the feature importance change for our simulated data that are equipped with properties of real observations.
In our approach, we trained our neural network to predict the distribution of dust quantities given the SN model SEDs, p(y sim |x) ≈ N(y sim ; y pred , σ pred ). To evaluate how well our estimated uncertainties align with the prediction errors, we analysed the distribution of the normalised prediction errors (y pred − y sim )/σ pred . Under perfect neural network modelling circumstances, the distribution of these normalised values must follow a standard normal distribution. Figure 3 shows histograms of the normalised prediction errors for M dust and T dust of a test data set predicted by the trained neural network with the entire set of JWST filters, in S3, excluding the predictions that the neural network classifies them as no-dust. By fitting a normal probability distribution function to the normalised prediction errors, we find that for M dust distribution, a mean of 0.04, and a standard deviation of 0.94 are inferred. The inferred values for T dust are the mean and standard deviation of -0.001 and 0.76, respectively. Therefore, the inferred standard deviations corresponding to M dust and T dust are 6% and 24% lower than for a standard normal distribution. This might indicate that the predicted uncertainties, σ pred , are overestimating the prediction errors.
For each scenario, S1, S2, and S3 we discuss four cases of a performance evaluation. For case-1 and case-2 we evaluated the performances of our neural network that is trained on data sets that consist of preferred subsets of JWST filters (see Section 7 for further discussions on the selection of preferred subsets). For case-3, and case-4 we evaluate the performances of our neural network that is trained with data sets that are constructed with a minimum subset of JWST filters (see definition section 4), with which the different dust quantities are predicted with an acceptable level of accuracy. The latter means that the fraction of reliable predictions, out of the entire test data set, is 5%. Furthermore, for case-1 and case-3 we apply the evaluation metrics on the entire test data set. For case-2 and case-4, we apply the metrics only on the subsample of the test data set that satisfies the criteria for being reliable predictions as defined in Section 4.
Tables 2 and 3 summarise the outcome of the case by case performance evaluations of our neural network to predict M dust , T dust and to classify the dust species for all three scenarios S1, S2 and S3. Out of all scenarios and all cases we find that in S1 and for case-2, the RMSE of both M dust and T dust is the smallest and Rσ pred is maximal. For case-2, the RMSE of M dust increases from ∼ 0.05 dex in S1, to ∼ 0.1 dex in S2 and to ∼ 0.11 dex in S3. However, for T dust the RMSE increases from about 14 K in S1, only to ∼ 18 K in S2. From S1 to S3, the RMSE of T dust increases to ∼ 30 K in S3. From both S1 to S3, in case-2, the fraction of 3σ outliers for M dust target values increases.
The bias of the T dust predictions for most of the scenarios for case-3 and case-4 is negative. This indicates that the neural network underestimates the T dust target values (i.e. y pred T dust ). For case-1 and case-2 the bias is positive for T dust in all scenarios. This indicate that the neural network overestimates the T dust target values. For instance, in case-3 for S1, the bias of 0.013 (dex) for M dust represents that the average of M dust estimations over all the test data set is about 10 0.013 ≈ 1.03 times more than the simulated M dust . For T dust the average of T dust estimations over all the test data set is about 1 K more than the simulated T dust values.
As shown in Table 3, the highest classification accuracy for dust species is achieved for S2 for case-2. For this, we find a classification accuracy of 97%, 98%, and 100% for carbon, mixed, and silicate dust, respectively. Comparing the classification accuracy for each dust species, we find that for all scenarios and all cases, silicate dust is predicted with the highest accuracy. Carbon dust is predicted least accurately in all scenarios and cases, except in S3 for case-4. There, the SN model SEDs that are labelled as mixed dust are predicted with the lowest accuracy (57%). In case-4 and S3, 42% of the mixed dust species are predicted as carbon dust.
In Figures 4, 5, and 6, the performance of the neural network is shown for case-1 and case-2 for all scenarios. Overall, the performance of the neural network for case-2 is better than for case-1. As illustrated in the top panels of Figures 4 and 5, the dispersion of the predictions around the diagonal line that represents predicted values equal to simulated values, increases from S1 to S2 for both target values M dust and T dust . Moreover, as summarised in Table 3 the classification accuracy decreases for all dust species from S1 to S3 in case-1.
As shown in Figure 4, for S1 the reliable predictions for M dust and T dust range between about 6 × 10 −5 − 10 −1 M and 100-1 400 K, respectively. However, Figure 5 shows that in S2, the reliable predictions only range between about 10 −4 − 5 × 10 −2 M , and 250-1 200 K for M dust and T dust , respectively. Figure 6 shows that in S3, the reliable predictions for M dust are within 5×10 −4 −10 −1 M and 250-1 000 K for T dust . This means that the dust mass and temperature range of the reliable predictions for all cases shrinks from S1 to S2 to S3, and thus with the increased complexity of the scenarios. Figure 7 presents the performance of the neural network with the subsets of the JWST filters that are selected in each step of the feature selection process. The bottom panel compares the training losses obtained for the last epoch at all feature selection steps for all three scenarios. The validation losses for S3 are also included in Figure 7. It is evident that for S3, the validation loss closely follows that of the training loss. We find the same for the other two scenarios, although the loss values vary more drastically from step to step. The absolute local minimum of both the validation and training loss appears to be reached in step zero of the feature selection process for S3, while for both S1 and S2 the absolute local minimum is reached in step five. However, we find for S3 that both the training and validation loss slowly increase from step zero to eight by about 5%.

Feature selection
The first three panels of Figure 7 show the performance evaluation of the neural network for the test data sets. It is evident that the RMSE for T dust varies only minimally around a mean value of about 58 ± 9 K, after which it increases to about 240 K in the last step. The RMSE of M dust behaves similarly constant over the first eight steps except for step two, and increases from step eight to eleven by about 0.45 dex. The classification accuracy for carbon dust and the mixed composition also only changes minimally over the first eight steps, but appears to decrease from step eight to eleven from about 70% to about 50%. For silicate dust the classification accuracy remains nearly 100% over all steps.

Discussion
The performance evaluation of our trained neural network, which is designed to predict dust properties such as M dust , T dust and different dust species, demonstrates that neural networks can be a powerful tool, if a sufficiently large data set is at hand. One advantage of using such a method is that it is possible to obtain a good estimate on the prediction uncertainties for each dust property under consideration. For other common methods, such as fitting a simple modified black body function or combination of thereof, uncertainties of the fitted dust mass or dust temperature are often not obtained (e.g. Gall et al. 2011b, and references therein). Furthermore, due to the fact that for such fitting methods assumptions about the dust composition need to be made a priori to fitting, the parameter range can be large and often not explored in all detail. The reasons for this may include insufficient data quality, but also time and computational limitations. These issues also apply to more sophisticated dust models such as MOCASSIN, when used to fit observational data to obtain the amount and temperature of dust in and around SNe (see e.g. Wesson et al. 2015).

Limitations of the model dataset
For the purpose of running a large number of models in a reasonable amount of time, we made some simplifying restrictions to the parameter space that our models cover. Some of these simplifications may have a significant effect on the predicted dust quantities from the SEDs. Our models used a single grain size only, selected from a uniform distribution in log-space. In the interstellar medium, the grain size distribution may be approximated by a Mathis et al. (1977, hereafter MRN) distribution, in which the number density of grains of radius a is proportional to a −3.5 . As this power-law distribution arises from collision and fragmentation processes over a long timescale, it is unlikely to be applicable to the dust grains found in and around CCSNe. A single grain size may be a more reasonable approximation.   Observational studies tend to find evidence for large grains (e.g. Gall et al. 2014;Wesson et al. 2015;Owen & Barlow 2015;Bak Nielsen et al. 2018). If a population of grains grows by accretion, then according to the standard grain growth equation, the increase in radius with time does not depend on the initial radius of the grain. A size distribution will therefore become narrower as accretion proceeds, unless fragmentation is also taking place.
In Figure 11, to illustrate the effect of using a single grain size as opposed to a distribution, we show the example SEDs for 20 models characterised by a single grain size, evenly spaced logarithmically between 0.005 µm and 0.5 µm, together with the example SED for an MRN dust distribution. It can be seen that the SED for the full grain size distribution is almost identical to the SED for a single grain size of 0.15 µm.
The calculation of a spectral energy distribution from thermal dust emission fundamentally depends on the choice of optical constants. Different literature sets of optical constants may differ significantly from each other, and the dust actually present in and around a SN may not be well represented by the materials from which optical constants have been determined. The choice of optical constants thus introduces a systematic uncertainty into the dust mass and temperature estimates.
In a future work, we plan to investigate this more thoroughly, by using the neural network to classify SEDs calculated using different optical constants to those on which the network was trained. However, in this work, we used only two species of dust, and only one set of optical constants for each species. Dust in SNe is widely assumed to be either carbonaceous, silicaceous, or a mixture, and our SEDs are calculated using widely-used optical constants for these species. However, different choices of optical constants can yield significantly different SEDs. To illus-trate this, we show in Figure 12 the variations in example SEDs for one example model. The model has a 50:50 silicate:carbon grain mixture, and Figure 12 shows the example SEDs for a single grain size of 0.1µm, using all possible combinations of optical constants from four sets of carbon data ( (Hanner 1988, thereafter H88), and the ACAR, ACH2 and BE 7 samples from Zubko et al. 1996, thereafter Z96) and four sets of silicate constants (Draine & Lee 1984;Laor & Draine 1993, thereafter DL84 and LD93, respectively, and oxygen-deficient and oxygen-rich constants from Ossenkopf et al. 1992, thereafter O92).
It is clear from Figure 12 that different choices of optical constants can result in significant differences in some wavelength regions of some SEDs. Particularly affected appears to be the 1-10 µm regions. However, the differences are largest for relatively small grains and are negligible for grains as large as 5µm. As mentioned, many observational studies of dust in young and old SNRs have found evidence for generally large grains, thus tending to reduce the uncertainty due to the choice of optical constants. Additional comparisons for more grain sizes and for pure carbon and pure silicate compositions are given in Appendix D.

Performance evaluation
Our performance evaluation demonstrates that for all scenarios and cases (see Table 2), the obtained prediction error, RMSE, for M dust is smaller than ∼ 0.55 (dex), and is smaller than ∼ 78 K for T dust . These RMSE are obtained for case-3 and S3 and are the Table 4. Preferred and minimum subsets of JWST filters obtained from the feature selection process and used to estimate M dust , T dust , and dust species. This is shown for all three scenarios. Columns termed 'Pref.' and 'Min.' stand for preferred and minimum subset of JWST filters, respectively.
maximum RMSE values out of all scenarios and cases. This is because in case-3, the evaluation metrics are applied only onto the test data sets and minimum subsets of JWST filters of each scenario S1, S2 and S3. Moreover, for the evaluation of case-3 the test data sets are used without prior σ pred cut and hence, contain predictions with larger uncertainties. Additionally, S3 is the most complex scenario of all scenarios. However, compared to other works in the literature with inferred amounts and temperatures of dust from observed SNe, we find that even the worst performance here in this work constitutes a very good performance. For example, we can compare to other works that estimate the amount of dust, with Spitzer Space Telescope observations up to about 25 µm for SNe such as SN 2004et (Kotak et al. 2009) and SN 1987A (Ercolano et al. 2007). For SN 2004et, the estimated range for dust mass and dust temperature at 300, 464 and 795 days after the explosion, are about 0.37 dex and 500 K, 0.26 dex and 250 K, and 0.38 dex and 80 K. For SN 1987A, the amount Loss values that are achieved by the training and validation data sets at the end of each training process of the neural network in S1 (downward triangles), S2 (circles), and S3 (X symbols). The empty symbols mark the training loss for each step. The filled symbols mark the validation loss for each step. The grey shaded region represents the area between the training and validation loss in S3. The single panels show the RMSE of T dust (K), and M dust (M ), and the classification accuracy (%) for predicting the dust species for the test data sets in S3, from bottom to top. The classification accuracy for predicting carbon and silicate dust species, and a mixture of them are shown with circles, triangles, and dashes respectively. of carbon dust at day 615 has been estimated with an uncertainty of 0.81 dex.
We now turn to the performance evaluations of the most reliable predictions, which are drawn from case-1 and case-3 data sets that have σ pred , evaluated as case-2 and case-4. Comparing the RMSE for M dust and T dust between case-1 and case-2 (cases with the data sets that contain the preferred subsets of JWST filters) across the three scenarios, S1, S2 and S3, shows that the prediction errors are reduced by up to a factor of about 2-3 in case-2 where the predictions that do not have σ pred are excluded. Since this is an expected, but not guaranteed, consequence of including only the predictions that have σ pred , which removes 'bad' predictions that do not fulfil the criterion to have σ pred , the same is expected for the cases with the data sets that contain the minimum subsets of filters (case-3 and case-4). Our evalua- Table 5. Comparison of neural network performance for estimating M dust and T dust in scenario 3 for case-2 with the same definition in Table 2. Two test cases with test data sets with S/N=3 and S/N=10, respectively, are evaluated. In both test cases the training data set is similar to case-2. The evaluation metrics are applied on the reliable predictions from each test case.  Table 6. Comparison of neural network performance for classifying dust species, and the fraction of predictions of the test data set that have σ pred to all the predictions from the test data set (Rσ pred ), in scenario 3 for case-2 as defined in Table 2 Notes. With the subset of JWST filters that are selected via the feature selection procedure as follows: (c) : See the definitions in Table 2.
tions show that the effect of excluding the unreliable predictions for M dust estimations is even stronger than that for T dust , meaning that the RMSE (in dex) of the dust mass is smaller by about a factor of 4-5 in case-2 and case-4, compare to case-1 and case-3, while for T dust the decrease is only about factor of 2. The classification accuracy of classifying the different dust species shows the same behaviour, which is higher for nearly all species and scenarios for case-2 and case-4 than for case-1 and case-3. Particularly for silicate dust, the classification accuracy is close to or at 100%. However, in case-4 and S3, there is a bias in predicting the mixed dust species towards the carbon dust species. The evaluation method using the σ pred definition demonstrates that the dust mass and temperature predictions that have been under the scrutiny of the σ pred criterion can truly be considered as reliable predictions.
On the other hand, as shown by Rσ pred in Table 3, the number of predictions that satisfy the σ pred criteria in case-2 and case-4 is smaller than the predictions using the entire data set as in case-1 and case-3. Since there are 3σ outliers (as defined in Section 4) also for case-2 and case-4, the fractions of the best reliable predictions for S1, S2 and S3 in case-2 and case-4 are smaller than Rσ pred . For instance, in case-2 for S1 the fraction of the best reliable predictions is still about 59% while in case-4 and S3 it shrinks to only about 5.8%.
Comparing the number of 3σ outliers between M dust and T dust , we find that for nearly all setups of cases and scenarios, the M dust evaluations result in a larger number of 3σ outliers than the T dust evaluations. This is because the dispersion of M dust residuals is larger than T dust residuals.

Filter selection
Since observing SNe with all the JWST filters at the same time is practically not feasible, we are interested in finding the smallest set of filters with which an acceptable performance can be achieved. To do so, we utilised a feature selection process as described in Section 3.6. From this we obtain two sets of filters for each scenario, one preferred set of filters and one minimum set of filters. The preferred filter set is chosen based on the absolute minimum reached by both the training and the validation loss while the minimum filter set is chosen based on criterion that the fraction of the number of reliable predictions to the total number of predictions is larger than 5%. It turns out that for S1 and S2 the preferred filter set is reached early in the filter selection process, step five, and thus still contains a large number of filters (22 filters). The minimum filter set is obtained in steps eight or ten, and thus contain fewer, between seven to thirteen, filters. Looking at the performance evaluation from the two filter sets, for example case-1 and case-3 or case-2 and case-4 in Table 2, then while as expected, the performance of the neural network is overall better with the preferred set of filters. The performance with the minimum filter set is only minimally decreased. Hence, as demonstrated in Table 2, accurate predictions of T dust and M dust can be achieved with the minimum set of filters.
For scenario S3, the preferred and the minimum subset of filters are chosen from step eight and nine and are thus very close to each other. It is important to note that in this case the preferred set is chosen to be at step eight instead of step zero, where the loss reached the absolute minimum. However, we do not consider step zero as 'preferred'. Since both the training and validation loss remains rather stable until step eight as pointed out in Section 6.1, step eight can be considered as preferred.
As illustrated in Figure 7, in S3 compared to S2 there are insignificant changes of loss values in each step of the feature selection process up to step nine. This stability of the performance of the neural network in S3, regardless of the number of filters that are used as the input features can be due to the training of the neural network with additional noise. This is because the training of a neural network with additional noise can be equivalent to a regularisation (Bishop 1995), which helps the neural network to react less to the variation of input features. Therefore, in S3 compared to S2, the training and validation losses that are achieved by the neural network with smaller sets of filters than the entire filter set, do not significantly change in each step of the feature selection process up to step nine. Figure 8 visualises the resulting Shapley values obtained for each step in scenario S3. Figures C.1, and C.2 show the same for S1 and S2, respectively. It is interesting to note that for all three scenarios, none of the narrow-band JWST filters are amongst the minimum subsets of the JWST filters. However, for S1 and S2 two such narrow-band filters are included in the preferred filter set albeit with small Shapley values and hence, marginal importance. Fig. 8. Importance of JWST filters for estimating the amount, temperature and the dust species, in S3. The normalised feature importance (φ i ) of each NIRCam (blue) and MIRI (red) filters in each step of the feature selection process is shown by the size of the filled circles that are scaled to three values in the legend. The preferred and minimum subsets of JWST filters are highlighted with boxes using a dash-dotted and a solid line, respectively. This implies that real observations of SNe with such JWST narrow-band filters would have the least impact on estimating dust properties with our neural network. As shown in Figure 8, and expected, the MIRI filters that cover the longer wavelength region are crucially important to estimate the dust properties while the shorter wavelength NIRCam filters seem not to play a significant role.
One of the most pressing questions of course is, if it is technically feasible to construct an observing run with the minimum subset of filters. The NIRCam instrument uses a dichroic to split the incoming radiation into two wavelength ranges, λ < 2.5 µm and λ > 2.5 µm, known as short and long wavelength channels (Horner & Rieke 2004). This setup allows to simultaneously obtain two images with two different filters, each from one of the channels. Since, in the minimum subset for S3, two selected NIRCam filters, F070W and F140M, are in the short wavelength channel of NIRCam and two, F356W and F480M, are in the long wavelength channel, two separate runs are required to observe a SN with all four NIRCam filters. For MIRI, observations can only be conducted with one filter at a time. The entire observing time needed for all selected MIRI filters of the preferred subset may in the end depend on the brightness of the SN, the either desired or best possible signal-to-noise ratio or the phase of the SN.

Additional testing of the performance of the neural network
Our simulated data set is simplified by various assumptions such as a uniform S/N=10. In reality, the achieved S/N ratio depends on different aspects, such as the brightness of the object in a given filter band, the distance to the object or the exposure time and integration setup. The JWST Exposure time calculator is an ideal tool to adjust all these aspects. It is obvious that for bright sources a high S/N ratio even with a short exposure is possible to achieve, while for faint sources, long exposure times may be necessary to reach just a minimum significance of S/N≈3. While simulating more realistic S/N ratios for each filter band assuming different exposure times is possible, it is computationally expensive and hence, we decided to first test the neural network performance for a simple case, a uniform S/N=10, which represents neither particularly good nor bad data. However, to better understand the effect of better or worse data, we tested the performance of our neural network for scenario S3 and case-2 on two test cases with test data sets assuming S/N=20 and S/N=3, respectively. The results are summarised in Tables 5 and 6 and presented in Figure 6.1. For the test case representing higher quality data with a S/N=20, the RMSE of M dust and T dust are ∼0.12 M (dex) and ∼32 K. The RMSE of M dust and T dust predictions for the other test case with a test data set with S/N=3 are ∼0.42 M (dex) and ∼88 K. As expected, the performance of the neural network has become worse for the test-case with a S/N=3 compared to the S/N=10 while for the test-case with a S/N=20 the performance remains similar. We note that since the neural network has been trained for a S/N=10, for the test-case of S/N=3 our predictions are somewhat over-confident while they are under-confident for S/N=20.
The final test of the usability of our neural network, which has been trained on a simulated data set exploring a wide, but not exhaustive range of parameters, is to use true observational data. Hence, we used the spectrophotometric observations of SN 1987A taken with the Kuiper Airborne Observatory at 615, 632 and 638 days past explosion (referred to as 615 day epoch) (Wooden et al. 1993;Moseley et al. 1989), as this epoch shows a clear signature of dust formation in the ejecta. The data cover a wavelength range of 0.33 -29.5 µm. Furthermore, Wesson et al. (2015) has also fit MOCASSIN models to the same data and their best fit results in about 1 × 10 −3 M of dust for a clumpy model with a 85:15 carbon:silicate ratio and temperatures of 252 ± 29 K for carbon, and 316 ± 31 K for silicate dust. This is a larger dust mass than the best fitting models by (Ercolano et al. 2007) who obtained ∼ 2 × 10 −4 M at similar temperature, while (Wooden et al. 1993) obtained about 3.1 × 10 −4 M at about 400 K of graphite dust, assuming a smooth dust distribution.
Here, we created a small test data set which consists of SN 1987A data at 615 days that were replicated 500 times, each assigned a different redshift which was chosen randomly from a limited redshift range (0.0006-0.004). This ensures that the Article number, page 15 of 24 A&A proofs: manuscript no. output data are within the saturation and detection limits. We applied a Gaussian smoothing operator, which enabled interpolation between the data gaps at 1.02 -1.48 µm and 12.67 -17.32 µm and convolved the data with the JWST bandpass filters. We used the trained neural network of scenario S3, first, including all JWST bandpass filters (scenario S3) and second, using only the preferred set of the filters (S3, case-2) to predict the dust mass and temperature as well as the dust grain composition (carbon, silicates or 50:50 mix).
The results are shown in Figures 13 and 14. There appears a trend with redshift for all predictions in all two cases. We find that with increasing redshift, the dust mass and temperature predictions increase and the predicted dust species is leaning more towards silicates. Using all JWST bandpass filters, we obtain dust masses that are predicted with 99.7% confidence to range between a few times 10 −4 -10 −3 M and temperatures to range between ∼ 280 -340 K, in agreement with previous estimates in the literature. The estimated dust species is carbon or a mix of carbon and silicates. In the case of using only the preferred filter set (see Section 7.3), at very nearby distances, the results show that with a 99.7% confidence the predicted dust mass is not larger than 2 -4 times 10 −3 M while at z 0.003 the predicted dust mass ranges from 10 −3 -10 −2 M for a predicted dust species that can either be mixed or silicates. For all predictions, the temperature range overlaps with that from the first case using all JWST filter bands.
This shows that our dust mass and temperature predictions for SN 1987A at 615 days are comparable to those in the literature and hence, our dust mass and temperature predictions are reasonable for SN 1987A-like SNe. However we note that while the dust temperature predictions fulfil the σ pred criterion, the dust mass predictions do not. Moreover, using the preferred JWST filter set, results in silicates as the dominant dust species, which disagrees with what is found in the literature. A possible reason for this may be ascribed to our simplified training data set and limited parameter range. Despite this, our method can be a promising tool to analyse signatures of dust in and around SNe in their SEDs. In forthcoming work, we aim to use more detailed and realistic simulations to achieve more reliable predictions of the dust mass, temperature and possibly other dust properties. 7.5. Implications for future observations Figure 10 shows the histogram of the number of test data over T dust , in S3, for the entire test data set (dashed line), and the subsample of test data with reliable estimated standard deviations (solid line). In the top panel, the distribution is shown for the SN model SEDs with R in 5 × 10 16 cm, while the bottom panel represents the SN model SEDs with R in 5 × 10 16 cm. This cutoff represents an approximate division of models into those, which are closest to dust signatures from newly formed dust in young SNe ejecta, and those, which can be interpreted as signatures arising from pre-existing circumstellar dust, flash-heated by a SN explosion. Pre-existing dust grains at radii less than about 5 ×10 16 cm are likely to be evaporated by the SN explosion (Gall et al. 2014), although some dust may survive. Meanwhile, SNe ejecta expanding with a mean velocity of ∼ 6 000 km/s would reach this radius after ∼1 150 days. A SN following the bolometric evolution of SN 1987A (Seitenzahl et al. 2014) would have a luminosity of ∼10 000 L , the lowest considered in our models, at a similar epoch. Therefore, any dust estimated from model SEDs with dust located at distances 5×10 16 cm (i.e. the top panel in Figure 10) may be interpreted as being pre-existing dust. Dust estimates from model SEDs with dust located at distances 5 × 10 16 cm (i.e. the bottom panel in Figure 10) could be associated with newly formed dust at early epochs in young SNe. Such newly formed dust can either be located in the SN ejecta, or in the case of Type IIn SNe such as SN 2006jc (e.g. Smith et al. 2008) or SN 2010jl (e.g. Gall et al. 2014;Bevan et al. 2020), in the cool dense shell located at a distance of about 10 16 cm and behind the forward shock which propagates through the dense circumstellar material that was shed off by the progenitor prior to the terminal explosion. By comparing the covered areas in both panels, we find that our neural network may be better at estimating the dust mass and temperature of model SEDs which are more closer to a pre-existing dust scenario than a ejecta dust scenario. Whether or not this is due to our chosen simplifications and parameter space coverage of our simulations, can be tested in forthcoming, more realistic simulations.
In this work we used point source continuum sensitivity limits (Glasse et al. 2015;Greene et al. 2017, and described in Section 2.1.2) which are calculated assuming average zodiacal background levels. While this is a reasonable approach for young SNe that are in for example resolved nearby galaxies or maybe located at the outskirts of a galaxy or intergalactic medium, it may    be problematic for older SNRs that are more extended, diffuse sources, as well as for SNe located in crowded regions. ISM back-and foreground contamination from unresolved stars in distant galaxies can give rise to a brighter background than assumed here, changing the sensitivity limits to cover lower magnitudes. Greene et al. (2017) estimated that the sensitivity levels can worsen by up to a factor of ∼2 for NIRCam broad band filters in case of bright backgrounds. Contamination due to cold ISM dust with temperatures 30-50 K could also affect the sensitivity limits. However, this is most prominent at longer wavelengths, 100 µm, and thus may not significantly affect the sensitivity limits in JWST's wavelength range. Finally, the chosen observing strategy and possibilities for proper background sub-A&A proofs: manuscript no. output Fig. 13. Estimated amount, temperature, and composition of the dust in SN 1987A at 615 days after explosion for the entire set of JWST filters. The purple dots along with black lines represent the predicted values and the predicted uncertainties by the trained neural network, respectively. The estimated values by (Wesson et al. 2015, W15) and (Ercolano et al. 2007, E07) are shown as red and green solid lines (left panel) and shaded areas (middle and right panels). The blue and yellow regions in the right panel highlight x-axis labels; No dust and Silicate.
tractions may also shift the limits at which faint sources can still be detected. In forthcoming work we will test in more detail the impact of varying sensitivity limits on optimising the JWST filter selection to determine dust properties.
To use modern machine learning algorithms effectively, large data sets are essential. Presently ongoing wide-field surveys such as the Zwicky Transient Facility (Bellm 2014), Young Supernova Experiment (Jones et al. 2021) or SkyMapper Southern Sky Survey (Scalzo et al. 2017) are discovering hundreds to thousands of SNe and other transients per year and are building up a wealth of optical photometric as well as spectroscopic data of various different types of CCSNe that will be further advanced in future surveys such as the Vera C. Rubin Observatory Legacy Survey of Space and Time (Ivezić et al. 2019). While near-to mid-IR observations of CCSNe will likely boom with the launch of JWST and possibly other, future instruments on ground-and space-based telescopes, they are rare at present, and will most likely not reach the level required to train machine learning algorithms on observational near-to mid-to far-IR data.
The Open Supernova Catalog 8 reported the discovery of 450 Type II, 102 Type Ib, and 60 Type Ic SNe in the year 2021, from which only a few have mid-IR data. Although this is a large number of observed CCSNe in just one year, collecting a data set of the size, wavelength range and degree of variation used in our study will also in future not be easily feasible. This 'data-size' limitation is especially important for estimating the dust proper-8 https://sne.space ties of the types of SNe we used here in this work, where we had simulate 293 236 SN SEDs covering a wavelength range from 0.7 to 30 µm. Finally, as the dust properties and quantities cannot directly be measured from the observational data, thus are unknown, well advanced simulations with known dust properties are highly valuable. Therefore, applying a neural network that is well trained on a rich set of highly advanced simulated data exploring a large parameter space may be a promising way to determine dust quantities and properties of future observations. This work also allows testing in what detail quantities and properties of dust can be inferred from observational data. Furthermore future observational data, if included in the training of the neural network, can be used to validate the neural network and thus, will improve its performance and outcome.

Conclusion
In this work, we present a first test for using neural networks to estimate different quantities and properties of dust located in and around SNe including their predicted uncertainties. We aimed at predicting the temperature and amount of dust and to differentiate between three dust compositions. To do so, we simulated an extensive data set of 293 236 SN model SEDs using the 3D photoionisation and dust radiative transfer code MOCASSIN (Ercolano et al. 2003a(Ercolano et al. , 2005. We convolved the simulated data set with JWST MIRI and NIRCam bandpass filters. We considered the instrument's detection limits as well as estimated magnitude uncertainties to make the trained neural network suitable for predicting some of the properties and quantities of dust in SNe from future observations of this instrument. We defined three different scenarios to examine the feasibility and accuracy of inferring the dust properties by our neural network. In the first scenario, we assumed that all SN model SEDs have the same low redshift. In the second and third scenarios, we distributed all SN model SEDs within the redshift range of 0.0001 − 0.015, in which at least seven JWST bandpass filters of all SN model SEDs are within the sensitivity and saturation limits that are calculated for a S/N of 10. Additionally, in the third scenario, we added random noise to the distributed SN model SEDs within a redshift range of 0.0001 − 0.015. Thereafter, we selected the preferred and minimum subset of JWST filters from the feature selection process, which is based on the SHAP framework. We used these filter subsets to estimate the amount, temperature and dust species with our neural network.
From the outcome of our trained neural network in S3, which is the closest scenario to real observations, we find the minimum subset of JWST filters needed to estimate dust quantities and properties consists of NIRCam: F070W, F140M, F356W, F480M, and MIRI: F560W, F770W, F1000W, F1130W, F1500W, and F1800W filters. As presented in Table 2, our neural network can well predict the dust quantities and properties for approximately 7% of SN model SEDs from the entire test data set. This fraction has a RMSE of ∼ 0.12 dex, and ∼ 38 K for M dust and T dust . The classification accuracy is 95%, 99% and 57% for carbon, silicate and a mix of carbon and silicate dust, respectively. We find that the dust quantities and properties are best predicted by our neural network for SN model SEDs that approximately range in T dust between 250-1000 K, and M dust between 5 × 10 −4 − 10 −1 M , and are dominated by astronomical silicates.
Figures A.1 and A.2 represent the sensitivity and saturation limits for observing with NIRCam and MIRI filters with a minimum signal-to-noise ratio of 10.

Appendix D: Effect of optical constants on example SEDs
As discussed in the text, the choice of optical constants can have a significant effect on the model SED, and thus affect the predicted dust quantities. Here, we provide a further illustration of this, using a representative model from our dataset. The model has a mixed composition with 50% carbon and 50% silicate grains. Figure 12 in the main text shows the example SEDs from all combinations of four sets of carbon and four sets of silicate optical constants, for a grain size of 0.1 µm. In Figure D.1, we show the sets of example SEDs also for grain sizes of 0.01, 1.0 and 5.0 µm, and for models with the same geometry but pure carbon and pure silicate composition. One can see that the variation between SEDs is largest for smaller grain sizes.