Open Access
Issue
A&A
Volume 666, October 2022
Article Number A45
Number of page(s) 26
Section Atomic, molecular, and nuclear data
DOI https://doi.org/10.1051/0004-6361/202244091
Published online 04 October 2022

© T. Villadsen et al. 2022

Licence Creative CommonsOpen Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article is published in open access under the Subscribe-to-Open model. Subscribe to A&A to support open access publication.

1 Introduction

One of the objectives of the field of astrochemistry is to understand the formation, destruction, and survival of molecules in astrophysical environments, such as star-forming regions or planet-forming disks (Jørgensen et al. 2020; Öberg & Bergin 2021). This concerns the molecular reactions on dust-grain surfaces, for example, where the surface acts as a catalyst (Cuppen et al. 2017) and also evaporation of ices in hot cores near young massive stars (Viti et al. 2004). To model such phenomena, certain molecule-specific parameters are needed. Both examples require a measure of how strongly a particular molecule binds to a given surface, that is, the binding energy (BE). In astrophysical environments, physisorption is the primary source of BE. The main contributors of physisorption are van der Waals forces, which arise from dipole-dipole interactions, and hydrogen bonding between the adsorbed molecule and the surface. These forces act at a distance of between 2 and 4 Å. If no dynamical barrier is present, the BE is equal to the activation energy of desorption (Edes) as the system reference energy (i.e. E = 0) corresponds to the adsorbed molecule and surface at infinite separation (Minissale et al. 2022).

Experimentally, a prominent technique used to determine the BE is based on temperature programmed desorption (TPD). This applies to studies of catalysis (Luo et al. 1997), surface science (Zhou et al. 2007), and astrochemistry (Muñoz Caro et al. 2010). The TPD process consists of three steps; first, the molecule is adsorbed to the surface at cold temperatures. The coverage may be below or above a single monolayer depending on various factors, such as the molecular flux and deposition time. If more molecules than available surface adsorption sites are deposited, the coverage is also termed multilayer. Second, the temperature is linearly increased, resulting in desorption of molecules at specific temperatures. Third, the desorbed molecules are detected, often with a mass spectrometer. This produces spectra of desorption rate as a function of temperature. The process of thermal desorption is generally described by the Polanyi-Wigner equation, a modified Arrhenius law:

dNdt=kdes(T)·Nn,$ - {{{\rm{d}}N} \over {{\rm{d}}t}} = {k_{{\rm{des}}}}\left( T \right)\,\cdot\,{N^{\rm{n}}},$(1)

dNdt=vn·Nn·exp(EdeskB·T),$ - {{{\rm{d}}N} \over {{\rm{d}}t}} = {v_{\rm{n}}}\,\cdot\,{N^{\rm{n}}}\,\cdot\,\exp \left( { - {{{E_{{\rm{des}}}}} \over {{k_{\rm{B}}}\,\cdot\,T}}} \right)\,,$(2)

where kdes(T) is the desorption rate constant in s−1 at temperature T, N the number of adsorbed molecules on a surface, n the order of desorption (usually 1 for monolayer desorption and 0 for multilayer desorption), νn the pre-exponential frequency factor with value molecules1−n s−1 (also often denoted as A), Edes the desorption energy, and kB the Boltzmann constant. For a more thorough discussion of the technique and analysis, we refer to De Jong & Niemantsverdriet (1990), Burke & Brown (2010), and Minissale et al. (2022) for contextual reviews.

While TPD has been successful in determining the BEs for many molecules that are of astrochemical relevance (e.g. Brown & Bolina 2007; Burke et al. 2015a; Smith & Kay 2018; Behmard et al. 2019; Salter et al. 2019), there are limitations to experimental BE investigations with this and other techniques. Experiments are time consuming and the focus is usually on the scientifically most impactful systems. With the vast number of known interstellar molecules, this inevitably means that some of them are not yet studied. Furthermore, certain molecular species are difficult to work with, either because they are unstable or highly reactive (e.g. vinyl alcohol, cyanopolyynes), highly toxic (e.g. methyl iso-cyanate, propyl cyanide), or simply because it is challenging to produce an ice film with them (e.g. carbamide).

Alternatively, Bayesian inference (Heyl et al. 2022) or quantum chemical computational methods can be used to determine BEs (e.g. Das et al. 2018; Rimola et al. 2018; Balbisi et al. 2022). Quantum chemical calculation can also take into account BE distributions on amorphous and highly anisotropic surfaces (e.g. Tinacci et al. 2022; Ferrero et al. 2020; Duflot et al. 2021). However, as these methods are computationally expensive, many astrochemical studies rely on the so-called linear addition method. With this method, the BE of a molecule is determined by splitting its components into atoms and molecular fragments for which the BEs are known and subsequently adding them together (e.g. Garrod & Herbst 2006; Shingledecker et al. 2020). This method is computationally inexpensive, but it is also inaccurate. Novel methods that are computationally inexpensive but more accurate are required.

Machine learning (ML) has become one of the most prominent scientific tools of the 21st century as it provides high accuracy at low computational cost. It has the ability to handle and interpret data in ways that humans cannot, which allows the discovery of unprecedented patterns (Jordan & Mitchell 2015). These properties make ML an interesting alternative to the above-mentioned theory-based approaches. With its immense versatility, ML has applications ranging from self-driving cars and social media to banking and image recognition. In recent years, ML has also made its entrance as a powerful tool in astrochemistry and astrophysics. Notable deployments include the reproduction and prediction of chemical abundances in interstellar inventories (Lee et al. 2021) and exoplanet identification (Shallue & Vanderburg 2018). ML interatomic potentials are another type of ML model, which can be directly employed as a low-cost alternative to quantum chemical calculations for investigating, for example, molecular reactivity, adsorption, and diffusion on dust grains (e.g. Mazo-Sevillano et al. 2021; Molpeceres et al. 2021; Zaverkin et al. 2021). In surface science and catalysis, ML has also been used extensively to identify predictive models for BEs (e.g. Gu et al. 2020; Fung et al. 2021; Andersen & Reuter 2021).

In the present work, we apply supervised ML to a data set of BEs obtained from literature TPD data and thereby develop a model to predict BEs between new molecules and surfaces relevant to astrochemical environments. The methods used are discussed in Sect. 2. The results and discussion of the analysis are presented in Sect. 3. Astrophysical implications are covered in Sect. 4 and the conclusions of this work are given in Sect. 5.

2 Methods and data

Supervised learning algorithms are constructed to make a model that can recognise particular patterns within the data when given training examples by the user (supervisor). A strong limitation of such algorithms is that they can only recognise data that are related to the training data, and therefore any anomaly or unseen data structures would be difficult for the model to grasp. The training data given to the model in our work are a data set of BEs obtained from TPD experiments collected from the literature as well as relevant features of each system such as the surface category, and atoms and functional groups present in the adsorbed molecule. The trained model can therefore be expected to predict BEs for new examples of molecules and surfaces that are not too different from those seen in the training data. We quantify the predictive accuracy of our model by carrying out two types of cross-validation analysis. The workflow of the process is shown in Fig. 1. In the following sections, the essential components of this workflow are described.

2.1 Gaussian process regression

BEs are predicted here using the supervised ML technique Gaussian process regression (GPR), which is a probabilistic, non-parametric supervised learning method frequently used for regression and classification problems by the ML community. Being based on Bayesian probability theory, it learns a posterior probability distribution over all admissible target functions. Here, these are functions describing the relationship between surface and molecular features (the input x) and the TPD BE (the output y). A Gaussian process prior is assumed, which means that both the prior and posterior probability distributions are Gaussian distributed (normal) and can be specified using a mean function, µ(x), and a covariance function, k(x, x′), also called the kernel function. The posterior distribution is calculated by conditioning the prior distribution on the training data set. Model predictions on new test data points with input x, are obtained from the mean of the posterior distribution, f*, given by

f¯*=μ*+k(x*,x)[ k(x,x)+σn2I ]1(yμ),${{{\bf{\bar f}}}_*} = {\mu _*} + k\left( {{x_*},x} \right){\left[ {k\left( {x,x} \right) + \sigma _n^2I} \right]^{ - 1}}\left( {{\bf{y}} - \mu } \right),$(3)

and variances are obtained from the diagonal of the covariance matrix, cov(f*), given by

cov(f*)=k(x*,x*)k(x*,x)[ k(x,x)+σn2I ]1k(x,x*).${\mathop{\rm cov}} \left( {{{\bf{f}}_*}} \right) = k\left( {{x_*},{x_*}} \right) - k\left( {{x_*},x} \right){\left[ {k\left( {x,x} \right) + \sigma _n^2I} \right]^{ - 1}}k\left( {x,{x_*}} \right).$(4)

Here µ and µ* are the mean vectors, k(x, x*) denotes the covariance matrix evaluated at all pairs of training and test points, and similarly for the other entries k(x, x), k(x*, x*), and k(x*, x). The target function is assumed to be noisy, which is accounted for by the incorporation of independently and identically distributed Gaussian noise, σn2I${\sigma _n^2I}$. In practise, a small or vanishing noise level will cause the fitted model to follow the training data points closely, whereas a higher noise level will result in a smoother model. The latter can be useful for extrapolating to unseen data because more emphasis is put on trends rather than the individual training examples seen. The variance provides an uncertainty estimate, which is an estimate of how confident we can be about the model predictions, and is also affected by the noise level. Direct access to an uncertainty estimate is a great advantage of GPR compared to other types of ML methods such as neural networks (Scalia et al. 2020).

GPR belongs to the class of kernelised ML methods that internally employ the ‘kernel trick’. A kernel is a function that corresponds to an inner product in some high-dimensional feature space, whose values can be interpreted as a similarity measure of input data points. In this way, the kernel function implicitly maps the inputs into a higher dimensional space and applies the linear algorithm there, which is how kernelised ML methods tackle the computational complexity of dealing with high-dimensional feature spaces. We refer to Rasmus sen & Williams (2006) for extensive textbook coverage of GPR and to Gibson et al. (2012) and Aigrain et al. (2012) for examples of GPR applied to exoplanet data sets.

For the actual GPR implementation, we rely in this work on the ML library Scikit-learn (Pedregosa et al. 2011), which provides a number of built-in kernels. By testing several different combinations of kernels using five-fold cross validation (cf. Sect. 3.1), we find that the best performance is achieved by using the sum of the radial basis function (RBF) kernel

kRBF(xi,xj)=exp[ 1212(xixj)2 ],${k_{{\rm{RBF}}}}\left( {{{\bf{x}}_i},{{\bf{x}}_j}} \right) = \exp \left[ {{1 \over {2\ell _1^2}}{{\left( {{{\bf{x}}_i} - {{\bf{x}}_j}} \right)}^2}} \right],$(5)

and the rational quadratic (RQ) kernel

kRQ(xi,xj)=exp[ 1+(xixj)22α22 ]α.${k_{{\rm{RQ}}}}\left( {{{\bf{x}}_i},{{\bf{x}}_j}} \right) = {\rm{exp}}{\left[ {1 + {{{{\left( {{{\bf{x}}_i} - {{\bf{x}}_j}} \right)}^2}} \over {2\alpha \ell _2^2}}} \right]^{ - \alpha }}.$(6)

The length-scale parameters, 1 and 2, indicate how quickly the correlation between two points drops as their distance increases. A higher gives a smoother function and a smaller gives a wigglier function (Rasmussen & Williams 2006). α determines the relative weighting of large-scale and small-scale variations in the RQ kernel (Duvenaud 2014). 1, ℓ2, α, and the noise level from Eqs. (3) and (4) are hyperparameters. Here we determine these parameters during the model training by maximising the marginal likelihood function using the standard limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm for bound-constrained optimisation (L-BFGS-B) Byrd et al. (1995); Zhu et al. (1997).

thumbnail Fig. 1

Schematic overview of the workflow. First the data are collected from the literature and divided into monolayer and multilayer coverage. Secondly, specific features are designated to the data, including atomic composition, functional groups, and valence electrons. Thirdly, the Gaussian process regression model is constructed and trained on the data to be able to predict BEs for new molecules.

2.2 Data set and features

We compiled the training data by analysing laboratory studies presented in the literature and extracting the relevant information. From these publications, BEs determined with TPD experiments and information about the binding surface are retrieved, which resulted in a data set that initially contained 354 (167) entries for the monolayer (multilayer) case and 117 different molecules, many of which have been detected in the interstellar medium. These include simple diatomics like n2 and CO, organic molecules like ethanol (CH3CH2OH) and glycolaldehyde (HOCH2CHO), long carbon chains such as octane (C8H18), and biomolecules, such as the nucleobase adenine.

In addition to BEs, the training data set also contains input features, which are descriptive attributes of the surfaces and molecules. It is essential to choose the best possible features as they govern how well the model predicts; too few features and the model will not be able to differentiate between different training data points (e.g. molecules), and too many features could invoke the curse of dimensionality, which is a term that expresses how increasing the volume of feature space dilutes the data (Bellman 1966). Therefore, the best approach is to minimise the number of features while maximising the amount of information they contain.

In this work, molecular features are used, such as atomic compositions (C, H, O, N, Cl) and functional groups (alcohol, –OH; carbonyl, –C(O)–; carboxyl, –COOH; ester, –C(O)O-; ether, –O–; amine, -NH2; cyanide, –CN; amide, –NC(O)–). Several features are obtained from the Python module RDkit (Landrum 2020). These are calculated by converting the molecules to SMILES1 strings, which are then fed into RDkit. The considered features are the number of valence electrons, hydrogen bond donors, hydrogen bond acceptors, and topological polar surface area (TPSA). The latter is a property defined as the molecule’s sum of surface area of polar atoms and is measured in Å2. The motivation for applying the features from RDkit is to try to encapsulate the origin of the dominating forces that govern the binding of molecules. In particular, this includes hydrogen bonding and van der Waals interactions. Finally, we also included the molecular mass and the number of atoms in the molecule as features. An overview of all molecular features used is given in Table 1 and a selection of the feature values of each molecule are presented in Tables B.1-B.3.

For the consideration of features related to the surface, we first divide our data set into two categories; monolayer and multilayer coverage. Entries recorded at monolayer coverage (i.e. an adsorbate layer of one molecule thickness) or less are assumed to have their BE dominated by the surface-molecule interaction. A wide variety of different surfaces are present in the analysed literature, the number of which greatly exceeds what is reasonable for the model to handle. To reduce this number, the surfaces are placed into four different subcategories based on their common traits. The categories and their main contributors annotated with the percentage they comprise of the total entries in the category are the following: carbon (graphene, graphite, and highly orientated pyrolytic graphite, 94% of data entries), metal (gold, 56% of data entries), silicate (amorphous silicate and forsterite, 93% of data entries), and water (amorphous solid water and crystalline water, 98% of data entries). We note that this approximation could be a source of significant noise in the training data, as different types of surfaces placed in the same category (e.g. nickel and gold for metals) may in reality bind the studied molecules with different strengths but cannot be distinguished from the used input features. Finally, in order to input the surface category feature to the ML algorithm, it is converted to numerical values using one hot encoding.

For multilayer entries, the coverage is greater than one molecule in thickness and the BE is assumed to be dominated by intermolecular interactions of the adsorbed species with itself. For this reason, we have chosen to neglect surface features for the multilayer data set.

Finally, we note that, while only the BE is used for further analysis, it is important to be aware of the influence of the preexponential factor. This value can be experimentally determined, but often (including in many of the studies that contribute to the data set in this work) an assumed value is used in the analysis to retrieve the BE. Compared to an experimentally determined preexponential factor, the assumed value may result in significantly deviating BEs and their inclusion in the training data will affect the results of the ML model. It is also worth noting that adsorbates on amorphous and highly anisotropic surfaces usually have a distribution of BEs rather than a single value (Shimonishi et al. 2018). However, as the available TPD data mainly consist of a single binding energy per surface-molecule combination, it is not possible to include this distribution in our model at present.

Table 1

Overview of the features used to describe the molecules and surfaces.

2.3 Data preparation

As ML algorithms generally do not perform well on data points that are rare or very different from the remainder of the data set, we combed our data set for outliers. This was done by applying the ‘isolation forest algorithm’ from Scikit-learn (Pedregosa et al. 2011). This algorithm identifies anomalies that are both few in number and different in feature space. We applied the isolation forest in both the mono- and multilayer cases. This resulted in the removal of three outliers for both cases: the C60 fullerene and the two polycyclic aromatic hydrocarbons (PAHs) coronene and ovalene. For the multilayer case, we also removed two other molecules (dotriacontane and guanine) because these molecules, together with the fullerene and the PAHs, are outliers in the sense that they have BEs that are three or four times as large as the other molecules in the data set.

Lastly, we addressed the issue that data points with the same feature representation can have different labels (BEs). This can arise when several different experimental measurements of the same surface-molecule combination are present in the data set, which differ only in experimental parameters that in our model have been assumed not to influence the BE, such as temperature ramp, starting temperature, and pre-exponential factor. In this case, an average over the measured BEs is used. If, on the other hand, this issue arises because of differences in parameters such as the initial coverage and the specific surface (for the monolayer data set), a more thoughtful approach is fruitful. If data points at both submonolayer and monolayer coverage are present, we use the monolayer data point, because the submonolayer case may be dominated by specific surface sites exhibiting a more favourable (stronger) binding than the others. If data points for several specific surfaces or facets are present within the same surface category, we use the surface that occurs most frequently within the category in order to obtain a more coherent data set. After this data preparation step, the final data set contains 143 (46) entries for the monolayer (multilayer) case and 114 individual molecules. After the data preparation step, the features were normalised to have zero mean and unit variance, with the exception of the one-hot-encoded surface features. This is a standard procedure used by the ML community to make the learning task easier. All data points used in this work and their actual surface and assigned surface category are given in Tables B.2 and B.3 for monolayer and multilayer coverage species, respectively2.

3 Results and discussion

In the following sections, we validate the performance of the model using two different types of assessment; five-fold cross validation and leave-one-molecule-out cross validation.

3.1 Five-fold cross validation

Five-fold cross validation is a standard approach in the ML community for comparing models and for assessing how well they can predict new data points. As illustrated in Fig. A.1 in Appendix A, it is carried out by splitting the data set into five equally sized disjoint sets. The model is then trained on four of the sets while the fifth set is used to validate the model (i.e. by comparing model-predicted BEs to actual BEs). This is repeated five times until every data point has been used for validation exactly once. In Fig. 2, we show parity plots of ML predictions on the combined validation data set from the five folds versus actual literature BEs for the monolayer and multilayer data sets. The closer the points are to the diagonal dotted line, the better the model performs. The performance is quantified using the root-mean-square error (RMSE) and more absolutely by the coefficient of determination, R2.

We find that the model has the highest R2 value for the monolayer case shown in Fig. 2a. One contributor to this could be the fact that there are more than three times as many entries for the monolayer set than for the multilayer set. However, a much more important difference between the two data sets is that, for the monolayer case, the model is exceptionally accurate for data points with a BE above 17 000 K. The reason for this behaviour is that this part of the data set is immensely uniform as it consists only of carbohydrate chains of varying lengths adsorbed on graphene surfaces. As the BE increases proportionally with the length of the carbon chain, the model is able to learn these BEs with very high accuracy. In fact, if only the entries with BE above 17 000 K are included in the analysis, the model achieves a RMSE of 25.7 K and an R2 value of approximately one. This RMSE is much lower than the overall RMSE of 879 K, especially when taking into consideration that the scale of the data set is higher. However, it should be kept in mind that the model is only as nuanced as the data it has been provided. This implies that the model has relatively little knowledge of the molecules in the high-BE regime, and therefore the predictive capability would most likely be limited for molecules that are different from the simple carbohydrates. If only the low-energy regime is considered, the R2 value decreases to 0.946. The corresponding parity plot shown in Fig. A.2 can be found in the Appendix.

thumbnail Fig. 2

Parity plots for (a) monolayer and (b) multilayer coverage comparing ML-predicted BEs against actual BEs for the combined validation set from five-fold cross validation.

3.2 Leave-one-molecule-out cross validation

As a second type of assessment, we next evaluate how well the model can predict individual new molecules. This is done by first removing a chosen molecule from the data set (including all surface categories for the monolayer case), training the model on the reduced data set, and then finally comparing the ML-predicted BE of the chosen molecule with the literature value. For this assessment, eight different molecules were chosen, namely acetone (CH3C(O)CH3), acetonitrile (CH3CN), allyl alcohol (C3H5OH), ammonia (NH3), methane (CH4), methylformate (CH3OCHO), the alkane nonacosane (C29H60), and the nucleobase thymine (C5H6N2O2), which represent diverse chemical compositions and molecular sizes.

The comparison between ML-predicted and actual BEs is shown in Tables 2 and 4. The model uncertainty estimates are obtained from the standard deviation of the posterior distribution (i.e. the square root of its variance), as described in Sect. 2.1. As seen, the overall predictive capability of the model is reasonably good. Furthermore, there is a direct correlation between how well different types of molecules are represented in the training data and how well the model predicts them. For example, it is found that nonacosane is exactly predicted, which is presumably a consequence of the many similar molecules in the data set. It is further noticeable that ammonia is predicted quite reasonably, even though the model is mostly trained on organic molecules. We can identify that the good prediction accuracy is mainly the result of the inclusion of the following molecular features: number of valence electrons, H-bond acceptors, H-bond donors, and TPSA. We come to this conclusion because the predicted BE for a monolayer of ammonia on a carbon surface is only 2070 ± 1740 K (a deviation of −31% compared to the literature value) if these features are excluded. For the multilayer case, the deviation would be even more pronounced, at -36%. The model struggles the most with allyl alcohol, acetontrile, and methyl formate, although the deviations are still relatively small at +~20%. This might be because the training data set contains few molecules like these three, or because another factor, such as the specific surface, has an influence on the BE, which the model cannot account for.

We also note here that some types of molecules are not well represented in the training data, meaning that the developed model is not suitable for predicting their BEs. This primarily concerns the types of molecules that were removed as outliers in the data-preparation step (fullerenes, PAHs, and large aromatic molecules in general). Also, it is notoriously difficult to perform TPD experiments with phenols, and therefore few data are available for the model to train on.

In general, a further limitation of the developed model is that it cannot distinguish between isomers of the same molecule where these have the same feature representation, which will always be the case for structural isomers. However, given the other uncertainties – both in the experimental TPD procedure and in the developed model – it is very likely that the difference in BE between two isomers would, in any case, be much smaller than what can be resolved with the current approach.

Another limitation of the model is that it struggles with molecules consisting of atoms other than C, H, O, N, and Cl. This is because the data set includes very few molecules that contain atoms other than these, and because we consequently do not include features representing any other atoms. This means that predicting the BEs of for example sulphurous- or phosphorous-containing molecules will be less accurate. To increase the performance and predictive capabilities of the model, the most essential future step would be to increase and diversify the number of entries in the training data set.

Table 2

Comparison between ML-predicted and literature BE values (rounded to the nearest 10 K) for leave-one-molecule-out cross validation.

4 Astrophysical implications

In recent years, the number of newly detected molecules in the interstellar medium has increased substantially, including many complex organic molecules and prebiotic species such as carbamide (NH2C(O)NH2, Belloche et al. 2019), propargylimine (HC3HNH, Bizzocchi et al. 2020), propargyl cyanide (HCCCH2CN, McGuire et al. 2020), ethanolamine (HOCH2CH2NH2, Rivilla et al. 2021), and allenyl acetylene (H2CCCHCCH, Cernicharo et al. 2021). While these detections underline the molecular complexity that is present in star-forming regions, our limited knowledge of fundamental physicochemical parameters, such as reaction rate constants, photo destruction cross sections, and BEs, prevents us from fully understanding how these species form, react, and respond to physical conditions, and, ultimately, what their place is in the interstellar chemical factory.

In this section, we first employ the ML model to predict the BEs of a number of molecules that have been detected in the interstellar medium, but for which no or limited information about their BEs can be found in the literature; see Table 3. The features of the molecules are encoded in the same way as the training data for the ML model and presented in Table C.1. Predictions of BEs for monolayer coverage are limited to two surfaces, carbon and water, for which the model shows the highest performance. For a number of species, BE estimates based on the linear addition method are available in the literature and have been used in modelling studies. These BEs are generic, meaning that they are not specific to any surface. These literature estimates are included in Table 3 for comparison to our ML predictions.

Several observations are made for the predicted BEs. The uncertainties on the predictions vary from a few percent to up to 60% for hydroxyacetone. This is a reflection of the training data and feature representation of the model, with lower uncertainties emerging when features of a certain molecule are better represented in the training data set. In particular for molecules with cyanide groups, the uncertainties on the predicted BEs are low, because a comparatively large number of cyanide molecules are included in the training data. For about half of the molecules, their predicted BE is substantially larger on a water surface than on a carbonaceous surface. All species that show this behaviour contain a cyanide group and therefore this trend is explained by the stronger polar interaction between –CN and the H2O ice surface. For a couple of molecules, all of them isomers, the model cannot differentiate the BEs, because they have the same feature representation.

A comparison between ML predictions and literature estimates shows some discrepancies. The predictions and estimates of methylcyanodiacetylene and the hCxN species on carbonaceous surfaces, as well as N-methylformamide, agree fairly well, especially within the uncertainties of the ML prediction. However, the linear addition method seems to underestimate the BE of cyanamide, albeit just within the uncertainty of the ML prediction. For most species, the linear addition method severely overestimates the BE of the molecule. In the case of propargylimine, the estimated BE is more than three times larger than the ML BE prediction.

Finally, for a number of species, we note that the training data does not contain many molecules that represent their features, such as the imine (–N=CH–) group for ethanimine and propargylimine. Predictions for these species will have a higher degree of uncertainty. Improvements on the predictions can be expected by increasing the size of the training data set, in particular with species containing relevant features, and by expanding the feature representation.

We next discuss the astrophysical implications of the ML predictions. For this, we construct a simple model for the behaviour of the predicted species during thermal heating in the interstellar medium. At the basis of this model is Eq. (2), which is used to determine the loss of material from a surface at a specified temperature. The BEs presented in Table 3 are used and the pre-exponential factor is assumed to be 1 × 1018 s−1 for all species. With transition state theory calculations, Minissale et al. (2022) showed for a selection of interstellar molecules that their prefactors are substantially larger than the canonically assumed 1 × 1012 or 1 × 1013 s−1 and that these values increase for molecules of larger size. This motivates the choice of v = 1 × 1018 s−1, as the listed molecules are relatively large in size, but we emphasise that this is a rough assumption and values will differ from molecule to molecule. We note that when a prefactor of 1 × 1012 s−1 is used, the peak desorption temperatures of molecules increase by about 20–30 K with respect to the 1 × 1018 s−1 value. For each molecule, a monolayer column density of N = 1 × 1015 molecules cm−2 is used.

Two model results are presented. Figure 3 shows the desorption profiles of the molecules versus temperature for a linear heating rate of 1 K century−1 . Figure 4 shows the same desorption profiles, but plotted against the distance from a protostar based on the following equation:

T(r)=200×r0.62,$T\left( r \right) = 200 \times {r^{ - 0.62}},$(7)

with r being the radius from the protostar in astronomical units (au). Equation (7) is derived by Andrews & Williams (2007) by averaging the observed disk temperature profiles of a sample of protoplanetary disks in the Taurus-Auriga and Ophiuchus-Scorpius star forming regions. Figures 3 and 4 provide an indication of the location of the snow lines of the molecules presented in Table 3, that is, the radius from a central protostar where volatile molecules sublimate or freeze out (Öberg & Bergin 2021). In both figures, the peak desorption temperature (97 K) and location (3.2 au) of water are indicated, which are determined for a monolayer (1 × 1015 molecules cm−2) H2O coverage on HOPG with Ebin = 5792 K and v = 4.96 × 1015 s−1 (Minissale et al. 2022).

The plots reveal a large variation in the peak desorption temperatures of the predicted molecules. For some species, such as CH3CHNH, CH3NCO, and H2CCCHCCH, peak desorption coincides with or is lower than that of water. While in this work, the binding of molecules to a water surface is considered, it is reasonable to assume that many of these species will in fact be mixed in water ice because of the large H2O abundance in the ISM (Boogert et al. 2015). Consequently, it is to be expected that these molecule will mostly co-desorb (Burke & Brown 2010) with water when this species sublimates. Snow lines of molecules with Ebin,xEbin,water therefore more likely coincide with that of water. Co-desorption with other bulk ice components, such as CO, is possible in principle, but laboratory evidence generally indicates that molecules with a significantly higher BE than the bulk medium do not co-desorb (e.g. Ligterink et al. 2018). The species considered in this work, which all have Ebin,xEbin,CO, are therefore unlikely to co-desorb with CO.

Many species, such as CH2CCHCN, HCCCHCHCN, and CH3C5N, have a high BE to water surfaces and show desorption traces at much higher temperatures than the peak desorption temperature of water itself. At these temperatures, water ice has sublimated from grain surfaces and is no longer a binding medium, and so desorption should instead occur from a surface that is probably made out of silicates, carbonaceous species, or organic residue. Taking this into account, it seems that many of these molecules will in fact desorb quite close to the water snow line, based on their carbon surface BEs. Only a handful of molecules desorb at temperatures considerably above that of the water snow line, namely NH2CN, NH2C(O)NH2, the HCxN species, and indene. From this visualisation, it becomes clear that various subgroups of these molecules will end up in distinctly different regions of the planet-forming disk as either gas or ice.

Table 3

Predictions of BEs for molecules with astrophysical relevance measured in K and rounded to the nearest 10 K.

thumbnail Fig. 3

Desorption traces of molecules for which the BE is determined in this work. The first-order desorption profiles are plotted for monolayer (1 × 1015 molecules cm−2) coverage on a water ice (blue) and carbonaceous (orange) surface. The peak desorption temperatures are indicated in the top right corners. A linear heating rate of 1 K century−1 is applied. Prefactors are assumed and set at A =1 × 1018 s−1 for all molecules. The peak desorption for water is indicated with a green dashed lines at 97 K.

thumbnail Fig. 4

Same as Fig. 3, except that the desorption trace is plotted against a median disk temperature profile as derived by Andrews & Williams (2007); see main text for more details. Shorter distances are closer to the protostar and thus correspond to higher temperatures. The peak desorption for water is indicated with a green dashed lines at 3.2 au.

Table 4

Comparison between predicted and observed values of BE obtained from the literature measured in eV and rounded to nearest 0.01 eV.

5 Conclusions

In this work, we created an ML model based on GPR and trained it to predict the BEs of molecules relevant to astrochemistry. We collected the BEs determined from laboratory experiments, categorised them according to their features (e.g. mono- or multilayer coverage, binding surface, functional groups, valence electrons, H-bond acceptors, and donors), and used them as training data for the model. We assessed the performance of the model with five-fold and leave-one-molecule-out cross validation. We find a root mean square error of 892 K and 580 K for the mono- and multilayer models, respectively. For individual molecules, the deviation between model-predicted and literature BEs is found to be within ±20%. We note that sufficient training data and accurate feature representation are essential to predict BEs. Molecules for which features are not well described or insufficient training data points are available will generally have larger uncertainties on their predictions.

The validated model was used to predict the BEs on a water and carbonaceous surface of 21 molecules that have been detected in recent years in the interstellar medium, but for which no or limited experimental information about their BEs is available. The lowest BE of 2990 K is predicted for methyl isocyanate (CH3NCO) on a water surface, while the highest BE of 12820 K is predicted for cyanopentacetylene (HC11N) bound to a water surface. Uncertainties on the predictions range from just a few percent to about 60% for hydroxyacetone (CH3C(O)CH2OH), which is presumably a reflection of the lack of training data and feature representation for these molecules. The surface can have a pronounced effect on the predicted BE, showing differences of several thousand Kelvin for some molecules. Finally, we present a comparison between the predictions of the ML model and those predicted with the widely used linear addition method. We find that the linear addition method generally overpredicts BEs, in some case by more than a factor of two.

The newly predicted BEs are put into context of interstellar environments with a simple model that shows their desorption profile with respect to a 1K century−1 temperature ramp and a protoplanetary disk temperature profile. From this simple model, the locations of the snow lines of these molecules are determined. Most of them will roughly coincide with the water snow line, but those of cyanamide (NH2CN), urea/carbamide (NH2C(O)NH2), and the cyanoacetylenes (HCxN) are located at much higher temperatures or closer to the protostar.

This work demonstrates that ML can be employed to accurately and rapidly predict BEs of molecules. The approach taken here is based on experimental training data, but we note that ML models can also be trained on BEs obtained from quantum chemical calculations, as already pursued intensively in the heterogeneous catalysis community (e.g. Gu et al. 2020; Fung et al. 2021; Andersen & Reuter 2021). In that connection, a natural extension of this work could be to also take into account BE distributions on amorphous and highly anisotropic surfaces, as this distribution is often readily available from quantum chemical calculations (e.g. Tinacci et al. 2022; Ferrero et al. 2020; Duflot et al. 2021). Overall, we believe that the work presented here could pave the way for a stronger collaboration between the communities working on quantum chemical calculations of BEs, laboratory experiments, and ML, as the various approaches complement each other. The BEs predicted here will find general use in the modelling of astrochemical and planet-forming environments, while more detailed BE distributions would be critical for more specific modelling such as the reactivity of molecules at dust grains at low temperatures.

Acknowledgements

The authors thank C.N. Shingledecker for providing BE estimates of molecules used in several chemical modelling studies. N.F.W.L. acknowledges funding by the Swiss National Science Foundation (SNSF) under Ambizione grant 193453. M.A. acknowledges funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement no. 754513, the Aarhus University Research Foundation, the Danish National Research Foundation through the Center of Excellence ‘InterCat’ (Grant agreement no.: DNRF150) and VILLUM FONDEN (grant no. 37381).

Appendix A Cross validation

Figure A.1 schematically displays how the five-fold cross validation is set up. Figure A.2 depicts the parity plot when only BEs up to 17000 K are included.

thumbnail Fig. A.1

Schematic overview of five-fold cross validation.

thumbnail Fig. A.2

Parity plot for the monolayer case and including only BEs up to 17000 K.

Appendix B Literature data and molecular features

Table B shows the most important features of molecules that are used as training data in this work. Tables B and B.3 present the literature data used to train the ML model3.

Table B.1

Features of molecules used for ML

Table B.2

BEs of molecules at monolayer coverage

Table B.3

BEs of molecules at multilayer coverage

Appendix C Astrochemically relevant species

Table C.1 presents the features of the astrochemically relevant molecules for which BEs are predicted in this work.

Table C.1

Features of astrochemically relevant molecules used for ML

References

  1. Abdulgalil, A. G. M., Marchione, D., Thrower, J., et al. 2013, Philos. Trans. Roy. Soc. A: Math. Phys. Eng. Sci., 371, 20110586 [NASA ADS] [CrossRef] [Google Scholar]
  2. Acharyya, K., Fuchs, G., Fraser, H., Van Dishoeck, E., & Linnartz, H. 2007, A&A, 466, 1005 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  3. Aigrain, S., Pont, F., & Zucker, S. 2012, MNRAS, 419, 3147 [NASA ADS] [CrossRef] [Google Scholar]
  4. Allouche, A., Verlaque, P., & Pourcin, J. 1998, J. Phys. Chem. B, 102, 89 [CrossRef] [Google Scholar]
  5. Andersen, M., & Reuter, K. 2021, Acc. Chem. Res., 54, 2741 [CrossRef] [Google Scholar]
  6. Andrews, S. M., & Williams, J. P. 2007, ApJ, 659, 705 [NASA ADS] [CrossRef] [Google Scholar]
  7. Bahr, S., & Kempter, V. 2007, J. Chem. Phys., 127, 074707 [CrossRef] [Google Scholar]
  8. Bahr, S., Toubin, C., & Kempter, V. 2008, J. Chem. Phys., 128, 134712 [CrossRef] [Google Scholar]
  9. Balbisi, M., Horvath, R. A., Szőri, M., & Jedlovszky, P. 2022, J. Chem. Phys., 156, 184703 [CrossRef] [Google Scholar]
  10. Behmard, A., Fayolle, E. C., Graninger, D. M., et al. 2019, ApJ, 875, 73 [Google Scholar]
  11. Bellman, R. 1966, Science, 153, 34 [NASA ADS] [CrossRef] [Google Scholar]
  12. Belloche, A., Garrod, R., Müller, H., et al. 2019, A&A, 628, A10 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  13. Bertin, M., Doronin, M., Fillion, J.-H., et al. 2017, A&A, 598, A18 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  14. Bisschop, S., Fraser, H., Öberg, K., Van Dishoeck, E., & Schlemmer, S. 2006, A&A, 449, 1297 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  15. Bizzocchi, L., Prudenzano, D., Rivilla, V. M., et al. 2020, A&A, 640, A98 [EDP Sciences] [Google Scholar]
  16. Bolina, A., Wolff, A., & Brown, W. 2005a, J. Chem. Phys., 122, 044713 [CrossRef] [Google Scholar]
  17. Bolina, A. S., Wolff, A. J., & Brown, W. A. 2005b, J. Phys. Chem. B, 109, 16836 [CrossRef] [Google Scholar]
  18. Boogert, A. A., Gerakines, P. A., & Whittet, D. C. 2015, Annu. Rev. Astron. Astrophys., 53, 541 [CrossRef] [Google Scholar]
  19. Borget, F., Chiavassa, T., Allouche, A., Marinelli, F., & Aycard, J.-P. 2001, J. Am. Chem. Soc., 123, 10668 [CrossRef] [Google Scholar]
  20. Brown, W. A., & Bolina, A. S. 2007, MNRAS, 374, 1006 [NASA ADS] [CrossRef] [Google Scholar]
  21. Burke, D. J., & Brown, W. A. 2010, Phys. Chem. Chem. Phys., 12, 5947 [NASA ADS] [CrossRef] [Google Scholar]
  22. Burke, D. J., Puletti, F., Brown, W. A., et al. 2015a, MNRAS, 447, 1444 [CrossRef] [Google Scholar]
  23. Burke, D. J., Puletti, F., Woods, P. M., et al. 2015b, J. Chem. Phys., 143, 164704 [NASA ADS] [CrossRef] [Google Scholar]
  24. Byrd, R. H., Lu, P., Nocedal, J., & Zhu, C. 1995, SIAM J. Sci. Comput., 16, 1190 [Google Scholar]
  25. Cernicharo, J., Cabezas, C., Agúndez, M., et al. 2021, A&A, 647, A3 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  26. Chaabouni, H., Diana, S., Nguyen, T., & Dulieu, F. 2018, A&A, 612, A47 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  27. Collings, M., Dever, J. W., Fraser, H. J., & McCoustra, M. R. 2003, Astrophys. Space Sci., 285, 633 [NASA ADS] [CrossRef] [Google Scholar]
  28. Collings, M. P., Frankland, V., Lasne, J., et al. 2015, MNRAS, 449, 1826 [NASA ADS] [CrossRef] [Google Scholar]
  29. Congiu, E., Chaabouni, H., Laffon, C., et al. 2012, J. Chem. Phys., 137, 054713 [Google Scholar]
  30. Corazzi, M. A., Brucato, J. R., Poggiali, G., et al. 2021, ApJ, 913, 128 [NASA ADS] [CrossRef] [Google Scholar]
  31. Couturier-Tamburelli, I., Toumi, A., Piétri, N., & Chiavassa, T. 2018, Icarus, 300, 477 [NASA ADS] [CrossRef] [Google Scholar]
  32. Cuppen, H., Walsh, C., Lamberts, T., et al. 2017, Space Sci. Rev., 212, 1 [NASA ADS] [CrossRef] [Google Scholar]
  33. Danger, G., Duvernay, F., Theulé, P., Borget, F., & Chiavassa, T. 2012, ApJ, 756, 11 [Google Scholar]
  34. Das, A., Sil, M., Gorai, P., Chakrabarti, S. K., & Loison, J.-C. 2018, ApJS, 237, 9 [NASA ADS] [CrossRef] [Google Scholar]
  35. De Jong, A., & Niemantsverdriet, J. 1990, Surf. Sci., 233, 355 [CrossRef] [Google Scholar]
  36. Demers, L. M., Östblom, M., Zhang, H., et al. 2002, J. Am. Chem. Soc., 124, 11248 [CrossRef] [Google Scholar]
  37. Dostert, K.-H., O’Brien, C. P., Mirabella, F., Ivars-Barceló, F., & Schauermann, S. 2016, Phys. Chem. Chem. Phys., 18, 13960 [NASA ADS] [CrossRef] [Google Scholar]
  38. Duflot, D., Toubin, C., & Monnerville, M. 2021, Front. Astron. Space Sci., 8, 645243 [CrossRef] [Google Scholar]
  39. Duvenaud, D. 2014, PhD thesis, University of Cambridge, Cambridge, United Kingdom [Google Scholar]
  40. Edridge, J. L., Freimann, K., Burke, D. J., & Brown, W. A. 2013, Philos. Trans. Roy. Soc. A: Math. Phys. Eng. Sci., 371, 20110578 [NASA ADS] [CrossRef] [Google Scholar]
  41. Fayolle, E. C., Balfe, J., Loomis, R., et al. 2016, ApJ, 816, L28 [Google Scholar]
  42. Ferrero, S., Zamirri, L., Ceccarelli, C., et al. 2020, ApJ, 904, 11 [NASA ADS] [CrossRef] [Google Scholar]
  43. Fraser, H. J., Collings, M. P., McCoustra, M. R., & Williams, D. A. 2001, MNRAS, 327, 1165 [NASA ADS] [CrossRef] [Google Scholar]
  44. Fuchs, G. W., Acharyya, K., Bisschop, S. E., et al. 2006, Faraday Discuss., 133, 331 [Google Scholar]
  45. Fung, V., Hu, G., Ganesh, P., & Sumpter, B. G. 2021, Nat. Commun., 12, 88 [NASA ADS] [CrossRef] [Google Scholar]
  46. Galvez, O., Ortega, I. K., Maté, B., et al. 2007, A&A, 472, 691 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  47. Garrod, R., & Herbst, E. 2006, A&A, 457, 927 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  48. Gellman, A. J., & Paserba, K. R. 2002, J. Phys. Chem. B, 106, 13231 [CrossRef] [Google Scholar]
  49. Gibson, N., Aigrain, S., Roberts, S., et al. 2012, MNRAS, 419, 2683 [NASA ADS] [CrossRef] [Google Scholar]
  50. Gu, G. H., Noh, J., Kim, S., et al. 2020, J. Phys. Chem. Lett., 11, 3185 [CrossRef] [Google Scholar]
  51. Guennoun, Z., Couturier-Tamburelli, I., Piétri, N., & Aycard, J.-P. 2005, J. Phys. Chem. B, 109, 3437 [CrossRef] [Google Scholar]
  52. Haynes, D., Tro, N., & George, S. 1992, J. Phys. Chem., 96, 8502 [NASA ADS] [CrossRef] [Google Scholar]
  53. He, J., Acharyya, K., & Vidali, G. 2016, ApJ, 825, 89 [Google Scholar]
  54. He, J., Emtiaz, S. M., & Vidali, G. 2017, ApJ, 837, 65 [Google Scholar]
  55. Heyl, J., Holdship, J., & Viti, S. 2022, ApJ, 931, 26 [NASA ADS] [CrossRef] [Google Scholar]
  56. Jordan, M. I., & Mitchell, T. M. 2015, Science, 349, 255 [CrossRef] [Google Scholar]
  57. Jørgensen, J. K., Belloche, A., & Garrod, R. T. 2020, ArA&A, 58, 727 [CrossRef] [Google Scholar]
  58. Kruczkiewicz, F., Vitorino, J., Congiu, E., Theulé, P., & Dulieu, F. 2021, A&A, 652, A29 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  59. Landrum, G. 2020, RDKit: Open-Source Cheminformatics Software, https://www.rdkit.org/ [Google Scholar]
  60. Lasne, J., Laffon, C., & Parent, P. 2012, Phys. Chem. Chem. Phys., 14, 697 [NASA ADS] [CrossRef] [Google Scholar]
  61. Lattelais, M., Bertin, M., Mokrane, H., et al. 2011, A&A, 532, A12 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  62. Lee, K. L. K., Patterson, J., Burkhardt, A., et al. 2021, ApJ, 917, L6 [NASA ADS] [CrossRef] [Google Scholar]
  63. Ligterink, N., Walsh, C., Bhuin, R., et al. 2018, A&A, 612, A88 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  64. Luo, M.-F., Zhong, Y.-J., Yuan, X.-X., & Zheng, X.-M. 1997, Appl. Catal. A: Gen., 162, 121 [CrossRef] [Google Scholar]
  65. Maté, B., Jiménez-Redondo, M., Peláez, R. J., Tanarro, I., & Herrero, V. J. 2019, MNRAS, 490, 2936 [CrossRef] [Google Scholar]
  66. Mazo-Sevillano, P. D., Aguado, A., & Roncero, O. 2021, J. Chem. Phys., 154, 094305 [NASA ADS] [CrossRef] [Google Scholar]
  67. McGuire, B. A., Burkhardt, A. M., Loomis, R. A., et al. 2020, ApJ, 900, L10 [Google Scholar]
  68. Minissale, M., Aikawa, Y., Bergin, E., et al. 2022, ACS Earth Space Chem., 6, 597 [NASA ADS] [CrossRef] [Google Scholar]
  69. Molpeceres, G., Zaverkin, V., Watanabe, N., & Kästner, J. 2021, A&A, 648, A84 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  70. Muñoz Caro, G., Jiménez-Escobar, A., Martín-Gago, J. Á., et al. 2010, A&A, 522, A108 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  71. Noble, J., Congiu, E., Dulieu, F., & Fraser, H. 2012, MNRAS, 421, 768 [NASA ADS] [Google Scholar]
  72. Noble, J. A., Theule, P., Borget, F., et al. 2013, MNRAS, 428, 3262 [NASA ADS] [CrossRef] [Google Scholar]
  73. Noble, J., Diana, S., & Dulieu, F. 2015, MNRAS, 454, 2636 [NASA ADS] [CrossRef] [Google Scholar]
  74. Öberg, K. I., & Bergin, E. A. 2021, Phys. Rep., 893, 1 [Google Scholar]
  75. Öberg, K., Van Broekhuizen, F., Fraser, H., et al. 2005, ApJ, 621, L33 [CrossRef] [Google Scholar]
  76. Öberg, K. I., Garrod, R. T., Van Dishoeck, E. F., & Linnartz, H. 2009, A&A, 504, 891 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  77. Östblom, M., Liedberg, B., Demers, L. M., & Mirkin, C. A. 2005, J. Phys. Chem. B, 109, 15150 [CrossRef] [Google Scholar]
  78. Parmeter, J., Schwalke, U., & Weinberg, W. 1988, J. Am. Chem. Soc., 110, 53 [CrossRef] [Google Scholar]
  79. Paserba, K. R., & Gellman, A. J. 2001a, J. Chem. Phys., 115, 6737 [NASA ADS] [CrossRef] [Google Scholar]
  80. Paserba, K. R. & Gellman, A. J. 2001b, Phys. Rev. Lett., 86, 4338 [NASA ADS] [CrossRef] [Google Scholar]
  81. Pedregosa, F., Varoquaux, G., Gramfort, A., et al. 2011, J. Mach. Learn. Res., 12, 2825 [Google Scholar]
  82. Quan, D., Herbst, E., Corby, J. F., Durr, A., & Hassel, G. 2016, ApJ, 824, 129 [NASA ADS] [CrossRef] [Google Scholar]
  83. Rasmussen, C. E., & Williams, C. K. 2006, Gaussian Processes for Machine Learning (MA: MIT Press) [Google Scholar]
  84. Rimola, A., Skouteris, D., Balucani, N., et al. 2018, ACS Earth Space Chem., 2, 720 [Google Scholar]
  85. Rivilla, V. M., Jiménez-Serra, I., Martín-Pintado, J., et al. 2021, Proc. Natl. Acad. Sci. U.S.A., 118 [Google Scholar]
  86. Salter, T. L., Stubbing, J.W., Brigham, L., & Brown, W. A. 2018, J. Chem. Phys., 149, 164705 [NASA ADS] [CrossRef] [Google Scholar]
  87. Salter, T. L., Wootton, L., & Brown, W. A. 2019, ACS Earth Space Chem., 3, 1524 [NASA ADS] [CrossRef] [Google Scholar]
  88. Sandford, S. A., & Allamandola, L. J. 1988, Icarus, 76, 201 [NASA ADS] [CrossRef] [Google Scholar]
  89. Sandford, S. A., & Allamandola, L. J. 1990, Icarus, 87, 188 [NASA ADS] [CrossRef] [Google Scholar]
  90. Scalia, G., Grambow, C. A., Pernici, B., Li, Y.-P., & Green, W. H. 2020, J. Chem. Inform. Model., 60, 2697 [CrossRef] [Google Scholar]
  91. Schriver, A., Coanga, J., Schriver-Mazzuoli, L., & Ehrenfreund, P. 2004, Chem. Phys., 303, 13 [NASA ADS] [CrossRef] [Google Scholar]
  92. Shallue, C. J., & Vanderburg, A. 2018, AJ, 155 [Google Scholar]
  93. Shimonishi, T., Nakatani, N., Furuya, K., & Hama, T. 2018, ApJ, 855, 27 [Google Scholar]
  94. Shingledecker, C. N., Molpeceres, G., Rivilla, V. M., Majumdar, L., & Kästner, J. 2020, ApJ, 897, 158 [NASA ADS] [CrossRef] [Google Scholar]
  95. Smith, R. S., & Kay, B. D. 2018, J. Phys. Chem. B, 122, 587 [CrossRef] [Google Scholar]
  96. Smith, R. S., & Kay, B. D. 2019, J. Phys. Chem. A, 123, 3248 [NASA ADS] [CrossRef] [Google Scholar]
  97. Smith, R. S., Matthiesen, J., & Kay, B. D. 2014, J. Phys. Chem. A, 118, 8242 [NASA ADS] [CrossRef] [Google Scholar]
  98. Smith, R. S., May, R. A., & Kay, B. D. 2016, J. Phys. Chem. B, 120, 1979 [Google Scholar]
  99. Solomun, T., Christmann, K., & Baumgärtel, H. 1989, J. Phys. Chem., 93, 7199 [CrossRef] [Google Scholar]
  100. Suhasaria, T., Thrower, J., & Zacharias, H. 2015, MNRAS, 454, 3317 [NASA ADS] [CrossRef] [Google Scholar]
  101. Suhasaria, T., Thrower, J., & Zacharias, H. 2017, MNRAS, 472, 389 [NASA ADS] [CrossRef] [Google Scholar]
  102. Tait, S. L., Dohnálek, Z., Campbell, C. T., & Kay, B. D. 2005, J. Chem. Phys., 122, 164707 [NASA ADS] [CrossRef] [Google Scholar]
  103. Takeuchi, K., Yamamoto, S., Hamamoto, Y., et al. 2017, J. Phys. Chem. C, 121, 2807 [CrossRef] [Google Scholar]
  104. Theulé, P., Borget, F., Mispelaer, F., et al. 2011, A&A, 534, A64 [Google Scholar]
  105. Tinacci, L., Germain, A., Pantaleone, S., et al. 2022, ACS Earth Space Chem., 6, 1286 [Google Scholar]
  106. Toumi, A., Piétri, N., Chiavassa, T., & Couturier-Tamburelli, I. 2016, Icarus, 270, 435 [NASA ADS] [CrossRef] [Google Scholar]
  107. Tylinski, M., Smith, R. S., & Kay, B. D. 2020, J. Phys. Chem. A, 124, 6237 [NASA ADS] [CrossRef] [Google Scholar]
  108. Ulbricht, H., Zacharia, R., Cindir, N., & Hertel, T. 2006, Carbon, 44, 2931 [CrossRef] [Google Scholar]
  109. Viti, S., Collings, M. P., Dever, J. W., McCoustra, M. R., & Williams, D. A. 2004, MNRAS, 354, 1141 [NASA ADS] [CrossRef] [Google Scholar]
  110. Wakelam, V., Herbst, E., Loison, J.-C., et al. 2012, ApJS, 199, 21 [Google Scholar]
  111. Zaverkin, V., Molpeceres, G., & Kästner, J. 2021, MNRAS, 510, 3063 [Google Scholar]
  112. Zhou, J.-H., Sui, Z.-J., Zhu, J., et al. 2007, Carbon, 45, 785 [CrossRef] [Google Scholar]
  113. Zhu, C., Byrd, R. H., Lu, P., & Nocedal, J. 1997, ACM Trans. Math. Softw., 23, 550 [Google Scholar]
  114. Zubkov, T., Smith, R. S., Engstrom, T. R., & Kay, B. D. 2007, J. Chem. Phys., 127, 184707 [NASA ADS] [CrossRef] [Google Scholar]

1

Simplified molecular-input line-entry system (SMILES) is a formalism widely used in the chemistry community to describe molecular structures as a string of ASCII characters. The SMILES string also encodes the chemical functional groups that are present in the molecule.

2

Electronic versions of the data files, along with Python scripts for producing the results presented below, can be found in the Github repository (https://github.com/TorbenVilladsen/Predicting-Binding-Energies-of-Astrochemically-Relevant-Molecules-via-Machine-Learning)

3

Electronic versions of the data files along with Python scripts for producing the results can be found in the Github repository here.

All Tables

Table 1

Overview of the features used to describe the molecules and surfaces.

Table 2

Comparison between ML-predicted and literature BE values (rounded to the nearest 10 K) for leave-one-molecule-out cross validation.

Table 3

Predictions of BEs for molecules with astrophysical relevance measured in K and rounded to the nearest 10 K.

Table 4

Comparison between predicted and observed values of BE obtained from the literature measured in eV and rounded to nearest 0.01 eV.

Table B.1

Features of molecules used for ML

Table B.2

BEs of molecules at monolayer coverage

Table B.3

BEs of molecules at multilayer coverage

Table C.1

Features of astrochemically relevant molecules used for ML

All Figures

thumbnail Fig. 1

Schematic overview of the workflow. First the data are collected from the literature and divided into monolayer and multilayer coverage. Secondly, specific features are designated to the data, including atomic composition, functional groups, and valence electrons. Thirdly, the Gaussian process regression model is constructed and trained on the data to be able to predict BEs for new molecules.

In the text
thumbnail Fig. 2

Parity plots for (a) monolayer and (b) multilayer coverage comparing ML-predicted BEs against actual BEs for the combined validation set from five-fold cross validation.

In the text
thumbnail Fig. 3

Desorption traces of molecules for which the BE is determined in this work. The first-order desorption profiles are plotted for monolayer (1 × 1015 molecules cm−2) coverage on a water ice (blue) and carbonaceous (orange) surface. The peak desorption temperatures are indicated in the top right corners. A linear heating rate of 1 K century−1 is applied. Prefactors are assumed and set at A =1 × 1018 s−1 for all molecules. The peak desorption for water is indicated with a green dashed lines at 97 K.

In the text
thumbnail Fig. 4

Same as Fig. 3, except that the desorption trace is plotted against a median disk temperature profile as derived by Andrews & Williams (2007); see main text for more details. Shorter distances are closer to the protostar and thus correspond to higher temperatures. The peak desorption for water is indicated with a green dashed lines at 3.2 au.

In the text
thumbnail Fig. A.1

Schematic overview of five-fold cross validation.

In the text
thumbnail Fig. A.2

Parity plot for the monolayer case and including only BEs up to 17000 K.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.