Issue 
A&A
Volume 645, January 2021



Article Number  A28  
Number of page(s)  28  
Section  Interstellar and circumstellar matter  
DOI  https://doi.org/10.1051/00046361/202038040  
Published online  23 December 2020 
Tracers of the ionization fraction in dense and translucent gas
I. Automated exploitation of massive astrochemical model grids
^{1}
LERMA, Observatoire de Paris, PSL Research University, CNRS, Sorbonne Universités,
92190 Meudon, France
email: emeric.bron@obspm.fr
^{2}
LERMA, Observatoire de Paris, PSL Research University, CNRS, Sorbonne Universités,
75014 Paris, France
^{3}
IRAM,
300 rue de la Piscine,
38406 Saint Martin d’Hères,
France
^{4}
Laboratoire d’Astrophysique de Bordeaux, Univ. Bordeaux, CNRS, B18N, Allee Geoffroy SaintHilaire,
33615 Pessac,
France
^{5}
Instituto de Astrofísica, Pontificia Universidad Católica de Chile,
Av. Vicuña Mackenna 4860,
7820436 Macul,
Santiago, Chile
^{6}
Chalmers University of Technology, Department of Space, Earth and Environment,
412 93 Gothenburg, Sweden
^{7}
University of Toulouse, IRIT/INPENSEEIHT, CNRS,
2 rue Charles Camichel, BP 7122,
31071 Toulouse cedex 7, France
^{8}
Univ. Lille, CNRS, Centrale Lille, UMR 9189  CRIStAL,
59651 Villeneuve d’Ascq, France
^{9}
Instituto de Física Fundamental (CSIC). Calle Serrano 121,
28006
Madrid,
Spain
^{10}
Institut de Recherche en Astrophysique et Planétologie (IRAP), Université Paul Sabatier,
Toulouse cedex 4, France
^{11}
Laboratoire de Physique de l’Ecole normale supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université ParisDiderot,
Sorbonne Paris Cité,
Paris,
France
^{12}
National Radio Astronomy Observatory,
520 Edgemont Road,
Charlottesville,
VA
22903, USA
^{13}
HarvardSmithsonian Center for Astrophysics,
60 Garden Street,
Cambridge,
MA
02138,
USA
^{14}
School of Physics and Astronomy, Cardiff University, Queen’s buildings,
Cardiff CF24 3AA, UK
^{15}
AixMarseille University, CNRS, Centrale Marseille, Institut Fresnel, Marseille, France
Received:
27
March
2020
Accepted:
2
July
2020
Context. The ionization fraction in the neutral interstellar medium (ISM) plays a key role in the physics and chemistry of the ISM, from controlling the coupling of the gas to the magnetic field to allowing fast ionneutral reactions that drive interstellar chemistry. Most estimations of the ionization fraction have relied on deuterated species such as DCO^{+}, whose detection is limited to dense cores representing an extremely small fraction of the volume of the giant molecular clouds that they are part of. As large fieldofview hyperspectral maps become available, new tracers may be found. The growth of observational datasets is paralleled by the growth of massive modeling datasets and new methods need to be devised to exploit the wealth of information they contain.
Aims. We search for the best observable tracers of the ionization fraction based on a grid of astrochemical models, with the broader aim of finding a general automated method applicable to searching for tracers of any unobservable quantity based on grids of models.
Methods. We built grids of models that randomly sample a large range of physical conditions (unobservable quantities such as gas density, temperature, elemental abundances, etc.) and computed the corresponding observables (line intensities, column densities) and the ionization fraction. We estimated the predictive power of each potential tracer by training a random forest model to predict the ionization fraction from that tracer, based on these model grids.
Results. In both translucent medium and cold dense medium conditions, we found several observable tracers with very good predictive power for the ionization fraction. Many tracers in cold dense medium conditions are found to be better and more widely applicable than the traditional DCO^{+}/HCO^{+} ratio. We also provide simpler analytical fits for estimating the ionization fraction from the best tracers, and for estimating the associated uncertainties. We discuss the limitations of the present study and select a few recommended tracers in both types of conditions.
Conclusions. The method presented here is very general and can be applied to the measurement of any other quantity of interest (cosmic ray flux, elemental abundances, etc.) from any type of model (PDR models, timedependent chemical models, etc.).
Key words: ISM: clouds / ISM: molecules / methods: statistical / methods: numerical / astrochemistry
© E. Bron et al. 2020
Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1 Introduction
The socalled neutral component of the interstellar medium, despite being shielded from EUV (13.6 to 124 eV) stellar photons able to ionize hydrogen, retains a small ionization fraction (x(e^{−}) = n(e^{−})∕n_{H}). The ionization mechanism depends on the type of region: FUV (6 to 13.6 eV) photons ionizing C and S in the low A_{V} surface layer of clouds, or cosmic rays, X rays, shocks, etc., in the densest parts. As a result, ionization fractions range between ~10^{−4} in low A_{V} cloud surfaces and down to ~10^{−9} in dense cores (e.g., Goicoechea et al. 2009; Draine 2011).
This ionization fraction controls several key aspects of neutral interstellar clouds. It determines the degree of coupling of the gas to the magnetic field: the neutrals, accounting for most of the mass of the fluid, are only indirectly sensitive to the presence of a magnetic field through their friction with the ions that remain coupled to the field, a process called ionneutral friction. This coupling can provide a significant magnetic support against gravitational collapse of dense cores despite the low ionization fractions values found there, between 10^{−9} and 10^{−7} (Mestel & Spitzer 1956; Mouschovias 1976; Basu & Mouschovias 1994). The ionization fraction also controls the onset of the magnetorotational instability (Balbus & Hawley 1991), the main mechanism of angular momentum transport in accretion disks. Moreover, the gas phase chemistry in dense molecular clouds is to a large extent driven by fast ionneutral reactions (Herbst & Klemperer 1973; Oppenheimer & Dalgarno 1974). The buildup of chemical complexity thus depends on the ionization fraction of the medium. Finally, some common molecular tracers with high dipole moments, such as HCN and HCO^{+}, have high inelastic collision cross sections with electrons, and their excitation can be significantly affected by electron collisions for ionization fractions ≳10^{−5} (Black & van Dishoeck 1991; Liszt 2012; Liszt & Pety 2016; Goldsmith & Kauffmann 2017). This makes the interpretation of their emission (e.g., to estimate gas density) sensitive to our knowledge of the local ionization fraction.
Direct observational estimation of the ionization fraction in neutral clouds is difficult, except in very specific regions (e.g., Goicoechea et al. 2009; Cuadrado et al. 2019 at the dissociation front in a photodissociation region). Direct estimation of the total charge accounted for by observable molecular ions in molecular clouds only yields a loose lower limit (e.g., Miettinen et al. 2011). Indirect methods based on tracers that are chemically sensitive to the ionization fraction have thus been commonly used. These methods have mostly involved measuring the deuterium fractionation through abundance ratios involving simple molecular ions like DCO^{+}/HCO^{+} (Guelin et al. 1977, 1982; Dalgarno & Lepp 1984; Caselli et al. 1998) or N_{2} D^{+}/N_{2}H^{+} which is less affected by depletion (Caselli 2002). The idea is that the deuterium enrichment (defined as H_{2} D^{+}/H), initiated by the exchange reaction (1)
at low temperature, is limited by electronic dissociative recombination of H_{2} D^{+}, and that the resulting ratio is transmitted (with a known prefactor) to the deuteration fraction of other molecules such as HCO^{+}. Using such tracers, the ionization fraction is deduced either by using approximate analytical formulae representing simplified networks (Caselli et al. 1998; Miettinen et al. 2011; Caselli 2002), or by adjusting an astrochemical model including a full chemical network to the observations, using stationary chemical models (Williams et al. 1998; Bergin et al. 1999; Caselli et al. 2002; Fuente et al. 2016), timedependent ones (Maret & Bergin 2007; Shingledecker et al. 2016), or PDR models (Goicoechea et al. 2009). Despite the variety of determination methods, using different deuterated molecules, only very few works have proposed using other tracers than deuterated species. For instance, Flower et al. (2007) proposed the C_{6} H^{−}/C_{6}H ratio as a tracer of the ionization fraction, and Fossé et al. (2001) have investigated the relationship between the cyclictolinear ratio of C_{3} H_{2} and the ionization fraction. Deuterationbased approaches however suffer from several limitations due to the fact that they depend on other physical or chemical parameters that need to be determined independently. The initial deuteration reaction (Eq. (1)) is sensitive not only to the gas temperature but also to the essentially unmeasurable orthotopara ratio of H_{2} (Pagani et al. 1992, 2011; Shingledecker et al. 2016). Indeed, the endothermicity of the reaction in the backward direction (192K) is very close to the J = 1 to J = 0 energy difference of H_{2} (170.5K). Then, even a small fraction of oH_{2} (J = 1) contributes to H_{2} D^{+} destruction (with a reduced endothermicity of ≃ 20 K) and restricts the deuteration process. In addition, ratios such as DCO^{+}/HCO^{+} are linked to H_{2} D^{+}/H through reactions with neutral species like CO. The estimated deuterium fraction is therefore sensitive to the depletion factors of carbon, oxygen, and nitrogen that are not easy to evaluate (Caselli 2002). Moreover, deuterated tracers such as DCO^{+} are typicallyonly detectable in cold dense cores, representing only a tiny fraction of the observable area of a giant molecular cloud (GMC). Deuterationbased approaches are thus inadequate for an unbiased characterization of the conditions in GMCs as a whole.
Despitethe common use of advanced chemical models computing the abundances of hundreds ofspecies, the observed tracers to which these models are compared to estimate x(e^{−}) (deuterated molecules such as DCO^{+}) are still those initially proposed based on analytical reasoning using simplified chemical networks. The wealth of data produced in large chemical model grids remains largely unexploited. Their exploration of wide parameter spaces might reveal less intuitive but more efficient tracers. Based on this approach, we propose here a general and largely automatic method to identify the best observational predictors of the ionization fraction, when other important parameters such as the gas density, temperature or H_{2} orthotopara ratio are unknown. We apply this method to propose new predictors of the ionization fraction as a function of the molecular cloud conditions. We use simple stationary chemical models with a complete uptodate chemical network (Roueff et al. 2015), and use molecular ratios (column density ratios or integrated line intensity ratios) as observable tracers from which we seek to predict the ionization fraction. We base our investigation on the observed range of physical conditions and detected tracers in the IRAM30m Large Program ORIONB (Outstanding RadioImaging of OrioN B, coPIs: J. Pety and M. Gerin)^{1} . In this program, we imaged 5 square degrees towards the southern part of the Orion B giant molecular cloud over most of the 3 mm atmospheric window (Pety et al. 2017; Gratier et al. 2017, 2020; Orkisz et al. 2017, 2019; Bron et al. 2018; Roueff et al. 2020).
In the context of dense cores, the ionization fraction is linked with the cosmic ray ionization rate (CRIR) and both are often studied from the same molecular ratios (although direct tracers of the cosmic ray ionization rate can also be used in more diffuse medium, in particular H, e.g., Indriolo & McCall 2012; Le Petit et al. 2016). In our context of the Orion B giant molecular cloud, where UV illumination controls the ionization fraction in large parts of the cloud, we focus here on the question of estimating the ionization fraction only, independently of the source of ionization. The task of tracing the cosmic ray ionization rate (which has also been attempted using astrochemical model grids, e.g., Barger & Garrod 2020) will be considered in a future application of the method presented here.
In this first article, we present a generic method to find the best tracers of an unobservable physical parameter and apply it to the search of new tracers of the ionization fraction among the species that are detectable in the ORIONB dataset. The observational application of the tracers found here to study the ionization fraction in the Orion B molecular cloud will be presented in a second paper (Guzman et al., in prep.).
In Sect. 2, we describe our general statistical method for determining the best predictors of a given unobservable parameter based on the results of a model grid. In Sect. 3, we present the models used in this study for the search of ionization fraction tracers. We then present the ranking of observable predictors in Sect. 4. For ease of application of our results, we provide in Sect. 5 analytical fit formulae to deduce the ionization fraction from each of the proposed best predictors. We finally discuss our results in Sect. 6 and present our conclusions in Sect. 7.
2 Method
2.1 Principle
Both observable line intensities (or column densities) on the one hand, and the ionization fraction x(e^{−}) on the other hand, depend on multiple, unobservable physical parameters (e.g., gas density, elemental abundances, cosmic ray flux, ...). Our goal is to find reliable relationships between observable quantities and ionization fraction, despite lacking estimations of these hidden physical parameters. To do this, we first run model grids covering the whole possible parameter space. We then use a flexible regression method to fit x(e^{−}) as a function of one of the potential observational tracers through the whole grid of chemical models. This means that we treat the effects of the variations of the hidden parameters as sources of noise on the prediction of the ionization fraction. Finally, we use a quantitative measurement of the fit quality as an estimate of the predictive power of each potential tracer. These estimates are used to rank the tracers and highlight the most powerful predictors of the ionization fraction. The fitted models for the best tracers will provide readytouse tools to be applied to observations. The pipeline tool AutoRank, implementing the procedure described below, is available online^{2} .
2.2 Ranking tracers with Random Forests
Any a prioriinformation could easily be included in the method by sampling the parameters of the model grid according to a specific prior distribution. However, we wish to minimize the amount of a priori information injected in the method and to avoid making assumptions on the shape of the distributions of physical parameters (e.g., gas density, elemental abundances, cosmic ray flux, ...). We thus build a model grid that samples uniformly the possible range of values (see Sect. 3).
Our model grids provide us with a dataset comprising ionization fraction values and corresponding values of observable quantities. We will consider line intensity ratios or column density ratios as our observable quantities in this paper. The hidden physical parameter values introduce a non deterministic aspect to the relationship between x(e^{−}) and the observables: models might have identical values of an observable but different x(e^{−}) if the underlying physical parameters are different. Learning to predict x(e^{−}) from a given observable is then a regression problem, with the uncertainty introduced by the hidden physical parameters playing the role of noise. Determining the best tracers of x(e^{−}) is thus equivalent to finding observables for which the relationship to x(e^{−}) is least affected by this noise (i.e., by the hidden physical conditions). This means finding the observables for which the most accurate regression model can be found.
For this regression problem, we choose to use Random Forests (Breiman 2001) because their flexibility makes it possible to fit general nonlinear shapes, while their simplicity provides reasonable computational costs. This makes the method presented here very general and applicable to finding tracers of other physical parameters without any assumption on the shape of the relationship between the tracers and the target parameter. We will use RF for Random Forest in the rest of the paper. RF regression models are based on the concept of regression trees (Breiman et al. 1984), where a succession of binary decisions are made based on the input variables (e.g., x_{3} < 2 or ≥ 2) and constant values are predicted in each of the subsets of the partition that the decision tree defines. While such decision trees are easily interpretable, they require large tree depths to be flexible but are prone to overfitting if this depth is too large. RF tackle this overfitting problem by using the simple idea that multiple overfitted regression models will, when averaged, give a better prediction as long as the errors they individually make are uncorrelated between models. In a RF, the individual trees are made as independent as possible by introducing randomness in two aspects: (1) the building of each tree only considers a random subset of the input variables, and (2) each tree is given a bootstrapped sample (i.e. drawn by random sampling with replacement from the original dataset, Breiman 1996) instead of the original full sample. This way, the datasets seen by the different trees are independent and each tree only sees a subset of the dataset (bootstrapped datasets typically contain only 63% of the points of the original dataset as repetition is allowed). This provides a very flexible regression model, which retains some of the interpretability of decision trees. RF have thus quickly become a standard method in Machine Learning (see e.g., Hastie et al. 2001). In addition, they allow to estimate the generalization error of the fit (i.e., the error made when predicting data not seen during training): as each tree has only seen a random bootstrapped sample from the data, it is possible to estimate for each datapoint a partial prediction using only the trees that have not seen this datapoint during training. As the sample seen by a given tree is called a bag, these partial predictions are called outofbag predictions (OOB). Gratier et al. (2020) also use RF in the context of the interstellar medium and introduce the method in detail.
We thustrain RF regression models for each observable (using only one observable at a time) and estimate the accuracy of the regression models. This accuracy is taken as an estimate of the predictive power of the observable quantity considered, for the purpose of predicting x(e^{−}). The different observables can then be ranked according to this predictive power estimate. The accuracy of the regression model is estimated with the OOB R^{2},
where the index i runs across data points (individual model results), is the true value of ionization fraction (computed by the chemical model), is the OOB prediction value from the RF, and is the average of the (true) ionization fraction over the model grid. This coefficient R^{2} gives the fraction of the total ionization fraction variance (across the full model grid) that the RF model can explain from the given observable predictor alone (i.e., it measures the fractional decrease from the initial variance of x(e^{−}) to the variance of the residuals).
It is thus <1, with 1 representing perfect prediction (zero residual variance). We note that it can take a negative value when the model performs worse than predicting a constant value set at the average x(e^{−}) of the dataset. A value of 0 indicates a performance equivalent to this constant prediction of the average. This R^{2} value is used for the ranking of tracers. For more information, we also provide below the root mean square error
where N is the number of chemical models in our grid. The RMSE is completely univocally related to the R^{2} value, but ismore interpretable in terms of the amplitude of typical errors. We also provide the maximum absolute error,
This quantity, estimating the maximum error made by our regression model, is not guaranteed to converge when increasing the size of the dataset. It should thus not be interpreted further than being the largest error we observed in our limitedsize sample.
The RF model depends on a few internal parameters (number of trees, maximum depth of trees, etc.). Their values can affect the quality of the model and its tendency to overfit. We used a number of trees in the forest N_{trees} = 400 and a maximum tree depth d_{max} = 4. The procedure used to select these values is described in Appendix A. Our tests show that this optimization scheme is not critical for our purpose: while the choice of parameter values does affect the quality of the best fit RF model, it does not change significantly the ranking of the predictors that we deduce from it.
3 Chemical models
Here, we use the chemical code presented in Roueff et al. (2015) to study isotopic fractionation of deuterium, carbon, and nitrogen compounds. Single zone models with fixed density, temperature, visual extinction, radiation field, cosmic ray ionization rate, orthotopara H_{2} ratio, and depletion factors are computed at steady state.
We consider, for the present study, a chemical network including deuterium, and isotopic carbon and oxygen species where the deuterium, carbon and oxygen fractionation reactions have been introduced following the recent determinations of exothermicities by Mladenović & Roueff (2017). We introduce in particular D^{13} CO^{+}. Apart from these specific fractionation reactions, the chemistry of isotopically substituted species is built automatically from the chemical network of the major components. The chemical reactions involving one single carboncontaining reactant and one single carboncontaining product are duplicated with the same reaction rate coefficient. Simple statistical assumptions are introduced when two carbon containing molecules are implied in the reaction. Consider for example the case of the reaction
taking place with a reaction rate coefficient k. The reactions introduced for the isotopically substituted species are the following:
Such a procedure leads to an ensemble of 310 species linked through 8711 chemical reactions.
These models allow us to compute observable column density ratios for the hundreds of species included. Although commonly derived by observers, column densities are not the primary observable quantities, and we thus also compute line intensity ratios. To do so, we postprocess the results of our chemical models using a nonLTE excitation and radiative transfer model (RADEX, van der Tak et al. 2007) assuming a typical linewidth of 1 km s^{−1} (observed linewidths in the Orion B are typically of a few km s^{−1} Pety et al. 2017). Results based on column density ratios and based on line intensity ratios will be presented separately in the following sections.
It is unlikely that a single tracer will provide a good estimate of the ionization fraction x(e^{−}) in all possible physical conditions. Either the tracer will lose its relationship with x(e^{−}) in some conditions, or it might be too weak to be observable in other conditions. We thus decided to divide the range of possible conditions into subregions corresponding to the different types of environments found in GMCs (Pety et al. 2017; Bron et al. 2018). We focus on two kinds of environments: the translucent medium and the cold and dense gas, and we derive separate rankings of tracers for these two environments. The range of physical conditions explored for each of these environments are chosen based on our previous studies of Orion B and listed in Table 1.
In both grids, the gas density and gas temperatures were varied covering the typical ranges for translucent medium (3 × 10^{2}−3 × 10^{3} cm^{−3}, 15–100 K) and cold and dense medium (10^{3}−10^{6} cm^{−3}, 7–20 K). In the translucent model grid, external FUV photons still play an important role in the chemistry and in controlling the ionization fraction. This FUV illumination is controlled through an external FUV field strength G_{0} (see e.g., Hollenbach & Tielens 1999, p. 177) and an extinction A_{V} representingthe amount of shielding between the FUV source and the gas under consideration (also used as the depth of the slab when computing line intensities). We take into account selfshielding of H_{2} by using the approximate expression of Draine & Bertoldi (1996) and introduce also the shielding of CO by H_{2} from Heays et al. (2017). We consider lower extinctions and higher G_{0} values in translucent medium (A_{V} in the range 2–6, G_{0} of the external field in the range 1–1000) than in cold dense medium (A_{V} in the range 5–20, external G_{0} set to 1). We explore average to moderately strong FUV illumination values in the low density translucent grid. Regions with both high density high FUV illumination correspond to dense photodissociation regions (PDR), in which strong chemical and physical stratification on small spatial scale is critical. These regions would thus require the use of complete PDR models, such as the Meudon PDR Code (Le Petit et al. 2006). We thus did not explore this type of conditions in the present study.
Given the uncertainties about the cosmic ray ionization rate in molecular clouds (Lepp 1992; McCall et al. 2003; Indriolo et al. 2007), we consider the range of value 10^{−17}−10^{−15} s^{−1} . In the cold gas conditions, in order to account for the reduced cosmic ray fluxes (Padovani et al. 2009), we limit this range to 10^{−17}−10^{−16} s^{−1} . As sulfur can be an important contributor of electrons in neutral gas but has a highly uncertain gasphase elemental abundance (Agúndez & Wakelam 2013; Goicoechea et al. 2006), we explore in both grids values of [S], the relative sulfur abundance with respect to H, in the range 1.86 × 10^{−8}−1.86 × 10^{−5}.
We also explore ranges of H_{2} orthopara ratio (OPR) which impact significantly two reactions: namely, H_{2} D^{+} + oH_{2} → H + HD (Pagani et al. 1992) where the energy endothermicity is reduced to 61.5 K (compared to 232K with p–H_{2}) and N^{+} + oH_{2} → NH^{+} + H which is slightly endothermic(~44 K), whereas N^{+} + pH_{2} → NH^{+} + H is more strongly endothermic(~170 K), as first emphasized by Le Bourlot (1991). For this reaction, we follow the prescription of Dislaire et al. (2012) which is derived from experimental results. We use higher values of the OPR in the warmer translucent medium (0.1−3) than in cold dense gas (10^{−4}−10^{−1}). Finally, cold dense cores offer conditions where molecules can freeze out on dust grains, depleting the gas phase abundances of elements such as C and O. In our cold dense medium grid, we thus in addition explore depletion factors going from 1 (no depletion) to 10 (C elemental abundance 10 times lower than the reference values) witha constant C/O elemental ratio value of 0.6 (the elemental abundance of carbon is taken to be [C] = 1.32 × 10^{−4} when there is no depletion). Other parameters that might have an impact (although second order compared to the parameters considered here), such as variations of the metal elemental abundances or PAH abundance, were not considered in this study. The gasphase elemental abundances for metals (relative to H) are taken to be [Fe] = 1.5 × 10^{−8}, [Cl] = 1.8 × 10^{−7}, [Si] = 8.2 × 10^{−7}, [F] = 1.8 × 10^{−8}, [Ar] = 3.29 × 10^{−6}.
For each medium type, a set of 5000 models was run, sampling randomly and uniformly within the chosen parameter space. The adequacy of this number of models for our purpose is ascertained later when estimating the uncertainties on our results.
Given the variation by orders of magnitude both in the parameter values and the computed observables, we choose to work with the logarithm of all quantities. We sampled uniformly on the logarithm of each parameter within the ranges indicated above. The method described in Sect. 2 is applied on the logarithm of all quantities (i.e., training RF models using the logarithm of column density ratios or line intensity ratios to predict the logarithm of x(e^{−})). Representative error values such as the RMSE on logarithms are equivalent to representative error factors on the actual quantity. These corresponding error factors are given in parenthesis in the result tables of the following sections.
To get rid of possible instrumental, calibration and other source geometry effects, we choose to work only on ratios of observable quantities (either column density ratios or line intensity ratios). In the following, we use the term tracers for ratios of observable quantities.
Among all the species computed in our chemical model, we made a selection of species that are detected in the radio observations of the ORIONB project and potentially linked to the ionization fraction. Our search for the best tracers is made among the ratios of these selected species. For the translucent medium condition, we selected ^{13}CO, C^{18}O, HCO^{+}, HCN, HNC, CN, C_{2}H, CS, SO, H_{2}CS, HCS^{+}, CF^{+}. The search for a best ratio was thus done among 66 possible column density ratios (and 66 line intensity ratios). For dense cold medium conditions, we considered the same selection with the addition of DCO^{+} and N_{2} H^{+}. We thus had 91 possible column density ratios (and 91 line intensity ratios) in this case. For line intensities, the exact transition and frequency considered for each species are listed in Table 2. In the RADEX computations of line intensities, we account for collisional excitation with electrons (using the ionization fraction computed by the chemical model) for species for which collisional data with electrons are available in RADEX (HCO^{+}, HCN and C_{2}H). We note that for CN, excitation by electrons was not included as in the current version of RADEX, the collisional data that includes the hyperfine structure of CN does not include collisions with electrons. We chose to privilege the fact of accounting for the hyperfine structure here.
Range of physical parameters explored for each of our two classes of medium: gas density n_{H}, gas temperature T_{gas}, incident FUVradiation field intensity G_{0}, lineofsight visual extinction A_{V}, cosmicray ionization rate ζ, H_{2} orthotopararatio OPR, depletion factor and sulfur gasphase elemental abundance [S].
Shorthand names, full quantum number designation, and frequency of the molecular lines we considered.
4 Tracers rankings
We applied the method described in Sect. 2 to the two chemical model grids (translucent medium conditions and cold dense medium conditions) presented in Sect. 3 in order to obtain a ranking of the selected potential tracers according totheir usefulness for predicting the ionization fraction.
4.1 Translucent medium
Figure 1 presents the predictive power (estimated as the R^{2} of a RF fit) of each tracer for the best 20 tracers. The left panel shows the result when taking column density ratios as observable quantities, and the right panel the results when considering line intensity ratios. We see that the ranking is similar in both cases, suggesting that excitation and radiative transfer only have a moderate effect on the relationship between these tracers and the ionization fraction. A more complete ranking (covering all tracers having R^{2} > 0.5) is given in Table B.1 (due to their sizes, the results tables of this section and of Sect. 5 are placed in Appendix B)^{3} .
In both cases, about ten different ratios are found to be each able to explain more than 80% of the ionization fraction variance (R^{2} > 0.8). We emphasize that this means that an accurate prediction of the ionization fraction is possible from each of these tracers despite not knowing thevalues of the 7 physical and chemical parameters that have been varied in our model grid (cf. Table 1). The R^{2} values are slightly lower when using line intensity ratios rather than column densities ratios, indicating that excitation and radiative transfer effects tend to increase the degeneracy between the ionization fraction and other unknown parameters, but this effect remains moderate. To illustrate the performance of the tracers found with this ranking, we show on the left panel of Fig. 2 the ionization fraction versus the best ranked column density ratio (C_{2} H/HCN) in our grid of models for translucent medium conditions (blue symbols, with contours in shades of blue indicating isoPDF contours encompassing 25, 50, and 75% of the distribution) and the prediction of the fitted RF model (solid red line), which is found in this case to explain 95.7% of the ionization fraction variance in our grid. The remaining scatter around the relationship represents the effect of ignoring all other parameters (gas density, temperature, UV field, H_{2} OPR,...). The best ranked line intensity ratio (C_{2}H (1–0) / HCN (1–0)) is similarly shown on the right panel of Fig. 2 with the corresponding fitted RF model (solid red line).
In translucent gas, the ionization is still dominated by the effect of external FUV photons ionizing carbon (and to a lesser extent sulfur and chlorine), and is slowly decreasing as the total extinction increases. In the conditions covered by our translucent grid, we find x(e^{−}) ranging from 2 × 10^{−4} to 2 × 10^{−7}. C_{2} H, which we find in several of the best ratios, is known to be enhanced in FUV illuminated environments (Pety et al. 2005, 2017; Cuadrado et al. 2015; Guzmán et al. 2015; Gratier et al. 2017): as explained in Beuther et al. (2008), C_{2} H traces the amount of carbon not locked into CO, and is thus sensitive to the FUV flux through CO photodissociation and the presence of C and C^{+} at a significant abundance level. In our translucent medium model grid, we indeed find C^{+} to be the main charge carrier and thus to very strongly correlate with x(e^{−}). H^{+} and H may also contribute to the electronic fraction in environments where the cosmic ionization reaches values above 10^{−16} s^{−1} (Le Petit et al. 2016). However, C^{+}, an open shell ion, is chemically reactive with various molecules, except H_{2} ^{4}, and is at the origin of a complex chemistry with insertion of carbon atoms. C^{+} itself is not straightforwardly detectable as its fine structure transition at 158 μm requires spaceborne or airborne observations. But we may expect that molecules involving C^{+} in the initial chemical steps allow to probe the electronic fraction. Our finding of ratios involving C_{2} H (relative to e.g., HCN, HNC or CN) as good proxies of the electronic fraction is a natural consequence of the relevance of C^{+} as one of the main positive charge carriers. The initial step of C_{2}H formation involves indeed the C^{+} + CH → C + H reaction, followed by subsequent reactions with H_{2} up to C_{2} H, which recombines to form C_{2}H. Molecules such as HCN, HNC, or CO and its isotopologues, on the other hand, are saturated stable molecules which scale with column density. As a result, ratios such as C_{2}H/HCN, whose transitions are easily detectable, offer a convenient diagnostic tool of the electronic fraction in translucent medium. The electronic fraction is then, as shown in Fig. 2 (left panel), an increasing function of the C_{2} H/HCN ratio.
CF^{+} is another proxy of C^{+}, as describedin Neufeld et al. (2006); Guzmán et al. (2012), and ratios involving this ion are also found here to be good tracers of the ionization fraction. However, this ion is relatively scarce since it involves fluorine, which has a low relative abundance to H_{2} and has only been detected in PDR environments so far (detectability issues are investigated in Sect. 6.2).
In order to estimate the reliability of our results and determine if our 5000model grid is sufficient to explore the chosen parameter space for our purpose, we compute errorbars on the predictive power estimate (the R^{2} of the RF fit). To do so, we use 10fold cross validation: the model grid is randomly split into 10 parts, and for each of these parts, a RF model is trained on the other 9 parts and tested on the remaining part which it has not seen during training. From these 10 estimates of the R^{2} (all made on samples unseen during training), a reliable estimate of the predictive power on unseen data is made, as well as an estimate of the standard error on this predictive power. The corresponding error bars are also shown in Fig. 1. For most of the points, the errorbars are smaller than the marker, and the inset on the left panel presents a zoom on the first five ratios, showing the magnitude of the errorbars. This shows that the uncertainty induced by the finite size of our model grid in negligibly small and that our 5000model grid is sufficient for our purpose. However, this conclusion should be taken with some caution as ithas been shown that there exists no unbiased estimator of the variance of a crossvalidation estimate (Bengio & Grandvalet 2004) and that a naive estimation of this variance tends to underestimate it by a factor of up to 4 (Varoquaux 2018).
Fig. 1 Ranking of column density ratios (left) or line intensity ratios (right) of observable tracers by order of the predictive power for predicting the ionization fraction (measured by the R^{2} coefficient), in the case of translucent medium conditions (showing only the first 20). Errorbars of the R^{2} estimates are computed by crossvalidation (see text for explanations). Inset in the left panel: zoom on the first five ratios in order to make the magnitude of the errorbars visible. 
Fig. 2 Ionization fraction versus the best column density ratio, C_{2}H/HCN (left panel), and the best line intensity ratio, C_{2}H (1–0) / HCN (1–0) (right panel) for tracing the ionization fraction in translucent medium conditions. The model grid is shown as a scatter plot, with the central crowded regions replaced by PDF isocontours containing 25, 50, and 75% of the points. Superimposed are the RF model (red line), the analytical fit (solid black line, presented in Sect. 5), the analytical fit of the 1σ uncertainty (dashed black lines, presented in Sect. 5), and the bounds of the validity range of the analytical fit (vertical lines, presented in Sect. 5). The quality estimates of the two models are indicated on the figure. 
Fig. 3 Ranking of column density ratios (left) or line intensity ratios (right) of observable tracers by order of the predictive power for predicting the ionization fraction (measured by the R^{2} coefficient), in the case of dense cold medium conditions (showing only the first 20). Errorbars of the R^{2} estimates are computed by crossvalidation (see text for explanations). 
4.2 Cold dense medium
Figure 3 presents the ranking of the best tracers in cold dense conditions (cf. Table 1), for both column density ratios (left panel) and line intensity ratios (right panel). Error bars computed by the cross validation procedure described above are also shown, confirming that the size of our model grid is sufficient for estimating the quality of the fits based on each tracer. We see that the R^{2} values of the best tracers are slightly lower than in the translucent medium case, indicating slightly stronger degeneracies with unknown parameters in this case (note that depletion was varied in this cold dense medium, in addition to the parameters varied in the translucent medium grid). However, we still find several tracers explaining more than 80% of the variance in x(e^{−}). The best column density ratio is here found to be CN/N_{2}H^{+}. The cold dense environments are essentially ionized through cosmic rays and secondary UV photons induced by cosmic rays. Electrons are primarily produced by cosmic ray ionization of H_{2} and destroyed in the efficient dissociative recombination reactions of the various molecular ions. He^{+} ions, produced by cosmic ray ionization of He, are also particularly efficient in ionizing the molecular reservoirs, CO, HCN, N_{2}, H_{2} O. This contributes to forming atomic ions, in addition to the H molecular ion resulting from H_{2} ionization and the other stable molecular ions resulting from proton transfer of H with stable molecules giving ions such as H_{2} D^{+}, HCO^{+}, H_{3} O^{+}, or N_{2} H^{+}. The ionization carriers are then shared amongst several different species, going from the simple atomic ions that do not react with H_{2} (i.e., C^{+}, S^{+}, H^{+}) and closed shell molecular ions such as H, HCO^{+}, H_{3} O^{+}, and N_{2} H^{+}. Molecular ions are principally destroyed by dissociative recombination reactions whereas atomic ions rather react with the present neutral molecules since radiative recombination is not efficient. One can thus expect that molecular ions are inversely proportional to the electron abundances, as seen with the CN/N_{2}H^{+} ratio which is found to increase monotonically with x(e^{−}) (cf. Fig. 4, left panel).
As in the translucent case, we find slightly lower R^{2} values for the best line intensity ratios than for the best column density ratio. However, we find a few ratios that have better scores as intensity ratios than as column density ratios. In particular, while the ^{13}CO/HCO^{+} column density ratio is found to be a poor predictor of the ionization fraction, the ^{13}CO (1–0) / HCO^{+} (1–0) line intensity ratio appears as one of the best tracers. Contrary to the previous cases, this is entirely an excitation effect. The abundance ratio of ^{13}CO/HCO^{+} is found to be mostly uncorrelated with x(e^{−}) and mostly constant in the cold dense medium model grid (ratio of about 10^{3} with a typical scatter of a factor of 2–3). However, HCO^{+} and ^{13}CO have strongly different critical densities: ~2 × 10^{5} cm^{−3} for HCO^{+} (in cold dense medium conditions, x(e^{−}) is too low for electron collisions to play a significant role) in comparison to ~ 2 × 10^{3} cm^{−3} for ^{13}CO. In the range of densities considered in the cold dense medium grid (10^{3}−10^{6} cm^{−3}), ^{13}CO excitation isthus mostly at local thermodynamic equilibrium (LTE) and its emissivity per molecule is thus constant with gas density, while HCO^{+} is transitioning from the subthermally excited regime to the LTE regime, and its emissivity per molecule thus increases with density. The ^{13}CO (1–0)/HCO^{+} (1–0) ratio thus decreases with gas density. On the other hand, we find the gas density to be very strongly anticorrelated with x(e^{−}) in these conditions (with a typical scatter of a factor of ~3), as cosmic ray ionization is the dominant source of ionization here and the recombination rates per ion scale with the gas density. These two effects combine to give a ^{13}CO (1–0) / HCO^{+} (1–0) ratio that increases with x(e^{−}) with a relatively tight correlation (cf. Fig. C.4, top right panel).
Figure 4 shows the relation between x(e^{−}) and the best column density ratio, CN/N_{2}H (left panel), and the best line intensity ratio, CF^{+} (1–0) / DCO^{+} (1–0) (right panel). We see a larger scatter than in Fig. 2, but a clear relationship is still found.
We note that the classical DCO^{+}/HCO^{+} ratio does not appear among the best tracers found here for cold dense medium conditions. This point is discussed in Sect. 6.1.
5 Analytical fit formulas
If possible, we recommend using the RF models described in the previous section when attempting to estimate x(e^{−}) from one of the best tracers listed above. However, the provided datafiles are dependent on a specific implementation of Random Forests (the scikitlearn module for Python, Pedregosa et al. 2011). For a simpler and more persistent solution (independent of any external software), we provide in this section simple analytical fit formulae for the best tracers found in Sect. 4. While the RF models are flexible enough to make the method described in Sect. 2 generally applicable to any model grid and any physical quantity we want to find tracers of, the analytical fits provided here use formulas that have been specifically chosen for the application presented here (finding predictors of the ionization fraction from our chemical model grid). There is no guarantee that these same formulae would perform adequately to find analytical fits with other model grids and/or another quantity to predict.
5.1 Prediction formulae
We use simple polynomial formulae (working as before on the logarithm of both the observable ratios and the ionization fraction) to fit the nonlinear relationships between the best tracers found in Sect. 4 and x(e^{−}). This is applied to all tracers which where found to have R^{2} > 0.5 in the previous RF analysis.
In the cold dense medium conditions, we thus use a simple polynomial of order 5: (2)
where x is the log_{10} of the column density ratio or line intensity ratio from which we want to predict log _{10}(x(e^{−})), f^{dense}(x) is our fitting function to in cold dense gas conditions^{5}, and the parameters a_{0} to a_{5} are our fit parameters. A fit is made for each of the tracers that had R^{2} > 0.5 (more than half of the variance explained) in the rankings of Sect. 4. The corresponding fit coefficient values for each columndensity ratio are given in Table B.7 and the coefficient values for line intensity ratios are listed in Table B.8.
In the translucent medium conditions, x(e^{−}) naturally reaches a plateau at the fractional abundance of carbon (1.32 × 10^{−4} in our undepleted models). We thus use a modified formula combining a polynomial of order 5 and a saturation: (3)
The fit parameters here are f_{max} and a_{0} to a_{5} . As in the cold dense medium case, f^{translucent}(x) is our fitting function to in translucent gas conditions. The corresponding fit coefficient values for each column density ratio are listed in Table B.5 and the coefficient values for line intensity ratios are listed in Table B.6.
The order of the polynomial in these functions is chosen so that further increasing it yields only marginal increase in R^{2} (estimated by crossvalidation to avoid overfitting). The R^{2} values found for the best tracers in each case are close to the values initially found with the RF models, indicating that our analytical fits are not significantly worse than the RF models, at least for the highR^{2} tracers. As an example, Figs. 2 and 4 also show the analytical fit (solid red line) in comparison to the RF model (solid black line). For a few of the lower R^{2} tracers, the analytical fits perform significantly worse as can be seen on the tables of Appendix B by comparing the R^{2} values of the RF models with the R^{2} values of the analytical fits.
Fig. 5 Illustration of the two sources of uncertainties for the best column density ratio for tracing the ionization fraction in dense cold medium, CN/N_{2}H^{+}. Left panel: analytical fit (solid black line), the standard deviation around this curve corresponding to the uncertainties on the fit coefficients (thin dashed red lines representing the 3σ level), and the bounds of the validity range defined in the text (vertical blue lines). Right panel: analytical fit (solid black line) and two estimates of the standard deviation corresponding to the residual scatter of the data points around the curve: the red dotted line presents a movingaverage estimate of the local standard deviation, and the dashed black lines shows our analytical fit of the residual standard deviation. On both panels, the chemical model grid is shown as a scatter plot, with the central crowded regions replaced by PDF isocontours containing 25, 50, and 75% of the points. 
5.2 Uncertainty formulae
Finally, a key point is to estimate uncertainties on our prediction of the ionization fraction. Here, we distinguish two sources of uncertainty.
Our best analytical fit is determined on a finite sample of models, so that the fit coefficients are only estimates of the theoretical best fit coefficients. Estimating again these fit coefficients from a different sample of models (drawn from the same distribution) would result in slightly different values, and these uncertainties on the fit coefficients in turn imply an uncertainty on the ionization fraction value predicted by the fit formula at any given value of the observable quantity (intensity ratio or column density ratio). In order to estimate this uncertainty on the predicted value, we proceed by bootstrapping: we repeat the fitting procedure on 100 bootstrapped samples (drawn from the original model grid) and report the standard deviation of the value of the fit function as its uncertainty.
The left panel of Fig. 5 shows the corresponding uncertainty (showing the 3σ level in dashed curves around the main prediction curve) in the case of the best column density ratio in cold dense gas conditions (CN/N_{2}H^{+}). We name this uncertainty the fit coefficients uncertainty to distinguish it from the second form of uncertainty below. We define the validity domain of our fit as the range of values of the observable ratio where our best fit is sufficiently constrained by our finite grid of models for this fit coefficient uncertainty to be negligible. In practice, we define it as the range of ratios where the abovedefined uncertainty remains lower than 2% of the predicted value. In the following, we will thus assume this uncertainty to be negligible inside of this validity domain and focus on the second form of uncertainties. The limits of the corresponding validity range are also shown on the left panel of Fig. 5 as blue vertical lines. Due to the tendency of high order polynomial fits to diverge quickly outside of the domain of the fitted dataset, the analytical fit formulae should not be used outside of the validity range defined here.
The second form of uncertainties comes from the unobservable parameters (density, temperature,...), which induce a scatter in the relationship between any of the line intensity ratios or column density ratios and the ionization fraction. Inside of the validity domain of the fit, this scatter in the residuals is much larger than the fit coefficients uncertainty, as can be seen in the left panel of Fig. 5, and is thus the dominant source of uncertainties. When applying the fit formula to real observations, in the ideal case where the chemical model used in this paper would be a perfect model of reality, we would expect the value predicted by the fit formula to commit a mean squared error equal to the variance of this scatter of the residuals, and this error represents our lack of information on the underlying physical conditions. As can be seen in the previous figures however, this residual scatter varies as a function of the observable predictor (the vertical scatter is smaller in some regions of the plot than in others). This residual variance as a function of the predictor can be estimated by a moving average method (shown in the right panel of Fig. 5, we used a window of 0.1 dex). In order to provide asimpler way of estimating this residual variance function, we fitted the simple analytical function (4)
to the squared residuals, thus providing a fit to the local variance. We note that this function provides a fit to the variance on the prediction of log_{10}(x(e^{−})), and, as previously, x is the log_{10} of the observable ratio. The absolute value in this function was chosen because the residual variance is by definition a positive quantity. The best fit coefficients to this residual variance function are given in Tables B.5–B.8. An example of the corresponding residual standard deviation function is shown in the right panel of Fig. 5, where we compare its movingaverage estimate (dotted red curves around the main prediction curve) to its squared polynomial fit (dashed black curve), showing in both cases the 3σ level, for the best column density ratio in the cold dense medium case (CN/N_{2}H).
The resulting fits are also presented for the best ratio in the different cases in Figs. 2 and 4, showing the RF fit, the analytical fit and the scatter fit. Similar figures for each of the 6 best tracers for both the translucent medium and the cold dense medium conditions, and for both column density ratios and line intensities ratios, are presented in Appendix C.
6 Discussions
6.1 Traditional ionization tracers
One of the most commonly used ionization fraction tracers is DCO^{+}/HCO^{+} (Guelin et al. 1977, 1982; Dalgarno & Lepp 1984; Caselli et al. 1998). It is used mainly in cold dense cores, where the temperatureis low enough to allow sufficient deuterium enrichment, and the column density is large enough to make DCO^{+} detectable. However, our results show that even in the cold dense gas regime, and despite including DCO^{+} in our list of potential tracers, the DCO^{+}/HCO^{+} column density ratio is not ranked among the best tracers of the ionization fraction. In fact, it is ranked as the 38^{th} best tracer in dense cold gas conditions, with a R^{2} of 0.57 only(cf. Table B.7). This ratio is often determined using observations of H^{13}CO^{+} as H^{12} CO^{+} can be optically thick in highcolumndensity lines of sight. We leave this aspect aside in this discussion by showing that the column density ratio DCO^{+}/HCO^{+} itself (however it might be determined observationally) suffers from several limitations as a tracer of the ionization fraction.
The relationship between the DCO^{+}/HCO^{+} abundance ratio and the ionization fraction found in our (cold and dense) model grid is shown in Fig. 6. The blue distribution shows the results of the model grid and the black line our best RF model, with a R^{2} of 0.57. Wesee that two main problems limit the usability of DCO^{+}/HCO^{+}.
First, a large scatter of the ionization values (by up to 3 orders of magnitude) is present at all values of the ratio, despite having a significant fraction of the distribution tightly located around a clear relationship (the outermost blue contour encloses 75% of the distribution). As a result, the best fit RF model tries to make a compromise between the tightly located part of the distribution, and the scattered points at lower ionization fraction values. We found most of this scatter to be related to variations in the orthotopara ratio of H_{2} (OPR), an unobservable parameter whose value remains difficult to estimate in observations of dense cold cores. For instance, selecting only the models having OPR, we see in Fig. 6 (red dashed contour) that we retain only the unscattered part of the distribution. Thus the difficulty of obtaining reliable estimates of the OPR of H_{2} in dense cores limits the use of DCO^{+}/HCO^{+} as a tracer of the ionization fraction.
Second, even when selecting the low OPR models, we see that the relationship presents a very steep slope at high ratio values (low ionization fraction values). As a result, for DCO^{+}/HCO^{+} ratios above 10^{−1.6}, a range of ionization fractions of more than two orders of magnitude is possible. The ratio would then only be usable for lower ratios, equivalent to ionization fractions larger than 10^{−6.5}. Even if the model grid presented no scatter at all, a steep slope implies that small observational uncertainties on the ratio will induce large uncertainties on the predicted ionization fraction. Thus, relationships with steep slopes are of limited use.
These different effects combine to make the DCO^{+}/HCO^{+} ratio a poor predictor of the ionization fraction, compared to the best ranked tracers found by our method.
Fig. 6 DCO^{+} over HCO^{+} column density ratio in our grid of dense cold gas models (blue points and contours), shown as a scatter plot, with the central crowded regions replaced by PDF isocontours containing 25, 50, and 75% of the points. Our best fit Random Forest model is shown as a black line. The red dashed contours shows the distributions of the models with OPR. 
6.2 Detectability constraints
So far, detectability constraints have been ignored. Predictive power has been tested from noiseless values of column density or line intensity ratios. However, the different lines considered here have widely different brightnesses and thus differ in terms of detectability with current instruments. We explore here the effect of various noise levels on the predictive power of the different line ratios. We will consider two noise setups corresponding to two observation scenarios: the case of one constant noise level for all lines, corresponding to the typical case of a line survey where faint lines are detected with a lower signaltonoise ratio (S/N) than bright ones, and the case of a fixed S/N for all lines, corresponding to observations being designed to reach a set S/N for a few desired lines. These two cases correspond to the two opposite extremes of possible observation scenarios and will give us a general overview of the possible impact of noise on the performance of the tracers. In both cases, synthetic noise is added to the line integrated intensity values and we consider the S/N on the integrated intensity and not the peak intensity. All noise and S/N values quoted are for individual lines, not for line ratios.
In bothcases, in order to measure how the predictive power (measured as the R^{2} value) is affected by noise, we perform a modified crossvalidation. The model grid is randomly split in ten parts. For each of these tenths and for each line ratio:
 1.
A RF model is trained from the other nine parts of the model grids, without any noise added.
 2.
The trained RF is tested on the tenth under consideration, with added noise (either with a constant noise variance for all models and all lines, or with a constant S/N for each model and line). Since the RF models take the log_{10} of the intensity ratio as input and since the addition of noise can produce negative values, we only apply the RF model when both line integrated intensity values are above 1 σ. Otherwise, we do not use the RF model and simply take the average ionization fraction value of the grid as our prediction.
Finally, we take the average R^{2} value obtained over the ten noisy tests as the predictive power of the tracer under noisy conditions. We thus avoid estimating R^{2} from datapoints that have been seen during training, and we estimate the predictive power of a model trained on noiseless data when applied to noisy data (as would be the case when applying the results of this article to real observations).
The results for the translucent medium grid, for a few possible line ratios, are presented in Figs. 7 and 8. Figure 7 presents the scenario of a constant noise level σ_{noise} for all lines and all models, and shows how the R^{2} of the prediction from different line ratios decreases when the noise level is increased. We show only the three best line ratios (according to the noiseless ranking of Table B.2): C_{2} H (1–0)/HCN (1–0), C_{2}H (1–0)/^{13}CO (1–0), and C_{2}H (1–0)/C^{18}O (1–0), and the four tracers found to be the least sensitive to noise. We define the tracers least sensitive to noise as those having the highest σ_{1∕2} among tracers with R_{noiseless} ≥ 0.7, thus giving a compromise between a good fit quality (high R_{noiseless}) and a slow decrease with increasing noise level (high σ_{1∕2}). The four best ratios found according to this definition and shown on the figure are ^{13}CO (1–0)/C^{18}O (1–0), C^{18}O (1–0)/CF^{+} (1–0), ^{13}CO (1–0)/CF^{+} (1–0), and HCN (1–0)/CF^{+} (1–0). We see that the best three tracers, all including C_{2}H, have their predictive power decreasing sharply at a relatively low noise level (their R^{2} decreases by 50% at σ_{noise} ~ 5 × 10^{−3} K km s^{−1}). This is due to the relatively low brightness of the C_{2}H line in translucent conditions (median integrated intensity of ~7 × 10^{−3} K km s^{−1} in our translucent medium grid). In comparison, some line ratios built from brighter lines, despite a lower R^{2} on noiseless data, are found to perform better in the presence of noise: the R^{2} for the ^{13}CO (1–0)/C^{18}O (1–0) ratio decreases by 50% at σ_{noise} ~ 4 × 10^{−2} K km s^{−1} (C^{18}O (1–0) has a median integrated intensity of ~3 × 10^{−1} K km s^{−1} in this grid), the C^{18}O (1–0)/CF^{+} (1–0) ratio has its R^{2} decreased by 50% at σ_{noise} ~ 10^{−2} K km s^{−1} (CF^{+} (1–0) has a median integrated intensity of ~1.2 × 10^{−2} K km s^{−1} in this grid). We note, however, that ratios built from CF^{+} (1–0) actually only perform marginaly better than the best ratios involving C_{2} H at high noise levels (see Fig. 7) due to the low brightness of CF^{+} (1–0). For a more exhaustive comparison of the line ratios, Table 3 gives for each line ratio the noise level σ_{1∕2} at which the R^{2} is decreased by half.
Figure 8 similarly shows the results for the scenario of a fixed S/N for all lines and all models, still in the case of translucent medium conditions. The variations of the R^{2} of the prediction with the S/N are shown for the best seven tracers (according to the noiseless ranking of Table B.2). In this scenario, we find as expected that the predictive power drops at an S/N of order unity. The only exception is the ^{13}CO (1–0)/C^{18}O (1–0) ratio, which decreases slightly earlier. This is due to this ratio spanning a relatively small range of values in our grid of models (approximately one order of magnitude) while the other line ratios span ranges of four to six orders of magnitude. As the ionization fraction spans a range of values of 2.5 orders of magnitude in the grid, this implies that the relationship between the ionization fraction and the ^{13}CO (1–0)/C^{18}O (1–0) ratio has a much steeper slope than for the other line ratios. A steep slope then implies that small errors in the line ratios result in large errors in the predicted ionization fraction. As a result, the ^{13}CO (1–0)/C^{18}O (1–0) ratio requires significantly higher S/N than the other line ratios. Similarly to σ_{1∕2}, we define S/N_{1∕2} as the S/N value at which R^{2} is half of its value on noiseless data. The S/N_{1∕2} values for all line ratios are also given in Table 3.
The results for dense and cold medium conditions are similarly shown in Figs. 9 and 10 and Table 4. In the case of a constant noise level, we find that the R^{2} drop generally occurs at higher noise levels than in the translucent medium case, as expected because the lines are brighter in the cold dense medium due to higher column densities. Figure 9 shows the decrease of R^{2} with σ_{noise} for the best three dense gas tracers (according to the noiseless ranking of Table B.2), and for the four ratios least sensitive two noise (according to the same definition as previously). We note that ^{13}CO (1–0)/HCO^{+} (1–0), the second best ratio on noiseless data (therefore already shown on the figure), has also the highest σ_{1∕2} value among ratios with R_{noiseless} ≥ 0.7, so that we show the next four ratios least sensitive to noise on the figure. These four ratios are found to be C^{18}O (1–0)/HCO^{+} (1–0), HCO^{+} (1–0)/CN (1–0), SO (32)/CN (1–0), and HCN (1–0)/CN (1–0). The σ_{1∕2} values for all line ratios are listed in Table 4. When considering a constant S/N scenario for the dense cold medium case, we again find that the R^{2} drop occursat an S/N value of order unity, as shown in Fig. 10. The S/N_{1∕2} values for all line ratios are listed in Table 4.
When interpreting the results of this study, one must keep in mind that the decrease in R^{2} in our two scenarios (constant σ_{noise} or constant S/N) does not come from the same effect. In the constant σ_{noise} scenario, at a given σ_{noise} level one part of the grid has undetected or very low S/N values for the ratio under consideration, while the other part has high S/N. The R^{2} decrease is indicative of the growing fraction of the parameter space with undetected/low S/N. As a result, even when finding a low overall R^{2}, there might remain a fraction of the parameter space were the predictor remains very good (usually, high column density, high volume density, high temperature,...), which we did not characterize here. In the constant S/N scenario on the other hand, the S/N is by design constant over all models, independent of the physical parameters. The decrease in overall R^{2} is then more representative of the decrease in predictive power at any point in the parameter space. As a result, a ratio with a low σ_{1∕2} value might still be usable in real observations with a higher noise level but would be restricted to high brightness regions of GMCs (in the corresponding lines), while a ratio cannot be used at all in observations with S/N significantly lower than its S/N_{1∕2} value.
Fig. 7 Evolution of the R^{2} performance of the Random Forest predictors (for translucent gas) when adding noise of constant variance to all line intensities, as a function of the noise level σ_{noise}, for the 3 best tracers for translucent medium (C_{2}H(1–0)/HCN(1–0), C_{2} H(1–0)/^{13}CO(1–0), and C_{2} H(1–0)/C^{18}O(1–0)) and for the four tracers least affected by the noise (^{13}CO(1–0)/C^{18}O(1–0), C^{18}O(1–0)/CF^{+}(1–0), ^{13}CO(1–0)/CF^{+}(1–0), and HCN/CF^{+}(1–0)). 
Fig. 8 Evolution of the R^{2} performance of the Random Forest predictors (for translucent gas) when adding noise of constant signaltonoise ratio to the line intensities, as a function of the S/N, for the 7 best tracers for translucent medium. 
Impact of adding noise to the line intensities on the predictive power R^{2} (for translucent medium), measured by the noise level σ_{1∕2} (in a constantnoise level situation) and the signaltonoise ratio S/N_{1∕2} (in a constantS/N situation) for which R^{2} reaches one half of its value in the absence of noise .
Fig. 9 Evolution of the R^{2} performance of the Random Forest predictors (for cold dense gas) when adding noise of constant variance to the line intensities, as a function of the noise level σ_{noise}, for the 3 best tracers for cold dense medium (first three curves) and for the four tracers least affected by the noise (next four curves). 
Fig. 10 Evolution of the R^{2} performanceof the Random Forest predictors (for cold dense gas) when adding noise of constant signaltonoise ratio to the line intensities, as a function of the S/N, for the 7 best tracers for cold dense medium. 
Impact of adding noise to the line intensities on the predictive power R^{2} (for cold dense medium), measured by the noise level σ_{1∕2} (in a constantnoise level situation) and the signaltonoise ratio S/N_{1∕2} for which the R^{2} reaches one half of its value in the absence of noise .
6.3 Chemical model reliability
Independently of the statistical method that we present in this article, the results obtained rely on the chosen chemical model and its limitations. Previous works on ionization fraction tracers have mostly used stationarystate results of singlezone chemical models (although some works have used timedependent models, e.g., Maret & Bergin 2007; Shingledecker et al. 2016). As these previous studies have been mostly limited to deuterationbased tracers, the focus of the present article has been on highlighting the nondeuterationbased tracers that can be found for the ionization fraction from similar chemical models. We discuss here the impacts of our model’s limitations on our results.
Our singlezone model cannot include a detailed treatment of UV radiative transfer through the cloud. While most photodissociation rates can be simply estimated based on an assumed optical depth of dust protecting each model from UV photons (parameter A_{V} in our models), species such as H_{2}, CO and its isotopologues can be protected from photodissociation by self or mutualshielding. While selfshielding of H_{2} and mutual shielding of CO by H_{2} are included in our model using approximations (Draine & Bertoldi 1996; Heays et al. 2017), mutual shielding of ^{13}CO and C^{18}O by H_{2} and ^{12} CO are not included. As a result, in our translucent medium models where photodissociation by external UV photons still plays an important role, we expect the abundances of the rarer CO isotopologues to be less reliable than the other species. We note that the observations of ^{13}CO and C^{18}O in our ORIONB dataset indeed present specificities (systematic excitation temperature differences with ^{12} CO, Bron et al. 2018; Roueff et al. 2020) that remain unexplained even by more complex 1D PDR models.
The only explicit surface reaction in our chemical model is H_{2} formation, however we account for the freezeout of CO through our depletion parameter. The list of species that we consider as possible tracers has been restricted to species that are not strongly affected by surface chemistry beyond the depletion effect. Extension to more complex molecules would require a chemical model including a more complete treatment of surface chemistry.
Anothersource of uncertainties comes from the experimental or theoretical estimates of the reaction rate coefficients used in our chemical network. While we did not directly perform a sensitivity analysis of the reaction rate coefficients, our model grids include temperature variations which subsequently impact the reactions rate coefficients through their temperature dependence. This is especially true for the important dissociative recombination reactions which display significative temperature dependences. Temperature is then considered as an unobserved parameter when searching for good tracers of x(e^{−}). The discovery of strong relationships between some of the ratios and x(e^{−}), despite temperature variations in the grid, thus indicates that these relationships are to some extent robust to the reaction rates. A careful analysis of the magnitude and correlations of model uncertainties resulting from reaction rate uncertainties in the chemical network would deserve a separate study.
Timedependent effects are expected to be more important in cold dense medium conditions than in translucent medium conditions as photochemistry has shorter timescales in the latter case. Timedependent effects that result from the time evolution of some physical parameter (e.g., density and temperature during the contraction of a core) while the chemistry follows in a quasistationary way are in part accounted for in our models by exploring a large range of the various physical parameters that can be subject to variations (see Table 1). In addition, the progressive freeze out of CO on dust grains is accounted for by considering a range of depletion factors for carbon and oxygen. Similarly, the slow evolution of OPR in cold gas can keep parts of the chemistry (deuterium chemistry, nitrogen chemistry) in a timedependent evolution that depends mainly on the evolution of OPR. This is also in part accounted for by exploring a large range of OPR values in our models. The tracerfinding method presented in this article will be applied to timedependent chemical models in a future study.
The final and most important limitation of our model is that it does not include a spatial dimension: gas at a single value of density, temperature, etc. is assumed to be exposed to a given radiation field and protected by a given column density. As a result the possibility that different emission lines originate in separate layers of gas on the line of sight is completely neglected. Variations of the physical conditions along the line of sight can indeed have an important impact on the observables (e.g., Levrier et al. 2012). This limitation could be important both in the translucent medium where physical and chemical gradients are present due to the progressive extinction of the external UV field, and in dense cores with density and temperature gradients. As a result some caution must be exercised when choosing the line ratios to consider, ratios involving two species expected to emit in completely different regions should be avoided. For instance, the C_{2} H (1–0)/N_{2}H^{+} (1–0) that is found as the fourth best line ratio in dense cold medium should be avoided: C_{2} H is known to be a tracer of UV illumination and thus more likely to be emitted at the external surface of a given clump, while N_{2} H^{+} is abundant in the inner regions of the core where CO is already significantly depleted. An application of our method to 1D PDR models to better account for this effect in translucent medium conditions will be carried out in a future work.
6.4 Parameter PDF in the model grids
In the model grids used in this study, we sampled uniformly (in logarithm) for the values of the unobservable physical conditions (gas density, temperature, UV field, etc.) in an hypercube defined by lower and upper bounds for each of the parameters. The results of our ranking method will depend on this assumed PDF (probability density function) for the physical conditions in the interstellar medium (ISM). We made here the choice of making the minimal assumption: knowing only reasonable lower and upper bounds on each of the parameters, the uniform distribution is the maximum entropy PDF (i.e. the PDF that best represents our assumed state of knowledge). As a perspective, if more a priori knowledge is available, then more accurate assumptions for the PDF (in particular for the correlations between the different physical parameters) could reveal additional tracers.
We notethat this assumption of a uniform PDF over a maximum support (in the sense that any more accurate PDF would have almost all of its weight enclosed in this support) makes it likely that any tracer found to have a very good relationship with the ionization fraction would keep a strong relationship for more accurate PDF choices (if the relationship is strong over the full hypercube, it should with high likelihood stay strong on subregions of this hypercube). In this sense, we expect the tracers found here to remain reliable, but a more accurate PDF choice might strongly increase the performance of some tracers found here to perform poorly and thus reveal additional tracers of the ionization fraction. This argument remains however qualitative as pathological cases of PDF might be constructed that would radically change the rankings of the tracers.
As a result, the precise rankings presented in this article could slightly change, but we expect the good tracers highlighted by these ranking to be reliable for more realistic PDFs of the physical conditions in GMCs.
6.5 Finalrecommandation
Based on the limitations discussed above (detectability and model reliability), we recommend the use of the following integrated line intensity ratios to trace the ionization fraction.

In translucent medium conditions, we recommend the use C_{2}H (1–0)/HCN (1–0), C_{2}H (1–0)/HNC (1–0), or C_{2}H (1–0)/CN (1–0). If sensitivity is an issue, HCN (1–0)/CF^{+} (1–0) can sometimes perform as well as the previously listed ratios.

In cold dense gas conditions, we recommend the use of CF^{+} (1–0)/DCO^{+} (1–0), ^{13}CO (1–0)/HCO^{+} (1–0) or CN (1–0)/N_{2}H^{+} (1–0) if detectability is not an issue for these species, and of ^{13}CO (1–0)/HCO^{+} (1–0) or C^{18}O (1–0)/ HCO^{+} (1–0) otherwise.
This list is of course not exhaustive, and other ratios can give satisfactory predictions (see Tables 3 and 4) if the species listed above are not available.
In translucent gas conditions, this recommendation is based on the following points. After eliminating rarer CO isotopologues based on our discussion of mutualshielding effects on selective photodissociation in low/moderate A_{V} regions (cf. Sect. 6.3), and eliminating ratios involving sulfur species that were found to require unreasonably low noise levels of 10^{−4}−10^{−5} K km s^{−1}, the three remaining best line intensity ratios are C_{2}H (1–0)/HCN (1–0) and C_{2}H (1–0)/HNC (1–0) and C_{2}H (1–0)/CN (1–0). If noise sensitivity is critical, the tracer found to have the best predictive power at high noise levels is found in Sect. 6.2to be HCN (1–0)/CF^{+} (1–0) but is negligibly better than the three previously mentioned at high noise levels.
In cold dense gas conditions, the three best ratios are CF^{+} (1–0)/DCO^{+} (1–0), ^{13}CO (1–0)/HCO^{+}, and CN (1–0)/N_{2}H^{+} (1–0). If noise sensitivity is critical, we found in Sect. 6.2 that the ratios with the best predictive power at high noise levels are ^{13}CO (1–0)/HCO^{+} and C^{18}O (1–0)/HCO^{+} (1–0).
7 Conclusions
In this paper, we present a general statistical method to find the best observable tracers of an unobservable parameter based on a grid of models spanning the range of possible values for all the unknown underlying physical parameters (e.g., gas density, temperature, depletion, etc.). Our method estimates the predictive power of each potential observable tracer by training a flexible, nonlinear regression model (a Random Forest model, making no assumption on the nonlinear shape of the relationship to be found) on the task of predicting the target quantity from each of the potential tracers. The fit quality on test data, measured as the R^{2} coefficient by crossvalidation and outofbag estimation, is used to rank the potential tracers by order of predictive power.
In the context of our recent studies of the Orion B GMC (Pety et al. 2017; Gratier et al. 2017; Orkisz et al. 2017, 2019; Bron et al. 2018), we have applied this method to the important astrophysical question of tracing the ionization fraction in the neutral ISM, with the goal of being able to probe its variations across a whole GMC, from its translucent enveloppe to its dense cores. We considered grids of singlezone, stationary state astrochemical models exploring wide ranges of values in gas density, temperature, external UV field, A_{V} on the line of sight, cosmic ray ionization rate, orthotopara ratio of H_{2}, depletion factor, and sulfur elemental abundance. For a finer exploration of the possible conditions, we considered two grids corresponding to translucent medium conditions and cold dense medium conditions respectively, based on the different types of environments found in the Orion B GMC (Pety et al. 2017; Bron et al. 2018).
We considered successively column density ratios and line intensity ratios as potential tracers, focusing on species observable in the band at 100 GHz of our observations of Orion B. We find that in both cases and in both types of physical conditions, multiple ratios allow accurate predictions of the ionization fraction, with R^{2} > 0.8 (and up to 0.96). We investigated the impact of the noise level on the predictive capability of the different ratios After accounting for detectability and model reliability, we recommend:

for translucent medium conditions, C_{2}H (1–0)/HCN (1–0), C_{2}H (1–0)/HNC (1–0) or C_{2}H (1–0)/CN (1–0);

for cold dense medium conditions, CF^{+} (1–0)/DCO^{+} (1–0), ^{13}CO (1–0)/HCO^{+} (1–0) or CN (1–0)/N_{2}H^{+} (1–0) at low enough noise level, or ^{13}CO (1–0)/HCO^{+} (1–0) or C^{18}O (1–0)/HCO^{+} (1–0) if sensitivity is an issue.
In order to simplify the use of these predictors, we constructed ad hoc analytical fits (using polynomials or saturated polynomials) of the relationship of each observable tracer to the ionization fraction. Contrary to the Random Forest models, the choice of the analytical form of these fits is specific to the types of relationships observed in this specific application (different analytical forms might be necessary for other applications). We also provide analytical formulae to estimate the uncertainty on any measurement of the ionization fraction from these tracers. These tracers will be used to study the ionization fraction in the Orion B molecular cloud in a second paper (Guzman et al., in prep.).
The method presented here is very general and could be easily applied to finding tracers of other related (cosmic ray ionization rate, absolute electron abundance) or unrelated (gas density, temperature, OPR,...) unobservable quantities. This method can also be extended simply to simultaneously use pairs (or more) of line ratios (by training RF models on the possible combinations of line ratios), which would likely further increase the quality of the prediction.
Acknowledgements
We thank the anonymous referee for his comments that helped improve this article. We thank the CIAS for their hospitality during the many workshops devoted to the ORIONB project. This work was supported in part by the Programme National “Physique et Chimie du Milieu Interstellaire” (PCMI) of CNRS/INSU with INC/INP, cofunded by CEA and CNES. This project has received financial support from the CNRS through the MITI interdisciplinary programs. The authors also acknowledge funding by Paris Observatory through the AF Astrochimie program. J.R.G. thanks Spanish MICI for funding support under grant AYA201785111P.
Appendix A Random Forest parameter optimization
The Random Forest contains a few tuning parameters for which a value needs to be chosen before training. In particular, we will consider only the two most important here: the number of trees in the forest N_{trees}, and the maximum depth allowed for each tree, d_{max}. Another usual parameter is not considered here: the number of predictors in the random subset considered when choosing along which axis to make a split. Indeed, we will only train RF models with a single predictor at a time, so that this number is necessarily 1.
Since optimizing the choice of these parameters on the full model grids (that will then be used for training) and for each potential tracer will lead both to an increased risk of overfitting and a heavy computational time cost, we opt for a very limited optimization, where a single set of parameter values is used for all tracers (i.e. for all RF models trained on a single ratio)and for all model grids. The selection of these parameter values is done with a simplified procedure:
 1.
For each model grid, RF models are first trained on each ratio using default parameter values (default sklearn values, N_{trees} = 100 and d_{max} = ∞), and the OOB R^{2} value is calculated for each ratio.
 2.
For the two ratios having the best and worst R^{2} in the previous step, RF models are trained for a grid of parameter values exploring N_{trees} = 50−800 and d_{max} = 1−12, and the OOB R^{2} is again computed for RF models with each possible combination of parameter values (the OOB value is used to limit the risksof overfitting).
 3.
As the R^{2} maps obtained show very flat minima, we found a common set of parameter values assuring a R^{2} within 0.01 of the best value in all cases. The parameter values found are N_{trees} = 400, and d_{max} = 4.
In order to illustrate this procedure, Fig. A.1 shows for instance the resulting R^{2} maps as a function of N_{trees} and d_{max} for the preliminary best (top) and worst (bottom) tracers in the translucent model grid when using line integrated intensity ratios. The red contours delimit the region where R^{2} is within 0.01 of its maximum value. We see that the variations with N_{trees} are limited to a decrease at low values. Any large enough value could thus be chosen but larger values will induce higher computation time cost, so that a minimal acceptable value had to be chosen. The variations with d_{max} show a clear maximum (although rather flat). The R^{2} value decreases for decreasing values of d_{max} below the optimum as the RF model then becomes not flexible enough to capture the relationship between the predictor and the target variable. Above the optimum, R^{2} decreases for increasing values of d_{max} as overfitting starts to arise. The RF model becomes too flexible and starts to learn noise artefacts (we recall that the “noise” here is induced by the random sampling of the unobservable physical parameters of the chemical model).
Fig. A.1 R^{2} (OOB value) as a function of the tuning parameters of the Random Forest (N_{trees} and d_{max}), for thepreliminary best (top panel) and worst (bottom panel) ratios found though a preliminary estimate using defaults values of the parameters, for line integrated intensity ratios and translucent medium conditions. The red contours show the region where the R^{2} value is within 0.01 of the maximum. 
Appendix B Tables for tracers ranking, performance, and fit coefficients
B.1 Tracers ranking and performance
We provide here the ranking and performance of single ratio RF models, obtained following the method described in Sect. 2.
B.1.1 Translucent medium
Tables B.1 and B.2 present the ranking we obtain for translucent medium conditions, respectively for column density ratios and integrated line intensity ratios. See Sect. 2 for a description of the method used, and Sect. 4 for a discussion of these results. For each ratio, we list the performance of the corresponding RF model measured through the (crossvalidated) R^{2}, the equivalent root mean square error on log_{10}(x(e^{−})) and the corresponding error factor on x(e^{−}), the maximum absolute error on log_{10}(x(e^{−})) and the corresponding error factor on x(e^{−}). We also list for comparison the R^{2} value obtained with the analytical fit described in Sect. 5.
B.1.2 Cold dense medium
Tables B.3 and B.4 present the ranking we obtain for cold dense medium conditions, respectively for column density ratios and integrated line intensity ratios.
B.2 Analytical fit coefficients
We provide here the fit coefficients for the analytical fits described in Sect. 5.
Ranking of column density ratios according to their usefulness to predict the ionization fraction in translucent medium conditions (measured through the R^{2} of a fitted Random Forest model).
B.2.1 Translucent medium
Tables B.5 and B.6 list the fit coefficients for translucent medium conditions, respectively for column density ratios and integrated line intensity ratios. See Sect. 5 for a description of the method used. For each ratio, ranked according the results of Sect. 4, we list the fit coefficients (corresponding to the fit formula given in Eq. (3)) for predicting log_{10}(x(e^{−})), the quality of this fit estimated as the (crossvalidated) R^{2}, the root mean square error on log_{10}(x(e^{−})) and corresponding error factor on x(e^{−}), the maximum absolute error on log_{10}(x(e^{−})) and corresponding error factor on x(e^{−}), the fit coefficients (corresponding to the fit formula given in Eq. (4)) for estimating the uncertainty on the prediction, and the limits of the validity range of the fit (given as log_{10} of the ratio values).
B.2.2 Cold dense medium
Tables B.7 and B.8 list the fit coefficients for cold dense medium conditions, respectively for column density ratios and integrated line intensity ratios. For each ratio, ranked according the results of Sect. 4, we list the fit coefficients (corresponding to the fit formula given in Eq. (2)) for predicting log_{10}(x(e^{−})), the quality of this fit estimated as the (crossvalidated) R^{2}, the root mean square error on log_{10}(x(e^{−})) and corresponding error factor on x(e^{−}), the maximum absolute error on log_{10}(x(e^{−})) and corresponding error factor on x(e^{−}), the fit coefficients (corresponding to the fit formula given in Eq. (4)) for estimating the uncertainty on the prediction, and the limits of the validity range of the fit (given as log_{10} of the ratio values.
Ranking of line intensity ratios according to their usefulness to predict the ionization fraction in translucent medium conditions (measured through the R^{2} of a fitted Random Forest model).
Ranking of column density ratios according to their usefulness to predict the ionization fraction in dense cold medium conditions (measured through the R^{2} of a fitted Random Forest model).
Ranking of line intensity ratios according to their usefulness to predict the ionization fraction in dense cold medium conditions (measured through the R^{2} of a fitted Random Forest model).
Fit coefficients (for our main fit and scatter fit) and fit quality for column density ratios in translucent medium conditions.
Fit coefficients (for our main fit and scatter fit) and fit quality for line intensity ratios in translucent medium conditions.
Fit coefficients (for our main fit and scatter fit) and fit quality for column density ratios in cold dense medium conditions.
Fit coefficients (for our main fit and scatter fit) and fit quality for line intensity ratios in cold dense medium conditions.
Appendix C Analytical model visualization
We present in this section figures of the RF models and analytical fits for the best six tracers in each case.
C.1 Translucent medium
Figures C.1 presents scatter plots of the ionization fraction versus each of the best six column density ratios in translucent medium conditions. Superimposed are the RF model (red line), the analytical fit (solid black line), the uncertainty fit (dashed lines), and the validity range (blue vertical lines). Figure C.2 similarly shows scatter plots of the ionization fraction versus each of the best six integrated line intensity ratios in translucent medium conditions.
C.2 Cold dense medium
Figures C.3 and C.4 similarly present the scatter plots (for column density ratios and integrated line intensity ratios respectively) for the best six tracers in cold dense medium conditions.
Fig. C.1 Ionization fraction versus column density ratio for the best six ratios found in Sect. 4 for translucent medium conditions. The chemical model grid is shown as a scatter plot, with the central crowded regions replaced by PDF isocontours containing 25, 50, and 75% of the points. Superimposed are the RF model (red line), the analytical fit (solid black line), and the analytical fit of the 1σ uncertainty (dashed black lines). The quality estimates of the two models are indicated on the figure. 
Fig. C.2 Ionization fraction versus integrated line intensity ratio for the best six ratios found in Sect. 4 for translucent medium conditions. The chemical model grid is shown as a scatter plot, with the central crowded regions replaced by PDF isocontours containing 25, 50, and 75% of the points. Superimposed are the RF model (red line), the analytical fit (solid black line), and the analytical fit of the 1σ uncertainty (dashed black lines). The quality estimates of the two models are indicated on the figure. 
Fig. C.3 Ionization fraction versus column density ratio for the best six ratios found in Sect. 4 for cold dense medium conditions. The chemical model grid is shown as a scatter plot, with the central crowded regions replaced by PDF isocontours containing 25, 50, and 75% of the points. Superimposed are the RF model (red line), the analytical fit (solid black line), and the analytical fit of the 1σ uncertainty (dashed black lines). The quality estimates of the two models are indicated on the figure. 
Fig. C.4 Ionization fraction versus integrated line intensity ratio for the best six ratios found in Sect. 4 for cold dense medium conditions. The chemical model grid is shown as a scatter plot, with the central crowded regions replaced by PDF isocontours containing 25, 50, and 75% of the points. Superimposed are the RF model (red line), the analytical fit (solid black line), and the analytical fit of the 1σ uncertainty (dashed black lines). The quality estimates of the two models are indicated in the figure. 
References
 Agúndez, M., & Wakelam, V. 2013, Chem. Rev., 113, 8710 [Google Scholar]
 Balbus, S. A., & Hawley, J. F. 1991, ApJ, 376, 214 [NASA ADS] [CrossRef] [Google Scholar]
 Barger, C. J., & Garrod, R. T. 2020, ApJ, 888, 38 [NASA ADS] [CrossRef] [Google Scholar]
 Basu, S., & Mouschovias, T. C. 1994, ApJ, 432, 720 [NASA ADS] [CrossRef] [Google Scholar]
 Bengio, Y., & Grandvalet, Y. 2004, J. Mach. Learn. Res., 5, 1089 [Google Scholar]
 Bergin, E. A., Plume, R., Williams, J. P., & Myers, P. C. 1999, ApJ, 512, 724 [NASA ADS] [CrossRef] [Google Scholar]
 Beuther, H., Semenov, D., Henning, T., & Linz, H. 2008, ApJ, 675, L33 [NASA ADS] [CrossRef] [Google Scholar]
 Black, J. H., & van Dishoeck, E. F. 1991, ApJ, 369, L9 [NASA ADS] [CrossRef] [Google Scholar]
 Breiman, L. 1996, Mach. Learn., 24, 123 [Google Scholar]
 Breiman, L. 2001, Mach. Learn., 45, 5 [CrossRef] [Google Scholar]
 Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. 1984, Classification and Regression Trees (New York: Chapman & Hall), 358 [Google Scholar]
 Bron, E., Daudon, C., Pety, J., et al. 2018, A&A, 610, A12 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Caselli, P. 2002, Planet. Space Sci., 50, 1133 [NASA ADS] [CrossRef] [Google Scholar]
 Caselli, P., Walmsley, C. M., Terzieva, R., & Herbst, E. 1998, ApJ, 499, 234 [NASA ADS] [CrossRef] [Google Scholar]
 Caselli, P., Walmsley, C. M., Zucconi, A., et al. 2002, ApJ, 565, 344 [Google Scholar]
 Cuadrado, S., Goicoechea, J. R., Pilleri, P., et al. 2015, A&A, 575, A82 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Cuadrado, S., Salas, P., Goicoechea, J. R., et al. 2019, A&A, 625, L3 [CrossRef] [EDP Sciences] [Google Scholar]
 Dalgarno, A., & Lepp, S. 1984, ApJ, 287, L47 [NASA ADS] [CrossRef] [Google Scholar]
 Dislaire, V., HilyBlant, P., Faure, A., et al. 2012, A&A, 537, A20 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Draine, B. T. 2011, Physics of the Interstellar and Intergalactic Medium (Princeton: Princeton University Press) [Google Scholar]
 Draine, B. T., & Bertoldi, F. 1996, ApJ, 468, 269 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Flower, D. R., Pineau Des Forêts, G., & Walmsley, C. M. 2007, A&A, 474, 923 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Fossé, D., Cernicharo, J., Gerin, M., & Cox, P. 2001, ApJ, 552, 168 [NASA ADS] [CrossRef] [Google Scholar]
 Fuente, A., Cernicharo, J., Roueff, E., et al. 2016, A&A, 593, A94 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Goicoechea, J. R., Pety, J., Gerin, M., et al. 2006, A&A, 456, 565 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Goicoechea, J. R., Pety, J., Gerin, M., HilyBlant, P., & Le Bourlot, J. 2009, A&A, 498, 771 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Goldsmith, P. F., & Kauffmann, J. 2017, ApJ, 841, 25 [NASA ADS] [CrossRef] [Google Scholar]
 Gratier, P., Bron, E., Gerin, M., et al. 2017, A&A, 599, A100 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Gratier, P., Pety, J., Bron, E., et al. 2020, A&A, 645, A27 [CrossRef] [EDP Sciences] [Google Scholar]
 Guelin, M., Langer, W. D., Snell, R. L., & Wootten, H. A. 1977, ApJ, 217, L165 [NASA ADS] [CrossRef] [Google Scholar]
 Guelin, M., Langer, W. D., & Wilson, R. W. 1982, A&A, 107, 107 [NASA ADS] [Google Scholar]
 Guzmán, V., Pety, J., Gratier, P., et al. 2012, A&A, 543, L1 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Guzmán, V. V., Pety, J., Goicoechea, J. R., et al. 2015, ApJ, 800, L33 [NASA ADS] [CrossRef] [Google Scholar]
 Hastie, T., Tibshirani, R., & Friedman, J. 2001, The Elements of Statistical Learning, Springer Series in Statistics (New York: Springer New York Inc.) [Google Scholar]
 Heays, A. N., Bosman, A. D., & van Dishoeck, E. F. 2017, A&A, 602, A105 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Herbst, E., & Klemperer, W. 1973, ApJ, 185, 505 [NASA ADS] [CrossRef] [Google Scholar]
 Hollenbach, D. J., & Tielens, A. G. G. M. 1999, Rev. Mod. Phys., 71, 173 [Google Scholar]
 Indriolo, N., & McCall, B. J. 2012, ApJ, 745, 91 [NASA ADS] [CrossRef] [Google Scholar]
 Indriolo, N., Geballe, T. R., Oka, T., & McCall, B. J. 2007, ApJ, 671, 1736 [NASA ADS] [CrossRef] [Google Scholar]
 Le Bourlot, J. 1991, A&A, 242, 235 [NASA ADS] [Google Scholar]
 Le Petit, F., Nehmé, C., Le Bourlot, J., & Roueff, E. 2006, ApJS, 164, 506 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Le Petit, F., Ruaud, M., Bron, E., et al. 2016, A&A, 585, A105 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Lepp, S. 1992, IAU Symp., 150, 471 [Google Scholar]
 Levrier, F., Le Petit, F., Hennebelle, P., et al. 2012, A&A, 544, A22 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Liszt, H. S. 2012, A&A, 538, A27 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Liszt, H. S., & Pety, J. 2016, ApJ, 823, 124 [NASA ADS] [CrossRef] [Google Scholar]
 Maret, S., & Bergin, E. A. 2007, ApJ, 664, 956 [NASA ADS] [CrossRef] [Google Scholar]
 McCall, B. J., Huneycutt, A. J., Saykally, R. J., et al. 2003, Nature, 422, 500 [NASA ADS] [CrossRef] [PubMed] [Google Scholar]
 Mestel, L., & Spitzer, L., 1956, MNRAS, 116, 503 [NASA ADS] [CrossRef] [Google Scholar]
 Miettinen, O., Hennemann, M., & Linz, H. 2011, A&A, 534, A134 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Mladenović, M., & Roueff, E. 2017, A&A, 605, A22 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Mouschovias, T. C. 1976, ApJ, 207, 141 [NASA ADS] [CrossRef] [Google Scholar]
 Neufeld, D. A., Schilke, P., Menten, K. M., et al. 2006, A&A, 454, L37 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Oppenheimer, M., & Dalgarno, A. 1974, ApJ, 192, 29 [NASA ADS] [CrossRef] [Google Scholar]
 Orkisz, J. H., Pety, J., Gerin, M., et al. 2017, A&A, 599, A99 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Orkisz, J. H., Peretto, N., Pety, J., et al. 2019, A&A, 624, A113 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Padovani, M., Galli, D., & Glassgold, A. E. 2009, A&A, 501, 619 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Pagani, L., Salez, M., & Wannier, P. G. 1992, A&A, 258, 479 [NASA ADS] [Google Scholar]
 Pagani, L., Roueff, E., & Lesaffre, P. 2011, ApJ, 739, L35 [NASA ADS] [CrossRef] [Google Scholar]
 Pedregosa, F., Varoquaux, G., Gramfort, A., et al. 2011, J. Mach. Learn. Res., 12, 2825 [Google Scholar]
 Pety, J., Teyssier, D., Fossé, D., et al. 2005, A&A, 435, 885 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Pety, J., Guzmán, V. V., Orkisz, J. H., et al. 2017, A&A, 599, A98 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Roueff, E., Loison, J. C., & Hickson, K. M. 2015, A&A, 576, A99 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Roueff, A., Gerin, M., Gratier, P., et al. 2020, A&A, 645, A26 [Google Scholar]
 Shingledecker, C. N., Bergner, J. B., Le Gal, R., et al. 2016, ApJ, 830, 151 [CrossRef] [Google Scholar]
 van der Tak, F. F. S., Black, J. H., Schöier, F. L., Jansen, D. J., & van Dishoeck, E. F. 2007, A&A, 468, 627 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Varoquaux, G. 2018, NeuroImage, 180, 68 [CrossRef] [Google Scholar]
 Williams, J. P., Bergin, E. A., Caselli, P., Myers, P. C., & Plume, R. 1998, ApJ, 503, 689 [NASA ADS] [CrossRef] [Google Scholar]
Informations and data related to the ORIONB program can be found at http://www.iram.fr/~pety/ORIONB/
Datafiles containing the training RF models and tables of the rankings presented in this section are available at http://www.iram.fr/~pety/ORIONB/data.html.
All Tables
Range of physical parameters explored for each of our two classes of medium: gas density n_{H}, gas temperature T_{gas}, incident FUVradiation field intensity G_{0}, lineofsight visual extinction A_{V}, cosmicray ionization rate ζ, H_{2} orthotopararatio OPR, depletion factor and sulfur gasphase elemental abundance [S].
Shorthand names, full quantum number designation, and frequency of the molecular lines we considered.
Impact of adding noise to the line intensities on the predictive power R^{2} (for translucent medium), measured by the noise level σ_{1∕2} (in a constantnoise level situation) and the signaltonoise ratio S/N_{1∕2} (in a constantS/N situation) for which R^{2} reaches one half of its value in the absence of noise .
Impact of adding noise to the line intensities on the predictive power R^{2} (for cold dense medium), measured by the noise level σ_{1∕2} (in a constantnoise level situation) and the signaltonoise ratio S/N_{1∕2} for which the R^{2} reaches one half of its value in the absence of noise .
Ranking of column density ratios according to their usefulness to predict the ionization fraction in translucent medium conditions (measured through the R^{2} of a fitted Random Forest model).
Ranking of line intensity ratios according to their usefulness to predict the ionization fraction in translucent medium conditions (measured through the R^{2} of a fitted Random Forest model).
Ranking of column density ratios according to their usefulness to predict the ionization fraction in dense cold medium conditions (measured through the R^{2} of a fitted Random Forest model).
Ranking of line intensity ratios according to their usefulness to predict the ionization fraction in dense cold medium conditions (measured through the R^{2} of a fitted Random Forest model).
Fit coefficients (for our main fit and scatter fit) and fit quality for column density ratios in translucent medium conditions.
Fit coefficients (for our main fit and scatter fit) and fit quality for line intensity ratios in translucent medium conditions.
Fit coefficients (for our main fit and scatter fit) and fit quality for column density ratios in cold dense medium conditions.
Fit coefficients (for our main fit and scatter fit) and fit quality for line intensity ratios in cold dense medium conditions.
All Figures
Fig. 1 Ranking of column density ratios (left) or line intensity ratios (right) of observable tracers by order of the predictive power for predicting the ionization fraction (measured by the R^{2} coefficient), in the case of translucent medium conditions (showing only the first 20). Errorbars of the R^{2} estimates are computed by crossvalidation (see text for explanations). Inset in the left panel: zoom on the first five ratios in order to make the magnitude of the errorbars visible. 

In the text 
Fig. 2 Ionization fraction versus the best column density ratio, C_{2}H/HCN (left panel), and the best line intensity ratio, C_{2}H (1–0) / HCN (1–0) (right panel) for tracing the ionization fraction in translucent medium conditions. The model grid is shown as a scatter plot, with the central crowded regions replaced by PDF isocontours containing 25, 50, and 75% of the points. Superimposed are the RF model (red line), the analytical fit (solid black line, presented in Sect. 5), the analytical fit of the 1σ uncertainty (dashed black lines, presented in Sect. 5), and the bounds of the validity range of the analytical fit (vertical lines, presented in Sect. 5). The quality estimates of the two models are indicated on the figure. 

In the text 
Fig. 3 Ranking of column density ratios (left) or line intensity ratios (right) of observable tracers by order of the predictive power for predicting the ionization fraction (measured by the R^{2} coefficient), in the case of dense cold medium conditions (showing only the first 20). Errorbars of the R^{2} estimates are computed by crossvalidation (see text for explanations). 

In the text 
Fig. 4 Same as Fig. 2 for cold dense medium conditions. 

In the text 
Fig. 5 Illustration of the two sources of uncertainties for the best column density ratio for tracing the ionization fraction in dense cold medium, CN/N_{2}H^{+}. Left panel: analytical fit (solid black line), the standard deviation around this curve corresponding to the uncertainties on the fit coefficients (thin dashed red lines representing the 3σ level), and the bounds of the validity range defined in the text (vertical blue lines). Right panel: analytical fit (solid black line) and two estimates of the standard deviation corresponding to the residual scatter of the data points around the curve: the red dotted line presents a movingaverage estimate of the local standard deviation, and the dashed black lines shows our analytical fit of the residual standard deviation. On both panels, the chemical model grid is shown as a scatter plot, with the central crowded regions replaced by PDF isocontours containing 25, 50, and 75% of the points. 

In the text 
Fig. 6 DCO^{+} over HCO^{+} column density ratio in our grid of dense cold gas models (blue points and contours), shown as a scatter plot, with the central crowded regions replaced by PDF isocontours containing 25, 50, and 75% of the points. Our best fit Random Forest model is shown as a black line. The red dashed contours shows the distributions of the models with OPR. 

In the text 
Fig. 7 Evolution of the R^{2} performance of the Random Forest predictors (for translucent gas) when adding noise of constant variance to all line intensities, as a function of the noise level σ_{noise}, for the 3 best tracers for translucent medium (C_{2}H(1–0)/HCN(1–0), C_{2} H(1–0)/^{13}CO(1–0), and C_{2} H(1–0)/C^{18}O(1–0)) and for the four tracers least affected by the noise (^{13}CO(1–0)/C^{18}O(1–0), C^{18}O(1–0)/CF^{+}(1–0), ^{13}CO(1–0)/CF^{+}(1–0), and HCN/CF^{+}(1–0)). 

In the text 
Fig. 8 Evolution of the R^{2} performance of the Random Forest predictors (for translucent gas) when adding noise of constant signaltonoise ratio to the line intensities, as a function of the S/N, for the 7 best tracers for translucent medium. 

In the text 
Fig. 9 Evolution of the R^{2} performance of the Random Forest predictors (for cold dense gas) when adding noise of constant variance to the line intensities, as a function of the noise level σ_{noise}, for the 3 best tracers for cold dense medium (first three curves) and for the four tracers least affected by the noise (next four curves). 

In the text 
Fig. 10 Evolution of the R^{2} performanceof the Random Forest predictors (for cold dense gas) when adding noise of constant signaltonoise ratio to the line intensities, as a function of the S/N, for the 7 best tracers for cold dense medium. 

In the text 
Fig. A.1 R^{2} (OOB value) as a function of the tuning parameters of the Random Forest (N_{trees} and d_{max}), for thepreliminary best (top panel) and worst (bottom panel) ratios found though a preliminary estimate using defaults values of the parameters, for line integrated intensity ratios and translucent medium conditions. The red contours show the region where the R^{2} value is within 0.01 of the maximum. 

In the text 
Fig. C.1 Ionization fraction versus column density ratio for the best six ratios found in Sect. 4 for translucent medium conditions. The chemical model grid is shown as a scatter plot, with the central crowded regions replaced by PDF isocontours containing 25, 50, and 75% of the points. Superimposed are the RF model (red line), the analytical fit (solid black line), and the analytical fit of the 1σ uncertainty (dashed black lines). The quality estimates of the two models are indicated on the figure. 

In the text 
Fig. C.2 Ionization fraction versus integrated line intensity ratio for the best six ratios found in Sect. 4 for translucent medium conditions. The chemical model grid is shown as a scatter plot, with the central crowded regions replaced by PDF isocontours containing 25, 50, and 75% of the points. Superimposed are the RF model (red line), the analytical fit (solid black line), and the analytical fit of the 1σ uncertainty (dashed black lines). The quality estimates of the two models are indicated on the figure. 

In the text 
Fig. C.3 Ionization fraction versus column density ratio for the best six ratios found in Sect. 4 for cold dense medium conditions. The chemical model grid is shown as a scatter plot, with the central crowded regions replaced by PDF isocontours containing 25, 50, and 75% of the points. Superimposed are the RF model (red line), the analytical fit (solid black line), and the analytical fit of the 1σ uncertainty (dashed black lines). The quality estimates of the two models are indicated on the figure. 

In the text 
Fig. C.4 Ionization fraction versus integrated line intensity ratio for the best six ratios found in Sect. 4 for cold dense medium conditions. The chemical model grid is shown as a scatter plot, with the central crowded regions replaced by PDF isocontours containing 25, 50, and 75% of the points. Superimposed are the RF model (red line), the analytical fit (solid black line), and the analytical fit of the 1σ uncertainty (dashed black lines). The quality estimates of the two models are indicated in the figure. 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.