Tracers of the ionization fraction in dense and translucent gas: I. Automated exploitation of massive astrochemical model grids

The ionization fraction plays a key role in the physics and chemistry of the neutral interstellar medium, from controlling the coupling of the gas to the magnetic field to allowing fast ion-neutral reactions that drive interstellar chemistry. Most estimations of the ionization fraction have relied on deuterated species such as DCO+, whose detection is limited to dense cores representing an extremely small fraction of the volume of the giant molecular clouds they are part of. As large field-of-view hyperspectral maps become available, new tracers may be found. We search for the best observable tracers of the ionization fraction based on a grid of astrochemical models. We build grids of models that sample randomly a large space of physical conditions (unobservable quantities such as gas density, temperature, etc.) and compute the corresponding observables (line intensities, column densities) and the ionization fraction. We estimate the predictive power of each potential tracer by training a Random Forest model to predict the ionization fraction from that tracer, based on these model grids. In both translucent medium and cold dense medium conditions, several observable tracers with very good predictive power for the ionization fraction are found. Several tracers in cold dense medium conditions are found to be better and more widely applicable than the traditional DCO+/HCO+ ratio. We also provide simpler analytical fits for estimating the ionization fraction from the best tracers, and for estimating the associated uncertainties. We discuss the limitations of the present study and select a few recommended tracers in both types of conditions. The method presented here is very general and can be applied to the measurement of any other quantity of interest (cosmic ray flux, elemental abundances, etc.) from any type of model (PDR models, time-dependent chemical models, etc.). (abridged)


Introduction
The so-called neutral component of the interstellar medium, despite being shielded from EUV (13.6 to 124 eV) stellar photons able to ionize hydrogen, retains a small ionization fraction (x(e − ) = n(e − )/n H ). The ionization mechanism depends on the type of region: FUV (6 to 13.6 eV) photons ionizing C and S Send offprint requests to: E. Bron, e-mail: emeric.bron@obspm.fr in the low A V surface layer of clouds, or cosmic rays, X rays, shocks, etc., in the densest parts. As a result, ionization fractions range between ∼ 10 −4 in low A V cloud surfaces and down to ∼ 10 −9 in dense cores (e.g. Goicoechea et al. 2009;Draine 2011).
This ionization fraction controls several key aspects of neutral interstellar clouds. It determines the degree of coupling of the gas to the magnetic field: the neutrals, accounting for most of the mass of the fluid, are only indirectly sensitive to the presence of a magnetic field through their friction with the ions that remain coupled to the field, a process called ion-neutral friction. This coupling can provide a significant magnetic support against gravitational collapse of dense cores despite the low ionization fractions values found there, between 10 −9 and 10 −7 (Mestel & Spitzer 1956;Mouschovias 1976;Basu & Mouschovias 1994). The ionization fraction also controls the onset of the magnetorotational instability (Balbus & Hawley 1991), the main mechanism of angular momentum transport in accretion disks. Moreover, the gas phase chemistry in dense molecular clouds is to a large extent driven by fast ion-neutral reactions (Herbst & Klemperer 1973;Oppenheimer & Dalgarno 1974). The build-up of chemical complexity thus depends on the ionization fraction of the medium. Finally, some common molecular tracers with high dipole moments, such as HCN and HCO + , have high inelastic collision cross sections with electrons, and their excitation can be significantly affected by electron collisions for ionization fractions 10 −5 (Black & van Dishoeck 1991;Liszt 2012;Liszt & Pety 2016;Goldsmith & Kauffmann 2017). This makes the interpretation of their emission (e.g. to estimate gas density) sensitive to our knowledge on the local ionization fraction .
Direct observational estimation of the ionization fraction in neutral clouds is difficult, except in very specific regions (e.g. Goicoechea et al. 2009;Cuadrado et al. 2019, at the dissociation front in a photodissociation region). Direct estimation of the total charge accounted for by observable molecular ions in molecular clouds only yields a loose lower limit (e.g. Miettinen et al. 2011). Indirect methods based on tracers that are chemically sensitive to the ionization fraction have thus been commonly used. These methods have mostly involved measuring the deuterium fractionation through abundance ratios involving simple molecular ions like DCO + /HCO + (Guelin et al. 1977(Guelin et al. , 1982Dalgarno & Lepp 1984;Caselli et al. 1998) or N 2 D + /N 2 H + which is less affected by depletion (Caselli 2002). The idea is that the deuterium enrichment (defined as H 2 D + /H + 3 ), initiated by the exchange reaction H + 3 + HD H 2 D + + H 2 (1) at low temperature, is limited by electronic dissociative recombination of H 2 D + , and that the resulting ratio is transmitted (with a known prefactor) to the deuteration fraction of other molecules such as HCO + . Using such tracers, the ionization fraction is deduced either by using approximate analytical formulae representing simplified networks Miettinen et al. 2011;Caselli 2002), or by adjusting an astrochemical model including a full chemical network to the observations, using stationary chemical models (Williams et al. 1998;Bergin et al. 1999;Caselli et al. 2002;Fuente et al. 2016), time-dependent ones (Maret & Bergin 2007;Shingledecker et al. 2016), or PDR models (Goicoechea et al. 2009). Despite the variety of determination methods, using different deuterated molecules, only very few works have proposed using other tracers than deuterated species. For instance, Flower et al. (2007) proposed the C 6 H − /C 6 H ratio as a tracer of the ionization fraction, and Fossé et al. (2001) have investigated the relationship between the cyclic-to-linear ratio of C 3 H 2 and the ionization fraction. Deuteration-based approaches however suffer from several limitations due to the fact that they depend on other physical or chemical parameters that need to be determined independently. The initial deuteration reaction (Eq. 1) is sensitive not only to the gas temperature but also to the essentially unmeasurable orthoto-para ratio of H 2 (Pagani et al. 1992(Pagani et al. , 2011Shingledecker et al. 2016). Indeed, the endothermicity of the reaction in the backward direction (192K) is very close to the J=1 to J=0 energy difference of H2 (170.5K). Then, even a small fraction of o-H2 (J=1) contributes to H2D+ destruction (with a reduced endothermicity of 20K) and restricts the deuteration process. In addition, ratios such as DCO + /HCO + are linked to H 2 D + /H + 3 through reactions with neutral species like CO. The estimated deuterium fraction is therefore sensitive to the depletion factors of carbon, oxygen and nitrogen that are not easy to evaluate (Caselli 2002). Moreover, deuterated tracers such as DCO + are typically only detectable in cold dense cores, representing only a tiny fraction of the observable area of a giant molecular cloud (GMC). Deuteration-based approaches are thus inadequate for an unbiased characterization of the conditions in GMCs as a whole.
Despite the common use of advanced chemical models computing the abundances of hundreds of species, the observed tracers to which these models are compared to estimate x(e − ) (deuterated molecules such as DCO + ) are still those initially proposed based on analytical reasoning using simplified chemical networks. The wealth of data produced in large chemical model grids remains largely unexploited. Their exploration of wide parameter spaces might reveal less intuitive but more efficient tracers. Based on this approach, we propose here a general and largely automatic method to identify the best observational predictors of the ionization fraction, when other important parameters such as the gas density, temperature or H 2 ortho-topara ratio are unknown. We apply this method to propose new predictors of the ionization fraction as a function of the molecular cloud conditions. We use simple stationary chemical models with a complete up-to-date chemical network (Roueff et al. 2015), and use molecular ratios (column density ratios or integrated line intensity ratios) as observable tracers from which we seek to predict the ionization fraction. We base our investigation on the observed range of physical conditions and detected tracers in the IRAM-30m Large Program ORION-B (Outstanding Radio-Imaging of OrioN B, co-PIs: J. Pety and M. Gerin) 1 . In this program, we imaged 5 square degrees towards the southern part of the Orion B giant molecular cloud over most of the 3 mm atmospheric window Gratier et al. 2017;Orkisz et al. 2017;Bron et al. 2018;Orkisz et al. 2019;Roueff et al. 2020;Gratier et al. subm.).
In the context of dense cores, the ionization fraction is linked with the cosmic ray ionization rate (CRIR) and both are often studied from the same molecular ratios (although direct tracers of the cosmic ray ionization rate can also be used in more diffuse medium, in particular H + 3 , e.g. Indriolo & McCall 2012;Le Petit et al. 2016). In our context of the Orion B giant molecular cloud, where UV illumination controls the ionization fraction in large parts of the cloud, we focus here on the question of estimating the ionization fraction only, independently of the source of ionization. The task of tracing the cosmic ray ionization rate (which has also been attempted using astrochemical model grids, e.g. Barger & Garrod 2020) will be considered in a future application of the method presented here.
In this first article, we present a generic method to find the best tracers of an unobservable physical parameter and apply it to the search of new tracers of the ionization fraction among the species that are detectable in the ORION-B dataset. The observational application of the tracers found here to study the ionization fraction in the Orion B molecular cloud will be presented in a second paper (Guzman et al. in prep.).
In Sect. 2, we describe our general statistical method for determining the best predictors of a given unobservable parameter based on the results of a model grid. In Sect. 3, we present the models used in this study for the search of ionization fraction tracers. We then present the ranking of observable predictors in Sect. 4. For ease of application of our results, we provide in Sect. 5 analytical fit formulae to deduce the ionization fraction from each of the proposed best predictors. We finally discuss our results in Sect. 6 and present our conclusions in Sect. 7.

Method
Both observable line intensities (or column densities) on the one hand, and the ionization fraction x(e − ) on the other hand, depend on multiple, unobservable physical parameters (e.g. gas density, elemental abundances, cosmic ray flux, ...). Our goal is to find reliable relationships between observable quantities and ionization fraction, despite lacking estimations of these hidden physical parameters. To do this, we first run model grids covering the whole possible parameter space. We then use a flexible regression method to fit x(e − ) as a function of one of the potential observational tracers through the whole grid of chemical models. This means that we treat the effects of the variations of the hidden parameters as sources of noise on the prediction of the ionization fraction. Finally, we use a quantitative measurement of the fit quality as an estimate of the predictive power of each potential tracer. These estimates are used to rank the tracers and highlight the most powerful predictors of the ionization fraction. The fitted models for the best tracers will provide ready-to-use tools to be applied to observations.

Predicting the ionization fraction from one ratio of line intensities (or column densities) with Random Forests
Any a priori information could easily be included in the method by sampling the parameters of the model grid according to a specific prior distribution. However, we wish to minimize the amount of a priori information injected in the method and to avoid making assumptions on the shape of the distributions of physical parameters (e.g. gas density, elemental abundances, cosmic ray flux, ...). We thus build a model grid that samples uniformly the possible range of values (see Sect. 3). Our model grids provide us with a dataset comprising ionization fraction values and corresponding values of observable quantities. We will consider line intensity ratios or column density ratios as our observable quantities in this paper. The hidden physical parameter values introduce a non deterministic aspect to the relationship between x(e − ) and the observables: models might have identical values of an observable but different x(e − ) if the underlying physical parameters are different. Learning to predict x(e − ) from a given observable is then a regression problem, with the uncertainty introduced by the hidden physical parameters playing the role of noise. Determining the best tracers of x(e − ) is thus equivalent to finding observables for which the relationship to x(e − ) is least affected by this noise (i.e. by the hidden physical conditions). This means finding the observables for which the most accurate regression model can be found.
For this regression problem, we choose to use Random Forests (Breiman 2001) because their flexibility makes it possible to fit general non-linear shapes, while their simplicity provides reasonable computational costs. This makes the method presented here very general and applicable to finding tracers of other physical parameters without any assumption on the shape of the relationship between the tracers and the target parameter.
We will use RF for Random Forest in the rest of the paper. RF regression models are based on the concept of regression trees (Breiman et al. 1984), where a succession of binary decisions are made based on the input variables (e.g. x 3 < 2 or ≥ 2) and constant values are predicted in each of the subsets of the partition that the decision tree defines. While such decision trees are easily interpretable, they require large tree depths to be flexible but are prone to overfitting if this depth is too large. RF tackle this overfitting problem by using the simple idea that multiple overfitted regression models will, when averaged, give a better prediction as long as the errors they individually make are uncorrelated between models. In a RF, the individual trees are made as independent as possible by introducing randomness in two aspects: 1) the building of each tree only considers a random subset of the input variables, and 2) each tree is given a bootstrapped sample (i.e. drawn by random sampling with replacement from the original dataset, Breiman 1996) instead of the original full sample. This way, the datasets seen by the different trees are independent and each tree only sees a subset of the dataset (bootstrapped datasets typically contain only 63% of the points of the original dataset as repetition is allowed). This provides a very flexible regression model, which retains some of the interpretability of decision trees. RF have thus quickly become a standard method in Machine Learning (see e.g. Hastie et al. 2001). In addition, they allow to estimate the generalization error of the fit (i.e., the error made when predicting data not seen during training): as each tree has only seen a random bootstrapped sample from the data, it is possible to estimate for each datapoint a partial prediction using only the trees that have not seen this datapoint during training. As the sample seen by a given tree is called a bag, these partial predictions are called outof-bag predictions (OOB). Gratier et al. (subm.) also use RF in the context of the interstellar medium and introduce the method in detail.
We thus train RF regression models for each observable (using only one observable at a time) and estimate the accuracy of the regression models. This accuracy is taken as an estimate of the predictive power of the observable quantity considered, for the purpose of predicting x(e − ). The different observables can then be ranked according to this predictive power estimate. The accuracy of the regression model is estimated with the OOB R 2 R 2 = 1 − SS res SS tot with the sums of squares where the index i runs across data points (individual model results), y true i is the true value of ionization fraction (computed by the chemical model), y pred i is the OOB prediction value from the RF, and y true is the average of the (true) ionization fraction over the model grid. This coefficient R 2 gives the fraction of the total ionization fraction variance (across the full model grid) that the RF model can explain from the given observable predictor alone (i.e. it measures the fractional decrease from the initial variance of x(e − ) to the variance of the residuals).
It is thus <1, with 1 representing perfect prediction (zero residual variance). Note that it can take a negative value when the model performs worse than predicting a constant value set at the average x(e − ) of the dataset. A value of 0 indicates a performance equivalent to this constant prediction of the average. This R 2 value is used for the ranking of tracers. For information, we Table 1. Range of physical parameters explored for each of our two classes of medium: gas density n H , gas temperature T gas , incident FUV radiation field intensity G 0 , line-of-sight visual extinction A V , cosmicray ionization rate ζ, H 2 ortho-to-para ratio OPR H 2 , depletion factor and sulfur gas-phase elemental abundance [S]. translucent medium cold dense medium n H [cm −3 ] 3 × 10 2 − 3 × 10 3 10 3 − 10 6 T gas [K] 15 − 100 7 − 20 1.86 × 10 −8 − 1.86 × 10 −5 1.86 × 10 −8 − 1.86 × 10 −5 also provide below the root mean square error where N is the number of chemical models in our grid. The RMSE is completely univocally related to the R 2 value, but is more interpretable in terms of the amplitude of typical errors. We also provide the maximum absolute error This quantity, estimating the maximum error made by our regression model, is not guaranteed to converge when increasing the size of the dataset. It should thus not be interpreted further than being the largest error we observed in our limited-size sample. The RF model depends on a few internal parameters (number of trees, maximum depth of trees, etc...). Their values can affect the quality of the model and its tendency to overfit. We used a number of trees in the forest N trees = 400 and a maximum tree depth d max = 4. The procedure used to select these values is described in Appendix A. Our tests show that this optimization scheme is not critical for our purpose: while the choice of parameter values does affect the quality of the best fit RF model, it does not change significantly the ranking of the predictors that we deduce from it.
The pipeline tool implementing this procedure is available at [link to be added before publication].

Chemical models
We use here the chemical code presented in Roueff et al. (2015) to study isotopic fractionation of deuterium, carbon and nitrogen compounds. Single zone models with fixed density, temperature, visual extinction, radiation field, cosmic ray ionization rate, ortho-to-para H 2 ratio and depletion factors are computed at steady state.
We consider for the present study a chemical network including deuterium, and isotopic carbon and oxygen species where the deuterium, carbon and oxygen fractionation reactions have been introduced following the recent determinations of exothermicities by Mladenović & Roueff (2017). We introduce in particular D 13 CO + . Apart from these specific fractionation reactions, the chemistry of isotopically substituted species is built automatically from the chemical network of the major components. The chemical reactions involving one single carbon-containing reactant and one single carbon-containing product are duplicated with the same reaction rate coefficient. Simple statistical assumptions are introduced when two carbon containing molecules are implied in the reaction. Consider for example the case of the reaction CX + CY → CX + CY , taking place with a reaction rate coefficient k. The reactions introduced for the isotopically substituted species are the following: 13 CX + CY → 13 CX + CY with k/2 13 CX + CY → CX + 13 CY with k/2 CX + 13 CY → CX + 13 CY with k/2 CX + 13 CY → 13 CX + CY with k/2 13 CX + 13 CY → 13 CX + 13 CY with k Such a procedure leads to an ensemble of 310 species linked through 8711 chemical reactions.
These models allow us to compute observable column density ratios for the hundreds of species included. Although commonly derived by observers, column densities are not the primary observable quantities, and we thus also compute line intensity ratios. To do so, we post-process the results of our chemical models using a non-LTE excitation and radiative transfer model (RADEX, van der Tak et al. 2007) assuming a typical linewidth of 1 km/s (observed linewidths in the Orion B are typically of a few km/s ). Results based on column density ratios and based on line intensity ratios will be presented separately in the following sections.
It is unlikely that a single tracer will provide a good estimate of the ionization fraction x(e − ) in all possible physical conditions. Either the tracer will lose its relationship with x(e − ) in some conditions, or it might be too weak to be observable in other conditions. We thus decided to divide the range of possible conditions into subregions corresponding to the different types of environments found in GMCs Bron et al. 2018). We focus on two kinds of environments : the translucent medium and the cold and dense gas, and we derive separate rankings of tracers for these two environments. The range of physical conditions explored for each of these environments are chosen based on our previous studies of Orion B and listed in Table 1.
In both grids, the gas density and gas temperatures were varied covering the typical ranges for translucent medium (3×10 2 − 3 × 10 3 cm −3 , 15-100 K) and cold and dense medium (10 3 − 10 6 cm −3 , 7-20 K). In the translucent model grid, external FUV photons still play an important role in the chemistry and in controlling the ionization fraction. This FUV illumination is controlled through an external FUV field strength G 0 (see e.g. Hollenbach & Tielens 1999, p. 177) and an extinction A V representing the amount of shielding between the FUV source and the gas under consideration (also used as the depth of the slab when computing line intensities). We take into account self-shielding of H 2 by using the approximate expression of Draine & Bertoldi (1996) and introduce also the shielding of CO by H 2 from Heays et al. (2017). We consider lower extinctions and higher G 0 values in translucent medium (A V in the range 2-6, G 0 of the external field in the range 1 -1000) than in cold dense medium (A V in the range 5-20, external G 0 set to 1). We explore average to moderately strong FUV illumination values in the low density translucent grid. Regions with both high density high FUV illumination correspond to dense photodissociation regions (PDR), in which strong chemical and physical stratification on small spatial scale is critical. These regions would thus require the use of complete PDR models, such as the Meudon PDR Code (Le Petit et al. 2006). We thus did not explore this type of conditions in the present study.
Given the uncertainties about the cosmic ray ionization rate in molecular clouds (Lepp 1992;McCall et al. 2003;Indriolo et al. 2007), we consider the range of value 10 −17 − 10 −15 s −1 . In the cold gas conditions, in order to account for the reduced cosmic ray fluxes (Padovani et al. 2009), we limit this range to 10 −17 − 10 −16 s −1 . As sulfur can be an important contributor of electrons in neutral gas but has a highly uncertain gas-phase elemental abundance (Agúndez & Wakelam 2013;Goicoechea et al. 2006), we explore in both grids values of [S], the relative sulfur abundance with respect to H, in the range 1.86 × 10 −8 − 1.86 × 10 −5 .
We also explore ranges of H 2 ortho-para ratio (OPR H 2 ) which impact significantly two reactions: i.e. H 2 D + + o-H 2 → H + 3 + HD (Pagani et al. 1992) where the energy endothermicity is reduced to 61.5 K (compared to 232K with p-H 2 ) and N + + o-H 2 → NH + + H which is slightly endothermic (∼ 44 K) whereas N + + p-H 2 → NH + + H is more strongly endothermic (∼ 170 K), as first emphasized by Le Bourlot (1991). For this reaction, we follow the prescription of Dislaire et al. (2012) which is derived from experimental results. We use higher values of the OPR in the warmer translucent medium (0.1−3) than in cold dense gas (10 −4 −10 −1 ). Finally, cold dense cores offer conditions where molecules can freeze out on dust grains, depleting the gas phase abundances of elements such as C and O. In our cold dense medium grid, we thus in addition explore depletion factors going from 1 (no depletion) to 10 (C elemental abundance 10 times lower than the reference values) with a constant C/O elemental ratio value of 0.6 (the elemental abundance of carbon is taken to be [C]= 1.32 × 10 −4 when there is no depletion). Other parameters that might have an impact (although second order compared to the parameters considered here), such as variations of the metal elemental abundances or PAH abundance, were not considered in this study. The gas-phase elemental abundances for metals (relative to H) are taken to be [Fe]= 1.5 × 10 −8 , [Cl]= 1.8 × 10 −7 , [Si]= 8.2 × 10 −7 , [F]= 1.8 × 10 −8 , [Ar]= 3.29 × 10 −6 .
For each medium type, a set of 5000 models was run, sampling randomly and uniformly within the chosen parameter space. The adequacy of this number of models for our purpose is ascertained later when estimating the uncertainties on our results.
Given the variation by orders of magnitude both in the parameter values and the computed observables, we choose to work with the logarithm of all quantities. We sampled uniformly on the logarithm of each parameter within the ranges indicated above. The method described in Sect. 2 is applied on the logarithm of all quantities (i.e., training RF models using the logarithm of column density ratios or line intensity ratios to predict the logarithm of x(e − )). Note that representative error values such as the RMSE on logarithms are equivalent to representative error factors on the actual quantity. These corresponding error factors are given in parenthesis in the result tables of the following sections.
To get rid of possible instrumental, calibration and other source geometry effects, we choose to work only on ratios of observable quantities (either column density ratios or line intensity ratios). In the following, we use the term tracers for ratios of observable quantities.
Among all the species computed in our chemical model, we made a selection of species that are detected in the radio observa- Table 2. We list here the shorthand names, full quantum number designation, and frequency of the molecular lines we considered.

Short name Full quantum numbers
Frequency (GHz) 13 CO (1-0) J = 1 → J = 0 110.201354 tions of the ORION-B project and potentially linked to the ionization fraction. Our search for the best tracers is made among the ratios of these selected species. For the translucent medium condition, we selected 13 CO, C 18 O, HCO + , HCN, HNC, CN, C 2 H, CS, SO, H 2 CS, HCS + , CF + . The search for a best ratio was thus done among 66 possible column density ratios (and 66 line intensity ratios). For dense cold medium conditions, we considered the same selection with the addition of DCO + and N 2 H + . We thus had 91 possible column density ratios (and 91 line intensity ratios) in this case. For line intensities, the exact transition and frequency considered for each species are listed in Table 2. In the RADEX computations of line intensities, we account for collisional excitation with electrons (using the ionization fraction computed by the chemical model) for species for which collisional data with electrons are available in RADEX (HCO + , HCN and C 2 H). We note that for CN, excitation by electrons was not included as in the current version of RADEX, the collisional data that includes the hyperfine structure of CN does not include collisions with electrons. We chose to privilege the fact of accounting for the hyperfine structure here.

Tracers rankings
We applied the method described in Sect. 2 to the two chemical model grids (translucent medium conditions and cold dense medium conditions) presented in Sect. 3 in order to obtain a ranking of the selected potential tracers according to their usefulness for predicting the ionization fraction. Figure 1 presents the predictive power (estimated as the R 2 of a RF fit) of each tracer for the best 20 tracers. The left panel shows the result when taking column density ratios as observable quantities, and the right panel the results when considering line intensity ratios. We see that the ranking is similar in both cases, suggesting that excitation and radiative transfer only have a moderate effect on the relationship between these tracers and the ionization fraction. A more complete ranking (covering all tracers having R 2 > 0.5) is given in ( 1 0     the results tables of this Section and of Sect. 5 are placed in Appendix B) 2 .

Translucent medium
In both cases, about ten different ratios are found to be each able to explain more than 80% of the ionization fraction variance (R 2 > 0.8). We emphasize that this means that an accurate prediction of the ionization fraction is possible from each of these tracers despite not knowing the values of the 7 physi-contours in shades of blue indicating iso-PDF contours encompassing 25%, 50%, and 75% of the distribution) and the prediction of the fitted RF model (solid red line), which is found in this case to explain 95.7% of the ionization fraction variance in our grid. The remaining scatter around the relationship represents the effect of ignoring all other parameters (gas density, temperature, UV field, H 2 OPR,...). The best ranked line intensity ratio (C 2 H (1-0) / HCN (1-0)) is similarly shown on the right panel of Fig. 2 with the corresponding fitted RF model (solid red line).
In translucent gas, the ionization is still dominated by the effect of external FUV photons ionizing carbon (and to a lesser extent sulfur and chlorine), and is slowly decreasing as the total extinction increases. In the conditions covered by our translucent grid, we find x(e − ) ranging from 2×10 −4 to 2×10 −7 . C 2 H, which we find in several of the best ratios, is known to be enhanced in FUV illuminated environments (Pety et al. 2005;Cuadrado et al. 2015;Guzmán et al. 2015;Gratier et al. 2017;Pety et al. 2017): as explained in Beuther et al. (2008), C 2 H traces the amount of carbon not locked into CO, and is thus sensitive to the FUV flux through CO photodissociation and the presence of C and C + at a significant abundance level. In our translucent medium model grid, we indeed find C + to be the main charge carrier and thus to very strongly correlate with x(e − ). H + and H + 3 may also contribute to the electronic fraction in environments where the cosmic ionization reaches values above 10 −16 s −1 (Le Petit et al. 2016). However, C + , an open shell ion, is chemically reactive with various molecules, except H 2 3 , and is at the origin of a complex chemistry with insertion of carbon atoms. C + itself is not straightforwardly detectable as its fine structure transition at 158 µm requires spaceborne or airborne observations. But we may expect that molecules involving C + in the initial chemical steps allow to probe the electronic fraction. Our finding of ratios involving C 2 H (relative to e.g. HCN, HNC or CN) as good proxies of the electronic fraction is a natural consequence of the relevance of C + as one of the main positive charge carriers. The initial step of C 2 H formation involves indeed the C + + CH → C + 2 + H reaction, followed by subsequent reactions with H 2 up to C 2 H + 2 , which recombines to form C 2 H. Molecules such as HCN, HNC or CO and its isotopologues, on the other hand, are saturated stable molecules which scale with column density. As a result, ratios such as C 2 H/HCN, whose transitions are easily detectable, offer a convenient diagnostic tool of the electronic fraction in translucent medium. The electronic fraction is then, as shown in Fig. 2 (left panel), an increasing function of the C 2 H/HCN ratio. CF + is another proxy of C + , as described in Neufeld et al. (2006); Guzmán et al. (2012), and ratios involving this ion are also found here to be good tracers of the ionization fraction. However, this ion is relatively scarce since it involves fluorine, which has a low relative abundance to H 2 and has only been detected in PDR environments so far (detectability issues are investigated in Sect. 6.2).
In order to estimate the reliability of our results and determine if our 5000-model grid is sufficient to explore the chosen parameter space for our purpose, we compute errorbars on the predictive power estimate (the R 2 of the RF fit). To do so, we use 10-fold cross validation: the model grid is randomly split into 10 parts, and for each of these parts, a RF model is trained on the other 9 parts and tested on the remaining part which it has not seen during training. From these 10 estimates of the R 2 (all made on samples unseen during training), a reliable estimate of the predictive power on unseen data is made, as well as an estimate of the standard error on this predictive power. The corresponding error bars are also shown in Fig. 1. For most of the points, the errorbars are smaller than the marker, and the inset on the left panel presents a zoom on the first five ratios, showing the magnitude of the errorbars. This shows that the uncertainty induced by the finite size of our model grid in negligibly small and that our 5000-model grid is sufficient for our purpose. However, this conclusion should be taken with some caution as it has been shown that there exists no unbiased estimator of the variance of a cross-validation estimate (Bengio & Grandvalet 2004) and that a naive estimation of this variance tends to underestimate it by a factor of up to 4 (Varoquaux 2018). Figure 3 presents the ranking of the best tracers in cold dense conditions (cf. Table 1), for both column density ratios (left panel) and line intensity ratios (right panel). Error bars computed by the cross validation procedure described above are also shown, confirming that the size of our model grid is sufficient for estimating the quality of the fits based on each tracer. We see that the R 2 values of the best tracers are slightly lower than in the translucent medium case, indicating slightly stronger degeneracies with unknown parameters in this case (note that depletion was varied in this cold dense medium, in addition to the parameters varied in the translucent medium grid). However, we still find several tracers explaining more than 80% of the variance in x(e − ). The best column density ratio is here found to be CN/N 2 H + . The cold dense environments are essentially ionized through cosmic rays and secondary UV photons induced by cosmic rays. Electrons are primarily produced by cosmic ray ionization of H 2 and destroyed in the efficient dissociative recombination reactions of the various molecular ions. He + ions, produced by cosmic ray ionization of He, are also particularly efficient in ionizing the molecular reservoirs, CO, HCN, N 2 , H 2 O. This contributes to forming atomic ions, in addition to the H + 3 molecular ion resulting from H 2 ionization and the other stable molecular ions resulting from proton transfer of H + 3 with stable molecules giving ions such as H 2 D + , HCO + , H 3 O + , or N 2 H + . The ionization carriers are then shared amongst several different species, going from the simple atomic ions that do not react with H 2 (i.e. C + , S + , H + ) and closed shell molecular ions such as H + 3 , HCO + , H 3 O + , and N 2 H + . Molecular ions are principally destroyed by dissociative recombination reactions whereas atomic ions rather react with the present neutral molecules since radiative recombination is not efficient. One can thus expect that molecular ions are inversely proportional to the electron abundances, as seen with the CN/N 2 H + ratio which is found to increase monotonically with x(e − ) (cf. Fig. 4, left panel).

Cold dense medium
As in the translucent case, we find slightly lower R 2 values for the best line intensity ratios than for the best column density ratio. However, we find a few ratios that have better scores as intensity ratios than as column density ratios. In particular, while the 13 CO/HCO + column density ratio is found to be a poor predictor of the ionization fraction, the 13 CO (1-0) / HCO + (1-0) line intensity ratio appears as one of the best tracers. Contrary to the previous cases, this is entirely an excitation effect. The abundance ratio of 13 CO/HCO + is found to be mostly uncorrelated with x(e − ) and mostly constant in the cold dense medium model grid (ratio of about 10 3 with a typical scatter of a factor of 2-3). However, HCO + and 13 CO have strongly different critical densities: ∼ 2 × 10 5 cm −3 for HCO + (in cold dense medium condi-      tions, x(e − ) is too low for electron collisions to play a significant role) in comparison to ∼ 2 × 10 3 cm −3 for 13 CO. In the range of densities considered in the cold dense medium grid (10 3 − 10 6 cm −3 ), 13 CO excitation is thus mostly at local thermodynamic equilibrium (LTE) and its emissivity per molecule is thus constant with gas density, while HCO + is transitioning from the sub-thermally excited regime to the LTE regime, and its emissivity per molecule thus increases with density. The 13 CO (1-0) / HCO + (1-0) ratio thus decreases with gas density. On the other hand, we find the gas density to be very strongly anti-correlated with x(e − ) in these conditions (with a typical scatter of a factor of ∼ 3), as cosmic ray ionization is the dominant source of ionization here and the recombination rates per ion scale with the gas density. These two effects combine to give a 13 CO (1-0) / HCO + (1-0) ratio that increases with x(e − ) with a relatively tight correlation (cf. Fig. C.4, top right panel). Fig. 4 shows the relation between x(e − ) and the best column density ratio, CN/N 2 H (left panel), and the best line intensity ratio, CF + (1-0) / DCO + (1-0) (right panel). We see a larger scatter than in Fig. 2, but a clear relationship is still found.
We note that the classical DCO + /HCO + ratio does not appear among the best tracers found here for cold dense medium conditions. This point is discussed in Sect. 6.1.

Analytical fit formulas
If possible, we recommend using the RF models described in the previous section when attempting to estimate x(e − ) from one of the best tracers listed above. However, the provided datafiles are dependent on a specific implementation of Random Forests (the scikit-learn module for Python, Pedregosa et al. 2011). For a simpler and more persistent solution (independent of any external software), we provide in this section simple analytical fit formulae for the best tracers found in Sect. 4. While the RF models are flexible enough to make the method described in Sect. 2 generally applicable to any model grid and any physical quantity we want to find tracers of, the analytical fits provided here use formulas that have been specifically chosen for the application presented here (finding predictors of the ionization fraction from our chemical model grid). There is no guarantee that these same formulae would perform adequately to find analytical fits with other model grids and/or another quantity to predict.

Prediction formulae
We use simple polynomial formulae (working as before on the logarithm of both the observable ratios and the ionization fraction) to fit the non-linear relationships between the best tracers found in Sect. 4 and x(e − ). This is applied to all tracers which where found to have R 2 > 0.5 in the previous RF analysis.
In the cold dense medium conditions, we thus use a simple polynomial of order 5: where x is the log 10 of the column density ratio or line intensity ratio from which we want to predict log 10 (x(e − )), f dense (x) is our fitting function to log 10 (x(e − )) in cold dense gas conditions 4 , and the parameters a 0 to a 5 are our fit parameters. A fit is made for each of the tracers that had R 2 > 0.5 (more than half of the variance explained) in the rankings of Section 4. The corresponding fit coefficient values for each column density ratio are given in Table B.7 and the coefficient values for line intensity ratios are listed in Table B.8. In the translucent medium conditions, x(e − ) naturally reaches a plateau at the fractional abundance of carbon (1.32 × 10 −4 in our undepleted models). We thus use a modified formula combining a polynomial of order 5 and a saturation: The fit parameters here are f max and a 0 to a 5 . As in the cold dense medium case, f translucent (x) is our fitting function to log 10 (x(e − )) in translucent gas conditions. The corresponding fit coefficient values for each column density ratio are listed in Table B.5 and the coefficient values for line intensity ratios are listed in Table B.6.
The order of the polynomial in these functions is chosen so that further increasing it yields only marginal increase in R 2 (estimated by cross-validation to avoid overfitting). The R 2 values found for the best tracers in each case are close to the values initially found with the RF models, indicating that our analytical fits are not significantly worse than the RF models, at least for the high-R 2 tracers. As an example, Figures 2 and 4 also show the analytical fit (solid red line) in comparison to the RF model (solid black line). For a few of the lower R 2 tracers, the analytical fits perform significantly worse as can be seen on the tables of Appendix B by comparing the R 2 values of the RF models with the R 2 values of the analytical fits.

Uncertainty formulae
Finally, a key point is to estimate uncertainties on our prediction of the ionization fraction. We separate here two sources of uncertainty.
Our best analytical fit is determined on a finite sample of models, so that the fit coefficients are only estimates of the theoretical best fit coefficients. Estimating again these fit coefficients from a different sample of models (drawn from the same distribution) would result in slightly different values, and these uncertainties on the fit coefficients in turn imply an uncertainty on the ionization fraction value predicted by the fit formula at any given value of the observable quantity (intensity ratio or column density ratio). In order to estimate this uncertainty on the predicted value, we proceed by bootstrapping: we repeat the fitting procedure on 100 bootstrapped samples (drawn from the original model grid) and report the standard deviation of the value of the fit function as its uncertainty.
The left panel of Fig. 5 shows the corresponding uncertainty (showing the 3σ level in dashed curves around the main prediction curve) in the case of the best column density ratio in cold dense gas conditions (CN/N 2 H + ). We name this uncertainty the fit coefficients uncertainty to distinguish it from the second form of uncertainty below. We define the validity domain of our fit as the range of values of the observable ratio where our best fit is sufficiently constrained by our finite grid of models for this fit coefficient uncertainty to be negligible. In practice, we define it as the range of ratios where the above-defined uncertainty remains lower than 2% of the predicted value. In the following, we will thus assume this uncertainty to be negligible inside of this validity domain and focus on the second form of uncertainties. The limits of the corresponding validity range are also shown on the left panel of Fig. 5 as blue vertical lines. Due to the tendency of high order polynomial fits to diverge quickly outside of the domain of the fitted dataset, the analytical fit formulae should not be used outside of the validity range defined here.
The second form of uncertainties comes from the unobservable parameters (density, temperature,...), which induce a scatter in the relationship between any of the line intensity ratios or column density ratios and the ionization fraction. Inside of the validity domain of the fit, this scatter in the residuals is much larger than the fit coefficients uncertainty, as can be seen on the left panel of Fig. 5), and is thus the dominant source of uncertainties. When applying the fit formula to real observations, in the ideal case where the chemical model used in this paper would be a perfect model of reality, we would expect the value predicted by the fit formula to commit a mean squared error equal to the variance of this scatter of the residuals, and this error represents our lack of information on the underlying physical conditions. As can be seen on the previous figures however, this residual scatter varies as a function of the observable predictor (the vertical scatter is smaller in some regions of the plot than in others). This residual variance as a function of the predictor can be estimated by a moving average method (shown in the right panel of Fig. 5, we used a window of 0.1 dex). In order to provide a simpler way of estimating this residual variance function, we fitted the simple analytical function to the squared residuals, thus providing a fit to the local variance. Note that this function provides a fit to the variance on the prediction of log 10 (x(e − )), and, as previously, x is the log 10 of the   observable ratio. The absolute value in this function was chosen because the residual variance is by definition a positive quantity. The best fit coefficients to this residual variance function are given in Tables B.5, B.6, B.7 and B.8 in Appendix B. An example of the corresponding residual standard deviation function is shown in the right panel of Fig. 5, where we compare its movingaverage estimate (dotted red curves around the main prediction curve) to its squared polynomial fit (dashed black curve), showing in both cases the 3σ level, for the best column density ratio in the cold dense medium case (CN/N 2 H).
The resulting fits are also presented for the best ratio in the different cases in Fig. 2 and 4, showing the RF fit, the analytical fit and the scatter fit. Similar figures for each of the 6 best tracers for both the translucent medium and the cold dense medium conditions, and for both column density ratios and line intensities ratios, are presented in Appendix C, .

Traditional ionization tracers
One of the most commonly used ionization fraction tracers is DCO + /HCO + (Guelin et al. 1977(Guelin et al. , 1982Dalgarno & Lepp 1984;Caselli et al. 1998). It is used mainly in cold dense cores, where the temperature is low enough to allow sufficient deuterium enrichment, and the column density is large enough to make DCO + detectable. However, our results show that even in the cold dense gas regime, and despite including DCO + in our list of potential tracers, the DCO + /HCO + column density ratio is not ranked among the best tracers of the ionization fraction. In fact, it is ranked as the 38 th best tracer in dense cold gas conditions, with a R 2 of 0.57 only (cf. Table B.7). Note that this ratio is often determined using observations of H 13 CO + as H 12 CO + can be optically thick in high-column-density lines of sight. We leave this aspect aside in this discussion by showing that the column density ratio DCO + /HCO + itself (however it might be determined Chemical model results Models with OPR < 2.5 × 10 3 RF model Fig. 6. DCO + over HCO + column density ratio in our grid of dense cold gas models (blue points and contours), shown as a scatter plot, with the central crowded regions replaced by PDF isocontours containing 25%, 50%, and 75% of the points. Our best fit Random Forest model is shown as a black line. The red dashed contours shows the distributions of the models with OPR H 2 < 2.5 × 10 −3 . observationally) suffers from several limitations as a tracer of the ionization fraction.
The relationship between the DCO + /HCO + abundance ratio and the ionization fraction found in our (cold and dense) model grid is shown in Fig. 6. The blue distribution shows the results of the model grid and the black line our best RF model, with a R 2 of 0.57. We see that two main problems limit the usability of DCO + /HCO + .
First, a large scatter of the ionization values (by up to 3 orders of magnitude) is present at all values of the ratio, despite having a significant fraction of the distribution tightly located  around a clear relationship (the outermost blue contour encloses 75% of the distribution). As a result, the best fit RF model tries to make a compromise between the tightly located part of the distribution, and the scattered points at lower ionization fraction values. We found most of this scatter to be related to variations in the ortho-to-para ratio of H 2 (OPR H 2 ), an unobservable parameter whose value remains difficult to estimate in observations of dense cold cores. For instance, selecting only the models having OPR H 2 < 2.5 × 10 −3 , we see in Fig. 6 (red dashed contour) that we retain only the unscattered part of the distribution. Thus the difficulty of obtaining reliable estimates of the OPR of H 2 in dense cores limits the use of DCO + /HCO + as a tracer of the ionization fraction.
Second, even when selecting the low OPR H 2 models, we see that the relationship presents a very steep slope at high ratio values (low ionization fraction values). As a result, for DCO + /HCO + ratios above 10 −1.6 , a range of ionization fractions of more than two orders of magnitude is possible. The ratio would then only be usable for lower ratios, equivalent to ionization fractions larger than 10 −6.5 . Even if the model grid presented no scatter at all, a steep slope implies that small observational uncertainties on the ratio will induce large uncertainties on the predicted ionization fraction. Thus, relationships with steep slopes are of limited use.
These different effects combine to make the DCO + /HCO + ratio a poor predictor of the ionization fraction, compared to the best ranked tracers found by our method.

Detectability constraints
So far, detectability constraints have been ignored. Predictive power has been tested from noiseless values of column density or line intensity ratios. However, the different lines considered here have widely different brightnesses and thus differ in terms of detectability with current instruments. We explore here the effect of various noise levels on the predictive power of the different line ratios. We will consider two noise setups correspond- ing to two observation scenarios: the case of one constant noise level for all lines, corresponding to the typical case of a line survey where faint lines are detected with a lower signal-to-noise ratio than bright ones, and the case of a fixed signal-to-noise ratio (SNR) for all lines, corresponding to observations being designed to reach a set signal-to-noise ratio for a few desired lines. These two cases correspond to the two opposite extremes of possible observation scenarios and will give us a general overview of the possible impact of noise on the performance of the tracers. In both cases, synthetic noise is added to the line integrated intensity values and we consider the SNR on the integrated intensity and not the peak intensity. Note that all noise and SNR values quoted are for individual lines, not for line ratios.
In both cases, in order to measure how the predictive power (measured as the R 2 value) is affected by noise, we perform a modified cross-validation. The model grid is randomly split in ten parts. For each of these tenths and for each line ratio : 1. A RF model is trained from the other nine parts of the model grids, without any noise added. 2. The trained RF is tested on the tenth under consideration, with added noise (either with a constant noise variance σ 2 noise for all models and all lines, or with a constant SNR for each model and line). Since the RF models take the log 10 of the intensity ratio as input and since the addition of noise can produce negative values, we only apply the RF model when both line integrated intensity values are above 1 σ. Otherwise, we do not use the RF model and simply take the average ionization fraction value of the grid as our prediction.
We finally take the average R 2 value obtained over the ten noisy tests as the predictive power of the tracer under noisy conditions. We thus avoid estimating R 2 from datapoints that have been seen during training, and we estimate the predictive power of a model trained on noiseless data when applied to noisy data (as would be the case when applying the results of this article to real observations).
The results for the translucent medium grid, for a few possible line ratios, are presented in Fig. 7 and 8. Figure 7 presents the scenario of a constant noise level σ noise for all lines and all models, and shows how the R 2 of the prediction from different Table 3. Impact of adding noise to the line intensities on the predictive power R 2 (for translucent medium), measured by the noise level σ 1/2 (in a constant noise level situation) and the signal-to-noise ratio SNR 1/2 (in a constant SNR situation) for which R 2 reaches one half of its value in the absence of noise R 2 (noiseless) .
Line intensity ratio R 2 (noiseless)  (1-0), and the four tracers found to be the least sensitive to noise. We define the tracers least sensitive to noise as those having the highest σ 1/2 among tracers with R noiseless ≥ 0.7, thus giving a compromise between a good fit quality (high R noiseless ) and a slow decrease with increasing noise level (high σ 1/2 ). The four best ratios found according to this definition and shown on the figure are 13 CO (1-0) / C 18 O (1-0), C 18 O (1-0) / CF + (1-0), 13 CO (1-0) / CF + (1-0), and HCN (1-0) / CF + (1-0). We see that the best three tracers, all including C 2 H, have their predictive power decreasing sharply at a relatively low noise level (their R 2 decreases by 50% at σ noise ∼ 5 × 10 −3 K km s −1 ). This is due to the relatively low brightness of the C 2 H line in translucent conditions (median integrated intensity of ∼ 7 × 10 −3 K km s −1 in our translucent medium grid). In comparison, some line ratios built from brighter lines, despite a lower  9. Evolution of the R 2 performance of the Random Forest predictors (for cold dense gas) when adding noise of constant variance σ 2 noise to the line intensities, as a function of the noise level σ noise , for the 3 best tracers for cold dense medium (first three curves) and for the four tracers least affected by the noise (next four curves). R 2 on noiseless data, are found to perform better in the presence of noise : the R 2 for the 13 CO (1-0) / C 18 O (1-0) ratio decreases by 50% at σ noise ∼ 4 × 10 −2 K km s −1 (C 18 O (1-0) has a median integrated intensity of ∼ 3 × 10 −1 K km s −1 in this grid), the C 18 O (1-0) / CF + (1-0) ratio has its R 2 decreased by 50% at σ noise ∼ 10 −2 K km s −1 (CF + (1-0) has a median integrated intensity of ∼ 1.2 × 10 −2 K km s −1 in this grid). We note, however, that ratios built from CF + (1-0) actually only perform marginaly better than the best ratios involving C 2 H at high noise levels (see Fig. 7) due to the low brightness of CF + (1-0). For a more exhaustive comparison of the line ratios, Tab. 3 gives for each line ratio the noise level σ 1/2 at which the R 2 is decreased by half. Figure 8 similarly shows the results for the scenario of a fixed SNR for all lines and all models, still in the case of translucent medium conditions. The variations of the R 2 of the prediction with the SNR are shown for the best seven tracers (according to the noiseless ranking of Table B.2). In this scenario, we find as expected that the predictive power drops at a SNR of order unity. The only exception is the 13 CO (1-0) / C 18 O (1-0) ratio, which decreases slightly earlier. This is due to this ratio spanning a relatively small range of values in our grid of models (approximately one order of magnitude) while the other line ratios span ranges of four to six orders of magnitude. As the ionization fraction spans a range of values of 2.5 orders of magnitude in the grid, this implies that the relationship between the ionization fraction and the 13 CO (1-0) / C 18 O (1-0) ratio has a much steeper slope than for the other line ratios. A steep slope then implies that small errors in the line ratios result in large errors in the predicted ionization fraction. As a result, the 13 CO (1-0) / C 18 O (1-0) ratio requires significantly higher SNRs than the other line ratios. Similarly to σ 1/2 , we define SNR 1/2 as the SNR value at which R 2 is half of its value on noiseless data. The SNR 1/2 values for all line ratios are also given in Table 3.
The results for dense and cold medium conditions are similarly shown in Fig. 9 and 10 and Table 4. In the case of a constant noise level, we find that the R 2 drop generally occurs at higher noise levels than in the translucent medium case, as expected because the lines are brighter in the cold dense medium due to higher column densities. Figure 9 shows the decrease of R 2 with σ noise for the best three dense gas tracers (according to the noiseless ranking of Tab. B.2), and for the four ratios least sensitive two noise (according to the same definition as previously). We note that 13 CO (1-0) / HCO + (1-0), the second best ratio on noiseless data (therefore already shown on the figure), has also the highest σ 1/2 value among ratios with R noiseless ≥ 0.7, so that we show the next four ratios least sensitive to noise on the figure. These four ratios are found to be C  Table 4. When considering a constant SNR scenario for the dense cold medium case, we again find that the R 2 drop occurs at a SNR value of order unity, as shown in Fig. 10. The SNR 1/2 values for all line ratios are listed in Table 4.
When interpreting the results of this study, one must keep in mind that the decrease in R 2 in our two scenarios (constant σ noise or constant SNR) does not come from the same effect. In the constant σ noise scenario, at a given σ noise level one part of the grid has undetected or very low SNR values for the ratio under consideration, while the other part has high SNR. The R 2 decrease is indicative of the growing fraction of the parameter space with undetected/low SNR ratios. As a result, even when finding a low overall R 2 , there might remain a fraction of the parameter space were the predictor remains very good (usually, high column density, high volume density, high temperature,...), which we did not characterize here. In the constant SNR scenario on the other hand, the SNR is by design constant over all models, independent of the physical parameters. The decrease in overall R 2 is then more representative of the decrease in predictive power at any point in the parameter space. As a result, a ratio with a low σ 1/2 value might still be usable in real observations with a higher noise level but would be restricted to high brightness regions of GMCs (in the corresponding lines), while a ratio cannot be used at all in observations with SNR significantly lower than its SNR 1/2 value.

Chemical model reliability
Independently of the statistical method that we present in this article, the results obtained rely on the chosen chemical model Table 4. Impact of adding noise to the line intensities on the predictive power R 2 (for cold dense medium), measured by the noise level σ 1/2 (in a constant noise level situation) and the signal-to-noise ratio SNR 1/2 for which the R 2 reaches one half of its value in the absence of noise R 2 (noiseless) .
Line intensity ratio R 2 (noiseless) and its limitations. Previous works on ionization fraction tracers have mostly used stationary-state results of single-zone chemical models (although some works have used time-dependent models, e.g., Maret & Bergin 2007;Shingledecker et al. 2016). As these previous studies have been mostly limited to deuterationbased tracers, the focus of the present article has been on highlighting the non-deuteration-based tracers that can be found for the ionization fraction from similar chemical models. We discuss here the impacts of our model's limitations on our results. Our single-zone model cannot include a detailed treatment of UV radiative transfer through the cloud. While most photodissociation rates can be simply estimated based on an assumed optical depth of dust protecting each model from UV photons (parameter A V in our models), species such as H 2 , CO and its isotopologues can be protected from photodissociation by self-or mutual-shielding. While self-shielding of H 2 and mutual shielding of CO by H 2 are included in our model using approximations (Draine & Bertoldi 1996;Heays et al. 2017), mutual shielding of 13 CO and C 18 O by H 2 and 12 CO are not included. As a result, in our translucent medium models where photodissociation by external UV photons still plays an important role, we expect the abundances of the rarer CO isotopologues to be less reliable than the other species. Note that the observations of 13 CO and C 18 O in our ORION-B dataset indeed present specificities (systematic excitation temperature differences with 12 CO, Bron et al. 2018;Roueff et al. 2020) that remain unexplained even by more complex 1D PDR models.
The only explicit surface reaction in our chemical model is H 2 formation, however we account for the freeze-out of CO through our depletion parameter. The list of species that we consider as possible tracers has been restricted to species that are not strongly affected by surface chemistry beyond the depletion effect. Extension to more complex molecules would require a chemical model including a more complete treatment of surface chemistry.
Another source of uncertainties comes from the experimental or theoretical estimates of the reaction rate coefficients used in our chemical network. While we did not directly perform a sensitivity analysis of the reaction rate coefficients, our model grids include temperature variations which subsequently impact the reactions rate coefficients through their temperature dependence. This is especially true for the important dissociative recombination reactions which display significative temperature dependences. Temperature is then considered as an unobserved parameter when searching for good tracers of x(e − ). The discovery of strong relationships between some of the ratios and x(e − ), despite temperature variations in the grid, thus indicates that these relationships are to some extent robust to the reaction rates. A careful analysis of the magnitude and correlations of model uncertainties resulting from reaction rate uncertainties in the chemical network would deserve a separate study.
Time-dependent effects are expected to be more important in cold dense medium conditions than in translucent medium conditions as photochemistry has shorter timescales in the latter case. Time-dependent effects that result from the time evolution of some physical parameter (e.g. density and temperature during the contraction of a core) while the chemistry follows in a quasi-stationary way are in part accounted for in our models by exploring a large range of the various physical parameters that can be subject to variations (see Table 1). In addition, the progressive freeze out of CO on dust grains is accounted for by considering a range of depletion factors for carbon and oxygen. Similarly, the slow evolution of OPR H 2 in cold gas can keep parts of the chemistry (deuterium chemistry, nitrogen chemistry) in a time-dependent evolution that depends mainly on the evolution of OPR H 2 . This is also in part accounted for by exploring a large range of OPR H 2 values in our models. The tracer-finding method presented in this article will be applied to time-dependent chemical models in a future study.
The final and most important limitation of our model is that it does not include a spatial dimension : gas at a single value of density, temperature, etc. is assumed to be exposed to a given radiation field and protected by a given column density. As a result the possibility that different emission lines originate in separate layers of gas on the line of sight is completely neglected. Variations of the physical conditions along the line of sight can indeed have an important impact on the observables (e.g. Levrier et al. 2012). This limitation could be important both in the translucent medium where physical and chemical gradients are present due to the progressive extinction of the external UV field, and in dense cores with density and temperature gradients. As a result some caution must be exercised when choosing the line ratios to consider, ratios involving two species expected to emit in completely different regions should be avoided. For instance, the C 2 H (1-0) / N 2 H + (1-0) that is found as the fourth best line ratio in dense cold medium should be avoided : C 2 H is known to be a tracer of UV illumination and thus more likely to be emitted at the external surface of a given clump, while N 2 H + is abundant in the inner regions of the core where CO is already significantly depleted. An application of our method to 1D PDR models to better account for this effect in translucent medium conditions will be carried out in a future work.

Parameter PDF in the model grids
In the model grids used in this study, we sampled uniformly (in logarithm) for the values of the unobservable physical conditions (gas density, temperature, UV field, etc.) in an hypercube defined by lower and upper bounds for each of the parameters. The results of our ranking method will depend on this assumed PDF (probability density function) for the physical conditions in the ISM. We made here the choice of making the minimal assumption: knowing only reasonable lower and upper bounds on each of the parameters, the uniform distribution is the maximum entropy PDF (i.e. the PDF that best represents our assumed state of knowledge). As a perspective, if more a priori knowledge is available, then more accurate assumptions for the PDF (in particular for the correlations between the different physical parameters) could reveal additional tracers.
We note that this assumption of a uniform PDF over a maximum support (in the sense that any more accurate PDF would have almost all of its weight enclosed in this support) makes it likely that any tracer found to have a very good relationship with the ionization fraction would keep a strong relationship for more accurate PDF choices (if the relationship is strong over the full hypercube, it should with high likelihood stay strong on subregions of this hypercube). In this sense, we expect the tracers found here to remain reliable, but a more accurate PDF choice might strongly increase the performance of some tracers found here to perform poorly and thus reveal additional tracers of the ionization fraction. This argument remains however qualitative as pathological cases of PDF might be constructed that would radically change the rankings of the tracers.
As a result, the precise rankings presented in this article could slightly change, but we expect the good tracers highlighted by these ranking to be reliable for more realistic PDFs of the physical conditions in GMCs.

Final recommandation
Based on the limitations discussed above (detectability and model reliability), we recommend the use of the following integrated line intensity ratios to trace the ionization fraction. This list is of course not exhaustive, and other ratios can give satisfactory predictions (see Tables 3 and 4) if the species listed above are not available. In translucent gas conditions, this recommendation is based on the following points. After eliminating rarer CO isotopologues based on our discussion of mutual-shielding effects on selective photodissociation in low/moderate A V regions (cf Sect. 6.3), and eliminating ratios involving sulfur species that were found to require unreasonably low noise levels of 10 −4 − 10 −5 K km s −1 , the three remaining best line intensity ratios are C 2 H (1-0) / HCN (1-0) and C 2 H (1-0) / HNC (1-0) and C 2 H (1-0) / CN (1-0). If noise sensitivity is critical, the tracer found to have the best predictive power at high noise levels is found in Sect. 6.2 to be HCN (1-0) / CF + (1-0) but is negligibly better than the three previously mentioned at high noise levels.

Conclusions
We have presented a general statistical method to find the best observable tracers of an unobservable parameter based on a grid of models spanning the range of possible values for all the unknown underlying physical parameters (e.g., gas density, temperature, depletion, etc.). Our method estimates the predictive power of each potential observable tracer by training a flexible, non-linear regression model (a Random Forest model, making no assumption on the non-linear shape of the relationship to be found) on the task of predicting the target quantity from each of the potential tracers. The fit quality on test data, measured as the R 2 coefficient by cross-validation and out-of-bag estimation, is used to rank the potential tracers by order of predictive power.
In the context of our recent studies of the Orion B GMC Gratier et al. 2017;Orkisz et al. 2017;Bron et al. 2018;Orkisz et al. 2019), we have applied this method to the important astrophysical question of tracing the ionization fraction in the neutral ISM, with the goal of being able to probe its variations across a whole GMC, from its translucent enveloppe to its dense cores. We considered grids of single-zone, stationary state astrochemical models exploring wide ranges of values in gas density, temperature, external UV field, A V on the line of sight, cosmic ray ionization rate, ortho-to-para ratio of H 2 , depletion factor, and sulfur elemental abundance. For a finer exploration of the possible conditions, we considered two grids corresponding to translucent medium conditions and cold dense medium conditions respectively, based on the different types of environments found in the Orion B GMC Bron et al. 2018).
We considered successively column density ratios and line intensity ratios as potential tracers, focusing on species observable in the band at 100 GHz of our observations of Orion B. We find that in both cases and in both types of physical conditions, multiple ratios allow accurate predictions of the ionization fraction, with R 2 > 0.8 (and up to 0.96). We investigated the impact of the noise level on the predictive capability of the different ratios After accounting for detectability and model reliability, we recommend : In order to simplify the use of these predictors, we constructed ad hoc analytical fits (using polynomials or saturated polynomials) of the relationship of each observable tracer to the ionization fraction. Contrary to the Random Forest models, the choice of the analytical form of these fits is specific to the types of relationships observed in this specific application (different analytical forms might be necessary for other applications). We also provide analytical formulae to estimate the uncertainty on any measurement of the ionization fraction from these tracers. These tracers will be used to study the ionization fraction in the Orion B molecular cloud in a second paper (Guzman et al. in prep.). The method presented here is very general and could be easily applied to finding tracers of other related (cosmic ray ionization rate, absolute electron abundance) or unrelated (gas density, temperature, OPR H 2 ,...) unobservable quantities. This method can also be extended simply to simultaneously use pairs (or more) of line ratios (by training RF models on the possible combinations of line ratios), which would likely further increase the quality of the prediction.

Appendix A: Random Forest parameter optimization
The Random Forest contains a few tuning parameters for which a value needs to be chosen before training. In particular, we will consider only the two most important here : the number of trees in the forest N trees , and the maximum depth allowed for each tree, d max . Another usual parameter is not considered here: the number of predictors in the random subset considered when choosing along which axis to make a split. Indeed, we will only train RF models with a single predictor at a time, so that this number is necessarily 1.
Since optimizing the choice of these parameters on the full model grids (that will then be used for training) and for each potential tracer will lead both to an increased risk of overfitting and a heavy computational time cost, we opt for a very limited optimization, where a single set of parameter values is used for all tracers (i.e. for all RF models trained on a single ratio) and for all model grids. The selection of these parameter values is done with a simplified procedure: 1. For each model grid, RF models are first trained on each ratio using default parameter values (default sklearn values, N trees = 100 and d max = ∞), and the OOB R 2 value is calculated for each ratio. 2. For the two ratios having the best and worst R 2 in the previous step, RF models are trained for a grid of parameter values exploring N trees = 50 − 800 and d max = 1 − 12, and the OOB R 2 is again computed for RF models with each possible combination of parameter values (the OOB value is used to limit the risks of overfitting). 3. As the R 2 maps obtained show very flat minima, we found a common set of parameter values assuring a R 2 within 0.01 of the best value in all cases. The parameter values found are N trees = 400, and d max = 4.
In order to illustrate this procedure, Fig. A.1 shows for instance the resulting R 2 maps as a function of N trees and d max for the preliminary best (top) and worst (bottom) tracers in the translucent model grid when using line integrated intensity ratios. The red contours delimit the region where R 2 is within 0.01 of its maximum value. We see that the variations with N trees are limited to a decrease at low values. Any large enough value could thus be chosen but larger values will induce higher computation time cost, so that a minimal acceptable value had to be chosen. The variations with d max show a clear maximum (although rather flat). The R 2 value decreases for decreasing values of d max below the optimum as the RF model then becomes not flexible enough to capture the relationship between the predictor and the target variable. Above the optimum, R 2 decreases for increasing values of d max as overfitting starts to arise. The RF model becomes too flexible and starts to learn noise artefacts (we recall that the "noise" here is induced by the random sampling of the unobservable physical parameters of the chemical model).

Appendix B: Tables for tracers ranking, performance, and fit coefficients
Appendix B.1: Tracers ranking and performance We provide here the ranking and performance of single ratio RF models, obtained following the method described in Sect. 2.
Appendix B.1.1: Translucent medium Tables B.1 and B.2 present the ranking we obtain for translucent medium conditions, respectively for column density ratios and integrated line intensity ratios. See Sect. 2 for a description of the method used, and Sect. 4 for a discussion of these results. For each ratio, we list the performance of the corresponding RF model measured through the (cross-validated) R 2 , the equivalent root mean square error on log 10 (x(e − )) and the corresponding error factor on x(e − ), the maximum absolute error on log 10 (x(e − )) and the corresponding error factor on x(e − ). We also list for comparison the R 2 value obtained with the analytical fit described in Sect. 5.
Appendix B.1.2: Cold dense medium Tables B.3 and B.4 present the ranking we obtain for cold dense medium conditions, respectively for column density ratios and integrated line intensity ratios. Ranking of column density ratios according to their usefulness to predict the ionization fraction in translucent medium conditions (measured through the R 2 of a fitted Random Forest model). Additional error measures of the Random Forest model (root mean square error and maximum absolute errors) are also given. As these errors concern the logarithm of the ionization fraction, we also provide the equivalent error factors on the ionization fraction. For comparison, the R 2 obtained with the analytical fit described in Sect. 5 is also listed in the last column. 3) for predicting log 10 (x(e − )), the quality of this fit estimated as the (cross-validated) R 2 , the root mean square error on log 10 (x(e − )) and corresponding error factor on x(e − ), the maximum absolute error on log 10 (x(e − )) and corresponding error factor on x(e − ), the fit coefficients (corresponding to the fit formula given in Eq. 4) for estimating the uncertainty on the prediction, and the limits of the validity range of the fit (given as log 10 of the ratio values.

Appendix B.2.2: Cold dense medium
Tables B.7 and B.8 list the fit coefficients for cold dense medium conditions, respectively for column density ratios and integrated line intensity ratios. For each ratio, ranked according the results of Sect 4, we list the fit coefficients (corresponding to the fit formula given in Eq. 2) for predicting log 10 (x(e − )), the quality of this fit estimated as the (cross-validated) R 2 , the root mean square error on log 10 (x(e − )) and corresponding error factor on x(e − ), the maximum absolute error on log 10 (x(e − )) and corresponding error factor on x(e − ), the fit coefficients (corresponding to the fit formula given in Eq. 4) for estimating the uncertainty on the prediction, and the limits of the validity range of the fit (given as log 10 of the ratio values. Ranking of line intensity ratios according to their usefulness to predict the ionization fraction in translucent medium conditions (measured through the R 2 of a fitted Random Forest model). Additional error measures of the Random Forest model (root mean square error and maximum absolute errors) are also given. As these errors concern the logarithm of the ionization fraction, we also provide the equivalent error factors on the ionization fraction. For comparison, the R 2 obtained with the analytical fit described in Sect. 5 is also listed in the last column.

Appendix C: Analytical model visualization
We present in this section figures of the RF models and analytical fits for the best six tracers in each case.
Appendix C.1: Translucent medium  Ranking of column density ratios according to their usefulness to predict the ionization fraction in dense cold medium conditions (measured through the R 2 of a fitted Random Forest model). Additional error measures of the Random Forest model (root mean square error and maximum absolute errors) are also given. As these errors concern the logarithm of the ionization fraction, we also provide the equivalent error factors on the ionization fraction. For comparison, the R 2 obtained with the analytical fit described in Sect. 5 is also listed in the last column.  Table B.4. Ranking of line intensity ratios according to their usefulness to predict the ionization fraction in dense cold medium conditions (measured through the R 2 of a fitted Random Forest model). Additional error measures of the Random Forest model (root mean square error and maximum absolute errors) are also given. As these errors concern the logarithm of the ionization fraction, we also provide the equivalent error factors on the ionization fraction. For comparison, the R 2 obtained with the analytical fit described in Sect. 5 is also listed in the last column.
Line intensity ratio Random Forest Model Analytical fit R 2 Root mean square error Maximum absolute error R 2 dex (equ. factor) dex (equ. factor) CF A&A proofs: manuscript no. paper_RF_ioniz Table B.5. Fit coefficients (for our main fit and scatter fit) and fit quality for column density ratios in translucent medium conditions. We list the fit coefficients for predicting log 10 (x(e − ) (according to the fit formula given in Eq. 3), the quality of this fit estimated as the (cross-validated) R 2 , root mean square error (RMSE) on log 10 (x(e − ) and corresponding error factor on x(e − ), maximum absolute error factor on log 10 (x(e − )) and corresponding error factor on x(e − ), the fit coefficients for estimating the uncertainty on the prediction (according to the fit formula given in Eq. 4), and the validity range of the fit (given as log 10 of the ratio values).
Column density ratio Main fit coefficients  Table B.6. Fit coefficients (for our main fit and scatter fit) and fit quality for line intensity ratios in translucent medium conditions. We list the fit coefficients for predicting log 10 (x(e − ) (according to the fit formula given in Eq. 3), the quality of this fit estimated as the (cross-validated) R 2 , root mean square error (RMSE) on log 10 (x(e − ) and corresponding error factor on x(e − ), maximum absolute error factor on log 10 (x(e − )) and corresponding error factor on x(e − ), the fit coefficients for estimating the uncertainty on the prediction (according to the fit formula given in Eq. 4), and the validity range of the fit (given as log 10 of the ratio values).

Line intensity ratio
Main fit coefficients  Table B.7. Fit coefficients (for our main fit and scatter fit) and fit quality for column density ratios in cold dense medium conditions. We list the fit coefficients for predicting log 10 (x(e − ) (according to the fit formula given in Eq. 2), the quality of this fit estimated as the (cross-validated) R 2 , root mean square error (RMSE) on log 10 (x(e − ) and corresponding error factor on x(e − ), maximum absolute error factor on log 10 (x(e − )) and corresponding error factor on x(e − ), the fit coefficients for estimating the uncertainty on the prediction (according to the fit formula given in Eq. 4), and the validity range of the fit (given as log 10 of the ratio values).

Column density ratio
Main fit coefficients  Table B.8. Fit coefficients (for our main fit and scatter fit) and fit quality for line intensity ratios in cold dense medium conditions. We list the fit coefficients for predicting log 10 (x(e − ) (according to the fit formula given in Eq. 2), the quality of this fit estimated as the (cross-validated) R 2 , root mean square error (RMSE) on log 10 (x(e − ) and corresponding error factor on x(e − ), maximum absolute error factor on log 10 (x(e − )) and corresponding error factor on x(e − ), the fit coefficients for estimating the uncertainty on the prediction (according to the fit formula given in Eq. 4), and the validity range of the fit (given as log 10 of the ratio values).

Line intensity ratio
Main fit coefficients     Ionization fraction versus column density ratio for the best six ratios found in Sect.4 for cold dense medium conditions. The chemical model grid is shown as a scatter plot, with the central crowded regions replaced by PDF isocontours containing 25%, 50%, and 75% of the points. Superimposed are the RF model (red line), the analytical fit (solid black line), and the analytical fit of the 1σ uncertainty (dashed black lines). The quality estimates of the two models are indicated on the figure.