Issue 
A&A
Volume 643, November 2020



Article Number  A36  
Number of page(s)  33  
Section  Interstellar and circumstellar matter  
DOI  https://doi.org/10.1051/00046361/202038593  
Published online  28 October 2020 
3D chemical structure of diffuse turbulent ISM
I. Statistics of the HItoH_{2} transition
^{1}
Observatoire de Paris, PSL University, Sorbonne Université, LERMA,
75014
Paris, France
email: elena.bellomi@obspm.fr
^{2}
Laboratoire de Physique de l’Ecole normale supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris,
75005
Paris, France
^{3}
Laboratoire AIM, CEA/IRFU, CNRS/INSU, Université Paris Diderot, CEASaclay,
91191
GifSurYvette, France
^{4}
Université ParisSaclay, CNRS, Institut d’Astrophysique Spatiale,
91405
Orsay, France
Received:
5
June
2020
Accepted:
8
August
2020
Context. The amount of data collected by spectrometers from radio to ultraviolet (UV) wavelengths opens a new era where the statistical and chemical information contained in the observations can be used concomitantly to investigate the thermodynamical state and the evolution of the interstellar medium (ISM).
Aims. In this paper, we study the statistical properties of the HItoH_{2} transition observed in absorption in the local diffuse and multiphase ISM. Our goal is to identify the physical processes that control the probability of occurrence of any line of sight and the origins of the variations of the integrated molecular fraction from one line of sight to another.
Methods. The turbulent diffuse ISM is modeled using the RAMSES code, which includes detailed treatments of the magnetohydrodynamics, the thermal evolution of the gas, and the chemistry of H_{2}. The impacts of the UV radiation field, the mean density, the turbulent forcing, the integral scale, the magnetic field, and the gravity on the molecular content of the gas are explored through a parametric study that covers a wide range of physical conditions. The statistics of the HItoH_{2} transition are interpreted through analytical prescriptions and compared with the observations using a modified and robust version of the KolmogorovSmirnov test.
Results. The analysis of the observed background sources shows that the lengths of the lines of sight follow a flat distribution in logarithmic scale from ~100 pc to ~3 kpc. Without taking into account any variation of the parameters along a line of sight or from one line of sight to another, the results of one simulation, convolved with the distribution of distances of the observational sample, are able to simultaneously explain the position, the width, the dispersion, and most of the statistical properties of the HItoH_{2} transition observed in the local ISM. The tightest agreement is obtained for a neutral diffuse gas modeled over ~200 pc, with a mean density n̅_{H̅} = 1−2 cm^{−3}, illuminated by the standard interstellar UV radiation field, and stirred up by a largescale compressive turbulent forcing. Within this configuration, the 2D probability histogram of the column densities of H and H_{2}, poetically called the kingfisher diagram, is remarkably stable and is almost unaltered by gravity, the strength of the turbulent forcing, the resolution of the simulation, or the strength of the magnetic field B_{x}, as long as B_{x} < 4 μG. The weak effect of the resolution and our analytical prescription suggest that the column densities of HI are likely built up in largescale warm neutral medium and cold neutral medium (CNM) structures correlated in density over ~20 pc and ~10 pc, respectively, while those of H_{2} are built up in CNM structures between ~3 and ~10 pc.
Conclusions. Combining the chemical and statistical information contained in the observations of HI and H_{2} sheds new light on the study of the diffuse matter. Applying this new tool to several atomic and molecular species is a promising perspective to understanding the effects of turbulence, magnetic field, thermal instability, and gravity on the formation and evolution of molecular clouds.
Key words: ISM: structure / ISM: molecules / ISM: kinematics and dynamics / ISM: clouds / methods: numerical / methods: statistical
© E. Bellomi et al. 2020
Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1 Introduction
The multiphase nature of the interstellar medium (ISM) is at the root of the regulation of star formation in galaxies (e.g., Hill et al. 2018). As shown by the emission profiles of the HI 21 cm line (Heiles & Troland 2003a,b; Murray et al. 2015, 2018), the diffuse neutral ISM is composed of two stable thermal states at thermal pressure equilibrium (Jenkins & Tripp 2011), the warm neutral medium (WNM, T ~ 7000 K) and the cold neutral medium (CNM, T ~ 70 K), coexisting with a third unstable state, the lukewarm neutral medium (LNM), whose temperature is comprised between those of the CNM and the WNM (e.g., Marchal et al. 2019). Through condensation and evaporation processes, turbulent transport, and turbulent mixing, the diffuse matter flows from one stable state to the other eventually leading to the formation of dense and cold clouds massive enough to trigger gravitational collapse (e.g., Ostriker et al. 2010). While this picture is widely accepted, the intricated effects of turbulence, gravity, radiation field, and magnetic field on the exchange of mass andenergy between the different phases and on the formation of structures at all scales has yet to be unveiled.
Following the illustrious analytical descriptions of the thermal instability process (Field 1965; Wolfire et al. 1995, 2003; Bialy & Sternberg 2019), several analytical and numerical studies have been dedicated to understand the dynamical evolution of the gas, focusing on the formation of CNM structures, molecular clouds, and collapsing cores (e.g., Hennebelle & Pérault 1999; Koyama & Inutsuka 2002a,b; Audit & Hennebelle 2005; VázquezSemadeni et al. 2007; Hennebelle et al. 2008), as well as on the stability of clouds of various geometries under evaporation and condensation conditions (e.g., Inoue et al. 2006; Stone & Zweibel 2009; Kim & Kim 2013; Nagashima et al. 2005; Iwasaki & Inutsuka 2014). These show that largescale turbulence combined with thermal instability is sufficient to explain several features of the neutral ISM, including the fractions of mass observed in the different thermal states (Seifried et al. 2011; Hennebelle & Iffrig 2014; Hill et al. 2018), the distribution of thermal pressure (Saury et al. 2014), and the mass spectrum, the masssize relation, and the velocity dispersionsize relation of molecular clouds (e.g., Audit & Hennebelle 2010; Padoan et al. 2016; Iffrig & Hennebelle 2017).
To extend the predictions of simulations to a larger set of observational diagnostics, recent numerical studies haveundertaken the challenging task of solving the chemical evolution of turbulent and/or multiphase environments. Originally dedicated to the formation of CO in molecular clouds and to the analysis of the COtoH_{2} conversion factor in galaxies and the CO darkgas (e.g., Glover et al. 2010; Smith et al. 2014; Richings & Schaye 2016a; Seifried et al. 2017; Gong et al. 2018), numerical simulations are now used to study a variety of atomic and molecular tracers, including CII, CI, CH^{+}, OH^{+}, H_{2}O^{+}, and ArH^{+} (e.g., Richings & Schaye 2016b; Valdivia et al. 2017; Clark et al. 2019; Bialy et al. 2019). All these works demonstrate the predictive power of astrochemistry. The column density distribution of each atom and molecule has a unique signature that provides detailed information on the thermodynamical state of the diffuse matter (Clark et al. 2019). In turn, the confrontation with the predictions of numerical simulations can be used to estimate the scale and strength of the injection of mechanical energy by stellar feedback (Bialy et al. 2019), the largescale turbulent transport and the interfaces between CNM and WNM (Valdivia et al. 2017), and the nature of the turbulent dissipation processes (Lesaffre et al. 2020).
In this context, understanding the formation and survival of molecular hydrogen has long been recognized as a major topic of investigation. As the most abundant molecule in space, H_{2} is at the root of interstellar chemistry and the growth of molecular complexity. In addition, and because its formation preferentially occurs in dense environments, H_{2} naturally correlates with the star formation rate of galaxies (e.g., Lupi et al. 2017) and therefore offers a valuable proxy to understand the limit in the KennicuttSchmidt relation above which star formation occurs (e.g., Bigiel et al. 2008, 2011; Schruba et al. 2011; Leroy et al. 2013).
Over the last decades, great efforts have thus been devoted to propose analytical descriptions of the HItoH_{2} transition in homogeneous clouds with planeparallel or spherical geometries (e.g., Sternberg 1988; Krumholz et al. 2008; McKee & Krumholz 2010; Sternberg et al. 2014 and references therein), compute this transition in detailed 1D chemical models assuming chemical equilibrium (e.g., van Dishoeck & Black 1986; Abgrall et al. 1992; Le Bourlot et al. 2012; Bron et al. 2014) or not (e.g., Lee et al. 1996; Goldsmith et al. 2007; Lesaffre et al. 2007), treat the chemistry of H and H_{2} in subgrid models applied to simulations of galaxy formations (e.g., Gnedin et al. 2009; Christensen et al. 2012; Thompson et al. 2014; Diemer et al. 2018), or solve it in 3D isothermal or multiphase simulations of the diffuse ISM using various treatments of the radiative transfer (e.g., Glover et al. 2010; Valdivia et al. 2016; Hu et al. 2016; Bialy et al. 2017; Nickerson et al. 2018).
Thanks to all these works, a global picture of the formation of H_{2} in galaxies is now emerging. At the scale of a homogeneous cloud, the molecular content, the sharpness of the HItoH_{2} transition, and the asymptotic column density of HI are controlled by the ratio of the intensity of the ultraviolet (UV) field to the gas density and the dusttogas ratio, or equivalently, the metallicity (Sternberg et al. 2014). At larger scales, the integrated column densities of HI and H_{2} also depend on the distribution of clouds of various densities along the line of sight and on the porosity to the UV radiation field. Because of these effects, the statistical properties of the total column density are found to depend on the strength, the scale, and the compressibility of the turbulent forcing in simulations of CNM gas (Micic et al. 2012; Bialy et al. 2017). The amount of molecular gas depends on the “clumpiness factor” used for the subgrid models in simulations of galaxy formation (Gnedin et al. 2009; Christensen et al. 2012).
Despite these achievements, very few works have been dedicated so far to the analysis of the HItoH_{2} transition in a turbulent multiphase medium at a scale sufficient to resolve the formation of CNM structures. In addition and while the predictions of analytical models (e.g., Krumholz et al. 2008) and simulations (Gnedin et al. 2009; Valdivia et al. 2016) were able to reproduce the trend of the HItoH_{2} transition observed by Copernicus and FUSE in the local ISM (e.g., Savage et al. 1977; Gillmon et al. 2006; Rachford et al. 2009), the LMC and the SMC (e.g., Browning et al. 2003; Gillmon et al. 2006; Leroy et al. 2007), no detailed comparison with the statistical properties of these observations have been proposed. As a result, the occurrence of lines of sight with large molecular fractions predicted by numerical simulations often exceed what is deduced from the observations (Valdivia et al. 2016). Finally, and while statistical studies of 1D probability distribution functions (PDF) have become a common tool to understand the formation and the dynamics of molecular clouds (Körtgen et al. 2019), few statistical studies have been performed to date on 2D probability distribution functions using combined observations of different molecular tracers. In that perspective, the recent work of Bialy et al. (2019) opens new horizons for the analysis of chemistry in the diffuse matter.
In the first paper of this series, we extend these pioneer statistical studies to the measurements of the atomictomolecular transition observed in the diffuse and translucent ISM located in a radius of ~ 3 kpc around the sun. We perform a parametric exploration of numerical simulations of the multiphase ISM and compare the results with the observed 2D probability histogram (PH) of total and molecular hydrogen column densities in order to identify the physical processes that control the molecular content of CNM clouds and the probability of occurrence of lines of sight. The observational dataset and the distribution of sizes of the sampled medium are presented in Sect. 2. The different setups of the simulations and the method used to reconstruct the 2D PH are described in Sect. 3. The comparisons with the observations are shown in Sect. 4 which also highlights the influences of the different parameters. The paper finally ends with Sects. 5 and 6 where we discuss the validity of our approach and summarize our main conclusions.
2 Observations of the HItoH_{2} transition
2.1 Observational sample and distances
The observational sample studied in this work is built from the database of Gudennavar et al. (2012) who compiled existing data of atomic and molecular lines observed in absorption toward several thousand sources, including stars and AGNs. Limiting this catalog to observations or tentative detections of HI, H_{2}, and of the reddening E(BV), and removing the data associated to the Magellanic Cloud or high redshift extragalactic environments (e.g., Tumlinson et al. 2002; Cartledge et al. 2005; Welty & Crowther 2010; Noterdaeme et al. 2007), we obtain a sample of 360 sources which form, to date, the most complete set of observations of the HItoH_{2} transition in the local diffuse ISM. A more detailed description of this set, the list of the background sources, and the values of the column densities of HI and H_{2} toward each source, N(H) and N(H_{2}) are given in Appendix A and Table A.1.
The positions of the sources in the sky and their distance deduced from Gaia and HIPPARCOS measurements of parallaxes (Perryman et al. 1997; Gaia Collaboration 2018) are shown in Fig. 1, where unknown distances of extragalactic sources (see Table A.1) are arbitrarily set to 1 Mpc. With comparable numbers of observations in all Galactic quadrants, the sources appear to be well distributed in Galactic longitudes. Oppositely, and while the sources cover almost all Galactic latitudes, about twothirds of them are located toward the Galactic disk with latitudes smaller than 15°, and only one third is located above, crossing the Galactic halo. Since the sample contains extragalactic sources, and since the amount of molecular gas in the Milky Way decreases exponentially as a function of the distance from the midplane, the length of the line of sight l_{los} occupied by the observed diffuse gas cannot always be identified to the distance of the background source. For the sake of simplicity, we assume here a molecular height above the midplane of 100 pc and compute the length of the observed diffuse Galactic material as (1)
where b is the Galactic latitude of the background source and p is its parallax.
The resulting distribution of the lengths of the lines of sight is shown in Fig. 2. The shortest lines of sight are found to extend over ~100 pc and the largest over ~3.5 kpc. Remarkably, and because of the combined distributions of distances and Galactic latitudes of the sources, we find that log (l_{los}) follows a flat distribution up to l_{los} ~ 2 kpc with about 50 sources per bin and drops by about a factor of two for l_{los} ~ 3 kpc. Oppositely, and as expected, the distribution of lengths of nondetections of H_{2} is not flat but decreases rapidly up to 1 kpc. Long lines of sight are finally not limited to the first and fourth Galactic quadrants but are found to spread over all Galactic longitudes and mostly depend on the Galactic latitude of the background source.
Fig. 1 Aitoff projection, in Galactic longitude and latitude coordinates, of the background sources of the observational sample of HI and H_{2} deduced from absorption studies and used in this work (see Appendix A and Table A.1). The colorcode indicates the distance of the source. All unknown distances correspond to extragalactic sources (see Table A.1): these are arbitrarily set to 1 Mpc and indicated with black points. 
Fig. 2 Distribution of lengths l_{los} of the intercepted diffuse material computed with Eq. (1) along all lines of sight of the observational sample. The orange sample corresponds to lines of sight where H_{2} is detected and the green sample to those for which an upper limit on N(H_{2}) has been derived (see Table A.1). 
2.2 Physical and statistical properties
The compiled data are shown in Fig. 3 which displays the observed column densities of H_{2} as functions of the total proton column densities of the gas N_{H} = N(H) + 2N(H_{2}). As shown inFig. 3 and as already noted by Goldsmith et al. (2009), almost no line of sight is either purely WNM or purely molecular. This implies that the observed gas is necessarily composed of a combination of phases and clouds of different extinctions with various contributions to the volume spanned by the different lines of sight. As a result, the integrated molecular fraction computed as (2)
shows a large dispersion in the observational sample covering about seven orders of magnitude. While the molecular fraction averaged over all the lines of sight is found to be 0.20, the mass averaged molecular fraction is 0.27, a value similar to the results obtained by MivilleDeschênes et al. (2017) at 8.5 kpc based on the analysis of molecular clouds observed in CO in the entire Galactic disk. The position of the HItoH_{2} transition, on the other hand, is found to extend over about one order of magnitude of total column density from N_{H} ~ 10^{20} to 10^{21} cm^{−2} and occurs, on average, at N_{H} ~ 3 × 10^{20} cm^{−2} (Gillmon et al. 2006).
In order to highlight the statistical features of this transition, we divide the observational sample into 5 subsamples A, B, C, D, and E, shown in Fig. 3, which encompass almost all the observational points and whose statistical properties are summarized in Table 1. 48 lines of sight out of 360 are found to be not detected in H_{2}. Most of these upper limits are obtained for a total column density smaller than 10^{21} cm^{−2} and about half of them provide strong constraints on the molecular fraction with . 3% of the 312 detections belong to the subsample A, 13% to subsample B, 16% to subsample C, and 65% to subsample D. Interestingly, the subsample E is empty and no line of sight is observed with N_{H} > 10^{22} cm^{−2}. While the mean value of the logarithm of the molecular fraction strongly increases from subsamples A to D, the dispersion simultaneously decreases by about a factor of three, probably revealing an effect of average over long distances.
All these statistical properties, and more precisely the probability of occurrence of a given line of sight, are the subject of this paper. What physical processes control the HItoH_{2} transition? How does the distribution of lengths of the lines of sight influence its observed statistical properties? What are the origins of the variations of the molecular fraction from one line of sight to another?
Fig. 3 H_{2} column density as a function of the total column density of protons N_{H}. Open circles correspond to detections of H_{2} while arrows correspond to upper limits (see Table A.1). The blue dashed line indicates the maximum value of N(H_{2}) derived from a purely molecular medium with an integrated molecular fraction (Eq. (2)). The red dasheddotted line indicates the theoretical molecular fraction derived in an unshielded WNMtype environment with a density of 0.5 cm^{−3} and a temperature of 8000 K, illuminated by a UV photon flux of 10^{8} cm^{−2} s^{−1} (see Eqs. (13) and (15)). The regions A, B, C, D, and E defined in Table 1 correspond to an arbitrary separation of the observational sample used for quantitative comparisons with the results of simulations (see Sect. 4). 
3 Physics and numerical method
To study thephysical processes at play in the HItoH_{2} transition, we performed numerical simulations of the multiphase diffuse ISM, using the RAMSES code (Teyssier 2002; Fromang et al. 2006), a gridbased solver with adaptative mesh refinement (Berger & Oliger 1984). The methodology applied in this paper follows the works of Seifried et al. (2011), Saury et al. (2014), and Valdivia et al. (2015, 2016).
The diffuse matter in the Solar Neighborhood of our galaxy is simulated over a box of size L with periodic boundary conditions. The matter, defined by a mean proton density , is assumed to be illuminated on all sides by an isotropic spectrum of UV photons set to the standard interstellar radiation field (Habing 1968) and scaled with a factor G_{0}.
3.1 Fluid equations
Within this framework, RAMSES computes the evolution of the gas solving the classic equations of ideal magnetohydrodynamics (MHD):
where ρ, v, B, P and E are the mass density, the velocity field, the magnetic field, the total pressure, and the total energy density, respectively. The net cooling function per unit mass, , and the acceleration due to the turbulent driving, f, are described in Sects. 3.2 and 3.3. The axis x, y, and z are chosen so that z corresponds to the direction perpendicular to the Galactic disk, and x corresponds to the direction of the mean magnetic field initially parametrized by a constant value B_{x}.
To take into account all gravitational forces, including selfgravity and the action of stars and dark matter, the gravitational potential Φ is divided into two terms: (7)
The selfgravity potential, ϕ_{gas}, is deduced from the Poisson’s equation: (8)
Following Kuijken & Gilmore (1989) and Joung & Mac Low (2006), we assume that the Galactic potential along the direction z perpendicular to the Galactic disk can be written as (9)
where the first term is the contribution of the stellar disk parametrized by z_{0} = 0.18 kpc and a_{1} = 1.42 × 10^{−3} kpc Myr^{−2} and the second term is the contribution of the spherical dark halo parametrized by a_{2} = 5.49 × 10^{−4} Myr^{−2}.
3.2 Thermal processes and radiative transfer
As shown by Field (1965), the multiphase nature of the ISM results from the thermal balance of the gas and thus from its net cooling function defined by (10)
where and n_{H}Γ are the cooling and heating rates of the medium (in erg cm^{−3} s^{−1}) and n_{H} is the proton density. To correctly describe the thermal state of the diffuse ISM, we include in this work the heating induced by the photoelectric effect and the decay of cosmic ray particles and the cooling induced by Lyman α photons, therecombination of electrons onto grains, and the fine structure lines of OI and CII. All these processes, described in Appendix B, are modeled with the analytical formulae given by Wolfire et al. (2003).
The absorption of UV photons by dust, and its subsequent impact on the photoelectric effect, is treated with the treebased method proposed by Valdivia & Hennebelle (2014). At each point the effective radiation field G_{eff} (in Habing units) is computed as (11)
where A_{V} is the visual extinction along a given ray, deduced from the integrated proton column density^{1} computed from the border of the box to the current point (12)
and ⟨e^{−2.5AV}⟩ is an average performed over 12 directions, treated as solid angles evenly spread in polar coordinates.
3.3 Turbulence forcing
To mimic the injection of mechanical energy in the diffuse ISM, a large scale turbulent forcing is applied. Following Schmidt et al. (2009) and Federrath et al. (2010), this forcing, modeled by an acceleration f in the momentum conservation equation, is driven through an OrnsteinUhlenbeck process using a pseudospectral method. At regular time intervals Δτ, random fluctuations of the forcing term are generated and applied over an autocorrelation timescale τ. To excite only large scale modes, the forcing is modeled as a paraboloid in Fourier space covering a small interval of wavenumbers 1 ≤ k ≤ 3 and centered on k = 2. Using the notations of Seifried et al. (2011) and Saury et al. (2014), the total magnitude of these perturbations is set with either an acceleration parameter F or, equivalently, a velocity parameter V related by F = V^{2}∕L_{drive}, where L_{drive} is the main driving scale, L_{drive} = L∕2. A Helmholtz decomposition is finally applied, in order to control the powers injected in compressive and solenoidal modes. Using the classical notation, these powers are set with a parameter^{2} ζ ranging from a pure solenoidal field (ζ = 1) to a pure compressive field (ζ = 0).
Throughout this work, we adopt Δτ ~ 0.4 Myr which roughly corresponds to the time interval separating two supernova events occurring in a volume of (200 pc)^{3}. The characteristic damping time of the turbulence τ is approximately set to the turnover timescale of the diffuse ISM Myr (Larson 1981; Hennebelle & Falgarone 2012). F (or V) and ζ are left as free parameters.
3.4 H_{2} chemistry
The timescale required for the abundance of molecular hydrogen to reach its equilibrium value is known to range over several orders of magnitude, depending on the physical conditions of the ISM (e.g., Goldsmith et al. 2007; Tabone 2018). In the diffuse gas, this timescale varies typically between a few 10^{3} yr and a few 10^{7} yr (Valdivia et al. 2016), hence over a range of values that often exceeds the dynamical timescales. To take into account this important aspect of the diffuse interstellar chemistry, the outofequilibrium abundance of H_{2} is computed selfconsistently in the simulation, using the formalism introduced in RAMSES by Valdivia et al. (2015, 2016).
The formation of H_{2} onto grains in physisorption and chemisorption sites is modeled with the simplified rate of Le Bourlot et al. (2012) (13)
where n_{H} and n(H) are the localproton and atomic hydrogen densities, and (14)
is the sticking coefficient of H onto grain, parametrized by T_{2} = 464 K and β = 1.5.
The destruction of H_{2} by UV photons is computed using the formalism, described by Draine & Bertoldi (1996) and Sternberg et al. (2014), which is classically introduced in many astrochemical models (e.g., Lesaffre et al. 2013; Gong et al. 2017; Bialy et al. 2017). In each cell, the photodestruction rate of H_{2} is modeled as (15)
where x = N(H_{2})∕5 × 10^{14} cm^{−2}, k_{d,0}= 3.3 × 10^{−11} s^{−1} is the inverse freespace dissociation timescale of H_{2} in an isotropic Habing field, n(H_{2}) is the local density of the molecular hydrogen, e^{−σdNH} is the shielding induced by dust and f_{shield} the selfshielding function. We adopt here an effective dust attenuation cross section at λ = 1000 Å, σ_{d} = 2 × 10^{−21} cm^{2} (Sternberg et al. 2014). Following Draine & Bertoldi (1996), the selfshielding function is computed as (16)
where b_{D} is the Doppler broadening parameter expressed in km s^{−1}. As done for the photoelectric heating rate (see Sect. 3.2), both the shielding by dust and the selfshielding are calculated along 12 different directions and then averaged to obtain the photodissociation rate of Eq. (15).
3.5 Fiducial model and grids of parameters
The framework described above lays on several independent parameters, which are all related to key physical ingredients of the ISM. The influence of each ingredient on the HItoH_{2} transition is studied here through several grids of simulations − including a total of 305 runs − covering a broad range of physical conditions and centered around a fiducial setup^{3}. The reference value adopted for each parameter, and the range of values explored in this work, are summarized in Table 2. Among all parameters, L, and G_{0} are of particularimportance.
With our assumptions, L simultaneously corresponds to the scale of illumination of the gas by UV photons and twice the integral scale of turbulence, that is twice the scale of injection of mechanical energy L_{drive}. OB stars, which are the dominant sources of the interstellar UV field, are not uniformly distributed in the sky but are known to be clustered in associations (Ambartsumian 1947). As shown by the recent 3D studies of the distributions of stars based on the HIPPARCOS and Gaia Catalogs (e.g., Bouy & Alves 2015, Zari et al. 2018), the typical distances separating two associations in the Solar Neighborhood range between 50 pc and a few hundreds of pc, that is several times the mean distance deduced from the integrated surface densities of OB stars (~ 1.6 × 10^{−3} pc^{−2}, MaízApellániz 2001). Interestingly, such distances are not only comparable to the heights of the molecular gas (~ 75 pc) and the cold HI gas (~150 pc) above the Galactic plane deduced from CO and HI allsky surveys (e.g., Dame et al. 2001, Dickey & Lockman 1990, Kalberla & Kerp 2009), but they also correspond to the typical size of HI superclouds (Elmegreen & Elmegreen 1987). For all these reasons, we, therefore, adopt a fiducial simulation with L = 200 pc and explore values down to a few tens of pc.
The mean density of the gas, , represents the mass of the diffuse neutral ISM contained in a volume L^{3}, and also controls the porosity of the matter to the impinging radiation field. In this work, we adopt a fiducial value = 2 cm^{−3}, a value slightly larger than the standard Galactic midplane density of HI at a galactocentric distance of 8.5 kpc (Kalberla & Kerp 2009). With our fiducial value of L, the mass of gas contained in the box accounts for all the mass surface density of HI in the Solar Neighborhood (Σ_{HI} ~ 10 M_{⊙} pc^{−2}, Nakanishi & Sofue 2016, MivilleDeschênes et al. 2017).
G_{0} controls the intensity of the radiation field, hence the thermodynamical state of the gas. Since G_{0} is normalized to the Habing field, we choose a fiducial value of G_{0} = 1. The corresponding UV energy density of the standard setup is therefore slightly smaller than that contained in the standard UV radiation fields given by Draine (1978) and Mathis et al. (1983).
The standard values of the two parameters F and B_{x} are set to 9 × 10^{−4} kpc Myr^{−2} and 3.8 μG, respectively. As we will show later, those values are chosen so that the velocity dispersion of the gas and the strength of the magnetic field are close to the values observed in the diffuse ISM.
Fiducial model and range of parameters explored in this work.
3.6 Steadystate
The evolution of the multiphase environments simulated here is identical to the description already given in many papers (e.g., Seifried et al. 2011; Saury et al. 2014; Valdivia et al. 2015, 2016). Starting from an homogeneous density , the gas evolves under the joint actions of turbulence, gravity, and thermal instability, and splits up in three different phases at thermal pressure equilibrium, the WNM, the CNM, and the LNM. The formation of dense environments well shielded from the destructive UV radiation field triggers the formation of H_{2} and the medium jointly evolves from a purely atomic state to a partly molecular state. If the mass of the gas is conserved, as imposed by the periodic boundary conditions, the medium progressively tends toward a steadystate where the mean pressure, the volume filling factors of the different phases, their velocity dispersion, and their mean molecular fractions are roughly constant. This steadystate is typically reached after a few turnover timescales, providing that the corresponding time is longer than the damping time (see Sect. 3.3).
Because of turbulence and thermal instability, the steadystate has a statistical nature. The turbulent forcing and the subsequent turbulent cascade induce pressure variations and shear motions at all scales which trigger mass exchanges between the different phases. Any pressure or density structure is therefore a transient system which is usually described by its contribution to probability distribution functions. At steadystate, only PDFs remain constant. This steady yet ever changing environment is the reason for the sustained presence of a substantial amount of gas in the LNM at densities and temperatures out of thermal equilibrium (e.g., Marchal et al. 2019). All the results shown throughout this paper are taken at steadystate, at times ranging from a few tens of Myr up to 100 Myr depending on the strength of the turbulent forcing.
3.7 Properties of the multiphase medium
The steadystate values of the mean pressure ⟨P∕k⟩, the turbulent velocity dispersion σ_{tur}, and the fractions of mass of the WNM and the CNM, f_{WNM} and f_{CNM}, are shown in Fig. 4 for a set of 60 different simulations. For the sake of simplicity, we assume that the WNM is composed of all cells with a temperature T ≥3000 K, the CNM of all cells with T ≤ 300 K, and the LNM of all cells with 300 K < T < 3000 K, hence (17)
The mean pressure ⟨P∕k⟩ (with k the Boltzmann constant) is classically computed as an average over the entire volume. While the above conventions are well established, there are many ways to define the turbulent velocity dispersion σ_{tur}. It could be defined as the volumeweighted or massweighted velocity dispersion computed over the entire volume (Audit & Hennebelle 2005; Federrath et al. 2010), the average of the massweighted velocity dispersion computed along individual lines of sight (MivilleDeschênes & Martin 2007; Saury et al. 2014), or the dispersion of the massweighted velocity centroids computed along independent directions (Henshaw et al. 2019). All these definitions give velocity dispersions that differ from one another and are not equally relevant for the comparison with observed quantities. To relate the velocity dispersion to a kinetic energy and provide values that could be compared to the observations of broad HI emission profiles at high Galactic latitude (see below), we chose to compute σ_{tur} in the WNM only as (19)
and where the sums are performed over all cells with T ≥ 3000 K.
The results displayed in Fig. 4 are very similar to those obtained in the parametric studies of Seifried et al. (2011) and Saury et al. (2014) and are in line with the expectations of models of turbulent multiphase environments (Wolfire et al. 2003; Ostriker et al. 2010). The mean pressure of the gas is primarily regulated by the thermal equilibrium curve (see Appendix B) which depends on the local illumination of the gas by the UV radiation field G_{eff}. Since G_{eff} results from the absorption of the external UV field by the surrounding environments, ⟨P∕k⟩ is not only sensitive to G_{0} but also to the total mass of the simulation set by and L. Larger values of G_{0} or smaller values of L or leads to larger ⟨P∕k⟩. In turn, the fractions of mass contained in the WNM and the CNM are controlled by the mean pressure and the total mass of the gas. Larger pressure implies larger densities of the WNM (and the CNM). The fraction of mass of the CNM therefore decreases as ⟨P∕k⟩ increases; eventually, if the density of the WNM becomes comparable to the mean density , the CNM almost entirely disappears (see bottom left panels of Fig. 4). At last, and because the WNM occupies most of the volume, f_{CNM} necessarily increases as a function of the total mass of the gas, or equivalently , even at constant pressure.
The turbulent forcing induces pressure fluctuations and shearing motions at all scales. As shown by Seifried et al. (2011), this not only frequently perturbs the gas out of the thermal equilibrium states but also strongly reduces the times spent by any fluid elements in the WNM, LNM, and CNM. Because of these two aspects, increasing the turbulent forcing reduces the mass of the CNM to thebenefit of those of the LNM and WNM (right panels of Fig. 4 and Fig. 10 of Seifried et al. 2011). The mean pressure therefore increases, the 1D PDF of the density broadens and its bimodal nature progressively disappears (Piontek & Ostriker 2005; Walch et al. 2011). These effects can be magnified depending on the nature of the turbulent forcing and the power injected in the compressive and solenoidal modes. Because solenoidal motions are more efficient to prevent the gas to condensate back to the CNM phase, a pure solenoidal forcing naturally leads to larger pressure and smaller CNM fractions than those obtained with an equivalent kinetic energy injected in pure compressive modes.
As expected, the velocity dispersion of the WNM is mostly given by the strength of the turbulent forcing and the driving scale (L_{drive} ~ L∕2, see Sect. 3.3), with a slight dependence on and ζ. As proposed by Saury et al. (2014), a realistic value for the turbulent velocity dispersion of the WNM can be estimated by looking at the HI 21 cm emission spectra with the fewest components observed at high Galactic latitude (Kalberla et al. 2005). Toward these directions, Haud & Kalberla (2007) derive a total velocity dispersion km s^{−1}, where the σ_{thr} is the 1D thermal velocity dispersion (~8.2 km s^{−1} for the WNM). In the present paper, the turbulent forcing applied to the standard simulation (see Table 2 and Fig. 4) is chosen so that σ_{tur} ~ 4−5 km s^{−1}, in fair agreement with the observations at high Galactic latitude. While this value is chosen as a reference, the velocity dispersions obtained in all the simulations explored in this work range between 1 and 15 km s^{−1} (see Fig. 4).
Fig. 4 Colored tables of the mean pressure expressed in K cm^{−3} (first line), the turbulent velocity dispersion σ_{tur} (second line), and the fractions of mass f_{WNM} and f_{CNM} contained inthe WNM phase (third line) and the CNM phase (fourth line). First and second columns: these quantities as functions of and G_{0}, for L = 50 pc and F = 3.6 × 10^{−3} kpc Myr^{−2} (first column) and for L = 200 pc and F = 9 × 10^{−4} kpc Myr^{−2} (second column). Third and fourth columns: these quantities as functions of the acceleration parameter F and the compressive ratio ζ, for L = 50 pc (third column) and L = 200 pc (fourth column). All other parameters are set to their standard values (see Table 2). 
3.8 Reconstruction of lines of sight
As shown in Sect. 2, the medium observed in absorption at UV and visible wavelengths extends over a very broad range of distances, from ~100 pc to several kpc (see Fig. 2). The targeted lines of sight may therefore contain several isolated diffuse neutral phases but also hot and warm ionized material (McKee & Cowie 1977; de Avillez & Breitschwerdt 2004). Indeed, such a superimposition of independent components is particularly well seen in submillimeter and infrared observations of the Galacticdisk where the gas seen in absorption is found to cluster in several velocity components associated to known Galactic structures (e.g., Gerin et al. 2016). Since our setup only follows a piece of diffuse neutral material of size L, an additional treatment regarding the lengths of the lines of sight is therefore required in order to compare the results of the simulations to the distribution of observations. We apply here a methodology similar to that proposed by Bialy et al. (2019) and schematized in Fig. 5. We assume that a given simulation corresponds to a building block of neutral diffuse ISM. Depending on its length, any random line of sight necessarily intercepts parts or several of these building elements and an unknown mass of diffuse ionized gas parametrized by its volume filling factor φ. A total sampleof N simulated lines of sight is then generated as follows.
Based on the results of Sect. 2 (Fig. 2), we consider six lengths of lines of sight homogeneously distributed in log space:l_{los} = 100, 200,400, 800, 1600, and 3200 pc. For each length, we generate a sample of lines of sight, where w_{l} are normalized weights deduced from the distribution of distances in the observed sample: w_{1} = 0.14, w_{2} = 0.21, w_{3} = 0.16, w_{4} = 0.20, w_{5} = 0.18, w_{6} = 0.11 (see Fig. 2). The column densities of H and H_{2} along each lines of sight are finally reconstructed by comparing the length occupied by the neutral medium (1 −φ)l_{los} (see Fig. 5) and the size of the box L. If (1 − φ)l_{los} = L, we draw a random line of sight in the simulation and extract the corresponding column densities. If (1 − φ)l_{los} < L, we draw a random line of sight and integrate the column density over a reduced distance of (1 − φ)l_{los}. If (1 − φ)l_{los} > L, we draw random lines of sight and add the respective individual column densities.
For the sake of simplicity, we assume here that any line of sight intercepts a constant fraction of diffuse ionized gas with φ =0.5 (Hill et al. 2018). Similarly, we note that, while spatially uncorrelated, the pieces of diffuse neutral gas used in the reconstruction algorithm correspond to random realizations obtained with a single simulation. Potential variations of the mean density, of the external radiation field, or of the turbulent forcing that naturally follow the Galactic structure depending on the position of the source (see Appendix A) are not taken into account. All these limitations are discussed in Sect. 5.
The outcome of the reconstruction algorithm is shown in Fig. 6 which displays the 2D PH of the total proton column density N_{H} and the column density of molecularhydrogen N(H_{2}) obtained with the standard simulation. Because of the flat distribution of distances in log space (see Fig. 2), the peaksof the reconstructed PH are found to be shifted toward both the large and the low values of N_{H} compared to those of the initialdistribution (top panel of Fig. 6). This naturally enhances the initial bimodality and many lines of sight are found to be either at low (~10^{−5}) or large (~10^{−1}) integrated molecular fractions. In addition, it induces an inclination toward large column densities; more than half of the lines of sight are found to have N_{H} > 10^{21} cm^{−2}. By virtue of the central limit theorem, the molecular fraction obtained in those lines of sight tends toward the mean H_{2} molecular fraction of the initial simulation with a dispersion that decreases as a function of N_{H}.
Building simulated lines of sight from the observed distribution of sizes allows us to perform a statistical comparison of both samples and to limit the impact of observational biases to lines of sight at large column density (region E, see Sect. 5). The combined 2D distributions of N_{H} and N(H_{2}) deduced from the simulations are the main focus of this paper and will be shown several times in the following sections. For obvious aesthetical reasons, and to simplify the descriptions of the figures, we will often refer to this representation as a kingfisher diagram.
Fig. 5 Schematic view of the reconstruction of individual lines of sight over a distance l_{los}. The medium between the observer and the source is assumed to be composed of hotand warm ionized material (light blue cubes) with a volume filling factor φ and of uncorrelated pieces of diffuse neutral gas of individual size L (simulated boxes) with a volume filling factor (1φ). 
Fig. 6 2D probability histogram of the total proton column density N_{H} and the column density of molecular hydrogen N(H_{2}) obtained with the standard simulation (see Table 2). Upper panel: original data where all lines of sight have a size L= 200 pc. Bottom panel: outcome of the reconstruction algorithm described in Sect. 3.8 that produces a sample of lines of sight ranging from 100 to 3200 pc. The color code indicates the fraction of lines of sight (in logarithmic scale) contained in each bin. Dotted lines are isocontours of the molecular fraction for , 10^{−6}, 10^{−4}, 10^{−2}, and 1. 
3.9 Interpretative modeling
The computation of column densities over various lines of sight and the subsequent kingfisher diagrams are the outcome of three main factors: the local conditions of the gas (n_{H}, G_{eff}, T, and the selfshielding) which control the local abundance of H_{2}; the probabilistic ordering of these local conditions along any random line of sight of size l_{los} ; and finally, the distribution of sizes l_{los} used for the reconstruction algorithm (see Sect. 3.8). In order to separate these effects, in particular the impact of local conditions from the probabilistic aspects, and propose a physical interpretation of the behaviors shown in this paper, we developed a semianalytical approach. The resulting model and the confrontations of its predictions to the results of numerical simulations are described in details in Appendix C. To keep the paper concise, we only summarize here its basic ingredients and our main deductions.
Following the works of VázquezSemadeni & García (2001), Bialy et al. (2017), we assume that any line of sight can be understood as a succession of density fluctuations. These fluctuations are supposed to be fully correlated over a distance called the “decorrelation scale”, and completely uncorrelated over larger distances. Because of the biphasic nature of the neutral gas, we adopt two different decorrelation scales: if and if . The limit separating the diffuse and dense components is chosen as the inflection point between the two lognormal distributions classically found in the PH of the gas density (see Fig. C.1). Analyzing the production of H_{2} and the integration of column densities in this framework leads to the following conclusions.
 1.
The comparisons with numerical simulations performed at four different scales and for fifteen different simulations show that the analytical model reproduces to an outstanding level the 1D PHs of the total column density N_{H} (see Fig. C.2), assuming that the diffuse gas is correlated over a scale (i.e., 20 pc for the fiducial simulation), and that the dense gas is correlated over a scale . Interestingly, the value of is similar to that obtained by Bialy & Burkhart (2020) in a set of isothermal MHD simulations with different driving scales and Mach numbers. This confirms that the WNM, which fills most of the volume, behaves like an isothermal gas perturbed by a sustained turbulent forcing.
 2.
Similar comparisons performed on the 2D PHs of N_{H} and N(H_{2}) (Figs. C.3 and C.4 and Appendix C.4) indicate that the dense gas is necessarily composed of a distribution of structures of different sizes. Indeed, while most of the mass and volume of the cold HI can be accurately modeled with a single scale , H_{2} is required to be built up in smaller components. should therefore be considered as the maximum decorrelation scale of the dense gas.
 3.
The local production of H_{2} mostly depends on the density: low density components are atomic while high density components are molecular. The threshold triggering the transition between the two regimes is set by the local selfshielding (induced by the component itself) and the largescale selfshielding (induced by the surrounding environment). The local selfshielding alone in a component of size implies (21)
The largescale selfshielding can be seen as a stochastic process that lowers this limit: for L = 200 pc, is found to be reduced by a factor of two on average.
 4.
As schematized in Fig. 7, the distribution of normalized column densities N_{H} ∕l_{los} and N(H_{2})∕l_{los} obtained for a given l_{los} can be divided in three categories. Lines of sight with low integrated molecular fraction () exclusively contain components with (case a). Incontrast, lines of sight with high integrated molecular fraction () necessarily intercept at least one large or several small components at high density (case c). In spite of what intuition dictates, lines of sight with intermediate integrated molecular fraction () do not result from components at moderate densities () but from the combination of low density material and a small number of small components at high density (case b).
 5.
The proportions of lines of sight of types (a), (b), and (c) (Fig. 7) are given by the volume filling factors of the diffuse and dense gas and the length of the lines of sight l_{los}. If l_{los} is comparable to and , the 2D PH of N_{H}∕l_{los} and N(H_{2})∕l_{los} is spread and contains a large number of lines of sight of type (a). Larger l_{los} naturally favor lines of sight of types (b) and (c). If l_{los} becomes large compared to to and , the central limit theorem applies. The spread PH described above is squeezed along the x and y axis as both N_{H} ∕l_{los} and N(H_{2})∕l_{los} progressively tend toward Gaussian distributions centered on the means (red star in Fig. 7).
Fig. 7 Schematic view of lines of sight of fixed length l_{los} inferred from the analytical model described in Appendix C, and corresponding contributions to the histogram of the normalized column densities N_{H} ∕l_{los} and N(H_{2})∕l_{los}. Bottom panels: the white stars correspond to all lines of sight while the black stars correspond to the specific cases illustrated above. Any line of sight is intercepting several components of constant density n_{H}. Diffuse components () have a size ; dense components () have a distribution of sizes . Only components with densities larger than are molecular (see main text). The red star in the bottom panels indicates the mean value of N_{H} ∕l_{los} and N(H_{2})∕l_{los} computed over a large sample of lines of sight (white stars), hence the expected mean molecular fraction. The dotted line indicates a fully molecular medium. 
4 Comparison with observations
4.1 Fiducial simulation
In Fig. 8, we compare the observational dataset to the 2D PH of N_{H} and N(H_{2}) (i.e., the kingfisher diagram) obtained with the reconstruction algorithm applied to the fiducial simulation for two resolutions, R = 64^{3} and R = 512^{3}. In each panel, the color code indicates, in logarithmic scale, the fraction of lines of sight predicted for any couple (N_{H}, N(H_{2})). Quantitative comparisons of the observed and predicted fractions of lines of sight, mean molecular fractions, and dispersions in the regions A, B, C, D, and E defined in Sect. 2 (see Table 1) are given in Fig. 9. Unexpectedly, the sample of lines of sight built from the fiducial simulation reproduces, to an outstanding level, the global trend of the HItoH_{2} transition and its statistical properties. Without taking into account any possible variation of the parameters along the lines of sight or from one line of sight to the next, the structures induced by the joint actions of turbulence and thermal instability alone are found to produce a wealth of lines of sight whose probabilities of occurrence match those derived from the observations.
In particular, the integrated molecular fraction is predicted to have a bimodal distribution with a transition occurring at N_{H}~ 3 × 10^{20} cm^{−2} and extending over one order of magnitude of total column density. More quantitatively, the fractions of lines of sight simulated and observed in regions A, B, C, and D are found to differ by 50% at the most (see Fig. 9). Similarly, the observed and simulated mean molecular fractions and their corresponding dispersions are found to be comparable and to differ by less than a factor of three in region B, and less than a factor of two in regions C, and D. Because of the distribution of background sources, the reconstructed ensemble predicts a strong increase in probability from region C to D, which contains a large fraction of the entire sample, and a decrease in the dispersion of as a function of N_{H}. At last, the probability of occurrence of lines of sight with N_{H} < 3 × 10^{19} and or N_{H} > 3 × 10^{21} and are rare to nonexistent. All these features are also found in the observational dataset.
Notwithstanding, Figs. 8 and 9 also reveal a few discrepancies. Firstly, about 16 observed lines of sight (out of 360) lay at the border of the simulated distribution, in regions where the predicted probability is ≤ 10^{−4} (or even smaller than10^{−5} for the white regions of Fig. 8), a value far smaller than the inverse number of observations ~ 3 × 10^{−3}. This impliesthat the simulation used here somehow fails to explain on its own part of the diversity observed in the Solar Neighborhood. Secondly, the mean molecular fractions predicted in regions A and B are found to be noticeably smaller than that derived from the observations, by about a factor of ten and three, respectively. It is important to note, however, that these molecular fractions deduced from the observations are probably overestimated as a third of the observed lines of sight contained in these regions correspond to upper limits on N(H_{2}). Finally, the simulated sample clearly shows that a substantial fraction of the lines of sight are in the region E, with an integrated probability of 4%. If the observational sample of 360 targets is unbiased, between 10 and 18 lines of sight should have been observed in this region, which is not the case. These discrepancy will be discussed in more details in Sect. 5.
Fig. 8 Comparison of the observational dataset (black points) to the 2D probability histogram reconstruction algorithm (see Sect. 3.8) applied to the fiducial simulation (colored histogram). Results are shown for two resolutions, R = 64^{3} (top panel) and R = 512^{3} (bottom panel). Observations include detections of H_{2} (circles) and upper limits on N(H_{2}) (arrows). The color code indicates the fraction of lines of sight (in logarithmic scale) contained in each bin. As a reminder, contours of the regions A, B, C, D, and E defined in Sect. 2 (see Table 1) are also displayed. 
Fig. 9 Top frame: comparisons of the observational dataset (black points) with the 2D probability histograms of N_{H} and N(H_{2}) computed from the reconstruction algorithm (see Sect. 3.8) applied to thesimulations. Bottom frame: fraction of lines of sight (%), and mean value μ and dispersion σ of the logarithm of the molecular fraction computed from the simulated histograms in the regions A, B, C, D, and E defined in Table 1. Numbers correspond to the values of %, μ, and σ. The color code indicates a measure of distance (in arbitrary units) between the observed and simulated values in order to guide the eye. These comparisons are shown in each frame for 15 different simulations with G_{0} varying from 0.5 (left panels) to 4 (right panels) and varying from0.5 cm^{−3} (top panels) to 4 cm^{−3} (bottom panels). All other parameters are set to their fiducial values (see Table 2). 
4.2 Impact of the resolution
The comparison of the two panels of Fig. 8 shows that the kingfisher diagram is independent of the resolution over about one order of magnitude of scales, from 64^{3} to 512^{3}. Even if not systematically shown, this unusual result is not limited to the fiducial setup but is a general feature of all the simulations explored in this work (see Fig. D.2 for instance). Our interpretation is based on the analytical model presented inAppendix C and summarized in the previous section.
Evidently, high resolution simulations are important to accurately describe small scale structures. In particular, large resolution are required for modeling the formation of dense and gravitationally bound environments and follow their collapse. The fact that the kingfisher diagrams are independent of resolution for R ≥ 64^{3} therefore suggests that the cold and dense structures between 0.4 and 3 pc have no influence on the distributions of N_{H} and N(H_{2}) for the fiducial simulations and are insignificant in the mass and the volume budgets of H and H_{2}. This is in line with the conclusions deduced from the analytical model. Indeed, as explained in the previous section, the turbulent forcing at large scale induces density fluctuations in the diffuse gas that extend over ~ 20 pc and a distribution of dense structures with sizes smaller than ~10 pc. While the total quantity of matter is inferred to be contained in the diffuse and the largest dense components, H_{2} is exclusively built up in dense components smaller than ~10 pc. The results of Fig. 8 combined with conclusions deduced from the analytical model therefore imply that the structures contributing the most to the mass and volume of H_{2} are above 3 pc and below 10 pc for the fiducial simulation.
Beside the physical insights on the typical scales participating to the buildup of column densities, this result also provides a strong justification for using simulations with moderate numerical resolution for the study of the HItoH_{2} transition. In all this work we therefore adopt a standard resolution of R = 256^{3} unless indicated otherwise.
4.3 Impact of G_{0} and
Figures 9 and 10 show comparisons between the observational dataset and the 2D PHs extracted from the simulations for 0.5 cm^{−3} cm^{−3}, 0.5 ≤ G_{0} ≤ 4, and two sets of values of the box size and the turbulent forcing, L = 200 pc and F = 9 × 10^{−4} kpc Myr^{−2} (Fig. 9) and L = 50 pc and F = 3.6 × 10^{−3} kpc Myr^{−2} (Fig. 10). While the trend and the statistics of the HItoH_{2} transition weakly depend on the resolution, they strongly depend on the total mass of the gas parametrized by and on the UV illumination factor. As G_{0} increases or decreases, (1) the fraction of lines of sight with large f(H_{2}) drops to the benefit of lines of sight with low f(H_{2}), (2) the transition is shifted toward larger total column density and its width increases, and (3) the molecular fraction globally decreases over all lines of sight while its dispersion increases. Interestingly, and because and G_{0} have opposite effects, the simulations that reproduce the most accurately the observed statistics of the HItoH_{2} transition follow a trend with . While similar, the effect of these two parameters are, however, not symmetrical. In particular, has an obvious and strong impact on the fraction of lines of sight in region E, regardless of G_{0}. Likewise, the fraction of lines of sight at low column densities (regions A and B) and the mean molecular fraction at large column densities (regions C and D) are not constant for a given ratio but respectively decrease and increase with . All these properties effectively break the degeneracies between the two parameters. All things considered, the tightest concordance between observed and simulated data is obtained for cm^{−3}, in agreement with the Galactic midplane density deduced from HI surveys in the Solar Neighborhood (Kalberla & Kerp 2009).
At first sight, the results described above seem obvious as they somehow mimic the dependencies of the HItoH_{2} transition found with detailed models of photodissociation regions (e.g., Krumholz et al. 2009; Sternberg et al. 2014). Such an interpretation is, however, a dangerous misconception. Indeed, while G_{0} is tightly linked to the effective radiation field G_{eff} that locally illuminates the gas, the mean density should not be mistaken for the local density. Moreover, the results displayed in Figs. 9 and 10cannot be compared to PDR models because they are statistical in nature. For instance, simulations at high G_{0} do not preclude the existence of clouds with high molecular fractions. Indeed, increasing G_{0} may lead to denser local environments with larger molecular fractions whose probability of occurrence along a line of sight is simply reduced. It implies that the results shown here are very different from PDR model predictions. They rather reflect the complex link between the global properties of the simulation (mass, illumination, driving scale) on the one side, and the local conditions and their probability distribution functions on the other side.
Because the local abundance of H_{2} is inversely proportional to G_{0}, increasing G_{0} naturally reduces the local selfshielding. Similarly, increasing G_{0} or decreasing reduce the largescale selfshielding. As a result, the density threshold required to produce highly molecular environments (see item 3. of Sect. 3.9) rises by a factor of three when either G_{0} is multiplied or is divided by a factor of eight. While significant, such an effect on the local conditions of the gas is, however, too shallow to fully explain the variations observed in Figs. 9 and 10.
Indeed, regardless of local conditions, G_{0} and have a major impact on the probabilistic reconstruction of lines of sight. As shown in Sect. 3.7, increasing G_{0} or decreasing strongly reduce the fractions of mass and volume occupied by the dense and cold gas. This not only reduces the molecular fraction averaged over the entire simulation (red star in Fig. 7) but also favors the occurrence of lines of sight with low or intermediate (cases (a) and (b) in Fig. 7). When combined with the distribution of sizes l_{los}, the HItoH_{2} transition is naturally shifted toward larger N_{H}, and the asymptotic molecular fraction at high N_{H} drops. Moreover, because the central limit theorem requires larger lines of sight to apply, the HItoH_{2} transition isnaturally wider and the dispersion of lines of sight at large molecular fraction increases. This final property is particularly well seen in the kingfisher diagram obtained for the simulation at G_{0} = 0.5 and cm^{−3} where most of the lines of sight follow an homothetic transformation of the mean normalized column densities N_{H} ∕l_{los} and N(H_{2})∕l_{los} (red star in Fig. 7).
4.4 Impact of the box size L
The impacts of the box size revealed by comparing Figs. 9 and 10 partly follow the results of the previous section. Reducing L by a factor of four drastically reduces the total amount of matter in the simulation, hence the absorption of the impinging UV radiation field and the largescale selfshielding. As shown in Sect. 3.7, the mean pressure of the gas rises while the mass and volume occupied by the CNM decreases. All the local and statistical effects described in the previous section therefore apply and modify the kingfisher diagrams accordingly. Changing L has, however, two additional and specific consequences.
Because L_{drive} is four times smaller in the simulations displayed in Fig. 10 than in those displayed in Fig. 9, the decorrelation scales and are correspondingly smaller. According to the interpretation given in Sect. 3.9, all the reconstructed lines of sight are therefore considerably larger than individual density fluctuations. This not only favors the occurrence of lines of sight at intermediate and high molecular fractions (cases (b) and (c) of Fig. 7) but also magnify the impact of the central limit theorem (see item 5. of Sect. 3.9), as already illustrated by Bialy et al. (2019). The kingfisher diagrams shown in Fig. 10 are thus squeezed along the x and y axis compared to those of Fig. 9. Consequently, the simulations at L = 50 pc predict almost no line of sight at low total column density (N_{H} ≤ 3 × 10^{19} cm^{−2}) and systematically underestimate the proportion of line of sight in region A compared to the observations.
Because it controls the total amount of matter, changing L has finally a major effect on gravitational forces. For L = 50 pc, the size of the largest dense clouds are almost always smaller than the Jean length computed in the CNM. Consequently, selfgravity plays almost no role in the physical state or the evolution of the gas for all simulations at L = 50 pc. The specific impact of gravity on the probability histograms of column densities is described in more details in the following section.
Fig. 10 Same as in Fig. 9 for simulations with a box size of L = 50 pc instead of200 pc. The turbulent forcing is adjusted as in Fig. 4 (F = 3.6 × 10^{−3} kpc Myr^{−2}) to obtain similar velocity dispersions for L = 50 pc and L = 200 pc. 
Fig. 11 KS distance between the simulations and the observational sample as a function of the mean density and the UV scaling factor G_{0} = 0.5 (red), 1 (green), 2 (orange), and 4 (blue). Results obtained with and without gravity are shown with solid lines and dashed lines, respectively. All other parameters are set to their standard values (see Table 2). Points correspond to reliable measurements of the KS distances. Triangles indicate lower limits corresponding to simulations where the upper error bar on R_{KS} tends toward infinity (see Appendix D). 
4.5 Impact of gravity
To facilitate the comparison between simulations and observations and avoid a pedestrian repetition of the kingfisher diagrams, we developed a modified version of the KolmogorovSmirnov test. This test, fully explained inAppendix D and validated over 30 different simulations, defines a value R_{KS}, called the KS distance, that measures how two 2D PDFs (or 2D PHs) differ from one another. In a nutshell, any observational datapoint ina 2D diagram can be used to divide the diagram into four different regions which each contain different fractions of observed and simulated data. The modified KS test simply searches for the observational datapoint and the region that maximize the ratio between the fractions of simulated and observed lines of sight. The distance R_{KS} is the absolute value of the logarithm of this ratio: a value R_{KS} = 1 therefore implies that some region in the 2D diagram contains tens times more or ten times less simulated data than required to explain the observations, and that all the other regions have smaller ratios.
The results of the KS test applied to 15 simulations run with and without gravity are shown in Fig. 11. As expected from Sect. 4.3, the KS distance strongly depends on both and G_{0}. In comparison, gravity appears to have a limited impact on the kingfisher diagrams. While including gravity seems to slightly reduce the KS distance, the trends as functions of and G_{0} and the set of simulations found to minimize R_{KS} remain unaltered. These results can be understood by simple statistical considerations.
The impact of gravity in multiphase simulations is to produce selfgravitating environments which appear as a high density tail in the PDF of the gas density. Because these selfgravitating clumps are dense and fully molecular, they often dominate the integrated column densities of both HI and H_{2} along any line of sight thatintercept them, and therefore favor lines of sight at high molecular fraction (case (c) in Fig. 7). This not only induces a tail at high N_{H} and N(H_{2}) in the kingfisher diagram but may also contribute to shift the HItoH_{2} transition tolower N_{H} as faint but highly molecular lines of sight starts to appear. The importance of these two effects depends on the area filling factor of selfgravitating clumps and on whether they carry or not a substantial fraction of the mass of the dense gas. For cm^{−3}, selfgravitating clumps occupy less than 0.001% of the entire volume and carry less than a percent of the mass of the gas. These fractions increase, however, as functions of : for cm^{−3}, selfgravitating environments occupy 0.01% of the volume and contain as much as 30% of the total mass. Therefore, while the impacts of gravity on the kingfisher diagram is negligible for most simulations, they become important at high : this property is effectively captured by the KS distances displayed in Fig. 11.
Fig. 12 KS distances between the simulations and the observational sample computed for five values of the acceleration parameter F = 4.5 × 10^{−5}, 1.5 × 10^{−4}, 4.5 × 10^{−4}, 1.5 × 10^{−3}, and 4.5 × 10^{−3} kpc Myr^{−2}, and three values of the compressive ratio ζ = 0.1 (blue points), 0.5 (orange points), and 0.9 (green points), which set the balance between compressive and solenoidal forcing (see Sect. 3.3). All other parameters are set to their standard values (see Table 2). Points correspond to reliable measurements of the KS distances. The triangle indicates a lower limit corresponding to a simulation where the upper error bar on R_{KS} tends toward infinity (see Appendix D). 
4.6 Impact of turbulent forcing
As done in the previous section, the impact of the turbulent forcing is discussed through the results of the KolmogorovSmirnov test. The KS distance obtained for various configurations of the turbulent forcing (Fig. 12) shows that the strength of turbulence affects differently the HItoH_{2} transition depending on its nature: highly compressive turbulent forcing produces virtually identical column density distributions over almost two decades of the turbulent acceleration parameter F; oppositely the strength of the forcing significantly modifies these distributions if a substantial fraction of the kinetic energy is injected in pure solenoidal modes. The tightest agreement with the observational sample is obtained for F ~ 1.5 × 10^{−4} kpc Myr^{−2} if ζ ≥ 0.5 and for all F ≥ 1.5 × 10^{−4} kpc Myr^{−2} if ζ = 0.1. According to Sect. 3.7 (Fig. 4), these values corresponds to a WNM velocity dispersion σ_{tur} ~ 2 km s^{−1} if ζ ≥ 0.5, and σ_{tur} ≥ 2 km s^{−1} for ζ = 0.1. In any case, a small amount of turbulence is always required.
All these characteristics are consequences of the fractions of volume and mass contained in the CNM structures and their size distribution. Without turbulence or with a weak turbulent forcing, CNM clouds are found to evolve toward largescale entities which evaporate slowly. Such a configuration favor lines of sight at high molecular fraction (type (c) in Fig. 7) and leads to an overestimation of the global amount of H_{2} in the local diffuse ISM. If the turbulent forcing increases, the CNM becomes progressively organized into a distribution of structures of different sizes down to the numerical resolution. Simultaneously, the fraction of mass located in the CNM phase diminishes to the benefit of the LNM (see Sect. 3.7). Both effects reduce the mean molecular fraction of the gas and favor the occurrence of lines of sight of type (b) (see Fig. 7), in better agreement with the observational sample. Larger turbulent forcing ultimately lead to an underestimation of the global amount of H_{2} in the local ISM. However, and as shown in Sect. 3.7, this last effect is much more pronounced for a pure solenoidal forcing which efficiently prevents the unstable gas to condensate back to cold and dense environments compared to a pure compressive forcing.
Here again, the probabilistic information contained in the kingfisher diagram proves to provide a valuable tool to analyze the nature of turbulence in the diffuse local ISM. Indeed, the velocity dispersion deduced from the comparison of observations and simulations at high solenoidal forcing is substantially smaller than the velocity dispersion observed at high Galactic latitude (see Sect. 3.7) and the velocity dispersion deduced from the observations of CO at a scale of 200 pc (Hennebelle & Falgarone 2012). The fact that the observed statistics of the HItoH_{2} transition is reproduced over a broader range of velocity dispersion for a compressive forcing suggests that the largescale turbulence of the diffuse local ISM is dominated by compressive modes. This picture is coherent with the results of Saury et al. (2014) who found that a largescale compressive forcing induces a distribution of thermal pressure in excellent agreement with that derived from the observations of the fine structure lines of CI (Jenkins & Tripp 2011). It is also coherent with the fact that 200 pc corresponds to the width of the Galactic disk seen in CO and therefore to the limit above which the equipartition between compressive and solenoidal modes switches from the values expected in a 3D fluid to those expected in a 2D medium, hence values of ζ smaller than 0.5. At last, it concurs with the conclusions of Iffrig & Hennebelle (2017) who found that compressible modes dominates at low altitudes, close to the equatorial plane, in simulations where the ISM turbulence is selfconsistently driven by supernovae explosions.
4.7 Impact of the magnetic field
The impact of the initial magnetic field on the kingfisher diagram simply reveals the competition between thermal instability which induces the production of dense environments, and the magnetic pressure which acts against this evolution. As shown in Fig. 13, the KS distance obtained for different magnetic field intensities is found to be constant until a critical value of B_{x} ~ 4 μG above which the predicted and observed distribution of N_{H} and N(H_{2}) significantly differ from one another. The initial homogeneous magnetic field adopted in the standard simulation, B_{x} = 3.8 μG, is just under this critical value and corresponds to the case where the energy density of the large scale coherent magnetic field and the thermal energy density of the WNM are equivalent. It follows that reproducing the observed 2D PH of N_{H} and N(H_{2}) requires a magnetic field intensity below or at equipartition.
Interestingly, the value of B_{x} adopted for the fiducial simulation leads, at steadystate, to a constant magnetic field intensity B ~ 5 μG for n_{H} < 10 cm^{−3}, and a field intensity that scales as for n_{H} > 10 cm^{−3}. Those values are comparable with the magnetic field intensities obtained in diffuse and molecular environments from Zeeman measurements (Crutcher et al. 2010) and those obtained in the most diffuse phases from Faraday rotation measurements toward extragalactic radio sources (e.g., Frick et al. 2001). It appears that these information on the interstellar magnetic field are encoded, at least partly, in the statistical properties of the HItoH_{2} transition.
Fig. 13 KS distances between the simulations and the observational sample computed for six values of the initial magnetic field B_{x}. All other parameters are set to their standard values (see Table 2). Points correspond to reliable measurements of the KS distances. The triangle indicates a lower limit corresponding to a simulation where the upper error bar on R_{KS} tends toward infinity (see Appendix D). 
5 Discussion
5.1 Observational biases
All the results presented in Sect. 4 are discussed under the assumption that the observational dataset is unbiased, meaning that the underlying lines of sight correspond to a random sample with no selection effect. This is not true. As Krumholz et al. (2008) already noticed, “the FUSE and the Copernicus lines of sight have been specifically chosen to probe a certain range of column densities with a selection biased against high extinction which makes determining column densities very costly or altogether impossible.” Indeed, very few stars emit a UV radiation field strong enough to be detected through large visual extinction material. The problem is not due to the increase in H_{2} absorption, which can be overcome by focusing on fainter bands, but to dust absorption itself and the complication of the structure of NaI often used as a proxy to derive the column density of HI (Rachford et al. 2002). Moreover, because such bias depends on the sensitivity of the instrument, it necessarily applies at different column densities for FUSE and Copernicus.
This selection effect obviously complicates the comparison between simulations and observations. As shown in Sect. 4.3 and Figs. 9 and 10, several simulations, including the fiducial setup, predict a significant fraction of lines of sight at high column densities (N_{H} ≥ 10^{22} cm^{−2}, region E), in apparent contradiction with the observations. However, the fact that the fraction of stars dismissed by selection is unknown makes it difficult to assess whether this result reveals an actual and important statistical discrepancy or a simple observational limit. It implies that the likelihood of a simulation to be representative of the local diffuse gas cannot be estimated from this criterion alone. It must involve other observational signatures such as the average and the dispersion of the molecular fraction in regions A, B, C, and D, or even the chemical and statistical signatures of other atomic and molecular lines. This latter aspect is currently under development and will be the subject of the next paper of this series.
The fact that the limitations differ between FUSE and Copernicus surveys finally raises the question of the validity of studying the two samples simultaneously. We find, however, that performing comparisons with simulations on the two samples separately gives very similar results and does not impact any of our conclusions. It is so because the observational bias discussed above occurs at an extinction which increases as the natural logarithm of the square root of the instrument sensitivity. The largest totalcolumn density probed by FUSE is thus only three times larger than that observed by Copernicus (Gillmon et al. 2006; Rachford et al. 2009). Moreover, the number of lines of sight observed by FUSE that are above the maximum extinction seen by Copernicus corresponds to a small fraction of the entire sample (Gillmon et al. 2006).
5.2 Variations of physical conditions in the local ISM
The reconstruction of the simulated sample of lines of sight and the subsequent comparisons with the observations are done assuming that the local ISM can be built out of a single simulation (see Sect. 3.8). This approach was chosen in order to highlight the natural variations induced by turbulence and thermal instability alone in a diffuse neutral gas with a known averaged density and UV illumination factor. However, because the medium probed by the observations extends in all directions around the sun over a maximum distance of 3 kpc (see Fig. 2), it stands to reason that potential variations of all parameters should be taken into account, not only from one line of sight to another but also along a single and outstretched line of sight. Indeed, such considerations would offer a natural explanation for observations whose existence is in contradiction with the corresponding simulated probability of occurrence (see Sect. 4.1).
The total proton mass surface density deduced from HI and CO all sky surveys appears to be rather constant in the Galactic layer located between 5 and 12 kpc from the Galactic center (Fig. 9 of MivilleDeschênes et al. 2017). Taking into account variations of the ISM scale height above the Galaxy (Kalberla & Kerp 2009), the midplane proton density is expected to vary by less than a factor of two over the corresponding volume.
Using the Galactic star distribution of Wainscoat et al. (1992) and the grains composition and size distribution of Weingartner & Draine (2001), Porter & Strong (2005) and Moskalenko et al. (2006) estimated the energy density of the radiation field across the Galaxy. Similarly to the midplane density, the mean UV radiation field is estimated to vary by a factor two over the volume considered in this paper (Fig. 2 of Porter & Strong 2005). Interestingly, this estimation is far smaller than the variations derived by Jenkins & Tripp (2011) (Fig. 8 in their paper) from the observation of the fine structure line of CI in the local gas. Such discrepancies could be explained by the fact that Jenkins & Tripp (2011) perform local measurements: the observed gas could thus be located close to or far from an irradiating star. Alternatively, we note that the results obtained by Jenkins & Tripp (2011) are derived from models at equilibrium which do not take into account the uncorrelated perturbations of pressure and density in a turbulent multiphase medium: this naturally favors large fluctuations of the UV radiation field.
The cosmic ray ionization rate inferred from submillimeter observations of several molecular ions, including OH^{+}, H_{2}O^{+}, ArH^{+}, and H, shows a wide distribution across the Galactic disk (Indriolo et al. 2015). A recent estimation performed by Neufeld & Wolfire (2017) suggests that this rate could vary by a factor of five in the gas located between 5 and 12 kpc from the Galactic center.
Finally, potential variations of the composition and the size distribution of grains in the local diffuse gas should also be considered. While the extinction curve is found to be surprisingly uniform in the Milky Way (Schlafly et al. 2016), the local distribution of grains could change, not only along the line of sight but between the atomic and ionized phase and the molecular clouds. This would modify the efficiency of the photoelectric effect and the equilibrium between the two stable states of the neutral ISM, and would also have an impact the H_{2} formation rate.
Interestingly, if uncorrelated, the expected variations of and G_{0} alone would help to explain the slight discrepancies described in Sect. 4.3 but it would also lead to a dispersion of lines of sight far larger than those observed (see Figs. C.2 and C.3). The fact that the predicted statistics of the HItoH_{2} transition is close to the observed sample therefore suggests that the variations of all the parameters described above cannot be considered independently but must follow strong correlations which apply locally (as discussed, for instance, by Bialy et al. 2019) but also across the Galactic disk.
5.3 Fraction of ionized gas
The fractionof volume φ occupied bythe ionized phases of the ISM, the warm ionized medium (WIM) and the hot ionized medium (HIM), plays an important role in our reconstruction algorithm. As illustrated in Fig. 5, this parameter controls the filling factor of the neutral medium along a line of sight of length l_{los}. Unfortunately, its value in the Milky Way is highly uncertain and still debated.
The consensus is that the volume of the HIM far exceeds that of the WIM and results from an interplay between supernovae explosions, which regularly produce hot gas in the Galactic disk, and buoyancy, which lifts this gas into the halo, releasing the pressure in the midplane. Early analytical studies neglecting buoyancy (McKee & Ostriker 1977) predicted a large fraction of HIM in the midplane (φ ~ 95%). In contrast, early numerical simulations, including the cycle of matter and energy between the disk and the halo, led to considerably smaller predictions with φ ~ 25% (de Avillez & Breitschwerdt 2004). This value is now considered as a lower limit by the most recent numerical simulations which reveal the importance of the driving mode of supernovae explosions (Walch et al. 2015) and of the photoelectric heating (Li et al. 2015; Hill et al. 2018) on the volume of the HIM. These latest studies estimate that 20% ≤ φ ≤ 90% in the Galacticmidplane.
Highly uncertain, the volume filling factor of the HIM can also vary from one line of sight to another (Fig. 1 of Hill et al. 2018). In this paper, we adopt a constant and conservative value φ =0.5 for every line of sight. Changing φ would have the effect of either squeeze or stretch the predicted 2D PHs displayed in Figs. 9 and 10 along the x axis, and to modify the balance of probabilities of occurrence of lines of sight at high and low molecular fraction. Taking into account a realistic distribution of φ would require to simulate the Galactic disk and halo over a scale of several kiloparsecs and to properly model and follow the impact of supernovae explosions. This is far beyond the scope of the present paper.
5.4 Galactic vertical structure
The simulated 2D PH of the HItoH_{2} transition are found to slightly depend on the Galactic gravitational potential. This odd result is nothing but an artifact of the physical scales considered here. Since the sizes of all simulations are below 200 pc, the gas expands, at the most, over 100 pc above the Galactic plane, a distance far smaller than the characteristic scale of variation of the thermal pressure expected for a turbulent gas in hydrostatic equilibrium. It follows that the column densities show no significant variation as a function of the position of the line of sight or its angle with the Galactic plane.
This setup, initially chosen to favor the physical resolution, is a strong shortcoming which prevents us from using and studying the information carried by the Galactic latitude of each observation. Indeed, the comparison between the FUSE halo survey and the data collected by FUSE and Copernicus in the Galactic disk indicates that the HItoH_{2} transition at high latitude occurs at a total hydrogen column density ~ 2 times smaller than that in the disk (Gillmon et al. 2006). Similarly to the previous section, studying this effect would requires to compute the local vertical structure of our Galaxy over several hundreds of parsecs, taking into account the hot and ionized component of the ISM.
5.5 Doppler broadening parameter
The selfshielding of H_{2} depends on the dispersion of the Lyman and Werner lines which is usually modeled with a turbulent Doppler broadening parameter b_{D} (Draine & Bertoldi 1996 and Eq. (16)). In single cloud models, this parameter has the effect of shifting the HItoH_{2} transition toward larger total column density without modifying any of the asymptotic states (e.g., Bialy et al. 2017). In this paper, we identify this parameter with the turbulent velocity dispersion of the gas at large scales and therefore adopt b_{D} = 8 km s^{−1} for the fiducial simulation. This is done to prevent an overestimation of the H_{2} selfshielding at large scales, at the cost of underestimating the selfshielding at the scale of a CNM clouds.
To estimate the effect of this parameter, all the grids presented in this paper were also run assuming b_{D} = 2 km s^{−1}, which roughly corresponds to the velocity dispersion expected at a scale of 10 pc for the fiducial simulation. While locally important, b_{D} is found to have a relatively small impact on the 2D PHs of N_{H} and N(H_{2}): increasing b_{D} by a factor of four slightly increases the width of the HItoH_{2} transition and the fraction of lines of sight in region B. We interpret this limited effect as a consequenceof the fact that the asymptotic molecular states of any cloud are independent of b_{D}.
Even so, it should be stressed that b_{D} has a strong impact on the local molecular fraction in transition regions. Therefore, and while inconsequential for the global statistics of H and H_{2}, the value of the Doppler broadening parameter might be paramount for any chemical species preferentially formed at the border of molecular clouds. As proposed by Bialy et al. (2019), the H_{2} selfshielding could be computed selfconsistently using the velocity and density fields of the simulation and a cost effective radiative transfer method. This would prevent the dilemma of favoring largescale or smallscale selfshielding.
5.6 H_{2} selfshielding at high temperature
The prescription of H_{2} selfshielding used in this paper (Eq. (16)) is taken from Draine & Bertoldi (1996). As discussed by WolcottGreen et al. (2011), such a prescription is reliable for diffuse gas at low temperature but becomes less and less reliable for high temperature environments (T > 500 K) where efficient collisional excitation of H_{2} in its rovibrational levels reduces the selfshielding. To estimate the impact of this process, we ran the fiducial simulation with the alternative selfshielding function proposed by WolcottGreen et al. (2011) (Eq. (12) in their paper). This prescription leads to a similar probability histogram and therefore does not influence the global analysis of the kingfisher diagram presented in this paper. However, we note that adopting this alternative prescription slightly increases the width of the HItoH_{2} transition, and induces more lines of sight at intermediate molecular fraction (region B, see Fig. 3), in better agreement with the observations.
6 Summary and conclusions
This paper presents an exhaustive parametric study of the transition from atomic to molecular hydrogen in the local diffuse ISM. Using stateoftheart MHD simulations, and an ensemble of 305 runs, we quantify separately the impacts of the mean density, the UV radiation field, the integral scale, the resolution, the turbulent forcing, the magnetic field, and the gravity on the molecular content of multiphase environments. The original feature of this work is to not only focus on the production of individual column densities but also on their statistics, meaning the probabilities of occurrence of these column densities along random lines of sight. For the first time, both the chemical and statistical information are used concomitantly, through the socalled kingfisher diagrams, to interpret the distribution of H and H_{2} observed toward 360 lines of sight across the local interstellar matter.
The results of the simulations are interpreted with a semianalytical model which attempts to separate the effects of local conditions from those induced by the probabilistic reconstruction of individual lines of sight. To compare the kingfisher diagrams to the observational sample, we propose a new version of the KolmogorovSmirnov test which can be generalized and used for thecomparison of two probability histograms or distribution functions in any dimension larger than one.
Taking into account the distance of each background source and simulating random lines of sight over the same distribution of distances is paramount to explain the range of observed column densities and their corresponding statistics. Once this aspect is included, the joint actions of thermal instability and largescale turbulence in the standard simulation are found to produce a wealth of lines of sight which reproduce the observed position and width of the HItoH_{2} transition, and whose probabilities of occurrence match those derived from the observations. The agreement is so remarkable that it is almost unnecessary to invoke variations of physical conditions along lines of sight or from one line of sight to another.
The minimal KS distance obtained over the entire grid is ~ 0.5. Such a value implies that there exists a small group of lines of sight in the observational sample whose probability of occurrence is under or overpredicted by about a factor of three. However, it also implies that the probability of occurrence of any other group of observed lines of sight, small or large, is reproduced to a better level.
The distribution of column densities computed from the simulations strongly depends on the Galactic midplane density parametrized by the mean density , the density of OB stars parametrized by the UV scaling factor G_{0}, and the scale of neutral diffuse clouds parametrized by the box size L. It is so because these three parameters not only regulate the mean pressure of the gas, hence the fractions of mass and volume occupied by the CNM and WNM, but also control the typical scale of density fluctuations in the WNM and the distribution of sizes of the CNM structures where H_{2} forms. The tightest concordance between the observed and simulated samples is obtained for a mean density cm^{−3} and a UV radiation field scaling factor G_{0} = 1 (in Habing units), in good agreement with the values deduced from HI and CO all sky surveys and from direct observations of the UV radiation field in the Solar Neighborhood. The range of observed column densities of H and H_{2} requires a box size L = 200 pc which corresponds to the estimated scale of HI superclouds.
Within this setup, the column densities of HI are inferred to be built up in largescale WNM and CNM structures correlated in densities over ~20 and ~ 10 pc, respectively. In contrast H_{2} is inferred to be built up at smaller scales. However, the fact that the kingfisher diagram is independent from the resolution of the simulation suggests that most of the mass and volume of H_{2} is contained in CNM structures between ~3 and ~ 10 pc. All these values are given for the standard simulation (L = 200 pc and L_{drive} ~ 100 pc) but naturally depend on the size of the box and the mean density of the gas.
In spite of the strong influences of , G_{0}, and L, the statistical properties of the HItoH_{2} transition are otherwise remarkably stable. Admittedly, the kingfisher diagram depends on the strength of the turbulence if most of the forcing is injected in solenoidal modes; however such a configuration prevents to reproduce the observational sample unless the largescale velocity dispersion of the gas is unrealistically small. In contrast, if most of the kinetic energy is injected in compressive modes, the kingfisher diagram is found to weakly depend on the strength of the forcing. Similarly, the HItoH_{2} transition is almost not affected by gravity and is found to weakly depend on the Doppler broadening parameter and the strength of the magnetic field, as long as B_{x} ≤ 4 μG. The 2D PH of the column densities of H and H_{2} is therefore a valuable tool to constrain the nature of the turbulent forcing at large scale; however, it provides few or no information regarding the velocity dispersion of the gas, the amount of gravitationally bound environments and the strength of the magnetic field. Other observational tracers are required.
All these results open new perspectives for the study of the chemical state of the ISM in which any observation must be understood through the combination of local physical conditions and the probabilistic ordering of these conditions along the line of sight. In particular, similar studies should be applied to all atomic and molecular species with observational samples large enough to conduct statistical analysis. It also invites to expend the study of PHs to higher dimensions, taking into account simultaneously the joint information contained in the column densities of several species. In this context, the generalization of the KolmogorovSmirnov test proposed in this paper will be very valuable. All these aspects are currently under development and will be the subjects of the following papers of this series.
Acknowledgements
We thank the referee for his careful reading and his valuable comments and suggestions. The research leading to these results has received fundings from the European Research Council, under the European Community’s Seventh framework Programme, through the Advanced Grant MIST (FP7/20172022, No 742719). We would also like to acknowledge the support from the Programme National “Physique et Chimie du Milieu Interstellaire” (PCMI) of CNRS/INSU with INC/INP cofunded by CEA and CNES. We would finally like to thank E. Falgarone and M. Gerin for the stimulating discussions we had and their precious comments regarding this work.
Appendix A Observations of HI and H_{2} in the local diffuse ISM
The complete observational dataset used in this work is given in Table A.1, including the sources identifiers, the coordinates, and the column densities of HI and H_{2}. The column densities of H_{2} are obtained through direct observations of its FUV absorption lines. The sample presented in Table A.1 therefore results from a combination of the Copernicus survey of nearby stars (e.g., Savage et al. 1977; Bohlin et al. 1983) and the FUSE surveys of the Galactic disk and the Galactic halo (e.g., Rachford et al. 2002, 2009; Lehner et al. 2003; Cartledge et al. 2004; Pan et al. 2004; Gillmon et al. 2006; Jensen & Snow 2007a,b). In contrast, and as reviewed by Fruscione et al. (1994), the HI column densities are derived from both direct methods, which include Lyα absorption studies, EUV observations of stellar spectra, and observations of the 21cm line, and indirect methods, which include curve of growth of neutral and singly ionized atoms and optical interstellar absorption lines of NaI which are both found to correlate with N(H) (de Boer et al. 1986; Ferlet et al. 1985). The column densities of HI given in Table A.1 therefore result from a combination of FUV and optical studies for nearby stars (e.g., Bohlin et al. 1978, 1983; Diplas & Savage 1994; Fitzpatrick & Massa 1990; Jensen & Snow 2007b,a) and radio studies of the 21cm line for extragalactic sources at high Galactic latitude (Wakker et al. 2003; Gillmon et al. 2006).
As reported by all these authors, the indirect methods relying on the observations of metals can be subjects to uncertainties mostly due to the assumptions made regarding the elemental abundances. In addition, the column densities of HI derived from the emissionprofiles of the 21cm line can also be highly uncertain because the measurements are done over a beam far larger than the pencilbeam sampled by H_{2} data but also because it requires to identify in the HI profiles the components associated to the molecular gas. It should thus be kept in mind that while the errors on the column densities of H_{2} are somehow limited, those on HI can be sometimes larger than a factor of five, in particular for the lines of sight at high Galactic latitude (Gillmon et al. 2006). The color excess E(B−V) given in Table A.1 finally results from a compilation which includes direct measurements of the star reddening compared to its intrinsic color (e.g., Savage et al. 1977; Fitzpatrick & Massa 1990; Diplas & Savage 1994; Rachford et al. 2002) and dust emission maps from the IRAS telescope^{4} (Schlegel et al. 1998).
With all these data at hand, we adopt the following methodology to derive the total proton column densities N_{H} : if the column densities of HI and H_{2} are available, then N_{H} is computed as N(H) + 2N(H_{2}); if not, N_{H} is derived from the reddening E(B−V) as N_{H} = 5.8 × 10^{21}E(B−V) cm^{−2} assuming a standard Galactic extinction curve and the average interstellar ratio R_{V} = A_{V}∕E(B−V) = 3.1 (Fitzpatrick & Massa 1986; Fitzpatrick 1999). It should be noted that the N_{H} ∕E(B−V) ratio observed at low column density is larger than the standard value used in this work (Liszt 2014; Lenz et al. 2017). Because of this and the uncertainties on the HI column densities discussed above, the values of N_{H} derived here should be considered as estimates. Examples of the underlying uncertainties on N_{H} can be seen in Table A.1 where N(H) sometimes exceeds slightly the total column density derived from E(B−V).
Observational dataset used in this work.
Appendix B Heating and cooling equations
The analytical equations of the heating and cooling rates used in this work are taken from Wolfire et al. (1995, 2003).
The electronic density is calculated from the ionization equilibrium (B.1)
where ζ_{CR} is the total ionization rate (including primary and secondary ionizations) of H by soft Xrays and cosmic ray particles (expressed in unit of 10^{−16} s^{−1}) and G_{eff} is the local effective radiation field in Habing units (see Sect. 3.2). ϕ_{PAH} is a recombination parameter of electrons on PAHs, discussed in Wolfire et al. (2003), and set to 0.5. x_{C+} is the abundance of C^{+} relative to n_{H}, x_{C+} = n(C^{+})∕n_{H}. Throughout this work, we adopt x_{C+} = 1.4 × 10^{−4}, which corresponds to the value derived in the Solar Neighborhood assuming 40% depletion of carbon onto grains and that the remaining carbon is singly ionized. This equation for the density of electrons differs from that of Wolfire et al. (2003) by the addition of C^{+} which is the most abundant ion in the diffuse and transluscent CNM (Snow & McCall 2006).
Following Wolfire et al. (2003), we include the heating induced by the photoelectric effect on grains and PAHs and by cosmic ray ionizations. The former is modeled with a rate (B.2)
where the heating efficiency (B.3)
The latter is modeled with a rate (B.5)
Fig. B.1 Thermal equilibrium curves () computed with RAMSES for A_{V} = 0 (red solid curve) and A_{V} = 1 (green solid curve) compared to those predicted by Wolfire et al. (2003) for ϕ_{PAH} = 0.5 (blackdashed curve) and 1.0 (blue dashed curve). 
Regarding the cooling, we include the finestructure emission of CII and OI, the emission of Lyman α photons by HI, and the recombination of electrons onto charged grains and PAHs. The cooling rate due to collisional excitation of the finestructure levels of C^{+} by atomic hydrogen and electrons is given by (B.6)
The cooling rate by collisional excitation of the finestructure level of OI by atomic hydrogen is computed as (B.7)
where x_{O} = n(O)∕n_{H} is the relative abundance of atomic oxygen. Throughout this paper, we adopt x_{O} = 3.2 × 10^{−4}, which corresponds to the value derived in the Solar Neighborhood assuming 37% depletion of oxygen onto grains and that the remaining oxygen is in its atomic form. Those two lines are the dominant cooling terms of the CNM phase. The cooling induced by the excitation of the Lyman α line, which is the dominant cooling at T ≳ 8000 K (WNM), is taken from Spitzer (1978) (B.8)
Finally, the cooling rate due to electron recombination onto charged grains and PAHs is set to (B.9)
with β = 0.74∕T^{0.068}. To validate the calculations of the heating and cooling terms, we compare in Fig. B.1 the thermal equilibrium curve obtained with RAMSES to the predictions of Wolfire et al. (2003).
Appendix C Analytical description of 1D and 2D probability histograms
In order to interpret the results found in Sect. 4, we propose a semianalytical prescription to predict the 1D and 2D PHs of the total column density and the column density of H_{2} obtained with numerical simulations. This prescription is based on the work of VázquezSemadeni & García (2001) and Bialy et al. (2017, 2019) who showed that lines of sight across isothermal simulations of turbulence can be accurately modeled as a series of random density fluctuations.
C.1 Decorrelation scales
In Fig. C.1, we display an example of the volumeweighted distribution of the proton density computed for the fiducial simulation (see Table 2). The total column density integrated along the x direction over a random line of sight of size l ≤ L is (C.1)
where R is the resolution of the box of size L (see Table 2), i_{0} is a random starting index for the integration of the column densities, and i_{l} is the final index deduced from l. Because of spatial correlations of the density n_{H}, this computation is not equivalent to the sum of random realizations of n_{H} drawn out of the 1D probability distribution (Fig. C.1). It depends, instead, on how and over which distance the values of n_{H} are correlated. As proposed by Bialy et al. (2017, 2019), and since the correlations of density in a turbulent medium decrease over long distances, we assume that these correlations can be modeled with a parameter y_{dec}, called the decorrelation scale. The density is supposed to be constant over distances smaller than y_{dec}, and uncorrelated over larger distances.
In this framework, if y_{dec} is the same for all densities, the total column density integrated over a distance l becomes equivalent to the sum of 1 + l∕y_{dec} random realizations of n_{H}. In isothermal simulations, the 1D probability distribution of the gas density is found to follow a lognormal distribution with a dispersion proportional to the Mach number, dependent on the nature of the turbulent driving, andindependent on the mean density of the gas (e.g. Padoan et al. 1997; Federrath et al. 2008). Because of this property, Bialy et al. (2017) were able to establish a relation between the dispersion of the distribution of the averaged densities and the dispersion of the distribution of the proton density: (C.2)
where L_{drive} is the turbulence driving scale. Fitting the ratio as a function of l∕L_{drive}, Bialy et al. (2017) estimated y_{dec} = 0.2 × L_{drive}.
Fig. C.1 Probability histogram of the proton density n_{H} extracted from the fiducial simulation (see Table 2). The black histogram correspond to the extracted data. The red dashed curve shows an example of the sum of two lognormal components and a powerlaw tail at high density, for comparison. The blue line indicates the inflection point of the PH between the diffuse and the dense components. 
Unfortunately, this method cannot be applied here. Indeed, oppositely to isothermal simulations and as illustrated in Fig. C.1, the PH of n_{H} derived from simulations of the multiphase ISM is usually described by the sum two lognormal distributions plus a powerlaw tail at high density that could be a signature of the CNM which is known to behave like a polytropic gas with an exponent γ < 1 (Passot & VázquezSemadeni 1998) or a signature of gravity (Federrath 2013; Girichidis et al. 2014). To overcome this issue, and for the sake of simplicity, we assume that a two phase medium is described by two decorrelation scales. All densities below a limit are supposed to belong to the diffuse lognormal component and to be correlated over a scale . Similarly, all densities above are supposed to belong to the dense lognormal component and to be correlated over a smaller scale . We identified here the limit between the two lognormal distributions with the inflection point of the 1D probability histogram of the gas density (blue line in Fig. C.1). Since the diffuse component behave like an isothermal gas at the temperature of the WNM, we assume that is constant for all simulations and adopt the value given by Bialy et al. (2017), , where L_{drive} = L∕2 is the main driving scale used for the turbulent forcing (see Sect. 3.3). In contrast, and because the diffuse component occupies most of the volume, we state that depends on the total mass of the gas or equivalently its mean density . To simplify, we propose that (C.3)
which means that the impact of changing the total mass of the gas is to change the typical volume of the dense structures by the same factor. Within this framework, the semianalytical model proposed here therefore depends on a single parameter: the decorrelation scale of the dense component for the fiducial simulation. In the following, we assume pc for cm^{−3} which implies a decorrelation length of 6.3, 7.9, and 12.6 pc for , 1, and 4 cm^{−3}, respectively.
It is quite optimistic to believe that the dense and cold component of the ISM can be modeled by a single decorrelation scale . This component is indeed likely to follow a distribution of sizes which decrease with the local pressure, hence the local density n_{H}. However, and as we show below, such an assumption allows us to highlight important features of the simulations. The essential impact of the distribution of sizes is proven and discussed in Appendix C.4.
Fig. C.2 Comparisons of the 1D probability histograms of the total column density normalized to the integration scale, N_{H} ∕l, extracted from the simulations (black histograms) and constructed with the semianalytical model described in the main text (green histograms) for l = 200 (top left), 100 (top right), 50 (bottom left), and 25 (bottom right) pc. Each of these four main panels displays the comparisons performed for 15 simulations with different and G_{0} around the fiducial setup defined by cm^{−3} and G_{0} = 1. All probability histograms inferred from the semianalytical model are obtained assuming a fixed correlation length of the diffuse component pc and a correlation length of the dense component pc (see main text). 
C.2 Comparison with simulations
The goal of the model is to offer an explanation on how the PHs of local densities translate into PHs of column densities in a simulation of the multiphase ISM. To test its validity, we generate a series of fictitious lines of sight of size l and compare the PHs to those obtained with an equivalent sample of lines of sight extracted from numerical simulations. For each line of sight, we draw a sequence of random realizations of n_{H} out of its known 1D PH using the rejection method. For each draw, the density is supposed to be constant over a distance if , and over a distance otherwise^{5}. The contribution of this piece of gas to the total column density is therefore computed as or . In contrast, the contribution of this piece to the column density of H_{2} is inferred from the expected density profile of H_{2} over a 1D slab of size or . Throughout the slab, the density of H_{2} is calculated at equilibrium taking into account local and largescale extinction and selfshielding as (C.4)
where and N^{loc}(H_{2}) are the local column densities of protons and H_{2} computed from the border of the slab. The mean values in this expression are calculated from six random realizations of the largescale column densities and N^{ext}(H_{2}) drawn out of the sample a lines of sight under construction. The construction of each line of sight ends when its size reaches the integration scale l. The entire sample of lines of sight is finally reconstructed until convergence of the largescale shielding processes described above.
The comparisons between the 1D and 2D PHs reconstructed from the semianalytical model and extracted from the simulations are shown in Figs. C.2 and C.3 for samples of lines of sight, 15 different simulations, and different values of the integration scale l, from L down to . Unexpectedly, setting the decorrelation length of the dense component to ~ 10 pc for the fiducial simulation ( cm^{−3}) leads to a remarkable agreement between all the fictitious and actuals PHs. Once this parameter is set, the model not only reproduces surprisingly well the shapes of the 1D PHs and of the HItoH_{2} transition, but also their global trends depending on and G_{0} and their deformations depending on the chosen integration scale l. This result is not straightforward and highly depends on the decorrelation lengths and . The agreement observed in Figs. C.2 and C.3 therefore suggests that the model somehow captures an essential property of the simulations, namely some characteristic lengths of the diffuse and dense components of a multiphase gas with a turbulence driven at a scale L_{drive}.
Fig. C.3 Comparisons of the 2D probability histograms of the total column density normalized to the integration scale, N_{H} ∕l, and the column density of H_{2} normalized to integration scale, N(H_{2})∕l extracted from the simulations and constructed with the semianalytical model described in the main text for l = 200 (top left), 100 (top right), 50 (bottom left), and 25 (bottom right) pc. The results of the simulations are indicated with contour plots with isoprobabilities of 10^{−4}, 10^{−3}, and 10^{−2}. The coloredhistograms correspond to the results obtained with the semianalytical model. As in Fig. C.2, each of the four main panels displays the comparisons performed for 15 simulations with different and G_{0} around the fiducial setup defined by cm^{−3} and G_{0} = 1. All probability histograms inferred from the semianalytical model are obtained assuming a fixed correlation length of the diffuse component pc and a correlation length of the dense component pc (see main text). 
C.3 Interpretation of the model
The statistics of the HItoH_{2} transition derived from the model result from a combination of effects. Locally, the fraction of H_{2} of a given slab depends on the density and the size of the slab and , which set the local selfshielding, and on the surrounding environment, which sets the largescale selfshielding. How these local properties contribute to the integrated quantities N_{H} and N(H_{2}) depends, in turn, on the sizes of the slabs and on the PH of the density which both control the reconstruction of the line of sight.
 1.
Since is fixed, the HItoH_{2} transition induced by the local selfshielding alone occurs in any slab with a density larger than (C.5)
This equation can be obtained from the expression of the column densities of HI envelopes (Sternberg et al. 2014, Eq. (40)) in the weak field limit. The largescale selfshielding not only increases the fraction of H_{2} in atomic slabs (i.e., for ) but also shifts toward lower values, two effects which depend on the size of the box L. For L = 200 pc, including the largescale selfshielding is found to reduce by a factor of two.
 2.
Lines of sight with very low molecular fraction necessarily result from the combination of slabs with . The occurrence of such events depends on the volume filling factor of the diffuse gas, hence on the 1D PH of low density material, and on the integration length l: as l increases, their likelihood decreases. Such a scenario occurs for a maximum normalized column density of (C.6)
For , such a high normalized column density of atomic gas is a likely event. As l increases, it becomes, however, unlikely to throw a line of sight composed of components with identical densities . Therefore, while the above limit is still valid, the maximum normalized column density of weakly molecular gas appears to decrease.
 3.
Lines of sight with large molecular fraction necessarily contain at least one slab with . Oppositely to the previous case, the occurrence of such events depends on the volume filling factor of the dense gas, hence on the 1D PH of large density material, and on the integration length l: as l increases, their likelihood increases. Such a scenario occurs for a minimum normalized column density (C.7)
 4.
These two limits for lines of sight with low and large (items 2. and 3.), set the width of the HItoH_{2} transition seen in column densities. As shown in Fig. C.3, this width is somehow smaller than that obtained from the numerical simulations. We will discuss this point in the next section.
 5.
At last, lines of sight with intermediate H_{2} fraction () mostly result from a combination of slabs of low and moderate densities (). Because such events are unlikely, the model predicts a small fraction of lines of sight at intermediate , in contradiction with results extracted from the simulations. This point will also be discussed further in the next section.
Fig. C.4 Same as Fig. C.3, assuming that pc. The exponent used here corresponds to a simple test to show the effect of a dense decorrelation scale varying with density. 
All these properties fully explain the behaviors of the analytical model observed in Fig. C.3. The transition density for the fiducial simulation is cm^{−3}. As expected, the corresponding lower and upper limits of N_{H}∕l required to activate the HItoH_{2} transition are in agreement with the limits found for the lowest integration scale l = 25 pc (bottom right panels of Fig. C.3). Increasing the integration length has three effects: (a) to squeeze the 2D PH along the x and y axis as both N_{H} and N(H_{2}) progressively tend toward Gaussian distributions centered on the means, (b) to shift the HItoH_{2} transition to lower N_{H}∕l according to the limits derived above, and (c) to increase the occurrence of lines of sight with large to the detriment of lines of sight with low as the probability of intercepting a slab with rises. The dependence of the distributions on G_{0} and are also straightforward. As G_{0} increases or decreases, the HItoH_{2} transition is shifted according to the dependence of on these parameters. The occurrences of high or low molecular fraction lines of sight simply depend on the volume filling factor of the dense gas, and are therefore a direct consequence of the 1D PH of the gas density. As G_{0} increases or decreases, the mean thermal pressure rises and the mass fraction of the CNM diminishes. This favors the occurrence of lines of sight with low molecular fraction.
C.4 Discrepancies and conclusions
Despite the surprising agreement between the simulated and the modeled PHs, in particular regarding the 1D PH of the total column density N_{H}, Figs. C.2 and C.3 also reveal important discrepancies. Most notably, and as mentioned above, the modeled 2D PHs systematically underestimate the widths of the HItoH_{2} transition and underestimate the proportion of lines of sight at intermediate integrated molecular fraction. These strong discrepancies are entirely due to the hypothesis of a constant decorrelation scale of the dense component.
Obviously, the dense and cold ISM are not characterized by a unique scale but a distribution of sizes which likely decrease with the gaspressure and local density. Such a distribution would increase the probability of occurrence of small components of high densityalong any line of sight (see footnote 5) and therefore solve the discrepancy between the analytical model and the simulations. Indeed, as schematized in Fig. 7, small and dense components surrounded by diffuse material favor the occurrence of lines of sight at intermediate molecular fractions. This configuration not only reduces the mean molecular fraction predicted by the model but also necessarily widens the HItoH_{2} transition, in closer agreement with the simulations. To illustrate this point, we display in Fig. C.4 the 2D PHs reconstructed from the semianalytical model assuming that is a power law function of the gas density.
In other words, the excellent agreement observed in Fig. C.2 indicates that setting constant decorrelation scales and is sufficient to reproduce the distribution of the total quantity of matter N_{H}. It is so because the chosen probably describes the largest sizes of the dense components which capture most of its mass and volume. However, the model also proves that choosing a single value of is inappropriate to accurately describe the distribution of H_{2}. It is so because the mass and volume of H_{2} in the simulation is likely built in smaller components. should therefore be interpreted as a maximum length scale of the dense gas.
All these considerations show that the simplistic model developed here is very useful to interpret the results of the simulations. It successfully separates local properties and probabilistic effects in the integration of column densities. Moreover it provides estimations of the decorrelation scale of the diffuse gas and the maximum decorrelation scale of the dense gas. Finally, its flaws clearly highlight the importance of a distribution of sizes of the dense and cold ISM and the necessary existence of small dense clouds which produce lines of sight with intermediate integrated molecular fraction. The findings of this section are synthesized in Sect. 3.9 and Fig. 7 and used in the rest of the paper as a major tool for interpreting the results of the simulations.
Appendix D KolmogorovSmirnov test
The results of this paper rely on the comparison of 2D probability histograms of observed and simulated data. To facilitate this comparison and the underlying parametric study, we apply here a modified version of the KolmogorovSmirnov (KS) test. This test, originally developed for the study of 1D samples, searches for the maximum cumulative difference between two distributions. Fasano & Franceschini (1987) generalized the KS test to 2D samples following Peacock (1983) idea of replacing the cumulative probability distribution, which is not well defined in a dimension larger than one, with the integrated probability in each of the four quadrants surrounding a datapoint. Such a consideration allows us to define a KS distance which measures how two 2D distribution functions differ from one another.
Fig. D.1 Results of the modified KS test applied to the fiducial simulation. The black points are the observational data, the red dots indicate the dataset used for the estimations of the merit function M, and the 2D histogram the simulated data. The violet star and rectangle indicate the observational point and the quadrant that maximize M (see main text). The fiducial simulation has a KS distance of 0.98. The corresponding quadrant contains 0.47 and 4.49% of the entiresimulated and observed datasets. 
Fig. D.2 KSdistance between the simulations and the observational sample as a function of the mean density , the UV scaling factor G_{0} = 0.5 (red), 1 (green), 2 (orange), and 4 (blue), and for a resolution of 256^{3} (solid lines) and 128^{3} (dashed lines). All other parameters are set to their standard values (see Table 2). Points correspond to reliable measurements of the KS distances. Triangles indicate lower limits corresponding to simulations where the upper error bar on R_{KS} tends toward infinity (see main text). 
As illustrated in Fig. D.1, each observational datapoint is identified by a pair of variables (, f^{obs}(H_{2})) which divide the space into four quadrants: (1) & , (2) & , (3) & , and (4) & . Each quadrant thus contains twofractions f_{obs} and f_{sim} of the entire observed and simulated datasets. To compare these values, we define a merit function (D.1)
and the KS distance between the two distributions as the maximum value of M computed over all quadrants of an observational dataset , (D.2)
This procedure, not only provides a measurement of the difference between the two distributions, but also the datapoint and the quadrant that maximize the merit function (see Fig. D.1). The interpretation is also straightforward. For instance, a KS distance of 1 implies that one of all the quadrants scanned contains 10 times fewer or 10 times more observations than simulated lines of sight, and that all the other quadrants have smaller distances.
The stability of the procedure depends on the errors made on the merit function and therefore on the absolute numbers of observed and simulated lines of sight contained in each quadrant. The observational dataset used to compute the KS distance (red points in Fig. D.1) is chosen as the subsample such that all quadrants scanned contain at least 10 observed lines of sight. With this assumption, the error on the merit function is calculated by taking into accountonly the statistical errors on the number of simulated lines of sight. For each quadrant, we assume that the “true” merit function ranges between
where S is the total number of simulated lines of sight. Because the errors are asymmetric, M_{min} can tend toward infinity. If so, the KS distance is a lower limit, even if the infinite error bar is obtained for another quadrant than the one that maximizes M. In short, for each observational datapoint, we compute M_{min}, M, M_{max}, and R_{KS}. If one of the M_{min} goes to infinity, R_{KS} is considered as a lower limit.
Because the division in quadrants is performed on a Cartesian grid, the procedure also depends on the choice of the axes. Mathematically, the best option would be to identify the principal components of the observational sample using proper orthogonal decomposition or singular value decomposition algorithms. Such a method could even be applied to subsamples in order to adaptively rotate the system of axes and follow more precisely the distribution of observations. In any case, the resulting system would be a linear combination of N_{H} and N(H_{2}) which could be difficult to relate to the underlying physical properties. Because the molecular fraction is bimodal as a function of N_{H}, we choose here N_{H} and as primary variables. This choice facilitates the physical interpretation of the KS test while ensuring some homogeneity of the spread of the observational data in all quadrants (see Fig. D.1).
As a proof of concept, we display in Fig. D.2 the results of the KS test applied to a grid of simulations obtained for different values of and G_{0} and two different resolutions. Despite its simplicity, this test appears to capture the main behaviors described in Sects. 4.2 and 4.3. In particular, the KS distance between the simulations and the observations is found to strongly depend on both and G_{0}, and more loosely on the resolution. Moreover, the simulations that minimize R_{KS} are found to be identical to those identified in Sect. 4.3 to be in closest agreement with the observations. Interestingly, the value of R_{KS} obtained for cm^{−3} and G_{0}= 4 is relatively small, in apparent contradiction with the conclusions of Sect. 4.3. This is due to the limit imposed on the minimum number of observations contained in each quadrant. Because of this limit, several observations at large N_{H} are not includedin the analysis which reduces the merit function at large column density (region E, see Table 1). Keeping in mind these border effects, the KS test turns out to be a valuable tool for estimating the distance between two distributions without performing a detailed comparison of the samples. In this paper, we apply this method to display our results in a synthetic manner (see Sects. 4.5–4.7) and only give additional details when necessary.
References
 Abgrall, H., Le Bourlot, J., Pineau des Forêts, G., et al. 1992, A&A, 253, 525 [NASA ADS] [Google Scholar]
 Ambartsumian, V. A. 1947, The evolution of stars and astrophysics [Google Scholar]
 Andersson, B. G., Wannier, P. G., & Crawford, I. A. 2002, MNRAS, 334, 327 [NASA ADS] [CrossRef] [Google Scholar]
 André, M. K., Oliveira, C. M., Howk, J. C., et al. 2003, ApJ, 591, 1000 [NASA ADS] [CrossRef] [Google Scholar]
 Audit, E., & Hennebelle, P. 2005, A&A, 433, 1 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Audit, E., & Hennebelle, P. 2010, A&A, 511, A76 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Berger, M. J., & Oliger, J. 1984, J. Comput. Phys., 53, 484 [NASA ADS] [CrossRef] [MathSciNet] [Google Scholar]
 Bialy, S., & Burkhart, B. 2020, ApJ, 894, L2 [CrossRef] [Google Scholar]
 Bialy, S., & Sternberg, A. 2019, ApJ, 881, 160 [NASA ADS] [CrossRef] [Google Scholar]
 Bialy, S., Burkhart, B., & Sternberg, A. 2017, ApJ, 843, 92 [NASA ADS] [CrossRef] [Google Scholar]
 Bialy, S., Neufeld, D., Wolfire, M., Sternberg, A., & Burkhart, B. 2019, ApJ, 885, 109 [CrossRef] [Google Scholar]
 Bigiel, F., Leroy, A., Walter, F., et al. 2008, AJ, 136, 2846 [NASA ADS] [CrossRef] [Google Scholar]
 Bigiel, F., Leroy, A., & Walter, F. 2011, IAU Symp., 270, 327 [Google Scholar]
 Bohlin, R. C., Savage, B. D., & Drake, J. F. 1978, ApJ, 224, 132 [NASA ADS] [CrossRef] [Google Scholar]
 Bohlin, R. C., Hill, J. K., Jenkins, E. B., et al. 1983, ApJS, 51, 277 [NASA ADS] [CrossRef] [Google Scholar]
 Bouy, H., & Alves, J. 2015, A&A, 584, A26 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Bowen, D. V., Jenkins, E. B., Tripp, T. M., et al. 2008, ApJS, 176, 59 [NASA ADS] [CrossRef] [Google Scholar]
 Bron, E., Le Bourlot, J., & Le Petit, F. 2014, A&A, 569, A100 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Browning, M. K., Tumlinson, J., & Shull, J. M. 2003, ApJ, 582, 810 [NASA ADS] [CrossRef] [Google Scholar]
 Burgh, E. B., France, K., & McCandliss, S. R. 2007, ApJ, 658, 446 [NASA ADS] [CrossRef] [Google Scholar]
 Burgh, E. B., France, K., & Jenkins, E. B. 2010, ApJ, 708, 334 [NASA ADS] [CrossRef] [Google Scholar]
 Cartledge, S. I. B., Meyer, D. M., & Lauroesch, J. T. 2003, ApJ, 597, 408 [NASA ADS] [CrossRef] [Google Scholar]
 Cartledge, S. I. B., Lauroesch, J. T., Meyer, D. M., & Sofia, U. J. 2004, ApJ, 613, 1037 [NASA ADS] [CrossRef] [Google Scholar]
 Cartledge, S. I. B., Clayton, G. C., Gordon, K. D., et al. 2005, ApJ, 630, 355 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Cartledge, S. I. B., Lauroesch, J. T., Meyer, D. M., Sofia, U. J., & Clayton, G. C. 2008, ApJ, 687, 1043 [NASA ADS] [CrossRef] [Google Scholar]
 Christensen, C., Quinn, T., Governato, F., et al. 2012, MNRAS, 425, 3058 [NASA ADS] [CrossRef] [Google Scholar]
 Clark, P. C., Glover, S. C. O., Ragan, S. E., & DuarteCabral, A. 2019, MNRAS, 486, 4622 [NASA ADS] [CrossRef] [Google Scholar]
 Crutcher, R. M., Wandelt, B., Heiles, C., Falgarone, E., & Troland, T. H. 2010, ApJ, 725, 466 [NASA ADS] [CrossRef] [Google Scholar]
 Dame, T. M., Hartmann, D., & Thaddeus, P. 2001, ApJ, 547, 792 [NASA ADS] [CrossRef] [Google Scholar]
 de Avillez, M. A., & Breitschwerdt, D. 2004, A&A, 425, 899 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 de Boer, K. S., Lenhart, H., van der Hucht, K. A., et al. 1986, A&A, 157, 119 [NASA ADS] [Google Scholar]
 Dickey, J. M., & Lockman, F. J. 1990, ARA&A, 28, 215 [NASA ADS] [CrossRef] [MathSciNet] [Google Scholar]
 Diemer, B., Stevens, A. R. H., Forbes, J. C., et al. 2018, ApJS, 238, 33 [CrossRef] [Google Scholar]
 Diplas, A., & Savage, B. D. 1994, ApJS, 93, 211 [NASA ADS] [CrossRef] [Google Scholar]
 Draine, B. T. 1978, ApJS, 36, 595 [NASA ADS] [CrossRef] [Google Scholar]
 Draine, B. T., & Bertoldi, F. 1996, ApJ, 468, 269 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Elmegreen, B. G., & Elmegreen, D. M. 1987, ApJ, 320, 182 [NASA ADS] [CrossRef] [Google Scholar]
 Fasano, G., & Franceschini, A. 1987, MNRAS, 225, 155 [NASA ADS] [CrossRef] [Google Scholar]
 Federman, S. R. 1982, ApJ, 257, 125 [NASA ADS] [CrossRef] [Google Scholar]
 Federman, S. R., Strom, C. J., Lambert, D. L., et al. 1994, ApJ, 424, 772 [NASA ADS] [CrossRef] [Google Scholar]
 Federrath, C. 2013, MNRAS, 436, 1245 [NASA ADS] [CrossRef] [Google Scholar]
 Federrath, C., Klessen, R. S., & Schmidt, W. 2008, ApJ, 688, L79 [NASA ADS] [CrossRef] [Google Scholar]
 Federrath, C., RomanDuval, J., Klessen, R. S., Schmidt, W., & Mac Low, M.M. 2010, A&A, 512, A81 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Ferlet, R., VidalMadjar, A., & Gry, C. 1985, ApJ, 298, 838 [NASA ADS] [CrossRef] [Google Scholar]
 Field, G. B. 1965, ApJ, 142, 531 [NASA ADS] [CrossRef] [Google Scholar]
 Fitzpatrick, E. L. 1999, PASP, 111, 63 [NASA ADS] [CrossRef] [Google Scholar]
 Fitzpatrick, E. L., & Massa, D. 1986, ApJ, 307, 286 [NASA ADS] [CrossRef] [Google Scholar]
 Fitzpatrick, E. L., & Massa, D. 1990, ApJS, 72, 163 [NASA ADS] [CrossRef] [Google Scholar]
 Frick, P., Stepanov, R., Shukurov, A., & Sokoloff, D. 2001, MNRAS, 325, 649 [NASA ADS] [CrossRef] [Google Scholar]
 Fromang, S., Hennebelle, P., & Teyssier, R. 2006, A&A, 457, 371 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Fruscione, A., Hawkins, I., Jelinsky, P., & Wiercigroch, A. 1994, ApJS, 94, 127 [NASA ADS] [CrossRef] [Google Scholar]
 Gaia Collaboration (Brown, A. G. A., et al.) 2018, A&A, 616, A1 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Garmany, C. D., & Stencel, R. E. 1992, A&AS, 94, 211 [NASA ADS] [Google Scholar]
 Gerin, M., Neufeld, D. A., & Goicoechea, J. R. 2016, ARA&A, 54, 181 [NASA ADS] [CrossRef] [Google Scholar]
 Gillmon, K., Shull, J. M., Tumlinson, J., & Danforth, C. 2006, ApJ, 636, 891 [NASA ADS] [CrossRef] [Google Scholar]
 Girichidis, P., Konstandin, L., Whitworth, A. P., & Klessen, R. S. 2014, ApJ, 781, 91 [NASA ADS] [CrossRef] [Google Scholar]
 Glover, S. C. O., Federrath, C., Mac Low, M.M., & Klessen, R. S. 2010, MNRAS, 404, 2 [NASA ADS] [Google Scholar]
 Gnacinski, P., & Krogulec, M. 2006, Acta Astron., 56, 373 [NASA ADS] [Google Scholar]
 Gnedin, N. Y., Tassis, K., & Kravtsov, A. V. 2009, ApJ, 697, 55 [NASA ADS] [CrossRef] [Google Scholar]
 Goldsmith, P. F., Li, D., & Krčo, M. 2007, ApJ, 654, 273 [NASA ADS] [CrossRef] [Google Scholar]
 Goldsmith, P. F., Velusamy, T., Li, D., & Langer, W. 2009, ASP Conf. Ser., 417, 177 [Google Scholar]
 Gong, M., Ostriker, E. C., & Wolfire, M. G. 2017, ApJ, 843, 38 [CrossRef] [Google Scholar]
 Gong, M., Ostriker, E. C., & Kim, C.G. 2018, ApJ, 858, 16 [NASA ADS] [CrossRef] [Google Scholar]
 Gudennavar, S. B., Bubbly, S. G., Preethi, K., & Murthy, J. 2012, ApJS, 199, 8 [NASA ADS] [CrossRef] [Google Scholar]
 Habing, H. J. 1968, Bull. Astron. Inst. Netherlands, 19, 421 [Google Scholar]
 Haud, U., & Kalberla, P. M. W. 2007, A&A, 466, 555 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Heiles, C., & Troland, T. H. 2003a, ApJ, 586, 1067 [NASA ADS] [CrossRef] [Google Scholar]
 Heiles, C., & Troland, T. H. 2003b, VizieR Online Data Catalog: J/ApJS/145/329 [Google Scholar]
 Hennebelle, P., & Falgarone, E. 2012, A&ARv, 20, 55 [NASA ADS] [CrossRef] [Google Scholar]
 Hennebelle, P., & Iffrig, O. 2014, A&A, 570, A81 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Hennebelle, P., & Pérault, M. 1999, A&A, 351, 309 [NASA ADS] [Google Scholar]
 Hennebelle, P., Banerjee, R., VázquezSemadeni, E., Klessen, R. S., & Audit, E. 2008, A&A, 486, L43 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Henshaw, J. D., Ginsburg, A., Haworth, T. J., et al. 2019, MNRAS, 485, 2457 [NASA ADS] [CrossRef] [Google Scholar]
 Hill, A. S., Mac Low, M.M., Gatto, A., & IbáñezMejía, J. C. 2018, ApJ, 862, 55 [CrossRef] [Google Scholar]
 Hobbs, L. M. 1978, ApJS, 38, 129 [NASA ADS] [CrossRef] [Google Scholar]
 Hu, C.Y., Naab, T., Walch, S., Glover, S. C. O., & Clark, P. C. 2016, MNRAS, 458, 3528 [NASA ADS] [CrossRef] [Google Scholar]
 Iffrig, O., & Hennebelle, P. 2017, A&A, 604, A70 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Indriolo, N., Neufeld, D. A., Gerin, M., et al. 2015, ApJ, 800, 40 [NASA ADS] [CrossRef] [Google Scholar]
 Inoue, T., Inutsuka, S.i., & Koyama, H. 2006, ApJ, 652, 1331 [NASA ADS] [CrossRef] [Google Scholar]
 Iwasaki, K., & Inutsuka, S.i. 2014, ApJ, 784, 115 [CrossRef] [Google Scholar]
 Jenkins, E. B., & Tripp, T. M. 2011, ApJ, 734, 65 [NASA ADS] [CrossRef] [Google Scholar]
 Jenkins, E. B., Savage, B. D., & Spitzer, L., J. 1986, ApJ, 301, 355 [NASA ADS] [CrossRef] [Google Scholar]
 Jensen, A. G., & Snow, T. P. 2007a, ApJ, 669, 378 [NASA ADS] [CrossRef] [Google Scholar]
 Jensen, A. G., & Snow, T. P. 2007b, ApJ, 669, 401 [NASA ADS] [CrossRef] [Google Scholar]
 Joung, M. K. R., & Mac Low, M.M. 2006, ApJ, 653, 1266 [NASA ADS] [CrossRef] [Google Scholar]
 Kalberla, P. M. W., & Kerp, J. 2009, ARA&A, 47, 27 [NASA ADS] [CrossRef] [Google Scholar]
 Kalberla, P. M. W., Burton, W. B., Hartmann, D., et al. 2005, A&A, 440, 775 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Kim, J.G., & Kim, W.T. 2013, ApJ, 779, 48 [CrossRef] [Google Scholar]
 Koyama, H., & Inutsuka, S.i. 2002a, ApJ, 564, L97 [NASA ADS] [CrossRef] [Google Scholar]
 Koyama, H., & Inutsuka, S. 2002b, 8th AsianPacific Regional Meeting, Volume II, eds. S. Ikeuchi, J. Hearnshaw, & T. Hanawa, 159 [Google Scholar]
 Krumholz, M. R., McKee, C. F., & Tumlinson, J. 2008, ApJ, 689, 865 [NASA ADS] [CrossRef] [Google Scholar]
 Krumholz, M. R., McKee, C. F., & Tumlinson, J. 2009, ApJ, 693, 216 [NASA ADS] [CrossRef] [Google Scholar]
 Kuijken, K., & Gilmore, G. 1989, MNRAS, 239, 571 [NASA ADS] [CrossRef] [Google Scholar]
 Körtgen, B., Federrath, C., & Banerjee, R. 2019, MNRAS, 482, 5233 [NASA ADS] [CrossRef] [Google Scholar]
 Larson, R. B. 1981, MNRAS, 194, 809 [CrossRef] [Google Scholar]
 Le Bourlot, J., Le Petit, F., Pinto, C., Roueff, E., & Roy, F. 2012, A&A, 541, A76 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Le Coupanec, P., Rouan, D., Moutou, C., & Léger, A. 1999, A&A, 347, 669 [NASA ADS] [Google Scholar]
 Lee, H.H., Herbst, E., Pineau des Forêts, G., Roueff, E., & Le Bourlot, J. 1996, A&A, 311, 690 [NASA ADS] [Google Scholar]
 Lehner, N., Jenkins, E. B., Gry, C., et al. 2003, ApJ, 595, 858 [NASA ADS] [CrossRef] [Google Scholar]
 Lenz, D., Hensley, B. S., & Doré, O. 2017, ApJ, 846, 38 [NASA ADS] [CrossRef] [Google Scholar]
 Leroy, A., Bolatto, A., Stanimirovic, S., et al. 2007, ApJ, 658, 1027 [NASA ADS] [CrossRef] [Google Scholar]
 Leroy, A. K., Walter, F., Sandstrom, K., et al. 2013, AJ, 146, 19 [NASA ADS] [CrossRef] [Google Scholar]
 Lesaffre, P., Gerin, M., & Hennebelle, P. 2007, A&A, 469, 949 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Lesaffre, P., Pineau des Forêts, G., Godard, B., et al. 2013, A&A, 550, A106 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Lesaffre, P., Todorov, P., Levrier, F., et al. 2020, MNRAS, 495, 816 [CrossRef] [Google Scholar]
 Li, M., Ostriker, J. P., Cen, R., Bryan, G. L., & Naab, T. 2015, ApJ, 814, 4 [NASA ADS] [CrossRef] [Google Scholar]
 Linsky, J. L., Redfield, S., Wood, B. E., & Piskunov, N. 2000, ApJ, 528, 756 [NASA ADS] [CrossRef] [Google Scholar]
 Liszt, H. 2014, ApJ, 783, 17 [NASA ADS] [CrossRef] [Google Scholar]
 Lupi, A., Volonteri, M., & Silk, J. 2017, MNRAS, 470, 1673 [CrossRef] [Google Scholar]
 MaízApellániz, J. 2001, AJ, 121, 2737 [Google Scholar]
 Marchal, A., MivilleDeschênes, M.A., Orieux, F., et al. 2019, A&A, 626, A101 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Mathis, J. S., Mezger, P. G., & Panagia, N. 1983, A&A, 500, 259 [NASA ADS] [Google Scholar]
 McKee, C. F., & Cowie, L. L. 1977, ApJ, 215, 213 [NASA ADS] [CrossRef] [Google Scholar]
 McKee, C. F., & Krumholz, M. R. 2010, ApJ, 709, 308 [NASA ADS] [CrossRef] [Google Scholar]
 McKee, C. F., & Ostriker, J. P. 1977, ApJ, 218, 148 [NASA ADS] [CrossRef] [Google Scholar]
 Micic, M., Glover, S. C. O., Federrath, C., & Klessen, R. S. 2012, MNRAS, 421, 2531 [NASA ADS] [CrossRef] [Google Scholar]
 MivilleDeschênes, M. A., & Martin, P. G. 2007, A&A, 469, 189 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 MivilleDeschênes, M.A., Murray, N., & Lee, E. J. 2017, ApJ, 834, 57 [NASA ADS] [CrossRef] [Google Scholar]
 Moskalenko, I. V., Porter, T. A., & Strong, A. W. 2006, ApJ, 640, L155 [NASA ADS] [CrossRef] [Google Scholar]
 Murray, C. E., Stanimirović, S., Goss, W. M., et al. 2015, ApJ, 804, 89 [NASA ADS] [CrossRef] [Google Scholar]
 Murray, C. E., Stanimirović, S., Goss, W. M., et al. 2018, ApJS, 238, 14 [NASA ADS] [CrossRef] [Google Scholar]
 Nagashima, M., Koyama, H., & Inutsuka, S.i. 2005, MNRAS, 361, L25 [CrossRef] [Google Scholar]
 Nakanishi, H., & Sofue, Y. 2016, PASJ, 68, 5 [NASA ADS] [CrossRef] [Google Scholar]
 Neckel, T., & Klare, G. 1980, A&AS, 42, 251 [Google Scholar]
 Neufeld, D. A., & Wolfire, M. G. 2017, ApJ, 845, 163 [Google Scholar]
 Nickerson, S., Teyssier, R., & Rosdahl, J. 2018, MNRAS, 479, 3206 [CrossRef] [Google Scholar]
 Noterdaeme, P., Petitjean, P., Srianand, R., Ledoux, C., & Le Petit, F. 2007, A&A, 469, 425 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Oegerle, W. R., Jenkins, E. B., Shelton, R. L., Bowen, D. V., & Chayer, P. 2005, ApJ, 622, 377 [NASA ADS] [CrossRef] [Google Scholar]
 Ostriker, E. C., McKee, C. F., & Leroy, A. K. 2010, ApJ, 721, 975 [NASA ADS] [CrossRef] [Google Scholar]
 Padoan, P., Nordlund, A., & Jones, B. J. T. 1997, MNRAS, 288, 145 [NASA ADS] [CrossRef] [Google Scholar]
 Padoan, P., Pan, L., Haugbølle, T., & Nordlund, Å. 2016, ApJ, 822, 11 [NASA ADS] [CrossRef] [Google Scholar]
 Palazzi, E., Mandolesi, N., & Crane, P. 1992, ApJ, 398, 53 [NASA ADS] [CrossRef] [Google Scholar]
 Pan, K., Federman, S. R., Cunha, K., Smith, V. V., & Welty, D. E. 2004, ApJS, 151, 313 [NASA ADS] [CrossRef] [Google Scholar]
 Pan, K., Federman, S. R., Sheffer, Y., & Andersson, B. G. 2005, ApJ, 633, 986 [NASA ADS] [CrossRef] [Google Scholar]
 Passot, T., & VázquezSemadeni, E. 1998, Phys. Rev. E, 58, 4501 [NASA ADS] [CrossRef] [Google Scholar]
 Peacock, J. A. 1983, MNRAS, 202, 615 [NASA ADS] [Google Scholar]
 Perryman, M. A. C., Lindegren, L., Kovalevsky, J., et al. 1997, A&A, 500, 501 [NASA ADS] [Google Scholar]
 Piontek, R. A., & Ostriker, E. C. 2005, ApJ, 629, 849 [NASA ADS] [CrossRef] [Google Scholar]
 Porter, T. A., & Strong, A. W. 2005, ICRC, 4, 77 [Google Scholar]
 Rachford, B. L., Snow, T. P., Tumlinson, J., et al. 2002, ApJ, 577, 221 [NASA ADS] [CrossRef] [Google Scholar]
 Rachford, B. L., Snow, T. P., Destree, J. D., et al. 2009, ApJS, 180, 125 [NASA ADS] [CrossRef] [Google Scholar]
 Richings, A. J., & Schaye, J. 2016a, MNRAS, 460, 2297 [CrossRef] [Google Scholar]
 Richings, A. J., & Schaye, J. 2016b, MNRAS, 458, 270 [NASA ADS] [CrossRef] [Google Scholar]
 Ritchey, A. M., Martinez, M., Pan, K., Federman, S. R., & Lambert, D. L. 2006, ApJ, 649, 788 [NASA ADS] [CrossRef] [Google Scholar]
 Roth, K. C., & Blades, J. C. 1995, ApJ, 445, L95 [NASA ADS] [CrossRef] [Google Scholar]
 Ryu, K. S., Dixon, W. V. D., Hurwitz, M., et al. 2000, ApJ, 529, 251 [CrossRef] [Google Scholar]
 Saury, E., MivilleDeschênes, M.A., Hennebelle, P., Audit, E., & Schmidt, W. 2014, A&A, 567, A16 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Savage, B. D., Bohlin, R. C., Drake, J. F., & Budich, W. 1977, ApJ, 216, 291 [NASA ADS] [CrossRef] [Google Scholar]
 Savage, B. D., Massa, D., Meade, M., & Wesselius, P. R. 1985, ApJS, 59, 397 [NASA ADS] [CrossRef] [Google Scholar]
 Savage, B. D., Meade, M. R., & Sembach, K. R. 2001, ApJS, 136, 631 [NASA ADS] [CrossRef] [Google Scholar]
 Schlafly, E. F., Meisner, A. M., Stutz, A. M., et al. 2016, ApJ, 821, 78 [NASA ADS] [CrossRef] [Google Scholar]
 Schlegel, D. J., Finkbeiner, D. P., & Davis, M. 1998, ApJ, 500, 525 [NASA ADS] [CrossRef] [Google Scholar]
 Schmidt, W., Federrath, C., Hupp, M., Kern, S., & Niemeyer, J. C. 2009, A&A, 494, 127 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Schruba, A., Leroy, A. K., Walter, F., et al. 2011, AJ, 142, 37 [NASA ADS] [CrossRef] [Google Scholar]
 Seifried, D., Schmidt, W., & Niemeyer, J. C. 2011, A&A, 526, A14 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Seifried, D., Walch, S., Girichidis, P., et al. 2017, MNRAS, 472, 4797 [NASA ADS] [CrossRef] [Google Scholar]
 Sheffer, Y., Rogers, M., Federman, S. R., Lambert, D. L., & Gredel, R. 2007, ApJ, 667, 1002 [NASA ADS] [CrossRef] [Google Scholar]
 Sheffer, Y., Rogers, M., Federman, S. R., et al. 2008, ApJ, 687, 1075 [NASA ADS] [CrossRef] [Google Scholar]
 Shull, J. M., & van Steenberg, M. E. 1985, ApJ, 294, 599 [NASA ADS] [CrossRef] [Google Scholar]
 Smith, R. J., Glover, S. C. O., Clark, P. C., Klessen, R. S., & Springel, V. 2014, MNRAS, 441, 1628 [NASA ADS] [CrossRef] [Google Scholar]
 Snow, T. P., & McCall, B. J. 2006, ARA&A, 44, 367 [NASA ADS] [CrossRef] [Google Scholar]
 Snow, T. P., Destree, J. D., & Jensen, A. G. 2007, ApJ, 655, 285 [NASA ADS] [CrossRef] [Google Scholar]
 Snow, T. P., Ross, T. L., Destree, J. D., et al. 2008, ApJ, 688, 1124 [NASA ADS] [CrossRef] [Google Scholar]
 Spitzer, L. 1978, Physical Processes in the Interstellar Medium (Weinheim: WileyVCH) [Google Scholar]
 Sternberg, A. 1988, ApJ, 332, 400 [NASA ADS] [CrossRef] [Google Scholar]
 Sternberg, A., Le Petit, F., Roueff, E., & Le Bourlot, J. 2014, ApJ, 790, 10 [NASA ADS] [CrossRef] [MathSciNet] [Google Scholar]
 Stone, J. M., & Zweibel, E. G. 2009, ApJ, 696, 233 [CrossRef] [Google Scholar]
 Tabone, B. 2018, PhD thesis, Université de recherche Paris Sciences et Lettres, France [Google Scholar]
 Teyssier, R. 2002, A&A, 385, 337 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Thompson, R., Nagamine, K., Jaacks, J., & Choi, J.H. 2014, ApJ, 780, 145 [CrossRef] [Google Scholar]
 Tumlinson, J., Shull, J. M., Rachford, B. L., et al. 2002, ApJ, 566, 857 [NASA ADS] [CrossRef] [Google Scholar]
 Valdivia, V., & Hennebelle, P. 2014, A&A, 571, A46 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Valdivia, V., Hennebelle, P., Gerin, M., & Lesaffre, P. 2015, EAS Publ. Ser., 75, 393 [CrossRef] [Google Scholar]
 Valdivia, V., Hennebelle, P., Gérin, M., & Lesaffre, P. 2016, A&A, 587, A76 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Valdivia, V., Godard, B., Hennebelle, P., et al. 2017, A&A, 600, A114 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 van Dishoeck, E. F., & Black, J. H. 1986, ApJS, 62, 109 [NASA ADS] [CrossRef] [Google Scholar]
 van Dishoeck, E. F., & Black, J. H. 1989, ApJ, 340, 273 [NASA ADS] [CrossRef] [Google Scholar]
 van Steenberg, M. E., & Shull, J. M. 1988, ApJS, 67, 225 [NASA ADS] [CrossRef] [Google Scholar]
 VázquezSemadeni, E., & García, N. 2001, ApJ, 557, 727 [NASA ADS] [CrossRef] [Google Scholar]
 VázquezSemadeni, E., Gómez, G. C., Jappsen, A. K., et al. 2007, ApJ, 657, 870 [NASA ADS] [CrossRef] [Google Scholar]
 Wainscoat, R. J., Cohen, M., Volk, K., Walker, H. J., & Schwartz, D. E. 1992, ApJS, 83, 111 [NASA ADS] [CrossRef] [Google Scholar]
 Wakker, B. P., Savage, B. D., Sembach, K. R., et al. 2003, ApJS, 146, 1 [NASA ADS] [CrossRef] [Google Scholar]
 Walch, S., Wünsch, R., Burkert, A., Glover, S., & Whitworth, A. 2011, ApJ, 733, 47 [NASA ADS] [CrossRef] [Google Scholar]
 Walch, S., Girichidis, P., Naab, T., et al. 2015, MNRAS, 454, 238 [NASA ADS] [CrossRef] [Google Scholar]
 Weingartner, J. C., & Draine, B. T. 2001, ApJ, 548, 296 [NASA ADS] [CrossRef] [Google Scholar]
 Welsh, B. Y., Vedder, P. W., & Vallerga, J. V. 1990, ApJ, 358, 473 [NASA ADS] [CrossRef] [Google Scholar]
 Welsh, B. Y., Vedder, P. W., Vallerga, J. V., & Craig, N. 1991, ApJ, 381, 462 [NASA ADS] [CrossRef] [Google Scholar]
 Welty, D. E., & Crowther, P. A. 2010, MNRAS, 404, 1321 [NASA ADS] [Google Scholar]
 WolcottGreen, J., Haiman, Z., & Bryan, G. L. 2011, MNRAS, 418, 838 [NASA ADS] [CrossRef] [Google Scholar]
 Wolff, B., Koester, D., & Lallement, R. 1999, A&A, 346, 969 [NASA ADS] [Google Scholar]
 Wolfire, M. G., Hollenbach, D., McKee, C. F., Tielens, A. G. G. M., & Bakes, E. L. O. 1995, ApJ, 443, 152 [Google Scholar]
 Wolfire, M. G., McKee, C. F., Hollenbach, D., & Tielens, A. G. G. M. 2003, ApJ, 587, 278 [NASA ADS] [CrossRef] [Google Scholar]
 York, D. G. 1976, ApJ, 204, 750 [CrossRef] [Google Scholar]
 Zari, E., Hashemi, H., Brown, A. G. A., Jardine, K., & de Zeeuw, P. T. 2018, A&A, 620, A172 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
In this work we use the relation between N_{H} and A_{V} deduced from the observations of the mean Galactic extinction curve (Fitzpatrick & Massa 1986).
Available from the NED (http://nedwww.ipac.caltech.edu) and on the more recent plateform https://irsa.ipac.caltech.edu.
All Tables
All Figures
Fig. 1 Aitoff projection, in Galactic longitude and latitude coordinates, of the background sources of the observational sample of HI and H_{2} deduced from absorption studies and used in this work (see Appendix A and Table A.1). The colorcode indicates the distance of the source. All unknown distances correspond to extragalactic sources (see Table A.1): these are arbitrarily set to 1 Mpc and indicated with black points. 

In the text 
Fig. 2 Distribution of lengths l_{los} of the intercepted diffuse material computed with Eq. (1) along all lines of sight of the observational sample. The orange sample corresponds to lines of sight where H_{2} is detected and the green sample to those for which an upper limit on N(H_{2}) has been derived (see Table A.1). 

In the text 
Fig. 3 H_{2} column density as a function of the total column density of protons N_{H}. Open circles correspond to detections of H_{2} while arrows correspond to upper limits (see Table A.1). The blue dashed line indicates the maximum value of N(H_{2}) derived from a purely molecular medium with an integrated molecular fraction (Eq. (2)). The red dasheddotted line indicates the theoretical molecular fraction derived in an unshielded WNMtype environment with a density of 0.5 cm^{−3} and a temperature of 8000 K, illuminated by a UV photon flux of 10^{8} cm^{−2} s^{−1} (see Eqs. (13) and (15)). The regions A, B, C, D, and E defined in Table 1 correspond to an arbitrary separation of the observational sample used for quantitative comparisons with the results of simulations (see Sect. 4). 

In the text 
Fig. 4 Colored tables of the mean pressure expressed in K cm^{−3} (first line), the turbulent velocity dispersion σ_{tur} (second line), and the fractions of mass f_{WNM} and f_{CNM} contained inthe WNM phase (third line) and the CNM phase (fourth line). First and second columns: these quantities as functions of and G_{0}, for L = 50 pc and F = 3.6 × 10^{−3} kpc Myr^{−2} (first column) and for L = 200 pc and F = 9 × 10^{−4} kpc Myr^{−2} (second column). Third and fourth columns: these quantities as functions of the acceleration parameter F and the compressive ratio ζ, for L = 50 pc (third column) and L = 200 pc (fourth column). All other parameters are set to their standard values (see Table 2). 

In the text 
Fig. 5 Schematic view of the reconstruction of individual lines of sight over a distance l_{los}. The medium between the observer and the source is assumed to be composed of hotand warm ionized material (light blue cubes) with a volume filling factor φ and of uncorrelated pieces of diffuse neutral gas of individual size L (simulated boxes) with a volume filling factor (1φ). 

In the text 
Fig. 6 2D probability histogram of the total proton column density N_{H} and the column density of molecular hydrogen N(H_{2}) obtained with the standard simulation (see Table 2). Upper panel: original data where all lines of sight have a size L= 200 pc. Bottom panel: outcome of the reconstruction algorithm described in Sect. 3.8 that produces a sample of lines of sight ranging from 100 to 3200 pc. The color code indicates the fraction of lines of sight (in logarithmic scale) contained in each bin. Dotted lines are isocontours of the molecular fraction for , 10^{−6}, 10^{−4}, 10^{−2}, and 1. 

In the text 
Fig. 7 Schematic view of lines of sight of fixed length l_{los} inferred from the analytical model described in Appendix C, and corresponding contributions to the histogram of the normalized column densities N_{H} ∕l_{los} and N(H_{2})∕l_{los}. Bottom panels: the white stars correspond to all lines of sight while the black stars correspond to the specific cases illustrated above. Any line of sight is intercepting several components of constant density n_{H}. Diffuse components () have a size ; dense components () have a distribution of sizes . Only components with densities larger than are molecular (see main text). The red star in the bottom panels indicates the mean value of N_{H} ∕l_{los} and N(H_{2})∕l_{los} computed over a large sample of lines of sight (white stars), hence the expected mean molecular fraction. The dotted line indicates a fully molecular medium. 

In the text 
Fig. 8 Comparison of the observational dataset (black points) to the 2D probability histogram reconstruction algorithm (see Sect. 3.8) applied to the fiducial simulation (colored histogram). Results are shown for two resolutions, R = 64^{3} (top panel) and R = 512^{3} (bottom panel). Observations include detections of H_{2} (circles) and upper limits on N(H_{2}) (arrows). The color code indicates the fraction of lines of sight (in logarithmic scale) contained in each bin. As a reminder, contours of the regions A, B, C, D, and E defined in Sect. 2 (see Table 1) are also displayed. 

In the text 
Fig. 9 Top frame: comparisons of the observational dataset (black points) with the 2D probability histograms of N_{H} and N(H_{2}) computed from the reconstruction algorithm (see Sect. 3.8) applied to thesimulations. Bottom frame: fraction of lines of sight (%), and mean value μ and dispersion σ of the logarithm of the molecular fraction computed from the simulated histograms in the regions A, B, C, D, and E defined in Table 1. Numbers correspond to the values of %, μ, and σ. The color code indicates a measure of distance (in arbitrary units) between the observed and simulated values in order to guide the eye. These comparisons are shown in each frame for 15 different simulations with G_{0} varying from 0.5 (left panels) to 4 (right panels) and varying from0.5 cm^{−3} (top panels) to 4 cm^{−3} (bottom panels). All other parameters are set to their fiducial values (see Table 2). 

In the text 
Fig. 10 Same as in Fig. 9 for simulations with a box size of L = 50 pc instead of200 pc. The turbulent forcing is adjusted as in Fig. 4 (F = 3.6 × 10^{−3} kpc Myr^{−2}) to obtain similar velocity dispersions for L = 50 pc and L = 200 pc. 

In the text 
Fig. 11 KS distance between the simulations and the observational sample as a function of the mean density and the UV scaling factor G_{0} = 0.5 (red), 1 (green), 2 (orange), and 4 (blue). Results obtained with and without gravity are shown with solid lines and dashed lines, respectively. All other parameters are set to their standard values (see Table 2). Points correspond to reliable measurements of the KS distances. Triangles indicate lower limits corresponding to simulations where the upper error bar on R_{KS} tends toward infinity (see Appendix D). 

In the text 
Fig. 12 KS distances between the simulations and the observational sample computed for five values of the acceleration parameter F = 4.5 × 10^{−5}, 1.5 × 10^{−4}, 4.5 × 10^{−4}, 1.5 × 10^{−3}, and 4.5 × 10^{−3} kpc Myr^{−2}, and three values of the compressive ratio ζ = 0.1 (blue points), 0.5 (orange points), and 0.9 (green points), which set the balance between compressive and solenoidal forcing (see Sect. 3.3). All other parameters are set to their standard values (see Table 2). Points correspond to reliable measurements of the KS distances. The triangle indicates a lower limit corresponding to a simulation where the upper error bar on R_{KS} tends toward infinity (see Appendix D). 

In the text 
Fig. 13 KS distances between the simulations and the observational sample computed for six values of the initial magnetic field B_{x}. All other parameters are set to their standard values (see Table 2). Points correspond to reliable measurements of the KS distances. The triangle indicates a lower limit corresponding to a simulation where the upper error bar on R_{KS} tends toward infinity (see Appendix D). 

In the text 
Fig. B.1 Thermal equilibrium curves () computed with RAMSES for A_{V} = 0 (red solid curve) and A_{V} = 1 (green solid curve) compared to those predicted by Wolfire et al. (2003) for ϕ_{PAH} = 0.5 (blackdashed curve) and 1.0 (blue dashed curve). 

In the text 
Fig. C.1 Probability histogram of the proton density n_{H} extracted from the fiducial simulation (see Table 2). The black histogram correspond to the extracted data. The red dashed curve shows an example of the sum of two lognormal components and a powerlaw tail at high density, for comparison. The blue line indicates the inflection point of the PH between the diffuse and the dense components. 

In the text 
Fig. C.2 Comparisons of the 1D probability histograms of the total column density normalized to the integration scale, N_{H} ∕l, extracted from the simulations (black histograms) and constructed with the semianalytical model described in the main text (green histograms) for l = 200 (top left), 100 (top right), 50 (bottom left), and 25 (bottom right) pc. Each of these four main panels displays the comparisons performed for 15 simulations with different and G_{0} around the fiducial setup defined by cm^{−3} and G_{0} = 1. All probability histograms inferred from the semianalytical model are obtained assuming a fixed correlation length of the diffuse component pc and a correlation length of the dense component pc (see main text). 

In the text 
Fig. C.3 Comparisons of the 2D probability histograms of the total column density normalized to the integration scale, N_{H} ∕l, and the column density of H_{2} normalized to integration scale, N(H_{2})∕l extracted from the simulations and constructed with the semianalytical model described in the main text for l = 200 (top left), 100 (top right), 50 (bottom left), and 25 (bottom right) pc. The results of the simulations are indicated with contour plots with isoprobabilities of 10^{−4}, 10^{−3}, and 10^{−2}. The coloredhistograms correspond to the results obtained with the semianalytical model. As in Fig. C.2, each of the four main panels displays the comparisons performed for 15 simulations with different and G_{0} around the fiducial setup defined by cm^{−3} and G_{0} = 1. All probability histograms inferred from the semianalytical model are obtained assuming a fixed correlation length of the diffuse component pc and a correlation length of the dense component pc (see main text). 

In the text 
Fig. C.4 Same as Fig. C.3, assuming that pc. The exponent used here corresponds to a simple test to show the effect of a dense decorrelation scale varying with density. 

In the text 
Fig. D.1 Results of the modified KS test applied to the fiducial simulation. The black points are the observational data, the red dots indicate the dataset used for the estimations of the merit function M, and the 2D histogram the simulated data. The violet star and rectangle indicate the observational point and the quadrant that maximize M (see main text). The fiducial simulation has a KS distance of 0.98. The corresponding quadrant contains 0.47 and 4.49% of the entiresimulated and observed datasets. 

In the text 
Fig. D.2 KSdistance between the simulations and the observational sample as a function of the mean density , the UV scaling factor G_{0} = 0.5 (red), 1 (green), 2 (orange), and 4 (blue), and for a resolution of 256^{3} (solid lines) and 128^{3} (dashed lines). All other parameters are set to their standard values (see Table 2). Points correspond to reliable measurements of the KS distances. Triangles indicate lower limits corresponding to simulations where the upper error bar on R_{KS} tends toward infinity (see main text). 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.