3D chemical structure of diffuse turbulent ISM. I. Statistics of the HI-to-H$_2$ transition

We studied the statistical properties of the HI-to-H$_2$ transition observed in absorption in the local diffuse and multiphase ISM to identify the physical processes controlling the probability of occurrence of any line of sight. The turbulent diffuse ISM is modeled using the RAMSES code, which includes detailed treatments of the magnetohydrodynamics, the thermal evolution of the gas, and the chemistry of H$_2$. The impacts of the UV radiation field, the mean density, the turbulent forcing, the integral scale, the magnetic field, and the gravity on the molecular content of the gas are explored through a parametric study covering a wide range of physical conditions. The statistics of the HI-to-H$_2$ transition are interpreted through analytical prescriptions and compared with the observations using a modified and robust version of the Kolmogorov-Smirnov test. The results of one simulation, convolved with the distribution of distances of the observational sample, are able to explain most of the statistical properties of the HI-to-H$_2$ transition observed in the local ISM. The tightest agreement is obtained for a neutral diffuse gas modeled over ~200 pc, with a mean density of $1-2$ cm$^{-3}$, illuminated by the standard interstellar UV radiation field, and stirred up by a large-scale compressive turbulent forcing. Within this configuration, the 2D probability histogram of the column densities of H and H$_2$ is remarkably stable and is almost unaltered by gravity, the strength of the turbulent forcing, the resolution of the simulation, or the strength of the magnetic field $B_x$. The weak effect of the resolution and our analytical prescription suggest that the column densities of HI are likely built up in large-scale WNM and CNM structures correlated in density over ~20 pc and ~10 pc, respectively, while those of H$_2$ are built up in CNM structures between ~3 pc and ~10 pc.


Introduction
The multiphase nature of the interstellar medium (ISM) is at the root of the regulation of star formation in galaxies (e.g., Hill et al. 2018). As shown by the emission profiles of the HI 21 cm line Murray et al. 2015Murray et al. , 2018, the diffuse neutral ISM is composed of two stable thermal states at thermal pressure equilibrium (Jenkins & Tripp 2011), the warm neutral medium (WNM, T ∼ 7000 K) and the cold neutral medium (CNM, T ∼ 70 K), coexisting with a third unstable state, the lukewarm neutral medium (LNM), whose temperature is comprised between those of the CNM and the WNM (e.g., Marchal et al. 2019). Through condensation and evaporation processes, turbulent transport, and turbulent mixing, the diffuse matter flows from one stable state to the other eventually leading to the formation of dense and cold clouds massive enough to trigger gravitational collapse (e.g., Ostriker et al. 2010). While this picture is widely accepted, the intricated effects of turbulence, gravity, radiation field, and magnetic field on the exchange of mass and energy between the different phases and on the formation of structures at all scales has yet to be unveiled.
Following the illustrious analytical descriptions of the thermal instability process (Field 1965;Wolfire et al. 1995;Wolfire et al. 2003;, several analytical and numerical studies have been dedicated to understand the dynamical evolution of the gas, focusing on the formation of CNM structures, molecular clouds, and collapsing cores (e.g., Hennebelle Article & Pérault 1999;Audit & Hennebelle 2005;Vázquez-Semadeni et al. 2007;Hennebelle et al. 2008), as well as on the stability of clouds of various geometries under evaporation and condensation conditions (e.g., Inoue et al. 2006;Stone & Zweibel 2009;Kim & Kim 2013;Nagashima et al. 2005;Iwasaki & Inutsuka 2014). These show that large-scale turbulence combined with thermal instability is sufficient to explain several features of the neutral ISM, including the fractions of mass observed in the different thermal states (Seifried et al. 2011;Hennebelle & Iffrig 2014;Hill et al. 2018), the distribution of thermal pressure (Saury et al. 2014), and mass spectrum, the mass-size relation, and the velocity dispersion-size relation of molecular clouds (e.g., Audit & Hennebelle 2010;Padoan et al. 2016;Iffrig & Hennebelle 2017).
To extend the predictions of simulations to a larger set of observational diagnostics, recent numerical studies have undertaken the challenging task of solving the chemical evolution of turbulent and/or multiphase environments. Originally dedicated to the formation of CO in molecular clouds and to the analysis of the CO-to-H 2 conversion factor in galaxies and the CO dark-gas (e.g., Glover et al. 2010;Smith et al. 2014;Richings & Schaye 2016a;Seifried et al. 2017;Gong et al. 2018), numerical simulations are now used to study a variety of atomic and molecular tracers, including CII, CI, CH + , OH + , H 2 O + , and ArH + (e.g., Richings & Schaye 2016b;Valdivia et al. 2017;Clark et al. 2019;. All these works demonstrate the predictive power of astrochemistry. The column density distribution of each atom and molecule has a unique signature that provides detailed information on the thermodynamical state of the diffuse matter (Clark et al. 2019). In turn, the confrontation with the predictions of numerical simulations can be used to estimate the scale and strength of the injection of mechanical energy by stellar feedback , the large-scale turbulent transport and the interfaces between CNM and WNM (Valdivia et al. 2017), and the nature of the turbulent dissipation processes (Lesaffre et al. 2020).
In this context, understanding the formation and survival of molecular hydrogen has long been recognized as a major topic of investigation. As the most abundant molecule in space, H 2 is at the root of interstellar chemistry and the growth of molecular complexity. In addition, and because its formation preferentially occurs in dense environments, H 2 naturally correlates with the star formation rate of galaxies (e.g., Lupi et al. 2017) and therefore offers a valuable proxy to understand the limit in the Kennicutt-Schmidt relation above which star formation occurs (e.g., Bigiel et al. 2008Bigiel et al. , 2011Schruba et al. 2011;Leroy et al. 2013).
Over the last decades, great efforts have thus been devoted to propose analytical descriptions of the HI-to-H 2 transition in homogeneous clouds with plane-parallel or spherical geometries (e.g., Sternberg 1988;Krumholz et al. 2008;McKee & Krumholz 2010;Sternberg et al. 2014 and references therein), compute this transition in detailed 1D chemical models assuming chemical equilibrium (e.g., van Dishoeck & Black 1986;Abgrall et al. 1992;Le Bourlot et al. 2012;Bron et al. 2014) or not (e.g., Lee et al. 1996;Goldsmith et al. 2007;Lesaffre et al. 2007), treat the chemistry of H and H 2 in subgrid models applied to simulations of galaxy formations (e.g., Gnedin et al. 2009;Christensen et al. 2012; Thompson et al. 2014;Diemer et al. 2018), or solve it in 3D isothermal or multiphase simulations of the diffuse ISM using various treatments of the radiative transfer (e.g., Glover et al. 2010;Valdivia et al. 2016;Hu et al. 2016;Bialy et al. 2017;Nickerson et al. 2018).
Thanks to all these works, a global picture of the formation of H 2 in galaxies is now emerging. At the scale of a homogeneous cloud, the molecular content, the sharpness of the HI-to-H 2 transition, and the asymptotic column density of HI are controlled by the ratio of the intensity of the ultraviolet (UV) field to the gas density and the dust-to-gas ratio, or equivalently, the metallicity (Sternberg et al. 2014). At larger scales, the integrated column densities of HI and H 2 also depend on the distribution of clouds of various densities along the line of sight and on the porosity to the UV radiation field. Because of these effects, the statistical properties of the total column density are found to depend on the strength, the scale, and the compressibility of the turbulent forcing in simulations of CNM gas (Micic et al. 2012;Bialy et al. 2017). The amount of molecular gas depends on the "clumpiness factor" used for the subgrid models in simulations of galaxy formation (Gnedin et al. 2009;Christensen et al. 2012).
Despite these achievements, very few works have been dedicated so far to the analysis of the HI-to-H 2 transition in a turbulent multiphase medium at a scale sufficient to resolve the formation of CNM structures. In addition and while the predictions of analytical models (e.g., Krumholz et al. 2008) and simulations (Gnedin et al. 2009;Valdivia et al. 2016) were able to reproduce the trend of the HI-to-H 2 transition observed by Copernicus and FUSE in the local ISM (e.g., Savage et al. 1977;Gillmon et al. 2006;Rachford et al. 2009), the LMC and the SMC (e.g., Browning et al. 2003;Gillmon et al. 2006;Leroy et al. 2007), no detailed comparison with the statistical properties of these observations have been proposed. As a result, the occurrence of lines of sight with large molecular fractions predicted by numerical simulations often exceed what is deduced from the observations (Valdivia et al. 2016). Finally, and while statistical studies of 1D probability distribution functions (PDF) have become a common tool to understand the formation and the dynamics of molecular clouds (Körtgen et al. 2019), few statistical studies have been performed to date on 2D probability distribution functions using combined observations of different molecular tracers. In that perspective, the recent work of  opens new horizons for the analysis of chemistry in the diffuse matter.
In the first paper of this series, we extend these pioneer statistical studies to the measurements of the atomic-to-molecular transition observed in the diffuse and translucent ISM located in a radius of ∼ 3 kpc around the sun. We perform a parametric exploration of numerical simulations of the multiphase ISM and compare the results with the observed 2D probability histogram (PH) of total and molecular hydrogen column densities in order to identify the physical processes that control the molecular content of CNM clouds and the probability of occurrence of lines of sight. The observational dataset and the distribution of sizes of the sampled medium are presented in Sect. 2. The different setups of the simulations and the method used to reconstruct the 2D PH are described in Sect. 3. The comparisons with the observations are shown in Sect. 4 which also highlights the influences of the different parameters. The paper finally ends with Sects. 5 and 6 where we discuss the validity of our approach and summarize our main conclusions.  Table A.1). The color code indicates the distance of the source. All unknown distances correspond to extragalactic sources (see Table A.1): these are arbitrarily set to 1 Mpc and indicated with black points. this catalog to observations or tentative detections of HI, H 2 , and of the reddening E(B-V), and removing the data associated to the Magellanic Cloud or high redshift extragalactic environments (e.g., Tumlinson et al. 2002;Cartledge et al. 2005;Welty & Crowther 2010;Noterdaeme et al. 2007), we obtain a sample of 360 sources which form, to date, the most complete set of observations of the HI-to-H 2 transition in the local diffuse ISM. A more detailed description of this set, the list of the background sources, and the values of the column densities of HI and H 2 toward each source, N(H) and N(H 2 ) are given in Appendix A and Table A.1.
The positions of the sources in the sky and their distance deduced from GAIA and Hipparcos measurements of parallaxes (Perryman et al. 1997;Gaia Collaboration et al. 2018) are shown in Fig. 1, where unknown distances of extragalactic sources (see Table A.1) are arbitrarily set to 1 Mpc. With comparable numbers of observations in all Galactic quadrants, the sources appear to be well distributed in Galactic longitudes. Oppositely, and while the sources cover almost all Galactic latitudes, about two-thirds of them are located toward the Galactic disk with latitudes smaller than 15 • , and only one third is located above, crossing the Galactic halo. Since the sample contains extragalactic sources, and since the amount of molecular gas in the Milky Way decreases exponentially as a function of the distance from the midplane, the length of the line of sight l los occupied by the observed diffuse gas cannot always be identified to the distance of the background source. For the sake of simplicity, we assume here a molecular height above the midplane of 100 pc and compute the length of the observed diffuse Galactic material as l los = min 1" p , 100 sin(|b|) pc, where b is the Galactic latitude of the background source and p is its parallax. The resulting distribution of the lengths of the lines of sight is shown in Fig. 2. The shortest lines of sight are found to extend over ∼ 100 pc and the largest over ∼ 3.5 kpc. Remarkably, and because of the combined distributions of distances and Galactic Fig. 2: Distribution of lengths l los of the intercepted diffuse material computed with Eq. 1 along all lines of sight of the observational sample. The orange sample corresponds to lines of sight where H 2 is detected and the green sample to those for which an upper limit on N(H 2 ) has been derived (see Table A.1).
latitudes of the sources, we find that log(l los ) follows a flat distribution up to l los ∼ 2 kpc with about 50 sources per bin and drops by about a factor of two for l los ∼ 3 kpc. Oppositely, and as expected, the distribution of lengths of non-detections of H 2 is not flat but decreases rapidly up to 1 kpc. Long lines of sight are finally not limited to the first and fourth Galactic quadrants but are found to spread over all Galactic longitudes and mostly depend on the Galactic latitude of the background source.

Physical and statistical properties
The compiled data are shown in Fig. 3 which displays the observed column densities of H 2 as functions of the total proton column densities of the gas N H = N(H) + 2N(H 2 ). As shown in Fig. 3 and as already noted by Goldsmith et al. (2009), almost no line of sight is either purely WNM or purely molecular. This implies that the observed gas is necessarily composed of a combination of phases and clouds of different extinctions with various contributions to the volume spanned by the different lines of sight. As a result, the integrated molecular fraction computed as shows a large dispersion in the observational sample covering about seven orders of magnitude. While the molecular fraction averaged over all the lines of sight is found to be 0.20, the mass averaged molecular fraction is 0.27, a value similar to the results obtained by Miville-Deschênes et al. (2017) at 8.5 kpc based on the analysis of molecular clouds observed in CO in the entire Galactic disk. The position of the HI-to-H 2 transition, on the other hand, is found to extend over about one order of magnitude of total column density from N H ∼ 10 20 to 10 21 cm −2 and occurs, on average, at N H ∼ 3 × 10 20 cm −2 (Gillmon et al. 2006). In order to highlight the statistical features of this transition, we divide the observational sample into 5 subsamples A, B, C, D, and E, shown in Fig. 3, which encompass almost all the observational points and whose statistical properties are summarized in Table 1. 48 lines of sight out of 360 are found to be not detected in H 2 . Most of these upper limits are obtained for a total column density smaller than 10 21 cm −2 and about half of them provide strong constraints on the molecular fraction with Article number, page 3 of 35  Table A.1). The blue dashed line indicates the maximum value of N(H 2 ) derived from a purely molecular medium with an integrated molecular fraction f H 2 = 1 (Eq. 2). The red dashed-dotted line indicates the theoretical molecular fraction derived in an unshielded WNMtype environment with a density of 0.5 cm −3 and a temperature of 8000 K, illuminated by a UV photon flux of 10 8 cm −2 s −1 (see Eqs. 13 & 15). The regions A, B, C, D, and E defined in Table 1 correspond to an arbitrary separation of the observational sample used for quantitative comparisons with the results of simulations (see Sect. 4).
f H 2 10 −5 . 3% of the 312 detections belong to the subsample A, 13% to subsample B, 16% to subsample C, and 65% to subsample D. Interestingly, the subsample E is empty and no line of sight is observed with N H > 10 22 cm −2 . While the mean value of the logarithm of the molecular fraction strongly increases from subsamples A to D, the dispersion simultaneously decreases by about a factor of three, probably revealing an effect of average over long distances.
All these statistical properties, and more precisely the probability of occurrence of a given line of sight, are the subject of this paper. What physical processes control the HI-to-H 2 transition? How does the distribution of lengths of the lines of sight influence its observed statistical properties? What are the origins of the variations of the molecular fraction from one line of sight to another?

Physics and numerical method
To study the physical processes at play in the HI-to-H 2 transition, we performed numerical simulations of the multiphase diffuse ISM, using the RAMSES code (Teyssier 2002;Fromang et al. 2006), a grid-based solver with adaptative mesh refinement (Berger & Oliger 1984). The methodology applied in this paper follows the works of Seifried et al. (2011), Saury et al. (2014), and Valdivia et al. (2015. The diffuse matter in the Solar Neighborhood of our galaxy is simulated over a box of size L with periodic boundary conditions. The matter, defined by a mean proton density n H , is assumed to be illuminated on all sides by an isotropic spectrum of UV photons set to the standard interstellar radiation field (Habing 1968) and scaled with a factor G 0 .

Fluid equations
Within this framework, RAMSES computes the evolution of the gas solving the classic equations of ideal magnetohydrodynamics (MHD): where ρ, v, B, P and E are the mass density, the velocity field, the magnetic field, the total pressure, and the total energy density, respectively. The net cooling function per unit mass, L, and the acceleration due to the turbulent driving, f , are described in Sections 3.2 and 3.3. The axis x, y, and z are chosen so that z corresponds to the direction perpendicular to the Galactic disk, and x corresponds to the direction of the mean magnetic field initially parametrized by a constant value B x .
To take into account all gravitational forces, including selfgravity and the action of stars and dark matter, the gravitational potential Φ is divided into two terms: The self-gravity potential, φ gas , is deduced from the Poisson's equation: Following Kuijken & Gilmore (1989) and Joung & Mac Low (2006), we assume that the Galactic potential along the direction z perpendicular to the Galactic disk can be written as  where the first term is the contribution of the stellar disk parametrized by z 0 = 0.18 kpc and a 1 = 1.42 × 10 −3 kpc Myr −2 and the second term is the contribution of the spherical dark halo parametrized by a 2 = 5.49 × 10 −4 Myr −2 .

Thermal processes and radiative transfer
As shown by Field (1965), the multiphase nature of the ISM results from the thermal balance of the gas and thus from its net cooling function L defined by where n 2 H Λ and n H Γ are the cooling and heating rates of the medium (in erg cm −3 s −1 ) and n H is the proton density. To correctly describe the thermal state of the diffuse ISM, we include in this work the heating induced by the photoelectric effect and the decay of cosmic ray particles and the cooling induced by Lyman α photons, the recombination of electrons onto grains, and the fine structure lines of OI and CII. All these processes, described in Appendix B, are modeled with the analytical formulae given by Wolfire et al. (2003).
The absorption of UV photons by dust, and its subsequent impact on the photoelectric effect, is treated with the tree-based method proposed by Valdivia & Hennebelle (2014). At each point the effective radiation field G eff (in Habing units) is computed as where A V is the visual extinction along a given ray, deduced from the integrated proton column density 1 computed from the border of the box to the current point and e −2.5A V is an average performed over 12 directions, treated as solid angles evenly spread in polar coordinates.

Turbulence forcing
To mimic the injection of mechanical energy in the diffuse ISM, a large scale turbulent forcing is applied. Following Schmidt et al. (2009) andFederrath et al. (2010), this forcing, modeled 1 In this work we use the relation between N H and A V deduced from the observations of the mean Galactic extinction curve (Fitzpatrick & Massa 1986).
by an acceleration f in the momentum conservation equation, is driven through an Ornstein-Uhlenbeck process using a pseudospectral method. At regular time intervals ∆τ, random fluctuations of the forcing term are generated and applied over an autocorrelation timescale τ. To excite only large scale modes, the forcing is modeled as a paraboloid in Fourier space covering a small interval of wavenumbers 1 k 3 and centered on k = 2. Using the notations of Seifried et al. (2011) andSaury et al. (2014), the total magnitude of these perturbations is set with either an acceleration parameter F or, equivalently, a velocity parameter V related by F = V 2 /L drive , where L drive is the main driving scale, L drive = L/2. A Helmholtz decomposition is finally applied, in order to control the powers injected in compressive and solenoidal modes. Using the classical notation, these powers are set with a parameter 2 ζ ranging from a pure solenoidal field (ζ = 1) to a pure compressive field (ζ = 0).
Throughout this work, we adopt ∆τ ∼ 0.4 Myr which roughly corresponds to the time interval separating two supernova events occurring in a volume of (200 pc) 3 . The characteristic damping time of the turbulence τ is approximately set to the turnover timescale of the diffuse ISM τ ∼ 33(L/200pc) 0.6 Myr (Larson 1981;Hennebelle & Falgarone 2012). F (or V) and ζ are left as free parameters.

H 2 chemistry
The timescale required for the abundance of molecular hydrogen to reach its equilibrium value is known to range over several orders of magnitude, depending on the physical conditions of the ISM (e.g., Goldsmith et al. 2007;Tabone 2018). In the diffuse gas, this timescale varies typically between a few 10 3 yr and a few 10 7 yr (Valdivia et al. 2016), hence over a range of values that often exceeds the dynamical timescales. To take into account this important aspect of the diffuse interstellar chemistry, the outof-equilibrium abundance of H 2 is computed self-consistently in the simulation, using the formalism introduced in RAMSES by Valdivia et al. (2015Valdivia et al. ( , 2016. The formation of H 2 onto grains in physisorption and chemisorption sites is modeled with the simplified rate of Le Bourlot et al. (2012) where n H and n(H) are the local proton and atomic hydrogen densities, and is the sticking coefficient of H onto grain, parametrized by T 2 = 464 K and β = 1.5. The destruction of H 2 by UV photons is computed using the formalism, described by Draine & Bertoldi (1996) and Sternberg et al. (2014), which is classically introduced in many astrochemical models (e.g., Lesaffre et al. 2013;Gong et al. 2017;Bialy et al. 2017). In each cell, the photodestruction rate of H 2 is modeled as where x = N(H 2 )/5 × 10 14 cm −2 , k d,0 = 3.3 × 10 −11 s −1 is the inverse freespace dissociation timescale of H 2 in an isotropic Habing field, n(H 2 ) is the local density of the molecular hydrogen, e −σ d N H is the shielding induced by dust and f shield the selfshielding function. We adopt here an effective dust attenuation cross section at λ = 1000 Å, σ d = 2 × 10 −21 cm 2 (Sternberg et al. 2014). Following Draine & Bertoldi (1996), the self-shielding function is computed as where b D is the Doppler broadening parameter expressed in km s −1 . As done for the photoelectric heating rate (see section 3.2), both the shielding by dust and the self-shielding are calculated along 12 different directions and then averaged to obtain the photodissociation rate of Eq. 15.

Fiducial model and grids of parameters
The framework described above lays on several independent parameters, which are all related to key physical ingredients of the ISM. The influence of each ingredient on the HI-to-H 2 transition is studied here through several grids of simulations − including a total of 305 runs − covering a broad range of physical conditions and centered around a fiducial setup 3 . The reference value adopted for each parameter, and the range of values explored in this work, are summarized in Table 2. Among all parameters, L, n H and G 0 are of particular importance. With our assumptions, L simultaneously corresponds to the scale of illumination of the gas by UV photons and twice the integral scale of turbulence, that is twice the scale of injection of mechanical energy L drive . OB stars, which are the dominant sources of the interstellar UV field, are not uniformly distributed in the sky but are known to be clustered in associations (Ambartsumian 1947). As shown by the recent 3D studies of the distributions of stars based on the Hipparcos and GAIA Catalogs (e.g., Bouy & Alves 2015Zari et al. 2018, the typical distances separating two associations in the Solar Neighborhood range between 50 pc and a few hundreds of pc, that is several times the mean distance deduced from the integrated surface densities of OB stars (∼ 1.6 × 10 −3 pc −2 , Maíz-Apellániz 2001). Interestingly, such distances are not only comparable to the heights of the molecular gas (∼ 75 pc) and the cold HI gas (∼ 150 pc) above the Galactic plane deduced from CO and HI all-sky surveys (e.g., Dame et al. 2001, Dickey & Lockman 1990, Kalberla & Kerp 2009), but they also correspond to the typical size of HI superclouds (Elmegreen & Elmegreen 1987). For all these reasons, we, therefore, adopt a fiducial simulation with L = 200 pc and explore values down to a few tens of pc.
The mean density of the gas, n H , represents the mass of the diffuse neutral ISM contained in a volume L 3 , and also controls the porosity of the matter to the impinging radiation field. In this work, we adopt a fiducial value n H = 2 cm −3 , a value slightly larger than the standard Galactic midplane density of HI at a galactocentric distance of 8.5 kpc (Kalberla & Kerp 2009). With our fiducial value of L, the mass of gas contained in the box accounts for all the mass surface density of HI in the Solar Neighborhood (Σ HI ∼ 10 M pc −2 , Nakanishi & Sofue 2016, Miville-Deschênes et al. 2017).
G 0 controls the intensity of the radiation field, hence the thermodynamical state of the gas. Since G 0 is normalized to the Habing field, we choose a fiducial value of G 0 = 1. The corresponding UV energy density of the standard setup is therefore slightly smaller than that contained in the standard UV radiation fields given by Draine (1978) and Mathis et al. (1983).
The standard values of the two parameters F and B x are set to 9 × 10 −4 kpc Myr −2 and 3.8 µG, respectively. As we will show later, those values are chosen so that the velocity dispersion of the gas and the strength of the magnetic field are close to the values observed in the diffuse ISM.

Steady-state
The evolution of the multiphase environments simulated here is identical to the description already given in many papers (e.g., Seifried et al. 2011;Saury et al. 2014;Valdivia et al. 2015Valdivia et al. , 2016. Starting from an homogeneous density n H = n H , the gas evolves under the joint actions of turbulence, gravity, and thermal instability, and splits up in three different phases at thermal pressure equilibrium, the WNM, the CNM, and the LNM. The formation of dense environments well shielded from the destructive UV radiation field triggers the formation of H 2 and the medium jointly evolves from a purely atomic state to a partly molecular state. If the mass of the gas is conserved, as imposed by the periodic boundary conditions, the medium progressively tends toward a steady-state where the mean pressure, the volume filling factors of the different phases, their velocity dispersion, and their mean molecular fractions are roughly constant. This steady-state is typically reached after a few turnover timescales, providing that the corresponding time is longer than the damping time (see Sect. 3.3).
Because of turbulence and thermal instability, the steadystate has a statistical nature. The turbulent forcing and the subsequent turbulent cascade induce pressure variations and shear motions at all scales which trigger mass exchanges between the different phases. Any pressure or density structure is therefore a transient system which is usually described by its contribution to probability distribution functions. At steady-state, only PDFs remain constant. This steady yet ever changing environment is the reason for the sustained presence of a substantial amount of gas in the LNM at densities and temperatures out of thermal equilibrium (e.g., Marchal et al. 2019). All the results shown throughout this paper are taken at steady-state, at times ranging from a few tens of Myr up to 100 Myr depending on the strength of the turbulent forcing.  . The first and second columns display these quantities as functions of n H and G 0 , for L = 50 pc and F = 3.6 × 10 −3 kpc Myr −2 (first column) and for L = 200 pc and F = 9 × 10 −4 kpc Myr −2 (second column). The third and fourth columns display these quantities as functions of the acceleration parameter F and the compressive ratio ζ, for L = 50 pc (third column) and L = 200 pc (fourth column). All other parameters are set to their standard values (see Table 2).

Properties of the multiphase medium
The steady-state values of the mean pressure P/k , the turbulent velocity dispersion σ tur , and the fractions of mass of the WNM and the CNM, f WNM and f CNM , are shown in Fig. 4 for a set of 60 different simulations. For the sake of simplicity, we assume that the WNM is composed of all cells with a temperature T 3000 K, the CNM of all cells with T 300 K, and the LNM of all cells with 300 K < T < 3000 K, hence and The mean pressure P/k (with k the Boltzmann constant) is classically computed as an average over the entire volume. While the above conventions are well established, there are many ways to define the turbulent velocity dispersion σ tur . It could be defined as the volume-weighted or mass-weighted velocity dispersion computed over the entire volume (Audit & Hennebelle 2005;Federrath et al. 2010), the average of the mass-weighted velocity dispersion computed along individual lines of sight (Miville-Deschênes & Martin 2007;Saury et al. 2014), or the dispersion of the mass-weighted velocity centroids computed along independent directions (Henshaw et al. 2019). All these definitions give velocity dispersions that differ from one another and are not equally relevant for the comparison with observed quantities.
To relate the velocity dispersion to a kinetic energy and provide values that could be compared to the observations of broad HI emission profiles at high Galactic latitude (see below), we chose to compute σ tur in the WNM only as and where the sums are performed over all cells with T 3000 K.
The results displayed in Fig. 4 are very similar to those obtained in the parametric studies of Seifried et al. (2011) and Saury et al. (2014) and are in line with the expectations of models of turbulent multiphase environments (Wolfire et al. 2003;Ostriker et al. 2010). The mean pressure of the gas is primarily regulated by the thermal equilibrium curve (see Appendix B) which depends on the local illumination of the gas by the UV radiation field G eff . Since G eff results from the absorption of the external UV field by the surrounding environments, P/k is not only sensitive to G 0 but also to the total mass of the simulation set by n H and L. Larger values of G 0 or smaller values of L or n H leads to larger P/k . In turn, the fractions of mass contained in the WNM and the CNM are controlled by the mean pressure and the total mass of the gas. Larger pressure implies larger densities of the WNM (and the CNM). The fraction of mass of the CNM therefore decreases as P/k increases; eventually, if the density of the WNM becomes comparable to the mean density n H , the CNM almost entirely disappears (see bottom left panels of Fig.  4). At last, and because the WNM occupies most of the volume, f CNM necessarily increases as a function of the total mass of the gas, or equivalently n H , even at constant pressure.
The turbulent forcing induces pressure fluctuations and shearing motions at all scales. As shown by Seifried et al. (2011), this not only frequently perturbs the gas out of the thermal equilibrium states but also strongly reduces the times spent by any fluid elements in the WNM, LNM, and CNM. Because of these two aspects, increasing the turbulent forcing reduces the mass of the CNM to the benefit of those of the LNM and WNM (right panels of Fig. 4 and Fig. 10 of Seifried et al. 2011). The mean pressure therefore increases, the 1D PDF of the density broadens and its bimodal nature progressively disappears (Piontek & Ostriker 2005;Walch et al. 2011). These effects can be magnified depending on the nature of the turbulent forcing and the power injected in the compressive and solenoidal modes. Because solenoidal motions are more efficient to prevent the gas to condensate back to the CNM phase, a pure solenoidal forcing naturally leads to larger pressure and smaller CNM fractions than those obtained with an equivalent kinetic energy injected in pure compressive modes.
As expected, the velocity dispersion of the WNM is mostly given by the strength of the turbulent forcing and the driving scale (L drive ∼ L/2, see Sect. 3.3), with a slight dependence on n H and ζ. As proposed by Saury et al. (2014), a realistic value for the turbulent velocity dispersion of the WNM can be estimated by looking at the HI 21 cm emission spectra with the fewest components observed at high Galactic latitude (Kalberla et al. 2005). toward these directions, Haud & Kalberla (2007) derive a total velocity dispersion σ tot = (σ 2 tur + σ 2 thr ) 1/2 ∼ 10 km s −1 , where the σ thr is the 1D thermal velocity dispersion (∼ 8.2 km s −1 for the WNM). In the present paper, the turbulent forcing applied to the standard simulation (see Table. 2 and Fig. 4) is chosen so that σ tur ∼ 4 − 5 km s −1 , in fair agreement with the observations at high Galactic latitude. While this value is chosen as a reference, the velocity dispersions obtained in all the simulations explored in this work range between 1 and 15 km s −1 (see Fig. 4).

Reconstruction of lines of sight
As shown in Sect. 2, the medium observed in absorption at UV and visible wavelengths extends over a very broad range of distances, from ∼ 100 pc to several kpc (see Fig. 2). The targeted lines of sight may therefore contain several isolated diffuse neutral phases but also hot and warm ionized material (McKee & Cowie 1977;de Avillez & Breitschwerdt 2004). Indeed, such a superimposition of independent components is particularly well Fig. 5: Schematic view of the reconstruction of individual lines of sight toward a background source located at a distance l los from the observer. The medium between the observer and the source is assumed to be composed of hot and warm ionized material (light blue cubes) with a volume filling factor ϕ and of uncorrelated pieces of diffuse neutral gas of individual size L (simulated boxes) with a volume filling factor (1-ϕ). seen in submillimeter and infrared observations of the Galactic disk where the gas seen in absorption is found to cluster in several velocity components associated to known Galactic structures (e.g., Gerin et al. 2016). Since our setup only follows a piece of diffuse neutral material of size L, an additional treatment regarding the lengths of the lines of sight is therefore required in order to compare the results of the simulations to the distribution of observations. We apply here a methodology similar to that proposed by  and schematized in Fig.  5. We assume that a given simulation corresponds to a building block of neutral diffuse ISM. Depending on its length, any random line of sight necessarily intercepts parts or several of these building elements and an unknown mass of diffuse ionized gas parametrized by its volume filling factor ϕ. A total sample of N simulated lines of sight is then generated as follows.
Based on the results of Sect. 2 ( Fig. 2), we consider six lengths of lines of sight homogeneously distributed in log space: l los = 100, 200, 400, 800, 1600, and 3200 pc. For each length, we generate a sample of N l = 1 6 w l N lines of sight, where w l are normalized weights deduced from the distribution of distances in the observed sample: w 1 = 0.14, w 2 = 0.21, w 3 = 0.16, w 4 = 0.20, w 5 = 0.18, w 6 = 0.11 (see Fig. 2). The column densities of H and H 2 along each lines of sight are finally reconstructed by comparing the length occupied by the neutral medium (1 − ϕ)l los (see Fig. 5) and the size of the box L. If (1 − ϕ)l los = L, we draw a random line of sight in the simulation and extract the corresponding column densities. If (1 − ϕ)l los < L, we draw a random line of sight and integrate the column density over a reduced distance of (1 − ϕ)l los . If (1 − ϕ)l los > L, we draw L/(1 − ϕ)l los random lines of sight and add the respective individual column densities.
For the sake of simplicity, we assume here that any line of sight intercepts a constant fraction of diffuse ionized gas with ϕ = 0.5 (Hill et al. 2018). Similarly, we note that, while spatially uncorrelated, the pieces of diffuse neutral gas used in the reconstruction algorithm correspond to random realizations obtained with a single simulation. Potential variations of the mean density, of the external radiation field, or of the turbulent forcing  Table 2). The upper panel shows the original data where all lines of sight have a size L = 200 pc. The bottom panel shows the outcome of the reconstruction algorithm described in Sect. 3.8 that produces a sample of lines of sight ranging from 100 pc to 3200 pc. The color code indicates the fraction of lines of sight (in logarithmic scale) contained in each bin. Dotted lines are isocontours of the molecular fraction for f H 2 = 10 −8 , 10 −6 , 10 −4 , 10 −2 , and 1. that naturally follow the Galactic structure depending on the position of the source (see Appendix A) are not taken into account. All these limitations are discussed in Sect. 5.
The outcome of the reconstruction algorithm is shown in Fig.  6 which displays the 2D PH of the total proton column density N H and the column density of molecular hydrogen N(H 2 ) obtained with the standard simulation. Because of the flat distribution of distances in log space (see Fig. 2), the peaks of the reconstructed PH are found to be shifted toward both the large and the low values of N H compared to those of the initial distribution (top panel of Fig. 6). This naturally enhances the initial bimodality and many lines of sight are found to be either at low (∼ 10 −5 ) or large (∼ 10 −1 ) integrated molecular fractions. In addition, it induces an inclination toward large column densities; more than half of the lines of sight are found to have N H > 10 21 cm −2 . By virtue of the central limit theorem, the molecular fraction obtained in those lines of sight tends toward the mean H 2 molecular fraction of the initial simulation with a dispersion that decreases as a function of N H .
Building simulated lines of sight from the observed distribution of sizes allows us to perform a statistical comparison of both samples and to limit the impact of observational biases to lines of sight at large column density (region E, see Sect. 5). The combined 2D distributions of N H and N(H 2 ) deduced from the simulations are the main focus of this paper and will be shown several times in the following sections. For obvious aesthetical reasons, and to simplify the descriptions of the figures, we will often refer to this representation as a kingfisher diagram. Fig. 7: Schematic view of lines of sight of fixed length l los inferred from the analytical model described in Appendix C, and corresponding contributions to the histogram of the normalized column densities N H /l los and N(H 2 )/l los . In the bottom panels, the white stars correspond to all lines of sight while the black stars correspond to the specific cases illustrated above. Any line of sight is intercepting several components of constant density n H . Diffuse components (n H < n lim H ) have a size y diff dec ; dense components (n H n lim H ) have a distribution of sizes y dec y dens dec . Only components with densities larger than n tr H are molecular (see main text). The red star in the bottom panels indicates the mean value of N H /l los and N(H 2 )/l los computed over a large sample of lines of sight (white stars), hence the expected mean molecular fraction.

Interpretative modeling
The computation of column densities over various lines of sight and the subsequent kingfisher diagrams are the outcome of three main factors: the local conditions of the gas (n H , G eff , T , and the self-shielding) which control the local abundance of H 2 ; the probabilistic ordering of these local conditions along any random line of sight of size l los ; and finally, the distribution of sizes l los used for the reconstruction algorithm (see Sect. 3.8). In order to separate these effects, in particular the impact of local conditions from the probabilistic aspects, and propose a physical interpretation of the behaviors shown in this paper, we developed a semi-analytical approach. The resulting model and the confrontations of its predictions to the results of numerical simulations are described in details in Appendix C. To keep the paper concise, we only summarize here its basic ingredients and our main deductions.
Following the works of Vázquez-Semadeni & García (2001), Bialy et al. (2017), we assume that any line of sight can be understood as a succession of density fluctuations. These fluctuations are supposed to be fully correlated over a distance called the "decorrelation scale", and completely uncorrelated over larger distances. Because of the biphasic nature of the neutral gas, we adopt two different decorrelation scales: y diff dec if n H < n lim H and y dens dec if n H n lim H . The limit n lim H separating the diffuse and dense components is chosen as the inflection point between the two log-normal distributions classically found in the PH of the gas density (see Fig. C.1 of Appendix C). Analyzing the production of H 2 and the integration of column densities in this framework leads to the following conclusions.

The comparisons with numerical simulations performed at four different scales and for fifteen different simulations
show that the analytical model reproduces to an outstanding level the 1D PHs of the total column density N H (see Fig. C.2), assuming that the diffuse gas is correlated over a scale y diff dec = 0.2 L drive (i.e., 20 pc for the fiducial simulation), and that the dense gas is correlated over a scale y dens dec = 10 pc (n H /2 cm −3 ) 1/3 . Interestingly, the value of y diff dec is similar to that obtained by Bialy & Burkhart (2020) in a set of isothermal MHD simulations with different driving scales and Mach numbers. This confirms that the WNM, which fills most of the volume, behaves like an isothermal gas perturbed by a sustained turbulent forcing. 2. Similar comparisons performed on the 2D PHs of N H and N(H 2 ) (Figs. C.3 and C.4 and Sect. C.4) indicate that the dense gas is necessarily composed of a distribution of structures of different sizes. Indeed, while most of the mass and volume of the cold HI can be accurately modeled with a single scale y dens dec = 10 pc (n H /2 cm −3 ) 1/3 , H 2 is required to be built up in smaller components. y dens dec should therefore be considered as the maximum decorrelation scale of the dense gas. 3. The local production of H 2 mostly depends on the density: low density components are atomic while high density components are molecular. The threshold n tr H triggering the transition between the two regimes is set by the local self-shielding (induced by the component itself) and the large-scale selfshielding (induced by the surrounding environment). The local self-shielding alone in a component of size y dec y dens dec implies n tr H ∼ 8 cm −3 G 1/2 0 (y dec /10 pc) −1/2 .
The large-scale self-shielding can be seen as a stochastic process that lowers this limit: for L = 200 pc, n tr H is found to be reduced by a factor of two on average. 4. As schematized in Fig. 7, the distribution of normalized column densities N H /l los and N(H 2 )/l los obtained for a given l los can be divided in three categories. Lines of sight with low integrated molecular fraction ( f H 2 < 10 −4 ) exclusively contain components with n H < n tr H (case a). In contrast, lines of sight with high integrated molecular fraction ( f H 2 > 10 −2 ) necessarily intercept at least one large or several small components at high density n H > n tr H (case c). In spite of what intuition dictates, lines of sight with intermediate integrated molecular fraction (10 −4 f H 2 10 −2 ) do not result from components at moderate densities (n H ∼ n tr H ) but from the combination of low density material and a small number of small components at high density n H > n tr H (case b). 5. The proportions of lines of sight of types (a), (b), and (c) (Fig. 7) are given by the volume filling factors of the diffuse and dense gas and the length of the lines of sight l los . If l los is comparable to y diff dec and y dens dec , the 2D PH of N H /l los and N(H 2 )/l los is spread and contains a large number of lines of sight of type (a). Larger l los naturally favor lines of sight of types (b) and (c). If l los becomes large compared to to y diff dec and y dens dec , the central limit theorem applies. The spread PH described above is squeezed along the x and y axis as both N H /l los and N(H 2 )/l los progressively tend toward Gaussian distributions centered on the means (red star in Fig. 7).  Table 1) are also displayed.

Fiducial simulation
In Fig 8, we compare the observational dataset to the 2D PH of N H and N(H 2 ) (i.e., the kingfisher diagram) obtained with the reconstruction algorithm applied to the fiducial simulation for two resolutions, R = 64 3 and R = 512 3 . In each panel, the color code indicates, in logarithmic scale, the fraction of lines of sight predicted for any couple (N H ,N(H 2 )). Quantitative comparisons of the observed and predicted fractions of lines of sight, mean molecular fractions, and dispersions in the regions A, B, C, D, and E defined in Sect. 2 (see Table 1) are given in Fig. 9. Unexpectedly, the sample of lines of sight built from the fiducial simulation reproduces, to an outstanding level, the global trend of the HI-to-H 2 transition and its statistical properties. Without taking into account any possible variation of the parameters along the lines of sight or from one line of sight to the next, the structures induced by the joint actions of turbulence and thermal instability alone are found to produce a wealth of lines of sight whose probabilities of occurrence match those derived from the observations.
In particular, the integrated molecular fraction is predicted to have a bimodal distribution with a transition occurring at N H ∼ 3 × 10 20 cm −2 and extending over one order of magnitude of total column density. More quantitatively, the fractions of lines of sight simulated and observed in regions A, B, C, and D are found to differ by 50% at the most (see Fig. 9). Similarly, the observed and simulated mean molecular fractions and their corresponding dispersions are found to be comparable and to differ by less than a factor of three in region B, and less than a factor of two in regions C, and D. Because of the distribution of background sources, the reconstructed ensemble predicts a strong increase in probability from region C to D, which contains a large fraction of the entire sample, and a decrease in the dispersion of f H 2 as a function of N H . At last, the probability of occurrence of lines of sight with N H < 3 × 10 19 and f H 2 > 10 −3 or N H > 3 × 10 21 and f H 2 < 10 −3 are rare to nonexistent. All these features are also found in the observational dataset.
Notwithstanding, Figs. 8 and 9 also reveal a few discrepancies. Firstly, about 16 observed lines of sight (out of 360) lay at the border of the simulated distribution, in regions where the predicted probability is 10 −4 (or even smaller than 10 −5 for the white regions of Fig. 8), a value far smaller than the inverse number of observations ∼ 3 × 10 −3 . This implies that the simulation used here somehow fails to explain on its own part of the diversity observed in the Solar Neighborhood. Secondly, the mean molecular fractions predicted in regions A and B are found to be noticeably smaller than that derived from the observations, by about a factor of ten and three, respectively. It is important to note, however, that these molecular fractions deduced from the observations are probably overestimated as a third of the observed lines of sight contained in these regions correspond to upper limits on N(H 2 ). Finally, the simulated sample clearly shows that a substantial fraction of the lines of sight are in the region E, with an integrated probability of 4%. If the observational sample of 360 targets is unbiased, between 10 and 18 lines of sight should have been observed in this region, which is not the case. These discrepancy will be discussed in more details in Sect. 5.

Impact of the resolution
The comparison of the two panels of Fig. 8 shows that the kingfisher diagram is independent of the resolution over about one order of magnitude of scales, from 64 3 to 512 3 . Even if not systematically shown, this unusual result is not limited to the fiducial setup but is a general feature of all the simulations explored in this work (see Fig. D.2 of Appendix D for instance). Our interpretation is based on the analytical model presented in Appendix C and summarized in the previous section.
Evidently, high resolution simulations are important to accurately describe small scale structures. In particular, large resolution are required for modeling the formation of dense and gravitationally bound environments and follow their collapse. The fact that the kingfisher diagrams are independent of resolution for R 64 3 therefore suggests that the cold and dense structures between 0.4 and 3 pc have no influence on the distributions of N H and N(H 2 ) for the fiducial simulations and are insignificant in the mass and the volume budgets of H and H 2 . This is in line with the conclusions deduced from the analytical model. Indeed, as explained in the previous section, the turbulent forcing at large scale induces density fluctuations in the diffuse gas that extend over ∼ 20 pc and a distribution of dense structures with sizes smaller than ∼ 10 pc. While the total quantity of matter is inferred to be contained in the diffuse and the largest dense components, H 2 is exclusively built up in dense compo-nents smaller than ∼ 10 pc. The results of Fig. 8 combined with conclusions deduced from the analytical model therefore imply that the structures contributing the most to the mass and volume of H 2 are above 3 pc and below 10 pc for the fiducial simulation.
Beside the physical insights on the typical scales participating to the build-up of column densities, this result also provides a strong justification for using simulations with moderate numerical resolution for the study of the HI-to-H 2 transition. In all this work we therefore adopt a standard resolution of R = 256 3 unless indicated otherwise.
4.3. Impact of G 0 and n H Figs. 9 and 10 show comparisons between the observational dataset and the 2D PHs extracted from the simulations for 0.5 cm −3 n H 4 cm −3 , 0.5 G 0 4, and two sets of values of the box size and the turbulent forcing, L = 200 pc and F = 9 × 10 −4 kpc Myr −2 (Fig. 9) and L = 50 pc and F = 3.6 × 10 −3 kpc Myr −2 (Fig. 10). While the trend and the statistics of the HI-to-H 2 transition weakly depend on the resolution, they strongly depend on the total mass of the gas parametrized by n H and on the UV illumination factor. As G 0 increases or n H decreases, (1) the fraction of lines of sight with large f (H 2 ) drops to the benefit of lines of sight with low f (H 2 ), (2) the transition is shifted toward larger total column density and its width increases, and (3) the molecular fraction globally decreases over all lines of sight while its dispersion increases. Interestingly, and because n H and G 0 have opposite effects, the simulations that reproduce the most accurately the observed statistics of the HI-to-H 2 transition follow a trend with G 0 /n H ∼ 0.5 − 1. While similar, the effect of these two parameters are, however, not symmetrical. In particular, n H has an obvious and strong impact on the fraction of lines of sight in region E, regardless of G 0 . Likewise, the fraction of lines of sight at low column densities (regions A and B) and the mean molecular fraction at large column densities (regions C and D) are not constant for a given G 0 /n H ratio but respectively decrease and increase with n H . All these properties effectively break the degeneracies between the two parameters. All things considered, the tightest concordance between observed and simulated data is obtained for n H ∼ 1 − 2 cm −3 , in agreement with the Galactic midplane density deduced from HI surveys in the Solar Neighborhood (Kalberla & Kerp 2009).
At first sight, the results described above seem obvious as they somehow mimic the dependencies of the HI-to-H 2 transition found with detailed models of photodissociation regions (e.g., Krumholz et al. 2009;Sternberg et al. 2014). Such an interpretation is, however, a dangerous misconception. Indeed, while G 0 is tightly linked to the effective radiation field G eff that locally illuminates the gas, the mean density n H should not be mistaken for the local density. Moreover, the results displayed in Fig. 9 and 10 cannot be compared to PDR models because they are statistical in nature. For instance, simulations at high G 0 do not preclude the existence of clouds with high molecular fractions. Indeed, increasing G 0 may lead to denser local environments with larger molecular fractions whose probability of occurrence along a line of sight is simply reduced. It implies that the results shown here are very different from PDR model predictions. They rather reflect the complex link between the global properties of the simulation (mass, illumination, driving scale) on the one side, and the local conditions and their probability distribution functions on the other side.
Because the local abundance of H 2 is inversely proportional to G 0 , increasing G 0 naturally reduces the local self-shielding. Similarly, increasing G 0 or decreasing n H reduce the large-scale  Table 1. Numbers correspond to the values of %, µ, and σ. The color code indicates a measure of distance (in arbitrary units) between the observed and simulated values in order to guide the eye. These comparisons are shown in each frame for 15 different simulations with G 0 varying from 0.5 (left panels) to 4 (right panels) and n H varying from 0.5 cm −3 (top panels) to 4 cm −3 (bottom panels). All other parameters are set to their fiducial values (see Table 2). self-shielding. As a result, the density threshold n tr H required to produce highly molecular environments (see item 3. of Sect. 3.9) rises by a factor of three when either G 0 is multiplied or n H is divided by a factor of eight. While significant, such an effect on the local conditions of the gas is, however, too shallow to fully explain the variations observed in Figs. 9 and 10.
Indeed, regardless of local conditions, G 0 and n H have a major impact on the probabilistic reconstruction of lines of sight. As shown in Sect. 3.7, increasing G 0 or decreasing n H strongly reduce the fractions of mass and volume occupied by the dense and cold gas. This not only reduces the molecular fraction averaged over the entire simulation (red star in Fig. 7) but also favors the occurrence of lines of sight with low or intermediate f H 2 (case (a) and (b) in Fig. 7). When combined with the distribution of sizes l los , the HI-to-H 2 transition is naturally shifted toward larger N H , and the asymptotic molecular fraction at high N H drops. Moreover, because the central limit theorem requires larger lines of sight to apply, the HI-to-H 2 transition is naturally wider and the dispersion of lines of sight at large molecular fraction increases. This final property is particularly well seen in the kingfisher diagram obtained for the simulation at G 0 = 0.5 and n H = 4 cm −3 where most of the lines of sight follow an homothetic transformation of the mean normalized column densities N H /l los and N(H 2 )/l los (red star in Fig. 7).

Impact of the box size L
The impacts of the box size revealed by comparing Figs. 9 and 10 partly follow the results of the previous section. Reducing L by a factor of four drastically reduces the total amount of matter in the simulation, hence the absorption of the impinging UV radiation field and the large-scale self-shielding. As shown in Sect. 3.7, the mean pressure of the gas rises while the mass and volume occupied by the CNM decreases. All the local and statistical effects described in the previous section therefore apply and modify the kingfisher diagrams accordingly. Changing L has, however, two additional and specific consequences.
Because L drive is four times smaller in the simulations displayed in Fig. 10 than in those displayed in Fig. 9, the decorrelation scales y diff dec and y dens dec are correspondingly smaller. According to the interpretation given in Sect. 3.9, all the reconstructed lines of sight are therefore considerably larger than individual density fluctuations. This not only favors the occurrence of lines of sight at intermediate and high molecular fractions (cases (b) and (c) of Fig. 7) but also magnify the impact of the central limit theorem (see item 5. of Sect. 3.9), as already illustrated by . The kingfisher diagrams shown in Fig. 10 are thus squeezed along the x and y axis compared to those of Fig. 9. Consequently, the simulations at L = 50 pc predict almost no line of sight at low total column density (N H 3 × 10 19 cm −2 ) and systematically underestimate the proportion of line of sight in region A compared to the observations.
Because it controls the total amount of matter, changing L has finally a major effect on gravitational forces. For L = 50 pc, the size of the largest dense clouds are almost always smaller than the Jean length computed in the CNM. Consequently, selfgravity plays almost no role in the physical state or the evolution of the gas for all simulations at L = 50 pc. The specific impact of gravity on the probability histograms of column densities is described in more details in the following section.

Impact of gravity
To facilitate the comparison between simulations and observations and avoid a pedestrian repetition of the kingfisher diagrams, we developed a modified version of the Kolmogorov-Smirnov test. This test, fully explained in Appendix D and val- Fig. 11: KS distance between the simulations and the observational sample as a function of the mean density n H and the UV scaling factor G 0 = 0.5 (red), 1 (green), 2 (orange), and 4 (blue). Results obtained with and without gravity are shown with solid lines and dashed lines, respectively. All other parameters are set to their standard values (see Table 2). Points correspond to reliable measurements of the KS distances. Triangles indicate lower limits corresponding to simulations where the upper error bar on R KS tends toward infinity (see Appendix D).
idated over 30 different simulations, defines a value R KS , called the KS distance, that measures how two 2D PDFs (or 2D PHs) differ from one another. In a nutshell, any observational datapoint in a 2D diagram can be used to divide the diagram into four different regions which each contain different fractions of observed and simulated data. The modified KS test simply searches for the observational datapoint and the region that maximize the ratio between the fractions of simulated and observed lines of sight. The distance R KS is the absolute value of the logarithm of this ratio: A value R KS = 1 therefore implies that some region in the 2D diagram contains tens times more or ten times less simulated data than required to explain the observations, and that all the other regions have smaller ratios.
The results of the KS test applied to 15 simulations run with and without gravity are shown in Fig. 11. As expected from Sect. 4.3, the KS distance strongly depends on both n H and G 0 . In comparison, gravity appears to have a limited impact on the kingfisher diagrams. While including gravity seems to slightly reduce the KS distance, the trends as functions of n H and G 0 and the set of simulations found to minimize R KS remain unaltered. These results can be understood by simple statistical considerations.
The impact of gravity in multiphase simulations is to produce self-gravitating environments which appear as a high density tail in the PDF of the gas density. Because these self-gravitating clumps are dense and fully molecular, they often dominate the integrated column densities of both HI and H 2 along any line of sight that intercept them, and therefore favor lines of sight at high molecular fraction (case (c) in Fig. 7). This not only induces a tail at high N H and N(H 2 ) in the kingfisher diagram but may also contribute to shift the HI-to-H 2 transition to lower N H as faint but highly molecular lines of sight starts to appear. The importance of these two effects depends on the area filling factor of self-gravitating clumps and on whether they carry or not a substantial fraction of the mass of the dense gas. For n H = 0.5 cm −3 , self-gravitating clumps occupy less than 0.001% of the entire volume and carry less than a percent of the mass of the gas. These fractions increase, however, as functions of n H : for n H = 4 cm −3 , self-gravitating environments occupy 0.01% of the volume and contain as much as 30% of the total mass. Therefore, while the impacts of gravity on the kingfisher diagram is negligible for most simulations, they become important at high n H : this property is effectively captured by the KS distances displayed in Fig. 11. 4.6. Impact of turbulent forcing Fig. 12: KS distances between the simulations and the observational sample computed for five values of the acceleration parameter F = 4.5 × 10 −5 , 1.5 × 10 −4 , 4.5 × 10 −4 , 1.5 × 10 −3 , and 4.5 × 10 −3 kpc Myr −2 , and three values of the compressive ratio ζ = 0.1 (blue points), 0.5 (orange points), and 0.9 (green points), which set the balance between compressive and solenoidal forcing (see Sect. 3.3). All other parameters are set to their standard values (see Table 2). Points correspond to reliable measurements of the KS distances. The triangle indicates a lower limit corresponding to a simulation where the upper error bar on R KS tends toward infinity (see Appendix D).
As done in the previous section, the impact of the turbulent forcing is discussed through the results of the Kolmogorov-Smirnov test. The KS distance obtained for various configurations of the turbulent forcing (Fig. 12) shows that the strength of turbulence affects differently the HI-to-H 2 transition depending on its nature: highly compressive turbulent forcing produces virtually identical column density distributions over almost two decades of the turbulent acceleration parameter F; oppositely the strength of the forcing significantly modifies these distributions if a substantial fraction of the kinetic energy is injected in pure solenoidal modes. The tightest agreement with the observational sample is obtained for F ∼ 1.5 × 10 −4 kpc Myr −2 if ζ 0.5 and for all F 1.5 × 10 −4 kpc Myr −2 if ζ = 0.1. According to Sect. 3.7 (Fig. 4), these values corresponds to a WNM velocity dispersion σ tur ∼ 2 km s −1 if ζ 0.5, and σ tur 2 km s −1 for ζ = 0.1. In any case, a small amount of turbulence is always required.
All these characteristics are consequences of the fractions of volume and mass contained in the CNM structures and their size distribution. Without turbulence or with a weak turbulent forcing, CNM clouds are found to evolve toward large-scale entities which evaporate slowly. Such a configuration favor lines of sight at high molecular fraction (type (c) in Fig. 7) and leads to an overestimation of the global amount of H 2 in the local diffuse ISM. If the turbulent forcing increases, the CNM becomes progressively organized into a distribution of structures of dif-ferent sizes down to the numerical resolution. Simultaneously, the fraction of mass located in the CNM phase diminishes to the benefit of the LNM (see Sect. 3.7). Both effects reduce the mean molecular fraction of the gas and favor the occurrence of lines of sight of type (b) (see Fig. 7), in better agreement with the observational sample. Larger turbulent forcing ultimately lead to an underestimation of the global amount of H 2 in the local ISM. However, and as shown in Sect. 3.7, this last effect is much more pronounced for a pure solenoidal forcing which efficiently prevents the unstable gas to condensate back to cold and dense environments compared to a pure compressive forcing.
Here again, the probabilistic information contained in the kingfisher diagram proves to provide a valuable tool to analyze the nature of turbulence in the diffuse local ISM. Indeed, the velocity dispersion deduced from the comparison of observations and simulations at high solenoidal forcing is substantially smaller than the velocity dispersion observed at high Galactic latitude (see Sect. 3.7) and the velocity dispersion deduced from the observations of CO at a scale of 200 pc (Hennebelle & Falgarone 2012). The fact that the observed statistics of the HI-to-H 2 transition is reproduced over a broader range of velocity dispersion for a compressive forcing suggests that the large-scale turbulence of the diffuse local ISM is dominated by compressive modes. This picture is coherent with the results of Saury et al. (2014) who found that a large-scale compressive forcing induces a distribution of thermal pressure in excellent agreement with that derived from the observations of the fine structure lines of CI (Jenkins & Tripp 2011). It is also coherent with the fact that 200 pc corresponds to the width of the Galactic disk seen in CO and therefore to the limit above which the equipartition between compressive and solenoidal modes switches from the values expected in a 3D fluid to those expected in a 2D medium, hence values of ζ smaller than 0.5. At last, it concurs with the conclusions of Iffrig & Hennebelle (2017) who found that compressible modes dominates at low altitudes, close to the equatorial plane, in simulations where the ISM turbulence is selfconsistently driven by supernovae explosions.  Table 2). Points correspond to reliable measurements of the KS distances. The triangle indicates a lower limit corresponding to a simulation where the upper error bar on R KS tends toward infinity (see Appendix D).

Impact of the magnetic field
The impact of the initial magnetic field on the kingfisher diagram simply reveals the competition between thermal instability which induces the production of dense environments, and the magnetic pressure which acts against this evolution. As shown in Fig. 13, the KS distance obtained for different magnetic field intensities is found to be constant until a critical value of B x ∼ 4 µG above which the predicted and observed distribution of N H and N(H 2 ) significantly differ from one another. The initial homogeneous magnetic field adopted in the standard simulation, B x = 3.8 µG, is just under this critical value and corresponds to the case where the energy density of the large scale coherent magnetic field and the thermal energy density of the WNM are equivalent. It follows that reproducing the observed 2D PH of N H and N(H 2 ) requires a magnetic field intensity below or at equipartition.
Interestingly, the value of B x adopted for the fiducial simulation leads, at steady-state, to a constant magnetic field intensity B ∼ 5 µG for n H < 10 cm −3 , and a field intensity that scales as B ∝ (n H ) 0.3 for n H > 10 cm −3 . Those values are comparable with the magnetic field intensities obtained in diffuse and molecular environments from Zeeman measurements (Crutcher et al. 2010) and those obtained in the most diffuse phases from Faraday rotation measurements toward extragalactic radio sources (e.g., Frick et al. 2001). It appears that these information on the interstellar magnetic field are encoded, at least partly, in the statistical properties of the HI-to-H 2 transition.

Observational biases
All the results presented in Sect. 4 are discussed under the assumption that the observational dataset is unbiased, meaning that the underlying lines of sight correspond to a random sample with no selection effect. This is not true. As Krumholz et al. (2008) already noticed, "the FUSE and the Copernicus lines of sight have been specifically chosen to probe a certain range of column densities with a selection biased against high extinction which makes determining column densities very costly or altogether impossible." Indeed, very few stars emit a UV radiation field strong enough to be detected through large visual extinction material. The problem is not due to the increase in H 2 absorption, which can be overcome by focusing on fainter bands, but to dust absorption itself and the complication of the structure of NaI often used as a proxy to derive the column density of HI . Moreover, because such bias depends on the sensitivity of the instrument, it necessarily applies at different column densities for FUSE and Copernicus.
This selection effect obviously complicates the comparison between simulations and observations. As shown in Sect. 4.3 and Figs. 9 and 10, several simulations, including the fiducial setup, predict a significant fraction of lines of sight at high column densities (N H 10 22 cm −2 , region E), in apparent contradiction with the observations. However, the fact that the fraction of stars dismissed by selection is unknown makes it difficult to assess whether this result reveals an actual and important statistical discrepancy or a simple observational limit. It implies that the likelihood of a simulation to be representative of the local diffuse gas cannot be estimated from this criterion alone. It must involve other observational signatures such as the average and the dispersion of the molecular fraction in regions A, B, C, and D, or even the chemical and statistical signatures of other atomic and molecular lines. This latter aspect is currently under development and will be the subject of the next paper of this series.
The fact that the limitations differ between FUSE and Copernicus surveys finally raises the question of the validity of studying the two samples simultaneously. We find, however, that performing comparisons with simulations on the two samples separately gives very similar results and does not impact any of our conclusions. It is so because the observational bias discussed above occurs at an extinction which increases as the natural logarithm of the square root of the instrument sensitivity. The largest total column density probed by FUSE is thus only three times larger than that observed by Copernicus (Gillmon et al. 2006;Rachford et al. 2009). Moreover, the number of lines of sight observed by FUSE that are above the maximum extinction seen by Copernicus corresponds to a small fraction of the entire sample (Gillmon et al. 2006).

Variations of physical conditions in the local ISM
The reconstruction of the simulated sample of lines of sight and the subsequent comparisons with the observations are done assuming that the local ISM can be built out of a single simulation (see Sect. 3.8). This approach was chosen in order to highlight the natural variations induced by turbulence and thermal instability alone in a diffuse neutral gas with a known averaged density and UV illumination factor. However, because the medium probed by the observations extends in all directions around the sun over a maximum distance of 3 kpc (see Fig. 2), it stands to reason that potential variations of all parameters should be taken into account, not only from one line of sight to another but also along a single and outstretched line of sight. Indeed, such considerations would offer a natural explanation for observations whose existence is in contradiction with the corresponding simulated probability of occurrence (see Sect. 4.1).
The total proton mass surface density deduced from HI and CO all sky surveys appears to be rather constant in the Galactic layer located between 5 and 12 kpc from the Galactic center ( Fig. 9 of Miville-Deschênes et al. 2017). Taking into account variations of the ISM scale height above the Galaxy (Kalberla & Kerp 2009), the midplane proton density is expected to vary by less than a factor of two over the corresponding volume.
Using the Galactic star distribution of Wainscoat et al. (1992) and the grains composition and size distribution of Weingartner & Draine (2001), Porter & Strong (2005) and Moskalenko et al. (2006) estimated the energy density of the radiation field across the Galaxy. Similarly to the midplane density, the mean UV radiation field is estimated to vary by a factor two over the volume considered in this paper (Fig. 2 of Porter & Strong 2005). Interestingly, this estimation is far smaller than the variations derived by Jenkins & Tripp (2011) (Fig. 8) from the observation of the fine structure line of CI in the local gas. Such discrepancies could be explained by the fact that Jenkins & Tripp (2011) perform local measurements: the observed gas could thus be located close to or far from an irradiating star. Alternatively, we note that the results obtained by Jenkins & Tripp (2011) are derived from models at equilibrium which do not take into account the uncorrelated perturbations of pressure and density in a turbulent multiphase medium: this naturally favors large fluctuations of the UV radiation field.
The cosmic ray ionization rate inferred from submillimeter observations of several molecular ions, including OH + , H 2 O + , ArH + , and H + 3 , shows a wide distribution across the Galactic disk (Indriolo et al. 2015). A recent estimation performed by Neufeld & Wolfire (2017) suggests that this rate could vary by a factor of five in the gas located between 5 and 12 kpc from the Galactic center.
Finally, potential variations of the composition and the size distribution of grains in the local diffuse gas should also be considered. While the extinction curve is found to be surprisingly uniform in the Milky Way (Schlafly et al. 2016), the local distribution of grains could change, not only along the line of sight but between the atomic and ionized phase and the molecular clouds. This would modify the efficiency of the photoelectric effect and the equilibrium between the two stable states of the neutral ISM, and would also have an impact the H 2 formation rate.
Interestingly, if uncorrelated, the expected variations of n H and G 0 alone would help to explain the slight discrepancies described in Sect. 4.3 but it would also lead to a dispersion of lines of sight far larger than those observed (see Figs. C.2 and C.3). The fact that the predicted statistics of the HI-to-H 2 transition is close to the observed sample therefore suggests that the variations of all the parameters described above cannot be considered independently but must follow strong correlations which apply locally (as discussed, for instance, by ) but also across the Galactic disk.

Fraction of ionized gas
The fraction of volume ϕ occupied by the ionized phases of the ISM, the warm ionized medium (WIM) and the hot ionized medium (HIM), plays an important role in our reconstruction algorithm. As illustrated in Fig. 5, this parameter controls the filling factor of the neutral medium along a line of sight of length l los . Unfortunately, its value in the Milky Way is highly uncertain and still debated.
The consensus is that the volume of the HIM far exceeds that of the WIM and results from an interplay between supernovae explosions, which regularly produce hot gas in the Galactic disk, and buoyancy, which lifts this gas into the halo, releasing the pressure in the midplane. Early analytical studies neglecting buoyancy (McKee & Ostriker 1977) predicted a large fraction of HIM in the midplane (ϕ ∼ 95%). In contrast, early numerical simulations, including the cycle of matter and energy between the disk and the halo, led to considerably smaller predictions with ϕ ∼ 25% (de Avillez & Breitschwerdt 2004). This value is now considered as a lower limit by the most recent numerical simulations which reveal the importance of the driving mode of supernovae explosions (Walch et al. 2015) and of the photoelectric heating (Li et al. 2015;Hill et al. 2018) on the volume of the HIM. These latest studies estimate that 20% ϕ 90% in the Galactic midplane.
Highly uncertain, the volume filling factor of the HIM can also vary from one line of sight to another (Fig. 1 of Hill et al. 2018). In this paper, we adopt a constant and conservative value ϕ = 0.5 for every line of sight. Changing ϕ would have the effect of either squeeze or stretch the predicted 2D PHs displayed in Fig. 9 and 10 along the x axis, and to modify the balance of probabilities of occurrence of lines of sight at high and low molecular fraction. Taking into account a realistic distribution of ϕ would require to simulate the Galactic disk and halo over a scale of several kiloparsecs and to properly model and follow the impact of supernovae explosions. This is far beyond the scope of the present paper.

Galactic vertical structure
The simulated 2D PH of the HI-to-H 2 transition are found to slightly depend on the Galactic gravitational potential. This odd result is nothing but an artifact of the physical scales considered here. Since the sizes of all simulations are below 200 pc, the gas expands, at the most, over 100 pc above the Galactic plane, a distance far smaller than the characteristic scale of variation of the thermal pressure expected for a turbulent gas in hydrostatic equilibrium. It follows that the column densities show no significant variation as a function of the position of the line of sight or its angle with the Galactic plane.
This setup, initially chosen to favor the physical resolution, is a strong shortcoming which prevents us from using and studying the information carried by the Galactic latitude of each observation. Indeed, the comparison between the FUSE halo survey and the data collected by FUSE and Copernicus in the Galactic disk indicates that the HI-to-H 2 transition at high latitude occurs at a total hydrogen column density ∼ 2 times smaller than that in the disk (Gillmon et al. 2006). Similarly to the previous section, studying this effect would requires to compute the local vertical structure of our Galaxy over several hundreds of parsecs, taking into account the hot and ionized component of the ISM.

Doppler broadening parameter
The self-shielding of H 2 depends on the dispersion of the Lyman and Werner lines which is usually modeled with a turbulent Doppler broadening parameter b D (Draine & Bertoldi 1996 and Eq. 16). In single cloud models, this parameter has the effect of shifting the HI-to-H 2 transition toward larger total column density without modifying any of the asymptotic states (e.g., Bialy et al. 2017). In this paper, we identify this parameter with the turbulent velocity dispersion of the gas at large scales and therefore adopt b D = 8 km s −1 for the fiducial simulation. This is done to prevent an overestimation of the H 2 self-shielding at large scales, at the cost of underestimating the self-shielding at the scale of a CNM clouds.
To estimate the effect of this parameter, all the grids presented in this paper were also run assuming b D = 2 km s −1 , which roughly corresponds to the velocity dispersion expected at a scale of 10 pc for the fiducial simulation. While locally important, b D is found to have a relatively small impact on the 2D PHs of N H and N(H 2 ): increasing b D by a factor of four slightly increases the width of the HI-to-H 2 transition and the fraction of lines of sight in region B. We interpret this limited effect as a consequence of the fact that the asymptotic molecular states of any cloud are independent of b D .
Even so, it should be stressed that b D has a strong impact on the local molecular fraction in transition regions. Therefore, and while inconsequential for the global statistics of H and H 2 , the value of the Doppler broadening parameter might be paramount for any chemical species preferentially formed at the border of molecular clouds. As proposed by , the H 2 self-shielding could be computed self-consistently using the velocity and density fields of the simulation and a cost effective radiative transfer method. This would prevent the dilemma of favoring large-scale or small-scale self-shielding.

H 2 self-shielding at high temperature
The prescription of H 2 self-shielding used in this paper (Eq. 16) is taken from Draine & Bertoldi (1996). As discussed by Wolcott-Green et al. (2011), such a prescription is reliable for diffuse gas at low temperature but becomes less and less reliable for high temperature environments (T > 500 K) where efficient collisional excitation of H 2 in its rovibrational levels reduces the self-shielding. To estimate the impact of this process, we ran the fiducial simulation with the alternative self-shielding function proposed by Wolcott-Green et al. (2011) (Eq. 12). This prescription leads to a similar probability histogram and therefore does not influence the global analysis of the kingfisher diagram presented in this paper. However, we note that adopting this alternative prescription slightly increases the width of the HI-to-H 2 transition, and induces more lines of sight at intermediate molecular fraction (region B, see Fig. 3), in better agreement with the observations.

Summary & conclusions
This paper presents an exhaustive parametric study of the transition from atomic to molecular hydrogen in the local diffuse ISM. Using state-of-the-art MHD simulations, and an ensemble of 305 runs, we quantify separately the impacts of the mean density, the UV radiation field, the integral scale, the resolution, the turbulent forcing, the magnetic field, and the gravity on the molecular content of multiphase environments. The original feature of this work is to not only focus on the production of individual column densities but also on their statistics, meaning the probabilities of occurrence of these column densities along random lines of sight. For the first time, both the chemical and statistical information are used concomitantly, through the so-called kingfisher diagrams, to interpret the distribution of H and H 2 observed toward 360 lines of sight across the local interstellar matter.
The results of the simulations are interpreted with a semianalytical model which attempts to separate the effects of local conditions from those induced by the probabilistic reconstruction of individual lines of sight. To compare the kingfisher diagrams to the observational sample, we propose a new version of the Kolmogorov-Smirnov test which can be generalized and used for the comparison of two probability histograms or distribution functions in any dimension larger than one.
Taking into account the distance of each background source and simulating random lines of sight over the same distribution of distances is paramount to explain the range of observed column densities and their corresponding statistics. Once this aspect is included, the joint actions of thermal instability and largescale turbulence in the standard simulation are found to produce a wealth of lines of sight which reproduce the observed position and width of the HI-to-H 2 transition, and whose probabilities of occurrence match those derived from the observations. The agreement is so remarkable that it is almost unnecessary to invoke variations of physical conditions along lines of sight or from one line of sight to another.
The minimal KS distance obtained over the entire grid is ∼ 0.5. Such a value implies that there exists a small group of lines of sight in the observational sample whose probability of occurrence is under-or over-predicted by about a factor of three. However, it also implies that the probability of occurrence of any other group of observed lines of sight, small or large, is reproduced to a better level.
The distribution of column densities computed from the simulations strongly depends on the Galactic midplane density parametrized by the mean density n H , the density of OB stars parametrized by the UV scaling factor G 0 , and the scale of neutral diffuse clouds parametrized by the box size L. It is so because these three parameters not only regulate the mean pressure of the gas, hence the fractions of mass and volume occupied by the CNM and WNM, but also control the typical scale of density fluctuations in the WNM and the distribution of sizes of the CNM structures where H 2 forms. The tightest concordance between the observed and simulated samples is obtained for a mean density n H = 1 − 2 cm −3 and a UV radiation field scaling factor G 0 = 1 (in Habing units), in good agreement with the values deduced from HI and CO all sky surveys and from direct observations of the UV radiation field in the Solar Neighborhood. The range of observed column densities of H and H 2 requires a box size L = 200 pc which corresponds to the estimated scale of HI superclouds.
Within this setup, the column densities of HI are inferred to be built up in large-scale WNM and CNM structures correlated in densities over ∼ 20 pc and ∼ 10 pc, respectively. In contrast H 2 is inferred to be built up at smaller scales. However, the fact that the kingfisher diagram is independent from the resolution of the simulation suggests that most of the mass and volume of H 2 is contained in CNM structures between ∼ 3 and ∼ 10 pc. All these values are given for the standard simulation (L = 200 pc and L drive ∼ 100 pc) but naturally depend on the size of the box and the mean density of the gas.
In spite of the strong influences of n H , G 0 , and L, the statistical properties of the HI-to-H 2 transition are otherwise remarkably stable. Admittedly, the kingfisher diagram depends on the strength of the turbulence if most of the forcing is injected in solenoidal modes; however such a configuration prevents to reproduce the observational sample unless the large-scale velocity dispersion of the gas is unrealistically small. In contrast, if most of the kinetic energy is injected in compressive modes, the kingfisher diagram is found to weakly depend on the strength of the forcing. Similarly, the HI-to-H 2 transition is almost not affected by gravity and is found to weakly depend on the Doppler broadening parameter and the strength of the magnetic field, as long as B x 4 µG. The 2D PH of the column densities of H and H 2 is therefore a valuable tool to constrain the nature of the turbulent forcing at large scale; however, it provides few or no information regarding the velocity dispersion of the gas, the amount of gravitationally bound environments and the strength of the magnetic field. Other observational tracers are required.
All these results open new perspectives for the study of the chemical state of the ISM in which any observation must be understood through the combination of local physical conditions and the probabilistic ordering of these conditions along the line of sight. In particular, similar studies should be applied to all atomic and molecular species with observational samples large enough to conduct statistical analysis. It also invites to expend the study of PHs to higher dimensions, taking into account simultaneously the joint information contained in the column densities of several species. In this context, the generalization of the Kolmogorov-Smirnov test proposed in this paper will be very valuable. All these aspects are currently under development and will be the subjects of the following papers of this series.

Appendix A: Observations of HI and H 2 in the local diffuse ISM
The complete observational dataset used in this work is given in Table A.1, including the sources identifiers, the coordinates, and the column densities of HI and H 2 . The column densities of H 2 are obtained through direct observations of its FUV absorption lines. The sample presented in Table A.1 therefore results from a combination of the Copernicus survey of nearby stars (e.g., Savage et al. 1977;Bohlin et al. 1983) and the FUSE surveys of the Galactic disk and the Galactic halo (e.g., Lehner et al. 2003;Cartledge et al. 2004;Pan et al. 2004;Gillmon et al. 2006;Jensen & Snow 2007b,a;Rachford et al. 2009). In contrast, and as reviewed by Fruscione et al. (1994), the HI column densities are derived from both direct methods, which include Lyα absorption studies, EUV observations of stellar spectra, and observations of the 21cm line, and indirect methods, which include curve of growth of neutral and singly ionized atoms and optical interstellar absorption lines of NaI which are both found to correlate with N(H) (de Boer et al. 1986;Ferlet et al. 1985). The column densities of HI given in Table A.1 therefore result from a combination of FUV and optical studies for nearby stars (e.g., Bohlin et al. 1978Bohlin et al. , 1983Diplas & Savage 1994;Fitzpatrick & Massa 1990;Jensen & Snow 2007b,a ) and radio studies of the 21cm line for extragalactic sources at high Galactic latitude (Wakker et al. 2003;Gillmon et al. 2006). As reported by all these authors, the indirect methods relying on the observations of metals can be subjects to uncertainties mostly due to the assumptions made regarding the elemental abundances. In addition, the column densities of HI derived from the emission profiles of the 21cm line can also be highly uncertain because the measurements are done over a beam far larger than the pencil-beam sampled by H 2 data but also because it requires to identify in the HI profiles the components associated to the molecular gas. It should thus be kept in mind that while the errors on the column densities of H 2 are somehow limited, those on HI can be sometimes larger than a factor of five, in particular for the lines of sight at high Galactic latitude (Gillmon et al. 2006). The color excess E(B − V) given in Table A.1 finally results from a compilation which includes direct measurements of the star reddening compared to its intrinsic (B − V) 0 color (e.g., Savage et al. 1977;Fitzpatrick & Massa 1990;Diplas & Savage 1994;Rachford et al. 2002) and dust emission maps from the IRAS telescope 4 (Schlegel et al. 1998).
With all these data at hand, we adopt the following methodology to derive the total proton column densities N H : if the column densities of HI and H 2 are available, then N H is computed as N(H) + 2N(H 2 ); if not, N H is derived from the reddening E(B − V) as N H = 5.8 × 10 21 E(B − V) cm −2 assuming a standard Galactic extinction curve and the average interstellar ratio (Fitzpatrick & Massa 1986;Fitzpatrick 1999). It should be noted that the N H /E(B − V) ratio observed at low column density is larger than the standard value used in this work (Liszt 2014;Lenz et al. 2017). Because of this and the uncertainties on the HI column densities discussed above, the values of N H derived here should be considered as estimates. Examples of the underlying uncertainties on N H can be seen in Table A.1 where N(H) sometimes exceeds slightly the total column density derived from E(B − V). Table A.1: Observational dataset used in this work. The distance of each source is computed from the parallax measured by Gaia if the data is given in the DR2 catalog (Gaia Collaboration et al. 2018); otherwise, the distance of the source is taken from Gudennavar et al. (2012). Column densities are expressed in cm −2 . The total proton column densities N H are computed as N(H) + N(H 2 ) if the column densities of HI and H 2 are available, or derived from the reddening E(B − V) as N H = 5.8 × 10 21 E(B − V) cm −2 assuming a standard Galactic extinction curve and the average interstellar ratio R V = A V /E(B − V) = 3.1 (Fitzpatrick & Massa 1986;Fitzpatrick 1999       (1) York (1976); (2) Savage et al. (1977); (3) Bohlin et al. (1978); (4) Hobbs (1978); (5) Neckel & Klare (1980); (6) Federman (1982); (7) Bohlin et al. (1983); (8) Shull & van Steenberg (1985); (9) Savage et al. (1985); (10)  The analytical equations of the heating and cooling rates used in this work are taken from Wolfire et al. (1995) and Wolfire et al. (2003).
The electronic density is calculated from the ionization equilibrium where ζ CR is the total ionization rate (including primary and secondary ionizations) of H by soft X-rays and cosmic ray particles (expressed in unit of 10 −16 s −1 ) and G eff is the local effective radiation field in Habing units (see Sect. 3.2). φ PAH is a recombination parameter of electrons on PAHs, discussed in Wolfire et al. (2003), and set to 0.5. x C + is the abundance of C + relative to n H , x C + = n(C + )/n H . Throughout this work, we adopt x C + = 1.4 × 10 −4 , which corresponds to the value derived in the Solar Neighborhood assuming 40% depletion of carbon onto grains and that the remaining carbon is singly ionized. This equation for the density of electrons differs from that of Wolfire et al. (2003) by the addition of C + which is the most abundant ion in the diffuse and transluscent CNM (Snow & McCall 2006). Following Wolfire et al. (2003), we include the heating induced by the photoelectric effect on grains and PAHs and by cosmic ray ionizations. The former is modeled with a rate where the heating efficiency The latter is modeled with a rate Γ CR ∼ 10 −27 ζ CR 10 −16 s −1 erg s −1 . (B.5) Regarding the cooling, we include the fine-structure emission of CII and OI, the emission of Lyman α photons by HI, and the recombination of electrons onto charged grains and PAHs. The cooling rate due to collisional excitation of the fine-structure levels of C + by atomic hydrogen and electrons is given by Λ CII = 2.25 × 10 −23 + 10 −20 T 100 K −0.5 n e n H e −92/T x C + erg cm 3 s −1 . (B.6) The cooling rate by collisional excitation of the fine-structure level of OI by atomic hydrogen is computed as where x O = n(O)/n H is the relative abundance of atomic oxygen. Throughout this paper, we adopt x O = 3.2 × 10 −4 , which corresponds to the value derived in the Solar Neighborhood assuming 37% depletion of oxygen onto grains and that the remaining oxygen is in its atomic form. Those two lines are the dominant cooling terms of the CNM phase. The cooling induced by the excitation of the Lyman α line, which is the dominant cooling at T 8000 K (WNM), is taken from (Spitzer 1978) Λ HI = 7.3 × 10 −19 x e e −118400/T erg cm 3 s −1 . (B.8) Finally, the cooling rate due to electron recombination onto charged grains and PAHs is set to  Table 2). The black histogram correspond to the extracted data. The red dashed curve shows an example of the sum of two log-normal components and a power-law tail at high density, for comparison. The blue line indicates the inflection point of the PH between the diffuse and the dense components.
In order to interpret the results found in Sect. 4, we propose a semi-analytical prescription to predict the 1D and 2D PHs of the total column density and the column density of H 2 obtained with numerical simulations. This prescription is based on the work of Vázquez-Semadeni & García (2001), Bialy et al. (2017), and  who showed that lines of sight across isothermal simulations of turbulence can be accurately modeled as a series of random density fluctuations.

Appendix C.1: Decorrelation scales
In Fig. C.1, we display an example of the volume-weighted distribution of the proton density computed for the fiducial simulation (see Table 2). The total column density integrated along the x direction over a random line of sight of size l L is where R is the resolution of the box of size L (see Table 2), i 0 is a random starting index for the integration of the column densities, and i l is the final index deduced from l. Because of spatial correlations of the density n H , this computation is not equivalent to the sum of random realizations of n H drawn out of the 1D probability distribution (Fig. C.1). It depends, instead, on how and over which distance the values of n H are correlated. As proposed by Bialy et al. (2017), and , and since the correlations of density in a turbulent medium decrease over long distances, we assume that these correlations can be modeled with a parameter y dec , called the decorrelation scale. The density is supposed to be constant over distances smaller than y dec , and uncorrelated over larger distances. In this framework, if y dec is the same for all densities, the total column density integrated over a distance l becomes equivalent to the sum of 1 + l/y dec random realizations of n H . In isothermal simulations, the 1D probability distribution of the gas density is found to follow a lognormal distribution with a dispersion proportional to the Mach number, dependent on the nature of the turbulent driving, and independent on the mean density of the gas (e.g. Padoan et al. 1997;Federrath et al. 2008). Because of this property, Bialy et al. (2017) were able to establish a relation between the dispersion σ N H /l of the distribution of the averaged densities and the dispersion σ n H of the distribution of the proton density: where L drive is the turbulence driving scale. Fitting the σ N H /l /σ n H ratio as a function of l/L drive , Bialy et al. (2017) Unfortunately, this method cannot be applied here. Indeed, oppositely to isothermal simulations and as illustrated in Fig.  C.1, the PH of n H derived from simulations of the multiphase ISM is usually described by the sum two log-normal distributions plus a power-law tail at high density that could be a signature of the CNM which is known to behave like a polytropic gas with an exponent γ < 1 (Passot & Vázquez-Semadeni 1998) or a signature of gravity (Federrath 2013;Girichidis et al. 2014). To overcome this issue, and for the sake of simplicity, we assume that a two phase medium is described by two decorrelation scales. All densities below a limit n lim H are supposed to belong to the diffuse log-normal component and to be correlated over a scale y diff dec . Similarly, all densities above n lim H are supposed to belong to the dense log-normal component and to be correlated over a smaller scale y dens dec . We identified here the limit n lim H between the two log-normal distributions with the inflection point of the 1D probability histogram of the gas density (blue line in Fig. C.1). Since the diffuse component behave like an isothermal gas at the temperature of the WNM, we assume that y diff dec /L drive is constant for all simulations and adopt the value given by Bialy et al. (2017), y diff dec = 0.2 × L drive , where L drive = L/2 is the main driving scale used for the turbulent forcing (see Sect. 3.3). In contrast, and because the diffuse component occupies most of the volume, we state that y dens dec depends on the total mass of the gas or equivalently its mean density n H . To simplify, we propose that which means that the impact of changing the total mass of the gas is to change the typical volume of the dense structures by the same factor. Within this framework, the semi-analytical model proposed here therefore depends on a single parameter: the decorrelation scale of the dense component for the fiducial simulation. In the following, we assume y dens dec = 10 pc for n H = 2 cm −3 which implies a decorrelation length of 6.3, 7.9, and 12.6 pc for n H = 0.5, 1, and 4 cm −3 , respectively.
It is quite optimistic to believe that the dense and cold component of the ISM can be modeled by a single decorrelation scale y dens dec . This component is indeed likely to follow a distribution of sizes which decrease with the local pressure, hence the local density n H . However, and as we show below, such an assumption allows us to highlight important features of the simulations. The essential impact of the distribution of sizes is proven and discussed in Sect. C.4. drawn out of the sample a lines of sight under construction. The construction of each line of sight ends when its size reaches the integration scale l. The entire sample of lines of sight is finally reconstructed until convergence of the large-scale shielding processes described above.
The comparisons between the 1D and 2D PHs reconstructed from the semi-analytical model and extracted from the simulations are shown in Figs. C.2 and C.3 for samples of N = 64000 lines of sight, 15 different simulations, and different values of the integration scale l, from L down to y diff dec . Unexpectedly, setting the decorrelation length of the dense component to ∼ 10 pc for the fiducial simulation (n H = 2 cm −3 ) leads to a remarkable agreement between all the fictitious and actuals PHs. Once this parameter is set, the model not only reproduces surprisingly well the shapes of the 1D PHs and of the HI-to-H 2 transition, but also their global trends depending on n H and G 0 and their deformations depending on the chosen integration scale l. This result is not straightforward and highly depends on the decorrelation lengths y diff dec and y dens dec . The agreement observed in Figs. C.2 and C.3 therefore suggests that the model somehow captures an essential property of the simulations, namely some characteristic lengths of the diffuse and dense components of a multiphase gas with a turbulence driven at a scale L drive .

Appendix C.3: Interpretation of the model
The statistics of the HI-to-H 2 transition derived from the model result from a combination of effects. Locally, the fraction of H 2 of a given slab depends on the density and the size of the slab y diff dec and y dens dec , which set the local self-shielding, and on the surrounding environment, which sets the large-scale self-shielding. How these local properties contribute to the integrated quantities N H and N(H 2 ) depends, in turn, on the sizes of the slabs and on the PH of the density which both control the reconstruction of the line of sight.
1. Since y dens dec is fixed, the HI-to-H 2 transition induced by the local self-shielding alone occurs in any slab with a density larger than n tr H ∝ G 1/2 0 (y dens dec ) −1/2 ∝ G 1/2 0 (n H ) −1/6 . ous case, the occurrence of such events depends on the volume filling factor of the dense gas, hence on the 1D PH of large density material, and on the integration length l: as l increases, their likelihood increases. Such a scenario occurs for a minimum normalized column density N H /l = n tr H y dens dec /l. (C.7) 4. These two limits for lines of sight with low and large f H 2 (items 2. and 3.), set the width of the HI-to-H 2 transition seen in column densities. As shown in Fig. C.3, this width is somehow smaller than that obtained from the numerical simulations. We will discuss this point in the next section. 5. At last, lines of sight with intermediate H 2 fraction (10 −4 f H 2 10 −2 ) mostly result from a combination of slabs of low and moderate densities (n H n tr H ). Because such events are unlikely, the model predicts a small fraction of lines of sight at intermediate f H 2 , in contradiction with results extracted from the simulations. This point will also be discussed further in the next section.
All these properties fully explain the behaviors of the analytical model observed in Fig. C.3. The transition density for the fiducial simulation is n tr H = 4 cm −3 . As expected, the corresponding lower and upper limits of N H /l required to activate the HI-to-H 2 transition are in agreement with the limits found for the lowest integration scale l = 25 pc (bottom right panels of Fig. C.3). Increasing the integration length has three effects: (a) to squeeze the 2D PH along the x and y axis as both N H and N(H 2 ) progressively tend toward Gaussian distributions centered on the means, (b) to shift the HI-to-H 2 transition to lower N H /l according to the limits derived above, and (c) to increase the occurrence of lines of sight with large f H 2 to the detriment of lines of sight with low f H 2 as the probability of intercepting a slab with n H > n tr H rises. The dependence of the distributions on G 0 and n H are also straightforward. As G 0 increases or n H decreases, the HI-to-H 2 transition is shifted according to the dependence of n tr H on these parameters. The occurrences of high or low molecular fraction lines of sight simply depend on the volume filling factor of the dense gas, and are therefore a direct consequence of the 1D PH of the gas density. As G 0 increases or n H decreases, the mean thermal pressure rises and the mass fraction of the CNM diminishes. This favors the occurrence of lines of sight with low molecular fraction. and the KS distance between the two distributions as the maximum value of M computed over all quadrants of an observational dataset O, This procedure, not only provides a measurement of the difference between the two distributions, but also the datapoint and the quadrant that maximize the merit function (see Fig. D.1). The interpretation is also straightforward. For instance, a KS distance of 1 implies that one of all the quadrants scanned contains 10 times fewer or 10 times more observations than simulated lines of sight, and that all the other quadrants have smaller distances. The stability of the procedure depends on the errors made on the merit function and therefore on the absolute numbers of observed and simulated lines of sight contained in each quadrant. The observational dataset O used to compute the KS distance (red points in Fig. D.1) is chosen as the subsample such that all quadrants scanned contain at least 10 observed lines of sight. With this assumption, the error on the merit function is calculated by taking into account only the statistical errors on the number of simulated lines of sight. For each quadrant, we assume that the "true" merit function ranges between M min = log 10 where S is the total number of simulated lines of sight. Because the errors are asymmetric, M min can tend toward infinity. If so, the KS distance is a lower limit, even if the infinite error bar is obtained for another quadrant than the one that maximizes M. In short, for each observational datapoint, we compute M min , M, M max , and R KS . If one of the M min goes to infinity, R KS is considered as a lower limit. Because the division in quadrants is performed on a cartesian grid, the procedure also depends on the choice of the axes. Mathematically, the best option would be to identify the principal components of the observational sample using proper orthogonal decomposition or singular value decomposition algorithms. Such a method could even be applied to subsamples in order to adaptively rotate the system of axes and follow more precisely the distribution of observations. In any case, the resulting system would be a linear combination of N H and N(H 2 ) which could be difficult to relate to the underlying physical properties. Because the molecular fraction is bimodal as a function of N H , we choose here N H and f H 2 as primary variables. This choice facilitates the physical interpretation of the KS test while ensuring some homogeneity of the spread of the observational data in all quadrants (see Fig. D.1).
As a proof of concept, we display in Fig. D.2 the results of the KS test applied to a grid of simulations obtained for different values of n H and G 0 and two different resolutions. Despite its simplicity, this test appears to capture the main behaviors described in Sects. 4.2 and 4.3. In particular, the KS distance between the simulations and the observations is found to strongly depend on both n H and G 0 , and more loosely on the resolution. Moreover, the simulations that minimize R KS are found to be identical to those identified in Sect. 4.3 to be in closest agreement with the observations. Interestingly, the value of R KS obtained for n H = 4 cm −3 and G 0 = 4 is relatively small, in apparent contradiction with the conclusions of Sect. 4.3. This is due to the limit imposed on the minimum number of observations contained in each quadrant. Because of this limit, several observations at large N H are not included in the analysis which reduces the merit function at large column density (region E, see Table 1). Keeping in mind these border effects, the KS test turns out to be a valuable tool for estimating the distance between two distributions without performing a detailed comparison of the samples. In this paper, we apply this method to display our results in a synthetic manner (see Sects. 4.5,4.6,and 4.7) and only give additional details when necessary.