Proximate Molecular Quasar Absorbers: Excess of damped H2 systems at zabs~zQSO in SDSS DR14

We present results from a search for strong H2 absorption systems proximate to quasars (zabs~zem) in the Sloan Digital Sky Survey (SDSS) Data Release 14. The search is based on the Lyman-Werner band signature of damped H2 absorption lines without any prior on the associated metal or neutral hydrogen content. This has resulted in the detection of 81 systems with log N(H2)~19-20 located within a few thousand km/s from the quasar. Compared to a control sample of intervening systems, this implies an excess of proximate H2 systems by about a factor of 4 to 5. The incidence of H2 systems increases steeply with decreasing relative velocity, reaching an order of magnitude higher than expected from intervening statistics at Delta_v<1000 km/s. The most striking feature of the proximate systems compared to the intervening ones is the presence of Ly-alpha emission in the core of the associated damped HI absorption line in about half of the sample. This puts constraints on the relative projected sizes of the absorbing clouds to those of the quasar line emitting regions. Using the SDSS spectra, we estimate the HI, metal and dust content of the systems, which are found to have typical metallicities of one tenth Solar, albeit with a large spread among individual systems. We observe trends between the fraction of leaking Ly-alpha emission and the relative absorber-quasar velocity as well as with the excitation of several metal species, similar to what has been seen in metal-selected proximate DLAs. With the help of theoretical HI-H2 transition relations, we show that the presence of H2 helps to break the degeneracy between density and strength of the UV field as main sources of excitation and hence provides unique constraints on the possible origin and location of the absorbing clouds. We suggest that most of these systems originate from galaxies in the quasar group. [truncated]


Introduction
Damped Ly-α absorption systems (DLAs, see Wolfe et al. 2005) observed in the spectra of distant light sources belong to two main categories, intervening and associated, depending on their origin with respect to the background sources. Intervening DLAs are produced by neutral H i gas located by chance along the line of sight to the background sources without being related to the sources themselves. Using intervening absorption systems identified in large spectroscopic surveys (such as the Sloan Digital Sky Survey, hereafter SDSS), it is possible to conduct a census of the neutral gas in the Universe and study its evolution over cosmic time (e.g Péroux et al. 2003;Prochaska et al. 2005;Noterdaeme et al. 2012). Moreover, DLAs are very useful probes of cosmic chemical evolution (e.g. Rafelski et al. 2012;De Cia et al. 2018), and the physical conditions of the absorbing medium can be probed by studying the excitation of various species, in particular molecular hydrogen (e.g. Srianand et al. 2005;Noterdaeme et al. 2007;Jorgenson et al. 2010;Balashev et al. 2017). Overall, intervening DLAs exhibit characteristics and a complexity indicating an origin from interstellar or circumgalactic gas. Indeed, a direct connection between intervening DLAs and galaxies is now emerging thanks to the detection of galaxies in emission at the absorption redshift (e.g. Krogager et al. 2017;Neeleman et al. 2019).
Associated systems, in contrast, originate from gas belonging to the close environment of the background sources. As such, they provide unique information about the sources themselves or their environment. For example, in the case of longduration γ-ray burst (GRB) afterglows, strong DLAs are almost systematically detected. While the so-called GRB-DLAs may not necessarily be associated to the GRB explosion site itself (which is thought to be associated to the death of a massive star), they still likely probe the gas in the GRB host galaxy, as evidenced by a N(H i)-distribution skewed to high column densities (Fynbo et al. 2009). The luminous and rapidly varying afterglow also leads to specific effects such a time-varying UV-pumping of excited levels of atomic species (Vreeswijk et al. 2007) or the presence of vibrationally excited H 2 (Sheffer et al. 2009).
In the case of quasars, associated DLAs may arise from infalling or outflowing gas, gas in the quasar host, or from nearby galaxies in the group environment, all of which possibly affected by the quasar via radiation or mechanical feedback. For example, quasar activity can result in quenching of star formation in the quasar host due to gas consumption or gas ejection from the galaxy through powerful winds (so-called negative A&A proofs: manuscript no. main feedback). However, quasar activity may also lead to positive feedback on star formation through compression of the gas (e.g. Zubovas et al. 2013). The presence of a quasar may also affect the gas in nearby galaxies, and consequently their star formation. Moreover, the feeding of quasars with infalling gas is one of the most challenging problems in the field and lacks direct observational evidence. Finally, while outflows driven by the quasar are ubiquitously observed in various states, from highly ionised, atomic phases to molecular phases, detecting these in absorption will provide unique clues as to their physical and chemical states.
The various possible origins for the associated DLAs suggest that the frequency of these could be in excess compared to intervening systems, and that associated DLAs may exhibit different characteristics. However, it is not trivial to distinguish between intervening and associated systems through observations. The most direct piece of information regarding the respective location of the intervening and associated systems is the apparent velocity difference. Noting that 1000 km s −1 in the Hubble flow correspond to about 3 Mpc proper distance at z ∼ 3, systems with apparent velocity differences larger than a few thousand kilometres per second are generally considered as intervening since peculiar motions are unlikely to reach such values. Nonetheless, it cannot be excluded that outflowing winds may produce DLAs with large velocities. For velocity differences less than a few thousands of kilometres per second, the absorber can either be associated (including the various possible origins discussed above) or still unrelated to the source environment (i.e. intervening). Such systems are therefore dubbed "proximate" until further information is available.
Based on the CORALS survey, Ellison et al. (2002) reported a factor of ∼ 4 excess of proximate DLAs (PDLAs) compared to intervening ones. From a systematic search of SDSS data release 5 (DR5), Prochaska et al. (2008b) later reported an excess of only a factor of ∼ 2 at redshift z ∼ 3, but no statistically significant excess at z < 2.5 and z > 3.5. Studies of metal lines in both composite SDSS spectra (Ellison et al. 2010) and individual high resolution spectra (Ellison et al. 2011) suggest that PD-LAs have properties that are only marginally different from those of intervening DLAs; On average the former have higher metallicities (although spreading a wide range) and stronger highionisation lines.
A more striking difference between PDLAs and intervening DLAs is the existence of a population of PDLAs that do not fully cover the Ly-α emission region of the background quasar (Finley et al. 2013). This results in an additional flux in the core of the DLA, which complicates their identification as DLAs. The system will appear as a coronagraphic DLA when the broad line region (BLR) of the quasar is fully covered by the absorbing cloud but the narrow line region (NLR) is not. Depending on the relative strength and width of the emission compared to that of the DLA absorption, there exists a continuous range of situations, starting from DLAs where some emission is seen in the core to systems where the damping wings are barely visible due to strong Ly-α emission (Jiang et al. 2016). We note that part of the emission can also be due to Ly-α photons originating from the quasar host galaxy or from Ly-α photons scattered out to very large distances (tens of kpc, e.g. Courbin et al. 2008;Cantalupo et al. 2014;Borisova et al. 2016;North et al. 2017). However, the total flux of such kpc-scale Ly-α emission is significantly smaller than that from the NLR (Fathivavsari et al. 2016), yet sometimes becoming comparable to the later (e.g. Fathivavsari et al. 2015). In some extreme cases (called ghostly DLAs by Fathivavsari et al. 2017), the BLR is not fully covered either and the absorption system is only witnessed by its Ly-β, Ly-γ and higher series H i lines as well as low-ionisation metal lines that indicate the presence of neutral gas along the line of sight. Based on an observed relation between the strength of leaking Ly-α emission and the fine-structure excitation of metal species, Fathivavsari et al. (2018) suggested that systems with strong Ly-α emission could be located closer to the quasar where mechanical compression of the gas would be at play. We note that the enhanced UV flux may then also play a role in the excitation of the metal species.
Investigating the presence of molecular gas (in particular H 2 ) in PDLAs could bring new clues to the overall picture since the production and destruction of molecules is very sensitive to the physical conditions of the gas. In cold neutral gas, the molecular hydrogen fraction is governed by the equilibrium between the formation of H 2 on the surface of dust grains and photo-dissociation by UV photons through line absorption in the Lyman and Werner bands (see e.g. Wakelam et al. 2017). The proximity of the central engine not only increases the photodissociation rate but may also lead to complex effects such as an increase of the dust temperature that decreases the formation efficiency of H 2 on the surface of grains. On the other hand, the fragmentation of dust due to strong UV radiation increases the grain surface-to-mass ratio, which could increase the H 2 formation, but at the same time, the grains fragments will also be heated. It is therefore not obvious what the net effect on the H 2 formation rate would be. Additionally, mechanical feedback from the quasar may result in an increase in the number density, n H , and thus a significant increase in the H 2 production rate, which scales as n 2 H . More generally, it is crucial to investigate how H 2 clouds can survive or form in harsh environments and thereby how star formation is affected close to the quasar.
The presence of molecular hydrogen proximate to the quasar was first shown by Levshakov & Varshalovich (1985) who detected H 2 with N(H 2 ) ∼ 10 18 cm −2 at z abs = 2.811 towards PKS 0528-250 (z em = 2.77). This was later confirmed by Foltz et al. (1988) who also discussed the possible reasons for the existence of H 2 gas when the extinction measured towards the quasar is low. The authors suggested that the formation rate could be more efficient than seen locally, that the incident UV flux could actually be low, or that H 2 could be formed in nonequilibrium in cooling zones behind shocks. Levshakov & Foltz (1988) discussed the transverse size of the associated atomic gas from the complete absorption of Ly-α+N v emission by the DLA and Klimenko et al. (2015) demonstrated that the emission regions were not fully covered by the molecular cloud. A detailed investigation of physical conditions in this system from the excitation of various species is still to be done (Balashev et al. in prep).
It is also remarkable that the proximate H 2 system from Levshakov & Varshalovich also represents the first detection of molecules in absorption at high-redshift. Since then, several systematic searches have been performed to search for intervening H 2 towards quasars (Ledoux et al. 2003;Noterdaeme et al. 2008;Jorgenson et al. 2014;Balashev et al. 2014;Noterdaeme et al. 2018), but no systematic search has been performed for systems proximate to the quasar, for which the available pathlength is actually much smaller. Considering the very large number of quasar spectra now available in the SDSS, we initiated a campaign to study molecular gas absorbers proximate to quasars. In this paper, we present our results based on a automated search of H 2 in the SDSS quasar catalogue. The SDSS is indeed a gold mine for such studies since strong H 2 absorption systems can be efficiently identified in the SDSS spec-tra, as demonstrated by Balashev et al. (2014). We present the search of strong H 2 systems proximate to the quasar without any other prior in Sect. 2 and build a sample of about 80 such systems. In Sect. 3, we study the excess of such systems compared to what could be expected from intervening statistics. We then investigate the main properties of the systems, as can be derived from SDSS data in Sect. 4. In Sect. 5, we discuss our results within a theoretical frame for the transition from atomic to molecular gas, and lastly, we offer a summary of our main findings in Sect. 6.

Parent sample
We searched for H 2 lines at the redshift of the quasars in the SDSS DR14 catalogue (Pâris et al. 2018). A total of 103 320 quasars have emission redshifts z > 2.5 and are therefore suitable to search for H 2 bands in their SDSS spectra. In case several spectra are available for a given quasar, we used the combined spectrum that consists of the co-addition of all exposures of that object. We then rejected spectra with median signal-tonoise ratio per pixel lower than 2 in the 1400-1500 Å region in the rest-frame of the quasar, yielding a parent sample of 82 564 quasars (including also quasars with broad absorption line features) whose spectra were effectively searched for strong proximate H 2 absorption.

Searching procedure
We used a Spearman's rank correlation analysis to search for strong H 2 lines by correlating the observed data with a synthetic H 2 profile. We used a synthetic H 2 template built considering a total column density N(H 2 ) = 10 20 cm −2 that is distributed over the first three rotational levels, assuming an excitation temperature T 0,1,2 = 100 K (as typically seen for H 2 clouds in absorption). This theoretical profile was convolved by the SDSS instrumental line-spread function (corresponding to a resolving power of R = 1500 in the blue) and re-binned to the same grid, that is, with a constant log(λ) pixel-spacing of 10 −4 dex, or equivalently 69 km s −1 . We note that our procedure is little sensitive to the exact column density and excitation temperature since the lines we are looking for are intrinsically saturated and because the rank correlation is mostly sensitive to the global "comb-like" shape of the H 2 absorption profile and not on their actual strength. Nevertheless, we tested that changing the column density (by a factor of ten either upwards or downwards) and excitation temperature in the template has no effect on the detection of strong H 2 systems. Since we do not know a priori the exact velocity shift between any H 2 absorber and the quasar redshift and because the later is not known to high accuracy, we first cross-correlated the template with the data over a velocity interval that encompasses the pipeline and visual redshift estimates and extends by 2000 km s −1 on each side. We then calculate the significance of the Spearman's correlation coefficient at the redshift of the maximum cross-correlation. The significance of the deviation from zero is expressed in terms of a probability which we call P. A small P-value indicates a significant correlation. The Spearman's correlation test is performed over the regions of H 2 bands (from ν ′ = 0 up to ν ′ = 9), avoiding L(6-0), which is blended with Ly-β and restricting to λ obs > 3650 Å because of the significantly increased noise level and frequent data issues. In order to ascertain the presence of strong H 2 lines, we also measure the median ratio of the flux at the expected position of the H 2 lines with respect to the flux in-between the lines. In other words, this parameter provides a measurement of the contrast. In what follows, this "flux ratio" parameter is denoted FR. An example of a quasar spectrum with H 2 detection is shown along with the comparison template in Fig. 1.

Selection of the H 2 candidates and visual inspection
The distribution of the parameters P and FR for all the quasar spectra is shown in Fig. 2. The presence of a strong H 2 system in the search window is expected to result in small values for both P (i.e. high correlation significance) and FR (decrease in flux at expected position of H 2 ). The corresponding points also naturally appear as outliers compared to the main locus. Based on these considerations, we used two approaches to select the candidate H 2 absorbers. For the first approach (selection #1), we isolate all candidates (170) that have log P < −7 and FR < 0.75 (dashed lines on Fig. 2), noting that beyond these values, it is generally hard to confirm or reject any putative H 2 system. We call this sample: S P c1 . This selection has the advantage of simplicity, but the number of candidates also increases quickly when both P and FR values increase, while the fraction of them being confirmed visually decreases. The second approach (selection #2) is based on a detection of outliers from the main locus of points in the (P, FR) parameter space. The selected candidates (188) are those found beyond the contour containing 99.73% of the points (equivalent to 3 σ for normal statistics). We call this sample: S P c2 . One advantage of this selection is the possibility to explore candidates where one of the two parameters is peculiar for the given value of the other parameter. In particular, some systems may have strong H 2 lines (i.e. low FR), visually recognisable despite a low significance of correlation due to noisy data etc. There is a natural overlap between the two selections, with 78 candidates in common out of a total of 280 (coloured and black points in Fig. 2).
We visually inspected all these 280 candidates. During the visual inspection, not only did we pay attention to the region covering the position of the expected H 2 lines, but also to the overall SDSS spectrum, looking for the presence of other signatures of absorption systems, such as metal, H i lines and dust features. Our visual inspection led to the confirmation of 50 strong proximate H 2 systems, coloured green in Fig. 2. For another 8 can-A&A proofs: manuscript no. main didates (filled yellow), H 2 lines are likely present but it remains difficult to disregard the possibility that the lines are coincidence from the Ly-α forest. We assign a visual grade "A" for the former 50 and "B" for the latter 8 in Table 1. The remaining candidates are either clearly false positive systems or systems for which the data are inconclusive. The spectral regions covering the H 2 and H i lines are shown in the Appendix.
We finally note that the visual inspection remains somewhat subjective by nature and it is still possible that systems graded A or B are spurious or that we missed H 2 systems among the selected (hence inspected) candidates. While we believe these fractions to be very small, follow-up data with higher signalto-noise ratio and resolution are required to firmly establish the quality of our visual inspection.

Additional proximate H 2 systems
In spite of effective selection criteria, during the code testing, we came across several candidates that were quite evident by visual inspection but remain inside the main locus of the parameter space (i.e. less significant than the 99.73% confidence level imposed above). Some proximate 1 H 2 systems may also be located outside the redshift window used to build our statistical sample. This can happen when the quasar redshifts provided by the DR14Q catalogue are wrong or when the absorbers are very significantly redshifted (i.e. more than our limit of 2000 km s −1 ).
In order to explore differently or further inside the main locus or even systems not considered in the previous search, we performed a second, independent search using a method similar to that presented by Balashev et al. (2014). This independent method proved to be an efficient way to identify strong intervening H 2 -bearing DLAs in the SDSS. We slightly modified the method, adjusting the numerical values that specify the criteria used to search for H 2 -bearing DLAs We again searched all z > 2.5 quasars, but used a 3000 km s −1 search window around the best redshift value reported by Pâris et al. (2018). The identification of probable H 2 systems is based on a "χ 2 -like" selection function and the probabilities of false detection for the candidates were estimated using Monte Carlo simulation, as described by Balashev et al. (2014).
As before, we then visually inspected all 23 additional systems, i.e. new systems found by this second procedure, systems found by our main code but outside the selected statistical sample as well as serendipitous systems. Unsurprisingly, it is also generally more difficult to judge the reality of these additional systems, so that we ended up having a high fraction of grade B (11 out of 23) compared to our main selection. We also include these in Table 1; However, they are not considered for the statistical analysis of the incidence rate. The systems, for which we have measurements of the parameters FR and P at the same redshift but where FR and P fall within the rejection contour, are over-plotted in Fig. 2 as red and orange dots corresponding to visual grade A and B, respectively.

Note on sample completeness
The detection of additional H 2 systems inside our rejection contour indicates that the detection of strong H 2 in the overall parent sample of quasar is not complete. Indeed, the actual completeness of our statistical sample is expected to be a complex function of the quasar redshift and the S/N ratio over the wavelength range where the H 2 bands are located. Furthermore, it depends on the column density of the H 2 system, the strength and exact location of Ly-α forest lines, and the presence of other absorption systems. In principle, this prevents us from deriving the absolute incidence of strong H 2 systems but should have little impact on the relative incidence between proximate and intervening systems discussed in the next section. We can still roughly estimate the overall H 2 detection rate in PDLAs using the statistical sample (i.e. log N(H i) > 21.1) of metal-selected PD-LAs from Fathivavsari et al. (2018). This sample contains 201 systems with z > 2.5 searched by our code, among which we found 20 H 2 -bearing systems (18 grade A and 2 grade B) within our statistical selection, plus another 5 in our list of additional systems. This implies a H 2 covering fraction higher than 10% in strong (log N(H i) > 21.1) metal-selected PDLAs. This appears to be in qualitative agreement with the H 2 covering fraction for intervening systems. For example, Balashev & Noterdaeme (2018) found 4% (DLAs/sub-DLAs with log N(H i) > 20), 8% (DLAs with prominent metal lines) and 37% (extremely strong DLAs with log N(H i) > 21.7).

The excess of strong proximate H 2 absorbers
In this section, we investigate whether or not there is an excess of strong proximate H 2 systems compared to what is expected from intervening systems. In other words, we wish to quantify whether or not there is a higher probability for a H 2 cloud to be located close to the quasar in velocity space. To do this, we apply the exact same procedure, selection and visual inspection as for our statistical sample of proximate systems, with the only difference that we shifted the search window for each individual spectrum by 5 000 km s −1 to the blue. This velocity shift corresponds to what is typically considered a safe limit to treat the systems as intervening. At the same time, the velocity shift is large enough to avoid overlap of the search window with that used for proximate H 2 systems while being small enough so that the probed spectral regions and the redshifts remain very similar. In spite of this, a slight shift is observed for the main locus in the (P, FR) parameter space as compared to proximate systems. This results in a larger number of candidates following selection 1 (S I c1 with 396 candidates). However, these are mostly seen close to the chosen limits and the bottom-left corner of the plot (with a high probability of a given system to be real) is clearly much less populated than for proximate candidates. This alone already tells us that the incidence of strong intervening systems per velocity bin is much lower than for the proximate systems. Applying our outlier selection (#2), we obtain a total of 174 candidates (S I c2 ). From visual inspection of all 525 candidates (45 are in common between the two selections), only 13 are graded A (S I A ) and 6 are graded B (S I B ). In Figure 3 we present the distribution of velocity offsets, where R ≡ (1 + z abs )/(1 + z em ) for the strong H 2 systems detected in both search windows (i.e. centred on z em and shifted bluewards by 5 000 km s −1 ). For a fair comparison of the two distributions, we used only those systems satisfying selection 2, but note that the results do not change significantly when using selection 1 or the union or intersection of both selections. The shaded regions in Figure 3 show the minimal 4000 km s −1 -wide search windows. Both the intervening and the proximate distribution slightly extend beyond these boundaries as the search windows for each spectrum were defined to take into account the uncertainties on the quasar redshift. The statistical results discussed below are however strictly restricted to systems falling in the respective 4000 km s −1 windows. The intervening systems are uniformly distributed over the velocity interval, which is expected for systems randomly intercepted by a quasar line of sight. On the other hand, proximate systems are on average 5 times more numerous (4.2 if including grade B systems as well) at ∆v = 0 ± 2000 km s −1 than at ∆v = −5000 ± 2000km s −1 (shaded areas in Fig. 3). These are conservative lower limits since the number of intervening systems at significantly negative velocities (i.e. z abs > z em ) should be close to zero, as we expect little peculiar velocities of intervening gas to shift systems in that region. The distribution of proximate systems is also clearly peaked around the quasar redshift. The excess of proximate systems is about a factor of 2.5 in the velocity range from 1000 to 2000 km s −1 compared to what is expected from the statistics of purely intervening systems (dashed blue horizontal line) In the central 1000 km s −1 , however, 28 strong H 2 systems are seen when ∼2 are expected from intervening statistics. We note that the uncertainty on the quasar emission redshifts as provided by the SDSS quasar catalogue is of the order of 500-1000 km s −1 . Hence, the observed distribution of proximate H 2 absorbers may well appear wider than it is intrinsically.
In summary, we observe more than an order of magnitude excess of H 2 absorbers close to the quasar compared to what is expected from chance alignment with the quasar. This means that most of the proximate H 2 systems presented in this work must be related to the quasar environment and not to intervening galaxies in the Hubble flow. The question now becomes whether these systems are directly associated to the quasar, its host galaxy, or arise from galaxies in the quasar group environment. In the absence of detailed understanding of the physical conditions in the clouds, this is a difficult question to answer. In the following sections, we will shed light on this from the observed properties of the proximate H 2 systems as seen in the SDSS data.

Properties of the proximate H 2 systems
In this section, we derive some of the main properties of the proximate H 2 systems from the SDSS data alone. These are the atomic and molecular hydrogen column densities, Ly-α emission, metal content, and dust properties.

H i and H 2 column densities
We fitted a Voigt profile to the damped Ly-α line keeping the redshift fixed to that obtained from H 2 and metal lines. We also simultaneously fitted the other lines from the Lyman series and estimated the quasar continuum using a spline function. Since the latter task is complicated by the quasar blended Ly-α and N v emission lines, we guided the placement of the spline knots using the quasar composite spectrum from Vanden Berk et al. (2001) matched to the spectrum redwards of the quasar Ly-α emission. When necessary we adjusted the strength of the Ly-α+N v emission line by considering those of other emission lines in the spectrum. However, the derivation of the exact unabsorbed continuum will inevitably partly rely on implicit assumptions about the shape and strength of the Ly-α+N v emission line, which are hard to quantify. We therefore paid particular attention to the width of the profile close to the bottom, which is little influenced by the exact continuum placement but note that in some cases, it can still be affected by the presence of strong leaking Ly-α emission. The measurement of the H i column density was Fig. 2. Core-to-continuum median flux ratio versus significance of the Spearman correlation for all quasar spectra searched in a proximate velocity window (top, within a velocity window encompassing the pipeline and visual redshifts estimates, extended 2000 km s −1 on each side) and an intervening window with the exact same width for each spectrum, but shifted by 5 000 km s −1 bluewards (bottom). The vertical and horizontal dotted lines show our cuts defining the samples S P c1 (top) and S I c1 (bottom). Points located outside the solid contour (containing 99.73% of the points) define, respectively, S P c2 (top) and S I c2 (bottom). Candidates belonging to either one or both of these selections (black points) were visually checked and coloured green when strong H 2 is confirmed (grade A) or yellow when considered tentative only (grade B). Red and orange points correspond to additional systems described in Sect. 2.4 with, respectively, grade A and B. also helped by the presence of Ly-β and other Lyman series lines for which the emission-line-to-continuum ratio is different. The obtained N(H i) then sets the strength of the DLA (or sub-DLA) wings, and the continuum is then re-adjusted if necessary until we obtain a satisfactory fit. During this process, we remarked that the obtained H i column densities typically varied by no more than 0.2 dex. Our final N(H i) measurements are given in Table 1 and the corresponding figures in the Appendix. We note that automatically determined N(H i)-measurements of intervening DLAs based on Ly-α absorption only have typical uncertainties of 0.2 dex in SDSS (Noterdaeme et al. 2009). In the case of . Distribution of relative velocities with respect to the quasar redshift for our sample of strong proximate H 2 systems (orange histograms) compared to those found in a region shifted by 5 000 km s −1 (blue). We here used the "zbest" provided by the DR14Q catalogue as the quasar redshift and the z abs measurement directly from our search algorithm.
Negative velocities indicate z abs > z em . Note that the x-axis goes from positive velocities (blueshifted compared to the quasar) on the left to negative velocities (redshifted) to the right. Both distributions are restricted to visually-checked systems (unfilled histograms: grade A or B, filled histograms: grade A only) isolated using the outlier selection (# 2). The grey regions show the corresponding minimal search windows. Systems falling outside these regions are not considered when comparing incidence rates. The horizontal dashed line shows the mean number of intervening strong H 2 systems per velocity bin (∼ 1 per 500 km s −1 bin). A significant excess of H 2 systems at the quasar redshift is observed and cannot be explained by intervening statistics.
proximate DLAs, follow-up observations by Ellison et al. (2010) are actually in very good agreement (∼ 0.05 dex) with those obtained by Prochaska et al. (2008b) from SDSS data using a manual fitting scheme very similar to the one used here. Nevertheless, since we here discuss the overall population, the N(H i) uncertainty for individual systems does not affect the main results and conclusion of the paper.
We also obtain rough estimates of the H 2 column densities by manually adjusting the total column density of a H 2 template with a fixed excitation temperature and fixed Doppler parameter. We derive typical column densities of log N(H 2 ) ∼ 19.5 but caution that individual values are very uncertain in the absence of high-S/N, medium/high-resolution spectroscopy. The values of N(H 2 ) provided in Table 1 should then be considered as indicative only. We remark that we already have medium or high resolution data for several H 2 -bearing DLAs (including four from this sample: J1311+2225, Noterdaeme et al. 2018, in which we found that SDSS-based values typically underestimate the H 2 column density by up to 0.3 dex. In one outlier, however, the H 2 column density differs by about 0.8 dex compared to the SDSS-based estimate. Therefore, while the H 2 lines are intrinsically in the saturated regime, we do not use the column density estimates in the following.  The distribution of H i column densities for intervening and proximate DLAs is shown in Fig. 4. The observed distribution for H 2 -bearing PDLAs is slightly shifted towards higher H i column densities (by about 0.3 dex) compared to intervening H 2bearing DLAs. This may be due to the fact that higher H i column densities are necessary for the H 2 /H i transition closer to a strong UV source, as expected from transition theories (e.g. Krumholz et al. 2008;Sternberg et al. 2014), if the other parameters are kept unchanged. It is also possible that part of the observed H i is unrelated to the H 2 gas, and the excess column density is only due to a more gas-rich environment close to the quasar.
Our H 2 -selection of PDLAs can also provide an independent estimate of PDLA clustering close to the quasar. Indeed, if the conditions for the formation of H 2 are not very different, then the observed factor of 5 excess of proximate H 2 over intervening H 2 systems corresponds to the excess of proximate DLAs over intervening DLAs. This is well above the factor of two excess found by Prochaska et al. (2008b) in the SDSS-II. If, in turn, H 2 is more difficult to form in the quasar environment (as we could naively expect from the strong UV field), then the discrepancy is even larger. We note however that the PDLA detection algorithm from Prochaska et al. was based on the zero-flux in the core of the DLA and hence likely missed most of the systems with leaking Ly-α emission. The clustering of neutral gas around the quasar could also depend on the column density, being stronger at high N(H i) (as observed here) than for the overall population of DLAs. Finally, it remains possible that H 2 is instead formed more efficiently in the quasar environment (i.e. a positive AGN feedback) owing to higher metallicities, larger total surface of dust grains or gas compression.

Leaking Ly-α emission
Significant (> 3 σ) residual flux in the core of the DLA absorption is the most evident peculiar feature of our systems and observed in about half of our sample. We measured the total Ly-α flux (F Ly−α ) for each system by integrating the observed flux spectrum over the DLA trough. The associated uncertainty is obtained from the error spectrum. These values are robust since they do not depend on the assumed unabsorbed quasar emission and are provided in Table 1 for reference. However, since the Ly-α emission can be strong and is generally broad, it most likely corresponds to leaking Ly-α photons from the background quasars' emission line regions rather than arising solely from local star-formation activity in the quasar host. Therefore, the most interesting quantity to consider is actually the fraction of leaking photons at the DLA wavelength rather than the actual luminosity of this residual. Thus, we define f leak as the ratio of the observed flux integrated in the DLA core over the unabsorbed flux integrated over the same region. In spite of our efforts to reconstruct the unabsorbed quasar continuum (see previous section), the fraction f leak is highly uncertain. However, it remains a convenient way to distinguish between systems that allow a significant fraction of photons to leak at the DLA wavelength, and those systems that do not support such a leakage 2 . We assign a conservative estimate of the uncertainty of a factor of two to take into account the observed dispersion of Ly-αemission-to-continuum ratio seen between different quasars (e.g. Selsing et al. 2016).
Splitting the sample into two sub-samples with f leak above or below the median value (0.02), we then found that the systems with high f leak are located on average twice closer in velocity space than those with low f leak (|∆v| ∼ 500 vs 1000 km s −1 ). Figure 5 illustrates this further with f leak plotted as a function of the relative velocity with respect to the quasar redshift 3 . Systems without significant emission span the full range of velocities, while systems with high f leak tend to concentrate closer to zero velocities. Separating the systems according to their velocity shift to the quasar, we can indeed see that the mean and median f leak values are higher at small velocity separation than at high velocity separation. Interestingly enough, we note that the leaking fraction seems to be higher for systems with ∆v < −1000 km s −1 than for those with ∆v > 1000 km s −1 . To summarise, it appears that DLAs with absorption redshift very close to that of the quasar emission cover less of the corresponding Ly-α photons than those with significant velocity shifts. Among the latter, those redshifted compared to the quasar (i.e. moving towards the quasar) tend to cover less than those moving away from the quasar.
The observed dependence of f leak on the relative absorber to quasar velocity can in principle be explained as a purely observational effect. DLAs redshifted onto either wing of the quasar Ly-α emission will absorb Ly-α photons with wavelengths shifted relatively far away from resonance (1215.67Å × (1 + z QSO )) and hence arising mostly from the BLR. Conversely, DLAs located exactly at the quasar redshift correspond to Ly-α photons arising both from the BLR and from narrower Ly-α emission arising from regions further away from the central engine, up to the very outskirts of the quasar host (see e.g. Fathivavsari et al. 2015). This "narrow" and likely more extended component can therefore more easily leak through the absorbers. If this is the case, then we can expect that intervening H 2 -bearing clouds also have projected sizes smaller than emission region of the quasar at the peak Ly-α wavelength. This potentially could be detected as a partial coverage effect in the metal absorption lines. Indeed, Balashev et al. (2017) have recently observed an unambiguous partial coverage of the Ly-α emission by the S ii absorbing gas (see their Fig. 12) associated to an intervening DLA (z abs = 2.786, z QSO = 2.92) with damped H 2 lines. A systematic study of the partial coverage of  Table 1). Unfilled symbols correspond to the additional systems described in Sect. 2.4 (flag = 0). The colour indicates the visual classification (black:A, grey:B). Finally, red squares are overplotted on top of systems with clear Si ii * absorption. The solid (resp. dashed) segments correspond to the median (resp. mean) values in different velocity bins, using only statistical rank A systems. Values measured to be less than 0.01 are set to 0.01 for plotting convenience. The cross at the topleft corner shows typical (albeit conservative) uncertainties along both axes.
vide clues on the origin and extent of the different Ly-α emission components.
However, there may also be a physical reason for the clouds at small velocity separation covering statistically less of the Ly-α emission than those at large velocity separation. Indeed, neutral gas clouds close to the UV source may typically have higher density and hence be smaller (for a given column density) than those located farther away, as proposed by Fathivavsari et al. (2017). This would also explain the observations if systems close in velocity space are also statistically closer in distance. This is a valid possibility as clouds rotating with the quasar galaxy host should have little velocity along the line of sight while those located in other galaxies of the group could have larger |∆v|. Gas flows (either winds or infall) can however complicate the picture, being located relatively close to the source but still possibly having large relative velocities. Interestingly, there is a trend for systems with positive velocities (possibly due to infalling gas) to have larger leaking fraction and also featuring at the same time excited levels of silicon (red squares on Fig. 5). Both collisions (denser cloud) and enhanced UV field (closer to quasar) would help populating the fine-structure levels.
All this means that the presence of leaking Ly-α alone is probably not enough to differentiate between wavelength dependence of the emission size or distance dependence of the size of the absorbing cloud. However, metal lines (in particular in excited states) as well as molecular lines may provide further information in order to distinguishing between the different scenarios.  Finally, we note that it is very likely that other clouds, similar to that giving rise to the DLA, are located in the same galaxy (e.g., the quasar host or a group member) yet spatially offset from the line of sight to the quasar central engine. While these clouds do not intercept the line of sight to the compact continuum source they may still contribute to the absorption of the spatially extended Ly-α emission. Absorption signatures of such clouds would however be very difficult to identify. Only detailed measurements of absorption lines falling on top of emission lines, arising from the spatially extended emission region, would reveal the presence of such complex absorption geometries. In order to carry out such detailed analyses of the absorption and emission geometry, higher resolution spectroscopy with better SNR is required.
We also caution that the uncertainties on ∆v are large and dominated by the uncertainty on the quasar redshift. Measuring accurately the quasar systemic redshift through follow-up observations of the narrow forbidden emission lines in the near infrared would be imperative to confirm or reject the above discussed trends.

Metal lines
Metal absorption lines are systematically seen associated to the H 2 system. However, at the typical S/N ratio and given the low resolution of the SDSS spectra, the only information we can obtain is the equivalent width of strong lines, which are very likely intrinsically saturated. The equivalent width of such lines is therefore mostly determined by the velocity spread of the profile. Observationally, high resolution studies of DLAs indicate that the velocity extent of metal lines correlates well with the metallicity (Ledoux et al. 2006). This means that we can in principle use the observed equivalent width to get an idea of the metallicity. We measured the Si iiλ1526 equivalent widths using an automated procedure and obtain the distributions shown on Fig. 6. The median equivalent width in our statistical sample is about 1 Å, i.e. similar to that observed by Balashev et al. (2014) for the population of strong intervening H 2 systems. Using the empirical relation [X/H] = −0.92 + 1.41 log(W r λ1526) from Prochaska et al. (2008a), the median equivalent width corresponds to a metallicity of about one tenth Solar. However, we caution that this empirical equivalent-width metallicity relation has been obtained using intervening systems and thus may not actually apply here. Therefore, we further test this result using a stacked spectrum built by median averaging all systems visually classified A. The obtained composite spectrum, shown in Fig. 7 has a S/N ratio of about 50, allowing us to detect weak absorption lines that are otherwise undetectable in individual spectra and whose equivalent width will then depend rather on the column density than the velocity extent. The typical species seen in the overall population of DLAs are detected but we also detect significant C i lines, that are otherwise much less frequent in DLAs (Ledoux et al. 2015). This is consistent with our H 2 selection since C i is known to be a good tracer of molecular gas .
Using the unblended and undepleted S iiλ1253 line, and assuming optically thin regime, we obtain a metallicity of about [S/H] ∼ −0.9 using the median log N(H i) = 21. Similarly, we obtain [Zn/H] ∼ −0.8 from Zn iiλ2026 and [Si/H] ∼ −1 from Si iiλ1808. This exercise shows us that the average metallicity of our sample should be roughly 1/10 th of the Solar value. This is higher than the typical value seen in DLAs, albeit lower than purely C i-selected systems, that have Solar metallicity (Zou et al. 2018, Ledoux et al., in prep.). Nonetheless, it is important to keep in mind that the metal equivalent widths in our proximate molecular systems spread over a wide range, so that the metallicities are also likely to differ significantly from one system to another. Still, we attempt to identify some global trends in the following.
We then compare the composite spectrum with that obtained for a subset with significant leaking Ly-α emission. Overall, there is no striking difference between the strength of the main metal lines. However, it appears that the equivalent width of the weak Si iiλ1808 line remains almost unchanged while other Si ii lines (λ1260, 1304, 1526) are weaker for systems with leaking Ly-α. This suggests that the column densities (and the metallicities, since the median log N(H i) is unchanged) in the Ly-α-leaking sub-sample are similar to the overall average, but that systems with leaking emission may have smaller velocity spreads than the average. This could also explain the narrower C iv seen in the 'leaking' sub-sample.
A more significant difference is seen for the C ii line. While the overall median composite spectrum already shows clear evidence of C ii * absorption in the wing of the C iiλ1334 line, the composite spectrum corresponding to the Ly-α-leaking sub-sample apparently has a much higher C ii*/C ii ratio ( Wr(C ii * )/Wr(C ii) ∼ 0.4 overall versus Wr(C ii * )/Wr(C ii) ∼ 0.8 for Ly-α-leaking systems). A zoomed version of Fig. 7 is shown in Fig. 8, along with the composite spectrum built for systems with even stronger Ly-α leaking fraction ( f leak > 0.2). In the last composite spectrum, albeit noisier given only four grade A systems contributing to the stack, the C ii * line appears even stronger than C ii. All this indicates an increasing excitation of C ii with increasing leakage of Ly-α consistent with the findings of Fathivavsari et al. (2018). Since the excited level of ionised carbon is mostly excited by collisions (Silva & Viegas 2002;Goldsmith et al. 2012), this would favour a dependence of Ly-α leaking fraction on the compactness of the cloud. However, detailed investigation through follow-up observations and numerical modelling is needed to confirm the higher C ii * excitation and to understand its origin.

Dust reddening
In order to obtain a measure of the reddening induced by dust, we fitted the individual spectra using the quasar template by Selsing et al. (2016)    the Small Magellanic Cloud (SMC) or that of the giant shell in the Large Magellanic Cloud (LMC2) as parameterised by Gordon et al. (2003). However, due to the limited wavelength coverage of the spectra, we were not able to significantly distinguish the two extinction laws. In what follows, all measurements of dust reddening are therefore reported assuming the SMC extinction curve. Since the broad emission lines may vary significantly from one quasar to another, we masked out the corresponding parts of the spectra. This was done by defining 'bona fide' continuum regions in the quasar rest-frame which were used to constrain the fit. These regions were defined as: 1314 − 1351, 1430 − 1490, 1585 − 1600, 1700 − 1830, and 2000 − 2225 Å.
The best-fit values of A V are given in Table 1. Due to the intrinsic variations of the spectral power-law index of quasars, we report negative reddening for some targets. This does not necessarily mean that there is no dust reddening, but it is not possible to break the degeneracy without spectroscopic data covering the full rest-frame optical range of the quasar spectral energy distribution.
We can quantify the significance of the A V measurements by calculating the expected dispersion in A V introduced by variations in the power-law index. Based on the measured intrinsic dispersion of the quasar power-law index of σ β = 0.186   (Krawczyk et al. 2015), we calculate an expected 1-σ dispersion in A V of σ A V = 0.12 mag. We can therefore state that any target with A V > 2 σ A V is significant at 95 % confidence level, and any value below this threshold should be considered an upper limit, i.e., A V < 0.24 mag. In spite of a few exceptions, most of the quasars present no significant reddening (see Fig. 9), with a median A V of only 0.04 mag, which is consistent with the value measured for the sample of intervening H 2bearing DLAs selected in SDSS (Balashev et al. 2014). The typical dust-to-gas ratio in our sample is then roughly A V /N(H) ∼ (1 − 2) × 10 −23 mag cm 2 , which is similar or less than the typical value for intervening DLAs (∼ (2 − 4) × 10 −23 mag cm 2 , Vladilo et al. 2008) and much lower than values measured in the local ISM (where the dust-to-gas ratio is about 30 times higher, e.g. Watson 2011) and in C i-selected molecular-rich intervening systems (Ledoux et al. 2015;Zou et al. 2018) that also typically have Solar metallicities and low N(H i). Our current sample may be biased against systems with high reddening, not only because the colour selection may preclude their presence in the SDSS-III spectroscopic database, but also because of the decreased S/N ratio in the blue, impeding the detection (and visual confirmation) of the H 2 lines. Indeed, including the additional (nonstatistical) systems, the median A V /N(H) increases by a factor of two, owing to the inclusion of several significantly reddened systems with lower N(H i) values. Given the low dust-to-gas ratios, the presence of H 2 might then rather be due to higher densities than those typically derived in intervening H 2 -bearing DLAs (50-100 cm −3 , see e.g. Srianand et al. 2005;Noterdaeme et al. 2017), with the notable exception of the extremely strong H 2 system towards SDSS J0843+0221 ), which has a low metallicity ([Zn/H] ∼ −1.5) and high density, n H ∼ 300 cm −3 .

Discussion
By construction, we select only saturated H 2 systems (with log N(H 2 ) ∼ 20). At such large H 2 column densities, we ex-0.01 0.10 1.00 r (Mpc) 10 0 10 1 10 2 10 3 10 4 10 5 n (cm −2 ) q u a s a r χ loc =10 Fig. 10. Density required to produce H 2 as a function of the distance to the quasar. We assumed here a typical situation, with column density equating the median observed value (log N(H i) = 21.3), assumingσ g = 0.1 (red) andσ g = 0.5 (purple) and a quasar with the median luminosity observed at the Lyman-Werner wavelength range. The different curves are when including a local UV field, in units of Draine field (χ loc = 0, 1, 10) pect that the H i-H 2 transition has already occurred. We can then use the theoretical description of the H i-H 2 transition by Sternberg et al. (2014, see also Bialy & Sternberg 2016 to constrain the physical properties of the cloud. Following their formalism, the surface density of H i at which the transition occurs is given by where αG = 2.85 × 10 −8 F 0 100 cm −3 n H 9.9 1 + 8.9σ g 0.37 (3) In these equations,σ g ≡ σ g /(1.9 × 10 −21 cm 2 ) is the dust grain Lyman-Werner (LW = 11.2 -13.6 eV, 911.6 Å-1107 Å) photon absorption cross section per hydrogen nucleon normalised to the fiducial Galactic value. n H is the hydrogen number density of the cloud and F 0 is the free-space LW photon flux (cm −2 s −1 ) irradiating the cloud (see Bialy et al. 2015Bialy et al. , 2017. Note that the constant factor in Eq. 2 is a factor of two lower than that used in previous works (e.g., Ranjan et al. 2018) considering a slab of gas illuminated on both sides while we here consider one-sided illumination dominated by the quasar. Knowing the quasar luminosity at the LW band and H i column density in the cloud, we can then derive the number density of the H 2 cloud as a function of its distance to the quasar, for a given dust enrichment. In Fig. 10, we illustrate the relation between the cloud density and its distance to the quasar UV source for the typically observed quasar and cloud properties. More specifically, the relation is calculated for the median quasar luminosity at the LW band assuming a median H i column density of log N(H i) = 21.3. We considered a typical value ofσ g = 0.1, corresponding to the median A V and N(H i) values of our sample (σ g = 4.8 × 10 20 A LW /N(H), but we also included a calculation forσ g = 0.5. Finally, we considered two calculations: one with and one without a local source of UV photons, χ loc , expressed in units of the interstellar radiation field as measured by Draine (1978).
We find that, farther than about 0.3 Mpc, atomic hydrogen can transition to H 2 in relatively low-metallicity clouds with density n H ∼ 100 cm −3 , similar to what has been derived in intervening H 2 -bearing DLAs observed so far (e.g. Srianand et al. 2005 Fig. 11. Composite spectra obtained by median-averaging the spectra of five systems with clear Si ii * detection (black) and another five where this detection is tentative (green). Noterdaeme et al. 2017;Ranjan et al. 2018). In other words, at such distances, the conditions for the formation of H 2 become similar to those of intervening clouds, as seen from the inflexion point where the influence of a realistic local UV field becomes comparable to that of the quasar. Since such clouds would typically be of parsec scales, it is not surprising that Ly-α photons from the narrow line region of the quasar (and at fortiori from extended emission regions) can leak around the absorbing cloud. Closer than 0.1 Mpc, the quasar UV flux likely dominates and the density must be higher (n H ∝ r −2 ) for H 2 to form efficiently. It is important to note, however, that this depends strongly on the σ g × N(H i) product and hence on the total dust extinction, with n H ∝ exp(A V ) − 1 (when ignoring the slow dependence on σ g of the second factor in Eq. 3). For example, while keeping the same N(H i), a value of σ g ∼ 0.5 results in a decrease of the required density for H 2 formation by about an order of magnitude. This may be the case for the most reddened systems in our sample. As we get closer to the quasar, we expect that higher densities, together with a stronger UV field, will result in the excitation of fine-structure levels of species like Si ii and O i. While we do not see any evidence for excited fine-structure levels in most of the systems, nor in the median stack, we do find clear evidence of Si ii * in five systems (J0015+1842, J0125-0129, J1131+0812, J1242+4448 and J1421+5245) as well as tentative evidence in another five systems (J0756+1123, J0911+4110, J1135+2957, J1358+1410 and J1512+3821). Composite spectra of these systems around the main Si ii * lines are shown in Fig. 11. Interestingly, these systems with Si ii * tend to have stronger and wider leaking Ly-α emission than seen on average, while not necessarily being located at the exact quasar redshift 4 .
Similarly, Fathivavsari et al. (2018) show that excited levels of silicon and oxygen are systematically seen in proximate (metal-selected) DLAs with Ly-α emission in their trough. The authors find a sequence in which the equivalent width of the finestructure lines increases with increasing leaking Ly-α emission. In the case of eclipsing DLAs, the fine-structure lines are weak whereas the lines are much stronger in the case of ghostly DLAs, which the authors interpret as an effect arising from clouds so compact that the BLR is not fully covered. However, in the absence of detailed investigation through follow-up studies, the number density remains degenerate with the strength of the UV flux since an increase of both these quantities increases the excitation of the fine-structure lines. The presence of H 2 should help break this degeneracy since an atomic-to-molecular transi-tion requires the cloud to be denser when the UV field is stronger (or equivalently when the cloud is located closer to the quasar). Additionally, the excitation of high rotational levels of H 2 could also be efficiently used to discriminate between enhanced UV flux and increased number density, since these are predominantly populated via UV pumping.
The distance-density constraint can be converted into a constraint between cloud-size and distance, using l = N(H)/n H , where N(H) ∼ N(H i). For example, at 10 kpc, the required density for a H i-H 2 transition (n H ∼ 2 × 10 4 cm −3 for σ g = 0.1, log N(H i) = 21.3) would imply a cloud-size less than 0.1 pc. This is a strict upper limit since part of the observed column density may be unrelated to the H 2 cloud. Indeed, not only the numerator in the expression of l is decreased, but the denominator is also increased through Eqs. 2 and 3. On the other hand, we can estimate the size of the BLR using the relation between quasar luminosity and BLR size obtained from reverberation mapping. For the typical quasar luminosity λL λ (1350Å) ≈ 10 46 erg s −1 in our sample and using the relation from Kaspi et al. (2007), we obtain a C iv BLR size of about 0.1 pc. This is already comparable to the expected cloud size at 10 kpc derived above. Furthermore, the Ly-α BLR is likely to be more extended than the C iv BLR owing to scattering. In other words, the compression of neutral clouds required for an atomic-to-molecular transition to occur, if located closer than 10 kpc, could be such that the projected size of the cloud becomes comparable to that of the BLR. When the partial covering of the BLR gets significant, the system may be seen as a ghostly DLA. Since this is not the case for our systems, these are most likely located farther away, i.e. in other galaxies from the same group or in large-scale gas flows. Notwithstanding, H 2 may still form at distances of ∼10 kpc from the quasar in more diffuse clouds (hence possibly covering fully the BLR, i.e. non-ghostly DLAs) provided their metallicity is high enough (e.g. purple line on Fig. 10).
Because Ly-α transfer complicates the apparent velocity and spatial extent of the emission compared to that of the gas producing it 5 , it will be interesting to look for signatures of partial coverage of other emission lines by different species as done for intervening systems by e.g. (Balashev et al. 2011) and (Bergeron & Boissé 2017). C i is an interesting species since not only does it trace the same gas as that seen in H 2 , but it has several transitions, one of which (at 1560 Å) falling on the wing of the C iv emission line, when other C i lines are located on the quasar continuum which arises from the extremely small accretion disc. The continuum, by selection, should be fully covered by the absorbing clouds.
Before summarising our results, we remark that the transition theories used in the discussion implicitly assume a steady-state regime. Accurate measurements of the density and dust content in the molecular phase would allow us to investigate whether the molecular formation has reached an equilibrium or not. This would provide additional insights into the understanding of H 2 in quasar environments.

Summary
We have developed a novel technique to directly detect strong H 2 absorbers in low-resolution spectra solely from their Lyman-Werner band absorption, without any prior on the associated A&A proofs: manuscript no. main H i or metal content. Applying our technique to the SDSS-DR14 database, we have assembled a significant sample of strong H 2 systems proximate to the quasar redshift, with |∆v| 2000 km s −1 . We have studied the absorber statistics and investigated the basic characteristics that can be derived from the SDSS data. Our main findings are the following.
(1) We found that the incidence of proximate H 2 systems is about four to five times higher than that expected from the statistics of intervening systems. We further found that the excess of H 2 systems peaks at the quasar redshift, with an excess of more than an order of magnitude compared to intervening statistics. This shows that most of the proximate systems are actually associated to the quasar environment, arising either from galaxies in the same group, or to the quasar host itself. The observed velocities are hence not corresponding to the Hubble flow, but to the individual cloud velocities.
(2) Unsurprisingly, the proximate H 2 systems are also damped Ly-α systems. The column density distribution is however skewed to much higher values than the overall population of intervening DLAs, but only about a factor of two higher than our strong intervening H 2 systems selected the same way. The higher N(H i) values could be expected in order to shield H 2 clouds closer to a strong UV source.
(3) We detected significant Ly-α emission in the core of the DLA profile for about half of our sample. We showed that the fraction of leaking Ly-α photons is higher when the DLA is located at small velocity separation from the quasar's systemic redshift. This indicates that the relative projected sizes of the absorbing cloud and the Ly-α emission region decreases with decreasing velocity separation. This effect can then be explained by Ly-α emission at the emission peak arising from both the broad line region and gas located farther out (narrow line region, or even kpc-scales), while photons in the wings of the Ly-α emission arise only from the compact broad line region, and hence are easily covered by the cloud. It is also possible that clouds with smaller velocity separation belong to the quasar host compared to those at high velocities which could be due to other galaxies in the group. In this case, clouds located closer to the UV source could be more compact, as suggested by Fathivavsari et al. (2018), hence covering less the quasar emission.
(4) The equivalent width distribution as well as the average metal strength seen in a composite spectrum indicates that the proximate H 2 systems have metallicities around one tenth Solar, albeit with a wide dispersion between individual systems. We also identify several cases with signatures of high excitation, namely the presence of fine-structure lines of Si iiand C ii. These tend to be related to the fraction of leaking Ly-α photons, suggesting that the corresponding clouds are indeed more compact than typical DLA clouds.
(5) The measured high H 2 abundance allows us to bring further clues to the understanding of the clouds' origin. Following the H i-H 2 transition theory developed by Sternberg et al. (2014), we show that the number density required for a transition to occur depends strongly on the distance to the quasar, for a given metallicity and column density. Clouds located in galaxies from the group further than about 100 kpc from the quasar may have characteristics very similar to intervening clouds. In turn, clouds located within the quasar host or belonging to flows to or from the quasar would need n H ∼ 10 4 − 10 5 cm −3 to form H 2 and hence have very small dimensions. This could be the case for the systems with the highest excitation (dense gas, close to UV source) and large Ly-α leaking fraction (due to less coverage of the quasar emission line regions). On the other hand, it will be in-teresting to study the presence and excitation of H 2 in the overall population of proximate DLAs, in particular the ghostly DLAs, which are expected to be the sub-population located closest to the central engine (Fathivavsari et al. submitted).
In conclusion, given the spread in absorber characteristics (metallicities, dust extinction, excitation of fine-structure lines, and the presence, strength and width of leaking Ly-α emission), it is likely that there is no single origin for such clouds. While a large fraction, even with leaking Ly-α emission, is likely to belong to other galaxies in the group, several systems in our sample may well be directly associated to the quasar host or flows to or from the quasar. Follow-up at higher spectral resolution is required to investigate the partial coverage of the emission line regions by the absorbing clouds, to measure the exact relative velocity between the quasar and the cloud, to estimate the chemical enrichment in individual systems, and finally to investigate the physical conditions in order to estimate the cloud's density and distance to the UV source. The excitation of fine structure levels of ionised silicon and carbon as well as neutral oxygen and carbon will bring important constraints, together with the presence and excitation of molecules.