Issue |
A&A
Volume 660, April 2022
|
|
---|---|---|
Article Number | A22 | |
Number of page(s) | 15 | |
Section | Cosmology (including clusters of galaxies) | |
DOI | https://doi.org/10.1051/0004-6361/202142445 | |
Published online | 06 April 2022 |
Revealing new high-redshift quasar populations through Gaussian mixture model selection
1
Max-Planck Institut fur Radioastronomie, Auf dem Hügel 69, 53121 Bonn, Germany
e-mail: wagenveld@mpifr-bonn.mpg.de
2
Leiden Observatory, Leiden University, PO Box 9513 2300 RA Leiden, The Netherlands
3
Department of Physics and Astronomy, University College London, Gower Street, London WC1E 6BT, UK
4
SUPA, Institute for Astronomy, Royal Observatory, Blackford Hill, Edinburgh EH9 3HJ, UK
Received:
14
October
2021
Accepted:
24
January
2022
We present a novel method for identifying candidate high-redshift quasars (HzQs; z ≳ 5.5) –which are unique probes of supermassive black hole growth in the early Universe– from large-area optical and infrared photometric surveys. Using Gaussian mixture models to construct likelihoods and incorporating informed priors based on population statistics, our method uses a Bayesian framework to assign posterior probabilities that differentiate between HzQs and contaminating sources. We additionally include deep radio data to obtain informed priors. Using existing HzQ data in the literature, we set a posterior threshold that accepts ∼90% of known HzQs while rejecting > 99% of contaminants such as dwarf stars or lower redshift galaxies. Running the probability selection on test samples of simulated HzQs and contaminants, we find that the efficacy of the probability method is higher than traditional colour cuts, decreasing the fraction of accepted contaminants by 86% while retaining a similar fraction of HzQs. As a test, we apply our method to the Pan-STARRS Data Release 1 (PS1) source catalogue within the HETDEX Spring field area on the sky, covering 400 sq. deg. and coinciding with deep radio data from the LOFAR Two-metre Sky Survey Data Release 1. From an initial sample of ∼5 × 105 sources in PS1, our selection shortlists 251 candidate HzQs, which are further reduced to 63 after visual inspection. Shallow spectroscopic follow-up of 13 high-probability HzQs resulted in the confirmation of a previously undiscovered quasar at z = 5.66 with photometric colours i − z = 1.4, lying outside the typically probed regions when selecting HzQs based on colours. This discovery demonstrates the efficacy of our probabilistic HzQ selection method in selecting more complete HzQ samples, which holds promise when employed on large existing and upcoming photometric data sets.
Key words: quasars: supermassive black holes / galaxies: high-redshift / methods: statistical
© J. D. Wagenveld et al. 2022
Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Open Access funding provided by Max Planck Society.
1. Introduction
Studying large statistical samples of high-redshift quasars (HzQs) is essential for understanding the formation and evolution of super-massive black holes (SMBHs) in the early Universe. The presence of Gunn-Peterson (GP) troughs (Gunn & Peterson 1965) in the spectra of HzQs at z ∼ 6, which are caused by near-complete absorption of Lyα photons by the increasingly neutral intergalactic medium (IGM) along the line of sight, make them crucial probes of cosmic reionisation (EoR; Fan 2006; Becker et al. 2015). These GP troughs can in turn be used to photometrically identify large samples of HzQs, and the proliferation of wide-area multi-band photometric surveys at optical wavelengths, such as the Sloan Digital Sky Survey (SDSS; Abazajian et al. 2003) and the Panoramic Survey Telescope and Rapid Response System surveys (Pan-STARRS; Chambers et al. 2016), has enabled the discovery of statistically significant samples of bright quasars at high redshifts, with now over 500 confirmed HzQs at z > 5 (see Ross & Cross 2020, for a compilation).
For HzQs at z ∼ 6, towards the end of the EoR, the GP trough falls between the i- and z-band filters. Therefore, in the context of the SDSS and Pan-STARRS surveys (carried out using the u, g, r, i, z, and y filters), quasars at z ∼ 6 may be identified through pin-pointing ‘i-dropout’ sources, that show extreme i − z colours. Searches for HzQs using photometric dropout techniques over large areas of the sky often employ linear cuts in magnitude and colours (e.g. Bañados et al. 2015, 2016). For example, a colour cut of i − z > 1.5 to 2.5 is typically implemented in addition to magnitude cuts to ensure a balance between the selection of robustly detected HzQs and the exclusion of as many contaminating foreground sources as possible, which are often M-, L-, and T-type brown dwarf stars in the Milky Way (Fan et al. 2001; Willott et al. 2005; Bañados et al. 2016; Jiang et al. 2016).
A radio detection can considerably aid in removing foreground contaminants such as dwarf stars that do not emit persistent radio continuum emission at the sensitivity of current observations (e.g. Burningham et al. 2016), as around 10% of HzQs are seen to be ‘radio-loud’ (radio loudness being defined as the ratio between rest frame radio and optical flux density) even out to high redshifts (e.g. Bañados et al. 2015). However, overlapping deep radio data are often not available for the large sky areas covered by optical surveys from which candidate HzQs are selected. Deep radio continuum surveys of large sky areas, such as the Low-Frequency Array (LOFAR) Two Metre Sky Survey (LoTSS; Shimwell et al. 2019), can therefore potentially provide valuable additional information that could help improve HzQ selection and minimise the probability of contaminants.
While selection based on optical and infrared colours from large-area surveys has been highly successful in identifying some of the most distant HzQs currently known (e.g. Fan et al. 2001; Willott et al. 2010; Bañados et al. 2016; Matsuoka et al. 2016, 2018; Pipien et al. 2018; Reed et al. 2019), the use of linear cuts may lead to potential biases in the samples of HzQ candidates. A binary cut in colour and magnitude may inevitably lead to a loss of promising HzQ candidates. Additionally, Mortlock et al. (2012) argued that linear cuts result in uniform grouping of high-S/N candidates with more marginal ones that lie near the edges of the selection region, possibly making spectroscopic follow-up harder to prioritise. Finally, HzQs lying close to the limits of the redshift ranges probed by colour selections may be missed due to the GP trough not being fully sampled by the relevant broadband filters used for dropout selection. For example, at redshifts of z ∼ 5.5, the i-dropout selection may result in certain sources being missed, possibly presenting a gap in our understanding of SMBH evolution and/or the later stages of cosmic reionisation (Yang et al. 2017).
Additionally, binary selection criteria are often unable to properly account for the observational uncertainties in the observed properties for either individual sources or the population of sources being targeted. To overcome these limitations specifically in the case of identifying HzQs, a probabilistic selection as opposed to a binary selection may represent a better way to both obtain more complete samples of HzQs and assign higher probabilities for spectroscopic follow-up to more promising candidates. Bovy et al. (2011) introduced an implementation of Gaussian mixture modelling (GMM) that assumes and then deconvolves a model of the underlying population of sources from data, leading to a robust estimate of probability distribution of sources such as HzQs even from noisy measurements. Such an approach has been successfully employed to assign probabilities and better select low- and intermediate-redshift quasar candidates from SDSS data (e.g. Bovy et al. 2011, but see also Bailer-Jones et al. 2008; Richards et al. 2009).
Further complexities can be introduced in these models to improve the probability assignment, for example by also taking into account the respective prior probabilities of the different contaminants – particularly dwarf stars in the Galaxy – based on their spatial distribution and number densities on the sky. Such an approach was implemented by Mortlock et al. (2012) for HzQs where prior information about populations that exhibit HzQ-like colours was used to assign a contamination (and as a result HzQ) probability and reduce the number of contaminants, leading to the discovery of a quasar at z = 7.1 (Mortlock et al. 2011). However, the initial selection of HzQ candidates in the probabilistic approach of Mortlock et al. (2012) still relied on linear colour cuts, and could potentially suffer from the same incompleteness issues as faced by other colour-based HzQ selections.
Therefore, there remains room for improvement in probabilistic HzQ selection methods, namely by more accurately constraining the luminosity and sky distributions of possible contaminants to obtain more complete samples of HzQs. In this work, we build upon the probabilistic approach of selecting HzQs based on posterior probability estimation using informed priors and likelihood estimation utilising GMMs. We also make use of deep radio observations of the HETDEX spring field taken as part of the LoTSS first data release (LoTSS DR1; Shimwell et al. 2017), using radio detection as an additional prior to minimise foreground contamination. With the combination of multi-wavelength data and a probabilistic approach, we aim to develop a selection technique capable of uncovering more complete samples of HzQs from large-area surveys while minimising the number of contaminating sources present in them.
This paper is organised as follows. In Sect. 2 we describe the data sets that are used, and our HzQ selection method is described Sect. 3. In Sect. 4 we apply our selection method to the data sets, obtaining probabilistically selected HzQ candidates. In Sect. 5 we present spectroscopic follow-up for a handful of identified high-priority HzQ candidates, and report the discovery of a previously undiscovered quasar at z ∼ 5.7. In Sect. 6, we discuss the performance of our selection method, application to incoming large sky survey data sets, and the possible implications of the discovery of P144+50. Finally, in Sect. 7 we summarise the findings of this study.
Throughout the paper, we assume a Planck 2015 cosmology (Planck Collaboration XIII 2015), with H0 = 67.8 km s−1 Mpc−1, ΩM = 0.308, and ΩΛ = 0.692. All magnitudes are given in the AB system (Oke & Gunn 1983) unless otherwise stated.
2. Data
2.1. Pan-STARRS
The primary data set used to identify HzQ candidates in this study is Pan-STARRS Data Release 1 (PS1). The PS1 survey covers 3π steradian of the sky, including the entire northern hemisphere (Chambers et al. 2016), reaching 5σ depths of 23.3, 23.2, 23.1, 22.3, and 21.3 AB in the g, r, i, z, and y optical filters, respectively.
We first retrieve a sample of sources from the PS1 data archive1, and although no colour cuts are made for the initial selection, a number of other criteria are applied to reduce the full PS1 sample down to the appropriate parameter space and more manageable numbers. As we are primarily interested in HzQs, we require a nondetection in the g and r filters while requiring a robust detection in the i, z, and y filters. The nondetections are attributed to magnitudes fainter than the 5σ limiting magnitudes in the photometric filters published by the PS1 team, or values of −999 as this value is the magnitude assigned in case of a nondetection in a particular band.
As a proof of concept, we also restrict our analysis to the sky area corresponding to the HETDEX Spring field, which is advantageously covered by deep radio data at 150 MHz from LoTSS DR1 (see Sect. 2.2). The selection criteria for obtaining an initial sample from the PS1 catalogue can be summarised as follows:
Furthermore, objects with flags from the PS1 processing pipeline indicating poor or low-quality detections are excluded to remove sources that have poor photometric data, following Table 6 of Bañados et al. (2014). From these criteria, a sample of ∼5 × 105 sources with complete photometric data is retrieved.
2.2. Radio data
The radio data used in this study are taken from LoTSS DR1 (Shimwell et al. 2017), which is a low-frequency radio continuum survey covering over 424 sq. deg. of the northern hemisphere that reaches a median sensitivity of 71 μJy beam−1. Consequently, radio sources that are considered are based on the 5σ LoTSS DR1 catalogues and have a flux density of at least 350 μJy. Full details about the data reduction, processing, final images, and source catalogue creation are presented in Shimwell et al. (2019), with robust optical cross identifications presented in Williams et al. (2019) and accompanying photometric redshifts in Duncan et al. (2019). We additionally make use of early LOFAR ‘deep-fields’ data in the Boötes field (Williams et al. 2018), which coincides with the HETDEX Spring field sky area.
Visual inspection of the radio and optical images of initial HzQ candidate samples drawn from PS1 demonstrated that dusty red galaxies at intermediate redshifts (0 ≲ z ≲ 3) represented an additional potential contaminant population that could also emit significant radio continuum emission. The Boötes deep field data from Williams et al. (2018) therefore acts as a primary reference for those sources where the high-quality optical data enable robust photometric redshifts.
2.3. Ancillary data
Several other data sets are utilised for training, testing, or validation of the GMM algorithms implemented in this work, or in accurate construction of priors. As a validation sample for HzQs, we use the sample of confirmed HzQs compiled by Bañados et al. (2016) containing all z > 5.6 quasars known up to March 2016.
As a reference data set for dwarf stars in the Milky Way, we use a catalogue of brown dwarfs observed in PanSTARRS by Best et al. (2017). From the same work we use data on the mean absolute magnitudes of different dwarf types in PS1, which are also used for constructing their sky densities.
3. High-redshift quasar selection
Having outlined the data sets used to implement our new HzQ selection, in this section we describe the ingredients that go into the construction and implementation of our GMM-based HzQ selection method.
Our HzQ selection method builds upon probabilistic selection of HzQs using a Bayesian framework presented by Mortlock et al. (2012), which does not rely on binary colour–magnitude cuts and incorporates additional prior knowledge about quasars and other contaminants to predict the likelihood of a source selected from a large area survey being a HzQ. For HzQs at z ≳ 5.6, therefore, a flexible algorithm can be constructed that can compute the probability for any given source based on its iP1, zP1, and yP1 magnitudes.
We begin by first defining the posterior distributions for the classes of objects that are likely to occupy the photometric parameter spaces typically occupied by HzQs. We recall that the posterior probability of a source being part of any particular population can be calculated using Bayes’ theorem:
where P(Ck) is the prior probability of an object belonging to class k, and ℒ(X|Ck) is the likelihood of the given source being part of class k, normalised over all N possible classes and their associated probabilities.
In reality, when considering the measurements of astronomical objects, additional factors related to both the distribution of sources on the sky and the survey limitations must be accounted for when deriving probability estimates. More generally, considering these additional factors and the features f = {f1, f2, …, fn} of a source that differentiate it from other sources in the data, the prior probability of sources belonging to class k with parameters θk is calculated as:
where ρ(θk) is the sky density, and P(det|θk, Ck) is the probability that the source is detected in the survey.
For sources detected in a flux-limited survey, the parameters θk most relevant to the probability are the magnitudes of the source classes in different filters, described by mk. In this case the features, f, would describe the observed magnitudes of a given source in different filters, . Therefore, to calculate the prior we marginalise over the relevant magnitude space. The prior is then combined with the likelihood in the ‘weighted evidence’ term, describing the evidence that the source in question belongs to a given class:
where is the likelihood of the features of a source belonging to an object of class k. With this, Eq. (1) can be rewritten as
Having rewritten the equation to calculate the posterior probability of any given class of objects detected in a survey in terms of its apparent magnitude, below we describe the classes of sources that we consider in our search for complete samples of HzQs.
3.1. Source classes
Successful implementation of our HzQ selection method requires the proper identification of all classes of sources that are relevant and overlap with the HzQ parameter spaces. As a result, not every class of astrophysical source needs to be considered, which may also be considered as setting the prior of non-relevant source classes to zero. The relevant classes consist of the target HzQ population and a set of contaminating populations occupying the same feature space. Therefore, we identify three relevant populations: HzQs, dwarf stars within the Milky Way, and dusty intermediate-redshift galaxies with red observed-frame optical colours. To model these populations we require data with PS1 magnitudes for each.
As mentioned above, we use the Galactic brown dwarf stars catalogue from Best et al. (2017) containing photometry of dwarf stars, and the deep multi-wavelength galaxy catalogues in the Boötes field from Williams et al. (2018) containing photometric measurements for dusty intermediate-redshift galaxies. Both catalogues contain ∼104 sources, which is sufficient to model the colour space reliably without being biased by scatter in individual sources.
However, the same is not true for the catalogue containing approximately 200 confirmed HzQs. Therefore, to model the distribution of the quasar population in the colour spaces probed, we simulate the rest-frame UV spectral energy distributions (SEDs) for a population of quasars using a distribution of power laws, α ∼ 𝒩(1.30, 0.38), following the distribution presented by Cristiani et al. (2016). These power-law SEDs are then combined with emission lines using the SDSS quasar template from Vanden Berk et al. (2001).
To then simulate a population of high-redshift quasars, each simulated quasar spectrum (continuum + emission lines) is redshifted. The redshift is drawn from the redshift distribution following the co-moving luminosity functions as defined in Mortlock et al. (2012), Eq. (13), in the redshift range of 5.6 < z < 6.5. A redshift-dependent IGM absorption from Madau (1995) is then applied to simulated spectra, and the spectra are then convolved with the Pan-STARRS photometric filters (using prescriptions built into the SMPY PYTHON package2). As we are only interested in obtaining a reasonable distribution in colour space, the results are not dependent on the absolute flux of the quasars. This method of generating quasar spectra results in a reliable distribution of HzQ colours, and to maintain consistency with the number of contaminants available to model, we simulate a total of 104 quasars in this manner. This method of simulating quasars rests on the assumption that both the Vanden Berk et al. (2001) template spectrum and power-law distribution from Cristiani et al. (2016) are valid to higher redshifts as well. While beyond the scope of this paper, more reliable samples of quasars could be generated using parametric SED modelling, which can account for intrinsic changes in quasar spectra as a function of luminosity and redshift (e.g. Temple et al. 2021).
3.2. Likelihoods and Gaussian mixture modeling
In order to estimate the likelihood of a source belonging to a certain population that is considered in this study, we model each relevant population in the iP1 − zP1, zP1 − yP1 colour space. To do this, we use GMMs, which assume that the probability density of a population can be described by a finite number of weighted Gaussian functions (Reynolds 2009). Therefore, to obtain a probability density, N Gaussian functions each with mean μi and variance Σi are given a weight wi, with the condition that the N weights sum up to unity as follows:
The GMM is implemented in a machine learning algorithm, which optimises the parameters using expectation maximisation (Dempster et al. 1977). To estimate the number of components needed to model each population adequately, the Bayesian information criterion (BIC) is used (Wit et al. 2012). This use of machine learning techniques to model various populations in colour space is a deviation from the method presented in Mortlock et al. (2012), and this is where the novelty of our method compared to traditional techniques relying on binary cuts in the colour space is best highlighted. The resulting likelihood of any given source belonging to a population follows directly from the GMM:
Furthermore, we use an extension of the classical GMM algorithm which implements extreme deconvolution, XDGMM (Holoien et al. 2017). This implementation is particularly suited for noisy data, as it deconvolves the noisy distribution of the population in order to capture the underlying distribution more accurately. This method thus makes use of the uncertainties in the data, both for deconvolving the models and to assign likelihood to input data. As the error bars are folded into the covariance matrix of the GMMs, sources with larger uncertainties are assigned lower likelihood. The GMM algorithm is used to model the previously defined populations (quasars, dwarf stars, galaxies) in iP1 − zP1, zP1 − yP1 colour space. The log likelihoods (assuming a constant error in magnitude) of the resulting GMMs are shown in Fig. 1, along with the sources used to create the models for each population, in the left plot simulated quasars, and in the middle and right plots the sources from the reference catalogues (Best et al. 2017; Williams et al. 2018, respectively). The Gaussian components for each mixture model are also shown, with four components for HzQs, six for dwarf stars, and one for galaxies. The number of components that minimises the BIC is chosen for each population separately.
![]() |
Fig. 1. Log likelihood of quasars (left), dwarf stars (middle), and low-redshift galaxies (right) modelled by a Gaussian mixture model in colour space. The red ellipses indicate the 3-sigma extents of the individual deconvolved Gaussians (four components for quasars, six for dwarf stars, and one for galaxies). The quasar likelihoods are assigned using photometry from simulated quasar spectra as highlighted in Sect. 3.1, whereas the photometry for dwarf stars and low-redshift galaxies are taken from their respective reference catalogues, Best et al. (2017) and Williams et al. (2018) (black dots). The dashed line in the left panel marks the commonly defined colour cut at i − z = 2.0. |
3.3. Detection prior
Many sources in our sample have faint magnitudes, extending all the way down to the PS1 detection limit. This makes obtaining accurate detection priors necessary not only to differentiate between real and fake sources, but also to robustly characterise the various populations of sources considered, especially at the faintest magnitudes.
As the fraction of real sources detected as a function of source magnitude in PS1 (Metcalfe et al. 2013) is relatively well described by a sigmoid function, the detection priors we use for PS1 detected sources in this study are calculated as:
where mfilt is the magnitude of a source in one of the Pan-STARRS filters, mfilt, 1/2 is the 50% magnitude depth of this filter, and s is a binary value that depends on the source type:
which is a relevant statistic for differentiating between point sources and extended sources.
3.4. Radio detection prior
Deep radio continuum data from LoTSS DR1 are used to complement the available optical data from Pan-STARRS, providing radio detections for a subset of the selected sources. To properly account for radio-detected sources, we modify the source classification based on the likelihood of radio detection, which we implement through the inclusion of an additional radio detection parameter, fR, k, into the detection prior. Through this radio detection prior, if a radio counterpart in the LoTSS DR1 images of the input source is present, the radio detection is taken into consideration when computing the HzQ posterior probability.
For HzQs, roughly 10% of the quasar population (e.g. Hooper et al. 1995) is ‘radio-loud’. This relation seems to hold at higher redshifts, as Bañados et al. (2015) reported a radio-loud fraction of ∼10% for z > 5 quasars at 1.4 GHz. Recent results from deep LOFAR survey data at lower redshifts suggest that there is no dichotomy between radio-loud and radio-quiet quasars, and that 30% of quasars can be detected by LOFAR surveys (Gurkan et al. 2019). Similar fractions are found in LoTSS DR2 (36% at > 2σ significance; Gloudemans et al. 2021) at z ≳ 5.0. A reasonable assumption therefore would be to set fR, k = 0.3 for HzQs as the radio detection prior.
For stars, including brown dwarf stars, the radio-loud fraction is very low, and Kimball et al. (2009) find that about one in a million stars can be detected at radio wavelengths. However, low-frequency radio data combined with unparalleled sensitivity from LoTSS represents a new parameter space for the detection of radio signals from stars, as demonstrated by the recent discovery of polarised radio emission from a cold brown dwarf (Vedantham et al. 2020). Nevertheless, bright, non-variable radio continuum emission sufficient to be detected in LoTSS imaging will be significantly rarer for brown dwarf contaminants than for luminous quasars or galaxies. Therefore, the probability of a radio-detected source in our sample being a brown dwarf is virtually zero, with fR, k = 10−6 for dwarf stars.
For red, dusty galaxies at intermediate redshift, we find from the deep multi-wavelength catalogues based on deep LOFAR data in the Boötes field (Williams et al. 2018) that only a small fraction (∼1%) of these galaxies has a radio detection. Therefore, we set fR, k = 10−2 for the galaxy population. We note that these radio detection priors currently represent order of magnitude accuracy, and with deeper data collected over larger areas of the sky by current and future radio surveys, the radio detection priors can be improved upon to further enhance the probability assignment method for HzQs.
3.5. Sky densities
The sky densities of the source classes represent a significant prior, especially given the very rare nature of HzQs that makes any given source on the sky more likely to be a star or foreground galaxy. This prior can also differ depending on the apparent magnitude of the source and in this section we describe the calculation of the sky density priors for source populations considered in this study.
3.5.1. M-, L-, and T-type dwarfs
Since the dwarf star contaminants are all within the Milky Way, the number density of dwarf stars at distance d from the Earth and Galactic latitude (l) and longitude (b) can be estimated assuming a Galactic model (e.g. Chen et al. 2001) as
where Z⊙ is the height of the Sun or Earth above the Galactic plane, and hZ and hR are the characteristic height and distance scales for stars in the Milky Way, respectively (see also Caballero et al. 2008). The fiducial values of the various parameters used to calculate the sky densities of dwarf stars are given in Table 1.
Values of the various parameters used in Eq. (10) with errors as determined by Chen et al. (2001).
Given the magnitude range specified, every dwarf type will have a slightly different heliocentric distance at which it will appear in the sample. To calculate this, the absolute PS1 magnitudes of each dwarf type are used from the Best et al. (2017) catalogue. We calculate the sky density for each magnitude bin by integrating the spatial density over the cone covering the sky area. For a single stellar type, this results in
where D is the distance in parsecs. The total sky density of all contaminating dwarfs is calculated as the sum of the densities ρi of all stellar dwarf types.
3.5.2. Galaxies
A significant fraction of sources in the PS1 data described in Sect. 2 are identified as faint red galaxies. As mentioned previously, the information for this population is primarily taken from the Boötes multi-wavelength photometric catalogue from Williams et al. (2018), which also contains photometric redshifts, allowing us to select such galaxies in the redshift range 0 ≲ z ≲ 3. As this is a less well-defined astrophysical population compared to quasars and dwarf stars, we have no luminosity function to model their observed sky density. Instead, we use the apparent magnitudes of these galaxies in the PS1 i, z, and y filters to model their distribution as a function of apparent magnitude in a given filter. We model the distribution with the Kernel density estimation (KDE) technique (Silverman 2017), where we use a bandwidth h = 0.1 to get a smooth and continuous representation of the data.
This population of galaxies is made up of the population identified in the Boötes field, selected with the same criteria as the main PS1 sample (see Sect. 2.1). The Boötes field covers an area of S ≃ 11.6 deg2 on the sky, and we use this to convert the modelled distribution of galaxies to sky densities. Assuming that the galaxies are isotropically distributed, these sky densities are independent of the direction in which we observe, making the model valid for data in the HETDEX field as well. The resulting sky densities of galaxies detected in the PS1 i, z, and y band data as a function of AB magnitude are shown in Fig. 2.
![]() |
Fig. 2. Expected number density of the galaxy populations as a function of apparent magnitude in PS1 i, z, and y bands extrapolated from the deep multi-wavelength galaxy catalogue in the Boötes field (Williams et al. 2018). |
3.5.3. High-redshift quasars
Density functions of HzQs can be expressed in terms of the luminosity functions at high redshifts, which are poorly constrained compared to lower redshifts because of a lack of statistical samples (e.g. Manti et al. 2017). Using observations of quasars across redshifts, Mortlock et al. (2012) derived a redshift- and absolute-magnitude-dependent co-moving luminosity function for HzQs.
In order to calculate absolute magnitudes from the range of observed magnitudes in all relevant PS1 filters, K-corrections to the quasar spectra are calculated. We use the method with which we simulated quasar magnitudes in Sect. 3.1, applying redshift-dependent Lyman absorption from the intervening IGM based on redshift to the quasar SED templates from Vanden Berk et al. (2001). We note that we do not account for the presence of ionised proximity zones around the HzQs. The Lyman absorbed and redshifted spectra are divided by the unaltered SED templates, and convolved with the relevant PS1 filters to obtain the K-corrections (following Hogg et al. 2002).
Finally, integrating the redshift- and magnitude-dependent HzQ luminosity density, Φq(M, z), from Mortlock et al. (2012) over the observed redshift cone yields the sky density of HzQs
where Dco is the co-moving distance in megaparsecs integrated over the distances (Dco, 1, Dco, 2) corresponding to the redshift range probed.
3.6. Full posterior
For the full photometric sample outlined in Sect. 2 we calculate the evidences for each class using the priors and likelihoods outlined above. The final quasar posterior probability is then constructed as
where the weighted evidence is calculated based on the priors obtained using the i, z, and y magnitudes and likelihoods in the i − z and z − y colour spaces for each source
As a result, every source with a measured i, z, and y band magnitude in the PS1 catalogue can be robustly assigned a probability of being a HzQ. In the following section, we apply our HzQ selection method to publicly available photometric data from PS1 in a bid to identify previously undiscovered HzQs at z ≳ 5.5.
4. Implementing the quasar selection algorithm
Having defined all the necessary components for our HzQ probability assignment method, in this section we apply it to data taken from PS1 as described in Sect. 2. When provided the iP1, zP1 and yP1 magnitudes of any source, our method described above should yield a posterior probability, Pq, that the source is a HzQ.
4.1. Candidate HzQ samples
As the sky density priors give a much greater weight to non-quasar likelihoods, we must define a posterior threshold that is capable of capturing the quasar population while largely rejecting other foreground contaminants. Using PS1 photometry of known quasars at z > 5.5 from Bañados et al. (2016), we find that a posterior threshold of Pq > 5 × 10−4 accepts ∼90% of the quasar population. The same threshold also rejects more than 99% of dwarf stars and low-redshift galaxies, as is shown in Fig. 3.
![]() |
Fig. 3. Posterior quasar probabilities of the different populations assigned by our HzQ selection method. The chosen posterior threshold of Pq > 5 × 10−4 is shown using the dashed line, which retains ∼90% of the known quasar population while rejecting > 99% of foreground contaminant populations. |
We introduce an additional requirement ensuring a good-quality detection in PS1 to remove spurious detections, saturated counts, and other instances resulting in bad photometry in the catalogue by ensuring that the parameter iQfPerfect > 0.85. We exclude sources that do not adequately fit any of the modelled populations by removing sources that have low likelihood scores from all GMMs,
Having established the adequate posterior threshold that maximises the chances of identifying HzQs and minimises the incidence of foreground contaminants, we now proceed to run our novel HzQ selection method on the photometric data in the i, z, and y bands queried from PS1. Our initial data set contained ∼5 × 105 sources, out of which 508 sources were selected with probability above the set threshold. Finally, 263 sources satisfied the additional good-quality detection requirement, of which 12 sources had an accompanying LOFAR detection.
To investigate the selection function introduced by our algorithm to identify candidate HzQs, in Fig. 4 we show the cumulative distribution function (CDF) of zP1 magnitudes of sources lying above the probability threshold of being HzQs (red line), along with the CDF of the zP1 magnitudes of all the sources that were passed through the algorithm (black line). Very clearly, our algorithm preferentially classifies objects with brighter zP1 magnitudes as candidate HzQs, placing a stronger emphasis on capturing the Lyα break which manifests itself as higher i − z photometric colours. The brighter zP1 magnitudes also ensure that sources classified as candidate HzQs are securely detected at redder wavelengths. The zP1 mag distribution for sources with high HzQ probabilities peaks at zP1 ≈ 19, with objects fainter than zP1 > 21 very rarely selected, as it would be impossible to constrain the Lyα break in objects with the faintest zP1 magnitudes. The comparison shown in Fig. 4 therefore serves as a validation for our new HzQ selection algorithm.
![]() |
Fig. 4. Cumulative distribution function of zP1 magnitudes of the sources classified as candidate HzQs (red line) compared with the full PS1 photometric sample (black line). Our algorithm preferentially assigns higher HzQ priorities to sources that are brighter and securely detected in the zP1 band, ensuring that the Lyα break resulting in higher i − z colours is securely constrained. |
The 263 candidate HzQs lying above the posterior threshold of Pq > 5 × 10−4 selected from our method represent a very small fraction (∼0.05%) of the initial data set. Our strict threshold clearly results in a drastic reduction in the number of candidate HzQs, which can subsequently be visually inspected, whereby additional spurious, extended, or otherwise undesired sources can then be rejected. The main aim of visual inspection is to identify clearly spurious sources that may have been missed as such by the quality selection parameter (iQfPerfect). Examples include contamination by bright stellar spikes, cosmic ray residuals, and grouped bright pixels. We additionally rejected candidates that showed extended morphologies, as HzQs are highly likely to appear as point sources in PS1 images. The visual inspection was carried out by JDW, AS, and KJD, with mutual agreement being required in order to reject a candidate.
As a result, our conservative approach to visual inspection resulted in a large fraction of HzQ candidates being rejected, with 65 sources –11 of which have a radio detection– remaining as good HzQ candidates suitable for spectroscopic follow-up. The entire sample is summarised in Fig. 5, where the discriminating power of the posterior calculation can be appreciated. Further details of the sources can be found in Table A.1. Noteworthy is a cluster of sources around z − y = 1.5, which represents a subset of the sources that have a detection in LoTSS. In total, 4417 sources in the full catalogue have a LOFAR counterpart, none of which would be selected if no radio counterpart was present. The addition of radio data has given higher significance to these sources, and shows that the method can robustly take into account the additional information provided by a radio detection.
![]() |
Fig. 5. Distribution of sources in the zP1 − yP1 vs iP1 − zP1 diagram of the full PS1 sample (grey points), high-probability HzQs (black circles), and the candidate HzQs selected for spectroscopic follow-up (red circles) and those with a radio detection (red squares). The traditional colour cut at i − z = 2 is marked using the black dashed line, demonstrating that our method assigns a high HzQ probability to several sources that lie below this selection. We also mark P144+50 (red star) in the colour–colour plot, which was discovered using our method at a redshift of z = 5.66, and with i − z = 1.4 lies below the standard photometric selection employed in other studies. |
4.2. Comparison with colour selection
In this section we test the efficacy of our Bayesian HzQ selection method compared to the traditional colour- and magnitude-based selection. We note that ∼30% of sources that were assigned high HzQ probabilities lie below the traditional i − z > 2 colour cut (Fig. 5), and may potentially be missed by studies relying on binary colour and magnitude cuts for HzQ searches. To compare results, we apply a colour cut of i − z > 2 on the full PS1 sample, which selects 634 sources. This shows that even though sources below the colour cut can be above the probability threshold, many sources above the cut are also rejected. To investigate if these sources are rightfully rejected or accepted, we can run the algorithm on the data sets described in Sect. 3.1, and compare them with a i − z > 2 colour cut.
For the HzQ sample, we simulate 104 HzQs using the same method as described in Sect. 3.1. We also assign z-band magnitudes to the simulated quasars following a log-normal distribution based on the magnitude distribution of HzQs from Ross & Cross (2020). From the z-band magnitude, the i- and y-band magnitudes are automatically assigned based on the colours of the quasars. The algorithm is run on this sample, and as for the PS1 sample, only sources above the probability threshold of Pq = 5 × 10−4 are accepted. Through this we find 7614 (76%) HzQs above the probability threshold, while 7388 (74%) are above the i − z > 2 colour cut. Much like in the sample described above, the probability cut rejects sources above the colour cut and vice versa, such that an important difference between the methods is in which parts of colour space are probed.
As both Bayesian selection and colour selection methods return roughly the same number of HzQ candidates, we test the efficacy of selection by repeating the same experiment with the contaminant populations. For the brown dwarf population, we use the Best et al. (2017) catalogue containing ∼104 sources, and using the above mentioned probability threshold selection, our algorithm shortlists 14 (0.15%) known brown dwarfs as candidate HzQs. However, the colour selection selects 68 (0.72%) brown dwarfs as HzQ candidates.
Using the low- and intermediate-redshift galaxy catalogues from Williams et al. (2018) containing ∼104 sources, our algorithm classifies only one (0.01%) galaxy as a candidate HzQ, compared to 36 (0.38%) galaxies being classified as HzQ candidates based on colour selection. Overall, we find that our method rejects a larger percentage of contaminants, while retaining a similar fraction of HzQs compared to a simple i − z > 2 colour selection, implying a higher overall efficacy.
We note that in i − z colour-based selections, often an additional colour criterion of z − y < 0.5 is applied (e.g. Bañados et al. 2014) to further remove contaminants. Applying these i − z > 2 and z − y < 0.5 cuts on our simulated HzQ, brown dwarf, and galaxy catalogues, we find that all brown dwarfs are eliminated and 28 galaxies are classified as candidate HzQs. However, these cuts only retain 58% of HzQs from our simulated sample, clearly showing that although a large fraction of contaminants are eliminated using colour cuts based on i, z, and y band photometry, several HzQs may also be missed by such a selection.
The possible addition of radio data further improves the efficacy of HzQ selection, and we consider a sample where all sources have counterparts in the radio. Using the assumed radio detection rates in Sect. 3.4, we see almost all quasars with a radio detection are accepted (91%). This fraction is purely considering the number of sources that are above the probability threshold, which is in addition to the increase in probability for these quasars across the board. All dwarf stars are eliminated from the sample, while as before 0.01% of galaxies are retained. This shows that the addition of radio data can be extremely valuable for identifying HzQs, and significantly increases the purity of the resulting candidate sample.
In Fig. 6, the results of the colour and probability selection on the test samples of HzQs (left) brown dwarfs (middle), and galaxies (right) are summarised. Here it is clear that the probability selection method handily rejects contaminants that occupy the same colour space as HzQs and would normally be included in colour selection. We note that, although both methods recover a similar number of HzQs, different subsets are selected, as the probability selection recovers a significant fraction of HzQs below the colour cut, while also rejecting a portion of HzQs above the colour cut that lie close to the colour space of brown dwarfs. As HzQs with i − z < 2 have generally lower redshifts, this demonstrates that the probability selection can be especially effective in recovering HzQs at z ∼ 5.6.
![]() |
Fig. 6. Distribution of sources in the zP1 − yP1 vs iP1 − zP1 diagram of the test samples of HzQs (left), brown dwarfs (middle), and galaxies (right). Sources not selected with any method are marked in grey. Sources selected with i − z > 2 colour cut (dashed line) but rejected by the probability selection are marked as a diamond shape with a black edge. Notably, the z − y < 0.5 colour cut (dotted line) rejects all brown dwarfs from the test sample, but also a large portion of the simulated quasar sample. |
5. Spectroscopic follow-up
To demonstrate the efficacy of our quasar selection method and confirm the nature of the candidate sources, we obtained spectroscopy primarily targeting the Lyα lines for the most promising HzQs identified by our selection method. Nevertheless, our final sample of high-quality HzQ candidates still contains an impractical number of sources for additional spectroscopic observations, and therefore we assigned priorities to sources in our final sample based on available independent ancillary data primarily at infrared wavelengths, which were crucially not used in the probability assignment using our method.
We first cross-matched our candidate HzQ sample with the AllWISE Data Release catalogue3, which builds upon the data collected by the Wide-field Infrared Survey Explorer (WISE; Wright et al. 2010) mission by additionally including data from the NEOWISE surveys (Mainzer et al. 2011). The AllWISE data contain photometry in the WISE W1, W2, W3, and W4 bands, offering wavelength coverage in the range 3.4 − 22 μm. We additionally cross-match our sample with the UKIRT Hemisphere Survey (UHS, Dye et al. 2018) data release containing deep J-band imaging and source catalogues over ∼12 700 deg2 of the sky.
We used this ancillary data to assign priorities to sources in our candidate HzQ sample for spectroscopic follow-up. First, we assigned higher priority to sources with lower |zP1 − yP1| and |yP1 − J|, which brings us closer to the locus of colours from known HzQs at z > 5.5. Next, based on the infrared colours seen in the sample of known HzQs from Bañados et al. (2016), which was also used as a validation sample for this study, we assigned higher priorities to sources that satisfied the following conditions:
Lastly, the brightest sources in our sample were assigned higher priority, purely to make the process of spectroscopic follow-up more efficient. Having assigned a priority for spectroscopic follow-up to each candidate HzQ, we now describe our spectroscopic observations below.
5.1. Description of observations
The spectroscopic observations of our candidate HzQs presented in this work were obtained using the Intermediate Dispersion Spectrograph4 (IDS) on the 2.5 m Isaac Newton Telescope (INT, PI: Wagenveld, Program: N17). The observations were taken over a period of 6 nights in Spring 2019, during which 13 of the highest priority candidate HzQs were observed. Three nights were unfortunately lost due to bad weather, and the remaining three nights had favourable conditions with an average seeing of 0.5″ from 6 to 8 April 2019.
The observations were taken using the R400R grating in the Red arm of the spectrograph, with a slit width of 1.5 arcsec and slit length of 3 arcmin. Standard afternoon calibrations were performed with both lamp and sky flats taken before each observing night. A flux standard was observed at the beginning and the end of each night. We used CuAr+CuNe lamps for wavelength calibration, which were observed at the position of each target before the sky exposure.
The targets were observed in blocks of 1800 s, with total integration times per source ranging from 3600 s to 7200 s. Due to telescope limitations and higher priority assigned to brighter sources, only sources brighter than a z-magnitude of 20.5 were observed. Blind offsetting was used to acquire faint targets and standard data-reduction procedures, which includes bias subtraction, flat-fielding, sky subtraction, wavelength calibration, and flux calibration, were performed using a custom PYTHON-based data-reduction pipeline written by our team5, which is based on CCDPROC (Craig et al. 2021).
Of the 13 targets observed in this run, 11 could not be conclusively classified based on the spectra obtained. In most cases, only very faint continuum was spotted with potential narrow lines. Unfortunately, the S/N of the continuum or the emission lines for these sources was not sufficient to unambiguously determine redshifts or classify the sources as either dwarf stars in the Milky Way or low-redshift galaxies. Three of the observed sources had an accompanying radio detection, but did not contain strong emission line features in their spectra. However, the clear absence of a strong Lyα line or a break blueward of Lyα in their spectra indicated that these sources were unlikely to be quasars at z ≳ 5.5.
However, high-S/N spectra were obtained for two sources in our sample, one of which was conclusively classified as a brown dwarf star owing to clear, broad absorption features in the continuum arising from molecules such as TiO (e.g. Reiners et al. 2007), which often mimic the Lyα break found in the spectra of HzQs.
The other source, PSO J144128.715+502239.463 (P144+50 hereafter), was convincingly classified as a previously undiscovered, luminous quasar at a redshift of z = 5.66, and in the following section we describe the observed properties of this newly discovered HzQ.
5.2. P144+50: a luminous quasar at z≈ 5.7
The most luminous and promising source of the candidate sample, P144+50, has a very clear point-source-like structure across the available broad band photometry, as illustrated in Fig. 7. This luminous quasar was most likely missed by earlier searches owing to its relatively low iP1 − zP1 colour of 1.4, which may be excluded by traditional binary colour cuts. Although no radio counterpart for this quasar was identified within the LoTSS DR1 catalogues, P144+50 was still assigned a probability of Pq = 0.01 from our method, demonstrating that our novel HzQ selection method is capable of assigning realistically high probabilities even to non-radio-detected HzQs. In Table 2 we give the apparent magnitudes of P144+50 in the available optical and infrared filters.
![]() |
Fig. 7. P144+50 in optical and infrared bands. In the WISE images, it is blended with the neighbouring galaxy. Given the fact that the quasar is brighter in those bands while the galaxy drops off, and that its WISE magnitudes are consistent with those of other HzQs, it is likely that most of the emission in the WISE bands comes from the quasar. |
Observed optical and infrared magnitudes of P144+50 in all filters relevant to probability calculation and priority assignment.
The 1D spectrum of P144+50 shown in Fig. 8 displays a bright and broad, strong Lyα feature, with a clear break in the spectrum blueward of the line, showing the Gunn-Peterson trough. The peak of the Lyα line suggests a redshift of z = 5.66. No other rest-frame UV emission features are identified, but the Si II absorption feature may be present.
![]() |
Fig. 8. 1D spectrum of P144+50 taken using the IDS instrument on the 2.5 m INT. The characteristic break blueward of the Lyα line is clearly visible, unambiguously identifying it as a high-redshift quasar with a redshift of z = 5.66 based on its Lyα emission. Some flux is seen around the expected Lyβ and O IV features. Apart from Lyα, there are no clear emission lines visible in the spectrum due to the limited sensitivity of the INT. The orange line indicates the noise level of the spectrum. |
We additionally detect a Lyα forest, and faint signs of the presence of an ionised proximity zone around this QSO. Additional flux is detected around the rest-frame Lyβ and O IV wavelengths. Unfortunately, due to the limited S/N of our INT observations, any meaningful constraints on either the proximity zone or the neutrality of the intervening IGM along the line of sight cannot be derived. Therefore, deeper follow-up observations with larger telescope facilities are required to draw robust conclusions.
Demonstrating the merits of the new HzQ selection method introduced in this paper perfectly, P144+50 is a bright hitherto undiscovered quasar at z = 5.66. Its rest-frame UV magnitude, M1450 = −27.22, puts it at the brighter end of the quasar luminosity function at z ∼ 6 (Manti et al. 2017) and amongst the most luminous quasars currently known at this redshift. Given that the sky area covered by our search is ∼1% of the full sky, and the sensitivity of spectroscopic observations restricting the follow-up to only those sources with magnitudes brighter than 20.5 AB, it is not improbable that other such undiscovered quasars exist within the large sky surveys that may have been missed in searches relying on binary colour selection.
Additionally, P144+50 lies within the redshift range investigated by Yang et al. (2017), demonstrating that our GMM-based HzQ selection approach is able to successfully identify quasars within the so-called ‘redshift gap’ often encountered by HzQ searches employing optical broadband selection. Thanks to the increased discriminatory power provided by our algorithm, GMM-based HzQ searches might provide a powerful method for more complete samples of HzQs, including those that lie within the redshift gaps in ground-based optical broadband searches.
6. Future prospects
The Bayesian quasar-selection method presented in this work is built from priors informed by empirical relations and likelihood models from machine learning, as opposed to binary cuts in optical/infrared magnitudes and colours. As a result, this method relies heavily on the accuracy of priors derived for both HzQ populations and common contaminants in HzQ searches. Therefore, the priors can be improved in an iterative fashion by folding in the results from the ever increasing spectroscopic confirmation of candidate HzQs selected from photometric surveys.
The inclusion of additional photometric data can also help to improve the priors, resulting in a more accurate HzQ probability assignment. For example, J band and WISE photometry for known HzQs and contaminants can be used to improve the estimation of priors, which for this work have only been used to shortlist candidate HzQs for spectroscopic follow-up. Expanding the model to include these additional dimensions can further increase its precision and reliability thanks to the increased colour information available, as well as extend its application to searches for HzQs at even higher redshifts.
The range of redshifts selected for this analysis (5.6 < z < 6.5) was chosen in order to enable validation with known HzQs that were selected using i − z photometric colours. However, given that samples of HzQs can be simulated for training and existing HzQs can be used as validation for our selection method, other redshift ranges can be probed. For example, searches for HzQs at z > 6.5 can be readily carried out using our algorithm by using a handful of known HzQs at these redshifts, selected based on their z − y colours. The efficacy of the method may however not be as high for selecting z > 6.5 HzQ candidates owing to the lower number of HzQs known at these redshifts that could be used for training and validation. As mentioned earlier, the inclusion of more photometric data may help to improve the priors for HzQs at the highest redshifts.
As the method should be extendable to larger datasets, one potential bottleneck will be excising the remaining unwanted sources after assigning probability. For the sample described in this paper, this final step was done through visual inspection of 263 high-probability candidates. For a much larger initial sample size, this method is no longer feasible. From the visual inspection we performed, most of these unwanted sources were rejected on the grounds of either being spurious, having incorrect magnitude, or appearing extended. Spurious sources are essentially removed if we force all sources to have a counterpart in WISE and/or UHS. As these magnitudes are used anyway to assign priority to sources (Sect. 5), it will be doubly advantageous to implement such a selection. Incorrect magnitudes can be corrected by performing photometry directly on the Pan-STARRS images. Finally, there is clear need to differentiate between point sources and extended sources. There are several ways to do this with the Pan-STARRS catalogues, such as comparing the PSFs and Kron magnitudes of sources6 (Farrow et al. 2013). However, these methods of differentiating between point sources and extended sources become less reliable towards lower magnitudes, where we expect more quasars. We note here that efforts are currently underway to use machine learning techniques to morphologically classify radio sources (e.g. Mostert et al. 2021), which could suitably be extended to morphological classification of candidate HzQs from optical images.
Here, we demonstrate the discriminatory power of our HzQ selection method and show that it is possible to shortlist manageable numbers of high-quality HzQ candidates from large photometric data sets. With the aforementioned flexibility and room for improvement, our algorithm can potentially be applied to even larger and deeper surveys of the sky enabled by existing state-of-the-art and upcoming ground- and space-based optical and infrared observatories such as the Vera C. Rubin Observatory (Ivezić et al. 2019), Euclid (Laureijs et al. 2011), the Nancy Grace Roman Space Telescope (formerly known as WFIRST; Spergel et al. 2015), and existing large surveys such as the Kilo-Degree Survey (KiDS; De Jong et al. 2013) and Dark Energy Survey (DES; The Dark Energy Survey Collaboration 2005) to name a few.
Finally, while the z = 5.66 quasar discovered in this analysis is undetected in LoTSS radio continuum imaging, the high detection fraction of known z > 5 sources within the 5700 deg2 of the forthcoming LoTSS Data Release 2 (36% at > 2σ significance; Gloudemans et al. 2021) illustrates that the radio continuum observations can provide valuable additional information for HzQ selection and remain a powerful tool to crucially exclude contamination from Galactic dwarf stars. Relatively shallow but large-area existing radio surveys such as FIRST (Becker et al. 1995) and NVSS (Condon et al. 1998) carried out with the Very Large Array (VLA) have led to the discovery of several radio-loud quasars at z ≳ 5 (e.g. Bañados et al. 2015), and TGSS Alternative Data Release (Intema et al. 2017) covering ∼37 000 sq. deg. of the sky at 150 MHz has already led to the discovery of the most distant radio-selected galaxy currently known (Saxena et al. 2018). The full LoTSS data release will offer sensitive radio coverage over very large sky areas in the northern hemisphere, enabling the inclusion of radio priors for a large number of candidate HzQs. These sky areas and sensitivities will be improved by upcoming ultra-deep radio surveys such as those by the Square Kilometre Array (SKA; Dewdney et al. 2009) and its precursors like MeerKAT (Jonas 2016) and ASKAP (Hotan et al. 2021) enabling even fainter radio detections.
Therefore, the HzQ selection method presented in this work is flexible, and has room for improvement given the availability of deep photometric data over large parts of the sky via existing and future large area sky surveys across wavelengths. Our method also provides a robust framework within which the additional radio information can be incorporated to potentially identify even radio-faint quasars in the early Universe.
7. Summary and conclusions
In this paper we present a novel method for selecting candidate high-redshift quasars (HzQs; z ≳ 5) from large photometric data sets, making use of informed priors and Gaussian mixture models (GMMs) within a Bayesian framework. Our method attempts to capture the HzQ population more completely compared to traditionally used binary cuts in optical magnitudes and colours, while minimising the likelihood of contamination from foreground sources such as dwarf stars in the Milky Way and lower redshift dusty galaxies.
Our novel selection method builds upon previous works employing Bayesian selection of HzQ candidates using informed priors. The novelty of our methods lies in using GMMs to obtain likelihoods in optical colour–colour spaces using photometry of populations of known and simulated HzQs, as well as common contaminants such as M, L, and T brown dwarf stars and low-redshift dusty galaxies that often mimic the observed optical photometric properties of HzQs. Additional priors based on the security of optical detections, respective sky densities of the source populations, and a radio detection are used to calculate the probability of a particular source detected in a large photometric sky survey being a candidate HzQ.
We run our GMM-based HzQ search method on photometric data from the publicly available Pan-STARRS DR1 (PS1) over a limited area on the sky, coinciding with deep radio imaging from LoTSS in the HETDEX Spring field covering ∼400 square degrees. Using in particular photometry in the PS1 i, z, and y bands, we assign candidate HzQ probabilities to ∼5 × 105 sources from PS1. Adopting a HzQ posterior probability threshold that results in the selection of ∼90% of known HzQs at z ≳ 5.5 and the rejection of ≳99% of known foreground contaminants such as dwarf stars or low-redshift galaxies, we shortlist 263 candidate HzQs with high probabilities. By visually inspecting these candidates to spot any obvious artefacts, we select 63 sources in the final high-probability candidate HzQ sample, which can subsequently be followed up spectroscopically. To test the efficacy of the method, we run the probability selection on test samples of simulated HzQs and previously used samples of dwarfs an galaxies. We find that the efficacy of the probability method is higher than traditional colour cuts, decreasing the fraction of accepted contaminants by 86% while retaining a similar fraction of HzQs. While more stringent colour cuts decrease the contaminant fraction to levels similar to that of the probability selection, less HzQs are recovered. The efficacy of the probability selection is increased further once radio data are taken into account, reducing the fraction of contaminants by 99% compared to the traditional colour cut at the cost of selecting only quasars that have a radio detection.
Follow-up spectroscopic observations were then carried out for the highest priority HzQ candidates from our sample, with 13 candidates targeted with the 2.5 m Isaac Newton Telescope. Although the nature of 11 of these 13 candidates could not be confirmed owing to low S/Ns in the relatively shallow spectra, a lack of strong Lyα emission or Lyα absorption present in the spectrum was used to rule out a very high-redshift nature.
However, the exact nature of two candidates could be established, with one being a brown dwarf star and the other being a previously undiscovered, luminous quasar at z = 5.66 (P144+50). The spectrum of P144+50 shows a strong and broad Lyα line, with a strong break in the spectrum bluewards of Lyα indicative of a Gunn-Peterson trough. P144+50 has a rest-frame UV magnitude of M1450 = −27.22, putting it at the very bright end of the luminosity function at this redshift. This HzQ was likely missed by earlier searches owing to its i − z photometric colour of 1.4, which falls below the traditional limits requiring i − z > 1.5.
The discovery of this previously undiscovered, luminous quasar at z = 5.66 serves as a validation of our novel HzQ selection method, indicating that a probabilistic method of selecting HzQs from large photometric surveys may perform better at returning more complete samples of HzQs as opposed to binary selections based on cuts in optical and infrared colours or magnitudes. Our method could be improved by the inclusion of more photometric data when calculating posterior probabilities, and as such can be employed on larger incoming sky surveys to discover new quasars, into the epoch of reionisation.
Acknowledgments
This paper is dedicated to the memory of our dear friend and collaborator Maolin Zhang whose contributions to this project, and the wider field of astronomy, were tragically cut short. K.J.D. acknowledges funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 892117 (HIZRAD). We thank the anonymous referee for their useful comments and feedback. The Pan-STARRS1 Surveys (PS1) and the PS1 public science archive have been made possible through contributions by the Institute for Astronomy, the University of Hawaii, the Pan-STARRS Project Office, the Max-Planck Society and its participating institutes, the Max Planck Institute for Astronomy, Heidelberg and the Max Planck Institute for Extraterrestrial Physics, Garching, The Johns Hopkins University, Durham University, the University of Edinburgh, the Queen’s University Belfast, the Harvard-Smithsonian Center for Astrophysics, the Las Cumbres Observatory Global Telescope Network Incorporated, the National Central University of Taiwan, the Space Telescope Science Institute, the National Aeronautics and Space Administration under Grant No. NNX08AR22G issued through the Planetary Science Division of the NASA Science Mission Directorate, the National Science Foundation Grant No. AST–1238877, the University of Maryland, Eotvos Lorand University (ELTE), the Los Alamos National Laboratory, and the Gordon and Betty Moore Foundation. This publication makes use of data products from the Wide-field Infrared Survey Explorer, which is a joint project of the University of California, Los Angeles, and the Jet Propulsion Laboratory/California Institute of Technology, and NEOWISE, which is a project of the Jet Propulsion Laboratory/California Institute of Technology. WISE and NEOWISE are funded by the National Aeronautics and Space Administration. LOFAR data products were provided by the LOFAR Surveys Key Science project (LSKSP; https://lofar-surveys.org/) and were derived from observations with the International LOFAR Telescope (ILT). LOFAR (van Haarlem et al. 2013) is the Low Frequency Array designed and constructed by ASTRON. It has observing, data processing, and data storage facilities in several countries, which are owned by various parties (each with their own funding sources), and which are collectively operated by the ILT foundation under a joint scientific policy. The efforts of the LSKSP have benefited from funding from the European Research Council, NOVA, NWO, CNRS-INSU, the SURF Co-operative, the UK Science and Technology Funding Council and the Jülich Supercomputing Centre. The Isaac Newton Telescope is operated on the island of La Palma by the Isaac Newton Group of Telescopes in the Spanish Observatorio del Roque de los Muchachos of the In-stituto de Astrofísica de Canarias. This research has made extensive use of TOPCAT (Taylor 2005). The algorithm presented in this work is written in PYTHON and will be made publicly available. In the mean time the code will be shared upon reasonable written requests to the authors.
References
- Abazajian, K., Adelman-McCarthy, J. K., Ageros, M. A., et al. 2003, AJ, 126, 2081 [NASA ADS] [CrossRef] [Google Scholar]
- Bailer-Jones, C. A. L., Smith, K. W., Tiede, C., Sordo, R., & Vallenari, A. 2008, MNRAS, 391, 1838 [NASA ADS] [CrossRef] [Google Scholar]
- Bañados, E., Venemans, B. P., Morganson, E., et al. 2014, Proc. Int. Astron. Union, 9, 19 [Google Scholar]
- Bañados, E., Venemans, B. P., Morganson, E., et al. 2015, ApJ, 804, 118 [Google Scholar]
- Bañados, E., Venemans, B. P., Decarli, R., et al. 2016, ApJS, 227, 11 [Google Scholar]
- Becker, R. H., White, R. L., & Helfand, D. J. 1995, ApJ, 450, 559 [Google Scholar]
- Becker, G. D., Bolton, J. S., Madau, P., et al. 2015, MNRAS, 447, 3402 [Google Scholar]
- Best, W. M. J., Magnier, E. A., Liu, M. C., et al. 2017, ApJS, 234, 1 [NASA ADS] [CrossRef] [Google Scholar]
- Bovy, J., Hennawi, J. F., Hogg, D. W., et al. 2011, ApJ, 729, 141 [NASA ADS] [CrossRef] [Google Scholar]
- Burningham, B., Hardcastle, M., Nichols, J. D., et al. 2016, MNRAS, 463, 2202 [NASA ADS] [CrossRef] [Google Scholar]
- Caballero, J. A., Burgasser, A. J., & Klement, R. 2008, A&A, 488, 181 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Chambers, K. C., Magnier, E. A., Metcalfe, N., et al. 2016, ArXiv e-prints [arXiv:1612.05560] [Google Scholar]
- Chen, B., Stoughton, C., Smith, J. A., et al. 2001, ApJ, 553, 184 [Google Scholar]
- Condon, J. J., Cotton, W. D., Greisen, E. W., et al. 1998, AJ, 115, 1693 [Google Scholar]
- Craig, M., Crawford, S., Seifert, M., et al. 2021, https://doi.org/10.5281/zenodo.4883632 [Google Scholar]
- Cristiani, S., Serrano, L. M., Fontanot, F., Vanzella, E., & Monaco, P. 2016, MNRAS, 462, 2478 [Google Scholar]
- De Jong, J. T. A., Verdoes Kleijn, G. A., Kuijken, K. H., & Valentijn, E. A. 2013, Exp. Astron., 35, 25 [NASA ADS] [CrossRef] [Google Scholar]
- Dempster, A., Laird, N., & Rubin, D. 1977, R. Stat. Soc. B, 39, 1 [Google Scholar]
- Dewdney, P. E., Hall, P. J., Schilizzi, R. T., & Lazio, T. J. L. W. 2009, IEEE Proc., 97, 1482 [Google Scholar]
- Duncan, K. J., Sabater, J., Röttgering, H. J., et al. 2019, A&A, 622, A3 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Dye, S., Lawrence, A., Read, M. A., et al. 2018, MNRAS, 473, 5113 [Google Scholar]
- Fan, X. 2006, New Astron. Rev., 50, 665 [Google Scholar]
- Fan, X., Narayanan, V. K., Lupton, R. H., et al. 2001, AJ, 122, 2833 [Google Scholar]
- Farrow, D. J., Cole, S., Metcalfe, N., et al. 2013, MNRAS, 437, 748 [Google Scholar]
- Gloudemans, A. J., Duncan, K. J., Röttgering, H. J. A., et al. 2021, A&A, 656, A137 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Gunn, J. E., & Peterson, B. A. 1965, ApJ, 142, 1633 [Google Scholar]
- Gurkan, G., Hardcastle, M. J., Best, P. N., et al. 2019, A&A, 622, A11 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Hogg, D. W., Baldry, I. K., Blanton, M. R., & Eisenstein, D. J. 2002, ArXiv e-prints [arXiv:astro-ph/0210394] [Google Scholar]
- Holoien, T. W.-S., Marshall, P. J., & Wechsler, R. H. 2017, AJ, 153, 249 [NASA ADS] [CrossRef] [Google Scholar]
- Hooper, E. J., Impey, C. D., Foltz, C. B., & Hewett, P. C. 1995, ApJ, 445, 62 [NASA ADS] [CrossRef] [Google Scholar]
- Hotan, A. W., Bunton, J. D., Chippendale, A. P., et al. 2021, PASA, 38, e009 [NASA ADS] [CrossRef] [Google Scholar]
- Intema, H. T., Jagannathan, P., Mooley, K. P., & Frail, D. A. 2017, A&A, 598, A78 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Ivezić, Ž., Kahn, S. M., Tyson, J. A., et al. 2019, ApJ, 873, 111 [Google Scholar]
- Jiang, L., McGreer, I. D., Fan, X., et al. 2016, ApJ, 833, 222 [Google Scholar]
- Jonas, J. L. 2016, Proc. Sci., 277, 25 [Google Scholar]
- Kimball, A. E., Knapp, G. R., Ivezić, E., et al. 2009, ApJ, 701, 535 [NASA ADS] [CrossRef] [Google Scholar]
- Laureijs, R., Amiaux, J., Arduini, S., et al. 2011, ArXiv e-prints [arXiv:1110.3193] [Google Scholar]
- Madau, P. 1995, ApJ, 441, 18 [NASA ADS] [CrossRef] [Google Scholar]
- Mainzer, A., Bauer, J., Grav, T., et al. 2011, ApJ, 731, 53 [Google Scholar]
- Manti, S., Gallerani, S., Ferrara, A., Greig, B., & Feruglio, C. 2017, MNRAS, 466, 1160 [Google Scholar]
- Matsuoka, Y., Onoue, M., Kashikawa, N., et al. 2016, ApJ, 828, 26 [Google Scholar]
- Matsuoka, Y., Iwasawa, K., Onoue, M., et al. 2018, ApJS, 237, 5 [Google Scholar]
- Metcalfe, N., Farrow, D. J., Cole, S., et al. 2013, MNRAS, 435, 1825 [NASA ADS] [CrossRef] [Google Scholar]
- Mortlock, D. J., Warren, S. J., Venemans, B. P., et al. 2011, Nature, 474, 616 [Google Scholar]
- Mortlock, D. J., Patel, M., Warren, S. J., et al. 2012, MNRAS, 419, 390 [NASA ADS] [CrossRef] [Google Scholar]
- Mostert, R. I. J., Duncan, K. J., Röttgering, H. J. A., et al. 2021, A&A, 645, A89 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Oke, J. B., & Gunn, J. E. 1983, ApJ, 266, 713 [NASA ADS] [CrossRef] [Google Scholar]
- Pipien, S., Cuby, J. G., Basa, S., et al. 2018, A&A, 617, A127 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Planck Collaboration XIII. 2015, A&A, 594, A13 [Google Scholar]
- Reed, S. L., Banerji, M., Becker, G. D., et al. 2019, MNRAS, 487, 1874 [Google Scholar]
- Reiners, A., Homeier, D., Hauschildt, P. H., & Allard, F. 2007, A&A, 473, 245 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Reynolds, D. 2009, Encyclopedia of Biometrics (Boston, MA: Springer), 31, 659 [CrossRef] [Google Scholar]
- Richards, G. T., Deo, R. P., Lacy, M., et al. 2009, AJ, 137, 3884 [NASA ADS] [CrossRef] [Google Scholar]
- Ross, N. P., & Cross, N. J. 2020, MNRAS, 494, 789 [NASA ADS] [CrossRef] [Google Scholar]
- Saxena, A., Marinello, M., Overzier, R. A., et al. 2018, MNRAS, 480, 2733 [Google Scholar]
- Shimwell, T. W., Röttgering, H. J. A., Best, P. N., et al. 2017, A&A, 598, A104 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Shimwell, T. W., Tasse, C., Hardcastle, M. J., et al. 2019, A&A, 622, A1 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Silverman, B. W. 2017, Density Estimation for Statistics and Data Analysis (Boca Raton: Routledge) [Google Scholar]
- Spergel, D., Gehrels, N., Baltay, C., et al. 2015, ArXiv e-prints [arXiv:1503.03757] [Google Scholar]
- Taylor, M. B. 2005, in Astronomical Data Analysis Software and Systems XIV, eds. P. Shopbell, M. Britton, & R. Ebert, ASP Conf. Ser., 347, 29 [Google Scholar]
- Temple, M. J., Hewett, P. C., & Banerji, M. 2021, MNRAS, 508, 737 [NASA ADS] [CrossRef] [Google Scholar]
- The Dark Energy Survey Collaboration 2005, ArXiv e-prints [arXiv:astro-ph/0510346] [Google Scholar]
- Vanden Berk, D. E., Richards, G. T., Bauer, A., et al. 2001, AJ, 122, 549 [Google Scholar]
- van Haarlem, M. P., Wise, M. W., Gunst, A. W., et al. 2013, A&A, 556, A2 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Vedantham, H. K., Callingham, J. R., Shimwell, T. W., et al. 2020, ApJ, 903, L33 [Google Scholar]
- Williams, W. L., Calistro Rivera, G., Best, P. N., et al. 2018, MNRAS, 475, 3429 [NASA ADS] [CrossRef] [Google Scholar]
- Williams, W. L., Hardcastle, M. J., Best, P. N., et al. 2019, A&A, 622, A2 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Willott, C. J., Delfosse, X., Forveille, T., Delorme, P., & Gwyn, S. D. J. 2005, ApJ, 633, 630 [NASA ADS] [CrossRef] [Google Scholar]
- Willott, C. J., Albert, L., Arzoumanian, D., et al. 2010, AJ, 140, 546 [NASA ADS] [CrossRef] [Google Scholar]
- Wit, E., van den Heuvel, E., & Romeijn, J.-W. 2012, Statistica Neerlandica, 66, 217 [CrossRef] [Google Scholar]
- Wright, E. L., Eisenhardt, P. R., Mainzer, A. K., et al. 2010, AJ, 140, 1868 [Google Scholar]
- Yang, J., Fan, X., Wu, X.-B., et al. 2017, AJ, 153, 184 [NASA ADS] [CrossRef] [Google Scholar]
Appendix A: HzQ candidates
High probability HzQ candidates.
All Tables
Values of the various parameters used in Eq. (10) with errors as determined by Chen et al. (2001).
Observed optical and infrared magnitudes of P144+50 in all filters relevant to probability calculation and priority assignment.
All Figures
![]() |
Fig. 1. Log likelihood of quasars (left), dwarf stars (middle), and low-redshift galaxies (right) modelled by a Gaussian mixture model in colour space. The red ellipses indicate the 3-sigma extents of the individual deconvolved Gaussians (four components for quasars, six for dwarf stars, and one for galaxies). The quasar likelihoods are assigned using photometry from simulated quasar spectra as highlighted in Sect. 3.1, whereas the photometry for dwarf stars and low-redshift galaxies are taken from their respective reference catalogues, Best et al. (2017) and Williams et al. (2018) (black dots). The dashed line in the left panel marks the commonly defined colour cut at i − z = 2.0. |
In the text |
![]() |
Fig. 2. Expected number density of the galaxy populations as a function of apparent magnitude in PS1 i, z, and y bands extrapolated from the deep multi-wavelength galaxy catalogue in the Boötes field (Williams et al. 2018). |
In the text |
![]() |
Fig. 3. Posterior quasar probabilities of the different populations assigned by our HzQ selection method. The chosen posterior threshold of Pq > 5 × 10−4 is shown using the dashed line, which retains ∼90% of the known quasar population while rejecting > 99% of foreground contaminant populations. |
In the text |
![]() |
Fig. 4. Cumulative distribution function of zP1 magnitudes of the sources classified as candidate HzQs (red line) compared with the full PS1 photometric sample (black line). Our algorithm preferentially assigns higher HzQ priorities to sources that are brighter and securely detected in the zP1 band, ensuring that the Lyα break resulting in higher i − z colours is securely constrained. |
In the text |
![]() |
Fig. 5. Distribution of sources in the zP1 − yP1 vs iP1 − zP1 diagram of the full PS1 sample (grey points), high-probability HzQs (black circles), and the candidate HzQs selected for spectroscopic follow-up (red circles) and those with a radio detection (red squares). The traditional colour cut at i − z = 2 is marked using the black dashed line, demonstrating that our method assigns a high HzQ probability to several sources that lie below this selection. We also mark P144+50 (red star) in the colour–colour plot, which was discovered using our method at a redshift of z = 5.66, and with i − z = 1.4 lies below the standard photometric selection employed in other studies. |
In the text |
![]() |
Fig. 6. Distribution of sources in the zP1 − yP1 vs iP1 − zP1 diagram of the test samples of HzQs (left), brown dwarfs (middle), and galaxies (right). Sources not selected with any method are marked in grey. Sources selected with i − z > 2 colour cut (dashed line) but rejected by the probability selection are marked as a diamond shape with a black edge. Notably, the z − y < 0.5 colour cut (dotted line) rejects all brown dwarfs from the test sample, but also a large portion of the simulated quasar sample. |
In the text |
![]() |
Fig. 7. P144+50 in optical and infrared bands. In the WISE images, it is blended with the neighbouring galaxy. Given the fact that the quasar is brighter in those bands while the galaxy drops off, and that its WISE magnitudes are consistent with those of other HzQs, it is likely that most of the emission in the WISE bands comes from the quasar. |
In the text |
![]() |
Fig. 8. 1D spectrum of P144+50 taken using the IDS instrument on the 2.5 m INT. The characteristic break blueward of the Lyα line is clearly visible, unambiguously identifying it as a high-redshift quasar with a redshift of z = 5.66 based on its Lyα emission. Some flux is seen around the expected Lyβ and O IV features. Apart from Lyα, there are no clear emission lines visible in the spectrum due to the limited sensitivity of the INT. The orange line indicates the noise level of the spectrum. |
In the text |
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.