Issue |
A&A
Volume 694, February 2025
|
|
---|---|---|
Article Number | A259 | |
Number of page(s) | 27 | |
Section | Cosmology (including clusters of galaxies) | |
DOI | https://doi.org/10.1051/0004-6361/202452808 | |
Published online | 19 February 2025 |
KiDS-Legacy: Angular galaxy clustering from deep surveys with complex selection effects
1
Ruhr University Bochum, Faculty of Physics and Astronomy, Astronomical Institute (AIRUB), German Centre for Cosmological Lensing, 44780 Bochum, Germany
2
Institute for Theoretical Physics, Utrecht University, Princetonplein 5, 3584 CC Utrecht, The Netherlands
3
Centro de Investigaciones Energéticas, Medioambientales y Tecnológicas (CIEMAT), Av. Complutense 40, E-28040 Madrid, Spain
4
Institute of Cosmology & Gravitation, Dennis Sciama Building, University of Portsmouth, Portsmouth PO1 3FX, United Kingdom
5
The Oskar Klein Centre, Department of Physics, Stockholm University, AlbaNova University Centre, SE-106 91 Stockholm, Sweden
6
Imperial Centre for Inference and Cosmology (ICIC) Blackett Laboratory, Imperial College London, Prince Consort Road, London SW7 2AZ, UK
7
Argelander-Institut für Astronomie, Universität Bonn, Auf dem Hügel 71, D-53121 Bonn, Germany
8
School of Mathematics, Statistics and Physics, Newcastle University, Herschel Building, NE1 7RU Newcastle-upon-Tyne, UK
9
Center for Theoretical Physics, Polish Academy of Sciences, Al. Lotników 32/46, 02-668 Warsaw, Poland
10
Institute for Astronomy, University of Edinburgh, Royal Observatory, Blackford Hill, Edinburgh EH9 3HJ, UK
11
Department of Physics and Astronomy, University College London, Gower Street, London WC1E 6BT, UK
12
Dipartimento di Fisica e Astronomia “Augusto Righi” – Alma Mater Studiorum Università di Bologna, Via Piero Gobetti 93/2, 40129 Bologna, Italy
13
INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy
14
Leiden Observatory, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands
15
Aix-Marseille Université, CNRS, CNES, LAM, Marseille, France
16
Universität Innsbruck, Institut für Astro- und Teilchenphysik, Technikerstr. 25/8, 6020 Innsbruck, Austria
17
Department of Physics, University of Oxford, Denys Wilkinson Building, Keble Road, Oxford OX1 3RH, United Kingdom
18
Donostia International Physics Center, Manuel Lardizabal Ibilbidea, 4, 20018 Donostia, Gipuzkoa, Spain
19
Istituto Nazionale di Fisica Nucleare (INFN) – Sezione di Bologna, Viale Berti Pichat 6/2, I-40127 Bologna, Italy
20
Department of Physics “E. Pancini” University of Naples Federico II C.U. di Monte Sant’Angelo, Via Cintia, 21 ed. 6, 80126 Naples, Italy
21
Institute for Computational Cosmology, Ogden Centre for Fundament Physics – West, Department of Physics, Durham University, South Road, Durham DH1 3LE, UK
22
Centre for Extragalactic Astronomy, Ogden Centre for Fundament Physics – West, Department of Physics, Durham University, South Road, Durham DH1 3LE, UK
⋆ Corresponding author; yanza21@astro.rub.de
Received:
30
October
2024
Accepted:
15
January
2025
Photometric galaxy surveys, despite their limited resolution along the line of sight, encode rich information about the large-scale structure (LSS) of the Universe thanks to the high number density and extensive depth of the data. However, the complicated selection effects in wide and deep surveys can potentially cause significant bias in the angular two-point correlation function (2PCF) measured from those surveys. In this paper, we measure the 2PCF from the newly published KiDS-Legacy sample. Given an r-band 5σ magnitude limit of 24.8 and survey footprint of 1347 deg2, it achieves an excellent combination of sky coverage and depth for such a measurement. We find that complex selection effects, primarily induced by varying seeing, introduce over-estimation of the 2PCF by approximately an order of magnitude. To correct for such effects, we apply a machine learning-based method to recover an organised random (OR) that presents the same selection pattern as the galaxy sample. The basic idea is to find the selection-induced clustering of galaxies using a combination of self-organising maps (SOMs) and hierarchical clustering (HC). This unsupervised machine learning method is able to recover complicated selection effects without specifying their functional forms. We validate this SOM+HC method on mock deep galaxy samples with realistic systematics and selections derived from the KiDS-Legacy catalogue. Using mock data, we demonstrate that the OR delivers unbiased 2PCF cosmological parameter constraints, removing the 27σ offset in the galaxy bias parameter that is recovered when adopting uniform randoms. Blinded measurements on the real KiDS-Legacy data show that the corrected 2PCF is robust to the SOM+HC configuration near the optimal set-up suggested by the mock tests.
Key words: methods: data analysis / methods: statistical / cosmological parameters / cosmology: observations / large-scale structure of Universe
© The Authors 2025
Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
This article is published in open access under the Subscribe to Open model. Subscribe to A&A to support open access publication.
1. Introduction
The fluctuation patterns of the mass distribution in the Universe (i.e. the large-scale structure, LSS) is one of the central topics of modern cosmology (see, for example, Dodelson & Schmidt 2020, for a comprehensive introduction to the LSS). Because the matter in the Universe is mainly made up of invisible dark matter, one needs to observe visible LSS tracers to infer the matter distribution. Among observations of various LSS tracers, large-scale galaxy surveys have provided robust constraints on cosmological parameters. Spectroscopic surveys such as the Sloan Digital Sky Survey (SDSS-IV; Alam et al. 2021) and the Dark Energy Spectroscopic Instrument (DESI; DESI Collaboration 2024) map 3D galaxy distributions from which they measure baryon acoustic oscillations (BAOs) and redshift-space distortions (RSDs) to constrain the Hubble constant, matter density, dark energy equation of state, and neutrino mass (Cuceu et al. 2019; Loureiro et al. 2019; Neveux et al. 2020). Recent photometric surveys, including the Kilo-Degree Survey (KiDS; Asgari et al. 2021), the Dark Energy Survey (DES; Amon et al. 2022; Secco et al. 2022), and the Hyper Suprime-Cam Subaru Strategic Program (HSC-SSP; Li et al. 2023a; Dalal et al. 2023) have obtained consistent joint constraints on the matter density, Ωm, and fluctuation amplitude, σ8, of the density field with a precision of ∼5%. As the precision improves, tensions between the probes have begun to emerge, an important one being the tension in S8, the matter clustering amplitude parameter, between the CMB and the late-time Universe probes. (for a review, see Abdalla et al. 2022). Future surveys, including the Legacy Survey of Space and Time (LSST) at the Vera Rubin Observatory (LSST Science Collaboration 2009), the Euclid Survey (Euclid Collaboration: Mellier et al. 2025), the Xuntian (Gong et al. 2019, also known as the Chinese Space Station Telescope, CSST) are promising to tackle these problems with better data quality and systematics estimations.
The galaxy distribution encodes rich information about the LSS of the Universe and galaxy formation, so it has long been used as a tracer to study the LSS (Peebles 1973; Shanks et al. 1983; Maddox et al. 1990; Efstathiou et al. 1990; Maddox et al. 1996; Nichol 2007). On large scales, it can be estimated to be linearly biased relative to the matter distribution (see Desjacques et al. 2018, for a review on galaxy bias). Subtler bias terms can be modelled via an effective field theory (Baumann et al. 2012; Carrasco et al. 2012). On small scales, the galaxy distribution can be described by a halo occupation distribution model (HOD; Zheng et al. 2005) in the context of the halo model (Peacock & Smith 2000; Seljak 2000; Cooray & Sheth 2002; Asgari et al. 2023). An important summary statistic of the galaxy distribution is the two-point correlation function (2PCF), which describes the galaxy clustering between two points as a function of their separations. The measurements of 2PCF have been used to constrain the matter density and clustering amplitude of the LSS since the 1980s (Davis & Peebles 1983). Recently, a combination of galaxy clustering, galaxy-galaxy lensing, and cosmic shear, called the 3 × 2 pt analysis, has greatly improved the constraining power on Ωm and S8 (Heymans et al. 2021; Abbott et al. 2022; Sugiyama et al. 2023; Dvornik et al. 2023).
Measurements of three-dimensional galaxy clustering depend on reliable galaxy redshift measurements, typically carried out by spectroscopic observations that are generally time-consuming and limited in depth. While photometric surveys can reach deeper, they cannot obtain precise redshifts for each galaxy. In this case, given the angular coordinates of galaxy positions, one can measure the angular galaxy clustering with the 2PCF with respect to angular separation. With the calibrated redshift distribution of a photometric sample, the angular 2PCF can be modelled as projected 3D 2PCF weighted by the redshift distribution. Angular 2PCF also encodes a significant amount of cosmological information (see, for example; Coupon et al. 2012) and is relatively easy to measure. However, several systematic uncertainties need to be considered to obtain accurate constraints. In terms of modelling, galaxy distributions are affected by redshift space distortions (RSDs; Kaiser 1987) and cosmic magnification (Menard 2002). On the observational side, galaxies are subject to selection effects by various observational systematics, including seeing, sky background, instrument response, survey strategy, galactic extinction, and so on. In practice, the selection effects are anisotropic given the various observational conditions and survey strategies, such that the galaxy sample will show a variable depth pattern that will introduce additional non-cosmological correlations that bias the angular 2PCF measurement and subsequent cosmological constraints1.
The selection effects of bright samples tend to be mild (Johnston et al. 2021), but they are expected to be more significant for deeper samples since fainter galaxies are more sensitive to the selection. However, we can benefit from the higher number density and higher redshift of deep samples with their lower shot noise and richer information about the higher redshifts. In this work we measure the angular galaxy clustering with the Legacy catalogue selected from the fifth data release of the KiDS survey (Wright et al. 2024). It is a deep sample that reaches an r-band magnitude limit of 24.8 which is deeper than the KiDS-Bright catalogue (Bilicki et al. 2021) used in Johnston et al. (2021) by five magnitudes. Notably, the KiDS-Legacy catalogue has a 5σ magnitude limit in i-band of 23.5, which is deeper than the MagLim sample selected from DES data (Porredon et al. 2021) with an i-band magnitude limit of 22.2. Although Morrison & Hildebrandt (2015) and Nicola et al. (2020) have measured 2PCF using deeper galaxy samples from Canada-France-Hawaii Telescope Lensing Survey (CFHTLenS; Heymans et al. 2012) and HSC, the KiDS-Legacy sample has the advantage of larger sky coverage and enhanced accuracy on photo-z at z > 1 thanks to additional near-infrared bands. The KiDS-Legacy 2PCF will be included in the KiDS-Legacy 6 × 2 pt analysis (see Johnston et al. 2024, for forecasts of the 6 × 2 pt analysis), which is a combination of the six two-point statistics among KiDS-Legacy galaxy shapes, galaxy positions, and the spectroscopic samples of 2dFLenS (Blake et al. 2016) and BOSS DR12 (Alam et al. 2015). The advantages of such a measurement include low shot noise due to high number densities and self-calibration of the redshift distribution, given the same n(z) as the shear sample. As we show here, the strong and complicated selection effects in the KiDS-Legacy catalogue causes order-of-magnitude bias in the 2PCF. Therefore, it is crucial to correct such a bias to ensure the accuracy of subsequent cosmological analyses.
Various methods have been proposed and applied to correct for selection effects. Following Johnston et al. (2021), we classify them into three types:
-
Mitigating selection bias in statistics. Correcting the selection effects with the following two methods (Weaverdyck & Huterer 2021): (1) template subtraction method (Ho et al. 2012; Ross et al. 2011), which models contamination terms as templates multiplied by an amplitude factor. After determining the amplitude factor, these terms are subtracted from the statistics; (2) mode projection (Leistedt et al. 2013; Leistedt & Peiris 2014; Nicola et al. 2020; Berlfein et al. 2024), which mitigates systematics contamination by marginalising the functional contribution of systematic templates.
-
Forward modelling with synthetic objects. This approach calibrates the observational selections by injecting artificial galaxies into realistic images (Bergé et al. 2013; Suchyta et al. 2016). The injected objects are passed through a realistic measurement process, resulting in a random catalogue mimicking realistic selection functions. This method is computationally demanding, but it is under active development for ongoing and future surveys (Everett et al. 2022; Kong et al. 2024).
-
Reconstructing the selection function by regression. This method aims to probe the relationship between observed galaxy number densities and systematics values to construct weight maps or random catalogues that reflect the reconstructed selection functions. For example, Elvin-Poole et al. (2018), Rezaie et al. (2020), and Rodríguez-Monroy et al. (2022) regress galaxy number and systematics with the help of systematics templates, while Morrison & Hildebrandt (2015) accounts for selection effects by finding selection-induced galaxy clustering in high-dimensional systematics space.
A subclass of regression methods uses machine learning to learn the relationship between galaxy number densities and systematics values. This class of methods is more capable of finding complex, non-linear, and potentially correlated selection effects. For example, Rezaie et al. (2020) uses deep neural networks to derive weights of galaxies selected from the Dark Energy Camera Legacy Survey (DECaLS; Dey et al. 2019); Morrison & Hildebrandt (2015) applied a k-means clustering method in the high-dimensional density-systematics space to create weight maps. The method used in this work belongs to this category.
We focus on 2PCF in this work. In practice, 2PCF can be estimated with the Landy-Szalay estimator (Landy & Szalay 1993), which employs a random catalogue to factor out non-cosmological correlations. If the catalogue has significant selection effects, a uniform random catalogue (UR) will fail to capture them and will not prevent a bias in the estimated 2PCF. An organised random catalogue (OR) is a tailored random catalogue reflecting the same selection effects that can be used to correct such biases. Johnston et al. (2021) proposed a machine-learning-based method to recover the OR. It uses a combination of self-organising maps and hierarchical clustering (SOM+HC) to identify galaxy clusters2 in the systematics space. Galaxies from the same cluster are assumed to be selected uniformly, so a UR with the same galaxy number is generated in the sky region occupied by those galaxies. The OR is constructed by combining the URs corresponding to all the clusters. This method has been validated in Johnston et al. (2021) for the KiDS-1000 bright sample.
In this work, we apply the SOM+HC method to correct for complex selection effects in the full KiDS-Legacy sample. We first present the SOM+HC methodology, then test it on mock galaxy samples generated from the Generator for Large-Scale Structure (GLASS; Tessore et al. 2023). The mock systematics and selections are applied based on the real KiDS-Legacy sample. We evaluate the method by comparing the 2PCF measured with the recovered OR with an unbiased 2PCF with no selection effect. Then we perform a preliminary, blinded 2PCF measurement from the real KiDS-Legacy catalogue. This work focuses on comparing and modelling the 2PCFs on linear scales, as most photometric surveys use such scales due to limitations in the theoretical modelling (Crocce et al. 2015; Rodríguez-Monroy et al. 2022). In forthcoming work, the OR recovered from the KiDS-Legacy catalogue will be used to measure the 2PCF for the KiDS-Legacy 6 × 2-pt analysis, and will also be useful for future LSS surveys.
The paper is structured as follows: Sect. 2 describes the definition and measurements of galaxy 2PCF; Sect. 3 presents the data and mock that we use to run and test the method; Sect. 4 introduces the SOM+HC method to correct the selection effects; Sect. 5 deals with validations of the SOM+HC methods; Sect. 6 presents a blinded measurement with the KiDS-legacy catalogue; and Sect. 7 summarises the conclusion of this work and discusses the advantages of our method and future prospects. Throughout this study we assume a flat ΛCDM cosmology with the fixed cosmological parameters from Planck Collaboration VI (2020) as our background: (h, Ωch2, Ωbh2, σ8, ns) = (0.676, 0.119, 0.022, 0.81, 0.967). The code used in this work is published as the TIAOGENG3 package4 for future use.
2. The galaxy clustering correlation function
2.1. Definition and connection with matter distribution
Galaxy 3D 2PCFs describe the correlation of galaxy number over-density between points separated by a certain distance. For two galaxy samples a and b, their 2PCF is defined as (Peebles 1973)
where ⟨⋅⟩ denotes the ensemble average. Due to the large-scale ergodicity of the Universe, the ensemble average can be approximated as a spatial average. In addition, due to the large-scale isotropy and homogeneity of the Universe, the 2PCF only depends on r, the length of spatial separation between two points. δ(x) denotes the galaxy number over-density at point x:
Here n(x) denotes the galaxy number density at x and is the mean galaxy number density. Since we mainly focus on the galaxy 2PCF, we omit the subscript g in the following formulas. We note that the 2PCF can also be defined for fluctuations of any LSS fields.
According to Parseval’s theorem, the 2PCF is the Fourier transform of the power spectrum, the correlation function in Fourier space. They are both widely used summary statistics of the spatial distribution of cosmological fields, and they both depend on the evolution and ingredients of the Universe. For detailed mathematical descriptions of the 2PCF and power spectra, we refer to Dodelson & Schmidt (2020).
For photometric surveys, we usually cannot measure the distance of an individual galaxy. Thus we can only project galaxies onto the celestial sphere and study their angular distribution. We can define the angular two-point correlation function by replacing x with a 2-dimensional angular vector θ. Since we only care about the angular galaxy distribution in this paper, we use 2PCF to denote the angular two-point correlation function unless otherwise stated. Under the ‘flat-sky approximation’, the angular 2PCF is related to the angular power spectrum via (Peebles 1973)
where J0(x) is the zeroth order Bessel function of the first kind; is the angular galaxy cross-correlation power spectrum of samples a and b. We note that this formula works for 2PCF and angular power spectra between any cosmological fields. At sufficiently large ℓ, the angular power spectra can be calculated via the Limber approximation (Limber 1953):
Here χ is the comoving distance and the subscript H is the horizon; Pgg(k, z) is the galaxy power spectra at redshift z; Wa(χ) and Wb(χ) are the radial kernels of the two galaxy samples. We note that this formulation works for flat cosmology. For sample a, given its normalised galaxy redshift distributions na(z), its radial kernel is
where H(z) is the Hubble parameter at redshift z. The radial kernel for field b is defined likewise.
The galaxy power spectra Pgg(k, z) encode the 3D fluctuations of galaxy distributions, which follow the fluctuations of underlying mass distribution. On small scales, the non-linear clustering of galaxies can be modelled as a halo model-based halo occupation distribution (HOD; Seljak 2000; Cooray & Sheth 2002; Zheng et al. 2005). On large scales, galaxy fluctuations can be approximated as linearly biased matter fluctuations with δg(x, z) = bg(z)δm(x, z) where bg(z) is the redshift-dependent galaxy bias. Thus . We can take the galaxy biases outside of the Limber integral by approximating them at the mean redshift
of the galaxy samples (denoted as
and
). Therefore, on linear scales we have
where wm(θ) is the 2PCF of the matter field which can be calculated similarly to galaxies, but with galaxy power spectra replaced by matter power spectra.
2.2. The measurement of two-point correlation function
In practice, the galaxy 2PCF is measured in angular bins by counting the number of galaxy pairs (one galaxy from catalogue a and the other from b) with separation angles within each angular bin. Given a pair of galaxy catalogues (denoted as D hereafter) and corresponding random catalogues (denoted as R hereafter), the 2PCF can be measured using a standard Landy & Szalay (1993) estimator,
where θi is the mean angle in the i-th angular bin. Here (DaDb)i is the normalised number of galaxy pairs with separation angles within the i-th bin, which can be formally expressed as
where ND is the number of galaxies, θpq is the separation angle between the p-th and q-th galaxies, Θi(θ) is the rect function that equals 1 when θ falls in the i-th bin and 0 otherwise. is the Kronecker δ symbol. The denominator in Eq. (8) is the total number of galaxy pairs and the δab term concerns removing galaxies paired with themselves for auto-correlation (i.e. a = b). The RR term is likewise defined while DR is defined without
.
For this work, instead of counting galaxy pairs in angular bins, we pixelised the galaxy distribution into the Healpix scheme (Gorski et al. 2005) and counted pixel pairs weighted by galaxy numbers in each pixel. This loses sub-pixel information, but we had three reasons for doing this: (1) pixelisation can speed up the calculation: the original KiDS catalogue contains ∼200 million galaxies, while a pixelised galaxy map with Nside = 2048 which we use throughout this work only has about one million non-zero pixels, so counting pixel pairs instead of galaxy pairs will significantly speed up the measurement; (2) as we introduce later, our method to recover ORs is pixel-based, so any sub-pixel variable depth cannot be corrected; (3) the pixel size for an Nside = 2048 HEALPix map is 1.7 arcmin, corresponding to a physical size of 0.86 Mpc at z = 1.5, which is much below the linear scales that we probe in this work.
With pixelised galaxy fields, the terms in the Landy-Szalay estimator are modified as (take DD as an example)
where p and q are now pixel indices; θpq is the angular separation between these two pixels; PDa, b is the total number of occupied pixels in samples a and b; is the number of galaxies from sample a in pixel p. The random terms are defined likewise. The pixelised UR map can be taken as the footprint map with pixel value
representing the coverage fraction of pixel p.
The random catalogue or map is used to factor out galaxy distributions induced by cosmology-unrelated systematics, including galactic and atmospheric foregrounds, instrument response, and survey strategy. If all those systematics are uniform across the survey footprint, the random catalogue is usually taken as a uniform Poisson random sample within the survey footprint with several times the mean galaxy number density. However, deep surveys, such as the KiDS survey, have non-uniform systematics that cause variable depth in the galaxy sample. A UR will fail to capture the spatial variation in galaxy number density caused by those selection effects and biases the 2PCF.
To de-bias the 2PCF, we can create a random catalogue with the same variable depth, namely an ‘organised random’ catalogue, or, for pixelised measurement, an OR weight map for the R terms in the estimator. The method to recover the OR will be presented in Sect. 4. We note that the OR can be used for summary statistics other than the angular 2PCF. We present the OR application in angular power spectra Cℓ in Appendix B.
3. The KiDS-Legacy data and GLASS mock data
3.1. The Kilo-degree Survey and the Legacy data
In this work, we use the KiDS-Legacy galaxy catalogue selected from the fifth data release (DR5) of the Kilo-degree Survey (KiDS; Wright et al. 2024). KiDS is a wide-field imaging survey that measures the positions and shapes of galaxies using the VLT Survey Telescope (VST) at the European Southern Observatory (ESO). Both the telescope and the survey were primarily designed for weak gravitational lensing applications. High-quality optical images are produced with VST-OmegaCAM, and these data are then combined with imaging from the VISTA Kilo-degree INfrared Galaxy survey (VIKING; Edge et al. 2013), allowing all sources in KiDS to be photometrically measured in nine optical and near-infrared bands: ugriZYJHKs (Wright et al. 2019). Although the sky coverage of KiDS is smaller than some galaxy lensing surveys (such as DES, Abbott et al. 2016), galaxy photometric redshift estimation and redshift distribution calibration (especially at high redshift) benefit from complementary NIR information from VIKING (which was co-designed with KiDS to reach complementary depths in the NIR bands). Cosmological constraints have already been made available from DR3 and DR4 (Hildebrandt et al. 2017; Heymans et al. 2021). KiDS DR5 covers a survey area of 1347 deg2. It also includes an i-band re-observation of the full footprint, thereby increasing the effective i-band depth by 0.4 mag and enabling multi-epoch science. There is a 27 deg2 overlap with deep spectroscopic surveys, which enables the robust calibration of photometric redshifts across the full survey footprint (Hildebrandt et al. 2021).
The “KiDS-Legacy” sample used in this work is a subset of the full DR5 sample primarily determined by the availability of reliable shape measurements. It is the lensing sample that will be used for the fiducial KiDS DR5 cosmological analyses. Each galaxy in the KiDS-Legacy sample has ellipticities measured with the lensfit algorithm (Miller et al. 2013; Fenech Conti et al. 2017; Li et al. 2023b) for weak lensing analyses. The footprint of the KiDS DR5 is divided into a Northern and a Southern patch. Artefacts around bright, saturated starlight, planets, the Moon, satellite flares, aeroplanes, and higher-order reflections from very bright stars have been masked out, leaving an unmasked ∼1000 deg2 footprint. In addition, blended sources, unresolved binaries, transients, point sources flagged by lensfit, and sources with failed photometry, badly estimated shapes, low resolution, or zero lensing weight are removed from the full sample, resulting in a Legacy sample of 43 205 156 galaxies, corresponding to 12 galaxies per arcmin−2. A Bayesian photo-z (BPZ, Benítez 2000) estimation gives an approximated redshift range of 0.1 < zB < 2.0. The redshift distribution is calibrated with a combination of SOM and clustering-redshift method (van den Busch et al. 2020), which will be presented in a companion paper. For more details about the construction of the KiDS-Legacy catalogue, we refer to Sect. 7 of Wright et al. (2024). The galaxy distribution is shown in Fig. 1. We can tell by eye that the galaxy distribution has a clear tile-like pattern that is not likely the LSS.
![]() |
Fig. 1. Galaxy distribution of the KiDS-Legacy catalogue. The map is pixelised into HEALP IX grids with Nside = 2048. The map is colour-coded by galaxy number per pixel, which has a size of 1.7 arcmin. A tile-based selection pattern can be seen by eye. The shaded region in the colour bar shows the normalised distribution of galaxy number per pixel. |
The KiDS-Legacy data will be used to constrain cosmology with cosmic shear (Wright et al., in prep.), the combined analysis of three two-point functions (3 × 2 pt), and the combined analysis of six two-point functions (6 × 2 pt; see Johnston et al. 2024 for the forecast). The 3 × 2 pt measurements include cosmic shear, galaxy-galaxy lensing (GGL), and galaxy clustering. The galaxy clustering correlation function in KiDS-1000 3 × 2 pt presented in Heymans et al. (2021) is measured from the Baryon Oscillation Spectroscopic Survey Data Release 12 (BOSS DR12, Reid et al. 2016; Tröster et al. 2020), while for KiDS-Legacy the 3 × 2 pt statistics will be measured with the Bright sample selected from the Legacy data. The KiDS-Legacy 6 × 2 pt analysis measures the 2-point functions among KiDS galaxy shapes, galaxy positions, and the spectroscopic samples of 2dFLenS (Blake et al. 2016) and BOSS DR12 (Alam et al. 2015); therefore, six 2-point functions in total.
The KiDS survey strategy is tile-based. That is, the survey footprint is divided into adjacent 1° ×1° square tiles, each of which gets five exposures and is never re-observed afterwards except for the i-band, which was observed twice. Therefore, observational systematics can have three types of spatial variation based on their origin:
-
Type A,
inter-tile: uniform in each tile but differ across tiles, usually originating from varying observation conditions across tiles;
-
Type B,
intra-tile: varying in each tile, usually originating from focal plane distortion;
-
Type C,
observation-independent variations: usually generated from the anisotropy of the Milky Way or the Solar System.
A synthesis of systematics results in a complex selection pattern, namely variable depth, in the galaxy distribution. Figure 1 illustrates the galaxy map of the KiDS-Legacy catalogue, which exhibits a tile-based pattern of variable depth. For example, galaxies in the anomalous dark blue tile around (α = 196 deg, δ = 1.5 deg) in the centre of KiDS-North in Fig. 1 are significantly depleted. Heydenreich et al. (2020) studied the effect of variable depth on cosmic shear and concluded that the impact is insignificant for KiDS-like surveys. However, the variable depth will introduce significant bias in angular clustering measurement if not corrected. As galaxy clustering measurements form part of the KiDS-Legacy 3 × 2 pt and 6 × 2 pt analyses, it is essential to correct this bias.
To correct the selection effects, we create an OR which reflects the same variable depth as the galaxy sample from the systematics given by the KiDS-Legacy catalogue. Those systematics generally originate from the Galactic foreground, atmospheric seeing, survey strategy, and instrument set-up, and are ideally independent of cosmology, galaxy properties, and the LSS. In the catalogue, systematics are described by metadata that are calculated or derived from the instrument set-up, observing conditions, image properties, and so on. In the following text, we use ‘systematics’ to also refer to the metadata we use to obtain the OR. We select five systematics that are most likely to have selection effects: {Level, PSF_size, PSF_ell, EXTINCTION_r, GAIA_nstar}. Their types of distribution, definitions, origins and selection effects are summarised in Table 1. These systematics, with the exception of GAIA_nstar, are measured in the r-band because this is the detection band and we do not consider tomographic redshift binning in this analysis.
Summary of the systematics of each galaxy that we used to recover the OR.
Figure 2 shows the maps of these systematics. Each systematics map is divided into south and north fields, colour-coded by the mean systematics value in each pixel. The shaded regions in the colour bars are the probability distributions of the systematics. From Fig. 2, we can see how Level, PSF_size, PSF_ell are distributed based on the 1° ×1° tiles. Among them, Level and PSF_size are determined by the seeing of each exposure and are therefore relatively uniform in each tile, but vary from tile to tile; while PSF_ell depends on the curvature of the focal plane, which changes from the centre to the edge of each tile. On the other hand, EXTINCTION_r5 and Gaia_nstar reflect the spatial distribution of the Milky Way and are therefore more diffuse and independent of the tiles. To illustrate the selection effect of each systematics, we plot galaxy number contrast with respect to systematics values as black curves in the colour bars of Fig. 2. We note that the black curves share the same dynamic range between −0.25 and 0.25, so one can see that PSF_size has the strongest selection effect. In addition, the selection effects of EXTINCTION_r and Gaia_nstar are slightly non-linear.
![]() |
Fig. 2. Maps of the KiDS-Legacy systematic considered in this paper. Each map is divided into northern and southern fields, plotted together with their colour bars. The black curves over-plotted in the colour bars show the relationship between galaxy contrast and systematics value, and the dynamic ranges are [ − 0.25, 0.25]. The shaded regions show the probability distributions of each systematics. |
It should be noted that we could recover the OR with all the systematics such as PSF shape components and background counts, but in practice, it is advantageous to only use a subset of systematics that is most likely to select galaxies to reduce computation time. The rest of the systematics are either not correlated with galaxy numbers or are strongly correlated with the selected systematics. In Appendix C we run our method with all the systematics provided in the catalogue and prove that the performance does not improve. In the following sections of this paper, we present and test the method for recovering the organised random from these five systematics.
3.2. The GLASS mock catalogue
In this paper, we use mock catalogues generated by the Generator for Large Scale Structure (Tessore et al. 2023, GLASS)6 to test and validate our method. GLASS is a public code for generating mock data for LSS surveys. It takes as input pre-calculated matter power spectra that are projected within a sequence of shells through the light cone and generates lognormal matter density fields. Galaxy positions and shears can be generated accordingly with additional input of redshift distributions, galaxy bias and intrinsic alignment model. GLASS can also generate a Poisson random sample according to the galaxy number density in each spherical shell, which can be used as a UR sample. All the samplings are done on HEALP IX spheres which specify the minimum spatial resolution of the mock.
In this work, we only need galaxy positions generated by GLASS. We generate realisations of galaxy samples and corresponding random samples to validate our method of recovering organised random samples. The input cosmology is assumed to be the fiducial cosmology. We also note that under the assumption that the selection is independent of cosmology, the cosmology used to generate mock catalogues should not affect the evaluation of our method.
4. The SOM+HC method to recover organised randoms
Figure 3 shows a flowchart of the 2PCF measurement of galaxies with variable depth. Through observation, we obtain a depleted catalogue of galaxies with systematics evaluated for each galaxy. If we follow the upper half of the flowchart by using UR to compute the 2PCF, we get a biased “UR” 2PCF. Instead, we can recover OR to correct for non-uniform selection effects and obtain the unbiased “OR” 2PCF. In this section we introduce the method that we used to generate OR, namely a combination of self-organising map (SOM) and hierarchical clustering (HC). Figure 4 summarises this “SOM+HC” method.
![]() |
Fig. 3. Flowchart of the measurement of the 2PCF with reconstructed organised randoms. The colour block after the 2PCF indicates the colours shown in the following w(θ)−θ figures. |
![]() |
Fig. 4. Flowchart illustrating the SOM+HC method to recover OR and correct selection effects in the 2PCF. Starting from the top: 1. Input systematics; 2. SOM training: left panel: SOM grid colour-coded by galaxy number in each cell; right panel: projection of systematics vectors (small grey dots) and SOM cells (red dots connecting) on the plane of two systematics. The projected adjacent SOM cells are connected with black lines; 3. HC output: left panel: SOM cells colour-coded according to hierarchical cluster indices; right panel: systematics vectors and weight vectors colour-coded by corresponding cluster indices; 4. Effective areas corresponding to galaxies from each cluster; 5. Recovered OR weight map which will be used in a subsequent 2PCF measurement. |
4.1. The selection effects induced by systematics
The variable depth of the galaxy catalogue can be treated as non-uniform selection effects due to systematics. In other words, some systematics make certain galaxies more difficult to observe and thus deplete them from the catalogue. For example, poor seeing results in large PSF size, which smooths the galaxy brightness and prevents us from observing low-magnitude galaxies. Spatially variable systematics cause spatially variable selections, hence the variable depth. Our task is to find the spatial distribution of the selection effects from these systematics and create an OR with the same selection distributions.
The selection effect of each systematic can be interpreted as the probability of removing a galaxy from the sample as a result of this systematic. If the unselected spatial distribution of galaxy numbers is N(θ), then the selected sample number will distribute as
where Pkeep(θ,{sa(θ)}) is the selection function describing the probability that a galaxy at θ is kept in the sample given a set of systematics {sa(θ)}. Thus Pkeep(θ) acts as a “weight” of the galaxy distribution. When using selected galaxy samples, all the “galaxy numbers” in the definition of correlation functions, including both data and random samples for Eqs. (7), (8), (9), should be depleted galaxy numbers . The OR is just a UR selected by Pkeep(θ).
If the systematics are uncorrelated, the overall selection function is a multiplication of the selection functions of individual systematics. One can reconstruct the total selection function by modelling the selection effect of each systematics (Rodríguez-Monroy et al. 2022). In practice, however, it is not always possible to construct reliable quantitative models to describe the effect of each systematics on the galaxy number density. In addition, systematics can be correlated (e.g. extinction in different bands, extinction and GAIA stars), making the overall selection function more complicated.
The key to recovering the ORs is to group galaxies into subsamples, each sharing a similar selection effect. Galaxies from each subsample occupy a subregion in the survey footprint that is presumably selected uniformly. Therefore, we can generate a UR in the same sky region for each group, and combine them to obtain the OR. In this work, we use a combination of hierarchical clustering (Murtagh & Contreras 2012, HC) and SOM, namely the SOM+HC method, to group the galaxies and recover the OR weight to account for the selection effect. This method does not assume formal models for selection functions and their correlations, so it is flexible enough to account for arbitrarily complex selection effects.
4.2. Self-organising map and hierarchical clustering
HC is a widely used clustering algorithm. It has the advantages of flexibility, robustness, and interpretability, but is computationally expensive (the computational time scales with the data volume as O(N2), where N is the number of galaxies). On the other hand, the SOM has a complexity of O(N). In this work, we combine these two algorithms: first, we group the systematics vectors into SOM cells and then we further group the SOM cells into hierarchical clusters7. After the training, the hierarchical clusters are projected back onto the sky, resulting in discrete regions occupied by galaxies with uniform systematics and selection effects. We then randomly redistribute galaxies across these regions and combine galaxies from all the clusters to form the OR. The details of the method are presented below.
SOM is an unsupervised machine-learning algorithm that maps high-dimensional vectors to cells on a two-dimensional map while preserving the topological properties of the high-dimensional vectors by faithfully maintaining the distance between these data vectors. This means that when data vectors are close together in high-dimensional space, they are mapped to the same cell or cells that are close together on the SOM grid. Therefore, SOM can be used to group data, find correlations and visualise data. Specifically, the redshift distribution of the KiDS sample we use is calibrated by the SOM algorithm: mapping data vectors in colour space to SOM, and synthesising the redshift distribution in all SOM cells containing galaxies from spectroscopic samples (Wright et al. 2020; Myles et al. 2021). In addition, Jalan et al. (2024) use the SOM technique to quantify the completeness of spectroscopic samples used for photo-z training of the KiDS-Legacy bright sample.
In this work, instead of training a SOM in the colour space, we train it in the systematics space: the data vectors are Ngal (number of galaxies) Nsys-dimensional vectors, where Nsys is the number of systematics to account for (Nsys = 5 in our fiducial case). A SOM consists of Ndim × Ndim = Ncell/, cells to represent the Ngal data vector. The positions of SOM cells in the Nsys-dimensional systematics space are called weight vectors.
For the n-th galaxy, we use Vn to denote its systematics vector and Vn, i, i = 1, 2, …, Nsys to denote its i-th component. We can calculate the Euclidean distance between the weight vector of each SOM cell Wa, a = 1, 2, …, Ncell8 and the systematics vector as
For each data vector, the representing weight vector is chosen as the one with minimum dna. The corresponding SOM cell is termed the “best matching unit” (BMU).
By definition, data vectors that share the same BMU are clustered in the systematics space. The training of the SOM is to iteratively update Wa so that the clustered data vectors are mapped to the same or close BMU. In each epoch, all the weight vectors are brought closer to the data vector, with the requirement that distant weight vectors and weight vectors around the BMU do not move much. To achieve this, a “neighbourhood function” Σ(a, σ) is defined as a function of the grid distance between a cell and the BMU (not the distance between weight vectors). Σ is close to 1 when the weight vector is close to the BMU, and drops to zero outside a typical width σ. In each epoch, the optimisation is performed iteratively over the whole galaxy sample.
The training steps are summarised as follows:
-
Initialise the SOM by setting up the initial position of Ncell weight vectors as the two highest principal component analysis (PCA) components. This is equivalent to initialising the weight vectors as the Ncell on the 2-dimensional subspace spanned by the first two eigenvectors of the correlation matrix of the data vectors9.
-
In each epoch, perform the following steps iteratively until all the galaxies are iterated over:
-
(i)
Calculate the distance between the data vector and all the weight vectors, then find the BMU;
-
(ii)
Choose a data vector and update the weight vector as
where t denotes a time-step (i.e. the presentation of a new data vector to the SOM); L is the learning rate specifying how fast the weight vectors approach V(t) in each step, and Σ is the neighbourhood function;
-
(iii)
Choose another galaxy and perform Eq. (12) until all the galaxies are iterated over.
-
(i)
-
Perform the iteration in Step 2 for several training epochs. During the first few epochs, σ is roughly the size of the SOM, meaning that almost all the cells are updated. It decreases through the epochs so that only cells that are close to the BMU are updated. The learning rate also decreases to prevent “jumping over the minimum”. The training stops when the weight vectors converge.
Here, we use hexagonal SOM cells, which means that each cell has six neighbouring cells. The SOM will be smoother in the systematics space compared to rectangular cells. We choose the toroidal topology for the SOM, which means that the top and bottom boundaries and the left and right boundaries are adjacent. This can prevent edge effects in the training. The neighbourhood function is defined as a Gaussian with σ equal to half the dimension of the SOM then decreasing linearly to 1 at the last iteration of each epoch. The initial learning rate is 0.1 and decreases linearly to 0.01 in the last iteration of each epoch. We notice that the weight vectors barely update after five epochs, so we train the SOM in 10 epochs to ensure convergence.
After training, each systematic vector is represented by its BMU in the SOM. Since SOMs preserve the topological structure of the systematic vectors, they properly take into account the correlations between systematics. Galaxies that are clustered in the systematics space will be mapped into the same SOM cell, or the cells that are close both in the SOM grids and in the systematics space.
After training the SOM, we further group galaxies by running HC on weight vectors according to the distance between them. Since the distance between the weight vectors reflects the distance between the systematic vectors by construction, each cluster represents a group of galaxies with similar systematics just like each SOM cell, but with more galaxies to eliminate sample variance. In this work, we use “agglomerative clustering”, a bottom-up clustering method. The procedure can be briefly described as follows:
-
Treat each SOM cell as a cluster (so we have Ncell clusters in the beginning);
-
Iteratively merge two clusters separated by the shortest distance specified by the complete linkage criteria10;
-
Hierarchically merge clusters until there is only one cluster, and construct the dendrogram of the whole clustering process;
-
Cut the dendrogram where the cells are merged into the desired number of clusters (we use NC to denote the number of clusters hereafter). See Fig. 5 for an example of a dendrogram.
Fig. 5. Example of a dendrogram showing the clustering of 900 SOM cells into 20 HCs. The cells are clustered from the bottom to the top according to their Euclidean distance in the systematics space. The black dashed line shows the distance threshold where the cells are grouped into 20 clusters. SOM cells resulting in the same clusters are colour-coded with the same colour.
By construction, a galaxy cluster in the systematics space represents a particular combination of systematics. If the galaxy distribution in systematics space does not depend on the LSS of the Universe (we validate this assumption in Sect. 5.2.2), we can assume that the number of galaxies in each cluster is determined by a uniform selection probability given by the synthetic systematics represented by that cluster. After being grouped into clusters through SOM+HC, the galaxies from one cluster are distributed into discrete regions in the survey footprint and are assumed to be uniformly depleted by the associated combination of systematics. The detailed process of generating OR from clustered galaxies is presented in the next subsection.
The number of hierarchical clusters NC is an important parameter. If NC is too large, there will be higher sampling noise in each cluster. In addition, high NC will cause over-fitting of the OR, resulting in over-correction in the 2PCF. If NC is too small, the resolution in the systematics space will be too low to detect systematics variability. Therefore, we evaluate the NC value that optimises the trade-off between resolution and sampling noise in Sect. 5.
In this work, we use the SOMOCLU package (Wittek et al. 2017) to train the SOM. Hierarchical clustering is then performed via the AgglomerativeClustering class from sklearn.cluster package (for a more detailed technical explanation of the algorithm, see Müllner 2011). In the following text, we use “SOM+HC” to denote the procedure to create an OR with a combination of SOM and hierarchical clustering.
4.3. Reconstructing organised randoms
The basic idea of constructing an OR is to first find the effective sky region containing galaxies from each cluster. Galaxies in this region can be treated as uniformly selected according to the specific synthetic systematics of that cluster. We then generate a uniform random sample in the region according to the mean number density of that cluster and then add up the URs corresponding to all the clusters to get the overall OR. It is desirable to obtain an OR catalogue (i.e. coordinates of points within the observation footprint whose spatial distribution strictly follows the depletion distribution). However, with a finite number of hierarchical clusters, we cannot achieve such a resolution. Instead, we generate pixelised OR maps of the sky. That is, we aim for the OR weight in sky pixels that reflects the selection function for galaxies in that pixel (Pkeep). When measuring the 2PCF, we use it to calculate the RR term in Eq. (9). In this work, we use the HEALP IX scheme to pixelise the sky.
The OR is constructed as follows: after identifying the galaxy clusters in the systematics space via SOM+HC, we first find Npi, the number of galaxies in the i-th cluster that are in the p-th pixel of the sky; we note that a pixel can contain galaxies from different clusters. For a given cluster index i, Npi should occupy only discrete sub-regions within the footprint that contain galaxies distributed closely in the high-dimensional space. We calculate the effective area of the p-th pixel corresponding to the i-th cluster as
where Np is the total number of galaxies in that pixel and Ap is the area of the p-th pixel in the footprint11. If we assume that the SOM+HC are not affected by the underlying LSS, then we can assume that Npi and Np have the same LSS contribution that cancels each other out. So Api is the area in the p-th pixel occupied by galaxies with uniform selection effect of the i-th cluster.
Now we add up Api across pixels to get the total effective area for each cluster,
and the effective surface number density of the i-th cluster,
The average number of galaxies from the i-th cluster in the p-th pixel is niApi. We can uniformly sample this number of galaxies within the occupied region of the i-th cluster and combine them across clusters to get an OR catalogue. However, as mentioned before, for a pixelised sky, it is more efficient to construct the OR weight by simply adding up niApi across clusters:
The OR weight 𝒲p is an estimate of Pkeep, so it is used as the R terms in Eq. (7) to measure the unbiased 2PCF. We note that we use the calligraphic font 𝒲 for the OR weight to distinguish from the 2PCF w. Figure 4 summarises the SOM+HC method in a flowchart.
The left panel of Fig. 6 shows the KiDS-Legacy galaxy number fluctuation colour-coded by galaxy number per pixel. The holes in this map are the masked regions. From the left panel, one can notice an obvious tile-based variable depth in the galaxy map by eye. The right panel shows the organised random weight map constructed with a 30 × 30 SOM, grouped into 400 hierarchical clusters. Both maps are pixelised with Nside = 2048. Visually, it is clear that the OR weight map also shows a tile-based pattern that roughly matches the pattern on the galaxy number map.
![]() |
Fig. 6. Left panel: Galaxy number per pixel on a subregion of the KiDS-Legacy footprint; right panel: OR weight on the same region. |
Pixel size is another important parameter affecting the performance of the OR reconstruction. If the pixel size is too large, we lose the spatial resolution of the variable depth, resulting in an overestimated 2PCF with residual variable depth correlations. If the pixel size is too small, the organised random weight map will be too sparse, with many unintentionally masked pixels in the organised random map, since the organised random weight will only be non-zero in pixels containing galaxies. Thus the organised random will correlate with the LSS. If we correct this, we will remove the LSS from the 2PCF and under-estimate it. Therefore, we need to choose the pixel size carefully to balance the trade-off between resolution and over-correction. For the HEALP IX scheme used in this paper, the pixel size is determined by the Nside parameter of the map. Nside can only be a power of 2, and if Nside doubles, the pixel size is halved. Our baseline analysis uses Nside = 2048 corresponding to an angular size of 1.7 arcmin. We validate this choice below.
5. Validation of the method
In this section we validate the SOM+HC method with two case studies. In Sect. 5.1, we recover the OR for mock galaxy samples selected by systematics with simple spatial distributions and selection functions, while in Sect. 5.2.2, we recover the OR for mock galaxy samples selected by realistic systematics distributions and data-driven selection functions. We evaluate the method by calculating the bias in the OR-corrected 2PCF.
5.1. Testing the organised random with toy systematics model
In this subsection, we present a validation test of the SOM+HC method on mock galaxy samples with variable depth induced by simple depletion functions of anisotropic systematics. Figure 8 shows a flowchart of this test. The mock galaxy sample is generated by the GLASS package within a rectangular sky footprint with 0° < RA < 100°; −5° < Dec < 5° under the fiducial cosmology and a Gaussian redshift distribution centred at zmean = 0.3 with a standard deviation of 0.1. Before assigning the systematics and applying the depletions, a “point-source” mask is applied to mimic the mask in the real observation. The point source mask is a high-resolution (Nside = 8192, corresponding to a pixel size of 0.4 arcmin) binary mask, on which we generate a mock point source mask by removing circular holes (with value zero) on random positions with angular sizes of 5−15 arcmin. Galaxies that fall in the holes are removed from the mock catalogue.
We then divide the footprint into 1000 1° ×1° tiles. For each galaxy, we assign four systematics with different spatial distributions:
-
Systematics A1 varies discretely across tiles but is constant within them, mimicking per-exposure effects such as limiting depth variations (e.g. background level, PSF size) resulting from the use of a step-and-stare observing strategy.
-
Systematics A2 is the same idea as Systematics A1, but with a different realisation.
-
Systematics B mimics telescope and camera effects such as PSF shape variations over the focal plane, so in each tile it depends roughly on the angular distance to the tile centre. We take 2-dimensional Gaussian functions in each tile as the Systematics B distribution. The centre of the Gaussian is close to each tile centre with a small random jitter; the covariances are close to diagonal (hence the Gaussian has small ellipticities).
-
Systematics C varies smoothly over large angles. We model it as a Gaussian function of Galactic latitude. It mimics large-scale variations such as the Galactic foreground.
These four systematics are corresponding to type A (Systematics A1 and Systematics A2), type B (Systematics B), and type C (Systematics C) systematics introduced in Sect. 3. Their values are normalised to be between 0 and 1. Since SOM+HC does not change the topology of the galaxy distributions in systematics space, this normalisation does not affect the performance of the method.
We define simple selection functions for Systematics A1 (a linear function), Systematics B (a quadratic function), and Systematics C (a trigonometric function modulated by a linear function). Systematics A2 has no selection effect and thus acts as a “distractor” for the SOM. We show the systematics map and the associated selection function in Fig. 7. We also assume that the selections between the systematics are independent so that the overall selection function is a multiplication of the selection functions of the individual systematics. To make the selection, we generate a uniform random number between 0 and 1 for each galaxy, and if the random number is smaller than the selection function Pkeep(θ) at its position, the galaxy is kept in the catalogue. After selection, we have a galaxy catalogue with an angular number density of ∼1 arcmin−2.
![]() |
Fig. 7. Left panels: Spatial distribution of the four toy systematics. Systematics A1 and A2 are uniform in each tile, but differ across tiles (type A); Systematics B varies within each tile (type B); Systematics C is tile-independent (type C); Right panels: Black curves in the colour bar show the selection function of each systematic and the shaded region is the normalised distribution of the systematics. The numbers on the right show the selection rate values. |
![]() |
Fig. 8. Flowchart of the toy model validation of the SOM+HC method. We note that the UR case (w(θ) calculated with depleted mock catalogue and mock UR) is not shown in this figure. |
We also generate a “true” OR catalogue by applying the same selection functions to a UR catalogue in the same footprint. If we use it as the R terms for the 2PCF measurement with the depleted mock catalogue, we should get an unbiased w(θ). We note that in reality, we do not have access to the true OR. In this validation test, we recover the OR with a 30 × 30 SOM with hexagonal cells and a toroidal topology (so that the horizontal and vertical edges are adjacent), grouped into 200 hierarchical clusters. To check whether the SOM+HC recovers the input selection functions, we plot the relationship between galaxy number contrast (relative difference between galaxy number in each cluster and the mean galaxy number across clusters) and median systematics values of each cluster in Fig. 9. The blue curves are the number contrast derived from the input selection rates (the black curves in the colour bars of Fig. 7). The median systematics and number contrast for each point are calculated by first sorting the median systematics for all clusters in a realisation, then averaging the median systematics and number contrasts across realisations. Standard errors are also calculated and presented as error bars. The errors of the median systematics are too small to be visible. The galaxy number contrasts of the HCs generally follow the input selection rate, indicating that SOM+HC is able to recover the input selection functions of each systematics. Notably, SOM+HC will not introduce additional correlation for Systematics A2, which does not select galaxies. We also note that the number contrast slightly mismatches the selection functions at the edges. This is probably because the SOM is less effective closer to the boundary in the systematics space, where there are fewer neighbouring galaxies.
![]() |
Fig. 9. Relationship between the galaxy number contrast and the mean systematic value of each hierarchical cluster. The blue curves are the number contrast derived from the input selection rates (black curves in the colour bars of Fig. 7). The average median systematics and number contrast are calculated by averaging the sorted values in each realisation across all the realisations. The standard errors are also calculated and presented as error bars. The errors of the median systematics are too small to be visible. |
The recovered OR weight map is pixelised into a HEALP IX map with Nside = 102412. The true OR weight map, the recovered OR weight map, and the relative difference are shown in Fig. 10. Visually, the recovered OR shows the same spatial pattern as the true OR, but it appears more discrete. This is due to a finite number of clusters in the systematics space (in other words, if we had an infinite number of galaxies grouped into an infinite number of clusters, we would recover the smooth OR weights). The relative difference (the bottom panel of Fig. 10) is well within ∼ ± 20%.
![]() |
Fig. 10. Top panel: True OR weights (normalised by their mean) calculated from the total selection function of the toy systematics; middle panel: Recovered OR weights generated by the SOM+HC method. Both panels show only part of the footprint. We note that the holes in the maps are masked regions around point sources; bottom panel: Relative difference between the recovered OR weights and the true OR weights. |
In summary, we have an unselected mock galaxy catalogue and a selected catalogue, plus a uniform random, a true organised random and a recovered organised random. To evaluate the SOM+HC method quantitatively, we measure four w(θ)’s from them. They are summarised in Table 2. In summary, we validate the method by comparing the “Recovered OR” and the “No selection” 2PCFs. The 2PCF is measured in 20 angular bins between 2.5 and 250 arcmin (following DeRose et al. 2022) with the TREECORR13 package (Jarvis 2015). To evaluate the covariance matrix and reduce the sample variance, we perform the above validation on 40 GLASS realisations and evaluate the covariance as
Information for the four w(θ) measured from the mock sample.
where the subscript r denotes the realisation number, and is the average 2PCF over realisations. We also compute a theoretical covariance matrix using the ONECOVARIANCE code (Reischke et al. 2024)14 and find that it is consistent with the “No selection” covariance matrix (see Appendix A for a comparison). We evaluate the bias of the 2PCFs by calculating the χ2 between each w(θ) and the “No selection” w(θ)
where is the difference between the 2PCF and the “No selection” 2PCF. Assuming χd2 follows a χ2 distribution with degrees-of-freedom equal to the number of angular bins taken into account, we can calculate the corresponding probability-to-exceed (PTE) value. The 2PCF is less biased if the PTE value is closer to 1.
The measured 2PCFs and their relative difference to the “No selection” case are shown in Fig. 11. In this work, we focus on linear scales, so we calculate χd only on the angular scale θ < θcut where the cutting scale θcut is the angular scale corresponding to a physical scale of 8 h−1 Mpc at the mean redshift of the mock sample ( in our case), which has a value of 42.74 arcmin. For the UR case, χd = 1616, meaning a huge bias if we use uniform random for the depleted catalogue. For the true OR case, χd = 0.08 (corresponding to a PTE = 0.99999987), meaning that true organised random can give an unbiased w(θ) as expected. For the “Recovered OR” case, χd = 0.24 (corresponding to a PTE = 0.999992). This means that the OR recovered by the SOM+HC method also gives unbiased w(θ). Johnston et al. (2021) have shown that SOM+HC can correct a slight variable depth bias in the KiDS-Bright sample. Notably, in this test, SOM+HC can correct w(θ) that is biased by orders of magnitude. It should also be noted that the wiggle at small scales is due to pixelisation, which changes the angular distance between galaxies as they are effectively moved to the centre of the pixel. This effect particularly affects scales close to the pixel size (3.4 arcmin in our case).
![]() |
Fig. 11. Top panel: Measured w(θ) in the toy-systematics test. The definitions of the four w(θ) are presented in Table 2. The data points are the mean w(θ) from 40 realisations and the error bars are the diagonal elements of the covariance matrices evaluated from the realisations. The black curve is the theoretical w(θ) computed with PYCCL (Chisari et al. 2019) using the same cosmology and redshift distribution. The shaded region shows the angular scale smaller than 8 h−1 Mpc evaluated at the mean redshift. The middle panel is the relative bias of each w(θ) with respect to the No selection case, and the bottom panel is the w(θ) bias related to the error. Most points of the UR case are drastically biased and are outside the range of the middle and bottom panels. |
![]() |
Fig. 12. Flowchart of the data-driven test. |
5.2. Validation with data-driven systematics
5.2.1. Introduction to the method
The previous test with simulated systematics and depletions proves that the SOM+HC can correct for spatially variable depth, but the test is oversimplified and not realistic. In reality, the spatial distribution of systematics can be more complicated, stochastic and correlated. In addition, the selection function can be arbitrary. Therefore, it is difficult to create mocks with realistic variable depth to test the SOM+HC method. In addition, the SOM+HC method might over-correct the 2PCF by obtaining organised randoms “contaminated” by the LSS.
To tackle this problem, we apply “data-driven systematics” to the mock catalogue following Johnston et al. (2021). First, we generate a mock galaxy catalogue following the fiducial cosmology given in Sect. 1 with a galaxy bias b = 1. The galaxy sample follows the same calibrated redshift distribution of the KiDS-Legacy sample15 and is generated within the KiDS-Legacy footprint; then we generate the “data-driven mock systematics” by assigning systematics values to each mock galaxy with a nearest neighbour interpolation from the real KiDS systematics. Thus, we get a mock galaxy sample with the same spatial distributions of systematics as the real data. We note that even if the spatial distributions of the systematics are correlated with the LSS, this step removes any such correlation because the mock galaxy catalogue has a different spatial distribution. Next, we train a SOM+HC on the real galaxy catalogue (we call it “SOM+HC+KiDS”) and get an OR weight map 𝒲KiDS as a proxy of the realistic variable depth (we call it the “data-driven OR weight”); then we select mock galaxies according to it. We define the “OR contrast” as
where is the average OR weight across the footprint. In practice, we generate a large mock catalogue and select galaxies according to the selection function given by δOR, KiDS:
Here Nout is the desired galaxy number of the depleted mock sample. In this work, we choose Nout = 49, 875, 861, the galaxy number in the KiDS-Legacy catalogue. The galaxy number of the input, unselected mock sample Nin is chosen to ensure Pkeep ≤ 1. The selection procedure is the same as described in Sect. 5.1. For each galaxy we generate a uniform random number between 0 and 1 and compare it with Pkeep associated with the pixel containing the galaxy. If the random number is less than Pkeep, we keep the galaxy in the sample; otherwise, we discard it.
The assignment of mock systematics with nearest-neighbour interpolation ensures that the spatial distribution of the mock systematics is realistic. If SOM+HC recovers an unbiased OR weight on the real data, the data-driven OR weight should represent the realistic variable depth. When applied as a selection function to the mock data, it should produce the variable depth caused by the realistic selection function of the mock systematics. We run SOM+HC on the post-selected mock catalogue with the mock systematics to test this consistency. Ideally, the resulting OR weight should match the imported selections (i.e. the data-driven OR weight), and we expect to measure an unbiased 2PCF from the selected mock catalogue corrected with the recovered mock OR. Figure 12 shows the flowchart of the data-driven systematics test.
With this test we can also check whether the recovered OR weight contains the LSS. If so, the data-driven OR weight would have an imprint of the LSS from our real Universe, while the mock OR weight should have the imprint of the mock LSS. These two OR weights would not match, and the mock 2PCF corrected by the mock OR weight should be over-corrected.
Several configuration parameters affect the accuracy of the recovered OR weight. In particular, the SOM+HC procedure depends on NC, the number of hierarchical clusters, and Nside, the OR weight resolution. If NC and/or Nside is too low, SOM+HC will fail to resolve systematic clustering in systematic/real space, resulting in uncorrected variable depth. If NC is too high, then SOM+HC will start to resolve LSS-induced clustering in systematics space. If Nside is so high that significant amounts of pixels get no galaxy, then the OR weight will follow the LSS. In this section, we aim to perform data-driven systematics tests to find the optimal {NC, Nside}. We note that the data-driven systematics test depends on these parameters for both data-driven and recovered mock OR. In the following discussion, we use {NCKiDS, NsideKiDS} to denote the set-up for the data-driven OR, and {NCrec, Nsiderec} for the recovered mock OR. To avoid endless tests for different parameter value combinations, we assume that the same SOM+HC set-ups on real and mock data, with the same systematics choices, will yield equally good or bad performance on both real and mock data. With this assumption in mind, we can evaluate the performance of a {NCKiDS, NsideKiDS} combination based on the recovered OR 2PCF with NCKiDS = NCrec, NsideKiDS = Nsiderec:
-
If the recovered OR 2PCF is generally higher than the “No selection” case, the {NCKiDS, NsideKiDS} choice is likely to under-estimate the variable depth;
-
If the recovered OR 2PCF is significantly lower than the “No selection” case, the {NCKiDS, NsideKiDS} option is likely to over-correct the variable depth by removing the LSS;
-
If the recovered OR 2PCF agrees with the “No selection” case, it is possible that the choice is optimal. It is also possible that the data-driven OR under-estimates the variable depth and induces too soft selections, which can be corrected by the same {NCrec, Nsiderec} combination, while a realistic variable depth actually requires higher {NCKiDS, NsideKiDS}. For the second case, one can think of an extreme case: NCKiDS = NCrec = 1. This case is the same as “No selection” and will give an unbiased 2PCF, but the number of clusters is clearly too low for the real data with complex selection effects.
Another consideration is that instead of calculating 2PCF with varying Nside, we calculate the fraction of unmasked empty pixels (i.e. the pixels in the footprint that do not contain a galaxy). If the fraction is large, the OR weight will show the LSS pattern (one can imagine an extreme case: using a pixel size so small that each pixel contains either one or zero galaxies. In this case, the OR weight map is exactly the galaxy map and no 2PCF signal is measured with it). We find that when Nside ≤ 2048, the unmasked empty pixels are less than 0.5% of the footprint; when Nside = 4096, the fraction increases to 2%. Therefore, in this test we fix NsideKiDS = Nsiderec = 2048 and only vary NC.
With these assumptions and simplifications in mind, we find the optimal NCKiDS by first choosing the same NCKiDS and NCrec values for data-driven OR and recovered mock OR, respectively. We vary the cluster number from low to high until it is high enough to pick up the LSS and the recovered 2PCF starts to be systematically lower than the unbiased case. On the other hand, if a certain choice of NC gives an unbiased mock 2PCF, it is still possible that NCKiDS is too low to recover the selection function, which results in a soft data-driven OR map that can be recovered with the same NCrec = NCKiDS. To test if this is the case, we can manually amplify the variance of the data-driven selection function that applies to the mock by
We note that is 𝒲KiDS to the power of m. When the adjustment parameter m > 1, the variance in δOR, KiDS is enhanced. If the mock OR can still recover an unbiased 2PCF, then we can conclude that this set-up is powerful enough to recover the realistic variable depth. In this test, we choose m = 1.5 so that the bias in the 2PCF with URs is almost doubled.
If we find the NCKiDS so that the “Recovered OR” 2PCF is also unbiased for the enhanced data-driven selection, we can fix it for the data-driven OR weight. Based on our previous discussion, we can treat the data-driven OR weight with this NCKiDS as a realistic selection function. With this data-driven selection, we then vary NCrec for the mock OR and evaluate the bias of the resulting 2PCF.
5.2.2. Evaluating the performance of SOM+HC with data-driven systematics
Figure 13 shows an example of the true data-driven OR (the top panel), associated recovered OR (the bottom panel), and the relative difference between them. Both OR weights are generated with 600 hierarchical clusters and pixelised on a HEALP IX map with Nside = 2048. Visually they look very similar to each other, and the relative difference is well within ∼ ± 20%.
![]() |
Fig. 13. Upper panel: Part of the data-driven OR weight generated from the KiDS-Legacy map; middle panel: Mock OR weight recovered from GLASS mock galaxy sample selected according to the data-driven OR; bottom panel: Relative difference between recovered OR weights and true OR weights. Both OR weights are generated with 600 hierarchical clusters and pixelised on a HEALP IX map with Nside = 2048. |
Following Sect. 5.1, we quantitatively evaluate the effectiveness of SOM+HC by comparing the 2PCF from the depleted mock catalogue corrected by the recovered mock OR (the “Rec. OR w(θ)”) and the “No selection” 2PCF (measured from a complete catalogue with the same number of galaxies, corrected by a uniform random). Since the “No selection” 2PCF is unbiased, it serves as the reference 2PCF. The consistency between these two 2PCFs indicates the consistency between the recovered mock OR and the data-driven OR (not necessarily between the recovered mock OR and the realistic selections). The angular separation is binned into 20 logarithmic bins from 2.5 arcmin to 250 arcmin. The covariance matrices are again calculated by the ONECOVARIANCE code with the same cosmology, redshift distribution, sky coverage and galaxy number density as the GLASS mock (see Fig. 14).
![]() |
Fig. 14. 2PCFs measured for the data-driven systematics test with the same choices of NCKiDS (the number of hierarchical clusters for the data-driven OR) and NCrec (for the recovered mock OR) Each column of panels corresponds to one set-up. Nside is fixed at 2048 for both the data-driven OR and the recovered OR. The right-most column is an enhanced data-driven OR with m = 1.5 according to Eq. (21). The top panels show the 2PCF data points calculated as the mean values from 40 GLASS realisations, and the error bars are calculated as the square root of the diagonal terms of the theoretical covariance matrix. The middle panels show the biases in 2PCFs with respect to the No selection 2PCFs and the bottom panels show the biases relative to the errors in the 2PCFs (The UR case is well beyond the range). The shaded regions are angular scales corresponding to a physical scale r < 8 h−1 Mpc at the mean redshift of the galaxy sample. |
To achieve satisfactory statistical power and to suppress sample variance, we perform the data-driven systematics test on 40 GLASS realisation samples and calculate the average w(θ). At the data vector level, we compute the χd2 value between the recovered OR w(θ) and the no-selection w(θ) following Eq. (18). We also use the PTE value to evaluate the goodness of fit. If the PTE is close to 1, then it is almost certain that χ2 will be larger than a completely random variation, and the difference between “Recovered OR” 2PCF and “No selection” 2PCF is significantly smaller than the sample variance.
We can also evaluate the consistency at the level of parameter fits. For example, we set the matter density parameter Ωm and the galaxy bias b as free parameters. First, we define a Gaussian likelihood,
where χdm2 denotes the χ2 between data and model; is the measured 2PCF averaged across realisations and w(θi; Ωm, b) is the theoretical 2PCF. We run an MCMC using the EMCEE package (Foreman-Mackey et al. 2013) to sample the posterior distribution. We then compare the consistency between the MCMC chains from the “No selection” and “Recovered OR” cases. The inference bias of each parameter is described by
and
, the difference of the mean values of the parameters across the converged chains, with the units of their standard deviations. We also compare the entire posterior with the distance of the mean values of {Ωm, b} on the parameter plane parametrised with the χ2 value on the Ωm − b plane:
Here C is the covariance matrix of the inferred parameters derived from the MCMC chains. The PTE value is then derived from the χ2 value given the degrees of freedom of 2.
5.3. Results
In the upper panels of Fig. 14, we present the 2PCFs measured from the data-driven systematics test. Each column of panels shows a set-up specified by the number of clusters for data-driven OR and recovered mock OR. The right-most column corresponds to an enhanced data-driven OR by setting m = 1.5 in Eq. (21). The colour scheme and meaning of data points are the same as introduced in Table 2. The data points are the mean 2PCF computed from 40 GLASS realisations and the error bars are the square root of the diagonal terms of the theoretical covariance matrix given by the ONECOVARIANCE code. The middle panels show the fractional bias of the 2PCFs with respect to the “No selection” 2PCF and the bottom panels show the bias of the 2PCFs with respect to the error. The 1σ confidence contours of the parameter posterior shift with respect to the best-fit values constrained from the “No selection” 2PCF are shown in Fig. 15. We note that the “UR” 2PCF gives such biased constraints that the contour is well outside the plot area.
![]() |
Fig. 15. 1 − σ confidence contours of the parameter posterior shift with respect to the best-fit values constrained from the No selection 2PCF. The pink arrow indicates the direction of the UR contour with the best-fit values of ΔΩm = −0.03, Δb = 1.33, which is well outside the dynamic range of the plot. |
Table 2 summarises the statistics defined in Sect. 5.2.2. The uniform random gives extremely biased 2PCF, highlighting the necessity to mitigate the variable depth. The “True OR” case agrees well with the “No selection” case as expected, and this means that using the true OR in Eq. (7) can indeed correct the variable depth for the 2PCF. From NC = 200 to NC = 800 the SOM+HC changes from under-correcting the 2PCF to slightly over-correcting it. The measured 2PCF and constrained cosmological parameters are in best agreement with the “No selection” case at NC = 600. When we use an enhanced data-driven selection on the mock catalogue by setting m = 1.5, χd for UR increases from 8475 (when m = 1) to 20 578, meaning that the bias in UR 2PCF is more than doubled when m = 1.5. The SOM+HC method with NC = 600 can still correct the bias in the 2PCF. Therefore, we conclude that SOM+HC with NC ∼ 600 is the optimal choice for the KiDS-Legacy galaxy clustering 2PCF measurement.
Now we fix NCKiDS = 600 so that the data-driven selection reflects the realistic variable depth, then we vary NCrec from 200 to 800. The evaluation statistics are summarised from the 6th row to the 8th row of Table 3, plus the third row with NCrec = 600. Again we see that the recovered 2PCF goes from biased high (when NCOR = 200) to well corrected. We also note that the recovery biases are generally low with NCrec varying from 400 to 800, for which χd2 values are at a level of ∼0.5σ, with a minimum value of 0.39 when NCrec = 600. The goodness of recovery is also robust to the choice of NCrec from 400 to 800. In conclusion, we choose NCrec = 600 for the galaxy clustering measurement with KiDS-Legacy data.
Summary statistics of data-driven systematics test.
6. 2PCF measurement with KiDS-Legacy sample
6.1. Application of SOM+HC on the real data
In this section we apply the SOM+HC method to the real KiDS-Legacy galaxy catalogue. The systematics used to train the SOM are described in Sect. 3. The SOM+HC set-up is the same as the data-driven systematics test: the SOM dimension is 30, with hexagonal cells (so each cell has six adjacent neighbourhoods), toroidal topology (so the left and right edges are adjacent, and the top and bottom edges are adjacent). Training lasts for 10 epochs, with an initial width of the Gaussian neighbourhood function σ = 30/2 = 15, decreasing linearly to σ = 1 in the last epoch. The initial learning rate is 0.1 and decreases linearly to 0.01. The trained SOM cells are grouped into 600 clusters, from which the OR weight is generated on a HEALPix map with Nside = 2048.
Figure 16 shows the results of the post-trained SOM+HC. The first five panels show the SOM colour-coded by the average systematics values of each cell. We note that the SOM inherits the grained patterns of the mean Level and GAIA_nstar spatial distributions (see Fig. 2). The last panel shows the galaxy number contrast of the 600 hierarchical clusters (denoted by cells with black borders). Due to the toroidal topology, clusters near the edge may actually cross the edge and continue on the other side. One can see the correlation between galaxy number contrast and each systematics from the SOM maps. For example, the galaxy number contrast is strongly anticorrelated with PSF_size. This is expected, as a larger PSF size indicates poorer atmospheric seeing, making galaxies harder to detect under these conditions. A quantitative correlation can be evaluated using Spearman correlation coefficients. Figure 17 shows the correlation coefficient matrix. The bottom row shows the correlation between systematics and galaxy number contrast, and it is clear that PSF size has a strong anti-correlation with galaxy number contrast, in agreement with the contrast-systematics relation shown as black curves in Fig. 2. On the other hand, the PSF shape only has a weak correlation with the contrast in galaxy numbers. We also find that Level and GAIA_nstar are correlated with extinction.
![]() |
Fig. 16. Self-organising maps trained on the KiDS-Legacy catalogue, with the dimension 30 × 30. The first five panels are SOMs coloured by the average systematics values in each cell. The last panel (bottom right) is the SOM coloured by the galaxy number contrast of each hierarchical cluster. The black lines are the boundaries of each HC. We note that we use a toroidal topology for the SOM, so the left and right edges and the top and bottom edges are adjacent. |
![]() |
Fig. 17. Spearman correlation coefficient matrix. The numbers in each grid are the correlation coefficient between the median systematics and the galaxy contrast in each hierarchical cluster. |
6.2. Blinded measurement of 2PCF
Since the KiDS-Legacy cosmological analysis is not yet unblinded, in this paper we only make blinded measurements of the 2PCF. We leave more sophisticated measurements, including tomographic galaxy clustering, to later work. This section serves as a showcase of the SOM+HC method on real data.
The 2PCF is measured the same way as those from Sect. 5. The “UR” 2PCF is measured by using the coverage map as the random term in Eq. (9) (for the similar definition of the RR and DR terms), and the “OR” 2PCF uses the OR weight from SOM+HC. Both 2PCFs are then blinded using the method given in Muir et al. (2020). The blinding method includes the following steps:
-
Select a reference cosmology and calculate the associated theoretical 2PCF. In our case, we use the fiducial cosmology defined in Sect. 1.
-
Shift the cosmological parameters to be blinded with a Gaussian random number. The standard deviation of the Gaussian random defines how much one wants to blind the parameters. In this paper we shift {Ωm, b} with standard deviation {0.1, 0.1} respectively.
-
Calculate the shifted 2PCF given the shifted parameters and obtain the data vector shift by subtracting it from the reference 2PCF.
-
Add the data vector shift to the measured 2PCF to obtain the blinded 2PCF.
Both the UR and OR 2PCFs are shifted by the same amount to make a meaningful comparison. We then run an MCMC on the blinded 2PCFs to sample the posterior of {Ωm, b}. To prevent accidental unblinding, we never save the blinded data vector, but directly pass it to the MCMC code. The covariance matrix is computed from the ONECOVARIANCE code. Since we do not know the true parameter values a priori, we use an iterative fitting procedure: first, we compute the covariance matrix with the fiducial cosmology and run MCMCs with it, then we compute the best-fit parameters from the posterior and update the covariance matrices for OR and UR respectively. The posterior is then sampled with the updated covariance matrices. The MCMCs are sampled with the same modelling code when we blind the data vector. We note that we only constrain the linear model with data points at , where
is the angular diameter distance at the mean KiDS-Legacy redshift
. The redshift distribution is calibrated with a combination of SOM and the clustering-redshift method, which will also be presented in a companion paper.
Data-driven systematics tests (see Sect. 5) suggest that NC = 600 is the optimal choice. In this section, we use this as the fiducial setting, but also try NC = {200, 400, 800} to test the robustness of the NC choice. All 2PCFs are shifted with the same blinding shift. The posteriors are also fitted from these measurements.
The blinded 2PCFs are shown in Fig. 18 with error bars derived from the updated covariance matrices. The black dashed line shows the theoretical 2PCF calculated from the best-fit parameters from the blinded 2PCF measurement with NC = 600. The reduced χ2 value between the OR (NC = 600) 2PCF and the best-fit 2PCF is 1.28, corresponding to a PTE of 0.25, indicating a good fit between the model and the blinded data. Figure 19 shows the posteriors sampled from the MCMC, shifted with respect to the best-fit value of the NC = 600 case. The left panel shows the contours of shifted posteriors constrained from 2PCF corrected by OR with different NC, while the right panel shows contours corresponding to 2PCF corrected by OR (NC = 600) and UR, respectively. On linear scales, the OR and UR 2PCFs differ at a level of 70σ and the galaxy bias parameter differs at a level of 40σ. Interestingly, the matter density Ωm does not change significantly with the correction. This is because the selection does not cause significant bias in Ωm (Ωm is not biased even for the UR 2PCF), but this might be not the case for other surveys.
![]() |
Fig. 18. Blinded 2PCFs measured from the KiDS-Legacy sample. The blue and pink dots are the measurements corrected by the recovered organised random and by the uniform random, respectively. The shaded regions are angular scales corresponding to physical scales smaller than 8 h−1 Mpc estimated at the mean redshift. The error bars are the standard deviations derived from the covariance matrix computed by the ONECOVARIANCE code. The black dashed curve shows the best-fit 2PCF from the MCMC. |
![]() |
Fig. 19. Contours of 68.3% and 95.4% credible levels of the parameter posterior shift with respect to the best-fit values constrained from NC = 600 case. Left panel: Contours from the OR 2PCFs with different NC choices. The fiducial choice NC = 600 is shown as green filled contours; Right panel: Contours of OR 2PCF (NC = 600, green) and UR 2PCF (pink). |
The 2PCF measurements with different NC suggest that the OR 2PCFs decrease with NC (i.e. the corrections increase with NC). If we treat NC = 600 as the fiducial choice, the corresponding χd values are 5.75, 0.27, 0.07 with NC = 200, 400, 800, so the change is significant from NC = 200 to NC = 400, while moderate from NC = 400 to NC = 800. This is consistent with the data-driven validation test. Therefore, we conclude that the OR correction is quite robust around NC = 600.
7. Discussions and conclusions
The aim of this work was to correct the bias in the galaxy 2PCF from deep surveys due to complicated selection effects. We showed that for deep surveys such as KiDS, the selection effect is very pronounced and can lead to a bias in the 2PCF about ten times the signal, especially on large scales. Therefore, a critical step is to correct for the variable depth caused by complex selection effects. We introduce the SOM+HC method proposed by Johnston et al. (2021) on the KiDS-Legacy sample and show that it can correct the bias for faint galaxy samples.
SOM+HC is a machine-learning method for recovering organised randoms that have the same selection pattern as the galaxy sample. The method uses a combination of self-organising maps and hierarchical clusters that group galaxies in systematics space and redistributes them across the survey footprint. Compared to other methods used to mitigate selection bias in the 2PCF, SOM+HC has the advantage of being model-independent. As an unsupervised machine learning algorithm, it does not need to parametrise selection functions or biases in statistics, but directly captures arbitrary selection-induced patterns in the systematics space. Therefore, it can recover complex selection functions and their correlations without making a priori assumptions about them. In addition, the SOM+HC method is purely data-driven, meaning that it can be trained on the real data itself without relying on mock data or any external information.
Johnston et al. (2021) proved that SOM+HC can correct the slight bias in the bright sample, while this work further validates the method on a faint galaxy sample with more complex selection effects and a more significant bias in the 2PCF. The validations are performed on mock galaxies from GLASS with both toy selections and data-driven selections. The toy test is a direct demonstration of how SOM+HC can recover the selection functions and the unbiased 2PCF. The data-driven test demonstrates the performance of SOM+HC on deep surveys such as KiDS and determines the optimal set-up for realistic 2PCF measurements. From Fig. 14, we note that the bias in the UR 2PCF is scale-dependent. This scale dependency may arise because different scales are dominated by selection effects of different systematics, but it is hard to quantitatively isolate them in different scales. OR is able to correct such complicated selection effects. From this test, we see that the selection effects in the real KiDS-Legacy galaxy sample bias the 2PCF by χ2 ∼ 8500 and that the SOM+HC with NC = 600 can correct this bias down to a level of χ2 ∼ 0.3 given a degree of freedom of 10, resulting in the reduction of bias on the (Ωm, b) from a level of 27σ to 0.3σ. Notably, this set-up can recover an unbiased 2PCF even with an enhanced selection, correcting a 2PCF bias of χ2 ∼ 20 000 to χ2 ∼ 0.4.
Several set-ups of the SOM+HC algorithm can affect its performance, so the optimal set-up varies from survey to survey. For example, the number of galaxies (data volume) is important for the SOM+HC set-up. A catalogue with a larger number of galaxies is less sensitive to Poisson noise and has a more obvious variable depth, making the SOM+HC more efficient in recovering an accurate OR. On the other hand, smaller galaxy catalogues require lower NC and Nside to reduce cosmological contamination in the OR, but this will limit the spatial resolution of the OR. Furthermore, data-driven regression methods, including SOM+HC, aim to nullify modes of variance in the training data, which carries the risk of over-correction. A potential solution is to train the model on data from a disjoint sky region and run it on the region of interest. This approach requires ensuring that the training region is both representative and uncorrelated with the testing region (the region of interest), making it practically challenging. For datasets with weaker selection effects than the LSS fluctuation (such as the KiDS bright sample), SOM+HC is more prone to over-correction. Therefore, SOM+HC should perform better on larger, fainter galaxy samples. For KiDS-Legacy, we show that NC = 600 gives a parameter constraint accuracy of ∼0.3σ and performance is found to be stable around this choice. In this work we also set the pixel size to 1.4 arcmin, equivalent to Nside = 2048, to avoid unintentionally masking pixels within the footprint, while maintaining the best performance of the OR. We expect smaller pixels to give a more accurate OR for future surveys with higher galaxy number densities.
Using the validated SOM+HC method, we performed a preliminary blinded 2PCF measurement from the KiDS-Legacy galaxy catalogue. We find that the UR 2PCF is significantly higher than the OR 2PCF by an order of magnitude. Furthermore, the corrected 2PCF is robust to the choice of NC around 600. We applied the SOM+HC method to the photometric galaxy clustering measurement for the KiDS-Legacy 6 × 2 pt and the bright galaxy clustering measurement for the KiDS-Legacy 3 × 2 pt measurements. More detailed discussions of the methodology (e.g. tomographic galaxy clustering) will be given in the forthcoming papers.
We moreover note that the tests carried out in this work were focused on linear scales. The bottom panels in Fig. 14 show that the performance of the method is worse on small scales. This is because the variance is very small on small scales, so the bias tends to be significant compared to it. Therefore, we need to take extra care at these scales. For example, future work could combine different algorithms to correct for variable depth at different scales.
Combinations of multiple 2PCFs, namely the N × 2 pt measurements, will form the main analysis components of the next generation of LSS surveys, including LSST and Euclid. Galaxy 2PCF will be a critical part of these measurements. For these deep surveys, selection effects will also lead to variable depths. For example, the rolling cadence survey strategy of LSST (Bianco et al. 2021) will introduce stripe-like non-uniformity. Based on Hang et al. (2024), this will cause a significant bias up to an order of magnitude for the LSST Y3 galaxy clustering measurement. The Euclid survey, on the other hand, will combine ground-based multi-band photometry to estimate photo-z (Euclid Collaboration: Desprez et al. 2020), so any photo-z-based selection will introduce variable depth. With large sky coverage and depth, the measurement precision of these surveys will be greatly improved, requiring cleaner and more reliable bias correction. As discussed above, SOM+HC is more effective with larger galaxy samples. Therefore, we expect that as the data volumes of the next generation of surveys increase, SOM+HC will become more powerful in recovering selection-induced clustering. In addition, one can use smaller pixels (or higher Nside) for such galaxy catalogues, and obtain the OR weights at higher angular resolutions. To this end, we published the code used in this work as the TIAOGENG package16 for future implementation in pipelines for next-generation surveys (e.g. TXPIPE17). A combination with other methods, such as template-based correction methods, could be even more effective in mitigating the complex selection effects for future deep surveys.
Data availability
Our software is open-source for future usage.
Tiaogeng is the Chinese word for ‘spoon’, more commonly used in southern China. It contains two characters: tiao meaning to reconcile and geng referring to a Chinese-style thick soup. The code reconciles the unevenly observed sky, just as a tiaogeng stirs soup to make it taste more balanced and delicious.
In real measurement when the true angular power spectra are unknown, an iterative estimation is usually used to evaluate the covariance (Eifler et al. 2009). That is, one uses the theoretical angular power spectra calculated from reasonable cosmological parameters to calculate the covariance and constrain the parameters with the corresponding likelihood. The covariance matrix is updated with the best-fit parameter. This process is performed iteratively until the best-fit parameter converges.
Acknowledgments
We appreciate fruitful discussions with Harry Johnston, Andrina Nicola, and Anna Porredon. ZY acknowledges support from the Max Planck Society and the Alexander von Humboldt Foundation in the framework of the Max Planck-Humboldt Research Award endowed by the Federal Ministry of Education and Research (Germany). AHW is supported by the Deutsches Zentrum für Luft- und Raumfahrt (DLR), made possible by the Bundesministerium für Wirtschaft und Klimaschutz, and acknowledges funding from the German Science Foundation DFG, via the Collaborative Research Center SFB1491 “Cosmic Interacting Matters – From Source to Signal”. NEC and CG acknowledge support from the project “A rising tide: Galaxy intrinsic alignments as a new probe of cosmology and galaxy evolution” (with project number VI.Vidi.203.011) of the Talent programme Vidi which is (partly) financed by the Dutch Research Council (NWO). SJ acknowledges the Dennis Sciama Fellowship at the University of Portsmouth and the Ramón y Cajal Fellowship from the Spanish Ministry of Science. AL acknowledges support from the research project grant ‘Understanding the Dynamic Universe’ funded by the Knut and Alice Wallenberg Foundation under Dnr KAW 2018.0067. RR is supported by an ERC Consolidator Grant (No. 770935). MA acknowledges the UK Science and Technology Facilities Council (STFC) under grant number ST/Y002652/1 and the Royal Society under grant numbers RGSR2222268 and ICAR1231094 MB is supported by the Polish National Science Center through grants no. 2020/38/E/ST9/00395, 2018/30/E/ST9/00698, 2018/31/G/ST9/03388 and 2020/39/B/ST9/03494. AD acknowledges support from the ERC Consolidator Grant (No. 770935) CH acknowledges support from the Max Planck Society and the Alexander von Humboldt Foundation in the framework of the Max Planck-Humboldt Research Award endowed by the Federal Ministry of Education and Research, and the UK Science and Technology Facilities Council (STFC) under grant ST/V000594/1 H. Hildebrandt is supported by a DFG Heisenberg grant (Hi 1495/5-1), the DFG Collaborative Research Center SFB1491, an ERC Consolidator Grant (No. 770935), and the DLR project 50QE2305. PJ is supported by the Polish National Science Center through grant no. 2020/38/E/ST9/00395. BJ acknowledges support by the ERC-selected UKRI Frontier Research Grant EP/Y03015X/1 and by STFC Consolidated Grant ST/V000780/1. LL is supported by the Austrian Science Fund (FWF) [ESP 357-N]. CM acknowledges support from the Beecroft Trust, the Spanish Ministry of Science under the grant number PID2021-128338NB-I00, and from the European Research Council under grant number 770935. LM acknowledges the financial contribution from the grant PRIN-MUR 2022 20227RNLY3 “The concordance cosmological model: stress-tests with galaxy clusters” supported by Next Generation EU and from the grant ASI n. 2024-10-HH.0 “Attività scientifiche per la missione Euclid – fase E” NRN acknowledges financial support from the National Science Foundation of China, Research Fund for Excellent International Scholars (grant n. 12150710511), and from the research grant from China Manned Space Project n. CMS-CSST-2021-A01. BS acknowledges support from the Max Planck Society and the Alexander von Humboldt Foundation in the framework of the Max Planck-Humboldt Research Award endowed by the Federal Ministry of Education and Research. MvWK acknowledges the support by the UK Space Agency. MY acknowledges funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 101053992) The data in this paper is analysed with open-source python packages NUMPY (Harris et al. 2020), SCIPY (Virtanen et al. 2020), ASTROPY (Astropy Collaboration 2018), MATPLOTLIB (Hunter 2007), GLASS (Tessore et al. 2023), SOMOCLU (Wittek et al. 2017), HEALPY (Zonca et al. 2019), NAMASTER (Alonso et al. 2019), CCL (Chisari et al. 2019), EMCEE (Foreman-Mackey et al. 2013), and GETDIST (Lewis 2019). We also use WEBPLOTDIGITIZER (Rohatgi 2021) to digitise some external data from plots in the literature. Author contributions. All authors contributed to the development and writing of this paper. The authorship list is given in three groups: the lead authors (ZY & AHW) followed by two alphabetical groups. The first alphabetical group includes those who are key contributors to both the scientific analysis and the data products. The second group covers those who have either made a significant contribution to the data products or the scientific analysis.
References
- Abbott, T., Abdalla, F. B., Aleksić, J., et al. 2016, MNRAS, 460, 1270 [Google Scholar]
- Abbott, T. M. C., Aguena, M., Alarcon, A., et al. 2022, Phys. Rev. D, 105, 023520 [CrossRef] [Google Scholar]
- Abdalla, E., Abellán, G. F., Aboubrahim, A., et al. 2022, J. High Energy Astrophys., 34, 49 [NASA ADS] [CrossRef] [Google Scholar]
- Alam, S., Albareti, F. D., Prieto, C. A., et al. 2015, ApJS, 219, 12 [NASA ADS] [CrossRef] [Google Scholar]
- Alam, S., Aubert, M., Avila, S., et al. 2021, Phys. Rev. D, 103, 083533 [NASA ADS] [CrossRef] [Google Scholar]
- Alonso, D., Hill, J. C., Hložek, R., & Spergel, D. N. 2018, Phys. Rev. D, 97, 063514 [NASA ADS] [CrossRef] [Google Scholar]
- Alonso, D., Sanchez, J., & Slosar, A. 2019, MNRAS, 484, 4127 [Google Scholar]
- Amon, A., Gruen, D., Troxel, M. A., et al. 2022, Phys. Rev. D, 105, 023514 [NASA ADS] [CrossRef] [Google Scholar]
- Asgari, M., Lin, C.-A., Joachimi, B., et al. 2021, A&A, 645, A104 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Asgari, M., Mead, A. J., & Heymans, C. 2023, Open J. Astrophys., 6, 39 [NASA ADS] [CrossRef] [Google Scholar]
- Astropy Collaboration (Price-Whelan, A. M., et al.) 2018, AJ, 156, 123 [Google Scholar]
- Baumann, D., Nicolis, A., Senatore, L., & Zaldarriaga, M. 2012, JCAP, 2012, 051 [Google Scholar]
- Benítez, N. 2000, ApJ, 536, 571 [Google Scholar]
- Bergé, J., Gamper, L., Réfrégier, A., & Amara, A. 2013, Astron. Comput., 1, 23 [Google Scholar]
- Berlfein, F., Mandelbaum, R., Dodelson, S., & Schafer, C. 2024, MNRAS, 531, 4954 [NASA ADS] [CrossRef] [Google Scholar]
- Bianco, F. B., Ivezic, Z., Jones, R. L., et al. 2021, ApJS, 258, 1 [Google Scholar]
- Bilicki, M., Dvornik, A., Hoekstra, H., et al. 2021, A&A, 653, A82 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Blake, C., Amon, A., Childress, M., et al. 2016, MNRAS, 462, 4240 [NASA ADS] [CrossRef] [Google Scholar]
- Carrasco, J. J. M., Hertzberg, M. P., & Senatore, L. 2012, J. High Energy Phys., 2012, 82 [CrossRef] [Google Scholar]
- Chisari, N. E., Alonso, D., Krause, E., et al. 2019, ApJS, 242, 2 [Google Scholar]
- Cooray, A., & Sheth, R. 2002, Phys. Rep., 372, 1 [Google Scholar]
- Coupon, J., Kilbinger, M., McCracken, H. J., et al. 2012, A&A, 542, A5 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Crocce, M., Carretero, J., Bauer, A. H., et al. 2015, MNRAS, 455, 4301 [Google Scholar]
- Cuceu, A., Farr, J., Lemos, P., & Font-Ribera, A. 2019, JCAP, 2019, 044 [Google Scholar]
- Dalal, R., Li, X., Nicola, A., et al. 2023, Phys. Rev. D, 108, 123519 [CrossRef] [Google Scholar]
- Davis, M., & Peebles, P. J. E. 1983, ApJ, 267, 465 [Google Scholar]
- DeRose, J., Wechsler, R., Becker, M., et al. 2022, Phys. Rev. D, 105, 123520 [NASA ADS] [CrossRef] [Google Scholar]
- DESI Collaboration (Adame, A. G., et al.) 2024, ArXiv e-prints [arXiv:2411.12022] [Google Scholar]
- Desjacques, V., Jeong, D., & Schmidt, F. 2018, Phys. Rep., 733, 1 [Google Scholar]
- Dey, A., Schlegel, D. J., Lang, D., et al. 2019, AJ, 157, 168 [Google Scholar]
- Dodelson, S., & Schmidt, F. 2020, Modern Cosmology (London: Academic Press) [Google Scholar]
- Dvornik, A., Heymans, C., Asgari, M., et al. 2023, A&A, 675, A189 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Edge, A., Sutherland, W., Kuijken, K., et al. 2013, The Messenger, 154, 32 [NASA ADS] [Google Scholar]
- Efstathiou, G., Sutherland, W. J., & Maddox, S. J. 1990, Nature, 348, 705 [NASA ADS] [CrossRef] [Google Scholar]
- Eifler, T., Schneider, P., & Hartlap, J. 2009, A&A, 502, 721 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Elvin-Poole, J., Crocce, M., Ross, A. J., et al. 2018, Phys. Rev. D, 98, 042006 [Google Scholar]
- Euclid Collaboration (Desprez, G., et al.) 2020, A&A, 644, A31 [EDP Sciences] [Google Scholar]
- Euclid Collaboration (Mellier, Y., et al.) 2025, A&A, in press, https://doi.org/10.1051/0004-6361/202450810 [Google Scholar]
- Everett, S., Yanny, B., Kuropatkin, N., et al. 2022, ApJS, 258, 15 [NASA ADS] [CrossRef] [Google Scholar]
- Fenech Conti, I., Herbonnet, R., Hoekstra, H., et al. 2017, MNRAS, 467, 1627 [NASA ADS] [Google Scholar]
- Foreman-Mackey, D., Hogg, D. W., Lang, D., & Goodman, J. 2013, PASP, 125, 306 [Google Scholar]
- Gaia Collaboration (Prusti, T., et al.) 2016, A&A, 595, A1 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Gaia Collaboration (Vallenari, A., et al.) 2023, A&A, 674, A1 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Gong, Y., Liu, X., Cao, Y., et al. 2019, ApJ, 883, 203 [NASA ADS] [CrossRef] [Google Scholar]
- Gorski, K. M., Hivon, E., Banday, A. J., et al. 2005, ApJ, 622, 759 [Google Scholar]
- Hang, Q., Joachimi, B., Charles, E., et al. 2024, MNRAS, 535, 2970 [NASA ADS] [CrossRef] [Google Scholar]
- Harris, C. R., Millman, K. J., van der Walt, S. J., et al. 2020, Nature, 585, 357 [Google Scholar]
- Heydenreich, S., Schneider, P., Hildebrandt, H., et al. 2020, A&A, 634, A104 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Heymans, C., Van Waerbeke, L., Miller, L., et al. 2012, MNRAS, 427, 146 [Google Scholar]
- Heymans, C., Tröster, T., Asgari, M., et al. 2021, A&A, 646, A140 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Hildebrandt, H., Viola, M., Heymans, C., et al. 2017, MNRAS, 465, 1454 [Google Scholar]
- Hildebrandt, H., van den Busch, J. L., Wright, A. H., et al. 2021, A&A, 647, A124 [EDP Sciences] [Google Scholar]
- Ho, S., Cuesta, A., Seo, H.-J., et al. 2012, ApJ, 761, 14 [Google Scholar]
- Hunter, J. D. 2007, Comput. Sci. Eng., 9, 90 [Google Scholar]
- Jalan, P., Bilicki, M., Hellwing, W. A., et al. 2024, A&A, 692, A177 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Jarvis, M. 2015, Astrophysics Source Code Library [record ascl:1508.007] [Google Scholar]
- Johnston, H., Wright, A. H., Joachimi, B., et al. 2021, A&A, 648, A98 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Johnston, H., Chisari, N. E., Joudaki, S., et al. 2024, A&A, submitted [arXiv:2409.17377] [Google Scholar]
- Kaiser, N. 1987, MNRAS, 227, 1 [Google Scholar]
- Kong, H., Ross, A. J., Honscheid, K., et al. 2024, ArXiv e-prints [arXiv:2405.16299] [Google Scholar]
- Krause, E., & Eifler, T. 2017, MNRAS, 470, 2100 [NASA ADS] [CrossRef] [Google Scholar]
- Landy, S. D., & Szalay, A. S. 1993, ApJ, 412, 64 [Google Scholar]
- Leistedt, B., & Peiris, H. V. 2014, MNRAS, 444, 2 [Google Scholar]
- Leistedt, B., Peiris, H. V., Mortlock, D. J., Benoit-Lévy, A., & Pontzen, A. 2013, MNRAS, 435, 1857 [Google Scholar]
- Lewis, A. 2019, ArXiv e-prints [arXiv:1910.13970] [Google Scholar]
- Li, X., Zhang, T., Sugiyama, S., et al. 2023a, Phys. Rev. D, 108, 123518 [CrossRef] [Google Scholar]
- Li, S.-S., Kuijken, K., Hoekstra, H., et al. 2023b, A&A, 670, A100 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Limber, D. N. 1953, ApJ, 117, 134 [NASA ADS] [CrossRef] [Google Scholar]
- Loureiro, A., Cuceu, A., Abdalla, F. B., et al. 2019, Phys. Rev. Lett., 123, 081301 [NASA ADS] [CrossRef] [Google Scholar]
- LSST Science Collaboration (Abell, P. A., et al.) 2009, ArXiv e-prints [arXiv:0912.0201] [Google Scholar]
- Maddox, S. J., Efstathiou, G., Sutherland, W. J., & Loveday, J. 1990, MNRAS, 242, 43 [Google Scholar]
- Maddox, S. J., Efstathiou, G., & Sutherland, W. J. 1996, MNRAS, 283, 1227 [Google Scholar]
- Menard, B. 2002, in SF2A-2002: Semaine de l’Astrophysique Francaise, eds. F. Combes, & D. Barret, 57 [Google Scholar]
- Miller, L., Heymans, C., Kitching, T., et al. 2013, MNRAS, 429, 2858 [Google Scholar]
- Morrison, C. B., & Hildebrandt, H. 2015, MNRAS, 454, 3121 [Google Scholar]
- Muir, J., Bernstein, G. M., Huterer, D., et al. 2020, MNRAS, 494, 4454 [Google Scholar]
- Müllner, D. 2011, ArXiv e-prints [arXiv:1109.2378] [Google Scholar]
- Murtagh, F., & Contreras, P. 2012, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2, 86 [CrossRef] [Google Scholar]
- Myles, J., Alarcon, A., Amon, A., et al. 2021, MNRAS, 505, 4249 [NASA ADS] [CrossRef] [Google Scholar]
- Neveux, R., Burtin, E., de Mattia, A., et al. 2020, MNRAS, 499, 210 [NASA ADS] [CrossRef] [Google Scholar]
- Nichol, R. C. 2007, Gen. Relat. Grav., 40, 249 [Google Scholar]
- Nicola, A., Alonso, D., Sánchez, J., et al. 2020, JCAP, 2020, 044 [Google Scholar]
- Peacock, J. A., & Smith, R. E. 2000, MNRAS, 318, 1144 [Google Scholar]
- Peebles, P. J. E. 1973, ApJ, 185, 413 [Google Scholar]
- Planck Collaboration VI. 2020, A&A, 641, A6 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Porredon, A., Crocce, M., Fosalba, P., et al. 2021, Phys. Rev. D, 103, 043503 [Google Scholar]
- Reid, B., Ho, S., Padmanabhan, N., et al. 2016, MNRAS, 455, 1553 [NASA ADS] [CrossRef] [Google Scholar]
- Reischke, R., Unruh, S., Asgari, M., et al. 2024, A&A, submitted [arXiv:2410.06962] [Google Scholar]
- Rezaie, M., Seo, H.-J., Ross, A. J., & Bunescu, R. C. 2020, MNRAS, 495, 1613 [Google Scholar]
- Rodríguez-Monroy, M., Weaverdyck, N., Elvin-Poole, J., et al. 2022, MNRAS, 511, 2665 [CrossRef] [Google Scholar]
- Rohatgi, A. 2021, Webplotdigitizer: Version 4.5 [Google Scholar]
- Ross, A. J., Ho, S., Cuesta, A. J., et al. 2011, MNRAS, 417, 1350 [Google Scholar]
- Schlafly, E. F., & Finkbeiner, D. P. 2011, ApJ, 737, 103 [Google Scholar]
- Schlegel, D. J., Finkbeiner, D. P., & Davis, M. 1998, ApJ, 500, 525 [Google Scholar]
- Schneider, P., van Waerbeke, L., Kilbinger, M., & Mellier, Y. 2002, A&A, 396, 1 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Secco, L. F., Samuroff, S., Krause, E., et al. 2022, Phys. Rev. D, 105, 023515 [NASA ADS] [CrossRef] [Google Scholar]
- Seljak, U. 2000, MNRAS, 318, 203 [Google Scholar]
- Shanks, T., Bean, A. J., Ellis, R. S., et al. 1983, ApJ, 274, 529 [NASA ADS] [CrossRef] [Google Scholar]
- Suchyta, E., Huff, E. M., Aleksić, J., et al. 2016, MNRAS, 457, 786 [Google Scholar]
- Sugiyama, S., Miyatake, H., More, S., et al. 2023, Phys. Rev. D, 108, 123521 [NASA ADS] [CrossRef] [Google Scholar]
- Takada, M., & Hu, W. 2013, Phys. Rev. D, 87, 123504 [NASA ADS] [CrossRef] [Google Scholar]
- Tessore, N., Loureiro, A., Joachimi, B., von Wietersheim-Kramsta, M., & Jeffrey, N. 2023, Open J. Astrophys., 6, 11 [NASA ADS] [CrossRef] [Google Scholar]
- Tröster, T., Sánchez, A. G., Asgari, M., et al. 2020, A&A, 633, L10 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- van den Busch, J. L., Hildebrandt, H., Wright, A. H., et al. 2020, A&A, 642, A200 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- van Uitert, E., Joachimi, B., Joudaki, S., et al. 2018, MNRAS, 476, 4662 [NASA ADS] [CrossRef] [Google Scholar]
- Virtanen, P., Gommers, R., Oliphant, T. E., et al. 2020, Nat. Meth., 17, 261 [Google Scholar]
- Wandelt, B. D., Hivon, E., & Górski, K. M. 2001, Phys. Rev. D, 64, 083003 [NASA ADS] [CrossRef] [Google Scholar]
- Weaverdyck, N., & Huterer, D. 2021, MNRAS, 503, 5061 [Google Scholar]
- Wittek, P., Gao, S. C., Lim, I. S., & Zhao, L. 2017, J. Stat. Softw., 78, 1 [CrossRef] [Google Scholar]
- Wright, A. H., Hildebrandt, H., Kuijken, K., et al. 2019, A&A, 632, A34 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Wright, A. H., Hildebrandt, H., van den Busch, J. L., & Heymans, C. 2020, A&A, 637, A100 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Wright, A. H., Kuijken, K., Hildebrandt, H., et al. 2024, A&A, 686, A170 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Zheng, Z., Berlind, A. A., Weinberg, D. H., et al. 2005, ApJ, 633, 791 [NASA ADS] [CrossRef] [Google Scholar]
- Zonca, A., Singer, L., Lenz, D., et al. 2019, J. Open Source Softw., 4, 1298 [Google Scholar]
Appendix A: Covariance matrix comparison
In this Appendix we compare the covariance matrix from mock realisations and the theoretical covariance computed by the ONECOVARIANCE code. The mock covariance is computed based on Eq. (17). To ensure consistency, we configure the input file of the ONECOVARIANCE code so that the cosmology, footprint, redshift distribution and galaxy number density match those of the GLASS mock. In addition, since the GLASS package only generates a Gaussian field, we assume that the theoretical covariance matrix contains only the Gaussian covariance plus the super-sample covariance (Takada & Hu 2013) to account for the covariance of modes larger than the observed field.
In this section we compare only the covariance of the 2PCFs from the data-driven test with {NCKiDS = 600, NCrec = 600}. The three panels in Fig. A.1 show the square root of the main diagonal, the 5th and the 10th diagonal above the mean diagonal. We note that the covariance of the uniform random case is significantly higher than that of the “no selection” case, implying that variable depth contamination also introduces additional covariance into the data. The covariance for the “No selection” and “True OR” cases matches the theoretical covariance given by ONECOVARIANCE as expected. The covariance of “Recovered OR” agrees with that of the unbiased cases, so we conclude that the SOM+HC method can also recover accurate covariance in the correlation function in the linear regime.
![]() |
Fig. A.1. Comparison of the covariance matrices. Panels from top to bottom show the square root of the main covariance diagonal, the 5th and 10th diagonals above the main diagonal, as shown by the black grids in the top right theoretical covariance matrices in each panel. The coloured curves show the covariance terms of the four 2PCFs defined in Table 2 from the data-driven test with {NCKiDS = NCrec = 600}; the dotted lines are calculated from the ONECOVARIANCE code. The lower x-ticks are the row indices, while the upper x-ticks on the top panel are the corresponding angular scale for the diagonal term. |
Appendix B: Correcting selection effects for angular power spectra with SOM+HC
Another widely used two-point statistic is the angular power spectrum, which is defined as the correlation of galaxy over-density in harmonic space. In general, angular power spectra have weak correlations between ℓ modes, allowing us to study different angular scales independently.
![]() |
Fig. B.1. Upper panel: Pseudo-Cℓ measured from GLASS mock samples. The data points of each series are the average PCL from 40 realisations; the error bars are the square root of the covariance diagonal given by NAMASTER. The χd2 values, which describe the difference between each dataset and the No selection PCL, are calculated similarly to Eq. (18). Lower panel: Relative difference of each case with respect to the no selection case. The shaded regions are the ℓ modes corresponding to a physical scale smaller than 8 h−1Mpc estimated at the mean redshift. |
![]() |
Fig. B.2. 1-σ credible contours of parameter posterior shift with respect to the best-fit values constrained from the “No selection” PCL |
In practice, there are two estimators of the power spectra. One is the “band power”, where we measure the correlation functions first, then invert Eq. (7) by performing the integration Cℓ = ∫0∞w(θ)J0(ℓθ)θ dθ at the central ℓ in each band. The accuracy of the band power depends on the integral limit of θ and the discrete θ’s when calculating the correlation function (Schneider et al. 2002).
The other estimator is the pseudo-Cℓ (PCL, Wandelt et al. 2001; Alonso et al. 2018), where one calculates the coupled Cℓ from the weighted sky map, then decouples it with the mode coupling matrix of the weight, and bins it. The weight can be a binary mask specifying the footprint of the survey (so that a source gets a weight of 1 if it is in the footprint, otherwise 0); or a weight for each source to suppress errors (like the lensing weight) or to correct for selection effects (like the organised random weight in this paper). In this appendix we briefly discuss the usage and performance of SOM+HC on the measurement of PCL Cℓ from the same GLASS mock samples. Detailed discussions will be given in a companion paper presenting the methodology of the KiDS-Legacy 6×2pt cosmology.
The galaxy PCL is based on pixelised galaxy over-density maps. On a weighted sky, the galaxy over-density in the p-th pixel is given by (Nicola et al. 2020):
where N(p) is the number of galaxies in the pixel, w(p) is the weight value (the organised random weight in our case) in the pixel. By dividing the galaxy number by w(p), we have effectively corrected for the variable depth. The average galaxy number is given by
where the sum is taken within the footprint.
We can measure the PCL directly between the maps of two galaxy samples a and b as
where is the harmonic coefficient of sample a on weighted sky and
is its complex conjugate. For weighted galaxy maps, it is linked to the underlying power spectra Cℓ via
where Mℓℓ′(wa, wb) is the mode-mixing matrix determined by the weight maps (wa, wb) of the two fields. Therefore, the directly measured is called “coupled PCL”. An unbiased angular power spectrum is estimated by decoupling it via the inverse of the mode-coupling matrix. Assuming Poisson-distributed objects, the OR weight is proportional to the inverse variance of the field, we use OR weight to calculate the mode-coupling matrix. In practice, one also wants to bin the PCL into ℓ bins. For technical details, we refer to Alonso et al. (2019). The calculations are implemented in the NAMASTER package (Alonso et al. 2019) which we use in this section to measure the PCL.
The measured galaxy auto-power spectrum contains a shot noise which needs to be subtracted. We assume a Poissonian shot noise, for which the coupled noise spectrum (the “noise bias” termed in Alonso et al. 2019) is given by:
where Ωpix is the pixel area in the units of steradians; ⟨w⟩ is the mean weight value per pixel across the whole sky.
The Gaussian covariance matrix depends on unbiased estimations of the angular power spectra. In this section we only measure angular power spectra from GLASS mock catalogue, so we take the input angular power spectra for GLASS to estimate the theoretical Gaussian covariance matrix.18 The mode-coupling induced by the OR weights is also taken care of by the NAMASTER package. The non-Gaussian term includes a connected covariance matrix which is dominated by the galaxy trispectra (Krause & Eifler 2017) which only affects the small scales, so we neglect it in this section. Another non-Gaussian term is the super-sample covariance (SSC; Takada & Hu 2013) which accounts for the correlated modes that are larger than the survey footprint. We calculate this term using the ONECOVARIANCE package, but we note that this term is negligible for KiDS-Legacy.
Therefore, for PCL, the variable depth affects the mapping process, the mode-mixing matrix, the shot noise and the covariance matrix. In this section we perform the data-driven test for PCL with the same 40 GLASS mock samples with interpolated systematics. Here we show the case NCKiDS = NCrec = 600. That is, the mock sample is selected with the data-driven OR weight with NCKiDS = 600 and the recovered OR weight is from 600 HCs trained on the selected mock sample. We first generate the galaxy fluctuation map according to Eq. (B.1) and measure the four PCLs defined similarly to Table 2: the UR case is a depleted galaxy sample for which the mode-mixing matrix is calculated from the footprint; the ‘No selection’ case is the unselected galaxy sample for which the mode-mixing matrix is calculated from the footprint.
The PCL measure is shown in Fig. B.1. Data points in the top panel show the averaged PCL across realisations. Error bars are the standard deviation of each Cℓ calculated from the covariance matrix given by NAMASTER. The bottom panel shows the relative difference between each Cℓ measurement and the “no selection” case. The shaded region corresponds to physical scales smaller than 8h−1Mpc estimated at the mean redshift. We ignore these scales in this analysis. To assess the consistency between the OR-corrected PCL and the “no selection” PCL, we calculate the χd (defined similarly to Eq. (18) but using Cℓ instead of w(θ)) between them. The values are quoted in the upper panel of Fig. B.1. The uniform random again gives a very large bias (χd = 3230) in the PCL and the true OR completely corrects the bias as expected. The OR recovered by SOM+HC gives a residual of χd = 0.42.
In the same way as for the 2PCF, we run an MCMC on the measured PCL. The Gaussian likelihood is defined with the Gaussian covariance given by NAMASTER plus the SSC given by ONECOVARIANCE. We calculate the shifted posterior with respect to the best-fit values constrained from the “No selection” case. The contours of the 68% credible level are shown in Fig. B.2. We calculate the parameter constraint bias following the 2PCF test and get ΔΩm[σ] = 0.21, Δb[σ]= − 0.17, χΔΩm, Δb2 = 0.07 and for the recovered OR weight case. For the true OR case, the constraining bias is at the 0.001σ level. We note that both posteriors have the same shape, indicating a close degeneracy between the galaxy bias and Ωm.
From this exercise, we can conclude that the OR weight recovered by the SOM+HC method can also correct for variable depth in 2-point statistics in harmonic space. Furthermore, the optimal choice of NC that we found with data-driven systematics also provides accurate PCL measurements and parameter constraints. We leave further tests, including tomographic galaxy clustering PCL, to a future KiDS-Legacy 6 × 2pt methodology paper.
Appendix C: SOM+HC with all the systematics from KiDS-Legacy
Our fiducial choice of five systematics is based on the consideration that some systematics are strongly correlated with each other (such as extinction in different bands), so introducing them will not add information to the SOM. In addition, some systematics have no selection effect in the galaxy field. Therefore, including them in the training vector will not improve performance, but will increase the time and computational resources required (especially in the validation procedure). In this section we validate our fiducial choice of systematics described in Section 3 by training the SOM with all the available systematics in the KiDS-Legacy catalogue. This increases the number of systematics from 5 to 16 and also increases the number of training epochs required for the SOM to converge.
![]() |
Fig. C.1. Spearman correlation coefficient matrix. The numbers in each grid are the correlation coefficient between the median systematics and the galaxy contrast in each hierarchical cluster. |
![]() |
Fig. C.2. Blinded 2PCFs measured from the KiDS-Legacy catalogue. The pink dots are UR measurements; the blue dots are our fiducial OR measurements (OR recovered from 5 systematics) and the orange dots are measured with OR recovered from 16 systematics. The error bars are derived from the theoretical covariance matrices with best-fit UR and OR parameters. |
We group the galaxy into 600 clusters as the fiducial set-up and calculate Spearman’s correlation coefficient between median systematics and galaxy contrast in each hierarchical cluster. Figure C.1 shows the correlation coefficient matrix. The first row of the correlation matrix shows the correlation between galaxy number contrast and each systematics, indicating the selection effect captured by SOM+HC. The extinction in different bands is fully correlated because it is calculated by scaling the reddening template given by Schlafly & Finkbeiner (2011) according to the band, so we only include r-band here as the fiducial case. PSF size has the highest negative correlation, which is in agreement with the fiducial run. Level and GAIA star number density are both slightly correlated with galaxy number contrast. We do not include Background in our fiducial run because it correlates weakly with galaxy number contrast, while it correlates strongly with GAIA star number density, so including it does not add much information.
To further justify our choice of systematics, we measure the 2PCF corrected by the OR weight generated from all the systematics. Figure C.2 shows the UR (pink), fiducial OR (blue) and full systematics OR (orange) 2PCFs. The χd2 value between the fiducial 2PCF and all differences are insignificant over the entire angular scale considered. Therefore, we conclude that the 5 systematics that we choose are representative of the whole set of systematics to recover the organised random.
Appendix D: Angular galaxy clustering for the KiDS-1000 bright sample
Johnston et al. (2021) measured the 2PCF from the KiDS-1000 Bright sample (Bilicki et al. 2021) selected with a magnitude cut r < 20 from the 4th KiDS data release. Its redshift distribution is reliably calibrated from the overlap with Galaxy And Mass Assembly (GAMA) spectroscopy using the neural network algorithm implemented in the ANNz2 software (see Fig. D.1 for the redshift distribution). The sample covers a sky area of 789 deg2 and has a number density of 0.36 arcmin−2.
In this section we re-measure the 2PCF of the KiDS-Bright sample with the new SOM+HC implementation to check the consistency between our code and the pipeline used in Johnston et al. (2021). Following the fiducial set-up (the “100A” set-up) in Johnston et al. (2021), we recover the OR with a 100×100 SOM trained on the same systematics (r-band detection threshold, PSF size and PSF shape) grouped into 100 hierarchical clusters. The OR weight map has a Nside = 1024, which corresponds to an angular resolution of 3.4 arcmin, slightly larger than the fiducial set-up in Johnston et al. (2021). The correlation function is then measured in 30 angular bins between 3 and 300 arcmin.
![]() |
Fig. D.1. Photometric redshift distribution of the KiDS-1000 bright sample, with the zphot. = {0.02, 0.2, 0.5} redshift bins (dashed lines) employed in our w(θ) measurements |
![]() |
Fig. D.2. 2PCFs measured from the KiDS-1000 bright sample. The three panels are the auto-correlation w(θ) of the whole sample, and the first and the second tomographic bins, respectively. The blue and pink dots are the measurements corrected by recovered organised random and uniform random. The shaded regions are angular scales corresponding to physical scales smaller than 8h−1Mpc estimated at the mean redshift. The error bars are the standard deviations derived from the covariance matrix provided by the ONECOVARIANCE package. The orange curve is the OR-corrected 2PCF data points measured by Johnston et al. (2021). The black dashed curves in the second and third panels are the best-fit 2PCF from MCMC. |
After recovering the OR weight, we measure the correlation functions of the two tomographic bins of the bright sample, defined by selecting galaxies with ANNz-calibrated photo-z with the cut {0.02, 0.2, 0.5}. The 2PCFs corrected by the uniform random and the recovered organised random are presented in Fig. D.2 with pink points and blue points, respectively. We notice that our measurements are fairly consistent with that from Johnston et al. (2021), shown as orange curves. The error bars are the standard deviation derived from the covariance matrix given by the ONECOVARIANCE code with the same redshift distribution and the best-fit Ωm and galaxy biases to be determined below. The shaded regions are angular scales corresponding to physical scales smaller than 8h−1Mpc estimated at the mean redshift. We then fit {Ωm, b1, b2} in the linear model with the 2PCFs in the linear scale. We note that we don’t know the parameters a priori to compute the covariance matrix, so we do an iterative parameter fit. We choose the initial value of Ωm = 0.33 and b1 = 1.1, b2 = 1.25 to compute the covariance matrix. The pilot galaxy bias values are taken from van Uitert et al. (2018), which apply to a GAMA-like subsample of KV-450 with a similar redshift distribution to the bright sample used here. We define a Gaussian likelihood and run an MCMC to obtain the best-fit parameters, and then update the covariance matrix with them. We then run the MCMC again with the updated covariance matrix to obtain the posterior.
The posteriors of the UR and OR cases are shown as pink and blue contours in Fig. D.3. The mean value of the parameters in the converged MCMC chains and the 1 − σ levels are summarized in Table D.1. The theoretical w(θ) calculated from the best-fit parameters of the OR case is shown as dashed black curves in Fig. D.2. The reduced χ2 value between the OR 2PCF and the best-fit 2PCF is 1.06, corresponding to a PTE of 0.38, indicating a good fit between the model and the data. All three parameters are constrained. Notably, the matter density is constrained in agreement with previous cosmological probes. The galaxy biases are constrained to be close to 1 with a slightly increasing trend. However, we note that the bias parameters are highly degenerate with σ8, which is fixed in our case. To break this degeneracy we need to introduce matter field tracers such as cosmic shear or use the halo model like in Dvornik et al. (2023). This is the content of the ongoing KiDS-Legacy 3 × 2pt and 6 × 2pt projects.
From the data and the posterior, we found that the difference between UR and OR is at the ∼1σ level, suggesting that the variable depth in the bright sample is much less pronounced than in the faint sample. This is expected, as the detectability of bright galaxies should be less affected by the Galactic or atmospheric foreground.
Parameter fit from KiDS-1000 bright galaxy 2PCF.
![]() |
Fig. D.3. Posterior of Ωm, b1, b2 fit by the 2PCF from KiDS-1000 bright samples. The deep shaded and lightly shaded contour is the 68% and 95% credible levels, respectively. The blue and pink contours correspond to the posterior from 2PCF corrected with OR and UR respectively. |
All Tables
All Figures
![]() |
Fig. 1. Galaxy distribution of the KiDS-Legacy catalogue. The map is pixelised into HEALP IX grids with Nside = 2048. The map is colour-coded by galaxy number per pixel, which has a size of 1.7 arcmin. A tile-based selection pattern can be seen by eye. The shaded region in the colour bar shows the normalised distribution of galaxy number per pixel. |
In the text |
![]() |
Fig. 2. Maps of the KiDS-Legacy systematic considered in this paper. Each map is divided into northern and southern fields, plotted together with their colour bars. The black curves over-plotted in the colour bars show the relationship between galaxy contrast and systematics value, and the dynamic ranges are [ − 0.25, 0.25]. The shaded regions show the probability distributions of each systematics. |
In the text |
![]() |
Fig. 3. Flowchart of the measurement of the 2PCF with reconstructed organised randoms. The colour block after the 2PCF indicates the colours shown in the following w(θ)−θ figures. |
In the text |
![]() |
Fig. 4. Flowchart illustrating the SOM+HC method to recover OR and correct selection effects in the 2PCF. Starting from the top: 1. Input systematics; 2. SOM training: left panel: SOM grid colour-coded by galaxy number in each cell; right panel: projection of systematics vectors (small grey dots) and SOM cells (red dots connecting) on the plane of two systematics. The projected adjacent SOM cells are connected with black lines; 3. HC output: left panel: SOM cells colour-coded according to hierarchical cluster indices; right panel: systematics vectors and weight vectors colour-coded by corresponding cluster indices; 4. Effective areas corresponding to galaxies from each cluster; 5. Recovered OR weight map which will be used in a subsequent 2PCF measurement. |
In the text |
![]() |
Fig. 5. Example of a dendrogram showing the clustering of 900 SOM cells into 20 HCs. The cells are clustered from the bottom to the top according to their Euclidean distance in the systematics space. The black dashed line shows the distance threshold where the cells are grouped into 20 clusters. SOM cells resulting in the same clusters are colour-coded with the same colour. |
In the text |
![]() |
Fig. 6. Left panel: Galaxy number per pixel on a subregion of the KiDS-Legacy footprint; right panel: OR weight on the same region. |
In the text |
![]() |
Fig. 7. Left panels: Spatial distribution of the four toy systematics. Systematics A1 and A2 are uniform in each tile, but differ across tiles (type A); Systematics B varies within each tile (type B); Systematics C is tile-independent (type C); Right panels: Black curves in the colour bar show the selection function of each systematic and the shaded region is the normalised distribution of the systematics. The numbers on the right show the selection rate values. |
In the text |
![]() |
Fig. 8. Flowchart of the toy model validation of the SOM+HC method. We note that the UR case (w(θ) calculated with depleted mock catalogue and mock UR) is not shown in this figure. |
In the text |
![]() |
Fig. 9. Relationship between the galaxy number contrast and the mean systematic value of each hierarchical cluster. The blue curves are the number contrast derived from the input selection rates (black curves in the colour bars of Fig. 7). The average median systematics and number contrast are calculated by averaging the sorted values in each realisation across all the realisations. The standard errors are also calculated and presented as error bars. The errors of the median systematics are too small to be visible. |
In the text |
![]() |
Fig. 10. Top panel: True OR weights (normalised by their mean) calculated from the total selection function of the toy systematics; middle panel: Recovered OR weights generated by the SOM+HC method. Both panels show only part of the footprint. We note that the holes in the maps are masked regions around point sources; bottom panel: Relative difference between the recovered OR weights and the true OR weights. |
In the text |
![]() |
Fig. 11. Top panel: Measured w(θ) in the toy-systematics test. The definitions of the four w(θ) are presented in Table 2. The data points are the mean w(θ) from 40 realisations and the error bars are the diagonal elements of the covariance matrices evaluated from the realisations. The black curve is the theoretical w(θ) computed with PYCCL (Chisari et al. 2019) using the same cosmology and redshift distribution. The shaded region shows the angular scale smaller than 8 h−1 Mpc evaluated at the mean redshift. The middle panel is the relative bias of each w(θ) with respect to the No selection case, and the bottom panel is the w(θ) bias related to the error. Most points of the UR case are drastically biased and are outside the range of the middle and bottom panels. |
In the text |
![]() |
Fig. 12. Flowchart of the data-driven test. |
In the text |
![]() |
Fig. 13. Upper panel: Part of the data-driven OR weight generated from the KiDS-Legacy map; middle panel: Mock OR weight recovered from GLASS mock galaxy sample selected according to the data-driven OR; bottom panel: Relative difference between recovered OR weights and true OR weights. Both OR weights are generated with 600 hierarchical clusters and pixelised on a HEALP IX map with Nside = 2048. |
In the text |
![]() |
Fig. 14. 2PCFs measured for the data-driven systematics test with the same choices of NCKiDS (the number of hierarchical clusters for the data-driven OR) and NCrec (for the recovered mock OR) Each column of panels corresponds to one set-up. Nside is fixed at 2048 for both the data-driven OR and the recovered OR. The right-most column is an enhanced data-driven OR with m = 1.5 according to Eq. (21). The top panels show the 2PCF data points calculated as the mean values from 40 GLASS realisations, and the error bars are calculated as the square root of the diagonal terms of the theoretical covariance matrix. The middle panels show the biases in 2PCFs with respect to the No selection 2PCFs and the bottom panels show the biases relative to the errors in the 2PCFs (The UR case is well beyond the range). The shaded regions are angular scales corresponding to a physical scale r < 8 h−1 Mpc at the mean redshift of the galaxy sample. |
In the text |
![]() |
Fig. 15. 1 − σ confidence contours of the parameter posterior shift with respect to the best-fit values constrained from the No selection 2PCF. The pink arrow indicates the direction of the UR contour with the best-fit values of ΔΩm = −0.03, Δb = 1.33, which is well outside the dynamic range of the plot. |
In the text |
![]() |
Fig. 16. Self-organising maps trained on the KiDS-Legacy catalogue, with the dimension 30 × 30. The first five panels are SOMs coloured by the average systematics values in each cell. The last panel (bottom right) is the SOM coloured by the galaxy number contrast of each hierarchical cluster. The black lines are the boundaries of each HC. We note that we use a toroidal topology for the SOM, so the left and right edges and the top and bottom edges are adjacent. |
In the text |
![]() |
Fig. 17. Spearman correlation coefficient matrix. The numbers in each grid are the correlation coefficient between the median systematics and the galaxy contrast in each hierarchical cluster. |
In the text |
![]() |
Fig. 18. Blinded 2PCFs measured from the KiDS-Legacy sample. The blue and pink dots are the measurements corrected by the recovered organised random and by the uniform random, respectively. The shaded regions are angular scales corresponding to physical scales smaller than 8 h−1 Mpc estimated at the mean redshift. The error bars are the standard deviations derived from the covariance matrix computed by the ONECOVARIANCE code. The black dashed curve shows the best-fit 2PCF from the MCMC. |
In the text |
![]() |
Fig. 19. Contours of 68.3% and 95.4% credible levels of the parameter posterior shift with respect to the best-fit values constrained from NC = 600 case. Left panel: Contours from the OR 2PCFs with different NC choices. The fiducial choice NC = 600 is shown as green filled contours; Right panel: Contours of OR 2PCF (NC = 600, green) and UR 2PCF (pink). |
In the text |
![]() |
Fig. A.1. Comparison of the covariance matrices. Panels from top to bottom show the square root of the main covariance diagonal, the 5th and 10th diagonals above the main diagonal, as shown by the black grids in the top right theoretical covariance matrices in each panel. The coloured curves show the covariance terms of the four 2PCFs defined in Table 2 from the data-driven test with {NCKiDS = NCrec = 600}; the dotted lines are calculated from the ONECOVARIANCE code. The lower x-ticks are the row indices, while the upper x-ticks on the top panel are the corresponding angular scale for the diagonal term. |
In the text |
![]() |
Fig. B.1. Upper panel: Pseudo-Cℓ measured from GLASS mock samples. The data points of each series are the average PCL from 40 realisations; the error bars are the square root of the covariance diagonal given by NAMASTER. The χd2 values, which describe the difference between each dataset and the No selection PCL, are calculated similarly to Eq. (18). Lower panel: Relative difference of each case with respect to the no selection case. The shaded regions are the ℓ modes corresponding to a physical scale smaller than 8 h−1Mpc estimated at the mean redshift. |
In the text |
![]() |
Fig. B.2. 1-σ credible contours of parameter posterior shift with respect to the best-fit values constrained from the “No selection” PCL |
In the text |
![]() |
Fig. C.1. Spearman correlation coefficient matrix. The numbers in each grid are the correlation coefficient between the median systematics and the galaxy contrast in each hierarchical cluster. |
In the text |
![]() |
Fig. C.2. Blinded 2PCFs measured from the KiDS-Legacy catalogue. The pink dots are UR measurements; the blue dots are our fiducial OR measurements (OR recovered from 5 systematics) and the orange dots are measured with OR recovered from 16 systematics. The error bars are derived from the theoretical covariance matrices with best-fit UR and OR parameters. |
In the text |
![]() |
Fig. D.1. Photometric redshift distribution of the KiDS-1000 bright sample, with the zphot. = {0.02, 0.2, 0.5} redshift bins (dashed lines) employed in our w(θ) measurements |
In the text |
![]() |
Fig. D.2. 2PCFs measured from the KiDS-1000 bright sample. The three panels are the auto-correlation w(θ) of the whole sample, and the first and the second tomographic bins, respectively. The blue and pink dots are the measurements corrected by recovered organised random and uniform random. The shaded regions are angular scales corresponding to physical scales smaller than 8h−1Mpc estimated at the mean redshift. The error bars are the standard deviations derived from the covariance matrix provided by the ONECOVARIANCE package. The orange curve is the OR-corrected 2PCF data points measured by Johnston et al. (2021). The black dashed curves in the second and third panels are the best-fit 2PCF from MCMC. |
In the text |
![]() |
Fig. D.3. Posterior of Ωm, b1, b2 fit by the 2PCF from KiDS-1000 bright samples. The deep shaded and lightly shaded contour is the 68% and 95% credible levels, respectively. The blue and pink contours correspond to the posterior from 2PCF corrected with OR and UR respectively. |
In the text |
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.