Open Access
Issue
A&A
Volume 666, October 2022
Article Number A181
Number of page(s) 15
Section Cosmology (including clusters of galaxies)
DOI https://doi.org/10.1051/0004-6361/202243632
Published online 24 October 2022

© M. Kerscher 2022

Licence Creative CommonsOpen Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article is published in open access under the Subscribe-to-Open model. Subscribe to A&A to support open access publication.

1. Introduction

Statistical approaches are often used to characterise the large-scale structure of the galaxy distribution, wherein it is assumed that the distribution of galaxies is a realisation of a point process (see e.g. Neyman & Scott 1958; Peebles 1980). Observations give us the positions of galaxies in space. From this set of points, we estimate the moments of the point process, specifically the two-point correlation function ξ(r) or the two-point density

ϱ 2 ( x , y ) = ϱ 2 ( 1 + ξ ( | x y | ) ) , $$ \begin{aligned} \varrho _2({\boldsymbol{x}},{\boldsymbol{y}}) = \varrho ^2\ \left(1 + \xi (|{\boldsymbol{x}}-{\boldsymbol{y}}|)\right), \end{aligned} $$(1)

the probability of finding two galaxies at x and y, where 𝜚 is the number density. In a homogeneous and isotropic point process, ξ(r) only depends on the separation r = |x − y|. To determine ξ(r) from a galaxy catalogue within a finite domain W ⊂ ℝ3, we use estimators. In cosmology, estimators based on random point sets are most commonly used. These rely on the data–data DD, data–random DR, and random–random RR pair-counts (see below for the definition).

The two-point correlation function of the galaxy distribution is often used to constrain models of structure and galaxy formation and to estimate parameters of cosmological models. In current and upcoming galaxy samples, the positions of millions and up to billions of galaxies will be observed (Dawson et al. 2013, BOSS; Ross et al. 2020, eBOSS; Abbott et al. 2022, DES; Aghamousa et al. 2016, DESI; Alarcon et al. 2021, PAUS; Ivezić et al. 2019, LSST; Amendola et al. 2018, Euclid). Fast and reliable methods for calculating the two-point correlation function are needed. These large samples allow us to reduce the statistical error from cosmic variance, but for the error budget of the two-point correlation function we also have to control the systematic errors. One systematic contribution is the error from random sets used in the pair-counts DR and RR. As an example, consider the baryon acoustic oscillations (BAOs) which lead to a peak in the two-point correlation function of galaxies at a scale of about 100 Mpc h−1 (Eisenstein et al. 2005; Bautista et al. 2021). This BAO peak has a height of approximately 0.01 above zero (compare Fig. 2 in Eisenstein et al. 2005). For a percent level accuracy, we need to calculate the two-point correlation function with an absolute accuracy of less than 10−4. Below, we show how to reduce the systematic error to this level of accuracy without sacrificing efficiency.

Several estimators for the two-point correlation function have been developed (Peebles & Hauser 1974; Hewett 1982; Davis & Peebles 1983; Rivolo 1986; Landy & Szalay 1993; Hamilton 1993). By comparing these estimators to a reference result from a cosmological simulation, Kerscher et al. (2000) found that the Landy & Szalay (1993) estimator is the preferred estimator with the smallest deviation from the reference and also that its bias is negligible compared to its variance.

We focus on methods for increasing the numerical accuracy of the pair-counts as used in these estimators. The random point set, shared in the pair-counts RR and DR is used to correct for boundary (finite-size) and inhomogeneous sampling effects. As we outline below, RR and DR are Monte Carlo volume integration schemes. As expected from standard Monte Carlo integration, the error of these pair-counts RR and DR scales at least as 1 / N r $ 1/\sqrt{{{{N_{\text{r}}}}}} $, where Nr is the number of random points used (see Sect. 3 for a more differentiated view). This slow convergence rate makes increasing the accuracy costly, and sometimes unfeasible. To improve the accuracy of the pair-counts without sacrificing efficiency, we follow two directions:

  • The pair-counts can be expressed as averages of specific volume fractions. We use this to propose special adapted volume integration schemes which, in turn, can be calculated more efficiently than the standard approach.

  • We replace the standard Monte Carlo scheme with a quasi-Monte Carlo integration, which leads to an improved scaling of the error that is almost proportional to 1/Nq, where Nq is the number of points from a low-discrepancy sequence.

We compare the standard and the new methods to exactly known reference values. This allows us to empirically validate the asymptotic scaling of the errors.

A variety of approaches have been suggested to improve the speed and accuracy of estimators for the two-point correlation function. Keihänen et al. (2019) show that at fixed computational cost, a split random catalogue improves the accuracy of estimators for the two-point correlation function. For galaxy catalogues, Demina et al. (2018) achieve a speedup by factorising the calculations in radial (redshift) and angular coordinates (see also Breton & de la Torre 2021). Perhaps closest to our work are the investigations by Dávila-Kurbán et al. (2021). These authors use glass-like point sets instead of the random catalogues, where we use low-discrepancy sequences. As reference values in our comparisons, we use the exact results from Baddeley et al. (1993) and Kerscher (1999) for a rectangular box as summarised in Appendix B (for periodic boxes see Appendix A.1). He (2021) also discusses some approximations for these exact results.

Other approaches focus on the computational problem of calculating the pair-counts. Tree-based methods can be significantly faster than a direct implementation of the pair-counts, specifically for small radii (Moore et al. 2001, see also Appendix C). The double loop in the pair-count calculations can be parallelised. Alonso (2012) showed how to obtain a speedup by a factor of 100 over the direct implementation by using multi-threading on multi-core CPUs or utilising many cores in GPUs. It is well known from matrix computations that the memory layout of the data can have dramatic consequences for the run time of algorithms (see e.g. Anderson et al. 1999). Also, for the pair-counts, a clever layout of the coordinates in the memory can lead to a significant speedup (Donoso 2019). A similar approach can be combined with multi-threading and vectorisation resulting in a blazingly fast code (Sinha & Garrison 2020). Our conceptual improvements can be combined with these computational speedups.

In Sect. 2 we give the definition of the pair-counts and discuss the geometry of the expected pair-counts. In Appendix A we give the details of the derivations and in Appendix B we summarize some results for simple sample geometries. Together with Sect. 2, this enables us to calculate exact reference values for the pair-counts. At the end of Sect. 2, we give a short introduction to the quasi-Monte Carlo method, as used in our improved estimators. In Sects. 3 and 4, we compare the standard and improved versions of the pair-counts with the exactly known reference values. We discuss the scaling of the error with the number of points. We put this together in Sect. 5 and show how an improved version of the Landy & Szalay (1993) estimator can be constructed. Again we discuss the scaling of the error with the number of points. In Sects. 5.1 and 5.2, we compare the run times of the new estimator with those of the standard Landy & Szalay (1993) estimator in some typical situations. In Sect. 6, we discuss how these improvements have to be adapted to estimate the two-point correlation function from an inhomogeneous sampled galaxy distribution. We summarise in Sect. 7 and give some recommendations. In Appendix C, we discuss details of the implementation, the run times, and give a link to the code.

2. Pair-counts, geometry, and quasi-Monte Carlo

The set of the N data points (e.g. galaxies) is { x i } i = 1 N $ \{{\boldsymbol{x}}_i\}_{i=1}^N $ with all points xi ∈ W inside the observation window W ⊂ ℝ3, that is, inside the unmasked area. The number density is estimated with ρ ^ = N | W | $ \widehat{\rho}=\tfrac{N}{|{W}|} $, where |W| is the volume of W. We then define

DD ( r ) = 1 N 2 i = 1 N j = 1 , j i N δ k r δ ( | x i x j | ) , $$ \begin{aligned} \mathsf{DD }(r) = \frac{1}{N^2} \sum _{i=1}^N\sum _{j=1, j\ne i}^N\ {\delta }k_r^{\delta }(|{\boldsymbol{x}}_i-{\boldsymbol{x}}_j|), \end{aligned} $$(2)

the normalised number of data–data pairs with a distance of r = |xi − xj| in the interval [r, r + δ]. We use a rectangular kernel

k r δ ( s ) = 1 δ 1 [ r , r + δ ] ( s ) , $$ \begin{aligned} k_r^{\delta }(s) = \tfrac{1}{{\delta }}{\mathbb{1} }_{[r,r+{\delta }]}(s), \end{aligned} $$(3)

with the indicator function of the set A defined as

1 A ( q ) = { 1 if q A , 0 else . $$ \begin{aligned} {\mathbb{1} }_A(q) = {\left\{ \begin{array}{ll} 1&\mathrm{if}\; q\in A,\\ 0&\mathrm{else}. \end{array}\right.} \end{aligned} $$(4)

Also, other kernels with k r δ ( s ) d s = 1 $ \int k_r^{{\delta}}(s){{\mathrm{d}}} s=1 $ are possible (e.g. triangular, truncated Gaussian, or Epanechnikov). We consider Nr randomly distributed points { y j } j = 1 N r $ \{{\boldsymbol{y}}_j\}_{j=1}^{{{{N_{\text{r}}}}}} $, all inside the sample geometry yj ∈ W. The normalised number of data-random pairs with a distance in [r, r + δ] is denoted by

DR ( r ) = 1 N N r i = 1 N j = 1 N r δ k r δ ( | x i y j | ) . $$ \begin{aligned} \mathsf{DR }(r) = \frac{1}{N {{N_{\rm r}}}} \sum _{i=1}^N\ \sum _{j=1}^{{{N_{\rm r}}}}\ {\delta }k_r^{\delta }(|{\boldsymbol{x}}_i-{\boldsymbol{y}}_j|). \end{aligned} $$(5)

Similarly,

RR ( r ) = 1 N r 2 i = 1 N r j = 1 , j i N r δ k r δ ( | y i y j | ) , $$ \begin{aligned} \mathsf{RR }(r) = \frac{1}{{{N_{\rm r}}}^2}\sum _{i=1}^{{{N_{\rm r}}}} \sum _{j=1, j\ne i}^{{{N_{\rm r}}}}\ {\delta }k_r^{\delta }(|{\boldsymbol{y}}_i-{\boldsymbol{y}}_j|), \end{aligned} $$(6)

is the normalised number of random–random pairs. The Landy & Szalay (1993) estimator is defined as

ξ LS ( r ) = DD ( r ) 2 DR ( r ) + RR ( r ) RR ( r ) · $$ \begin{aligned} \xi _{\rm LS}(r) = \frac{\mathsf{DD }(r) - 2\mathsf{DR }(r) + \mathsf{RR }(r)}{\mathsf{RR }(r)}\cdot \end{aligned} $$(7)

Also, the estimators provided by Peebles & Hauser (1974), Hewett (1982), Davis & Peebles (1983), and Hamilton (1993) can be defined in terms of the pair-counts and our results apply accordingly.

2.1. Geometry of pair-counts

The expectation of the pair-counts DR and RR can be expressed in terms of geometric quantities depending on the sample window W and on the point set (for DR, Kerscher 1999). We first consider the set-covariance

γ W ( x ) = | W W x | = R 3 1 W ( y ) 1 W ( y + x ) d y , $$ \begin{aligned} \gamma _{W}({\boldsymbol{x}})&= |{W} \cap {W}_{\boldsymbol{x}}| = \int _{{\mathbb{R} }^3}{\mathbb{1} }_{W}({\boldsymbol{y}}){\mathbb{1} }_{W}({\boldsymbol{y}}+{\boldsymbol{x}}) {\mathrm{d} }{\boldsymbol{y}}, \end{aligned} $$(8)

where Wx is the shifted window W, that is, the set of all points from W shifted by the vector x. |W ∩ Wx| is the volume of the set W ∩ Wx. The isotropised set-covariance γ W ¯ ( r ) $ \overline{\gamma_{W}}(r) $ can be calculated from γW(x):

γ W ¯ ( r ) = 1 4 π 0 π 0 2 π γ W ( x ( r , θ , ϕ ) ) sin ( θ ) d θ d ϕ . $$ \begin{aligned} \overline{\gamma _{W}}(r) = \frac{1}{4\pi }\int _0^\pi \int _0^{2\pi }\, \gamma _{W}({\boldsymbol{x}}(r,\theta ,\phi ))\ \sin (\theta ){\mathrm{d} }\theta {\mathrm{d} }\phi . \end{aligned} $$(9)

Here x(r, θ, ϕ) = (rcos(ϕ)sin(θ),rsin(ϕ)sin(θ),rcos(θ)). For a large number of random points Nr one obtains (see Appendix A):

RR ( r ) RR ( r ) = 4 π | W | 2 r r + δ γ W ¯ ( s ) s 2 d s 4 π r 2 δ | W | 2 γ W ¯ ( r ) for δ small , $$ \begin{aligned} \mathsf{RR }(r) \rightarrow \ {\mathcal{RR} }(r)&= \frac{4\pi }{|{W}|^2} \int _{r}^{r+{\delta }}\!\! \overline{\gamma _{W}}(s)\ s^2{\mathrm{d} }s\\&\approx \frac{4\pi r^2 {\delta }}{|{W}|^2}\ \overline{\gamma _{W}}(r) \quad \mathrm{for}\; {\delta }\; \mathrm{small}, \nonumber \end{aligned} $$(10)

where ℛℛ(r) is the expectation value of the pair-count RR(r) illustrating its geometric nature.

Ripley (1988) used a local area weight in an estimator for his K-function (the normalised integrated two-point density) and Rivolo (1986) considered a similar weight in his estimator for the two-point correlation function. This weight is inversely proportional to the fraction of the surface area of a sphere Br(y) with radius r centred on the point y inside W:

area ( B r ( y ) W ) = 0 π 0 2 π 1 W ( y + x ( r , θ , ϕ ) ) sin ( θ ) d θ d ϕ . $$ \begin{aligned} \mathrm{area}(\partial {B}_{r}({\boldsymbol{y}})\cap {W}) = \int _0^\pi \int _0^{2\pi }\, {\mathbb{1} }_{W}({\boldsymbol{y}}+{\boldsymbol{x}}(r,\theta ,\phi ))\ \sin (\theta ){\mathrm{d} }\theta {\mathrm{d} }\phi . \end{aligned} $$(11)

For a large number of random points Nr, one obtains (see Appendix A):

DR ( r ) DR ( r ) = 1 | W | N i = 1 N r r + δ area ( B s ( x i ) W ) d s 1 | W | N i = 1 N area ( B r ( x i ) W ) δ , for δ small . $$ \begin{aligned} \mathsf{DR }(r) \rightarrow \ {\mathcal{DR} }(r)&= \frac{1}{|{W}|N}\sum _{i=1}^N \int _r^{r+{\delta }}\!\!\!\!\! \mathrm{area}(\partial {B}_{s}({\boldsymbol{x}}_i)\cap {W})\,{\mathrm{d} }s\\&\approx \frac{1}{|{W}|N} \sum _{i=1}^N \mathrm{area}(\partial {B}_{r}({\boldsymbol{x}}_i)\cap {W})\,{\delta }, \ \mathrm{for}\; {\delta }\; \mathrm{small}. \nonumber \end{aligned} $$(12)

As before, 𝒟ℛ(r) is the expectation value of the pair-count DR(r) illustrating its geometric nature. However, now both DR(r) and 𝒟ℛ(r) are depending on the points { x i } i = 1 N $ \{{\boldsymbol{x}}_i\}_{i=1}^N $ under consideration.

In Appendix B, we give expressions for γ W ¯ ( r ) $ \overline{\gamma_{W}}(r) $ and area(∂Br(y)∩W) if W is a rectangular box or a sphere. From these expressions, we calculate the reference values 𝒟ℛ and ℛℛ. This allows us to compare different integration schemes for the pair-counts DR and RR and we can investigate the scaling of the accuracy with the number of points used in these methods.

2.2. Quasi-Monte Carlo

In our improved methods for estimating the pair-counts, we use quasi-Monte Carlo integration. In a standard Monte Carlo integration scheme, one uses Nr random points {y1, …, yNr} to estimate the integral ∫[0, 1]df(x)dx by 1 N r i = 1 N r f ( y i ) $ \frac{1}{{{{N_{\text{r}}}}}} \sum_{i=1}^{{{N_{\text{r}}}}}f({\boldsymbol{y}}_i) $. The accuracy can be estimated using the Chebyshev inequality, which tells us that the probability of an error exceeding a given threshold decreases with 1 / N r $ 1/\!\sqrt{{{{N_{\text{r}}}}}} $. In other words, the standard error of a Monte Carlo integration scales as 1 / N r $ 1/\!\sqrt{{{{N_{\text{r}}}}}} $.

With quasi-Monte Carlo methods, we use 1 N q i = 1 N q f ( q i ) $ \frac{1}{{{{N_{\text{q}}}}}} \sum_{i=1}^{{{N_{\text{q}}}}}f({\boldsymbol{q}}_i) $ to numerically integrate ∫[0, 1]df(x)dx (see e.g. Niederreiter 1992). This estimate of the integral almost looks identical to the Monte Carlo integration above. However, for a quasi-Monte Carlo integration, the points in Q = {q1, …, qNq} are not random. It is essential for the application of quasi-Monte Carlo integration that for a given point set Q the error bound

| [ 0 , 1 ] d f ( x ) d x 1 N q i = 1 N q f ( q i ) | V ( f ) D ( Q ) $$ \begin{aligned} \left| \int _{[0,1]^d} f(\boldsymbol{x})\mathrm{d} {\boldsymbol{x}} - \frac{1}{{{N_{\rm q}}}} \sum _{i=1}^{{N_{\rm q}}}f({\boldsymbol{q}}_i)\right| \le V(f)\, D(Q) \end{aligned} $$(13)

factorises into a measure of variation V(f) depending only on properties of f, and a measure of discrepancy D(Q) depending only on the properties of the point set Q. If we consider functions of bounded variation, Eq. (13) is referred to as the Koksama-Hlawka bound and the measure of discrepancy is the star discrepancy (see e.g. L’Ecuyer & Lemieux 2002). To control the error bound (Eq. (13)), we have to control D(Q) (we note that V(f) does not depend on the point set Q). Low-discrepancy sequences, such as the Halton sequence, have been constructed with that in mind. For such sequences, Halton (1960) showed that

D ( Q ) ( log N q ) d N q · $$ \begin{aligned} D(Q) \propto \frac{(\log {{N_{\rm q}}})^d}{{{N_{\rm q}}}}\cdot \end{aligned} $$(14)

For small dimensions, d, this compares favourably to a Monte Carlo integration where the standard error only scales as 1 / N r $ 1/\!\sqrt{{{{N_{\text{r}}}}}} $. Halton sequences can be used to estimate integrals over indicator functions, which in turn define volumes like the set covariance. As indicator functions have bounded total variation, Eq. (13) applies.

Upper bounds like Eq. (13) are worst-case bounds. Owen & Rudolf (2021) derive an analogue to a strong law of large numbers for randomised low-discrepancy sequences. This further justifies the procedure for scrambling the Halton sequence developed by Owen (2017), where the scaling from Eq. (14) in Eq. (13) still gives an upper bound, but on average smaller errors are expected. We use these randomised Halton sequences in our calculations (see Appendix C).

3. RR

First we investigate how the accuracy of the standard RR, as given in Eq. (6), scales with the number of random points used. The numerical implementation of the pair-counts is discussed in Appendix C. The expectation value ℛℛ of RR can be expressed in terms of the isotropised set-covariance; see Eq. (10). Using the Eqs. (B.1) and (B.3) for the isotropised set covariance, we calculate ℛℛ(r) as a reference value for rectangular boxes W.

Figure 1 shows the scaling of the relative error εRR = |ℛℛ(r)−RR(r)|/ℛℛ(r) with the number of random points used. The error εRR is the mean value calculated from 100 samples of the random point sets. Only for the largest Nr do we use 10 samples. This also applies to the errors calculated in the following sections. We show the results for a rectangular cuboid W = [0, 1]3 and for a thin rectangular box W = [0, 1]×[0, 0.1]2. With the thin box, we investigate how the error is affected in a sample W where the boundary effects are more dominant. Landy & Szalay (1993, their Eq. (21)) show that the relative error of RR is of the form A N r + B N r $ \frac{A}{\sqrt{{{{N_{\text{r}}}}}}}+\frac{B}{{{{N_{\text{r}}}}}} $, with A and B depending on the window W and the bin width. As expected, we see in Fig. 1 that the error scales as 1 / N r $ 1/\!\sqrt{{{{N_{\text{r}}}}}} $ for large Nr. For small Nr and small radii, the contribution proportional to 1/Nr appears. This additional contribution can be seen in the voluminous sample but not in the thin box. Simply using a three-dimensional low-discrepancy sequence instead of the random points in Eq. (6) is not feasible, as we see below.

thumbnail Fig. 1.

Relative error εRR against the number of random points Nr used in the standard procedure for calculating RR(r) in rectangular windows W. The black dotted lines are proportional to 1/​Nr and 1 / N r $ 1/\!\sqrt{{{{N_{\text{r}}}}}} $.

From Eqs. (A.4) and (A.5) we get

RR ( r ) = 1 | W | 2 R 3 R 3 1 W ( y ) 1 W ( z ) δ k r δ ( | y z | ) d y d z . $$ \begin{aligned} {\mathcal{RR} }(r) = \frac{1}{|{W}|^2} \int _{\mathbb{R} ^3} \int _{\mathbb{R} ^3} {\mathbb{1} }_{W}({\boldsymbol{y}}){\mathbb{1} }_{W}({\boldsymbol{z}})\, {\delta }k_r^{\delta }(|{\boldsymbol{y}}-{\boldsymbol{z}}|) \, {\mathrm{d} }{\boldsymbol{y}}\,{\mathrm{d} }{\boldsymbol{z}}. \end{aligned} $$(15)

This allows a more flexible approach. Consider two (random) point sets P 1 = { y i } i = 1 N r $ P_1=\{{\boldsymbol{y}}_i\}_{i=1}^{{{{N_{\text{r}}}}}} $ and P 2 = { z j } j = 1 N r $ P_2=\{{\boldsymbol{z}}_j\}_{j=1}^{{{{N_{\text{r}}}}}} $, with the points yi ∈ W, zj ∈ W all inside the sample geometry. We then define

RR 2 x 3 d ( r ) = 1 N r 2 i = 1 N r j = 1 N r δ k r δ ( | y i z j | ) , $$ \begin{aligned} \mathsf{RR }_{\rm 2x3d}(r) = \frac{1}{{{N_{\rm r}}}^2}\sum _{i=1}^{{{N_{\rm r}}}} \sum _{j=1}^{{{N_{\rm r}}}}\ {\delta }k_r^{\delta }(|{\boldsymbol{y}}_i-{\boldsymbol{z}}_j|), \end{aligned} $$(16)

which is also an estimate of ℛℛ(r). If the points in P1 and P2 are drawn from a Poisson process, the estimate from Eq. (16) is almost the same as the estimate from Eq. (6), because points in a Poisson process are independent. See also Dávila-Kurbán et al. (2021), who observe that one cannot use P1 ≡ P2 for RR2x3d when they construct an estimator for the two-point correlation function using ‘glass like’ point sets. The representation of ℛℛ in Eq. (15) as a six-dimensional integral suggests a six-dimensional (quasi-)Monte Carlo approach. Consequently, we use six-dimensional random points or a six-dimensional randomised Halton sequence { q i } i = 1 N q $ \{{\boldsymbol{q}}_i\}_{i=1}^{{{{N_{\text{q}}}}}} $, which we split as qi = (yi, zi) into two three-dimensional sequences. We scale the points in the sequences { y i } i = 1 N q $ \{{\boldsymbol{y}}_i\}_{i=1}^{{{{N_{\text{q}}}}}} $ and { z i } i = 1 N q $ \{{\boldsymbol{z}}_i\}_{i=1}^{{{{N_{\text{q}}}}}} $ such that each yi ∈ W and zj ∈ W are uniformly distributed inside the sample geometry.

Figure 2 compares the scaling of the relative error εRR, 2x3d = |ℛℛ(r)−RR2x3d(r)|/ℛℛ(r) with the number of (quasi-)random points. For random points, we see the expected scaling 1 / N $ 1/\!\sqrt{N} $ of the error. Using the quasi-Monte Carlo approach, we see that the error scales proportionally to 1/N, which is even faster than the theoretical expectation according to Eq. (14). Using a low-discrepancy sequence, we gain more than two orders of magnitude in accuracy compared to the random point sets.

thumbnail Fig. 2.

Relative error εRR, 2x3d calculated for rectangular windows W with a pure Monte Carlo integration (solid lines) and with a randomised Halton sequence (dashed lines). The black dotted lines are proportional to 1 / N r $ 1/\!\sqrt{{{{N_{\text{r}}}}}} $ (upper) and 1/Nq (lower).

4. DR

In Sect. 2.1 we show that the expectation of DR(r) is 𝒟ℛ(r) and that it can be calculated from the area fraction area(∂Br(xi)∩W). For a rectangular box, Baddeley et al. (1993) gave explicit expressions for area(∂Br(xi)∩W) (see Appendix B.1). We use these expressions to calculate 𝒟ℛ(r) according to Eqs. (A.9) and (A.10). In the numerical integration of Eq. (A.9), we make sure to achieve a relative error of at least 10−10 for our reference value 𝒟ℛ(r). Both DR(r) and 𝒟ℛ(r) depend on the point set under consideration. We need a realistic data set D = { x i } i = 1 N $ \{{\boldsymbol{x}}_i\}_{i=1}^N $ to calculate both the theoretical reference 𝒟ℛ(r) and the different estimates for DR(r). For this purpose, we use a sample of simulated galaxy clusters1 from the Magneticum simulation (Hirschmann et al. 2014; Ragagnin et al. 2017). The side length of the simulation box is 325 Mpc h−1 and we use the real space positions of the 10 429 simulated clusters. We rescale the coordinates by the side-length of the box such that all points are inside W = [0, 1]3. In the smaller window W = [0, 1]×[0, 0.1]2 only 86 clusters are left.

Now we are set to determine the scaling of the relative error εDR = |𝒟ℛ(r)−DR(r)|/𝒟ℛ(r) with the number of (quasi-)random points used. Figure 3 compares the standard approach with ordinary random numbers in DR(r) to a pair-count determined using a low-discrepancy sequence instead of the random points. Using a randomised Halton sequence, we only observe a minor gain for small N, but for large N the error is reduced by an order of magnitude. In the small window, W = [0, 1]×[0, 0.1]2, this is more pronounced. The scaling follows the expected behaviour from Eq. (14).

thumbnail Fig. 3.

Relative error εDR calculated for rectangular windows W with a standard Monte Carlo integration (solid lines) and with a quasi-Monte Carlo scheme using a randomised Halton sequence (dashed lines). The black dotted lines are proportional to 1 / N $ 1/\!\sqrt{N} $ (upper) and (log N)3/N (lower).

We can do better if we consider the geometry of DR(r). From Eqs. (A.9) and (A.10) we know that the expectation value of DR(r) is DR ( r ) = 1 | W | N i = 1 N V r δ ( x i ) $ {{\mathcal{DR}}}(r)=\frac{1}{|{W}|N}\sum_{i=1}^N{{\mathcal{V}}}_r^{{\delta}}({\boldsymbol{x}}_i) $, where

V r δ ( x i ) = r r + δ area ( B s ( x i ) W ) d s = | S r δ ( x i ) W | $$ \begin{aligned} {\mathcal{V} }_r^{\delta }({\boldsymbol{x}}_i)&= \int _r^{r+{\delta }}\!\!\!\!\! \mathrm{area}(\partial {B}_{s}({\boldsymbol{x}}_i)\cap {W})\,{\mathrm{d} }s = |S_r^{\delta }({\boldsymbol{x}}_i)\cap {W}| \end{aligned} $$(17)

is the volume of the spherical shell with a radial range in [r, r + δ] around xi inside the sample geometry W. As already suggested by Rivolo (1986), this directly leads to a Monte Carlo scheme. With Nsh points { y i } i = 1 N sh $ \{{\boldsymbol{y}}_i\}_{i=1}^{{{{N_{\text{sh}}}}}} $ (quasi-)randomly distributed in the shell S r δ ( x i ) = { y 3 | s < | y x i | s + δ } $ S_r^{{\delta}}({\boldsymbol{x}}_i)=\{{\boldsymbol{y}}\in{{\mathbb{R}}}^3 \,|\, s < |{\boldsymbol{y}}-{\boldsymbol{x}}_i|\le s+{{\delta}} \} $ around xi we define a (quasi-)Monte Carlo estimate of V r δ ( x i ) $ {{\mathcal{V}}}_r^{{\delta}}({\boldsymbol{x}}_i) $:

V s δ ( x i ) = | S r δ | N sh j = 1 N sh 1 W ( y j ) , $$ \begin{aligned} \mathsf{V }_s^{\delta }({\boldsymbol{x}}_i)&= \frac{|S_r^{\delta }|}{{{N_{\mathrm{sh}}}}} \sum _{j=1}^{{{N_{\mathrm{sh}}}}}{\mathbb{1} }_{W}({\boldsymbol{y}}_j), \end{aligned} $$(18)

with the volume | S r δ | = 4 π 3 ( ( r + δ ) 3 r 3 ) $ |S_r^{{\delta}}|=\frac{4\pi}{3}\left((r+{{\delta}})^3-r^3\right) $ of the shell. We then get V s δ ( x i ) V s δ ( x i ) $ {{\mathsf{V}}}_s^{{\delta}}({\boldsymbol{x}}_i)\longrightarrow{{\mathcal{V}}}_s^{{\delta}}({\boldsymbol{x}}_i) $ for a large number Nsh of (quasi-)random points. Consequently we compare

DR shell ( r ) = 1 | W | N i = 1 N V r δ ( x i ) $$ \begin{aligned} \mathsf{DR }_{\mathrm{shell}}(r) = \frac{1}{|{W}|N}\sum _{i=1}^N\mathsf{V }_r^{\delta }({\boldsymbol{x}}_i) \end{aligned} $$(19)

with 𝒟ℛ(r). This is not a pair-count, but we still have a double sum over N data points and now Nsh points in the shell.

Figure 4 shows εDR, shell = |𝒟ℛ(r)−DRshell(r)|/𝒟ℛ(r). A comparison with DR in Fig. 3 shows that using DRshell leads to a reduction of the error by almost two orders of magnitude even for ordinary random points. Again this can be improved by using low-discrepancy sequences; doing so allows us to additionally gain at least another order of magnitude.

thumbnail Fig. 4.

Relative error εDR, shell calculated for rectangular windows W with a pure Monte Carlo integration (solid lines) and with a randomised Halton sequence (dashed lines). The black dotted lines are proportional to 1 / N sh $ 1/\sqrt{{{{N_{\text{sh}}}}}} $ (upper) and (log Nsh)3/Nsh (lower).

5. Estimating ξ

Now we join the improved pair-count estimates together and compare results from the standard Landy & Szalay (1993) estimator to our new estimator for the pair correlation function. We use the simulated galaxy clusters from the Magneticum simulation to illustrate the behaviour of the different estimators (Hirschmann et al. 2014; Ragagnin et al. 2017, compare also Sect. 4). The 10 429 clusters are in a box with side-length 325 Mpc h−1 and we use their real space positions. The standard Landy & Szalay (1993) estimator was defined in Eq. (7):

ξ LS ( r ) = DD ( r ) 2 DR ( r ) + RR ( r ) RR ( r ) · $$ \begin{aligned} \xi _{\mathrm{LS}}(r) = \frac{\mathsf{DD }(r)-2\mathsf{DR }(r)+\mathsf{RR }(r)}{\mathsf{RR }(r)}\cdot \end{aligned} $$

First we use the same Nr random points to calculate RR and DR (see Eqs. (5) and (6)). As an exact reference we have

Ξ ( r ) = DD ( r ) 2 DR ( r ) + RR ( r ) RR ( r ) , $$ \begin{aligned} \Xi (r) = \frac{\mathsf{DD }(r)-2{\mathcal{DR} }(r)+{\mathcal{RR} }(r)}{{\mathcal{RR} }(r)}, \end{aligned} $$(20)

with the ℛℛ and 𝒟ℛ given in Eqs. (10), (12), using the results from Appendix B.1 for a rectangular box. In Sects. 3 and 4 we discuss alternative possibilities to calculate the pair-counts DR and RR. We focus on the following combination:

ξ LS ( r ) = DD ( r ) 2 DR shell ( r ) + RR 2 x 3 d ( r ) RR 2 x 3 d ( r ) , $$ \begin{aligned} \widetilde{\xi }_{\mathrm{LS}}(r) = \frac{\mathsf{DD }(r)-2\mathsf{DR }_{\mathrm{shell}}(r)+\mathsf{RR }_{\mathrm{2x3d}}(r)}{\mathsf{RR }_{\mathrm{2x3d}}(r)}, \end{aligned} $$(21)

which resembles the Landy & Szalay (1993) estimator, but now with improved pair-count estimates. We use a low-discrepancy sequence with Nshell 3D points to calculate DRshell (see Eq. (19)), and another 6D low-discrepancy sequence with N2x3d points to calculate RR2x3d (see Eq. (16)).

As an illustrative example we compare these estimates in Fig. 5 using an insufficient number of points. As expected, we observe that a standard Landy & Szalay (1993) estimator with only 104 random points shows deviations from the exact result, but with 105 random points the LS-estimator starts to follow the exact result. Visually, one can see that the estimator using the low-discrepancy sequences achieves a higher accuracy already with N2x3d = 104 = Nsh points.

thumbnail Fig. 5.

Two-point correlation function calculated for a simulated galaxy cluster sample. The exact Ξ is compared to standard ξLS with Nr = 104 and Nr = 105 random points, and to the new ξ LS $ \widetilde\xi_{\mathrm{LS}} $ using N2x3d = 104 = Nsh points from a randomised Halton sequence.

To make this quantitative and to disentangle the influence that these different methods for calculating DR and RR have, we consider the ‘half exact’ estimates for ξ(r):

Ξ RR ( r ) = DD ( r ) 2 DR ( r ) + RR ( r ) RR ( r ) , Ξ DR ( r ) = DD ( r ) 2 DR ( r ) + RR ( r ) RR ( r ) , $$ \begin{aligned} \Xi _{{\mathcal{RR} }}(r)&= \frac{\mathsf{DD }(r)-2\mathsf{DR }(r)+{\mathcal{RR} }(r)}{{\mathcal{RR} }(r)}, \\ \Xi _{{\mathcal{DR} }}(r)&= \frac{\mathsf{DD }(r)-2{\mathcal{DR} }(r)+\mathsf{RR }(r)}{\mathsf{RR }(r)}, \end{aligned} $$(22)

where we use the exact reference values ℛℛ and 𝒟ℛ in turn. We also calculate Ξℛℛ and Ξ𝒟ℛ using DRshell and RR2x3d instead of DR and RR. For the comparison, we consider the absolute error Δ1(r) = |Ξ(r)−Ξ𝒟ℛ(r)| and Δ2(r) = |Ξ(r)−Ξℛℛ(r)|. In Fig. 6 we see that the scaling investigated in Sect. 3 directly transfers to the scaling of the error in the estimated ξ. For a fixed number of (quasi-)random points, RR2x3d is more accurate. The gain in accuracy is considerable for intermediate and large radii. For small radii and for a smaller number of (quasi-)random points, the accuracy gain of RR2x3d is reduced to a factor of two. With the standard DR, the error Δ1 scales as 1 / N r $ 1/\sqrt{{{{N_{\text{r}}}}}} $ (see Fig. 7). Already using a low-discrepancy sequence instead of random points in DR leads to a reduction in the error and the scaling starts to follow 1/Nq. Compared to DR, the DRshell gives a significantly smaller error, again showing a scaling proportional to 1/Nsh. This reduced error comes at a price: the run time of DRshell is greater than the run time of DR for the same number of (quasi-)random points (see Appendix C).

thumbnail Fig. 6.

Absolute error Δ1 is shown for the standard RR (solid lines) and for RR2x3d (dashed lines) pair-counts. The black dotted lines are proportional to 1 / N r $ 1/\sqrt{{{{N_{\text{r}}}}}} $ (upper) and 1/N2x3d (lower).

5.1. Run time

Ultimately, we are not interested in the scaling of the error with the number of (quasi-)random points. We want to obtain reliable results fast. The speed of the calculation depends on the hardware platform, the algorithms, details of the implementation, and further effects (see Appendix C). Nevertheless, we prepared some examples and investigate what one can expect as a speedup when using the new estimator. In a given window, and for a given radius interval, we look at Figs. 6 and 7 to read off how many (quasi-)random points we approximately need to achieve a targeted absolute error. In most applications of the standard estimator ξLS, the same random point set is used with RR and DR. However, there is some room for optimisation. As we tune RR2x3d and DRshell in the new estimator separately, we also choose different sizes of the random samples in RR and DR to optimise the overall run time. With some provisional runs, we check that we achieve the targeted accuracy Δ within 10%; sometimes we had to adjust the numbers. Table 1 lists the final parameters we use. For case III, we use Nr = 107 and achieve a Δ ≈ 5 × 10−4 for the standard estimator, and therefore the run times for case III may serve only as a lower bound.

thumbnail Fig. 7.

Absolute error Δ2 is shown for the standard DR (solid lines), the DR using a low-discrepancy sequence (dotted lines), and for DRshell also using a low-discrepancy sequence (dashed lines). The black dotted lines are proportional to 1 / N r $ 1{/}\sqrt{{{{N_{\text{r}}}}}} $ (upper) and 1/Nsh (lower).

Table 1.

Parameters and results of the run-time comparison.

The comparison of the run times was conducted on a small workstation (for details see Appendix C). We estimate the run times from 100 runs, for the longer running calculations we use 20 runs (in case III 5 runs). As the targeted accuracy is matched at the 10% level and this comparison did not run on an isolated machine, we can only give one digit for the run times. See Appendix C for further comments on the run time calculations.

The run times of the pair-counts are shown in Table 1 for different windows, radius ranges, and for different targeted accuracies. From Appendix C, we know that for a fixed number of (quasi-)random points, the standard pair-counts RR and DR are always faster than the corresponding RR2x3d and DRshell. However, we choose the number of (quasi-)random points by requiring a given accuracy. In that case, we need fewer points for the new pair-counts, which in turn lowers their run times. Now we compare RR and RR2x3d as well as DR and DRshell at fixed accuracy, and find that the run times of the new pair-counts are smaller in all the considered cases. We also see from Table 1 that the relative significance of the RR, RR2x3d and DR, DRshell parts depends on the considered case. For a targeted high accuracy (small Δ), the run times are dominated by the RR and RR2x3d part, respectively. For small radii and for lower accuracy, the DR and DRshell parts dominate the run time. The geometry of the window further differentiates this picture. In Table 2, we sum up the parts and show the run times of ξLS and the new ξ LS $ \widetilde{\xi}_{\mathrm{LS}} $. In all the cases considered, we see a reduction in the run time using the new estimator. For high accuracy and for large and intermediate radii, the effect is considerable. Using the new estimator we expect a reduction in run time by a factor of between several hundred and almost 104. For small radii and for less demanding accuracy goals, our method performs at least as well as the standard estimator and we may expect a reduction by up to an order of magnitude. The timing results for DD, RR, RR2x3d and DR are independent of the number of radius intervals, because we can sort the counts into the radius bins. However, we need to repeat the DRshell calculation for each radius bin.

Table 2.

We sum the parts and compare the run times for the standard ξLS and the new ξ LS $ \widetilde{\xi}_{\mathrm{LS}} $ in all the cases from Table 1.

5.2. A dense sample

Case VI from Table 1 shows that for small radii and lower accuracy goals, the run time is dominated by the DR or the DRshell calculations. However, DR and DRshell show very different run times depending on the number of data points (see Appendix C). Hence per se it is not clear which of the methods is preferable in dense samples for small radii and relaxed accuracy goals. We cannot check this with the cluster sample with N = 10429 data points only. Therefore, we use a sample of 1.36 million simulated galaxies2 from the Magneticum simulation (Hirschmann et al. 2014; Ragagnin et al. 2017) in the box with side-length 325 Mpc h−1. In Fig. 8, we see that the errors of RR and RR2x3d both scale proportionally to 1/Nr (see the discussion in Sect. 3). For a fixed number of (quasi-)random points, the RR2x3d is about twice as accurate as the RR for r ∈ [2, 3] Mpc h−1 in this dense sample. However, the accuracies of DR and DRshell differ drastically. Using the same number of points, accuracy is increased when using DRshell instead of DR by at least a factor of 103. We can understand this by looking at Eq. (19). In calculating DRshell, we average the shell volume inside the window over all the points. With so many points in this dense sample, only a rough estimate of the shell volume for each individual point or galaxy is necessary to obtain a DRshell of sufficient accuracy.

thumbnail Fig. 8.

Errors for estimates of the correlation function at r ∈ [2, 3] Mpc h−1 from 1.36 million galaxies and a subsample of 100 k galaxies in a box with side-length 325 Mpc h−1. Upper plot: absolute error Δ1 shown for the standard RR (solid lines) and for RR2x3d (dashed lines) pair-counts. Lower plot: absolute error Δ2 is shown for the standard DR (solid lines), the DR using a low-discrepancy sequence (dotted lines), and for DRshell also using a low-discrepancy sequence (dashed lines). The black dotted lines are proportional to 1 / N r $ 1/\sqrt{{{{N_{\text{r}}}}}} $ (upper) and 1/N2x3d (lower).

We proceed as in Sect. 5.1, choose a targeted accuracy Δ = 10−2, and determine the required number of points Nr, N2x3d, and Nsh in the pair-counts as summarised in Table 3. Extrapolating from Fig. 8, we could have guessed that the number Nsh of points, as used in DRshell, is less than one. Instead of subsampling the galaxies, we simply use Nsh = 1, and therefore we are overestimating the required run time TDR, shell. It is also interesting to observe that in case XI we only need Nr ≪ N random points for the pair-counts in the standard estimators to achieve the targeted accuracy.

Table 3.

Parameters and results of a run time comparison similar to Table 1 but now for a sample with N = 1.36 million galaxies (XI) and a random subsample with N = 105 galaxies (X) in a cube with side length 325 Mpc h−1.

From Table 3 we see that for the subsampled galaxy distribution (case X, 100 k points), the pair-counts RR and DR have a comparable contribution to the run time for the standard estimator, whereas for the new estimator, the RR2x3d still provides the major contribution to the run time. For the full galaxy sample with 1.36 million galaxies, the dominant contribution is from DD followed by DR for the standard estimator. For the new estimator, both RR2x3d and DRshell contribute similarly to the run time. In both cases, the new estimator is still faster than the standard estimator, although only marginally in the full sample XI. These timing results should be considered with some reservation. As illustrated in Appendix C, a tree-based method can speedup the DD, RR, DR, and RR2x3d for small radii significantly. For DRshell, such a method for increasing the speed is currently unknown. This certainly affects the relative contribution of DR versus DRshell and depending on the situation, DR using a tree-based method can be faster than DRshell.

6. Incomplete sampling

The random points in RR and DR are not only used to correct for finite-size effects but also to correct for incomplete sampling. Here, we follow the ideas presented in Baddeley et al. (2000) and Shaw et al. (2021) and adapt them for the pair-counts. In this way, we can still use the new improved estimators and also correct for incomplete sampling.

Typically the galaxy distribution is observed incompletely. For example, unobserved regions around bright stars are masked. This leads to holes in the observational window W and can be dealt with using the methods already described. However, a partial sampling of the galaxy distribution in crowded fields leaves us with an inhomogeneous selection of the galaxy sample. Also, in magnitude-limited samples, we have a systematic selection of the galaxies, depending on the distance from us. We model this by p(x)∈(0, 1], the probability of including or observing a galaxy at position x. We assume that this p(x) is statistically independent of any other point in the galaxy distribution. For example, in a crowded field, only a fraction of the galaxies are randomly targeted for spectroscopic follow up. For a galaxy in this field, we have a p(x) equal to the fraction of targeted galaxies in the field. In a magnitude-limited sample, we observe all galaxies down to a limiting brightness llim. The luminosity of a galaxy L(d(x),l) at position x can be determined from the brightness l > llim and its luminosity distance d(x) from our galaxy (we neglect absorption for simplicity). Therefore, at a distance of d we only include galaxies with luminosity L > L(d, llim) in our catalogue. Now consider F(L), the distribution function of the absolute luminosities of all the galaxies (i.e. the normalised cumulative luminosity function). The fraction of galaxies included at a distance d(x) is then p(x) = 1 − F(L(d(x),llim)). We therefore need a good model for the luminosity function. The exclusion of a galaxy due to fibre collision cannot be modelled with the independent thinning (see below), and the following approach may only serve as an approximation. Further selection and sampling effects are discussed in Ross et al. (2012).

To investigate the effects of incomplete sampling, we assume an unobserved homogeneous and isotropic galaxy distribution G with number density 𝜚G and two-point correlation function ξG(r). Our goal is to estimate this ξG(r). The inhomogeneous sampling, described by p(x), leads to an observed inhomogeneous galaxy catalogue and is modelled in the following way:

D = { x i G | x i W and u i p ( x i ) } , $$ \begin{aligned} \mathsf D^\prime = \{{\boldsymbol{x}}_i\in G | {\boldsymbol{x}}_i\in {W}\; \mathrm{and}\; u_i\le p({\boldsymbol{x}}_i)\}, \end{aligned} $$(23)

where ui are independent random variables that are specifically independent of the points and uniformly distributed on [0, 1]. This closely follows the construction of an inhomogeneous Markov point process by independent thinning as discussed by Baddeley et al. (2000). The observed inhomogeneous point set { x i } i = 1 N $ \{{\boldsymbol{x}}_i^\prime\}_{i=1}^N $ is considered a realisation of D′. With the number density 𝜚G of the homogeneous galaxy distribution, the inhomogeneous number density of this point process is 𝜚′(x) = p(x) 𝜚G. The two-point density, the probability of observing a point at x and y, is

ϱ 2 ( x , y ) = ϱ ( x ) ϱ ( y ) ( 1 + ξ ( | x y | ) ) . $$ \begin{aligned} \varrho _2^\prime ({\boldsymbol{x}},{\boldsymbol{y}}) = \varrho^\prime (\boldsymbol{x})\varrho^\prime ({\boldsymbol{y}}) \left(1 + \xi^\prime (|{\boldsymbol{x}}-{\boldsymbol{y}}|)\right). \end{aligned} $$(24)

This is well defined for a certain class of point process as defined in Baddeley et al. (2000). As the thinning is assumed to be independent from the points, we have

ξ G ( | x y | ) = ξ ( | x y | ) = ϱ 2 ( x , y ) ϱ G 2 p ( x ) p ( y ) 1 , $$ \begin{aligned} \xi _G(|{\boldsymbol{x}}-{\boldsymbol{y}}|) = \xi^\prime (|{\boldsymbol{x}}-{\boldsymbol{y}}|) = \frac{\varrho _2^\prime ({\boldsymbol{x}},{\boldsymbol{y}})}{\varrho _G^2\ p({\boldsymbol{x}}) p({\boldsymbol{y}})} - 1, \end{aligned} $$

for the two-point correlation function.

In full analogy to Sect. 2, we define the ‘inhomogeneous’ pair-counts DD′. The points { x i } i = 1 N $ \{{\boldsymbol{x}}_i^\prime\}_{i=1}^N $ are the incompletely sampled galaxies with positions x i $ {\boldsymbol{x}}_i^\prime $ inside the observation window W. Then

DD ( r ) = 1 N 2 i = 1 N j = 1 , j i N δ k r δ ( | x i x j | ) . $$ \begin{aligned} \mathsf{DD }^\prime (r) = \frac{1}{N^2} \sum _{i=1}^N\sum _{j=1, j\ne i}^N\ {\delta }k_r^{\delta }(|{\boldsymbol{x}}_i^\prime -{\boldsymbol{x}}_j^\prime |). \end{aligned} $$(25)

To account for the effect of incomplete sampling, one often applies the thinning as described in Eq. (23) to the random points as well. In this way, our random point set { y i } i = 1 N r $ \{{\boldsymbol{y}}_i^\prime\}_{i=1}^{{{N_{\rm r}}}} $ is a realisation of an inhomogeneous Poisson process with number density N r N ϱ ( x ) $ \tfrac{{{{N_{\text{r}}}}}}{N} \varrho^{\prime}({\boldsymbol{x}}) $. The pair-counts involving these inhomogeneous random points are

RR ( r ) = 1 N r 2 i = 1 N r j = 1 , j i N r δ k r δ ( | y i y j | ) , $$ \begin{aligned} \mathsf{RR }^\prime (r) = \frac{1}{{{N_{\rm r}}}^2}\sum _{i=1}^{{{N_{\rm r}}}} \sum _{j=1, j\ne i}^{{{N_{\rm r}}}}\ {\delta }k_r^{\delta }(|{\boldsymbol{y}}_i^\prime -{\boldsymbol{y}}_j^\prime |), \end{aligned} $$(26)

and

DR ( r ) = 1 N N r i = 1 N j = 1 N r δ k r δ ( | x i y j | ) . $$ \begin{aligned} \mathsf{DR }^\prime (r) = \frac{1}{N {{N_{\rm r}}}} \sum _{i=1}^N\ \sum _{j=1}^{{{N_{\rm r}}}}\ {\delta }k_r^{\delta }(|{\boldsymbol{x}}_i^\prime -{\boldsymbol{y}}_j^\prime |). \end{aligned} $$(27)

Appendix A.2 shows that the Peebles & Hauser (1974) estimator using DD′ and RR′ is ‘ratio unbiased’. For the Landy & Szalay (1993) estimator, we combine the pair-counts DD′, DR′, and RR′ with

ξ LS ( r ) = DD ( r ) 2 DR ( r ) + RR ( r ) RR ( r ) , $$ \begin{aligned} \xi^\prime _{\rm LS}(r)&= \frac{\mathsf{DD }^\prime (r) - 2\mathsf{DR }^\prime (r) + \mathsf{RR }^\prime (r)}{\mathsf{RR }^\prime (r)}, \end{aligned} $$(28)

in full analogy to Eq. (7). Now we show how we can calculate these inhomogeneous pair-counts using our new approach based on quasi-Monte Carlo methods. For this we make a detour and consider the expectation value of RR′ and then DR′.

For a large number of random points Nr from an inhomogeneous Poisson process, we obtain (see Appendix A.2):

RR ( r ) RR ( r ) = 4 π N 2 r r + δ Γ W ¯ ( s ) s 2 d s , $$ \begin{aligned} \mathsf{RR }^\prime (r)&\rightarrow \ {\mathcal{RR} }^\prime (r) = \frac{4\pi }{N^2} \int _{r}^{r+{\delta }} \overline{\Gamma _{W}}(s) \ s^2{\mathrm{d} }s, \end{aligned} $$(29)

where Γ W ¯ ( s ) $ \overline{\Gamma_{W}}(s) $ is given in Eq. (A.16), and ℛℛ′(r) is the expectation value of the pair-count RR′(r). There is no longer a direct correspondence with geometric objects like the set-covariance, however we can write ℛℛ′ as the following integral (compare to Eq. (15)):

RR ( r ) = 1 N 2 R 3 R 3 1 W ( y 1 ) 1 W ( y 2 ) ϱ ( y 1 ) ϱ ( y 2 ) × δ k r δ ( | y 1 y 2 | ) d y 1 d y 2 . $$ \begin{aligned} {\mathcal{RR} }^\prime (r)&= \frac{1}{N^2} \int _{\mathbb{R} ^3}\! \int _{\mathbb{R} ^3} {\mathbb{1} }_{W}({\boldsymbol{y}}_1){\mathbb{1} }_{W}({\boldsymbol{y}}_2)\, \varrho \prime ({\boldsymbol{y}}_1)\varrho \prime ({\boldsymbol{y}}_2)\nonumber \\&\quad \times {\delta }k_r^{\delta }(|{\boldsymbol{y}}_1-{\boldsymbol{y}}_2|) \, {\mathrm{d} }{\boldsymbol{y}}_1\,{\mathrm{d} }{\boldsymbol{y}}_2. \end{aligned} $$(30)

This suggests a (quasi-)Monte Carlo approach similar to Eq. (16). We use homogeneously sampled point sets P1, P2 as described in Sect. 3. Then

RR 2 x 3 d ( r ) = 1 N 2 i = 1 N r j = 1 N r ϱ ( y i ) ϱ ( z j ) δ k r δ ( | y i z j | ) , $$ \begin{aligned} \mathsf{RR }_{\rm 2x3d}^\prime (r)&= \frac{1}{N^2}\sum _{i=1}^{{{N_{\rm r}}}} \sum _{j=1}^{{{N_{\rm r}}}}\ \varrho \prime ({\boldsymbol{y}}_i)\varrho \prime ({\boldsymbol{z}}_j)\ {\delta }k_r^{\delta }(|{\boldsymbol{y}}_i-{\boldsymbol{z}}_j|), \end{aligned} $$(31)

is an estimate of ℛℛ′(r) (we reiterate that the number density of the inhomogeneous Poisson process is N r N ϱ ( x ) $ \tfrac{{{{N_{\text{r}}}}}}{N} \varrho^{\prime}(\boldsymbol{x}) $). As described in Sect. 3, the point sets P1, P2 can be generated from a six-dimensional random sequence or six-dimensional low-discrepancy sequence.

In Eq. (27) we use a thinned random point set to calculate DR′. Similarly, we could use a thinned low-discrepancy sequence in Eq. (27), but in Sect. 4 we see that this first approach is only mildly successful. We therefore follow the second approach by considering Eq. (A.18) and find

DR ( r ) DR ( r ) = 1 N 2 i = 1 N U r δ ( x i ) , $$ \begin{aligned} \mathsf{DR }^\prime (r)&\rightarrow {\mathcal{DR} }^\prime (r) = \frac{1}{N^2} \sum _{i=1}^N {\mathcal{U} }_r^{\delta }({\boldsymbol{x}}_i), \end{aligned} $$(32)

where U r δ ( x i ) $ {{\mathcal{U}}}_r^{{\delta}}({\boldsymbol{x}}_i) $ is given in Eq. (A.19). Similar to Eq. (18), we use Nsh points { y i } i = 1 N sh $ \{{\boldsymbol{y}}_i\}_{i=1}^{{{{N_{\text{sh}}}}}} $ (quasi-)randomly distributed in the shell S r δ ( x i ) = { y 3 | s < | y x i | s + δ } $ S_r^{{\delta}}({\boldsymbol{x}}_i)=\{{\boldsymbol{y}}\in{{\mathbb{R}}}^3 \,|\, s < |{\boldsymbol{y}}-{\boldsymbol{x}}_i|\le s+{{\delta}} \} $ around xi to estimate U r δ ( x i ) $ {{\mathcal{U}}}_r^{{\delta}}({\boldsymbol{x}}_i) $ by

U s δ ( x i ) = | S r δ | N sh j = 1 N sh 1 W ( y j ) ϱ ( y j ) . $$ \begin{aligned} \mathsf{U }_s^{\delta }({\boldsymbol{x}}_i)&= \frac{|S_r^{\delta }|}{{{N_{\mathrm{sh}}}}} \sum _{j=1}^{{{N_{\mathrm{sh}}}}}{\mathbb{1} }_{W}({\boldsymbol{y}}_j)\ \varrho \prime ({\boldsymbol{y}}_j). \end{aligned} $$(33)

This results in a new estimate for 𝒟ℛ′(r):

DR shell ( r ) = 1 N 2 i = 1 N U r δ ( x i ) . $$ \begin{aligned} \mathsf{DR }_{\mathrm{shell}}^\prime (r) = \frac{1}{N^2}\sum _{i=1}^N\mathsf{U }_r^{\delta }({\boldsymbol{x}}_i) . \end{aligned} $$(34)

Similar to Eq. (21) we can use DD′, RR 2x3d $ {{\mathsf{RR}}}_{\rm 2x3d}^\prime $, and DR shell $ {{\mathsf{DR}}}_{{\rm shell}}^\prime $ to construct a Landy & Szalay (1993) – type estimator ξ LS $ \widetilde{\xi}_{\mathrm{LS}}{\prime} $ for inhomogeneously sampled galaxy catalogues.

Shaw et al. (2021) discuss how to estimate the ξ′(r) and a non-parametric model for 𝜚′(x) directly from the data. In a typical application to galaxy catalogues, we often have a good model for 𝜚′(x) (or equivalently for p(x)). Parameters of this model are fixed by the sampling strategy, and some are determined from the galaxy distribution (e.g. from an estimate of the luminosity function).

7. Summary and outlook

First, we focussed on the scaling of the error in estimates of the pair-counts RR and DR with the number of (quasi-)random points. For the standard approach, with ordinary random numbers, we confirm the expected slow shrinking of the error proportional to 1 / N $ 1/\sqrt{N} $. A reformulation of the pair-counts makes a quasi-Monte Carlo integration possible. There we find that the error shrinks almost proportionally to 1/Nq, where Nq is the number of points from a low-discrepancy sequence. This scaling of the error not only holds for bulky samples but is even more pronounced in a thin sample with prevalent boundary effects. We are therefore confident that our improved methods are also applicable in more complicated sample geometries. We combine these improved pair-counts into a new Landy & Szalay (1993) – type estimator and compare with the standard one. The new estimator inherits the favourable scaling proportional to 1/Nq.

We can turn this observation around. For a fixed maximum error, we read off how many points are necessary to stay below this error with both the new and the standard estimator. We then compare the run times of the estimators. Depending on the accuracy goal, the radius range considered, and the density of the data, we observe a speedup by 50%, and up to a factor of almost 104 for our improved estimator. More specifically, using RR2x3d instead of the standard RR will increases the efficiency (accuracy vs. run time) of the estimator in any considered case. As tree-based methods are equally applicable to RR and RR2x3d, we recommend using RR2x3d for any use case. Also, in all the cases we considered, the DRshell is faster than the standard DR at fixed accuracy. However, an unreserved recommendation is not possible. For the DR calculations, a tree based method could be employed to speed up the calculations (we did not use one in our calculations). Currently, no comparable method for a speedup is known for DRshell. Therefore, in some situations for fixed accuracy, a tree-based method could lead to smaller run times in the DR calculations than in the DRshell calculations. We expect this to be relevant for voluminous samples, small radii, and lower accuracy goals. A combination of DD, DR, RR2x3d similar to Eq. (21) might therefore be useful. For large radii, we recommend DRshell.

The random point sets are not only used for boundary corrections but to correct for incomplete sampling as well. Typically, the selection and sampling effects, as present in the galaxy distribution, are modelled onto the random point set used in the pair-counts. We discuss how to adapt this for the new improved estimator; we are using the probability of observing the galaxies as a weight in the calculation of the pair-counts. However, this is only a first step. Using weighting schemes, one could envision the construction of a minimum variance estimate for inhomogeneous sampled galaxies, similarly to Saunders et al. (1992), Feldman et al. (1994), and Colombi et al. (1998), but now for the improved estimators using low-discrepancy sequences.

Our use of the randomised Halton sequence is similar to the use of glass-like point sets by Dávila-Kurbán et al. (2021). Generating glass-like point sets can be computationally challenging. In contrast, a randomised Halton sequence is generated easily (see Appendix C). Halton sequences are probably the simplest choice, but other low-discrepancy sequences could be used as well (L’Ecuyer & Lemieux 2002). Interesting alternatives could be the so-called blue noise random sets, which are used in computer graphics for efficiently sampling from surfaces (see e.g. Heck et al. 2013). The computational improvements like tree-based methods or optimised memory layout are complimentary to our approach and can be used similarly for the pair-counts with low-discrepancy sequences. This should lead to a further speedup.


1

Specifically we use the simulated galaxy clusters from the snapshot Box2/hr, snap_136, z = 0.066340191 downloaded from http://www.magneticum.org/data.html#FULL_CATALOUGES

2

We use the simulated galaxies from the snapshot Box2/hr, snap_136, z = 0.066340191 downloaded from http://www.magneticum.org/data.html#FULL_CATALOUGES

Acknowledgments

Many thanks to Adrian Baddeley for sharing code and the comments on the expressions for the area fraction. It is a pleasure to thank Klaus Dolag and Antonio Ragagnin for providing public access to the simulated galaxies and galaxy clusters from the Magneticum simulation. Also many thanks to the referee for his/her helpful comments and especially for the suggestion to include a more detailed discussion of the run times.

References

  1. Abbott, T. M. C., Aguena, M., Alarcon, A., et al. 2022, Phys. Rev. D, 105, 023520 [CrossRef] [Google Scholar]
  2. Aghamousa, A., Aguilar, J., Ahlen, S., et al. 2016, ArXiv e-prints [arXiv:1611.00036] [Google Scholar]
  3. Alarcon, A., Gaztanaga, E., Eriksen, M., et al. 2021, MNRAS, 501, 6103 [NASA ADS] [CrossRef] [Google Scholar]
  4. Alonso, D. 2012, ArXiv e-prints [arXiv:1210.1833] [Google Scholar]
  5. Amendola, L., Appleby, S., Avgoustidis, A., et al. 2018, Liv. Rev. Relativ., 21, 2 [NASA ADS] [CrossRef] [Google Scholar]
  6. Anderson, E., Bai, Z., Bschof, C., et al. 1999, LAPACK Users Guide, 3rd edn. (Philadelphia: SIAM) [CrossRef] [Google Scholar]
  7. Baddeley, A., & Turner, R. 2005, J. Stat. Softw., 12, 1 [CrossRef] [Google Scholar]
  8. Baddeley, A. J., Moyeed, R. A., Howard, C. V., & Boyde, A. 1993, Appl. Stat., 42, 641 [CrossRef] [Google Scholar]
  9. Baddeley, A. J., Møller, J., & Waagepetersen, R. 2000, Stat. Neerl., 54, 329 [CrossRef] [Google Scholar]
  10. Bautista, J. E., Paviot, R., Vargas Magaña, M., et al. 2021, MNRAS, 500, 736 [Google Scholar]
  11. Breton, M.-A., & de la Torre, S. 2021, A&A, 646, A40 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  12. Colombi, S., Szapudi, I., & Szalay, A. S. 1998, MNRAS, 296, 253 [NASA ADS] [CrossRef] [Google Scholar]
  13. Dagum, L., & Menon, R. 1998, Comput. Sci. Eng. IEEE, 5, 46 [CrossRef] [Google Scholar]
  14. Daley, D. J., & Vere-Jones, D. 2003, An Introduction to the Theory of Point Processes (Berlin: Springer-Verlag) [Google Scholar]
  15. Dávila-Kurbán, F., Sánchez, A. G., Lares, M., & Ruiz, A. N. 2021, MNRAS, 506, 4667 [CrossRef] [Google Scholar]
  16. Davis, M., & Peebles, P. J. E. 1983, ApJ, 267, 465 [Google Scholar]
  17. Dawson, K. S., Schlegel, D. J., Ahn, C. P., et al. 2013, AJ, 145, 10 [Google Scholar]
  18. Demina, R., Cheong, S., BenZvi, S., & Hindrichs, O. 2018, MNRAS, 480, 49 [Google Scholar]
  19. Donoso, E. 2019, MNRAS, 487, 2824 [NASA ADS] [CrossRef] [Google Scholar]
  20. Eisenstein, D. J., Zehavi, I., Hogg, D. W., et al. 2005, ApJ, 633, 560 [Google Scholar]
  21. Feldman, H. A., Kaiser, N., & Peacock, J. A. 1994, ApJ, 426, 23 [Google Scholar]
  22. Fiksel, T. 1988, Statistics, 19, 67 [CrossRef] [Google Scholar]
  23. Halton, J. 1960, J. Numer. Math., 2, 84 [CrossRef] [Google Scholar]
  24. Hamilton, A. 1993, ApJ, 417, 19 [NASA ADS] [CrossRef] [Google Scholar]
  25. Harris, C. R., Millman, K. J., van der Walt, S. J., et al. 2020, Nature, 585, 357 [NASA ADS] [CrossRef] [Google Scholar]
  26. He, C.-C. 2021, ApJ, 921, 59 [NASA ADS] [CrossRef] [Google Scholar]
  27. Heck, D., Schlömer, T., & Deussen, O. 2013, ACM Trans. Graph., 32, 1 [Google Scholar]
  28. Hewett, P. C. 1982, MNRAS, 201, 867 [Google Scholar]
  29. Hirschmann, M., Dolag, K., Saro, A., et al. 2014, MNRAS, 442, 2304 [Google Scholar]
  30. Ivezić, Ž., Kahn, S. M., Tyson, J. A., et al. 2019, ApJ, 873, 111 [Google Scholar]
  31. Jakob, W., Rhinelander, J., & Moldovan, D. 2017, pybind11 – Seamless Operability between C++11 and Python, https://github.com/pybind/pybind11 [Google Scholar]
  32. Keihänen, E., Kurki-Suonio, H., Lindholm, V., et al. 2019, A&A, 631, A73 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  33. Kerscher, M. 1999, A&A, 343, 333 [NASA ADS] [Google Scholar]
  34. Kerscher, M., Szapudi, I., & Szalay, A. 2000, ApJ, 535, L13 [NASA ADS] [CrossRef] [Google Scholar]
  35. Landy, S. D., & Szalay, A. S. 1993, ApJ, 412, 64 [Google Scholar]
  36. L’Ecuyer, P., & Lemieux, C. 2002, Recent Advances in Randomized Quasi-Monte Carlo Methods (New York: Springer), 419 [Google Scholar]
  37. Moore, A. W., Connolly, A. J., & Genovese, C. 2001, in Mining the Sky, eds. A. J. Banday, S. Zaroubi, & M. Bartelmann, 71 [CrossRef] [Google Scholar]
  38. Neyman, J., & Scott, E. L. 1958, J. R. Stat. Soc., 20, 1 [Google Scholar]
  39. Niederreiter, H. 1992, Random Number Generation and Quasi-Monte Carlo Methods (Philadelphia: SIAM) [CrossRef] [Google Scholar]
  40. Ohser, J. 1983, Math. Operationsforsch. Stat. Ser. Stat., 14, 63 [Google Scholar]
  41. Owen, A. B. 2017, ArXiv e-prints [arXiv:1706.02808] [Google Scholar]
  42. Owen, A. B., & Rudolf, D. 2021, SIAM Rev., 63, 360 [CrossRef] [Google Scholar]
  43. Peebles, P. J. E. 1980, The Large Scale Structure of the Universe (Princeton: Princeton University Press) [Google Scholar]
  44. Peebles, P. J. E., & Hauser, M. G. 1974, ApJS, 28, 19 [NASA ADS] [CrossRef] [Google Scholar]
  45. Ragagnin, A., Dolag, K., Biffi, V., et al. 2017, Astron. Comput., 20, 52 [Google Scholar]
  46. Ripley, B. D. 1976, J. Appl. Prob., 13, 255 [Google Scholar]
  47. Ripley, B. D. 1988, Statistical Inference for Spatial Processes (Cambridge: Cambridge University Press) [CrossRef] [Google Scholar]
  48. Rivolo, A. R. 1986, ApJ, 301, 70 [NASA ADS] [CrossRef] [Google Scholar]
  49. Ross, A. J., Percival, W. J., Sánchez, A. G., et al. 2012, MNRAS, 424, 564 [Google Scholar]
  50. Ross, A. J., Bautista, J., Tojeiro, R., et al. 2020, MNRAS, 498, 2354 [NASA ADS] [CrossRef] [Google Scholar]
  51. Saunders, W., Rowan-Robinson, M., & Lawrence, A. 1992, MNRAS, 258, 134 [NASA ADS] [Google Scholar]
  52. Shaw, T., Møller, J., & Waagepetersen, R. 2021, Aust. N. Z. J. Stat., 63, 93 [CrossRef] [Google Scholar]
  53. Sinha, M., & Garrison, L. H. 2020, MNRAS, 491, 3022 [Google Scholar]
  54. Stoyan, D., & Stoyan, H. 1994, Fractals, Random Shapes and Point Fields: Methods of Geometrical Statistics (Chichester: John Wiley& Sons) [Google Scholar]
  55. Stoyan, D., & Stoyan, H. 2000, Scand. J. Stat., 27, 641 [CrossRef] [Google Scholar]
  56. Stoyan, D., Kendall, W. S., & Mecke, J. 1995, Stochastic Geometry and its Applications, 2nd edn. (Chichester: John Wiley& Sons) [Google Scholar]
  57. Virtanen, P., Gommers, R., Oliphant, T. E., et al. 2020, Nat. Meth., 17, 261 [Google Scholar]

Appendix A: Expectation of pair-counts

Estimators for the two-point density and the correlation function using geometrical weights have been developed in spatial statistics (e.g. Stoyan & Stoyan 1994). These geometrical weights can be derived with an application of the Campbell-Mecke formula to the pair-counts. These ideas can be found in several places, such as for example Ripley (1976), Ohser (1983), Fiksel (1988), Stoyan et al. (1995) and quite recently for inhomogeneous point sets in Shaw et al. (2021). We provide analogous derivations to show the connection between the pair-counts DR and RR and geometric quantities such as the set-covariance and the area fraction (Kerscher 1999; Stoyan & Stoyan 2000).

The Campbell-Mecke formula connects the expectation value over realisations of a point process Φ to integrals over n-point densities (e.g. Stoyan et al. 1995; Daley & Vere-Jones 2003). For suitable functions f(x) and g(x, y), we have

E [ x Φ f ( x ) ] = R 3 f ( x ) ϱ d x , $$ \begin{aligned} {\mathbb{E} }\left[\sum _{{\boldsymbol{x}}\in \Phi } f(\boldsymbol{x})\,\right]&= \int _{{\mathbb{R} }^3}f(\boldsymbol{x})\, \varrho \, {\mathrm{d} }{\boldsymbol{x}},\end{aligned} $$(A.1)

E [ x , y Φ x y g ( x , y ) ] = R 3 R 3 g ( x , y ) ϱ 2 ( x , y ) d x d y . $$ \begin{aligned} {\mathbb{E} }\left[ \sum _{\begin{matrix} {\boldsymbol{x}},{\boldsymbol{y}}\in \Phi \\ {\boldsymbol{x}}\ne {\boldsymbol{y}} \end{matrix}} g({\boldsymbol{x}},{\boldsymbol{y}})\,\right]&= \int _{{\mathbb{R} }^3}\int _{{\mathbb{R} }^3} g({\boldsymbol{x}},{\boldsymbol{y}})\, \varrho _2({\boldsymbol{x}},{\boldsymbol{y}})\, {\mathrm{d} }{\boldsymbol{x}}{\mathrm{d} }{\boldsymbol{y}}. \end{aligned} $$(A.2)

The Campbell-Mecke formula allows us to interchange the expectation of sums with an integration over the number density 𝜚 or the two-point density 𝜚2(x, y). For a simple point process Φ, we define

S = x , y Φ x y 1 W ( x ) 1 W ( y ) δ k r δ ( | x y | ) . $$ \begin{aligned} S&= \sum _{\begin{matrix} {\boldsymbol{x}},{\boldsymbol{y}}\in \Phi \\ {\boldsymbol{x}}\ne {\boldsymbol{y}} \end{matrix}} {\mathbb{1} }_{W}({\boldsymbol{x}}) {\mathbb{1} }_{W}({\boldsymbol{y}}) {\delta }k_r^{\delta }(|{\boldsymbol{x}}-{\boldsymbol{y}}|). \end{aligned} $$(A.3)

Considering the galaxy distribution as a realisation of a point process Φ, we have DD = S N 2 $ {{\mathsf{DD}}}=\frac{S}{N^2} $. We apply the Campbell-Mecke formula to 𝔼[S], where we assume homogeneity and isotropy 𝜚2(x, y) = 𝜚2(|x − y|).

E [ S ] = R 3 R 3 1 W ( x ) 1 W ( y ) δ k r δ ( | x y | ) ϱ 2 ( | x y | ) d x d y = R 3 R 3 1 W ( z + y ) 1 W ( y ) d y = γ W ( z ) δ k r δ ( | z | ) ϱ 2 ( | z | ) d z = 0 0 π 0 2 π γ W ( z ( s , θ , ϕ ) ) sin ( θ ) d θ d ϕ = 4 π γ W ¯ ( s ) δ k r δ ( s ) ϱ 2 ( s ) s 2 d s = 4 π r r + δ γ W ¯ ( s ) ϱ 2 ( 1 + ξ ( s ) ) s 2 d s , $$ \begin{aligned} {\mathbb{E} }[S]&= \int _{{\mathbb{R} }^3}\int _{{\mathbb{R} }^3} {\mathbb{1} }_{W}(\boldsymbol{x}) {\mathbb{1} }_{W}(\boldsymbol{y}) {\delta }k_r^{\delta }(|{\boldsymbol{x}}-{\boldsymbol{y}}|) \varrho _2(|{\boldsymbol{x}}-{\boldsymbol{y}}|)\, {\mathrm{d} }{\boldsymbol{x}}{\mathrm{d} }{\boldsymbol{y}}\nonumber \\&= \int _{{\mathbb{R} }^3} \underbrace{\int _{{\mathbb{R} }^3} {\mathbb{1} }_{W}({\boldsymbol{z}}+{\boldsymbol{y}}) {\mathbb{1} }_{W}(\boldsymbol{y}) {\mathrm{d} }{\boldsymbol{y}}}_{=\gamma _{W}(\boldsymbol{z})} \ {\delta }k_r^{\delta }(|\boldsymbol{z}|) \varrho _2(|\boldsymbol{z}|)\, {\mathrm{d} }{\boldsymbol{z}}\nonumber \\&= \int _{0}^\infty \underbrace{\int _0^\pi \!\!\! \int _0^{2\pi }\!\!\! \gamma _{W}({\boldsymbol{z}}(s,\theta ,\phi )) \sin (\theta ){\mathrm{d} }\theta {\mathrm{d} }\phi }_{=4\pi \overline{\gamma _{W}}(s)} \ {\delta }k_r^{\delta }(s) \varrho _2(s)\, s^2{\mathrm{d} }s \nonumber \\&= 4\pi \int _{r}^{r+{\delta }}\!\! \overline{\gamma _{W}}(s)\ \varrho ^2 (1+\xi (s))\ s^2{\mathrm{d} }s, \end{aligned} $$(A.4)

with 𝜚2(s) = 𝜚2(1 + ξ(s)) and the number density 𝜚 of the point process.

The random point set used in RR is a realisation of a Poisson process with number density ϱ r = N r | W | $ \varrho_{\mathrm{r}}=\frac{{{{N_{\text{r}}}}}}{|{W}|} $ and by definition a vanishing two-point correlation function ξ(r) = 0. Applying the Campbell-Mecke formula to RR = S r N r 2 $ {{\mathsf{RR}}}=\frac{S_{\mathrm{r}}}{{{{N_{\text{r}}}}}^2} $, we obtain the connection between the pair-count RR and the isotropised set-covariance (Kerscher 1999; Stoyan & Stoyan 2000):

E [ RR ( r ) ] = RR ( r ) = 4 π N r 2 r r + δ γ W ¯ ( s ) ϱ r 2 s 2 d s 4 π r 2 δ | W | 2 γ W ¯ ( r ) for δ small . $$ \begin{aligned} {\mathbb{E} }[\mathsf{RR }(r)] = {\mathcal{RR} }(r)&= \frac{4\pi }{{{N_{\rm r}}}^2} \int _{r}^{r+{\delta }}\!\! \overline{\gamma _{W}}(s)\ \varrho _{\rm r}^2 \ s^2{\mathrm{d} }s \\&\approx \ \frac{4\pi r^2 {\delta }}{|{W}|^2}\ \overline{\gamma _{W}}(r) \quad \mathrm{for}\; {\delta }\; \mathrm{small}.\nonumber \end{aligned} $$(A.5)

Using the isotropised set covariance (see Appendix B.1), we can calculate ℛℛ(r) as a geometrical reference value for rectangular boxes.

As mentioned above, we consider the galaxy distribution as a realisation of a point process with number density ϱ = N | W | $ \varrho=\frac{N}{|{W}|} $ and we seek to estimate its two-point correlation function ξ(r). From

E [ DD ( r ) ] E [ RR ( r ) ] = N r 2 4 π r r + δ γ W ¯ ( s ) ϱ 2 ( 1 + ξ ( s ) ) s 2 d s N 2 4 π r r + δ γ W ¯ ( s ) ϱ r 2 s 2 d s = 1 + r r + δ γ W ¯ ( s ) ξ ( s ) s 2 d s r r + δ γ W ¯ ( s ) s 2 d s 1 + ξ ( r ) for δ small , $$ \begin{aligned} \frac{{\mathbb{E} }[\mathsf{DD }(r)]}{{\mathbb{E} }[\mathsf{RR }(r)]}&= \frac{{{N_{\rm r}}}^2\ 4\pi \int _{r}^{r+{\delta }} \overline{\gamma _{W}}(s)\ \varrho ^2(1+\xi (s)) s^2{\mathrm{d} }s}{N^2\ 4\pi \int _{r}^{r+{\delta }} \overline{\gamma _{W}}(s)\ \varrho _{\rm r}^2 s^2{\mathrm{d} }s}\nonumber \\&= 1 + \frac{\int _{r}^{r+{\delta }} \overline{\gamma _{W}}(s)\ \xi (s)\ s^2{\mathrm{d} }s}{\int _{r}^{r+{\delta }} \overline{\gamma _{W}}(s)\ s^2{\mathrm{d} }s}\\&\approx 1 + \xi (r) \quad \mathrm{for}\; {\delta }\; \mathrm{small},\nonumber \end{aligned} $$(A.6)

we see that the estimator ξ PH ( r ) = DD ( r ) RR ( r ) 1 $ \xi_{\mathrm{PH}}(r)= \frac{{{\mathsf{DD}}}(r)}{{{\mathsf{RR}}}(r)}-1 $ of Peebles & Hauser (1974) is a ‘ratio unbiased’ estimator for the two-point correlation function ξ(r).

To derive an analogous relation between DR and the average surface area we consider

T = i = 1 N y Φ 1 W ( y ) δ k r δ ( | x i y | ) , $$ \begin{aligned} T&= \sum _{i=1}^N \sum _{{\boldsymbol{y}}\in \Phi } {\mathbb{1} }_{W}({\boldsymbol{y}})\, {\delta }k_r^{\delta }(|{\boldsymbol{x}}_i-{\boldsymbol{y}}|) , \end{aligned} $$(A.7)

with a point process Φ and { x i } i = 1 N $ \{{\boldsymbol{x}}_i\}_{i=1}^N $ being a given set of points inside the sample geometry xi ∈ W. With Φ, a Poisson process, and using the Campbell-Mecke formula for DR ( r ) = T N N r $ {{\mathsf{DR}}}(r)=\frac{T}{N {{{N_{\text{r}}}}}} $, we obtain

E [ T ] = i = 1 N R 3 1 W ( y ) δ k r δ ( | x i y | ) ϱ r d y = N r | W | i = 1 N V r δ ( x i ) , $$ \begin{aligned} {\mathbb{E} }[T]&= \sum _{i=1}^N \int _{{\mathbb{R} }^3} {\mathbb{1} }_{W}({\boldsymbol{y}})\, {\delta }k_r^{\delta }(|{\boldsymbol{x}}_i-{\boldsymbol{y}}|)\, \varrho _{\rm r}\, {\mathrm{d} }{\boldsymbol{y}} = \frac{{{N_{\rm r}}}}{|{W}|} \sum _{i=1}^N {\mathcal{V} }_r^{\delta }({\boldsymbol{x}}_i), \end{aligned} $$(A.8)

where ϱ r = N r | W | $ \varrho_{\mathrm{r}}=\frac{{{{N_{\text{r}}}}}}{|{W}|} $ is the number density of the Poisson process and

V r δ ( x i ) = | S r δ ( x i ) W | = r r + δ area ( B s ( x i ) W ) d s $$ \begin{aligned} {\mathcal{V} }_r^{\delta }({\boldsymbol{x}}_i)&= |S_r^{\delta }({\boldsymbol{x}}_i)\cap {W}| = \int _r^{r+{\delta }}\!\!\!\!\! \mathrm{area}(\partial {B}_{s}({\boldsymbol{x}}_i)\cap {W})\,{\mathrm{d} }s \end{aligned} $$(A.9)

is the volume of the spherical shell S r δ ( x i ) = { y 3 | s < | y x i | s + δ } $ S_r^{{\delta}}({\boldsymbol{x}}_i)=\{{\boldsymbol{y}}\in{{\mathbb{R}}}^3 \,|\, s < |{\boldsymbol{y}}-{\boldsymbol{x}}_i|\le s+{{\delta}} \} $ with a radial range in [r, r + δ] around xi inside the sample geometry W. For simple sample geometries, the area(∂Bs(xi)∩W) can be calculated explicitly (see Appendix B.1). The integral in Eq. (A.9) can be evaluated using standard numerical methods to obtain the expectation value of value DR(r):

DR ( r ) = 1 | W | N i = 1 N V r δ ( x i ) 1 | W | 1 N i = 1 N area ( B r ( x i ) W ) δ for δ small . $$ \begin{aligned} {\mathcal{DR} }(r)&= \frac{1}{|{W}|N}\sum _{i=1}^N {\mathcal{V} }_r^{\delta }({\boldsymbol{x}}_i) \\&\approx \frac{1}{|{W}|}\, \frac{1}{N} \sum _{i=1}^N \mathrm{area}(\partial {B}_{r}({\boldsymbol{x}}_i)\cap {W})\,{\delta }\quad \mathrm{for}\; {\delta }\; \mathrm{small}.\nonumber \end{aligned} $$(A.10)

Clearly, the expectation value of the estimators from Davis & Peebles (1983), Hewett (1982), and Hamilton (1993) can be expressed in terms of the istotropised set-covariance and the average area fraction (Kerscher 1999).

A.1. Periodic boundaries

Let us assume that our window is a rectangular box W = [0, L1]×[0, L2]×[0, L3] with periodic boundaries, i.e. W has the topology of a three-torus. The majority of cosmological simulations enforce these boundary conditions. In such a situation, no boundary corrections are needed in the calculation of the two-point correlation function, but we have to respect the periodicity in each coordinate direction (Stoyan et al. 1995). The distance between two points x, y ∈ W is d ( x , y ) = d 1 2 + d 2 2 + d 3 2 $ d(\boldsymbol{x},\boldsymbol{y})=\sqrt{d_1^2+d_2^2+d_3^2} $, with di = min{|xi − yi|,Li − |xi − yi|}. This is applicable if r is smaller than any Li/2. For the shifted periodic box, we have Wx = {y | y − x ∈ W}=W and consequently the (isotropised) set-covariance is constant γ W ( x ) = γ W ¯ ( | x | ) = | W | $ \gamma_{W}(\boldsymbol{x})=\overline{\gamma_{W}}(|\boldsymbol{x}|)=|{W}| $. Similar to the derivation in Eq. (A.4), the expectation value of DD(r) can be calculated using the Campbell-Mecke formula:

E [ DD ( r ) ] = 4 π N 2 r r + δ | W | ϱ 2 ( 1 + ξ ( s ) ) s 2 d s = | S r δ | | W | + 4 π | W | r r + δ ξ ( s ) s 2 d s 4 π r 2 δ | W | ( 1 + ξ ( s ) ) for δ small , $$ \begin{aligned} {\mathbb{E} }[\mathsf{DD }(r)]&= \frac{4\pi }{N^2} \int _{r}^{r+{\delta }}\!\! |{W}|\, \varrho ^2 (1+\xi (s))\, s^2{\mathrm{d} }s \nonumber \\&= \frac{|S_r^{\delta }|}{|{W}|} + \frac{4\pi }{|{W}|} \int _{r}^{r+{\delta }}\!\! \xi (s)\, s^2{\mathrm{d} }s \nonumber \\&\approx \ \frac{4\pi r^2 {\delta }}{|{W}|} (1 + \xi (s)) \quad \mathrm{for}\; {\delta }\; \mathrm{small}, \end{aligned} $$(A.11)

where the volume of the shell | S r δ | = 4 π 3 ( ( r + δ ) 3 r 3 ) $ |S_r^{{\delta}}|=\frac{4\pi}{3} \left((r+{{\delta}})^3 - r^3\right) $. Therefore, for small δ, | W | | S r δ | DD ( r ) 1 $ \frac{|{W}|}{|S_r^{{\delta}}|}{{\mathsf{DD}}}(r)-1 $ is an unbiased estimate of the two-point correlation function ξ(r) in a periodic box; neither a random points set, nor a boundary correction with geometric factors is needed.

A.2. Expectation of pair-counts for inhomogeneous point sets

As described in Sect. 6 we are naturally confronted with an inhomogeneous sampled galaxy distribution. Here we calculate the expectation values of DD′,DR′ and RR′ for such an inhomogeneous situation.

One also has Campbell-Mecke formulas for an inhomogeneous point process Φ′ (Baddeley et al. 2000; Shaw et al. 2021):

E [ x Φ f ( x ) ] = R 3 f ( x ) ϱ ( x ) d x , $$ \begin{aligned} {\mathbb{E} }\left[\sum _{\boldsymbol{x}\in \Phi \prime } f(\boldsymbol{x})\,\right]&= \int _{{\mathbb{R} }^3}f(\boldsymbol{x})\, \varrho \prime (\boldsymbol{x})\, {\mathrm{d} }{\boldsymbol{x}},\end{aligned} $$(A.12)

E [ x , y Φ x y g ( x , y ) ] = R 3 R 3 g ( x , y ) ϱ 2 ( x , y ) d x d y = R 3 R 3 g ( x , y ) ϱ ( x ) ϱ ( y ) ( 1 + ξ ( | x y | ) ) d x d y . $$ \begin{aligned} {\mathbb{E} }\left[\sum _{\begin{matrix} {\boldsymbol{x}},{\boldsymbol{y}}\in \Phi \prime \\ {\boldsymbol{x}}\ne {\boldsymbol{y}} \end{matrix}} g(\boldsymbol{x},\boldsymbol{y})\,\right]&= \int _{{\mathbb{R} }^3}\int _{{\mathbb{R} }^3} g(\boldsymbol{x},\boldsymbol{y})\, \varrho _2\prime (\boldsymbol{x},\boldsymbol{y})\, {\mathrm{d} }{\boldsymbol{x}}{\mathrm{d} }{\boldsymbol{y}} \\&= \int _{{\mathbb{R} }^3}\int _{{\mathbb{R} }^3} g(\boldsymbol{x},\boldsymbol{y})\,\varrho \prime (\boldsymbol{x})\varrho \prime (\boldsymbol{y})\,(1+\xi^\prime (|\boldsymbol{x}-\boldsymbol{y}|))\, {\mathrm{d} }{\boldsymbol{x}}{\mathrm{d} }{\boldsymbol{y}}.\nonumber \end{aligned} $$(A.13)

Again we consider

S = x , y Φ x y 1 W ( x ) 1 W ( y ) δ k r δ ( | x y | ) , $$ \begin{aligned} S^\prime&= \sum _{\begin{matrix} {\boldsymbol{x}},{\boldsymbol{y}}\in \Phi \prime \\ {\boldsymbol{x}}\ne {\boldsymbol{y}} \end{matrix}} {\mathbb{1} }_{W}(\boldsymbol{x}) {\mathbb{1} }_{W}(\boldsymbol{y}) {\delta }k_r^{\delta }(|{\boldsymbol{x}}-{\boldsymbol{y}}|) , \end{aligned} $$(A.14)

and calculate the expectation (see Baddeley et al. 2000; Shaw et al. 2021 for similar derivations)

E [ S ] = R 3 R 3 1 W ( x ) 1 W ( y ) δ k r δ ( | x y | ) ϱ 2 ( | x y | ) d x d y = R 3 R 3 1 W ( z + y ) 1 W ( y ) ϱ ( z + y ) ϱ ( y ) d y δ k r δ ( | z | ) ( 1 + ξ ( | z | ) d z = 4 π r r + δ Γ ¯ W ( s ) ( 1 + ξ ( s ) ) s 2 d s , $$ \begin{aligned} &{{\mathbb{E} }[S^\prime ] = \int _{{\mathbb{R} }^3}\!\int _{{\mathbb{R} }^3}\!\! {\mathbb{1} }_{W}(\boldsymbol{x}) {\mathbb{1} }_{W}(\boldsymbol{y}) {\delta }k_r^{\delta }(|\boldsymbol{x}-\boldsymbol{y}|) \varrho _2\prime (|\boldsymbol{x}-\boldsymbol{y}|)\, {\mathrm{d} }{\boldsymbol{x}}{\mathrm{d} }{\boldsymbol{y}}} \\ &\quad\ \ = \int _{{\mathbb{R} }^3}\! \int _{{\mathbb{R} }^3}\!\! {\mathbb{1} }_{W}(\boldsymbol{z}+\boldsymbol{y}) {\mathbb{1} }_{W}(\boldsymbol{y}) \varrho \prime (\boldsymbol{z}+\boldsymbol{y})\varrho \prime (\boldsymbol{y}) {\mathrm{d} }{\boldsymbol{y}}\ {\delta }k_r^{\delta }(|\boldsymbol{z}|) (1+\xi^\prime (|\boldsymbol{z}|) {\mathrm{d} }{\boldsymbol{z}} \\&\quad\ \ = 4\pi \int _{r}^{r+{\delta }}\!\! \overline{\Gamma }_{W}(s)\ (1+\xi^\prime (s))\ s^2{\mathrm{d} }s, \end{aligned} $$(A.15)

where Γ ¯ W ( s ) $ \overline{\Gamma}_{W}(s) $ is the density weighted isotropised set-covariance

Γ ¯ W ( s ) = 1 4 π 0 π 0 2 π R 3 1 W ( z ( s , θ , ϕ ) + y ) 1 W ( y ) ϱ ( z ( s , θ , ϕ ) + y ) ϱ ( y ) d y sin ( θ ) d θ d ϕ . $$ \begin{aligned} \overline{\Gamma }_{W}(s)&= \frac{1}{4\pi }\int _0^\pi \!\!\! \int _0^{2\pi }\!\!\! \int _{{\mathbb{R} }^3} {\mathbb{1} }_{W}(\boldsymbol{z}(s,\theta ,\phi )+\boldsymbol{y}) {\mathbb{1} }_{W}(\boldsymbol{y})\nonumber \\&\qquad \qquad \varrho \prime (\boldsymbol{z}(s,\theta ,\phi )+\boldsymbol{y})\varrho \prime (\boldsymbol{y})\ {\mathrm{d} }{\boldsymbol{y}} \sin (\theta ){\mathrm{d} }\theta {\mathrm{d} }\phi . \end{aligned} $$(A.16)

For a homogeneous point distribution with 𝜚′(x) = 𝜚, we consistently get Γ ¯ W ( s ) = ϱ 2 γ W ¯ ( s ) $ \overline{\Gamma}_{W}(s)=\varrho^2\,\overline{\gamma_{W}}(s) $.

The random point set used in RR′ is a realisation of an inhomogeneous Poisson process with number density N r N ϱ ( x ) $ \tfrac{{{{N_{\text{r}}}}}}{N}\varrho^{\prime}(\boldsymbol{x}) $, and a vanishing two-point correlation function ξ′(r) = 0. N is the number of galaxies and Nr the number of random points in W. From Eq. (A.15) we get

E [ RR ( r ) ] = RR ( r ) = 4 π N 2 r r + δ Γ W ¯ ( s ) s 2 d s 4 π r 2 δ N 2 Γ W ¯ ( r ) for δ small . $$ \begin{aligned} {\mathbb{E} }[\mathsf{RR }^\prime (r)] = {\mathcal{RR} }^\prime (r)&= \frac{4\pi }{N^2} \int _{r}^{r+{\delta }} \overline{\Gamma _{W}}(s) \ s^2{\mathrm{d} }s \\&\approx \frac{4\pi r^2 {\delta }}{N^2}\ \overline{\Gamma _{W}}(r) \quad \mathrm{for}\; {\delta }\; \mathrm{small}.\nonumber \end{aligned} $$(A.17)

Using Eq. (A.15) and Eq. (A.17) we can determine

E [ DD ( r ) ] E [ RR ( r ) ] = r r + δ Γ W ¯ ( s ) ( 1 + ξ ( s ) ) s 2 d s r r + δ Γ W ¯ ( s ) s 2 d s 1 + ξ ( r ) for δ small , $$ \begin{aligned} \frac{{\mathbb{E} }[\mathsf{DD }^\prime (r)]}{{\mathbb{E} }[\mathsf{RR }^\prime (r)]}&= \frac{\int _{r}^{r+{\delta }} \overline{\Gamma _{W}}(s)\ (1+\xi^\prime (s)) s^2{\mathrm{d} }s}{\int _{r}^{r+{\delta }} \overline{\Gamma _{W}}(s)\ s^2{\mathrm{d} }s}\nonumber \\&\approx 1 + \xi^\prime (r) \quad \mathrm{for}\; {\delta }\;\mathrm{small},\nonumber \end{aligned} $$

Hence we arrive at the well-known result that the estimator ξ PH ( r ) = DD ( r ) RR ( r ) 1 $ \xi_{\mathrm{PH}}^{\prime}(r)= \frac{{{\mathsf{DD}}}^{\prime}(r)}{{{\mathsf{RR}}}^{\prime}(r)}-1 $ of Peebles & Hauser (1974) is ‘ratio unbiased’. ξ PH ( r ) $ \xi_{\rm PH}^\prime(r) $ is an estimate of the two-point correlation function ξ′(r) = ξG(r) if we apply the same selection effects to the random point set as we find in the incompletely sampled galaxy distribution.

For a given point set { x i } i = 1 N $ \{{\boldsymbol{x}}_i\}_{i=1}^N $ inside the sample geometry xi ∈ W, we calculate the expectation of DR′. Using the Campbell-Mecke formula and with Φ′ an inhomogeneous Poisson process with number density N r N ϱ ( x ) $ \tfrac{{{{N_{\text{r}}}}}}{N}\varrho^{\prime}(\boldsymbol{x}) $, we obtain

E [ DR ] = DR ( r ) = 1 N N r i = 1 N R 3 1 W ( y ) δ k r δ ( | x i y | ) N r N ϱ ( y ) d y = 1 N 2 i = 1 N U r δ ( x i ) , $$ \begin{aligned} {\mathbb{E} }[\mathsf{DR }^\prime ]&= {\mathcal{DR} }^\prime (r) = \frac{1}{N{{N_{\rm r}}}} \sum _{i=1}^N \int _{{\mathbb{R} }^3} {\mathbb{1} }_{W}(\boldsymbol{y})\, {\delta }k_r^{\delta }(|{\boldsymbol{x}}_i-\boldsymbol{y}|)\, \tfrac{{{N_{\rm r}}}}{N}\varrho \prime (\boldsymbol{y})\, {\mathrm{d} }{\boldsymbol{y}}\nonumber \\&= \frac{1}{N^2} \sum _{i=1}^N {\mathcal{U} }_r^{\delta }({\boldsymbol{x}}_i), \end{aligned} $$(A.18)

where

U r δ ( x i ) = R 3 1 W ( y ) δ k r δ ( | x i y | ) ϱ ( y ) d y $$ \begin{aligned} {\mathcal{U} }_r^{\delta }({\boldsymbol{x}}_i) = \int _{{\mathbb{R} }^3} {\mathbb{1} }_{W}(\boldsymbol{y})\, {\delta }k_r^{\delta }(|{\boldsymbol{x}}_i-\boldsymbol{y}|)\, \varrho \prime (\boldsymbol{y})\, {\mathrm{d} }{\boldsymbol{y}} \end{aligned} $$(A.19)

is the integral of the density 𝜚′(y) in the volume of the spherical shell S r δ ( x i ) = { y 3 | s < | y x i | s + δ } $ S_r^{{\delta}}({\boldsymbol{x}}_i)=\{\boldsymbol{y}\in{{\mathbb{R}}}^3 \,|\, s < |\boldsymbol{y}-{\boldsymbol{x}}_i|\le s+{{\delta}}\} $ inside W.

Appendix B: Simple windows

The area fraction area(∂Br(x)∩W) and the (isotropised) set covariance γ W ¯ ( r ) $ \overline{\gamma_{W}}(r) $ can be calculated for a rectangular box W. We use these expressions to determine the accuracy of the (quasi-) Monte Carlo integration schemes. We also reiterate the result for a sphere and for two-dimensions.

B.1. Rectangular box

For a point x = (x1, x2, x3)T and a rectangular box W = [0, L1]×[0, L2]×[0, L3] with side lengths L1 > |x1|,L2 > |x2|,L3 > |x3|, the set–covariance is

γ W ( x ) = ( L 1 | x 1 | ) ( L 2 | x 2 | ) ( L 3 | x 3 | ) , $$ \begin{aligned} \gamma _{W}(\boldsymbol{x}) = (L_1-|x_1|)(L_2-|x_2|)(L_3-|x_3|), \end{aligned} $$(B.1)

and the isotropised set-covariance is for r < min{L1, L2, L3}

γ W ¯ ( r ) = L 1 L 2 L 3 r 2 ( L 1 L 2 + L 1 L 3 + L 2 L 3 ) + 2 r 2 3 π ( L 1 + L 2 + L 3 ) r 3 4 π ; $$ \begin{aligned}&\overline{\gamma _{W}}(r) = L_1 L_2 L_3 - \nonumber \\&\quad -\frac{r}{2}(L_1 L_2+L_1 L_3+L_2 L_3) + \frac{2r^2}{3\pi }(L_1+L_2+L_3) - \frac{r^3}{4\pi }; \end{aligned} $$(B.2)

(see e.g. Stoyan & Stoyan 1994; Kerscher 1999). A simple integration gives us the integrated isotropised set-covariance:

r R γ W ¯ ( s ) s 2 d s = 1 3 L 1 L 2 L 3 ( R 3 r 3 ) 1 8 ( L 1 L 2 + L 1 L 3 + L 2 L 3 ) ( R 4 r 4 ) + + 2 15 π ( L 1 + L 2 + L 3 ) ( R 5 r 5 ) 1 24 π ( R 6 r 6 ) , $$ \begin{aligned}&\int _r^R\overline{\gamma _{W}}(s) s^2{\mathrm{d} }s = \tfrac{1}{3} L_1 L_2 L_3 (R^3 - r^3) - \nonumber \\&\quad - \tfrac{1}{8}(L_1 L_2+L_1 L_3+L_2 L_3)(R^4 - r^4) + \nonumber \\&\quad + \tfrac{2}{15\pi }(L_1+L_2+L_3) (R^5 - r^5) - \tfrac{1}{24\pi } (R^6 - r^6), \end{aligned} $$(B.3)

which we need to calculate ℛℛ(r) according to Eq. (A.5).

It is more involved to calculate the surface area of a sphere inside a rectangular box W. Using the inclusion-exclusion formula, Baddeley et al. (1993) derive an explicit expression for the area fraction area(∂Br(x)∩W):

area ( B r ( x ) W ) = 4 π r 2 i = 1 3 { A 1 ( x i , r ) + A 1 ( L i x i , r ) } + i = 1 3 j = i + 1 3 { A 2 ( x i , x j , r ) + A 2 ( x i , L j x j , r ) + A 2 ( L i x i , x j , r ) + A 2 ( L i x i , L j x j , r ) } A 3 ( x 1 , x 2 , x 3 , r ) A 3 ( L 1 x 1 , x 2 , x 3 , r ) A 3 ( x 1 , L 2 x 2 , x 3 , r ) A 3 ( x 1 , x 2 , L 3 x 3 , r ) A 3 ( L 1 x 1 , L 2 x 2 , x 3 , r ) A 3 ( L 1 x 1 , x 2 , L 3 x 3 , r ) A 3 ( x 1 , L 2 x 2 , L 3 x 3 , r ) A 3 ( L 1 x 1 , L 2 x 2 , L 3 x 3 , r ) , $$ \begin{aligned}&{\mathrm{area}(\partial {B}_r(\boldsymbol{x})\cap {W}) = 4\pi r^2 - \sum _{i=1}^3 \left\{ A_1(x_i,r) + A_1(L_i-x_i,r)\right\} }\nonumber \\&+\sum _{i=1}^3 \sum _{j=i+1}^3 \Big \{\, A_2(x_i,x_j,r) + A_2(x_i,L_j-x_j,r)\nonumber \\&\qquad \qquad \qquad + A_2(L_i-x_i,x_j,r) + A_2(L_i-x_i,L_j-x_j,r)\,\Big \}\nonumber \\&- A_3(x_1,x_2,x_3,r) - A_3(L_1-x_1,x_2,x_3,r) - A_3(x_1,L_2-x_2,x_3,r)\nonumber \\&- A_3(x_1,x_2,L_3-x_3,r) - A_3(L_1-x_1,L_2-x_2,x_3,r)\nonumber \\&- A_3(L_1-x_1,x_2,L_3-x_3,r) - A_3(x_1,L_2-x_2,L_3-x_3,r)\nonumber \\&- A_3(L_1-x_1,L_2-x_2,L_3-x_3,r), \end{aligned} $$(B.4)

with

A 1 ( t 1 , r ) = 4 r 2 C ( t 1 / r , 0 , 0 ) A 2 ( t 1 , t 2 , r ) = 2 r 2 C ( t 1 / r , t 2 / r , 0 ) A 3 ( t 1 , t 2 , t 3 , r ) = r 2 C ( t 1 / r , t 2 / r , t 3 / r ) . $$ \begin{aligned} A_1(t_1,r)&= 4 r^2 C(t_1/r,0,0)\nonumber \\ A_2(t_1,t_2,r)&= 2 r^2 C(t_1/r,t_2/r,0)\\ A_3(t_1,t_2,t_3,r)&= r^2 C(t_1/r,t_2/r,t_3/r).\nonumber \end{aligned} $$(B.5)

The following expression for C(a, b, c) is almost the same as Eq. (37) from Baddeley et al. (1993), but with two typos corrected. These correct results have already been used in the spatstat package for R (Baddeley & Turner 2005; see sphefrac.c for the implementation). For a, b, c > 0 and a2 + b2 + c2 < 1, we get

C ( a , b , c ) = tan 1 ( ( 1 a 2 c 2 ) 1 / 2 ac ) + tan 1 ( ( 1 b 2 c 2 ) 1 / 2 bc ) + tan 1 ( ( 1 a 2 b 2 ) 1 / 2 ab ) a tan 1 ( ( 1 a 2 c 2 ) 1 / 2 c ) + a tan 1 ( b ( 1 a 2 b 2 ) 1 / 2 ) b tan 1 ( ( 1 b 2 c 2 ) 1 / 2 c ) + b tan 1 ( a ( 1 a 2 b 2 ) 1 / 2 ) c tan 1 ( ( 1 a 2 c 2 ) 1 / 2 a ) + c tan 1 ( b ( 1 b 2 c 2 ) 1 / 2 ) π . $$ \begin{aligned}&{C(a,b,c) = {\tan ^{-1}}\left(\frac{(1-a^2-c^2)^{1/2}}{ac}\right) + {\tan ^{-1}}\left(\frac{(1-b^2-c^2)^{1/2}}{bc}\right)} \nonumber \\&\quad + {\tan ^{-1}}\left(\frac{(1-a^2-b^2)^{1/2}}{ab}\right) - a {\tan ^{-1}}\left(\frac{(1-a^2-c^2)^{1/2}}{c}\right)\nonumber \\&\quad + a {\tan ^{-1}}\left(\frac{b}{(1-a^2-b^2)^{1/2}}\right) - b {\tan ^{-1}}\left(\frac{(1-b^2-c^2)^{1/2}}{c}\right)\nonumber \\&\quad + b {\tan ^{-1}}\left(\frac{a}{(1-a^2-b^2)^{1/2}}\right) - c {\tan ^{-1}}\left(\frac{(1-a^2-c^2)^{1/2}}{a}\right)\nonumber \\&\quad + c {\tan ^{-1}}\left(\frac{b}{(1-b^2-c^2)^{1/2}}\right) - \pi . \end{aligned} $$(B.6)

For a2 + b2 + c2 ≥ 1, we have C(a, b, c) = 0. As special cases,

C ( a , b , 0 ) = tan 1 ( ( 1 a 2 b 2 ) 1 / 2 ab ) a tan 1 ( ( 1 a 2 b 2 ) 1 / 2 b ) b tan 1 ( ( 1 a 2 b 2 ) 1 / 2 a ) , $$ \begin{aligned}&C(a,b,0) = {\tan ^{-1}}\left(\frac{(1-a^2-b^2)^{1/2}}{ab}\right) -a {\tan ^{-1}}\left(\frac{(1-a^2-b^2)^{1/2}}{b}\right)\nonumber \\&\qquad \qquad \qquad \qquad -b {\tan ^{-1}}\left(\frac{(1-a^2-b^2)^{1/2}}{a}\right), \end{aligned} $$(B.7)

and C ( a , 0 , 0 ) = π 2 ( 1 a ) $ C(a,0,0)=\tfrac{\pi}{2}(1-a) $.

B.2. Sphere

For a spherical sample W = BR centred at the origin with radius R and with r = |x|< R, the set-covariance reads

γ W ( x ) = 4 π 3 ( R 3 3 4 r R 2 + 1 16 r 3 ) . $$ \begin{aligned} \gamma _{W}(\boldsymbol{x}) = \frac{4\pi }{3} \left(R^3 -\frac{3}{4}rR^2 + \frac{1}{16} r^3\right). \end{aligned} $$(B.8)

Due to symmetry, the isotropised set-covariance is simply γ W ¯ ( | x | ) = γ W ( x ) $ \overline{\gamma_{W}}(|\boldsymbol{x}|)=\gamma_{W}(\boldsymbol{x}) $. The area fraction for a point q ∈ W at a distance |q|=s > R − r from the origin, and with r < R is area(∂Br(q)∩W) = 4 αr2, with cos ( α ) = r 2 + s 2 R 2 2 r s $ \cos(\alpha) = \frac{r^2+s^2-R^2}{2rs} $.

B.3. Two dimensions

Here we give the expressions for the set-covariance in two dimensions. For a point x = (x1, x2)T ∈ ℝ2 and a rectangle W = [0, L1]×[0, L2] with side lengths L1 > |x1|,L2 > |x2|, the set–covariance is

γ W ( x ) = ( L 1 | x 1 | ) ( L 2 | x 2 | ) , $$ \begin{aligned} \gamma _{W}(\boldsymbol{x})= (L_1-|x_1|)(L_2-|x_2|), \end{aligned} $$(B.9)

and for r < min{L1, L2}, the isotropised set-covariance is (Ripley 1988)

γ W ¯ ( r ) = 1 2 π 0 2 π γ W ( x ( r , ϕ ) ) d ϕ = L 1 L 2 2 r π ( L 1 + L 2 ) + r 2 π . $$ \begin{aligned} \overline{\gamma _{W}}(r)&= \frac{1}{2\pi }\int _0^{2\pi }\gamma _{W}(\boldsymbol{x}(r,\phi ))\,{\mathrm{d} }\phi \nonumber \\&= L_1 L_2 -\frac{2r}{\pi }(L_1+L_2) + \frac{r^2}{\pi }. \end{aligned} $$(B.10)

The analogue to area(∂Br(x)∩W) is the arc-length arc(Cr(x)∩W) of the circumference of a circle Cr(x) with radius r centred on a point x ∈ W inside the sample window W. It can be calculated from the intersection points of the circle with the sides of W and the angles between them and the centre (Stoyan et al. 1995).

Appendix C: Implementation and run time

We use Python with NumPy (Harris et al. 2020) and SciPy (Virtanen et al. 2020) to implement the data processing, calculate geometric quantities, generate random points, and to generate randomised Halton sequences (provided in SciPy starting with version ≥1.7.0). The performance critical parts are the pair-counting and a function for checking whether or not points are inside the observational window. We implemented these routines as C++ functions parallelised using OpenMP (Dagum & Menon 1998). They are made callable from Python via pybind11 (Jakob et al. 2017). In the computations of DD and RR, we obviously avoid the double counting by using ∑ij ≠ i… = 2∑ij > i…⋅ We implement the double sums without using a tree, resulting in a code with a run time quadratic in the number of points (see below). Using OpenMP, we could achieve a reduction of the computing time almost linear with the number of computing cores used (at least up to the 20 cores we were able to access; see also Alonso 2012). Only these optimisations are included in our code but perhaps it is versatile enough to help with future implementations3.

C.1 Run time

As initial guidance, the run time of the pair-count calculations should scale with the number of distance calculations actually performed. This again depends on the number of (quasi-)random or data points, and on the data structure and algorithm used. With our direct implementation of the double sums from eqs. (2), (5), (6), (16) we expect the run time to scale as TDD ∝ N2, TDR ∝ NNr, TRR N r 2 $ {{{N^2_{\rm r}}}} $ and TRR, 2x3d N q 2 $ {{{N^2_{\rm q}}}} $, respectively. This is independent of the number of radius intervals, because we can sort the counts into the corresponding radius bin. The calculation of DRshell should scale as TDR, shell ∝ NNsh, but we need to repeat the pair-count calculation for each radius bin. One could envision optimising this for convex windows.

To estimate the run time, we ran our mildly optimised code on a current small workstation equipped with an Intel Xeon W1350 processor with six cores and a clock frequency of at least 3.3 GHz. Memory was not an issue in these calculations. As expected, we see from Fig. C.1 that for large Nr, the run time of RR and RR2x3d grows quadratically with Nr, and that RR2x3d is slower than RR by a factor of two. The run time of DR and DRshell shows the expected linear scaling with Nr. At fixed Nr and N, the run time of DRshell is more than one order of magnitude longer than DR. But we note that for fixed accuracy we need significantly less quasi-random points in DRshell. We also consider subsamples with N = 1000 and N = 100 data points from the N = 10429 clusters and confirm the linear scaling of TDR and TDR, shell with N. We find the same results for r ∈ [30, 35] Mpc h−1 and r ∈ [100, 105] Mpc h−1.

thumbnail Fig. C.1.

Run time of the pair-counts for r ∈ [5, 10] Mpc h−1 and W = [0, 1]3. The black dotted lines are proportional to Nr and N r 2 $ {{{N^2_{\rm r}}}} $.

In all our calculations, we used a parallelised direct double sum implementation to calculate the pair-counts. As already mentioned, a significant speedup can be expected if one uses tree-based methods (Moore et al. 2001). To illustrate this, we compare the run time of our parallelised pair-count implementation for RR to an implementation using the kd–tree provided with SciPy (KDTree). As the direct double sum implementation ran parallelised and the KDTree on a single core, the absolute run times are not comparable, but the scaling with Nr is the essential result. The direct implementation shows the expected scaling ∝ N r 2 $ {{{N^2_{\rm r}}}} $ of the run time in Fig. C.2. For the tree-based method, we observe that for small radii the run time scales with ∝Nrlog(Nr) and a significant speedup can be achieved. However, for large radii, the run time scales ∝ N r 2 $ {{{N^2_{\rm r}}}} $, comparable to the direct implementation. In a similar comparison for the window [0, 1]×[0, 0.1]2 the run time of the tree based method scales ∝ N r 2 $ {{{N^2_{\rm r}}}} $ for all radii. We therefore confirm a significant speedup for tree-based methods, especially in voluminous samples and for small radii. Tree-based methods, or the other computational optimisations mentioned in the introduction, can be applied directly to DD, RR, DR, and RR2x3d, and should speedup these calculations further. Whether or not similar optimisation schemes can be found for DRshell is an open question.

thumbnail Fig. C.2.

Run time of RR in [0, 1]3 for the direct double sum implementation (three solid lines, almost on top of each other), and the implementation using KDTree (dashed lines). The black dotted lines are proportional to Nrlog(Nr) and N r 2 $ {{{N^2_{\rm r}}}} $.

For a given point set and window there is plenty of scope for optimising the run times. However, such an optimisation problem easily turns into a multi-parameter endeavour. The run time not only depends the accuracy goal, the radius, the bin width, and the density of the point set but also on the hardware platform, the operating system, the algorithms, the data structures, the details of the implementation, and so on.

All Tables

Table 1.

Parameters and results of the run-time comparison.

Table 2.

We sum the parts and compare the run times for the standard ξLS and the new ξ LS $ \widetilde{\xi}_{\mathrm{LS}} $ in all the cases from Table 1.

Table 3.

Parameters and results of a run time comparison similar to Table 1 but now for a sample with N = 1.36 million galaxies (XI) and a random subsample with N = 105 galaxies (X) in a cube with side length 325 Mpc h−1.

All Figures

thumbnail Fig. 1.

Relative error εRR against the number of random points Nr used in the standard procedure for calculating RR(r) in rectangular windows W. The black dotted lines are proportional to 1/​Nr and 1 / N r $ 1/\!\sqrt{{{{N_{\text{r}}}}}} $.

In the text
thumbnail Fig. 2.

Relative error εRR, 2x3d calculated for rectangular windows W with a pure Monte Carlo integration (solid lines) and with a randomised Halton sequence (dashed lines). The black dotted lines are proportional to 1 / N r $ 1/\!\sqrt{{{{N_{\text{r}}}}}} $ (upper) and 1/Nq (lower).

In the text
thumbnail Fig. 3.

Relative error εDR calculated for rectangular windows W with a standard Monte Carlo integration (solid lines) and with a quasi-Monte Carlo scheme using a randomised Halton sequence (dashed lines). The black dotted lines are proportional to 1 / N $ 1/\!\sqrt{N} $ (upper) and (log N)3/N (lower).

In the text
thumbnail Fig. 4.

Relative error εDR, shell calculated for rectangular windows W with a pure Monte Carlo integration (solid lines) and with a randomised Halton sequence (dashed lines). The black dotted lines are proportional to 1 / N sh $ 1/\sqrt{{{{N_{\text{sh}}}}}} $ (upper) and (log Nsh)3/Nsh (lower).

In the text
thumbnail Fig. 5.

Two-point correlation function calculated for a simulated galaxy cluster sample. The exact Ξ is compared to standard ξLS with Nr = 104 and Nr = 105 random points, and to the new ξ LS $ \widetilde\xi_{\mathrm{LS}} $ using N2x3d = 104 = Nsh points from a randomised Halton sequence.

In the text
thumbnail Fig. 6.

Absolute error Δ1 is shown for the standard RR (solid lines) and for RR2x3d (dashed lines) pair-counts. The black dotted lines are proportional to 1 / N r $ 1/\sqrt{{{{N_{\text{r}}}}}} $ (upper) and 1/N2x3d (lower).

In the text
thumbnail Fig. 7.

Absolute error Δ2 is shown for the standard DR (solid lines), the DR using a low-discrepancy sequence (dotted lines), and for DRshell also using a low-discrepancy sequence (dashed lines). The black dotted lines are proportional to 1 / N r $ 1{/}\sqrt{{{{N_{\text{r}}}}}} $ (upper) and 1/Nsh (lower).

In the text
thumbnail Fig. 8.

Errors for estimates of the correlation function at r ∈ [2, 3] Mpc h−1 from 1.36 million galaxies and a subsample of 100 k galaxies in a box with side-length 325 Mpc h−1. Upper plot: absolute error Δ1 shown for the standard RR (solid lines) and for RR2x3d (dashed lines) pair-counts. Lower plot: absolute error Δ2 is shown for the standard DR (solid lines), the DR using a low-discrepancy sequence (dotted lines), and for DRshell also using a low-discrepancy sequence (dashed lines). The black dotted lines are proportional to 1 / N r $ 1/\sqrt{{{{N_{\text{r}}}}}} $ (upper) and 1/N2x3d (lower).

In the text
thumbnail Fig. C.1.

Run time of the pair-counts for r ∈ [5, 10] Mpc h−1 and W = [0, 1]3. The black dotted lines are proportional to Nr and N r 2 $ {{{N^2_{\rm r}}}} $.

In the text
thumbnail Fig. C.2.

Run time of RR in [0, 1]3 for the direct double sum implementation (three solid lines, almost on top of each other), and the implementation using KDTree (dashed lines). The black dotted lines are proportional to Nrlog(Nr) and N r 2 $ {{{N^2_{\rm r}}}} $.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.