Issue 
A&A
Volume 631, November 2019



Article Number  A73  
Number of page(s)  11  
Section  Cosmology (including clusters of galaxies)  
DOI  https://doi.org/10.1051/00046361/201935828  
Published online  22 October 2019 
Estimating the galaxy twopoint correlation function using a split random catalog
^{1}
Department of Physics and Helsinki Institute of Physics, University of Helsinki, Gustaf Hällströmin katu 2, 00014 Helsinki, Finland
email: elina.keihanen@helsinki.fi
^{2}
Scuola Normale Superiore, Piazza dei Cavalieri 7, 56126 Pisa, Italy
^{3}
Department of Mathematics and Physics, Roma Tre University, Via della Vasca Navale 84, 00146 Rome, Italy
^{4}
Dipartimento di Fisica e Astronomia – Alma Mater Studiorum Università di Bologna, Via Piero Gobetti 93/2, 40129 Bologna, Italy
^{5}
INAF – Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy
^{6}
INFN – Sezione di Bologna, Viale Berti Pichat 6/2, 40127 Bologna, Italy
^{7}
ICC & CEA, Department of Physics, Durham University, South Road, Durham DH1 3LE, UK
^{8}
INAF, Osservatorio Astronomico di Trieste, Via Tiepolo 11, 34131 Trieste, Italy
^{9}
Aix Marseille Univ., CNRS, CNES, LAM, Marseille, France
^{10}
SISSA, International School for Advanced Studies, Via Bonomea 265, 34136 Trieste, TS, Italy
^{11}
INFN, Sezione di Trieste, Via Valerio 2, 34127 Trieste, TS, Italy
^{12}
Aix Marseille Univ., Université de Toulon, CNRS, CPT, Marseille, France
^{13}
MaxPlanckInstitut für Extraterrestrische Physik, Postfach 1312, Giessenbachstr., 85741 Garching, Germany
^{14}
IFPU – Institute for Fundamental Physics of the Universe, Via Beirut 2, 34014 Trieste, Italy
Received:
3
May
2019
Accepted:
24
July
2019
The twopoint correlation function of the galaxy distribution is a key cosmological observable that allows us to constrain the dynamical and geometrical state of our Universe. To measure the correlation function we need to know both the galaxy positions and the expected galaxy density field. The expected field is commonly specified using a MonteCarlo sampling of the volume covered by the survey and, to minimize additional sampling errors, this random catalog has to be much larger than the data catalog. Correlation function estimators compare data–data pair counts to data–random and random–random pair counts, where random–random pairs usually dominate the computational cost. Future redshift surveys will deliver spectroscopic catalogs of tens of millions of galaxies. Given the large number of random objects required to guarantee subpercent accuracy, it is of paramount importance to improve the efficiency of the algorithm without degrading its precision. We show both analytically and numerically that splitting the random catalog into a number of subcatalogs of the same size as the data catalog when calculating random–random pairs and excluding pairs across different subcatalogs provides the optimal error at fixed computational cost. For a random catalog fifty times larger than the data catalog, this reduces the computation time by a factor of more than ten without affecting estimator variance or bias.
Key words: largescale structure of Universe / cosmology: observations / methods: statistical / methods: data analysis
© ESO 2019
1. Introduction
The spatial distribution of luminous matter in the Universe is a key diagnostic for studying cosmological models and the physical processes involved in the assembly of structure. In particular, light from galaxies is a robust tracer of the overall matter distribution, whose statistical properties can be predicted by cosmological models. Twopoint correlation statistics are very effective tools for compressing the cosmological information encoded in the spatial distribution of the mass in the Universe. In particular, the twopoint correlation function in configuration space has emerged as one of the most popular cosmological probes. Its success stems from the presence of characterized features that can be identified, measured, and effectively compared to theoretical models to extract clean cosmological information.
One such feature is baryon acoustic oscillations (BAOs), which imprint a characteristic scale in the twopoint correlation function that can be used as a standard ruler. After the first detection in the twopoint correlation function of SDSS DR3 and 2dFGRS galaxy catalogs (Eisenstein et al. 2005; Cole et al. 2005), the BAO signal was identified, with different degrees of statistical significance, and has since been used to constrain the expansion history of the Universe in many spectroscopic galaxy samples (see e.g., Percival et al. 2010; Blake et al. 2011; Beutler et al. 2011; Anderson et al. 2012, 2014; Ross et al. 2015, 2017; Alam et al. 2017; VargasMagaña et al. 2018; Bautista et al. 2018; Ata et al. 2018). Several of these studies did not focus on the BAO feature only but also analyzed the anisotropies in the twopoint correlation function induced by the peculiar velocities (Kaiser 1987), the socalled redshift space distortions (RSD), and by assigning cosmologydependent distances to the observed redshifts (the Alcock & Paczyński 1979 test). For RSD analyses, see also, for example, Peacock et al. (2001), Guzzo et al. (2008), Beutler et al. (2012), Reid et al. (2012), de la Torre et al. (2017), Pezzotta et al. (2017), Zarrouk et al. (2018), Hou et al. (2018), and Ruggeri et al. (2019).
Methods to estimate the galaxy twopoint correlation function (2PCF) ξ(r) from survey data are based on its definition as the excess probability of finding a galaxy pair. One counts from the data (D) catalog the number DD(r) of pairs of galaxies with separation x_{2} − x_{1} ∈ r, where r is a bin of separation vectors, and compares it to the number of pairs RR(r) in a corresponding randomly generated (R) catalog and to the number of datarandom pairs DR(r). The bin may be a 1D (), 2D, or a 3D bin. In the 1D case, r is the length of the separation vector and Δr is the width of the bin. From here on, “separation r” indicates that the separation falls in this bin.
Several estimators of the 2PCF have been proposed by Hewett (1982), Davis & Peebles (1983), Hamilton (1993), and Landy & Szalay (1993), building on the original Peebles & Hauser (1974) proposal. These correspond to different combinations of the DD, DR, and RR counts to obtain a 2PCF estimate ; see Kerscher (1999) and Kerscher et al. (2000) for more estimators. The Landy–Szalay (Landy & Szalay 1993) estimator
(we call this method “standard LS” in the following) is the most commonly used, since it provides the minimum variance when ξ≪1 and is unbiased in the limit . Here N_{d} is the size (number of objects) of the data catalog and is the size of the random catalog. We define . To minimize random error from the random catalog, M_{r} ≫ 1 should be used (for a different approach, see Demina et al. 2018).
One is usually interested in ξ(r) only up to some r_{max} ≪ L_{max} (the maximum separation in the survey), and therefore pairs with larger separations can be skipped. Efficient implementations of the LS estimator involve preordering of the catalogs through kdtree, chainmesh, or other algorithms (e.g., Moore et al. 2000; Alonso 2012; Jarvis 2015; Marulli et al. 2016) to facilitate this. The computational cost is then roughly proportional to the actual number of pairs with separation r ≤ r_{max}.
The correlation function is small for large separations, and in cosmological surveys r_{max} is large enough so that for most pairs ξ(r) ≪ 1. The fraction f of DD pairs with r ≤ r_{max} is therefore not very different from the fraction of DR or RR pairs with r ≤ r_{max}. The computational cost is dominated by the part proportional to the total number of pairs needed, , which in turn is dominated by the RR pairs as M_{r} ≫ 1. The smaller number of DR pairs contribute much more to the error of the estimate than the large number of RR pairs, whereas the cost is dominated by RR. Thus, a significant saving of computation time with an insignificant loss of accuracy may be achieved by counting only a subset of RR pairs, while still counting the full set (up to r_{max}) of DR pairs.
A good way to achieve this is to use many small (i.e, lowdensity) R catalogs instead of one large (highdensity) catalog (Landy & Szalay 1993; Wall & Jenkins 2012; Slepian & Eisenstein 2015), or, equivalently, to split an already generated large R catalog into M_{s} small ones for the calculation of RR pairs while using the full R catalog for the DR counts. This method has been used by some practitioners (e.g., Zehavi et al. 2011^{1}), but this is usually not documented in the literature. One might also consider obtaining a similar cost saving by diluting (subsampling) the R catalog for RR counts, but, as we show below, this is not a good idea. We refer to these two costsaving methods as “split” and “dilution”.
In this work we theoretically derive the additional covariance and bias due to the size and treatment of the R catalog; test these predictions numerically with mock catalogs representative of nextgeneration datasets, such as the spectroscopic galaxy samples that will be obtained by the future Euclid satellite mission (Laureijs et al. 2011); and show that the “split” method, while reducing the computational cost by a large factor, retains the advantages of the LS estimator.
We follow the approach of Landy & Szalay (1993; hereafter, LS93), but generalize it in a number of ways: In particular, since we focus on the effect of the random catalog, we do not work in the limit M_{r} → ∞. Also, we calculate covariances, not just variances, and make fewer approximations (see Sect. 2.2).
The layout of the paper is as follows. In Sect. 2 we derive theoretical results for bias and covariance. In Sect. 3 we focus on the split LS estimator and its optimization. In Sect. 4 we test the different estimators with mock catalogs. Finally, we discuss the results and present our conclusions in Sect. 5.
2. Theoretical results: bias and covariance
2.1. General derivation
We follow the derivation and notations in LS93 but extend to the case that includes random counts covariance. We consider the survey volume as divided into K microcells (very small subvolumes) and work in the limit K → ∞, which means that no two objects will ever be located within the same microcell.
Here, α, β, and γ represent the relative deviation of the DD(r), DR(r), and RR(r) counts from their expectation values (mean values over an infinite number of independent realizations):
By definition ⟨α⟩ = ⟨β⟩ = ⟨γ⟩ = 0. The factors α, β, and γ represent fluctuations in the pair counts, which arise as a result of a Poisson process. As long as the mean pair counts per bin are large (≫1) the relative fluctuations will be small. We calculate up to second order in α, β, and γ, and ignore the higherorder terms (in the limit M_{r} → ∞, γ → 0, so LS93 set γ = 0 at this point).
The expectation values for the pair counts are:
where ξ(r) is the correlation function normalized to the actual number density of galaxies in the survey and
is the fraction of microcell pairs with separation r. Here Θ^{ij}(r): = 1 if x_{i} − x_{j} falls in the rbin, otherwise it is equal to zero.
The expectation value of the LS estimator (1) is
A finite R catalog thus introduces a (small) bias. (In LS93, γ = 0, so the estimator is unbiased in this limit). This expression is calculated to second order in α, β, and γ (we denote equality to second order by “≂”). Calculation to higher order is beyond the scope of this work. Since data and random catalogs are independent, ⟨αγ⟩ = 0.
We introduce shorthand notations ⟨α_{1}α_{2}⟩ for ⟨α(r_{1})α(r_{2})⟩, ⟨DD_{1}DD_{2}⟩ for ⟨DD(r_{1})DD(r_{2})⟩, and similarly for other terms.
For the covariance we get
Terms with γ represent additional variance due to finite , and are new compared to those of LS93. Also, ⟨β_{1}β_{2}⟩ collects an additional contribution, which we denote by Δ⟨β_{1}β_{2}⟩, from variations in the random field (see Sect. 2.3). The cross terms α_{1}β_{2} and α_{2}β_{1}, instead depend linearly on the random field, and average to the result. The additional contribution due to finite is thus
From (2),
and so on, so that the covariances of the deviations are obtained from
2.2. Quadruplets, triplets, and approximations
We use
to denote the fraction of ordered microcell triplets, where x_{i} − x_{k} ∈ r_{1} and x_{j} − x_{k} ∈ r_{2}. The notation ∑^{*} means that only terms where all indices (microcells) are different are included. Here is of the same magnitude as but is larger.
Appendix A gives examples of how the ⟨DD_{1}DD_{2}⟩ and so on in (9) are calculated. These covariances involve expectation values ⟨n_{i}n_{j}n_{l}n_{k}⟩, where n_{i} is the number of objects (0 or 1) in microcell i and so on, and only cases where the four microcells are separated pairwise by r_{1} and r_{2} are included. If all four microcells i, j, k, and l are different, we call this case a quadruplet; it consists of two pairs with separations r_{1} and r_{2}. If two of the indices, that is, microcells, are equal, we have a triplet with a center cell (the equal indices) and two end cells separated from the center by r_{1} and r_{2}.
We make the following three approximations:

For microcell quadruplets, the correlations between unconnected cells are approximated by zero on average.

Threepoint correlations vanish.

The part of fourpoint correlations that does not arise from the twopoint correlations vanishes.
With approximations (2) and (3), we have for the expectation value of a galaxy triplet
where ξ_{ij} := ξ(x_{j} − x_{i}), and for a quadruplet
We use “≃” to denote results based on these three approximations. Approximation (1) is good as long as the survey size is large compared to r_{max}. It allows us to drop terms other than 1 + ξ_{ij} + ξ_{kl} + ξ_{ij}ξ_{kl} in (12). Approximations (2) and (3) hold for Gaussian density fluctuations, but in the realistic cosmological situation they are not good: the presence of the higherorder correlations makes the estimation of the covariance of ξ(r) estimators a difficult problem. However, this difficulty applies only to the contribution of the data to the covariance, that is, to the part that does not depend on the size and treatment of the random catalog. The key point in this work is that while our theoretical result for the total covariance does not hold in a realistic situation (it is an underestimate), our results for the difference in estimator covariance due to different treatments of the random catalog hold well.
In addition to working in the limit (γ = 0), LS93 considered only 1D bins and the case where r_{1} = r_{2} ≡ r (i.e., variances, not covariances) and made also a fourth approximation: for triplets (which in this case have legs of equal length) they approximated the correlation between the end cells (whose separation in this case varies between 0 and 2r) by ξ(r). We use ξ_{12} to denote the mean value of the correlation between triplet end cells (separated from the triplet center by r_{1} and r_{2}). (For our plots in Sect. 4 we make a similar approximation of ξ_{12} as Landy & Szalay 1993, see Sect. 4.2). Also, LS93 only calculated to first order in ξ, whereas we do not make this approximation. Bernstein (1994) also considered covariances, and included the effect of threepoint and fourpoint correlations, but worked in the limit (γ = 0).
2.3. Poisson, edge, and q terms
After calculating all the ⟨DD_{1}DD_{2}⟩ and so on (see Appendix A), (9) becomes
for the standard LS estimator.
Following the definition of t and p in LS93, we define
For their diagonals (r_{1} = r_{2}), we write t, t_{r}, p, p_{c}, p_{r}, q, and q_{r}. Thus, t ≡ t_{11} ≡ t_{22}, and so on. (We use superscripts for the matrices, e.g., t^{r}(r_{1}, r_{2}), and subscripts for their diagonals, e.g., t_{r}(r)).
Using these definitions, (13) becomes
The new part in ⟨β_{1}β_{2}⟩ due to finite size of the random catalog is
Thus only ⟨α_{1}α_{2}⟩ and ⟨β_{1}β_{2}⟩ are affected by ξ(r) (in our approximation its effect cancels in ⟨α_{1}β_{2}⟩). The results for ⟨γ_{1}γ_{2}⟩, ⟨β_{1}γ_{2}⟩, and ⟨α_{1}γ_{2}⟩ are exact. The result for ⟨α_{1}α_{2}⟩ involves all three approximations mentioned above, ⟨α_{1}β_{2}⟩ involves approximations (1) and (2), and ⟨β_{1}β_{2}⟩ involves approximation (1).
We refer to p, p^{c}, and p^{r} as “Poisson” terms and t and t^{r} as “edge” terms (the difference between and is due to edge effects). While the Poisson terms are strongly diagonal dominated, the edge terms are not. Since , the q terms are much larger than the edge terms, but they get multiplied by ξ_{12} − ξ_{1}ξ_{2} or ξ_{12}. In the limit : ⟨β_{1}γ_{2}⟩→0, ⟨γ_{1}γ_{2}⟩→0, ⟨β_{1}β_{2}⟩→t_{12}; ⟨α_{1}α_{2}⟩, and also ⟨α_{1}β_{2}⟩ are unaffected.
We see that DD–DR and DR–RR correlations arise from edge effects. If we increase the density of data or random objects, the Poisson terms decrease as N^{−2} but the edge terms decrease only as N^{−1} so the edge effects are more important for a higher density of objects.
Doubling the bin size (combining neighboring bins) doubles G^{p}(r) but makes G^{t}(r_{1}, r_{2}) four times as large, since triplets where one leg was in one of the original smaller bins and the other leg was in the other bin are now also included. Thus, the ratio and t are not affected, but the dominant term in p, 1/(1 + ξ)G^{p} is halved. Edge effects are thus more important for larger bins.
2.4. Results for the standard Landy–Szalay estimator
Inserting the results for ⟨α_{1}α_{2}⟩ and so on into Eqs. (5) and (6), we get that the expectation value of the standard LS estimator (1) is
This holds also for large ξ and in the presence of threepoint and fourpoint correlations. A finite R catalog thus introduces a bias (ξ−1)(4 t_{r} + p_{r})+4 t_{r} = −p_{r} + (4 t_{r} + p_{r})ξ; the edge (t_{r}) part of the bias cancels in the ξ → 0 limit.
For the covariance we get
Because of the approximations made, this result for the covariance does not apply to the realistic cosmological case; not even for large separations r, where ξ is small, since large correlations at small r increase the covariance also at large r. However, this concerns only ⟨α_{1}α_{2}⟩ and ⟨α_{1}β_{2}⟩. Our focus here is on the additional covariance due to the size and handling of the random catalog, which for standard LS is
To zeroth order in ξ the covariance is given by the Poisson terms and the edge terms cancel to first order in ξ. This is the property for which the standard LS estimator was designed. To first order in ξ, the q terms contribute. This q contribution involves the triplet correlation ξ_{12}, which, depending on the form of ξ(r), may be larger than ξ_{1} or ξ_{2}.
If we try to save cost by using a diluted random catalog with for RR pairs, ⟨γ_{1}γ_{2}⟩ is replaced by with in place of , but and ⟨β_{1}β_{2}⟩ are unaffected, so that the edge terms involving randoms no longer cancel. In Sect. 4 we see that this is a large effect. Therefore, one should not use dilution.
3. Split random catalog
3.1. Bias and covariance for the split method
In the split method one has M_{s} independent smaller R^{μ} catalogs of size instead of one large random catalog R. Their union, R, has a size of . The pair counts DR(r) and RR′(r) are calculated as
that is, pairs across different R^{μ} catalogs are not included in RR′. The total number of pairs in RR′ is . Here, DR is equal to its value in standard LS.
The split Landy–Szalay estimator is
Compared to standard LS, ⟨α_{1}α_{2}⟩, ⟨β_{1}β_{2}⟩, and ⟨α_{1}β_{2}⟩ are unaffected. We construct ⟨RR′⟩, ⟨RR′⋅RR′⟩, and ⟨RR′⋅DR⟩ from the standard LS results, bearing in mind that the random catalog is a union of independent catalogs, arriving at
where
The first is the same as in standard LS and dilution, but the second differs both from standard LS and from dilution, since it involves both and .
For the expectation value we get
so that the bias is . In the limit ξ → 0 the edge part cancels, leaving only the Poisson term.
The covariance is
The change in the covariance compared to the standard LS method is
which again applies in the realistic cosmological situation. Our main result is that in the split method the edge effects cancel and the bias and covariance are the same as for standard LS, except that the Poisson term p^{r} from RR is replaced with the larger p^{r′}.
3.2. Optimizing computational cost and variance of the split method
The bias is small compared to variance in our application (see Fig. 1 for the theoretical result and Fig. 5 for an attempted bias measurement), and therefore we focus on variance as the figure of merit. The computational cost should be roughly proportional to
Fig. 1. Mean ξ(r) estimate and the scatter and theoretical bias of the estimates for different estimators. The dashdotted line, our theoretical result for the scatter of the LS method, underestimates the scatter, since higherorder correlations in the D catalog are ignored. The dotted line is without the contribution of the q terms, and is dominated by the Poisson (p) terms. The bias is multiplied by 100 so the curves can be displayed in a more compact plot. For the measured mean and scatter, and the theoretical bias we plot standard LS in black, dilution with d = 0.14 in red, and split with M_{s} = 50 in blue. For the mean and scatter the difference between the methods is not visible in this plot. The differences in the mean estimate are shown in Fig. 5. The differences in scatter (or its square, the variance) are shown in Fig 3. For the theoretical bias the difference between split and dilution is not visible at small r (ξ(r) > 1), where the bias is positive. 

Open with DEXTER 
and the additional variance due to finite R catalog in the ξ → 0 limit becomes
Here, N_{d} and p are fixed by the survey and the requested r binning, but we can vary M_{r} and M_{s} in the search for the optimal computational method. In the above we defined the “cost” and “variance” factors c and v.
We may ask two questions:

For a fixed level of variance v, which combination of M_{r} and M_{s} minimizes computational cost c?

For a fixed computational cost c, which combination of M_{r} and M_{s} minimizes the variance v?
The answer to both questions is (Slepian & Eisenstein 2015)
Thus, the optimal version of the split method is the natural one where . In this case the additional variance in the ξ → 0 limit becomes
and the computational cost factor becomes
meaning that DR pairs contribute twice as much as RR pairs to the variance and also twice as much computational cost is invested in them. The memory requirement for the random catalog is then the same as for the data catalog. The cost saving estimate above is optimistic, since the computation involves some overhead not proportional to the number of pairs.
For small scales, where ξ ≫ 1, the situation is different. The greater density of DD pairs due to the correlation requires a greater density of the R catalog so that the additional variance from it is not greater. From Eq. (19) we see that the balance of the DR and the RR contributions is different for large ξ (the p_{c} term vs. the other terms). We may consider recomputing for the small scales using a smaller r_{max} and a larger R catalog. Considering just the Poisson terms (p_{c} and p_{r} or ) with a “representative” ξ value, (27) and (28) become and which modifies the above result (Eq. (29)) for the optimal choice of M_{s} and M_{r} to
This result is only indicative, since it assumes a constant ξ for r < r_{max}. In particular, it does not apply for ξ ≈ 1, because then the approximation of ignoring the q^{r} and t^{r} terms in (19) is not good.
4. Tests on mock catalogs
4.1. Minerva simulations and methodology
The Minerva mocks are a set of 300 cosmological mocks produced with Nbody simulations (Grieb et al. 2016; Lippich et al. 2019), stored at five output redshifts z ∈ {2.0, 1.0, 0.57, 0.3, 0}. The cosmology is flat ΛCDM with Ω_{m} = 0.285, and we use the z = 1 outputs. The mocks have N_{d} ≈ 4 × 10^{6} objects (“halos” found by a friendsoffriend algorithm) in a box of 1500 h^{−1} Mpc cubed.
To model the survey geometry of a redshift bin with Δz ≈ 0.1 at z ∼ 1, we placed the observer at comoving distance 2284.63 h^{−1} Mpc from the center of the cube and selected from the cube a shell 2201.34–2367.92 h^{−1} Mpc from the observer. The comoving thickness of the shell is 166.58 h^{−1} Mpc. The resulting mock subcatalogs have N_{d} ≈ 4.5 × 10^{5} and are representative of the galaxy number density of the future Euclid spectroscopic galaxy catalog.
We ignore peculiar velocities, that is, we perform our analysis in real space. Therefore, we consider results for the 1D 2PCF ξ(r). We estimated ξ(r) up to r_{max} = 200 h^{−1} Mpc using Δr = 1 h^{−1} Mpc bins.
We chose standard LS with M_{r} = 50 as the reference method. In the following, LS without further qualification refers to this. The random catalog was generated separately for each shell mock to measure their contribution to the variance. For one of the random catalogs we calculated also triplets to obtain the edge effect quantity .
While dilution can already be discarded on theoretical grounds, we show results obtained using dilution, since these results provide the scale for edge effects demonstrating the importance of eliminating them with a careful choice of method. For the dilution and split methods we also used M_{r} = 50, and tried out dilution fractions and split factors M_{s} = 4, 16, 50 (chosen to have similar pairwise computational costs). In addition, we considered standard LS with M_{r} = 25, which has the same number of RR pairs as d = 0.5 and M_{s} = 4, but only half the number of DR pairs; and standard LS with M_{r} = 1 to demonstrate the effect of a small .
The code used to estimate the 2PCF implements a highly optimized paircounting method, specifically designed for the search of object pairs in a given range of separations. In particular, the code provides two alternative counting methods, the chainmesh and the kdtree. Both methods measure the exact number of object pairs in separation bins, without any approximation. However, since they implement different algorithms to search for pairs, they perform differently at different scales, both in terms of CPU time and memory usage. Overall, the efficiency of the two methods depends on the ratio between the scale range of the searching region and the maximum separation between the objects in the catalog.
The kdtree method first constructs a spacepartitioning data structure that is filled with catalog objects. The minimum and maximum separations probed by the objects are kept in the data structure and are used to prune object pairs with separations outside the range of interest. The tree pair search is performed through the dualtree method in which crosspairs between two dual trees are explored. This is an improvement in terms of exploration time over the singletree method.
On the other hand, in the chainmesh method the catalog is divided in cubic cells of equal size, and the indexes of the objects in each cell are stored in vectors. To avoid counting object pairs with separations outside the interest range, the counting is performed only on the cells in a selected range of distances from each object. The chainmesh algorithm has been imported from the CosmoBolognaLib, a large set of free software C++/python libraries for cosmological calculations (Marulli et al. 2016).
For our test runs we used the chainmesh method.
4.2. Variance and bias
In Fig. 1 we show the mean (over the 300 mock shells) estimated correlation function and the scatter (square root of the variance) of the estimates using the LS, split, and dilution methods; our theoretical approximate result for the scatter for LS; and our theoretical result for bias for the different methods.
The theoretical result for the scatter is shown with and without the q terms, which include the triplet correlation ξ_{12}, for which we used here the approximation ξ_{12} ≈ ξ(max(r_{1}, r_{2})). This behaves as expected, that is, it underestimates the variance, since we neglected the higherorder correlations in the D catalog. Nevertheless, it (see the dashdotted line in Fig. 1) has similar features to the measured variance (dashed lines).
In Fig. 2 we plot the diagonals of the p, t, and q quantities. This shows how their relative importance changes with separation scale. It also confirms that our initial assumption on small relative fluctuations is valid in this simulation case.
Fig. 2. Quantities p, p_{c}, p_{r}, q, q_{r}, t, and t_{r} for the Minerva shell. The values for the first bin are noisy. The vertical red line marks r = L. 

Open with DEXTER 
Consider now the variance differences (from standard LS with M_{r} = 50), for which our theoretical results should be accurate. Figure 3 compares the measured variance difference to the theoretical result. For the diluted estimators and LS with M_{r} = 1 the measured result agrees with theory, although clearly the measurement with just 300 mocks is rather noisy. For the split estimators and LS with M_{r} = 25 the difference is too small to be appreciated with 300 mocks, but at least the measurement does not disagree with the theoretical result.
Fig. 3. Measured difference from LS of the variance of different estimators, multiplied by r^{2}. Dashed lines are our theoretical results. 

Open with DEXTER 
In Fig. 4 we show the relative theoretical increase in scatter compared to the best possible case, which is LS in the limit M_{r} → ∞. Since we do not have a valid theoretical result for the total scatter, we estimate it by subtracting the theoretical difference from LS with M_{r} = 50 from the measured variance of the latter.
Fig. 4. Theoretical estimate of the scatter of the ξ estimates divided by the scatter in the limit. The dotted lines correspond to 0.3%, 0.5%, and 1% increase in scatter. For r < 10 h^{−1} Mpc there is hardly any difference between split and dilution, the curves lie on top of each other; whereas for larger r split is much better. 

Open with DEXTER 
At scales r ≲ 10 h^{−1} Mpc the theoretical prediction is about the same for dilution and split and neither method looks promising for r ≪ 10 h^{−1} Mpc where ξ ≫ 1. This suggests that for optimizing cost and accuracy, a different method should be used for smaller scales than that used for large scales. The number of RR pairs with small separations is much less. Therefore, for the smallscale computation there is no need to restrict the computation to a subset of RR pairs, or alternatively, one can afford to increase M_{r}. For the small scales, we may consider the split method with increased M_{r} as an alternative to standard LS. We have the same number of pairs to compute as in the reference LS case, if we use M_{r} = 866 and M_{s} = 866. We added this case to Fig. 4. It seems to perform better than LS at intermediate scales, but for the smallest scales LS has the smaller variance. This is in line with our conclusion in Sect. 3.2, which is that when ξ ≫ 1, it is not optimal to split the R catalog into small subsets.
We also compared the differences in the mean estimate from the different estimators to our theoretical results on the bias differences (see Fig. 5), but the theoretical bias differences are much smaller than the expected error of the mean from 300 mocks; and we simply confirm that the differences we see are consistent with the error of the mean and thus consistent with the true bias being much smaller. We also performed tests with completely random (ξ = 0) mocks, and with a large number (10 000) of mocks confirmed the theoretical bias result for the different estimators in this case. Since the bias is too small to be interesting we do not report these results in more detail here.
Fig. 5. Differences between the mean ξ(r) estimate and that from the LS, multiplied by r to better display all scales. This measured difference is not the true bias, which is too small to measure with 300 mocks, and is mainly due to random error of the mean. The results for dilution appear to reveal a systematic bias, but this is just due to strong error correlations between nearby bins; for different subsets of the 300 mocks the mean difference is completely different. 

Open with DEXTER 
However, we note that for the estimation of the 2D 2PCF and its multipoles, the 2D bins will contain a smaller number of objects than the 1D bins of these test runs and therefore the bias is larger. Using the theoretical results (17) or (24) the bias can be removed afterwards with accuracy depending on how well we know the true ξ.
4.3. Computation time and variance
The test runs were made using a single full 24core node for each run. Table 1 shows the mean computation time and mean estimator variance for different r ranges for the different cases we tested. Of these r ranges, the r = 80 − 120 h^{−1} Mpc is perhaps the most interesting, since it contains the important BAO scale. Therefore, we plot the mean variance at this range versus mean computation time in Fig. 6, together with our theoretical predictions. The theoretical estimate for the computation time for other dilution fractions and split factors is
Mean computation time over the 300 runs and the mean variance over four different ranges of r bins (given in units of h^{−1} Mpc) for each method.
Fig. 6. Measured variance (mean variance over the range r = 80 − 120 h^{−1} Mpc) vs. computational cost (mean computation time) for the different methods (markers with error bars) and our theoretical prediction (solid lines). The solid lines (blue for the split method, red for dilution, and black for standard LS with M_{r} ≤ 50) are our theoretical predictions for the increase in variance and computation time ratio when compared to the standard LS, M_{r} = 50, case, and the dots on the curves correspond to the measured cases (except for LS they are, from right to left, M_{r} = 25, 12.5, and (50/7); only the first of which was measured). The curve for split ends at M_{s} = 2500; the optimal case, M_{s} = M_{r}, is the circled dot. The error bars for the variance measurement are naive estimates that do not account for error correlations between bins. The theoretical predictions overestimate the cost savings (data points are to the right of the dots on curves; except for the smaller split factors, where the additional speedup compared to theory is related to some other performance differences between our split and standard LS implementations). This plot would have a different appearance for other r ranges. 

Open with DEXTER 
assuming M_{r} = 50. For standard LS with other random catalog sizes, the computation time estimate is
5. Conclusions
The computational time of the standard Landy–Szalay estimator is dominated by the RR pairs. However, except at small scales where correlations are large, these make a negligible contribution to the expected error compared to the contribution from the DD and DR pairs. Therefore, a substantial saving of computation time with an insignificant loss of accuracy can be achieved by counting a smaller subset of RR pairs.
We considered two ways to reduce the number of RR pairs: dilution and split. In dilution, only a subset of the R catalog is used for RR pairs. In split, the R catalog is split into a number of smaller subcatalogs, and only pairs within each subcatalog are counted. We derived theoretical results for the additional estimator covariance and bias due to the finite size of the random catalog for these different variants of the LS estimator, extending in many ways the original results by Landy & Szalay (1993), who worked in the limit of an infinite random catalog. We tested our results using 300 mock data catalogs, representative of the z = 0.95–1.05 redshift range of the future Euclid survey. The split method maintains the property the Landy–Szalay estimator was designed for, namely cancelation of edge effects in bias and variance (for ξ = 0), whereas dilution loses this cancellation and therefore should not be used.
For small scales, where correlations are large, one should not reduce RR counts as much. The natural dividing line is the scale r where ξ(r) = 1. Interestingly, the difference in bias and covariance between the different estimators (split, dilution, and LS) vanishes when ξ = 1. We recommend the natural version of the split method, M_{s} = M_{r}, for large scales where ξ< 1. This leads to a saving in computation time by more than a factor of ten (assuming M_{r} = 50) with a negligible effect on variance and bias. For small scales, where ξ > 1, one should consider using a larger random catalog and one can use either the standard LS method or the split method with a more modest split factor. Because the number of pairs with these small separations is much smaller, the computation time is not a similar issue as for large separations.
The results of our analysis will have an impact also on the computationally more demanding task of covariance matrix estimation. However, since in that case the exact computational cost is determined by the balance of data catalogs and random catalogs, which does not need to be the same as for the individual twopoint correlation estimate, we postpone a quantitative analysis to a future, dedicated study. The same kind of methods can be applied to higherorder statistics (threepoint and fourpoint correlation functions) to speed up their estimation (Slepian & Eisenstein 2015).
Acknowledgments
We thank Will Percival and Cristiano Porciani for useful discussions. The 2PCF computations were done at the Euclid Science Data Center Finland (SDCFI, urn:nbn:fi:researchinfras2016072529), for whose computational resources we thank CSC – IT Center for Science, the Finnish Grid and Cloud Computing Infrastructure (FGCI, urn:nbn:fi:researchinfras2016072533), and the Academy of Finland grant 292882. This work was supported by the Academy of Finland grant 295113. VL was supported by the Jenny and Antti Wihuri Foundation, AV by the Väisälä Foundation, AS by the Magnus Ehrnrooth Foundation, and JV by the Finnish Cultural Foundation. We also acknowledge travel support from the Jenny and Antti Wihuri Foundation. VA acknowledges funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 749348. FM acknowledges the grants ASI n.I/023/12/0 “Attivit‘a relative alla fase B2/C per la missione Euclid” and PRIN MIUR 2015 “Cosmology and Fundamental Physics: illuminating the Dark Universe with Euclid”.
References
 Alam, S., Ata, M., Bailey, S., et al. 2017, MNRAS, 470, 2617 [NASA ADS] [CrossRef] [Google Scholar]
 Alcock, C., & Paczyński, B. 1979, Nature, 281, 358 [NASA ADS] [CrossRef] [Google Scholar]
 Alonso, D. 2012, ArXiv eprints [arXiv:1210.1833] [Google Scholar]
 Anderson, L., Aubourg, E., Bailey, S., et al. 2012, MNRAS, 427, 3435 [NASA ADS] [CrossRef] [Google Scholar]
 Anderson, L., Aubourg, E., Bailey, S., et al. 2014, MNRAS, 441, 24 [NASA ADS] [CrossRef] [Google Scholar]
 Ata, M., Baumgarten, F., Bautista, J., et al. 2018, MNRAS, 473, 4773 [NASA ADS] [CrossRef] [Google Scholar]
 Bautista, J. E., VargasMagaña, M., Dawson, K. S., et al. 2018, ApJ, 863, 110 [NASA ADS] [CrossRef] [Google Scholar]
 Bernstein, G. M. 1994, ApJ, 424, 569 [NASA ADS] [CrossRef] [Google Scholar]
 Beutler, F., Blake, C., Colless, M., et al. 2011, MNRAS, 416, 3017 [NASA ADS] [CrossRef] [Google Scholar]
 Beutler, F., Blake, C., Colless, M., et al. 2012, MNRAS, 423, 3430 [NASA ADS] [CrossRef] [Google Scholar]
 Blake, C., Kazin, E. A., Beutler, F., et al. 2011, MNRAS, 418, 1707 [NASA ADS] [CrossRef] [MathSciNet] [Google Scholar]
 Cole, S., Percival, W. J., Peacock, J. A., et al. 2005, MNRAS, 362, 505 [NASA ADS] [CrossRef] [Google Scholar]
 Davis, M., & Peebles, P. J. E. 1983, ApJ, 267, 465 [NASA ADS] [CrossRef] [Google Scholar]
 de la Torre, S., Jullo, E., Giocoli, C., et al. 2017, A&A, 608, A44 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Demina, R., Cheong, S., BenZvi, S., & Hindrichs, O. 2018, MNRAS, 480, 49 [NASA ADS] [CrossRef] [Google Scholar]
 Eisenstein, D. J., Zehavi, I., Hogg, D. W., et al. 2005, ApJ, 633, 560 [NASA ADS] [CrossRef] [Google Scholar]
 Grieb, J. N., Sánchez, A. G., SalvadorAlbornoz, S., & Dalla Vecchia, C. 2016, MNRAS, 457, 1577 [NASA ADS] [CrossRef] [Google Scholar]
 Guzzo, L., Pierleoni, M., Meneux, B., et al. 2008, Nature, 451, 541 [NASA ADS] [CrossRef] [PubMed] [Google Scholar]
 Hamilton, A. J. S. 1993, ApJ, 417, 19 [NASA ADS] [CrossRef] [Google Scholar]
 Hewett, H. C. 1982, MNRAS, 201, 867 [NASA ADS] [CrossRef] [Google Scholar]
 Hou, J., Sánchez, A. G., Scoccimarro, R., et al. 2018, MNRAS, 480, 2521 [NASA ADS] [CrossRef] [Google Scholar]
 Jarvis, M. 2015, Astrophysics Source Code Library [record ascl:1508.007] [Google Scholar]
 Kaiser, N. 1987, MNRAS, 227, 1 [NASA ADS] [CrossRef] [Google Scholar]
 Kerscher, M. 1999, A&A, 343, 333 [NASA ADS] [Google Scholar]
 Kerscher, M., Szapudi, I., & Szalay, A. S. 2000, ApJ, 535, L13 [NASA ADS] [CrossRef] [PubMed] [Google Scholar]
 Landy, S. D., & Szalay, A. S. 1993, ApJ, 412, 64 [NASA ADS] [CrossRef] [Google Scholar]
 Laureijs, R., Amiaux, J., Arduini, S., et al. 2011, ArXiv eprints [arXiv:1110.3193] [Google Scholar]
 Lippich, M., Sánchez, A. G., Colavincenzo, M., et al. 2019, MNRAS, 482, 1786 [NASA ADS] [CrossRef] [Google Scholar]
 Marulli, F., Veropalumbo, A., & Moresco, M. 2016, Astron. Comput., 14, 35 [NASA ADS] [CrossRef] [Google Scholar]
 Moore, A., Connolly, A., Genovese, C., et al. 2000, ArXiv eprints [arXiv:astroph/0012333] [Google Scholar]
 Peacock, J. A., Cole, S., Norberg, P., et al. 2001, Nature, 410, 169 [NASA ADS] [CrossRef] [PubMed] [Google Scholar]
 Peebles, P. J. E., & Hauser, M. G. 1974, ApJS, 28, 19 [NASA ADS] [CrossRef] [Google Scholar]
 Percival, W. J., Reid, B. A., Eisenstein, D. J., et al. 2010, MNRAS, 401, 2148 [NASA ADS] [CrossRef] [Google Scholar]
 Pezzotta, A., de la Torre, S., Bel, J., et al. 2017, A&A, 604, A33 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Reid, B. A., Samushia, L., White, M., et al. 2012, MNRAS, 426, 2719 [NASA ADS] [CrossRef] [Google Scholar]
 Ross, A. J., Samushia, L., Howlett, C., et al. 2015, MNRAS, 449, 835 [NASA ADS] [CrossRef] [Google Scholar]
 Ross, A. J., Beutler, F., Chuang, C. H., et al. 2017, MNRAS, 464, 1168 [NASA ADS] [CrossRef] [Google Scholar]
 Ruggeri, R., Percival, W. J., GilMarín, H., et al. 2019, MNRAS, 483, 3878 [NASA ADS] [CrossRef] [Google Scholar]
 Slepian, Z., & Eisenstein, D. J. 2015, MNRAS, 454, 4142 [NASA ADS] [CrossRef] [Google Scholar]
 VargasMagaña, M., Ho, S., Cuesta, A. J., et al. 2018, MNRAS, 477, 1153 [NASA ADS] [CrossRef] [Google Scholar]
 Wall, J. V., & Jenkins, C. R. 2012, Practical Statistics for Astronomers (Cambridge, UK: Cambridge University Press) [CrossRef] [Google Scholar]
 Zarrouk, P., Burtin, E., GilMarín, H., et al. 2018, MNRAS, 477, 1639 [NASA ADS] [CrossRef] [Google Scholar]
 Zehavi, I., Zheng, Z., Weinberg, D. H., et al. 2011, ApJ, 736, 59 [NASA ADS] [CrossRef] [Google Scholar]
Appendix A: Derivation examples
As examples of how the variances of the different deviations, ⟨αβ⟩ and so on, are derived, following the method presented in LS93, we give three of the cases here: ⟨α_{1}α_{2}⟩ (common for all the variants of LS), and and for the split method. The rest are calculated in a similar manner.
To derive the ⟨DD_{1}DD_{2}⟩ appearing in the ⟨α_{1}α_{2}⟩ of (15) we start from
where both i, j, and k, l sum over all microcell pairs; n_{i} = 1 or 0 is the number of galaxies in microcell i. There are three different cases for the terms ⟨n_{i}n_{j}n_{k}n_{l}⟩ depending on how many indices are equal (i ≠ j and k ≠ l for all of them).
The first case (quadruplets, i.e., two pairs of microcells) is when i, j, k, l are all different. We use to denote this part of the sum. There are (we work in the limit K → ∞) such terms and they have
where δ_{i} := δ(x_{i}) is the density perturbation (relative to the actual mean density of galaxies in the survey), ζ is the threepoint correlation, and η is the connected (i.e., the part that does not arise from the twopoint correlation) fourpoint correlation. The fraction of microcell quadruplets (pairs of pairs) that satisfy r_{ij} ≡ x_{j} − x_{i} ∈ r_{1} and r_{kl} ∈ r_{2} is . In the limit of large K the number of other index quadruplets is negligible compared to those where all indices have different values, so we have
For the connected pairs (i, j) and (k, l) we have ξ(r_{ij})=ξ(r_{1})≡ξ_{1} and ξ(r_{kl})=ξ(r_{2})≡ξ_{2}.
The second case (triplets of microcells) is when k or l is equal to i or j. We denote this part of the sum by
It turns out that it goes over all ordered combinations of {i, j, k}, where i, j, k are all different, exactly once, so there are K(K − 1)(K − 2)=K^{3} such terms (triplets). Here
and
The third case (pairs of microcells) is when i = k and j = l. This part of the sum becomes
where
and
that is, the sum vanishes unless the two bins are the same, r_{1} = r_{2}.
We now apply the three approximations listed in Sect. 2.2: (1) ξ(r_{ik})=ξ(r_{il})=ξ(r_{jk})=ξ(r_{jl}) = 0 in (A.2); (2) ζ = 0 in (A.2) and (A.5); (3) η = 0 in (A.2). We obtain
and using ⟨DD⟩ from (3) we arrive at the ⟨α_{1}α_{2}⟩ result given in Eq. (15).
For the split method, we give the calculation for
below.
First the :
where
s_{i} is the number (0 or 1) of R^{μ} objects in microcell i, and r_{l} is the number of R objects in microcell l.
There are quadruplet terms, for which (see (A.3)) , and
Triplets where i or j is equal to k have ⟨s_{i}s_{k}n_{k}r_{l}⟩ = 0, since s_{k}n_{k} = 0 always (we cannot have two objects, from R^{μ} and D, in the same cell). There are K^{3} triplet terms where i or j is equal to l: for them , and
where if s_{l} = 1 then also r_{l} = 1 since R^{μ} ⊂ R.
Pairs where (i, j)=(k, l) or (i, j)=(l, k) have ⟨s_{k}s_{l}n_{k}r_{l}⟩ = 0, since again we cannot have two different objects in the same cell. Thus
and we obtain that
which is equal to the ⟨γ_{1}β_{2}⟩ of standard LS.
Then the :
where there are M_{s}(M_{s} − 1) terms with μ ≠ ν giving
and M_{s} terms with μ = ν giving
Adding these up gives
and
This differs both from standard LS and from dilution, since it involves both and .
All Tables
Mean computation time over the 300 runs and the mean variance over four different ranges of r bins (given in units of h^{−1} Mpc) for each method.
All Figures
Fig. 1. Mean ξ(r) estimate and the scatter and theoretical bias of the estimates for different estimators. The dashdotted line, our theoretical result for the scatter of the LS method, underestimates the scatter, since higherorder correlations in the D catalog are ignored. The dotted line is without the contribution of the q terms, and is dominated by the Poisson (p) terms. The bias is multiplied by 100 so the curves can be displayed in a more compact plot. For the measured mean and scatter, and the theoretical bias we plot standard LS in black, dilution with d = 0.14 in red, and split with M_{s} = 50 in blue. For the mean and scatter the difference between the methods is not visible in this plot. The differences in the mean estimate are shown in Fig. 5. The differences in scatter (or its square, the variance) are shown in Fig 3. For the theoretical bias the difference between split and dilution is not visible at small r (ξ(r) > 1), where the bias is positive. 

Open with DEXTER  
In the text 
Fig. 2. Quantities p, p_{c}, p_{r}, q, q_{r}, t, and t_{r} for the Minerva shell. The values for the first bin are noisy. The vertical red line marks r = L. 

Open with DEXTER  
In the text 
Fig. 3. Measured difference from LS of the variance of different estimators, multiplied by r^{2}. Dashed lines are our theoretical results. 

Open with DEXTER  
In the text 
Fig. 4. Theoretical estimate of the scatter of the ξ estimates divided by the scatter in the limit. The dotted lines correspond to 0.3%, 0.5%, and 1% increase in scatter. For r < 10 h^{−1} Mpc there is hardly any difference between split and dilution, the curves lie on top of each other; whereas for larger r split is much better. 

Open with DEXTER  
In the text 
Fig. 5. Differences between the mean ξ(r) estimate and that from the LS, multiplied by r to better display all scales. This measured difference is not the true bias, which is too small to measure with 300 mocks, and is mainly due to random error of the mean. The results for dilution appear to reveal a systematic bias, but this is just due to strong error correlations between nearby bins; for different subsets of the 300 mocks the mean difference is completely different. 

Open with DEXTER  
In the text 
Fig. 6. Measured variance (mean variance over the range r = 80 − 120 h^{−1} Mpc) vs. computational cost (mean computation time) for the different methods (markers with error bars) and our theoretical prediction (solid lines). The solid lines (blue for the split method, red for dilution, and black for standard LS with M_{r} ≤ 50) are our theoretical predictions for the increase in variance and computation time ratio when compared to the standard LS, M_{r} = 50, case, and the dots on the curves correspond to the measured cases (except for LS they are, from right to left, M_{r} = 25, 12.5, and (50/7); only the first of which was measured). The curve for split ends at M_{s} = 2500; the optimal case, M_{s} = M_{r}, is the circled dot. The error bars for the variance measurement are naive estimates that do not account for error correlations between bins. The theoretical predictions overestimate the cost savings (data points are to the right of the dots on curves; except for the smaller split factors, where the additional speedup compared to theory is related to some other performance differences between our split and standard LS implementations). This plot would have a different appearance for other r ranges. 

Open with DEXTER  
In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.