Fast Correlation Function Calculator A high-performance pair counting toolkit

,


Introduction
Correlation functions are a useful statistical tool in cosmology that characterise the excess probability of finding tracers with given separations compared to a random distribution.Thus, they are a measure of the clustering pattern of a tracer distribution, which can then be used to infer the statistical quantities of the underlying density field.In the current standard cosmological paradigm, the distribution of matter results from tiny fluctuations in the primordial Universe, which evolve following gravitational instability and cosmic expansion.For this reason, correlation functions are crucial for our understanding of inflation and cosmic structure formation models (e.g.Bernardeau et al. 2002).The pair correlation function, also known as the radial distribution function, which is essentially the isotropic two-point correlation function (2PCF), is also a fundamental quantity in statistical mechanics, where it links microscopic details to macroscopic properties (Chandler 1987).
The measurement of 2PCFs of galaxies and quasars has been a key goal of massive spectroscopic surveys, such as the Baryon Oscillation Spectroscopic Survey (BOSS; Dawson et al. 2013), the Extended Baryon Oscillation Spectroscopic Survey (eBOSS; Dawson et al. 2016), and the ongoing Dark Energy Spectroscopic Instrument (DESI; DESI Collaboration 2016).With a data catalogue and the corresponding random sample, the 2PCF is generally measured using the Landy-Szalay (LS) estimator (Landy & Szalay 1993), where DD, DR, and RR denote data-data, data-random, and random-random pair counts, respectively.Observational and simulated galaxy samples today usually consist of several million or more than several million galaxies.Robust clustering measurements further require random samples with typically ten times the objects.As a result, the brute-force pair-counting approach that evaluates N 2 pair separations, where N is the number of data points, is impractical.The calculation of 2PCFs from pair counts has become a practical challenge for modern cosmological analysis, not to for mention higher-order statistics like three-point correlation functions.The evaluation of correlation functions is effectively a rangesearching problem, which reports objects within a query range.Range searching is a fundamental topic in computational geometry.A variety of data structures and algorithms is available to solve range-searching problems with different objects and query ranges (see e.g. de Berg et al. 2008).In the context of cosmology, efficient correlation function calculators have also been studied extensively in the literature, from the pioneering work by Moore et al. (2001) to the recent remarkable development of Sinha & Garrison (2020).Meanwhile, significant efforts have been made to make full use of high-performance computing (HPC) resources, such as a large number of multi-core CPUs and GPUs (e.g.Dolence & Brunner 2008;Alonso 2012;Chhugani et al. 2012;Ponce et al. 2012).Different approximate methods are explored widely as well (e.g.Zhang & Pen 2005;Slepian & Eisenstein 2015;Philcox et al. 2022).

Nodes Data in memory
Even though very many pair-counting codes are publicly available (e.g.Alonso 2012; Jarvis 2015; Rohin 2018; Donoso 2019; Sinha & Garrison 2020, and references therein), we introduce the fast correlation function calculator 1 (FCFC), a novel high-efficiency, scalable, portable, flexible, and user-friendly toolkit for exact pair counting.We focus on FCFC version 1.0.1 in this article, which supports 2PCFs for 3D data, with various commonly used binning schemes.It may be the fastest publicly available 2PCF calculator for modern cosmological datasets so far.
This paper is organised as follows.We begin with a comparison of different data structures for pair-counting problems in Sect. 2.Then, we introduce the pair-counting and histogram update algorithms used by FCFC in Sect.3. In Sect. 4 we describe the performance of FCFC with different levels of parallelisms.A direct comparison between FCFC and Corrfunc, another efficient cosmological pair-counting code, is presented in Sect. 5. Finally, we conclude in Sect.6.

Data structures
Data structures are a technique for organising and storing an input dataset in memory that allows efficient data access.Typically, it is not only unnecessary, but also inefficient to process the data all at once in a program.A well-designed data structure may prevent the retrieval of irrelevant data during data queries as much as possible; this is known as data pruning.Thus, data structures are usually crucial for efficient algorithms.To this end, several types of data structures for pair-counting applications have been proposed in the literature, including regular grids (Alonso 2012; Sinha & Garrison 2020), linked list (Donoso 2019), the k-d tree (Moore et al. 2001), and ball tree (Rohin 2018).
In general, a data structure sorts and partitions the dataset, and stores the data segments on different nodes, either by copying the data directly or by saving only the addresses in memory.Each node is typically defined as an abstract data type, which contains summaries of the associate data, but this is sometimes done implicitly, to quickly judge whether the data should be retrieved.Connections between different nodes may also be built to optimise the node traversal process.This architecture is illustrated in Fig. 1.Because the datasets for cosmological applications are large, we save only data pointers on the nodes, to make the latter more compact and reduce the chance of cache 1 https://github.com/cheng-zhao/FCFCmisses during node traversal.The raw data, which are accessed less often, are stored separately and continuously in the memory.In particular, during the construction of the data structures, the data are sorted in such a way that the points belonging to adjacent nodes are aligned continuously.
For pair-counting applications, it is crucial to be able to compute the separation ranges between nodes efficiently to omit nodes that are too far away or too close to each other, without visiting individual data points.For this purpose, we describe a few data structures in this section, including regular grids, the k-d tree, and a new variant of the ball tree, and compare their performance in terms of pair counting.Throughout this section, the costs of structure constructions are not counted for our benchmarks, as they generally take <1 % of the time used by the pair-counting processes.Moreover, throughout this section, we do not consider multiple histogram bins.This is because with the same histogram update algorithm, almost identical costs are added to the pair-counting process with different data structures, which makes the comparisons of the data-pruning efficiency less effective.The computational costs are all measured with a single Haswell CPU core (see Appendix A).

Regular grids
A simple way to partition a dataset is to divide the domain into regular axis-aligned grids, with unique identifiers for spatial indexing.Normally, the positions and extents of the grid cells can be expressed by polynomials of the identifiers, or indices.Therefore, distance ranges between different grid cells can be inferred from the differences of cell indices, which can be computed prior to the cell traversal process.This makes regular grids a potentially very efficient data structure for pair counting.
With the architecture shown in Fig. 1, only three passes through the dataset are required to construct regular grids for a catalogue: (1) find the minimum axis-aligned bounding box (AABB) of the catalogue to define grids (2) count the number of data points in each grid cell (3) group the data points based on the associate cell indices.Therefore, the construction of regular grids can be very efficient, with a time complexity of O(N), where N denotes the total number of data points.In contrast, the storage consumed by regular grids is very sensitive to the number of grid cells and scales as O( i N g,i ), where N g,i indicates the number of cells along the ith dimension.
The efficiency of data pruning for regular grids largely depends on the cell sizes as well.An example is shown in Fig. 2, where the data partitions with two different cell sizes are illustrated, even though the choices of the cell sizes are likely unwise for real-world applications.Given the same reference point and maximum distance for an isotropic range search, the numbers of visited cells and data points are both significantly different when the number of cells per box side is varied.Here, data points belonging to different grid cells are arranged in column-major order, and gaps between adjacent memory visits are observed.Sorting the cells using a space-filling curve, such as the Hilbert curve, may improve the memory locality and reduce the chance of cache misses (see e.g.Springel 2005, for an application).Nevertheless, the improvement is expected to be marginal, as the memory jumps can never be entirely eliminated, and it is more difficult but possible to pre-compute the map from indices of grid cells to the distance ranges between cells.
The algorithm for pair counting with regular grids is as simple as traversing all grid cells that contain data points and successively visting cells that are separated within the distance range of interest, given the pre-computed offsets of indices.This A83, page 2 of 20 Zhao, C.: A&A proofs, manuscript no.aa46015-23 procedure is detailed in Appendix B. Consequently, the complexity of the algorithm depends not only on the number of grid cells that intersect with the query range, but also on the average number of data points in each cell.Apparently, when the side lengths of regular grid cells are increased, the number of cells to be visited is reduced, but there may be more unnecessary distance evaluations for pairs, as illustrated by Fig. 2. Thus, the choice of cell sizes is crucial for the efficiency of a grid-based pair-counting algorithm (see also Sinha & Garrison 2020, for relevant discussions).
We then performed a series of benchmarks with the paircounting routine based on regular grids, which reports the number of pairs with separations smaller than R max , and omits histogram bins of distances to separate the impacts of the data structure and histogram update algorithm (see Sect. 3.2 for details).For simplicity, the pair-counting procedure is based on cubic grid cells with a side length of L cell , and run on N uniformly distributed random points in a periodic cubic volume with the box size of L box .The execution time of the pair-counting processes with different settings are shown in Fig. 3, together with the theoretical model detailed in Appendix C. L box and L cell are expressed as factors of R max , as the benchmark results are irrelevant to the units of lengths.The results confirm the sensitivity of computational costs to cell sizes.For the configurations we explored, the optimal L cell is typically 0.1 to 0.5 times R max .

k-d tree
A k-d tree (Bentley 1975) is a binary space-partition data structure that is commonly used for range-searching and nearestneighbour algorithms.It partitions the k-dimensional space recursively with axis-aligned planes.Depending on the choice of the splitting planes, there are several variants of the k-tree structure.In this work, we chose the 'optimised' k-d tree introduced by Friedman et al. (1977), for which the space-partition planes are perpendicular to the dimension with the largest data variance, and split the dataset into two parts at the median point.Therefore, this variant always produces a balanced tree structure and is particularly useful for observational catalogues with an arbitrary survey geometry.
Algorithm 1 shows the procedure we followed to construct the k-d tree for pair-counting purposes.The root node of the tree is associated with all the data points.For each non-leaf node, the two subsets of data after space partition are assigned to their two children, respectively.In addition, we stored the minimum AABB of points on each node for efficient data pruning because it is easy to evaluate the minimum and maximum distances between AABBs, which can be good estimates of the separation ranges of points on different nodes.Finally, the tree construction process was terminated when all leaf nodes contained at most n leaf points.
Since the k-d tree is always balanced, there are in total O(N) tree nodes for a fixed n leaf .The storage cost of the tree is then O(N).Computations of the minimum AABB and coordinate variances require only two passes through the dataset.Additionally, we split the data for the child nodes using the lineartime adaptive QUICKSELECT algorithm (MedianOfNinthers; Alexandrescu 2017).Therefore, the k-d tree construction can be accomplished in O(N log N) time, given the tree depth of O(log N).
Figure 4 shows the k-d tree constructed with n leaf = 3 for the same sample points as in Fig. 2. The space partition is adaptive, and in this particular example, the tree is complete, with the same number of points on all leaf nodes.In addition, the  return ν as a leaf node 5: else 6: Find the axis direction with the largest coordinate variance for all points in P, and divide P into P 1 and P 2 with a splitting plane perpendicular to this direction, such that P = P 1 ∪ P 2 , P 1 ∩ P 2 = ∅, and cardinality(P 1 ) = ⌊cardinality(P)/2⌋.return ν 10: end if total AABB volume of nodes with the same depth can be significantly smaller than the volume of the full dataset, especially for leaf nodes, due to the gaps between the bounding boxes of different nodes.This implies a relatively high data-pruning efficiency, as it is easier to detect data groups that are too far away or too close to each other than in the grid-based method.For the example shown in Fig. 4, only four leaf nodes were visited after checking the distances between AABBs.Furthermore, since the leaf nodes visited are in the same branch of the tree, the associated data points are continuously aligned in memory, which indicates a high memory-access efficiency.
We used the dual-tree algorithm (see Sect. 3.1) to count pairs with the k-d tree, which traverses the tree nodes in a top-down manner.In brief, we skipped all the descendants of two nodes when the separation range between the minimum AABBs of these nodes was entirely inside or outside the query range for pair counting.In other words, a leaf node was only visited when the corresponding AABB intersected with the query range boundary of its counterpart during the tree traversal process, which usually is a leaf node as well.Consequently, the sizes of nodes from which the data points are retrieved are adaptive, and the number of visited nodes is greatly reduced compared to the grid-based approach, especially when the query range is large.
n leaf (k-d tree) We then ran the pair-counting routine based on the k-d tree upon the same catalogues as were used for benchmarks of the grid-based method.Again, we considered a single histogram bin for separations smaller than R max .The results with different choices of leaf node capacity are shown in Fig. 5.The execution time of the pair-counting algorithm based on the k-d tree does not vary significantly as n leaf , especially when 4 ≤ n leaf ≤ 64, compared to the strong cell-size dependence of the grid-based method (see Fig. 3; we provide more detail in Appendix C).Moreover, the optimal n leaf is found to be 8 for almost all configurations presented in Fig. 5.This makes the k-d tree structure particularly useful in practice, as it is not necessary to explore different choices of n leaf to maximise the pair-counting efficiency for different input samples2 .

Ball tree
Similar to the k-d tree, the ball tree (Omohundro 1989) is also a binary space partition tree that is useful for range queries, especially for high dimensions.In general, every node of a ball tree defines a hypersphere that contains all the points on the node.This makes it slightly easier to compute the minimum and maximum distances between two nodes than in the case of axisaligned boxes for the k-d tree.However, for traditional ball-tree implementations (e.g.Moore 2000), the tree is not necessarily balanced, and the hyperspheres, or balls, can be significantly larger than the minimum bounding spheres of the points.As a result, both the dual-tree algorithm (see Sect. 3.1) and the data-pruning process are sub-optimal for pair counting (see the application in Rohin 2018, however).We then introduce a new variant of the ball-tree structure to circumvent these problems.
To construct a balanced ball tree, one way is to use the space partition scheme of the k-d tree.In this case, all subsets of data points are bounded by axis-aligned boxes, and data pruning with minimum bounding spheres is supposed to be less efficient than with the k-d tree because their volumes are generally larger than the corresponding minimum AABBs.Moreover, axis-aligned partition schemes may be sub-optimal for observational data with complicated shapes.To circumvent these problems, we followed the space partition approach introduced by Dolatshah et al. (2015), which defines the splitting plane based on a principal component analysis (PCA).In particular, the plane was chosen to be perpendicular to the most significant principal component of the data distribution, which is the direction with the largest variance of the data points.Thus, the resulting data subsets are statistically the least extended.In this way, the minimum bounding spheres of the ball-tree nodes are generally small enough in practice for efficient data pruning.
The next step was to compute the minimum bounding spheres of the subdivided datasets.In principle, the exact solution can be obtained in linear time using a randomised algorithm (Welzl 1991;Gärtner 1999).However, it is relatively slow for a large dataset.We then focused on the approximate algorithm introduced by Ritter (1990), which ensures that all the input data points are enclosed by the reported sphere, but it typically overestimates the radius by ≲20 % (e.g.Larsson 2008).This algorithm sets up an initial sphere with three points that are far away from each other and then goes through the rest of the data points.Whenever a point is found outside the sphere, a new sphere that encloses both the point and the previous sphere is constructed.Following Larsson (2008), we improved this algorithm by constructing a better initial sphere, which is defined by the extreme points along the directions of the first two principal components.In practice, the minimum bounding sphere of the four extreme points was computed exactly, and this sphere was updated the same way as in Ritter (1990).
The full procedure for the construction of our ball-tree variant is shown in Algorithm 2, which is very similar to that of the k-d tree (see Algorithm 1), and consumes O(N) space as well since the tree is balanced.In practice, we relied on the symmetric QR algorithm (e.g.Golub & Van Loan 2013) for the 3D PCA.When the number of data points is large, the computing time for PCA is dominated by the covariance matrix evaluation, which requires two passes through the dataset.The update of the minimum bounding sphere needs another pass.Again, we used the adaptive QUICKSELECT algorithm for the data partition, but with a comparison rule that involves the first principal component of the dataset.Therefore, the time complexity for Algorithm 2 BALLTREE_BUILD (P, n leaf ) Input: a point set P and the capacity of leaf nodes.Output: the root of a ball tree for P. Divide P into subsets P 1 and P 2 , such that P = P 1 ∪ P 2 , , and cardinality(P 1 ) = ⌊cardinality(P)/2⌋.
return ν 16: end if constructing a single ball-tree node is O(N), and is O(N log N) for the whole tree.In practice, the ball-tree construction process is typically only marginally slower than that of the k-d tree.
An example of the ball tree constructed with the points as in Fig. 2 is shown in Fig. 6.The space partition lines are not axis-aligned in general, resulting in different data point groups than those of the axis-aligned partition scheme (see Fig. 4).Different nodes with the same depth do not share data points, but their bounding spheres may overlap.The bounding sphere of a node may not be fully inside that of its parent.This does not necessarily mean that the data-pruning efficiency is low, as the distances between different nodes are always examined in the top-down order.For the example in Fig. 6, the range-searching process involves one more leaf node than that of the k-d tree, but the visited data points are still stored continuously, indicating a good memory locality.
Similar to the case of the k-d tree, the tree-independent dualtree algorithm (see Sect. 3.1) was used to count pairs with the ball tree.The benchmark results with the ball tree are shown in Fig. 7, with a single histogram bin for separations below R max .The dependences of the execution time measurements on n leaf are similar to those of the k-d tree.The theoretical model was derived for the k-d tree (see Appendix C), but it also works well for the ball tree.This can be explained by the fact that the spatial partition schemes are similar for these two data structures for a periodic box.Again, the results are not sensitive to the choice of n leaf , and a leaf node capacity of 8 is found to be optimal for almost all cases.

Comparison of the data structures
In order to identify the optimal data structure among those discussed so far for real-world pair-counting problems, we performed two additional sets of benchmarks with both periodic and non-periodic datasets.For tests with periodic boundary conditions, which is the case for cosmological simulations, we generated uniformly distributed random points in a cubic volume with a box size of L box .Then, to mimic the geometry of the observational data in redshift bins, we cut the cubic catalogues at R out = L box and R in = L box /2 with respect to a corner of n leaf (ball tree) the boxes, and took the sections in between as our non-periodic samples, which are essentially octants of spherical shells.The reason for using random samples is that the most computationally challenging pair-counting tasks in practice are usually with random catalogues because high-density randoms are required for clustering measurements.We counted pairs in [0, R max ) as a whole to exclude costs from the histogram update process, as the goal was only to examine the data-pruning efficiencies of different data structures.For all tests, we set L box = 10R max , which is typical for modern cosmological applications, for instance, pair-counting with separations up to 200 h −1 Mpc, for simulations with a side length of 2 h −1 Gpc.We considered only cubic grid cells for regular grids, but with two choices of cell sizes, 0.1R max and 0.2R max , which are near optimum for most cases shown in Fig. 3.We set n leaf = 8 for the k-d and ball trees because it is the most favourable for almost all cases in Figs. 5 and 7.
For the benchmarks, we used the pair-counting algorithm described in Appendix B for regular grids and the dual-tree algorithm presented in Sect.3.1 for the trees.Both algorithms were highly optimised to permit a reasonable comparison of the data-pruning efficiency of the data structures.We also note that identical pair counts are produced with all the data structures used.Our benchmark results with different input sample sizes are shown in Fig. 8.The performance of the two tree structures is very similar, with differences <5% for all our tests.Again, we find that the efficiency of the grid-based method is sensitive to the choice of cell size.In particular, the optimal cell size decreases as the sample size increases.This is because too many data points per cell may result in a large number of unnecessary pair-separation evaluations, while the cost of traversing cells can A83, page 6 of 20 Zhao, C.: A&A proofs, manuscript no.aa46015-23 be significant when many cells are almost empty.In contrast, the sizes of tree nodes are adaptive.In terms of the execution time, when the number of data points is ≲10 6 , regular grids with the optimal cell size can be slightly better than the trees, but the improvement is only ≲5% compared to the ball tree.However, when a sub-optimal cell size is used, the computing time with regular grids can be as long as twice that of the trees.When the number of data points is ≳10 7 , the tree structures are always more efficient regardless of the choice of cell size for regular grids, especially for the nonperiodic and non-cubic catalogues, but the speedups may be marginal in practice if accounting for the cost of the histogram update process.Meanwhile, for regular grids, the scaling of the computational cost with respect to the sample size reaches N 2 earlier than those of the tree structures.The scaling of the tree structures is better than N 2 even at N = 5 × 10 7 .
To conclude, both the k-d tree and the ball tree perform better than regular grids for modern and next-generation cosmological pair-counting problems with ≳10 7 objects in the data or random catalogues because the computational costs is lower in general, and because fine-tuning parameters are absent that depend on the input samples and strongly affect the performance.In particular, given the second reason, we did not test the grid-based method hereafter.There is no essential difference in the efficiencies of the two tree structures, and therefore, we implemented both the k-d tree and ball tree in the FCFC toolkit.

Discussion of the data structures for pair counting
A variety of data structures are available for different rangesearching problems in the field of computational geometry (e.g. de Berg et al. 2008).The basic idea of time-efficient data structures is to allow for the report of groups of points directly without visiting them individually.Therefore, the complexity of a paircounting algorithm can in principle be better than O(N 2 ), which is the complexity of a brute-force approach that examines all data pairs.However, for cosmological applications, it is generally necessary to count pairs in thin separation bins.The 2PCFs are typically measured in (s, µ) or (σ, π) bins for anisotropic information, which are given by where s 1 and s 2 denote the coordinates of two points that form a pair, and l is the line-of-sight vector.For observational data, l is typically defined as while for simulations, the plane-parallel line of sight is usually assumed, for example, The complicated binning schemes make it difficult to find pairs of large data groups with separations all in the same bin.For instance, the typical number density of modern galaxy samples is ρ ∼ 10 −3 h 3 Mpc −3 .Then there is only one point in a cubic volume with a box size of 10 h −1 Mpc on average, which is already larger than the commonly used separation bin width of 5 h −1 Mpc for 2PCFs.This problem may be less severe for galaxy catalogues with strong clustering patterns.The most challenging tasks normally are pair counting with random samples for the normalisation of 2PCFs, however.In most cases, individual pairs have therefore to be visited to update the pair-counting histograms.
For a 3D periodic box, when ρR 3 max ≫ 1, the total number of pairs with separations in [0, R max ) can be estimated by Since Npair ∝ N 2 , the complexity of a real-world pair-counting algorithm is generally ineluctably O(N 2 ).For this reason, the aim of the data structures described in this work is to reduce the constant factor hidden in the complexity by efficient data pruning.After all, Npair /N 2 can be as small as ∼10 −3 when L box = 10R max .
Hence, the pair-counting algorithm with a well-designed data structure can still be faster than the brute-force approach by a few orders of magnitude.It is possible to reduce the complexity of the pair-counting process for certain cosmological problems in principle, however.As an example, because σ and π are independent with the planeparallel line of sight, the evaluation of (σ, π) pair counts for periodic simulations can benefit from developments of orthogonal range queries (de Berg et al. 2008).For instance, following the spirit of the range tree (e.g.Lueker 1978), a binary tree with the z coordinates can be constructed, and at each node, there can be an associate k-d tree or range tree for the x and y coordinates.Then, groups of pairs in π bins can be reported in logarithmic time, and individual pair visits are only required for the associated 2D subtrees.This improves the overall paircounting complexity with additional storage space.We leave detailed studies of this case to a future work.
For future samples with unprecedented number densities, isotropic pair counting with s bins alone can potentially be improved as well.For a given reference point (x 0 , y 0 , z 0 ), the paircounting process is equivalent to a spherical range search, that is, to finding all points (x, y, z) within a certain radius R, When we define w ≡ x 2 + y 2 + z 2 , the condition can be rewritten as Therefore, the 3D spherical range-searching problem is converted into a 4D half-space range search, that is, finding all the points (x, y, z, w) above a given hyperplane.This is a well-known problem in computational geometry, and data structures exist that are able to accomplish the query in logarithmic time.
Tradeoffs between the query time and storage costs are also possible (see de Berg et al. 2008;Agarwal 2017, for reviews).However, these data structures and algorithms are generally very difficult to implement in practice.We leave them for future developments.
It is possible to further boost the efficiency of pair counting hugely by allowing inexact solutions.For instance, there are data structures for approximate range queries with controlled errors that can be adjusted to vary the query time and storage costs (e.g.da Fonseca & Mount 2010).There are also 2PCF estimators that pixelate the volume (Alonso 2012), neglect the extent of tree  (Zhang & Pen 2005), or make use of fast Fourier transforms (e.g.Pen et al. 2003).It is important to validate these approximate methods in terms of the accuracies on different scales with modern cosmological data.We will perform relevant tests and combine the exact and inexact methods in FCFC to achieve a higher efficiency with tuneable precision in a follow-up paper.

Algorithms
Algorithms are another fundamental building block of a program in addition to data structures.A good algorithm may accomplish computational tasks efficiently by taking advantage of the layout of input datasets in memory given the data structure, or making use of memorisation to avoid redundant computations.The most important algorithms used by FCFC are those for identifying pairs within desired separation ranges, and updating histogram bins given a large number of (multi-dimensional) pair separations, which are usually the most time-consuming tasks for a correlation function calculator.

Tree-independent dual-tree algorithm
With the tree structures described in the previous section, a considerable fraction of unnecessary distance evaluations can be avoided provided an algorithm that detects node pairs that are not in the separation range of interest as early as possible.To this end, it is preferred to traverse trees in a top-down manner because when the separation range between a pair of parent nodes is entirely outside or inside the query range, all their descendant nodes can be omitted.In particular, for the latter case, we visited the data associated with the parent nodes directly to avoid unnecessary tree node visits.We then end up with Algorithm 3, which is an improved version of the dual-tree algorithm introduced by Moore et al. (2001).Our dual-tree algorithm is tree independent (see also Curtin et al. 2013b).Therefore, it is applicable to all binary space-partition tree structures in principle.The complexity of the dual-tree algorithm should depend on the tree structure, but in practice, the pair-counting efficiencies with the k-d and ball trees are quite similar (see Fig. 8).
Our algorithm traverses the tree in the so-called depth-first order, as it uses less memory than the breadth-first order for a balanced tree.This is because at a given level, the depth of the balanced binary tree is generally smaller than the width.We then maintain a stack for pairs of tree nodes to avoid recursive function calls in typical depth-first dual-tree algorithms (Moore et al. 2001;March et al. 2012).This increases the scalability of the algorithm with parallelisation, as different threads are able to work independently given their private stacks for dual nodes (see Sect. 4.2).The overhead due to the stack memory cost for recursive function calls is also mitigated.We do not directly report the total number of pairs from two nodes, as is done in Moore et al. (2001), because the examination of individual pairs is usually necessary for histogram updates with separation bins (see Sect. 2.5 for details).
Although not shown explicitly, the implementation of Algorithm 3 in FCFC is further optimised for some specific but common cases.For auto pair counts, we discard node pairs {ν 2 , ν 1 } when {ν 1 , ν 2 } is (going to be) visited to avoid duplicated pair examinations, as ν 1 and ν 2 belong to the same tree.In addition, following Sinha & Garrison (2020), we do not inspect individual pairs of data points for wrapping large separations when periodic boundary conditions are enabled.Instead, we compute the offsets of coordinates for the periodic wrapping of node pairs given their bounding volumes, and apply the offsets directly to all the associate data points.In this way, a large number of periodic boundary detections are avoided.
Moreover, the dual-tree algorithm can be directly applied to angular pair counts and can easily be extended for higher-order statistics, such as three-or four-point correlation functions.We leave relevant developments to future work.

Update of pair-counting histograms
The cost of histogram updates in Algorithm 3 can be considerable, as there are usually numerous pairs within the query range, which scales with O(N 2 ) for most cases (see Eq. ( 8)).Therefore, the complexity of the histogram update process is usually O(N 2 ), which is independent of data structures and algorithms.Nevertheless, it is possible to reduce the hidden constant factor with a smart algorithm.In general, this factor relies on the number of bins and the distribution of separations, which are then crucial for comparing the performance of different histogram update algorithms.In practice, pair separations are usually computed from the squared distances.We therefore sampled squared distances randomly following their expected distributions with a periodic box (see Appendix D for details) for the histogram update algorithm benchmarks.We assumed monotonically increasing histogram bins for the tests.In reality, this can be fulfilled by pre-sorting the bins.We also required the bins to be continuous, which is a common scenario in practice.Furthermore, we used zero-based bin indices throughout this work.

Comparison-based methods
A direct way of locating the histogram bins of given separation values is to compare them with the bin edges.In this case, the squared distances can be compared against pre-computed squared bin edges, without evaluating square roots for the actual separations.This improves the efficiency and the numerical stability of the algorithms.A commonly used method for this purpose is the binary search algorithm.The average complexity of this algorithm is O(log N bin ), where N bin denotes the number A83, page 8 of 20 Zhao, C.: A&A proofs, manuscript no.aa46015-23 of histogram bins.This complexity is optimal for comparisonbased methods when the separations are distributed uniformly across the bins and come in random order.
However, in reality, there are usually more pairs with larger separations (see Appendix D).Thus, it is worthwhile to consider a simple algorithm that continuously traverses the histogram bins in reverse order, that is, starting from the bin for the largest separations (see Sinha & Garrison 2020).The worst-case complexity of this algorithm is O(N bin ).However, the average computational cost can be lower than that of the binary search algorithm, especially when the distribution of separations across the bins is highly asymmetric.
In principle, comparison-based methods can be further improved by taking advantage of the locality of separation values during the pair-counting process.This is particularly true for the tree structures discussed in Sect.2, which groups nearby data points together.In this case, the splay tree is a potentially useful data structure for histogram updates, with which frequently accessed bins can be visited more quickly (Sleator & Tarjan 1985).Nevertheless, the performance of comparisonbased methods is limited by the N bin dependences and unavoidable conditional branches, which are harmful to the performance of instruction-level parallelism with modern pipelined processors.Given also the high efficiency of alternative algorithms introduced later, we did not implement a splay tree in this work.

Index-mapping functions
The pair separation histogram can be updated in constant time and without branches when it is possible to map squared distances directly onto the indices of the corresponding histogram bins.For evenly spaced bins on both linear and logarithmic scales, which are the most common configurations in practice, the index-mapping forms are simple.Thus, it is of practical interest to examine index-mapping algorithms for these specific cases.
For uniform linear separation bins in the range of [s min , s max ), the index of the bin for a given squared distance s 2 is where ∆ lin s indicates the width of the bins.Because the paircounting process is independent of coordinate units, we can rescale data-point coordinates and histogram bins by (1/∆ lin s) in advance to eliminate the division in Eq. ( 11), thus improving the overall efficiency of the pair-counting algorithm.Eventually, we need only three operations for the evaluation of bin index for each valid pair, which are square root, subtraction, and floor.
Similarly, the index of squared distance s 2 for logarithmic bins in the range of [s min , s max ) can be obtained by Here, ∆ log s is the width of the bins on logarithmic scale.Again, we can rescale all coordinates and histogram bins to further improve the efficiency.For instance, with a rescaling factor of (1/s min ), the (log s min ) term in Eq. ( 12) can be omitted.Then, if pre-computing the factor (2∆ log s) −1 , we end up with one logarithm, one multiplication, and one floor for the index mapping.
Although the complexity of index-mapping algorithms is only O(1), which outperforms those of comparison-based methods, the actual computing time largely depends on the efficiency of the index calculations.A considerable number of comparisons can be accomplished during the evaluation of the logarithm in Eq. ( 12).Therefore, histogram update algorithms based on index-mapping functions are not necessarily faster than the methods described in Sect.3.2.1,especially when N bin is small.To make the constant-time complexity effective, we need more efficient index-mapping methods than the direct function evaluations, not to mention the limited numerical precision of these functions.

Index lookup tables
A common way of accelerating the evaluation of a numerical function is to look up pre-computed values from a table.This technique can be very efficient if the domain of the function is discrete and reasonably small.In general, index-mapping functions for histogram updates do not fulfil this condition, as the squared distances can be of any value inside [s 2 min , s 2 max ).Nevertheless, when the edges of histogram bins are integers, only the integer part of a squared distance determines the index of the histogram bin.In this case, we can create an index lookup table, the keys to which are the integer parts of all possible squared distance values.The index-mapping process can then be completed by truncating squared distances and looking up indices in the table.Moreover, the efficiency of this method can benefit from the data locality with the tree structures discussed previously, which reduces the cache-miss rate of table lookup.
This method is also applicable when all the histogram bin edges can be converted into integers by a common rescaling factor, as it is permissible to rescale the histogram bins together with the coordinates of data points.This is a common scenario in practice.For instance, given equally spaced separation bins with s min = 0, the rescaling factor that converts all bin edges into integers is simply the inverse of the bin width.However, because the length of the lookup table is when the (rescaled) distance range is wide, the table may be too large to fit in the CPU caches.As a result, the lookup efficiency can be significantly downgraded due to the expensive memory accesses.One solution to this problem is to rescale the histogram bins by a factor that is smaller than 1.The bin edges are the not guaranteed to be integers, however.When we also consider cases in which the bin edges cannot all be converted into machinerepresentable integers, an index lookup algorithm that does not rely on integer bin edges is necessary.
For non-integer bin edges, we have to take care of noninjective lookup table entries.This is because squared distances belonging to different separation bins may share the same integer part.In this case, we can record the index ranges for noninjective entries and use a comparison-based method to further identify the exact index for a given squared distance.In practice, we used the reverse traversal algorithm (see Sect. 3.2.1)because it is simple.It is worth noting that this hybrid indexlookup method is able to deal with separation bins with arbitrary bin edges and widths as long as the bins are continuous.
The efficiency of this method depends on the rescaling factor of the histogram bins.When the factor is small, there is a higher chance of encountering non-injective table entries, which requires further comparisons that are relatively slow.In contrast, large rescaling factors yield large tables that may increase the cache-miss rate.In principle, the optimal rescaling factor depends on the CPU cache sizes and should be estimated through benchmarks.

Comparison of the histogram update algorithms
In order to compare the performance of the different histogram update algorithms discussed so far, and to choose the optimal table size for the hybrid index-lookup method, we performed a series of benchmark tests on Haswell CPUs with squared distance values sampled randomly following Appendix D. In particular, the squared distances were sampled in the range of [0, 200 2 ) h −2 Mpc 2 .We examined both linear and logarithmic separation bins, which are the most commonly used binning schemes in practice, with s in ranges of [0, 200) and [0.1, 200) h −1 Mpc , respectively.To inspect the N bin dependences of the algorithms, we further tested two different numbers of histogram bins, 20 and 200, for both binning schemes.Some of the algorithms required rescaling of the squared distances, which can be achieved by pre-processing the coordinates of all data points in reality.The computational cost of this preprocessing step is O(N), which is generally much smaller than that of the histogram update process with O(N 2 ) pairs.Thus, the costs of histogram update routines we report do not include those for rescaling separations.Given 6.4 × 10 9 random squared distances, the execution times of the hybrid index-lookup algorithm with different lookup table sizes and histogram bins are shown in Fig. 9.A table with ∼10 4 entries is always almost optimal, regardless of the separation bin configurations.The indices of histogram bins can be represent by 8-or 16-bit integers in most cases, and therefore, the memory cost of a table with ∼10 4 entries is about 10 to 20 KB, which fits in the level-1 (L1) cache of most modern CPUs for supercomputers.This explains the optimality of the table size.The optimal histogram update cost per squared distance value is about 2 ns for all the separation bin configurations in Fig. 9, which correspond to barely ∼5 Haswell CPU cycles, that is, slightly larger than the four-cycle latency of L1 cache accesses (Fog 2022).This means that we achieve almost the maximum theoretical efficiency for histogram updates.Thus, we always chose separation rescaling factors that yielded N table ∼ 10 4 for the hybrid index-lookup algorithm.We furthermore compared the performance of different histogram update algorithms with the same input squared separation sequences and histogram bins.The results are presented in Fig. 10.We did not use the hybrid method for linear separations bins because the bin edges are integers and lookup tables are directly applicable.In all cases, the execution time scales linearly with N dist , the number of squared distances sampled.It is not surprising that the comparison-based methods, binary search and reverse traversal, are sensitive to both the binning scheme and number of separation bins.The performance of index-mapping algorithms also largely depends on the binning schemes, but not on N bin .This can be explained by the different costs of the indexmapping functions.The linear index-mapping method expressed by Eq. ( 11) is faster than the comparison-based methods for the examined N bin values; and the logarithmic mapping shown in Eq. ( 12) is generally less efficient, especially when compared to the reverse traversal algorithm.In contrast, the index-lookup algorithms are insensitive to the configurations of separation bins, and they outperform all the other methods in all the cases presented here.The lookup cost for each squared distance value is always ∼2 ns on average.
We then implemented the index-lookup methods in FCFC for pair counting because they are highly efficient and can deal with arbitrary separation bins.In particular, for linear separation bins, we computed the smallest positive factor that converts both edges of the first bin into integers.Given the separation ranges rescaled by this factor, if the N table expressed by Eq. ( 13) is ≲3 × 10 4 , we used the index-lookup table for integer bin edges directly.For all the other cases, either the N table computed in this way is too large or the separation bins are not evenly spaced, we relied on the hybrid index-lookup method with N table ∼ 10 4 .In order to eliminate potential numerical errors due to rescaling, we always chose a rescaling factor that was a power of the radix used by floating point representations and that yielded a lookup table size that was closest to 10 4 .In this case, the rescaling only changes the exponent of almost all floating-point numbers, so the mantissas A83, page 10 of 20 Zhao, C.: A&A proofs, manuscript no.aa46015-23 were untouched and no additional numerical errors were introduced.When the factor was chosen, we rescaled all histogram bins and the coordinates of the data points accordingly.

Parallelisation
Modern multi-core vector processors are able to run multiple independent instructions simultaneously on different pieces of data.HPC clusters are usually equipped with hundreds or thousands of such CPUs.To make full use of the computing facilities, the computational task needs to be broken down into similar sub-tasks, and different levels of parallelisms need to be used.

SIMD
Single instruction, multiple data (SIMD) refers to a type of datalevel parallelism that permits operations of multiple data (i.e. a vector) with a single instruction3 .For instance, most of the modern x86 CPUs support advanced vector extensions (AVX), which provides 256-bit registers for eight single-precision or four double-precision floating-point numbers to be processed simultaneously.A number of CPUs also support advanced vector extensions 2 (AVX2), which is an extension of AVX with the same register width, but more instructions, or even AVX-512, which permits 512-bit SIMD operations.AVX-512 consists of multiple extension sets.We focus on AVX-512 foundation (AVX-512F) in this work because it is available for all AVX-512 implementations and is sufficient for our application.
SIMD is potentially able to boost the performance of a paircounting code because distances between different pairs of data points can be evaluated at once, which has to be processed for each individual pair with the conventional sequential (also known as scalar) approach.Thus, the traversal of points on pairs of tree nodes can be highly accelerated.In contrast, SIMD does not help the data-pruning process much because the maintenance of the dual-node stack (see Sect. 3.1) cannot be parallelised with vector operations.In this case, larger tree nodes and fewer node comparisons are preferred for a better overall pair-counting efficiency, so that the optimal leaf node capacity may change with different register widths.
To explore the optimal n leaf for the k-d tree and ball tree with AVX and AVX-512, we performed a new set of benchmarks with both the scalar and vectorised dual-tree algorithms on the Haswell and Knights Landing CPUs (see Appendix A).Because only distance evaluations were vectorised here, which involves barely elementary arithmetic, the speedups are expected to be very similar for newer CPUs.In particular, we checked both single-and double-precision arithmetics by using the float and double data types in the C progamming language, as illustrated in Fig. 11.Similar to the tests in Sect.2, we measured the execution time of the pair-counting algorithm, which reports the number of pairs with separations smaller than R max for 4 × 10 6 uniformly distributed random points in a cubic box with a side length of L box = 10R max .It has been shown previously that the optimal n leaf is not sensitive to the specifications of the input samples.Therefore, we did not vary the box size or the number of data points here.Figure 11 shows that with SIMD, n leaf = 32 is near optimal for almost all cases.Moreover, when n leaf ≳ 32, theoretical maximum speedups with SIMD are generally achieved.For instance, AVX is able to process four double-precision numbers at once, and the actual speedups of the AVX-vectorised algorithms are indeed ∼4 with respect to the scalar counterparts.It implies that the efficiency of the dual-tree algorithm is not likely to be surpassed by the SIMD-vectorised pair-counting algorithm based on regular grids.The speedup can be larger than the number of floating-point numbers processed simultaneously.This might be due to additional efficiency boosts with the fused multiply-add (FMA) instructions that are available with most modern SIMD implementations.Our histogram update process, however, may benefit from SIMD.On one hand, the index-lookup methods are sufficiently fast for the access of CPU caches to have become the bottleneck (see the discussions in Sect.3.2.4).On the other hand, AVX does not provide instructions for reading lookup tables and maintaining histograms.In this case, only the floor operation can be vectorised, while the rest of the histogram update process has to be implemented in scalar.The more recent AVX2 instruction provides the gather operation, which loads multiple elements from non-contiguous memory locations, and might useful for loading lookup tables and histogram counts.However, there is still no instruction for the update of histogram with AVX2.Only with AVX-512 are both gather and scatter operations available, where scatter stores multiple data at different memory locations at once.Therefore, AVX-512 permits a full vectorisation of our histogram update algorithm.
We then vectorised the histogram update process with different SIMD instruction sets and performed benchmarks on a number of different CPUs using 10 10 randomly generated squared separation sequences with the same binning schemes as in Sect.3.2.4.For AVX-512, we maintained private histograms for individual vector elements to avoid conflicts, rather than relying on the conflict detection instructions (AVX-512CD).In this way, we eliminated costs due to conflict detection and branching by trading off memory usage.The averaged processing time of each squared separation value, as well as the speedups of the vectorised versions with respect to the scalar counterparts, are illustrated in Fig. 12. Improvements with SIMD are almost always marginal, except for the Knights Landing CPU.This can be explained by the limits of cache throughputs.After all, for most of the CPUs tested, the cost of processing one square distance value is barely a few nanoseconds with the scalar code.Moreover, with AVX, the main components of the histogram update algorithm are not vectorised.The inclusion of the gather instruction alone with AVX2 is harmful to the efficiency of index lookups, possibly because the algorithm is not fully vectorised, and there are additional micro-operations than memory loads (see Chapter 15, Intel Corporation 2022).The index-lookup algorithm can be significantly accelerated by AVX-512 on the Knights Landing CPU, while for Cascade Lake, the performance of the vectorised and scalar codes is very similar.This shows that AVX-512 is only useful when the histogram update procedure is significantly slower than the latency of cache access.Given these benchmark results, FCFC makes use of gather only when scatter, or AVX-512, is available.This does not mean that AVX2 is useless because we benefit from the versatile vectorised integer arithmetics introduced by AVX2.
To further examine whether or to which degree SIMD is beneficial to the full pair-counting procedure, including both distance evaluations and histogram update, we compared the entire runtime of the scalar and vectorised FCFC on different CPUs for auto pair counts upon periodic cubic random catalogues with a box size of 3 h −1 Gpc, with 200 linear s bins in [0, 200) h −1 Mpc and 120 µ bins in [0, 1), which is a common setting in practice.The results are presented in Fig. 13.We conclude that SIMD is generally useful, although the overall improvement can be marginal on certain CPUs.Thus, we always enabled SIMD parallelisation throughout this work.

OpenMP
Open multi-processing4 (OpenMP) is a high-level application programming interface (API) that provides a set of compiler directives, library routines, and environment variables for multithread parallelisms with shared memory.It is usually possible to parallelise a program with high scalability using OpenMP, with little modification of the serial code.Therefore, multi-threading with OpenMP is generally easy to implement to take advantage of multi-core processors.It is thus used extensively in cosmological applications, including pair-counting programs (e.g.Alonso 2012; Donoso 2019; Sinha & Garrison 2020).However, it is not trivial to parallelise our dual-tree algorithm (see Algorithm 3) with high scalability.The update of the dual-node stack has to be executed by one thread at a time to prevent race conditions.This may result in additional overhead.It is possible to reform the algorithm as a recursive function, but then there additional costs accrue due to recurrent function calls and creations of threads for subtasks.One way to eliminate these expenses is maintaining a private stack on each thread.To this end, the dual-node stack has to be initialised with multiple elements that can be assigned to different threads and run independently.In this way, the initialisation and allocation of node pairs are crucial for the load balancing of the parallelised dual-tree algorithm.In principle, Algorithm 3 can be run with a single thread until the dual-node stack is sufficiently large, and then the node pairs can be distributed to different threads.However, with the depth-first tree traversal order, node pairs on the stack differ significantly in size as the number of points on each node depends mainly on the level (or depth) of the node.In this case, the word loads of different threads are normally highly unbalanced, which is harmful to the efficiency of the parallelised program.To circumvent this problem, we relied on the breadth-first tree traversal order for the initialisation of node pairs, which were then stored in a queue rather than a stack.Thus, after each iteration, the node pairs in the queue were all at the same level and consisted of a similar number of data points.When the queue was large enough, we distributed the node pairs to different OpenMP threads.The work loads were still not perfectly balanced in general, however, because the numbers of pairs within the query range can vary among different node pairs.It should be possible to further increase the performance of the parallelised dual-tree algorithm by using better scheduling strategies, such as the work-stealing technique (e.g.Blumofe & Leiserson 1999).For instance, the scaling efficiency of the 2PCF algorithm developed by Chhugani et al. (2012) is remarkable even with over 25 000 threads 5 .We leave relevant investigations to a future work.
The performance of the OpenMP-parallelised FCFC on different CPUs is shown in Fig. 14.The benchmarks were performed with a periodic cubic random sample with 5 × 10 7 points and a box size of 3 h −1 Gpc.Similar to the case in Sect.4.1, we measured auto pair counts with 200 linear s bins in [0, 200) h −1 Mpc and 120 µ bins in [0, 1).The speedup scales quite well with the number of threads when there are ≲32 OpenMP threads.With more threads, the speedups significantly deviate from the theoretical maximum values on both CPUs, possibly because of the non-negligible overheads of maintaining a large number of threads, as well as the imperfect work balancing.The scalability of the OpenMP-parallelised FCFC is reasonably good.Therefore, it is always recommended to enable OpenMP for pair-counting tasks with FCFC.

MPI
The message-passing interface (MPI) is a standard that defines a communication protocol for high-performance parallel computing on distributed memory systems.It permits multi-process 5 However, the algorithm of Chhugani et al. (2012) is mainly useful for isotropic 2PCFs with a small number of separation bins, and it is therefore not general enough for actual cosmological applications.programs that are able to make use of almost all computing resources of a cluster in principle.In practice, MPI is commonly used along with OpenMP.In this hybrid paradigm, MPI is typically used across the computing nodes or sockets of a cluster, while OpenMP is used within nodes or sockets to reduce communication overhead and memory usage.Thus, a better scalability may be achieved than in pure MPI or OpenMP manners.
Our parallelised dual-tree algorithm discussed in Sect.4.2 is applicable to the MPI parallelism.After creating the queue with node pairs at the same tree level, bulks of tasks can be assigned to different MPI processes, and then the breadth-first tree traversal procedure is repeated on each process to generate subtasks for threads if OpenMP is enabled in the meantime.In this way, the pair-counting routine is executed independently by different processes, and no communication is needed.In practice, the trees and dual-node queues are constructed on a single process and are broadcasted so that all processes maintain a local copy.The node pairs in the queue are then dynamically assigned to processes using the remote accessible memory (RMA) routines to achieve better load balancing.Thus, the only additional steps for MPI compared to the OpenMP-parallelised version are the synchronisations of trees and lookup tables among all processes, as well as the gathering of pair-counting results at the end.
The performance of the MPI-parallelised FCFC with and without OpenMP is presented in Fig. 15 for auto pair counts upon periodic cubic random samples with 5 × 10 8 and 5 × 10 7 points, respectively, and the same box size and binning scheme as in Sect.4.2.For FCFC with MPI, but without OpenMP, the speedups scale well with the number of MPI processes on the Haswell and Knights Landing nodes.The trends are similar to those shown in Fig. 14, for which only OpenMP was enabled.This is expected as we distributed the work loads in the same way.When enabling OpenMP along with MPI and running FCFC with the maximum available number of OpenMP threads (64 on Haswell and 272 on Knights Landing), the speedups are basically unchanged on Haswell, but there is a significant degradation in the efficiency when the number of processes is ≳8.This may be due to the fact that small imbalances of work loads become critical with thousands of independent threads running simultaneously.As discussed in Sect.4.2, we leave the A83, page 13 of 20 A&A 672, A83 (2023) exploration of a better work-load scheduler to a forthcoming paper.

Comparison with related work
To determine whether FCFC is useful in practice, it is important to run it with real-world applications and to compare the efficiency against related pair-counting tools.Sinha & Garrison (2020)  The most expensive tasks in reality are usually randomrandom pair counts.We therefore focused only on auto pair counts with random catalogues.Due to the differences in boundary periodicity and line of sight for pair counting with simulation and observational data (see Sect. 2.5), we examined two sets of randoms: (1) 5 × 108 uniformly distributed random points in a cubic box with a side length of 3 h −1 Gpc to mimic the random catalogue for a periodic simulation (1) the actual random samples for the BOSS DR12 data12 , with weights enabled for pair counting.These two random catalogues were further randomly down-sampled for benchmarks with smaller datasets.For all catalogues, we performed pair counts with the following binning schemes: The performance of FCFC and Corrfunc on the Haswell and Cascade Lake14 nodes with double-precision arithmetics is shown in Fig. 16.Here, we enabled OpenMP and SIMD for both codes 15 and ran them on an entire computing node with all available resources, that is, 64 threads with AVX2 on Haswell, and 36 threads with AVX-512 on Cascade Lake.The pair-counting results from the two codes are identical, and therefore, we only compare their efficiencies here.FCFC is faster than Corrfunc for all cases.For logarithmic bins, the speedups of FCFC are relatively small, while with linear bins, especially when the bin counts are large, the speedups can be prominent.For the binning scheme (2), which is commonly used in practice for the ease of rebinning with different bin widths, FCFC can be five and ten times faster than Corrfunc with ≳10 8 objects on Haswell and Cascade Lake CPUs, respectively.Speedups are generally consistent with those of the index-lookup algorithms for histogram update (see Sect. 3.2.4).Thus, we conclude that the high efficiency of FCFC is mainly due to the novel histogram update algorithm.

Conclusions
We have presented FCFC, a high-performance software package for exact pair counting.It is highly optimised for cosmological applications, but should be useful for calculations of all kinds of two-point correlation functions or radial distribution functions with 3D data.We mainly focused on the efficiency and scalability of the tool in this paper, but FCFC is also portable, flexible, user-friendly, and applicable to a number of different practical problems, such as the calculation of radial distribution functions in statistical mechanics.A brief guide to the toolkit can be found in Appendix E.
We have compared three different data structures for paircounting applications, that is, regular grids, the k-d tree, and a novel variant of the ball tree.For the tree structures, we make use of an improved dual-tree algorithm for pair counting.We showed that the performance of regular grids is sensitive to the choice of grid size.With a sub-optimal grid size, the efficiency of the pair-counting procedure can be substantially degraded.In contrast, the tree-based methods are almost always optimal for a fixed capacity of leaf nodes, and thus there is no free parameter for the tree constructions.In particular, we set n leaf = 8 for the scalar version of FCFC, while n leaf = 32 was used for the SIMDvectorised implementation.When the number of data point is sufficiently large, both trees outperform regular grids regardless of the grid size, but the improvements may be marginal for cosmological catalogues with 10 7 -10 8 objects.The efficiencies of the two tree structures are similar, however.Therefore, we implemented both tree structures in FCFC.
We also introduced a new histogram update algorithm based on index-lookup tables to speed up the increment of separation bins for correlation functions.For non-integer bin edges, the lookup table was used together with a comparison-based reverse traversal algorithm to locate histogram bins.Thus, our index lookup method is applicable to arbitrary binning schemes, including multi-dimensional bins for anisotropic measurements.According to the comprehensive benchmarks with different practical binning schemes, the index lookup method is shown to be considerably faster than the other commonly used histogram update algorithms for all cases.
Then, we parallelised FCFC with three levels of common parallelisms, that is, SIMD of vector processors, shared-memory OpenMP, and distributed memory MPI, with which it is possible to make full use of all computing resources of a cluster in principle.The key gredient of FCFC, that is, the index-lookup algorithm for separation bin updates, benefits only little from SIMD because the main bottleneck is likely to be the latency of CPU cache accesses.Nevertheless, the efficiency of FCFC scales well with the numbers of MPI processes and OpenMP threads, as long as the total number of threads does not exceed a few thousands.When the number of threads is too large, the performance boost due to parallelisation may be downgraded.
Finally, we compared OpenMP-and SIMD-parallelised FCFC and Corrfunc with the same number of computing resources, input catalogues, and binning schemes for pair counting.We find that FCFC is faster than Corrfunc for all the cases tested.The speedup is the most prominent with a large number of linear separation bins.FCFC can be more than ten times faster than Corrfunc on modern AVX-512 CPUs for catalogues containing ∼10 8 objects and for pair counting with 200 linear s bins and 120 µ bins, which is a common setting for 2PCF calculations in practice.Thus, FCFC is a very promising tool for modern and future cosmological clustering measurements.
We will further extend our methods for more cosmological applications in the future, such as angular and high-order clustering statistics, including in particular three-and four-point correlation functions.Approximate methods will also be explored to further speed up the measurements with tolerable errors.Moreover, we will implement more advanced load-balancing schemes to further increase the scalability of FCFC, and hopefully make use of GPU acceleration.

Fig. 1 .
Fig. 1.Architecture of the data structures implemented in this work.

Fig. 2 .
Fig. 2. Illustration of an isotropic range search using regular grids with different cell sizes.The points in panels a and b indicate a randomly generated dataset in 2D, with the current reference point marked in red.Yellow areas denote cells that are visited, given the query range indicated by red circles.Panels c and d show the arrangements of data points in memory for the column-major grid configurations in panels a and b, respectively.Pink regions indicate data points that are visited during the range search process.The cell sizes are chosen for illustration purposes and are not meant to be optimal.

Fig. 3 .
Fig. 3. Execution time of the grid-based pair-counting routine with different cell sizes and query ranges for periodic uniform random samples with different cubic box sizes and numbers of points.The solid lines show the best-fitting theoretical results detailed in Appendix C.

Fig. 4 .
Fig. 4. Illustration of an isotropic range search using the k-d tree.Panels a, b, and c show the partitions of the data points (black dots) during the k-d tree construction, the resulting minimum axis-aligned bounding box of leaf nodes and their parents, and the diagram of the tree structure, respectively.In particular, non-leaf nodes in panel (c) are indicated by the corresponding dividing lines in panel a.The red point and circle in panel b indicate the reference point and radius for a range search.The grey regions in panel c and the yellow areas in panels b and c highlight the visited non-leaf and leaf nodes during this query.The retrieved data points are shown in pink in panel d.

Fig. 5 .
Fig. 5. Execution time of the pair-counting routine based on the k-d tree with different capacities of leaf nodes for periodic uniform random samples with different cubic box sizes and numbers of points.The solid lines show the best-fitting theoretical results detailed in Appendix C.

Fig. 6 .
Fig. 6.Same as Fig. 4, but for a variant of the ball tree.Panel a shows the partition lines based on the PCA, and panel b shows the resulting minimum bounding spheres of leaf nodes and their parents.

Fig. 7 .
Fig. 7. Execution time of the pair-counting routine based on the ball tree with different capacities of leaf nodes for periodic uniform random samples with different cubic box sizes and numbers of points.The solid lines show the best-fitting theoretical results detailed in Appendix C.

Fig. 8 .
Fig.8.Comparisons of the computational costs of pair-counting routines based on different data structures for periodic uniform random samples with different numbers of points in a cubic volume (left) and for sections between R in and R out of the same catalogues with respect to a corner of the boxes, where R out is equal to the box size (right).The middle and bottom panels show results normalised by the costs with the k-d tree and kN 2 , respectively.Here, the value of k is chosen such that the normalised costs at N = 5 × 10 7 are unity.

Fig. 10 .
Fig. 10.Execution times of various histogram update algorithms for different histogram bins and numbers of squared distances sampled in the range of [0, 200 2 ) h −2 Mpc 2 .The separation ranges of the linear and logarithmic bins are [0, 200) and [0.1, 200) h −1 Mpc , respectively.Two different numbers of bins are also tested.The solid lines show the bestfitting straight lines with a constant execution time per squared distance.

Fig. 11 .
Fig. 11.Execution time of the scalar, AVX-vectorised, and AVX-512vectorised pair-counting routines based on the k-d and ball trees, with different capacities of leaf nodes, for a periodic uniform random sample.Results for both Haswell and Knights Landing CPUs are shown.Speedup is measured as the ratio of the computing time of the scalar code to that of the vectorised counterpart.

Fig. 12 .
Fig. 12. Execution time of the histogram update algorithms measured upon 10 10 random squared distances, and the speedups of the vectorised algorithms with respect to the scalar counterparts on different CPUs with different histogram bin settings and precisions of floating-point numbers.

Fig. 13 .
Fig. 13.Execution time of the scalar and vectorised versions of FCFC on different CPUs (bars) for the full pair-counting procedure with 200 linear s bins in [0, 200) h −1 Mpc and 120 µ bins in [0, 1), run upon periodic random samples in a cubic box with a side length of 3 h −1 Mpc.The purple lines indicate the speedups of the SIMD-parallelised versions with respect to the scalar counterparts.AVX2* indicates AVX2, but without the gather instructions.

Fig. 14 .
Fig. 14.Speedups of the OpenMP-parallelised FCFC with respect to the serial version on Haswell and Knights Landing CPUs with different numbers of OpenMP threads, measured using a periodic cubic random catalogue with N = 5 × 10 7 , L box = 3 h −1 Gpc, and with 200 linear s bins in [0, 200) h −1 Mpc and 120 µ bins in [0, 1).The dashed line denotes the theoretical maximum speedup.SIMD is enabled in all cases.

Fig. 16 .
Fig. 16.Performance of FCFC and Corrfunc for auto pair counts with periodic and survey-like random samples with different numbers of points.Both OpenMP and SIMD parallelisms are enabled.The codes are run on entire nodes with all cores of Haswell and Cascade Lake CPUs.
Fig. C.1.Grid cells to be visited (coloured areas) for a reference cell (black square) and an isotropic query range with a radius of R max .Yellow regions indicate cells that are entirely inside the query range, and pink zones denote cells that intersect the boundary of the query range, which is shown in red.The side length of every cell is denoted by L cell .
Speedups of the MPI-parallelised FCFC with respect to the version without MPI on Haswell and Knights Landing CPUs with different numbers of MPI processes, measured using periodic cubic random catalogues with N = 5 × 10 7 (left) and 5 × 10 8 (right), L box = 3 h −1 Gpc, and with 200 linear s bins in [0, 200) h −1 Mpc and 120 µ bins in [0, 1).The dashed lines indicate the theoretical maximum speedup.The left panel shows results without OpenMP, and the right panel presents results with the maximum available numbers of OpenMP threads.SIMD is enabled in all cases.
Table A.1.Specifications of CPUs on the computing nodes used for the benchmarks.PAIRCOUNT_GRID (C, S, H) Input: a list C with all grid cells, the separation range S of interest, and the histogram H for storing pair counts.1: for all non-empty cell c ∈ C do