Free Access
Issue
A&A
Volume 639, July 2020
Article Number A91
Number of page(s) 18
Section Cosmology (including clusters of galaxies)
DOI https://doi.org/10.1051/0004-6361/202037995
Published online 14 July 2020

© ESO 2020

1. Introduction

We live in the age of large astronomical surveys. These surveys detect and record tracers of cosmic structure across vast volumes of the Universe, using electromagnetic and gravitational waves. A non-exhaustive list includes optical and infrared imaging and spectroscopic surveys such as LSST (LSST Science Collaboration 2012), Euclid (Laureijs et al. 2011), DESI (DESI Collaboration 2016), and SPHEREx (SPHEREx Science Team 2018); catalogues and intensity maps from large radio surveys such as the square kilometer array (Square Kilometre Array Cosmology Science Working Group 2018) and its precursors; cluster catalogues from high-resolution observations of the microwave sky (Advanced ACTPol, Simon et al. 2018; SPTPol, Austermann et al. 2012; Simons Observatory, Simons Observatory Collaboration 2019, and CMB-S4); X-ray surveys such as the eROSITA mission (Merloni et al. 2012); as well as gravitational wave sirens across cosmological volumes with successive updates of (Advanced) LIGO (LIGO Scientific Collaboration 2015), Virgo (The Virgo Collaboration 2020) and LISA (Barausse et al. 2020). Whilst these data sets will be prodigious sources of scientific discovery across astrophysics, their enormous volume and dense sampling of cosmic structure will make them uniquely powerful when studying some of the deepest scientific mysteries of our time: the statistical properties of the primordial perturbations, the nature of dark matter, and the physical properties of dark energy. Indeed many of these surveys were conceived to address these questions.

Accomplishing this promise requires the ability to model these surveys in sufficient detail and with sufficient accuracy. All but the most simplistic models require the production of cosmological light-cone simulations. In particular, cosmological inferences often rely on large numbers of mock catalogues, which are used to construct unbiased estimators and study their statistical properties, such as covariance matrices. As surveys are getting deeper, these mock catalogues now need to represent a sizeable portion of the observable Universe, up to a redshift of ∼2 − 3 (e.g. z = 2.3 for the Euclid Flagship simulation1). Unfortunately, cosmological simulations put a heavy load on supercomputers. Even if only dark matter is included and resolution is minimised, they can require millions of CPU hours and hundreds of terabytes of disk space to solve the gravitational evolution of billions of particles and store the corresponding data. For instance, the DEUS-FUR simulation (Alimi et al. 2012), containing 81923 particles in a box of 21 Gpc h−1 side length, required 10 million hours of CPU time and 300 TB of storage.

While computational needs are soaring, the performance of individual compute cores attained a plateau around 2015. Traditional hardware architectures are reaching their physical limit. Therefore, cosmological simulations cannot merely rely on processors becoming faster to reduce the computational time. Current hardware development focuses on increasing power efficiency2 and solving problems of heat dissipation to allow packing a larger number of cores into each CPU. As a consequence, the performance gains of the world’s top supercomputers are the result of a massive increase in the number of parallel cores, currently3 to 𝒪(105), and soon to 𝒪(106 − 7) in systems that are currently being built4. Hybrid architectures, where CPUs work alongside GPUs and/or reconfigurable chips such as FPGAs, add to the massive parallelism. In the exa-scale world, raw compute cycles are no longer the scarce resource. The challenge is to access the available computational power when Amdahl’s law demonstrates that communication latencies kill the potential gains due to parallelisation (Amdahl 1967).

A way to embed high-resolution simulation of objects such as galaxy clusters, or even galaxies, in a cosmological context is through the use of varying particle mass resolution and the adaptive mesh refinement technique (AMR, Berger & Colella 1989). AMR is widely employed in grid-based simulation codes such as RAMSES (Teyssier 2002), ENZO (Bryan et al. 2014), FLASH (Fryxell et al. 2000), and AMIGA (Knebe & Doumler 2010). It is also used in MUSIC (Hahn & Abel 2011) to generate zoom-in initial conditions for simulations. The AMR technique, which uses multi-grid relaxation methods (e.g. Guillet & Teyssier 2011), allows focusing the effort on a specific region of the computational domain, but requires a two-way flow of information between small and large scales. More recently, leading computational cosmology groups have been developing sophisticated schemes to leverage parallel and hybrid computing architectures (Gonnet et al. 2013; Theuns et al. 2015; Aubert et al. 2015; Ocvirk et al. 2016; Potter et al. 2017; Yu et al. 2018; Garrison et al. 2019; Cheng et al. 2020).

Full simulations of large cosmological volumes, even limited to cold dark matter and at coarse resolution, involve multiple challenges. One of the main issues preventing their easy parallelisation is the long-range nature of gravitational interactions, which forestalls high-resolution, large-volume cosmological simulations. As a a response, much of the classical work in numerical cosmology focused on computational algorithms (tree codes, fast multipole methods, particle-mesh methods, and hybrids such as particle-particle–particle-mesh and tree–particle-mesh) that reduced the need for 𝒪(N2) all-to-all communications between N particles across the full computational volume.

While these algorithms are and remain the backbone of computational cosmology, they fail to fully exploit the physical scale hierarchy of cosmological perturbations. This hierarchy has first been used to push the results of N-body simulations to Universe scale for cosmic velocity fields (Strauss et al. 1995). At the largest scales, the dynamics of the Universe is not complicated, and in particular, is well-captured by Lagrangian Perturbation Theory (LPT; see Bouchet et al. 1995). Building upon this view, Tassev et al. (2015) introduced spatial COmoving Lagrangian Acceleration (sCOLA). This algorithm, using a hybrid analytical and numerical treatment of particles’ trajectories, allows one to perform simulations without the need to substantially extend the simulated volume beyond the region of interest in order to capture far-field effects, such as density fluctuations due to super-box modes. The sCOLA proof-of-concept focused on one sub-box embedded into a larger simulation box.

In this paper, we extend the sCOLA algorithm and use it within a novel method for perfectly parallel cosmological simulations. To do so, we rely on a tiling of the full cosmological volume to be simulated, where each tile is evolved independently using sCOLA. The principal challenge for the accuracy of such simulations are the boundary conditions used throughout the evolution of tiles, which can introduce artefacts. In this respect, we introduce three crucial improvements with respect to Tassev et al. (2015): the use of a buffer region around each tile, the use of exact boundary conditions in the calculation of LPT displacements (which has the side benefit of reducing memory requirements), and the use of a Poisson solver with Dirichlet boundary conditions meant to approximate the exact gravitational potential around sCOLA boxes. The method proposed in this work shares similar goals with zoom-in simulation techniques, the main difference residing in the change of frame of reference introduced in sCOLA, which accounts for the dynamics of large scales without requiring flows of information during the evolution. On the other hand, our method is independent of the N-body integrator used to calculate the numerical part of particles’ trajectories within each sCOLA box, and therefore, it cannot be related to specific approaches to do so, such as force-splitting. It is slightly approximate and more CPU-expensive than the corresponding “monolithic” simulation technique (chosen in this paper as tCOLA, Tassev et al. 2013), but has the essential advantage of perfect scalability. This scalability comes from the removal of any kind of communication among tiles after the initialisation of the simulation. As a consequence, for its major part, the degree of parallelism of the algorithm equals the number of tiles, which means that the workload is perfectly parallel (also called embarrassingly parallel). This property can be exploited to produce cosmological simulations in very short wall-clock times on a variety of hardware architectures, as we discuss in this paper.

After reviewing Lagrangian Perturbation Theory and its use within numerical simulations in Sect. 2, we describe our algorithm for perfectly parallel cosmological simulations in Sect. 3. In Sect. 4, we test the accuracy and speed of the algorithm with respect to reference simulations that do not use the tiling. We discuss the implications of our results for computational strategies to model cosmic structure formation, and conclude, in Sect. 5. Details regarding the implementation are provided in the appendices.

2. Cosmological simulations using Lagrangian perturbation theory

Throughout this section we denote by a the scale factor of the Universe. For simplicity, some of the equations are abridged. We reintroduce the omitted constants, temporal prefactors, and Hubble expansion in Appendix A.

Particle simulators are algorithms that compute the final position x and momentum p ≡ dx/da of a set of particles, given some initial conditions. They can also be seen as algorithms that compute a displacement field Ψ, which maps the initial (Lagrangian) position q of each particle to its final (Eulerian) position x, according to the classic equation (see e.g. Bernardeau et al. 2002, for a review)

(1)

With this point of view, the outputs are x and p  =  ∂Ψ/∂a.

2.1. Lagrangian perturbation theory (LPT)

In Lagrangian perturbation theory (LPT), the displacement field is given by an analytic equation which is used to move particles, without the need for a numerical solver. At second order in LPT, the displacement field is written

(2)

where each of the terms is separable into a temporal and a spatial contribution deriving from a Lagrangian potential:

(3)

(4)

In Eqs. (3) and (4), D1 and D2 are the growth factor and second-order growth factor, respectively. The Lagrangian potentials obey Poisson-like equations (Buchert et al. 1994):

(5)

(6)

where δi(q) is the density contrast in the initial conditions, in Lagrangian coordinates, and the are spatial second derivatives of ϕ(1), i.e. .

If only the first-order term is included in Eq. (2), the solution is known as the Zel’dovich approximation (Zel’dovich 1970).

2.2. Temporal comoving Lagrangian acceleration (tCOLA)

In contrast to the analytical equations of LPT, particle-mesh (PM) codes (see e.g. Klypin & Holtzman 1997) provide a fully numerical solution to the problem of large-scale structure formation. The equation of motion to be solved in a PM code reads schematically

(7)

where the gravitational potential Φ satisfies the Poisson equation,

(8)

Here, δ(x, a) is the density contrast at a scale factor a, which is obtained from the set of particles’ positions {x(a)} through a density assignment operator that we denote B (typically a cloud-in-cell (CiC) scheme, see Hockney & Eastwood 1981):

(9)

We denote by the corresponding interpolation operator, which is needed to obtain the accelerations of particles given the acceleration field on the grid:

(10)

The temporal COmoving Lagrangian Acceleration (tCOLA) algorithm seeks to decouple large and small scales by evolving large scales using analytic LPT results, and small scales using a numerical solver. This is achieved by splitting the Lagrangian displacement field into two contributions (Tassev & Zaldarriaga 2012):

(11)

where ΨLPT(q, a) is the LPT displacement field discussed in Sect. 2.1 and Ψres(q, a) is the residual displacement of each particle, as measured in a frame comoving with an “LPT observer”, whose trajectory is given by ΨLPT(q, a). Using Eq. (11), it is possible to rewrite Eq. (7) as

(12)

The term can be thought of as a fictitious force acting on particles, caused by our use of a non-inertial frame of reference. Importantly, it can be computed analytically given the equations of Lagrangian perturbation theory.

The equations of motions (7) and (12) are usually integrated by the use of time-stepping techniques (see Appendix B). In the limit of zero time-steps used to discretise the left-hand side of Eq. (12), Ψres = 0 and tCOLA recovers the results of LPT; therefore, tCOLA always solves the large scales with an accuracy of at least that of LPT. In contrast, PM codes require many time-steps in Eq. (7) just to recover the value of the linear growth factor D1. In the limit where the number of time-steps becomes large, tCOLA reduces to a standard PM code. In the intermediate regime (for 𝒪(10) time-steps), tCOLA provides a good approximation to large-scale structure formation, at the expense of not solving the details of particle trajectories in deeply non-linear halos (see Tassev et al. 2013; Howlett et al. 2015; Leclercq et al. 2015; Koda et al. 2016; Izard et al. 2016, for further discussion). Since by construction, tCOLA always gets the large scales correct, contrary to a PM code, the trade-off between speed and accuracy only affects small scales.

2.3. Spatial comoving Lagrangian acceleration (sCOLA)

During large-scale structure formation, non-linearities appear at late times and/or at small scales. tCOLA (Eq. (12)) decouples LPT displacements and residual non-linear contributions “in time”, so that, for a given accuracy, fewer time-steps are required to solve large-scale structure evolution than with a PM code. Following a similar spirit, the spatial COmoving Lagrangian Acceleration (sCOLA) framework decouples LPT displacements and residual non-linear contributions “in space”, so that numerically evolved small scales can feel far-field effects captured analytically via LPT.

More specifically, for each particle in a volume of interest (the “sCOLA box”) embedded in a larger cosmological volume (the “full box”), the equation of motion of particles, which reads for a traditional N-body problem

(13)

is replaced by

(14)

is defined by Eq. (11) as the residual displacement with respect to the LPT observer of the full box, whose trajectory is given by ΨLPT(q, a). In Eq. (14), is the trajectory prescribed by solving LPT equations (see Sect. 2.1) in the sCOLA box. Note that may differ from ΨLPT(q, a), depending on the assumptions made for the boundary conditions of the sCOLA box, discussed in Sect. 3.3. Denoting by 𝒮 ⊆ ⟦1, N⟧ the set of particles in the sCOLA box, the gravitational force, which in Eq. (13) reads

(15)

is replaced by

(16)

It is possible to evaluate FsCOLA(x, a), and thus to solve Eq. (14), like Eq. (13), using any numerical gravity solver, such as particle-particle–particle-mesh, tree codes, or AMR. In this paper, we choose to focus on evaluating forces via a PM scheme. In this case, the equation of motion of particles in sCOLA reads schematically (Tassev et al. 2015)

(17)

The gravitational potential in the sCOLA box, ΦsCOLA(x, a), obeys the near-field version of the Poisson equation,

(18)

The superscript “sCOLA” over the gradient and Laplacian operators, and , mean that they are restricted to the sCOLA box (contrary to that of Eqs. (8) and (12)). Over the density contrast δsCOLA(x, a), the superscript means that only particles in the sCOLA box {x(a)}sCOLA ≡ {xi(a)}i ∈ 𝒮 (instead of the full box) are used within the density assignment BsCOLA, i.e.

(19)

Contrary to tCOLA, which is an exact rewriting of the equations of motion of a PM code, sCOLA potentially involves approximations for the calculation of each quantity and operator with a superscript “sCOLA” instead of its full box equivalent. As a proof of concept, Tassev et al. (2015) showed that under certain circumstances, sCOLA provides a good approximation for the evolution of one sCOLA box embedded into a larger full box. As discussed in the introduction, we aim at generalising this result by using sCOLA within multiple sub-volumes of a full simulation box.

3. Algorithm for perfectly parallel simulations using sCOLA

In this section, we describe an algorithm for cosmological simulations using sCOLA, for which the time evolution of independent Lagrangian sub-volumes is perfectly parallel, without any communication. A functional block diagram representing the main steps and their dependencies is given in Fig. 1. An illustration of the different grids appearing in the algorithm is presented in Fig. 2, and Table 1 provides the nomenclature of some of the different variables appearing in this section.

thumbnail Fig. 1.

Functional diagram of sCOLA (left) versus tCOLA (right). The grey boxes are common steps. sCOLA specific steps are represented in blue, and tCOLA specific steps in red. The yellow rectangle constitutes the perfectly parallel section, within which no communication is required with the master process or between processes. Arrows represent dependencies, and references to the main text are given between parentheses.

Open with DEXTER

Table 1.

Nomenclature of symbols used in the present article.

We work in a cubic full box of side length L with periodic boundary conditions, populated by particles initially at the nodes {q} of a regular Lagrangian lattice. We seek to compute the set of final positions {x(af)} and momenta {p(af)} at final scale factor af. The model equations are reviewed in Appendix A. The time-stepping of these equations consists of a series of “kick” and “drift” operations and is discussed in Appendix B.

We approximate the Laplacians Δx, Δq and gradient operators x, q by finite difference approximation (FDA) at order 2, 4, or 6. The coefficients of the finite difference stencils in configuration and in Fourier space are given for example in table 1 in Hahn & Abel (2011). We note Nghost = 1, 2, 3 if FDA is taken at order 2, 4, 6, respectively.

3.1. Initial conditions and Lagrangian potentials

Before the perfectly parallel section, two initialisation steps are performed by the master process in the full box.

A.1. The first step is to generate the initial density contrast δi in the full box, on a cubic grid of N3 cells (the “LPT grid”, represented in red in the left panel of Fig. 2). This step can be done via the standard convolution approach (e.g. Hockney & Eastwood 1981), given the specified initial power spectrum.

thumbnail Fig. 2.

Illustration of the different grids used within sCOLA. The Lagrangian lattice is represented by dashed lines. For each tile, central particles (in black) are surrounded by buffer particles (in cyan), which are ignored at the end of the evolution. The corresponding buffer region in other grids is represented in cyan. The left panel represents the “LPT grid” on which Lagrangian potentials and are defined. The central region has grid points (in red) and is padded by 2Nghost cells in each direction (pink region). The right panel shows the “PM grid” on which the density contrast δsCOLA, the gravitational potential ΦsCOLA, and the accelerations are defined. The density contrast is defined only in the central region (which has grid points, in dark green). The gravitational potential is padded by 2Nghost cells in each direction (light green and yellow regions), and the gridded accelerations only by Nghost cells in each direction (yellow region). Solving the Poisson equation requires Dirichlet boundary conditions in six layers of Nghost cells, denoted as hatched regions. For simplicity of representation, we have used here Nghost = 1.

Open with DEXTER

A.2. The second step is to compute the Lagrangian potentials ϕ(1)(q) and ϕ(2)(q) on the LPT grid in the full box, which is achieved by solving Eqs. (5) and (6).

If initial phases are generated in Fourier space, the Zel’dovich approximation (i.e. the calculation of ϕ(1)) requires only one inverse fast Fourier transform (FFT) on the LPT grid. For the second-order potential, the source term on the right-hand side of Eq. (6) has to be computed from ϕ(1); this can either be done in Fourier space (for a cost of six inverse FFTs) or in configuration space via finite differencing (for a cost of nine one-dimensional gradient operations). In both cases, the calculation of ϕ(2) from its source then requires one forward and one inverse FFT.

These few FFTs in the full box are the most hardware-demanding requirement of the algorithm (particularly in terms of memory), and the only step which is not distributed and suitable for grid computing. These FFTs may however be performed on a cluster of computers with fast interconnection suitable for Message Passing Interface (Frigo & Johnson 2005; Johnson et al. 2008).

3.2. Tiling and buffer region

B.1. After having computed the Lagrangian potentials, the master process splits the Lagrangian lattice (of size ) into cubic tiles (we require that Np is a multiple of Ntiles). Tiles are constructed to be evolved independently; therefore the main, perfectly parallel region of the algorithm starts here.

To minimise artefacts due to boundary effects (see Sect. 3.4), each tile is surrounded by a “buffer region” in Lagrangian space. This buffer region consists of Np, buffer particles in each direction, so that each sCOLA box contains a total of particles, where Np, sCOLA ≡ Np, tile + 2Np, buffer and Np, tile ≡ Np/Ntiles. Corresponding physical sizes are Ltile ≡ LNp, tile/Np, Lbuffer ≡ LNp, buffer/Np, and LsCOLA ≡ LNp, sCOLA/Np. The fraction of the full Lagrangian lattice assigned to one child sCOLA process is represented by dotted lines in Fig. 2. Particles of the tile are represented in black, and particles of the buffer region are represented in cyan.

The sCOLA box is chosen to encompass the tile and its buffer region. We define the over-simulation factor r as the ratio between the total volume simulated in all sCOLA boxes and the target simulation volume, i.e.

(20)

Since all sCOLA boxes can be evolved independently, the degree of parallelism of the algorithm is equal to the number of sCOLA boxes, . We call the “parallelisation potential factor” the quantity p ≡ /r, which balances the degree of parallelism with the amount of over-simulation. It is also

(21)

For each sCOLA box, the corresponding child process computes the set of final positions {x}sCOLA and momenta {p}sCOLA.

B.2. At the end of the evolution, each child process sends the set of final positions {x}tile and momenta {p}tile of particles of the tile back to the master process. Particles of the buffer region are ignored. The master process then “untiles” the simulation by gathering the results from all the tiles.

3.3. Initial operations in the sCOLA boxes

A few steps are required in each sCOLA box before starting the evolution per se.

C.1. The sCOLA box receives the relevant portion of ϕ(1)(q) and ϕ(2)(q) from the master process. This is the only communication required with the master process before sending back the results at the end of the evolution.

The portion of the LPT grid received by each process from the master process corresponds to the full spatial region covered by the sCOLA box, plus an additional padding of 2Nghost cells in each direction. We denote by and the parts of ϕ(1)(q) and ϕ(2)(q) received from the master process (we avoid the superscript “sCOLA” since no approximation is involved at this stage). They are defined on a grid of size (NsCOLA + 4Nghost)3, where

(22)

(⌈⋅⌉ denotes the ceiling function). An illustration is provided in Fig. 2, left panel. There, the portion of the LPT grid corresponding to the sCOLA box, of size NsCOLA in each direction, is represented in red and the padding region, of size 2Nghost in each direction, is represented in pink.

C.2. The sCOLA process locally computes the required time-independent LPT vectors and via finite differencing in configuration space and interpolation to particles’ positions.

The ghost cells included around and in the sCOLA box ensure that the proper boundary conditions are used when applying the gradient operator in configuration space to get the LPT displacements on the grid. This step “consumes” Nghost layers of ghost cells in each direction, so that the grid of LPT displacements has a size of (NsCOLA + 2Nghost)3. To use again the proper boundary conditions when going from the LPT grid to particles’ positions, another Nghost layers of ghost cells is consumed by the interpolation operator . The use of the exact boundary conditions at each of these two steps ensures that and . Therefore, by construction, and in the sCOLA box are always the same as Ψ1 ≡ qϕ(1)(q) and Ψ2 ≡ qϕ(2)(q) in the full box (as would be computed by the master process). Consequently, we do not keep track of both and Ψ1, 2, contrary to Tassev et al. (2015). In addition to being simpler, this scheme has the practical advantage of saving six floating-point numbers per particle in memory (three in the case of the Zel’dovich approximation).

C.3. The sCOLA process precomputes the Dirichlet boundary conditions ΦBCs that will be used at each calculation of the gravitational potential during the sCOLA evolution.

For each sCOLA box, we define a particle-mesh grid of size (the “PM grid”, represented in dark green in the right panel of Fig. 2). The PM grid defines the force resolution; it should be equal to or finer than the LPT grid (Ng ≥ NsCOLA). Before starting the evolution with sCOLA, each process precomputes the Dirichlet boundary conditions that will be required by the Poisson solver at each value of the scale factor aK. This calculation takes as input the initial gravitational potential and outputs ΦBCs(x, aK) for each aK, defined on the PM grid with a padding of 2Nghost cells around the sCOLA box in each direction (light green and yellow regions in Fig. 2, right panel). The approximation involved in this step is further discussed in Sect. 3.4.2.

3.4. Evolution of sCOLA boxes

Each sCOLA box is then evolved according to the scheme reviewed in Sect. 2.3 and Appendices A and B. Two specific approximations are needed to compute the operators and quantities with a superscript “sCOLA”; we now discuss the choices that we made.

3.4.1. Density assignment (BsCOLA)

As mentioned in Sect. 2.3, only particles of the sCOLA box should contribute to δsCOLA(x, a). For particles that are fully in the sCOLA box, density assignment can be chosen as the same operation as would be used in a PM or tCOLA code (typically, a CiC scheme). A question is what to do with particles that have (partially) left the sCOLA box during the evolution, while keeping the requirement of no communication between boxes: this constitutes the only difference between the operators B and BsCOLA. Possible choices include artificially periodising the sCOLA box (which is clearly erroneous) or stopping particles at its boundaries (which does not conserve momentum). Both of these choices assign the entire mass carried by the set of sCOLA particles 𝒮 to the PM grid, but result in artefacts in the final conditions, if the buffer region is not large enough.

An alternative choice is simply to limit the (Eulerian) PM grid volume where we compute δsCOLA(x, a) to the (Lagrangian) sCOLA box, including central and buffer regions. In practice, this means ignoring the fractional particle masses that the CiC assignment would have deposited to grid points outside the sCOLA box. We have found in our tests that this choice gives the smallest artefacts of the three choices considered5. We note that (partially) erasing some particles’ mass is an approximation that is only used in the BsCOLA operator to evaluate the source term in the Poisson equation, and therefore only affects the force calculation. The number of particles, both within each sCOLA process () and in the full simulation (), is left unchanged during the evolution. Therefore, mass is always conserved both within each sCOLA process and within the full volume.

3.4.2. Gravitational potential (, and sCOLA)

Poisson solver (). To make sure that differences between ΦsCOLA(x, a) and Φ(x, a) are as small as possible, we make use of a Poisson solver with Dirichlet boundary conditions, instead of assuming periodic boundary conditions. Such a Poisson solver uses discrete sine transforms (DSTs) instead of FFTs, and requires the boundary values of Φ in six planes (west, east, south, north, bottom, top) surrounding the PM grid (see Appendix C). These planes have a thickness of Nghost cells (depending on the value of the FDA used to approximate the Laplacian); they are represented by hatched regions in Fig. 2, right panel. At each scale factor aK when the computation of accelerations is needed, the Dirichlet boundary conditions are extracted from the precomputed ΦBCs(x, aK) (step C.3., see Sect. 3.3).

Ideally, ΦBCs(x, aK) should be the exact, non-linear gravitational potential in the full volume at aK, Φ(x, aK). However, knowing this quantity would require having previously run the monolithic simulation in the full volume, which we seek to avoid. In this paper, we rely instead on the linearly-evolving potential (LEP) approximation (Brainerd et al. 1993; Bagla & Padmanabhan 1994), namely

(23)

The idea behind this approximation is that the gravitational potential is dominated by long-wavelength modes, and therefore it ought to obey linear perturbation theory to a better approximation than the density field.

In Eq. (23), we have assumed that the linear growth factor D1 is normalised to unity at the scale factor corresponding to the initial conditions. The precomputation of ΦBCs in step C.3. is therefore an interpolation from the LPT grid to the PM grid and a simple scaling with D1(aK).

The output of the Poisson solver is the gravitational potential ΦsCOLA(x, aK) on the PM grid, in the interior of the sCOLA box (dark green grid points in Fig. 2, right panel). Consistently with the treatment above, ΦsCOLA(x, aK) is padded using the values of ΦBCs(x, aK) in 2Nghost cells around the PM grid, in each direction (light green and yellow regions in Fig. 2, right panel).

Therefore, the only difference between and Δx resides in using the LEP instead of the true, non-linear gravitational potential at the boundaries of the sCOLA box.

Accelerations ( and sCOLA). Given the gravitational potential ΦsCOLA(x, aK), accelerations are computed by finite differencing in configuration space and interpolation to particles’ positions, similarly to step C.2. (see Sect. 3.3). The application of consumes Nghost cells, so that accelerations are obtained on the PM grid with a padding of Nghost cells (yellow region in Fig. 2, right panel). Interpolation from the grid to particles’ position (the operator) further consumes Nghost cells.

As for the Laplacian, the only difference between and x, and and , resides in using the LEP in ΦsCOLA(x, aK) instead of the true, non-linear gravitational potential at the boundaries of the sCOLA box.

4. Accuracy and speed

We implemented the perfectly parallel sCOLA algorithm described in Sect. 3 in the SIMBELMYNë code (Leclercq et al. 2015), publicly available6 (see also Leclercq 2015, appendix B, for technical details on the implementation of the PM and tCOLA models in SIMBELMYNë). This section describes some tests of the accuracy and speed of the new sCOLA algorithm. Since our implementation, relying on evaluating forces with a PM scheme, introduces some additional approximations with respect to tCOLA, we compare our results to that of corresponding monolithic tCOLA simulations. The accuracy of tCOLA with respect to more accurate gravity solvers has been characterised in the earlier literature (Tassev et al. 2013; Howlett et al. 2015; Leclercq et al. 2015; Koda et al. 2016; Izard et al. 2016). The question of comparing the accuracy of our sCOLA algorithm to full N-body simulations would require building in a full N-body integrator for the sCOLA boxes (see Eqs. (14) and (16)); this subject is left for future research.

Throughout the paper, we adopt the ΛCDM model with Planck 2015 cosmological parameters: h = 0.6774, ΩΛ = 0.6911, Ωb = 0.0486, Ωm = 0.3089, nS = 0.9667, σ8 = 0.8159 (Planck Collaboration XIII 2016, page 31, Table 4, last column). The initial power spectrum is computed using the Eisenstein & Hu (1998, 1999) fitting function.

We base our first tests on a periodic box of comoving side length L = 200 Mpc h−1 populated with  = 5123 dark matter particles. For all operators, we use FDA at order 2. The LPT grid has N3 = 2563 voxels. Particles are evolved to redshift z = 19 using 2LPT. For all runs, we use 10 time-steps linearly-spaced in the scale factor to evolve particles from z = 19 (ai = 0.05) to z = 0 (af = 1) (see Appendix B)7. For tCOLA, the PM grid, covering the full box, has 5123 voxels. For sCOLA, we use eight different setups, with various parameters {Ntiles, Np, tile, Ltile, Np, buffer, Lbuffer, Ng, r, p} given in the first part of Table 2.

Table 2.

Different setups used to test the accuracy and speed of our sCOLA algorithm.

To assess more extensively the impact of using sCOLA on large scales, we used a second ensemble of simulations with the following differences: a box with comoving side length of L = 1 Gpc h−1, Np = 10243 particles, a LPT grid with N3 = 5123 voxels, and a PM grid of 10243 voxels for tCOLA. For sCOLA, we use eight different setups given in the second part of Table 2.

4.1. Qualitative assessments

The redshift-zero density field is estimated by assigning all particles to the LPT grid using the CiC scheme. Results for the 200 Mpc h−1 box are shown in Fig. 3. There, the bottom right panel shows the reference tCOLA density field and other panels show the differences between sCOLA and tCOLA results, for the eight different setups. Some qualitative observations can be made: when artefacts are visible in the sCOLA results, they mainly affect over-dense regions of the cosmic web (filaments and halos), whereas under-dense regions are generally better recovered. Artefacts are of two types: the position of a structure (usually a filament) can be imprecise due to a misestimation of bulk motions (this is visible as a “dipole” in Fig. 3); or the density (usually of halos) can be over- or under-estimated (this is visible as a “monopole” in Fig. 3). In all setups, artefacts are predominantly located close to the boundaries of tiles (represented as dashed lines) and are less visible in the centre of tiles. This can be easily understood given that the approximations made all concern the behaviour at the boundaries of sCOLA boxes. At fixed size for the buffer region, the correspondence between sCOLA and tCOLA density fields improves with increasing tile size. A minimum tile size of about 50 Mpc h−1 seems necessary to limit the misestimation of halo densities (“monopoles” in Fig. 3). At low redshift, this scale is in the mildly non-linear regime, where LPT starts to break down; therefore, the LPT frame is inaccurate for particles, and the requirement of no communication between tiles leads to mispredicted clustering. As expected, at fixed tile size, the results are improved by increasing the buffer region around tiles: in each sCOLA box, boundary approximations are pushed farther away from the central region of interest. A good compromise between reducing artefacts and increasing the size of buffer regions seems to be found for a buffer region of 25 Mpc h−1, which corresponds roughly to the maximum distance travelled by a particle from its initial to its final position. In particular, the setup Ltile = 50 Mpc h−1, Lbuffer = 25 Mpc h−1 leads to a satisfactory approximation of the tCOLA density with a parallelisation potential factor p = 8.

thumbnail Fig. 3.

Qualitative assessment of the redshift-zero density field from sCOLA for different tilings and buffer sizes, with respect to tCOLA. The bottom right panel shows the reference tCOLA density field in a 200 Mpc h−1 box with periodic boundary conditions (the quantity represented is ln(2 + δtCOLA) where δtCOLA is the density contrast). Other panels show the difference between sCOLA and tCOLA density fields, ln(2 + δsCOLA)−ln(2 + δtCOLA), for different sizes of tile and buffer region, as indicated above the panels. The tiling is represented by dashed lines, and the central tile’s buffer region is represented by solid lines. In the third dimension, the slices represented intersect the central tile at its centre. As can be observed in this figure, artefacts are predominantly located close to the boundaries of tiles; they are reduced with increasing tile size and buffer region size.

Open with DEXTER

In a similar fashion, the velocity field is estimated on the LPT grid from particle information, using the simplex-in-cell estimator (Hahn et al. 2015; Leclercq et al. 2017). Using phase-space information, this estimator accurately captures the velocity field, even in regions sparsely sampled by simulation particles. Results for the 200 Mpc h−1 box are shown in Fig. 4, where one component of the tCOLA velocity field vtCOLA (in km s−1) is shown in the bottom right panel. Other panels show the velocity error in sCOLA, vsCOLA − vtCOLA in km s−1. Differences between tCOLA and sCOLA velocity fields are of two kinds: misestimation of bulk flows (visible as light, spatially extended regions in Fig. 4), or misestimation of particle velocities inside halos (visible as dark spots in Fig. 4). We do not interpret the second kind of differences as errors made by our sCOLA algorithm: indeed, motions within virialised regions are not captured accurately by any simulation using only ten time-steps, even by tCOLA in the full box. Therefore, only the first kind of differences, that is, the misestimation of coherent bulk motions is physically interpretable. In this respect, the same behaviour as for density fields can be observed: artefacts are mostly located at the boundaries of tiles, and they are reduced with increasing tile size and buffer region size, with safe minima of Ltile ≳ 50 Mpc h−1 and Lbuffer ≳ 25 Mpc h−1, respectively.

thumbnail Fig. 4.

Same as Fig. 3, but for one component of the velocity field, in km/s. Bulk flows are correctly captured if tiles and their buffer regions are large enough. Residual differences inside halos can be observed, but they are expected due to the limited number of time-steps, rendering both tCOLA and sCOLA velocities inaccurate in the deeply non-linear regime.

Open with DEXTER

4.2. Summary statistics

In this section, we turn to a more quantitative assessment of our results, by checking the power spectrum of final density fields and their cross-correlation to the tCOLA density field. Even if final density fields are non-Gaussian, two-point statistics (auto- and cross-spectra) are expected to be sensitive to the approximations made in our sCOLA algorithm, which involves both local and non-local operations in configuration space.

According to Huterer & Takada (2005) or Audren et al. (2013), in the best cases, observational errors for a Euclid-like survey are typically of order 3% for k <  10−2 (Mpc h−1)−1. These results do not account for any of the systematic uncertainties linked to selection effects or contamination of the clustering signal by foregrounds. At smaller scales, theoretical uncertainties take over, reaching 1% and above for k >  10−1 (Mpc h−1)−1. In addition, the impact of baryonic physics is still largely uncertain, some models predicting an impact of at least 10% at k = 1 (Mpc h−1)−1 (e.g. van Daalen et al. 2011; Chisari et al. 2018; Schneider et al. 2019). Any data model involving our sCOLA algorithm will be subject to these uncertainties. For this reason, we aim for no better than 3% to 1% accuracy at all scales up to k = 1 (Mpc h−1)−1, for any two-point measurement of clustering.

More precisely, we work with P(k) and R(k), defined for two density contrast fields δ and δ′=δtCOLA, with our Fourier transform convention, by

(24)

(25)

where δD is a Dirac delta distribution. For the estimation of P(k) and R(k), we use 100 logarithmically-spaced k-bins from the fundamental mode of the box kmin ≡ 2π/L to k = 1 (Mpc h−1)−1.

In Figs. 5 and 6, we plot the power spectrum of sCOLA density fields divided by the power spectrum of the reference tCOLA density field, PsCOLA(k)/PtCOLA(k) (upper panels) and the cross-correlation between sCOLA and tCOLA density fields, R(k) (bottom panels), for our 200 Mpc h−1 (Fig. 5) and 1 Gpc h−1 box (Fig. 6). The grey horizontal bands represent the target accuracies of 3% and 1%, and the vertical lines mark the fundamental modes of the tiles, ktile ≡ 2π/Ltile, for the different values of Ltile used.

thumbnail Fig. 5.

Power spectrum relative to tCOLA (top panel) and cross-correlation with respect to tCOLA (bottom panel) of redshift-zero sCOLA density fields, in a 200 Mpc h−1 box containing 5123 dark matter particles. Different sizes for the tiles (represented by different line styles) and buffer regions (represented by different colours) are used, as indicated in the legend. The vertical lines show the respective fundamental mode of different tiles, the light grey bands correspond to 3% accuracy, and the dark grey bands to 1% accuracy.

Open with DEXTER

thumbnail Fig. 6.

Same as Fig. 5, but in a 1 Gpc h−1 box containing 10243 particles.

Open with DEXTER

Figure 5 quantitatively confirms the considerations of Sect. 4.1. Both the amplitudes (as probed by P(k)/PtCOLA(k)) and the phase accuracy (as probed by R(k)) of sCOLA simulations are improved with increasing tile size, for a fixed buffer region (different line styles, same colours). For a fixed tile size, results are also improved by increasing the size of the buffer region (same line styles, different colours). Remarkably, all setups yield perfect phase accuracy at large scales (R(k) = 1 for k ≤ 0.2 (Mpc h−1)−1), even when the amplitude of corresponding modes deviates from the tCOLA result. Defects at small scales (lack of power and inaccurate phases) are only observed for the smallest tile sizes and are fixed by increasing the size of buffer region. This effect can be interpreted in Lagrangian coordinates: when the Lagrangian volume forming a halo is divided among different tiles that do not exchange particles, and if the buffer region is too small to contain the rest of the halo, the resulting structure is then split and under-clustered in Eulerian coordinates. In this respect, preferring a sCOLA box size (LsCOLA ≡ Ltile + 2Lbuffer) of at least 100 Mpc h−1 (and therefore Ltile ≳ 50 Mpc h−1, Lbuffer ≳ 25 Mpc h−1, in most situations) seems to be sensible. A more difficult issue is the amplitude of large-scale modes, for k <  ktile. These are sensitive to the tiling if buffer regions around tiles are too small. A safe requirement also seems to be Lbuffer ≳ 25 Mpc h−1. Putting everything together, in our 200 Mpc h−1 box, three setups reach 3% accuracy in amplitude and phases at all scales: {Ltile = 50 Mpc h−1, Lbuffer = 25 Mpc h−1} (discussed already in Sect. 4.1); {Ltile = 100 Mpc h−1, Lbuffer = 25 Mpc h−1}; and {Ltile = 50 Mpc h−1, Lbuffer = 50 Mpc h−1}. The last-mentioned performs even better, reaching 1% accuracy at all scales, but at the price of over-simulating the volume by a larger factor.

Figure 6 shows the same diagnostics for a 1 Gpc h−1 box, where the qualitative behaviour is the same as before. It confirms the requirement Lbuffer ≳ 25 Mpc h−1 to get sufficient accuracy at high k. The question of the accuracy reached at the largest scales is then jointly sensitive to Ltile and L. In our tests, the setups {Ltile = 62.5 Mpc h−1, Lbuffer = 39.1 Mpc h−1} and {Ltile = 125 Mpc h−1, Lbuffer = 29.3 Mpc h−1} yield 3% accurate results at all scales, and the setups {Ltile = 62.5 Mpc h−1, Lbuffer = 62.5 Mpc h−1} and {Ltile = 125 Mpc h−1, Lbuffer = 48.8 Mpc h−1} almost reach 1%-level precision at all scales. We note that the two different boxes have different mass resolutions, which confirms that requirements for tile and buffer region sizes should be expressed in physical size.

4.3. Tests of the approximations

As discussed in Sect. 3.4, two approximations are introduced in our sCOLA algorithm with respect to a monolithic tCOLA approach. These concern density assignment in the interior of sCOLA boxes (approximation D.1.) and the gravitational potential at the boundaries of sCOLA boxes (approximation D.2.). In this section, we test the impact of these approximations on final results, using two-point statistics as diagnostic tools. For this test we use our sCOLA run with L = 200 Mpc h−1, Np = 5123, 64 tiles (Ntiles = 4, Np, tile = 128) and Np, buffer = 32 (i.e. Ltile = 50 Mpc h−1, Lbuffer = 12.5 Mpc h−1). We choose a small buffer size on purpose, to be sensitive to the approximations made.

Let us denote by δint the density contrast in the interior of sCOLA boxes and by ΦBCs the gravitational potential at the boundaries of sCOLA boxes. As discussed in Sect. 3.4, our algorithm involves an approximation regarding particles leaving the sCOLA box during the evolution, yielding δsCOLA, and relies on the LEP approximation at the boundaries. It therefore uses

(26)

Everything else being fixed, we ran three investigative sCOLA simulations using respectively,

(27)

(28)

(29)

where δ is the “true” density contrast and Φ is the “true” gravitational potential, extracted at each time-step from the corresponding tCOLA simulation.

Figure 7 shows the auto- and cross-spectra of resulting sCOLA density fields, with respect to the reference tCOLA result. The use of δint = δ yields by construction R(k) = 1 at all scales, as can be checked from the bottom panel. The setup given by Eq. (29) is rid of the two approximations; it is therefore a consistency check: one should retrieve the tCOLA result if no bias is introduced by the tiling and different Poisson solver. As expected, Fig. 7 shows that our implementation recovers the tCOLA result at all scales, with only a small excess of power at k >  0.4 (Mpc h−1)−1 explained by the slightly higher force resolution of the sCOLA run with respect to tCOLA (the PM grid cell sizes are 0.3886 and 0.3906 Mpc h−1, respectively).

thumbnail Fig. 7.

Tests of the approximations made in sCOLA for the density field and the gravitational potential. As in Fig. 5, the diagnostic tools are the power spectrum relative to tCOLA (top panel) and the cross-correlation with tCOLA (bottom panel). Our sCOLA algorithm uses the approximate interior density field δsCOLA and the LEP approximation for the boundary gravitational potential (dash-dotted blue line). In other simulations, as indicated in the legend, we use the true density field δ and/or the true gravitational potential Φ at the boundaries. The approximation made for the density field dominates, especially at large scales.

Open with DEXTER

The setups given by Eqs. (27) and (28) allow disentangling the impact of approximations D.1. and D.2. In the standard run (Eq. (26)), averaging over tiles and timesteps, ∼0.43% of the 5123 particles, all of which belonging to the buffer region, do not deposit all of their mass in the calculation of δsCOLA, but ∼76.5% on average. This number only slightly increases with time (from ∼0.35% at a = 0.05 to ∼0.47% at a = 1); in other simulations, we have found that it has a stronger dependence on the mass resolution and on the surface of sCOLA boxes. Regarding the accuracy of the LEP approximation, the ratio of the power spectra of Φ − ΦLEP and of Φ goes to zero at early times and large scales, and stays below 12% for all scales with wavenumber k ≤ 2π/LsCOLA at a = 1. As can be observed in Fig. 7, although using the non-linear gravitational potential instead of the LEP improves both P(k) and R(k) for the final density field at all scales with wavenumber k >  7 × 10−2 (Mpc h−1)−1, it does not remove the ≳5% bias in amplitude at the largest scales. On the contrary, using the true density contrast solves this problem and yields a 3% accurate result at all scales, which is remarkable given the small buffer size used in this case (the over-simulation factor is only r = 3.38).

We conclude from these tests that the approximation made regarding the density field (D.1.) has more impact than the one regarding the gravitational potential (D.2.), especially on the largest modes. This result is consistent with the standard paradigm for structure formation, where the density contrast undergoes severe non-linearity at small scales and late times, while the gravitational potential evolves very little. It also suggests that future improvements of our algorithm should focus on finding a better approximation for δsCOLA, rather than ΦBCs.

4.4. Computational cost

One of the main motivations for our perfectly parallel algorithm based on sCOLA is to be able to run very large volume simulations at reasonably high resolution. A detailed analysis of the speed and computational cost of our algorithm, as implemented in SIMBELMYNë, is therefore beyond the intent of this paper. However, in this section we discuss some performance considerations based on a sCOLA run with L = 1 Gpc h−1, Np = 10243, 512 tiles (Ntiles = 8, Np, tile = 128), Np, buffer = 30 (i.e. Ltile = 125 Mpc h−1, Lbuffer = 29.3 Mpc h−1), Ng = 199; and the corresponding monolithic tCOLA simulation. In this case, the over-simulation factor is r ≈ 3.17 and the parallelisation potential factor is p ≈ 161.59. To compare the theoretical parallelisation potential factor and the realised parallelisation efficiency, we use one process for tCOLA and 512 processes for sCOLA. Each process is run on a node with 32 cores using OpenMP parallelisation.

One of the main advantages of our sCOLA algorithm lies in its reduced memory consumption. In Fig. 8 (first row), we show the memory requirements for the calculation of LPT potentials in the full box (common for tCOLA and sCOLA), for the evolution of the full box with tCOLA, and for the evolution of each sCOLA box, all in single-precision floating-point format. LPT requires eight grids of size N3 (one for the initial conditions, one for the Zel’dovich potential, and six for the second-order term), occupying ∼4.3 GB. Evolution with tCOLA requires one integer and 12 floating-point numbers per particle (their identifier, their position x, their momentum p, and the vectors Ψ1 and Ψ2), plus a PM grid of 10243 voxels, for a total of ∼60.1 GB. Within each box, sCOLA requires the same memory per particle (but with , a PM grid of size , and some overhead for Dirichlet boundary conditions. The total is around 400 MB per sCOLA box with the setup considered here.

thumbnail Fig. 8.

Memory requirements (first row) and timings for two corresponding tCOLA and sCOLA simulations. Although the CPU time required is higher for sCOLA, the memory consumption and wall-clock time are significantly reduced with respect to tCOLA, due to the perfectly parallel nature of most computations (second row). In the middle left panel, the height of the white bar shows the hypothetical cost of running tCOLA for the same volume as simulated with sCOLA, when taking buffer regions into account. The relative contributions of different operations, as detailed in the legend, is shown in the third row. The main difference in computational cost in sCOLA with respect to tCOLA comes from the use of DSTs instead of FFTs, which makes the evaluation of the potential significantly more expensive.

Open with DEXTER

In the second row of Fig. 8, we show the overall cost of tCOLA versus sCOLA, both in terms of CPU time (middle left panel) and wall-clock time (middle right panel). The key feature of our algorithm is that, although the overall CPU time needed is unavoidably higher than with tCOLA, the wall-clock time spent can be drastically reduced. This owes to the degree of parallelism of our algorithm, which is equal to the number of sCOLA boxes. In particular, if as many processes as sCOLA boxes can be allocated (512 in this case), the overall wall-clock time is determined by the initial full box operations (common with tCOLA, see Sect. 3.1), plus the cost of evolving only one sCOLA box (an average of 30.9 wall-clock seconds on 32 cores in this test). This is what is shown in the middle right panel of Fig. 8. The wall-clock time reduction factor is ≈93 for the evolution only (≈11 when accounting for initialisation and writing outputs). Compared to the parallelisation potential factor p ≈ 162, this number means that sCOLA-specific operations and the larger fractional parallelisation overhead in sCOLA boxes do not significantly hamper the perfectly parallel nature of the code.

The increased CPU time needed with sCOLA (see Fig. 8, middle left panel) is partly due to the necessity of over-simulating the volume of interest by a factor r >  1 for accuracy. For comparison with the sCOLA CPU time, the height of the white bar shows the tCOLA CPU time multiplied by r. The rest of the difference in CPU time principally comes form the fact that simulations with our variant of sCOLA are intrinsically more expensive than with tCOLA for a periodic volume of the same size. This point is further discussed below.

In the third row of Fig. 8, we show the various relative contributions to CPU time and wall-clock time, both for full tCOLA/sCOLA runs and per tCOLA/sCOLA box. The generations of the initial conditions (brown, step A.1.) and writing of outputs to disk (grey) are common to tCOLA and sCOLA and have an overall fixed cost. LPT calculations in the full box (pink) consist of computing the Lagrangian potentials and the particle-based LPT displacements in tCOLA, but are limited to computing the Lagrangian potentials in the full box in the case of sCOLA (step A.2.). These full-box operations are only showed in the bars labelled “tCOLA” and “sCOLA”. Within each box, the different operations are evaluating the density field (yellow), solving the Poisson equation to get the gravitational potential (green), differentiating the gravitational potential to get the accelerations (blue), “kicking” particles (red), and “drifting” particles (purple). sCOLA further requires some specific operations within each box: communicating with the master process (steps B.1., B.2., and C.1.), calculating the particle-based LPT displacements (step C.2.), grouped in Fig. 8 and shown in orange; and pre-computing the Dirichlet boundary conditions with the LEP approximation (step C.3., cyan). sCOLA-specific operations do not contribute more than 10% of the CPU and wall-clock times per box.

A notable difference between evolving a given box with sCOLA or with tCOLA resides in the higher cost of evaluating the potential (green): in this case, 9% of CPU time and 13% of wall-clock time with sCOLA versus 6% of CPU time and 3% of wall-clock time with tCOLA. This effect is due to the use of DSTs, required by the Poisson solver with Dirichlet boundary conditions (see Sect. 3.4 and Appendix C), instead of FFTs. Indeed, depending on the size of the PM grid, the evaluation of DSTs can be the computational bottleneck of our algorithm (up to 60% of overall CPU time is some of our runs), as opposed to the evaluation of the density field (e.g. via CiC) in traditional tCOLA or PM codes (37% of overall CPU time). For this reason, within each setup, we recommend performing experiments to find a PM grid size giving a good compromise between force accuracy and computational efficiency. In particular, it is strongly preferable that Ng + 1 not contain large prime factors (this number appears in the basis functions of sine transforms, see Appendix C.2). Throughout this paper, we ensured that Ng + 1 is always even, while keeping roughly the same force resolution as the corresponding tCOLA simulation. We note that our choice of Ng + 1 = 200 in the present test, combined with the use of a power of two for the PM grid in the monolithic tCOLA run, favours tCOLA in the comparison of CPU times. The sCOLA CPU time shown in the middle left panel of Fig. 8 could be further optimised by making Ng + 1 a power of two in sCOLA boxes.

5. Discussion and conclusion

5.1. Discussion

The principal computational challenge of the gravitational N-body problem is the long-range nature of the gravitational force. Our sCOLA approach enables perfectly parallel computations and therefore opens up profoundly new possibilities for how to compute large-scale cosmological simulations. We discuss these, some consequences and possible future directions in the following.

Gravity and physics models. It is important to note that the sCOLA algorithm introduced in this work is general, and not limited to the gravity model used here: while we focused on a tCOLA particle-mesh implementation to evolve the sCOLA tiles, this choice was designed to facilitate the assessment of tiling artefacts against monolithic tCOLA runs. Nonetheless, any N-body method, such as particle-particle–particle-mesh, tree methods or AMR, could be used to evolve each tile. In particular, since the sCOLA approach separates quasi-linear and non-linear scales, there is no need to cut off the computation on small scales. In concert with the approaches discussed below, this fact can be exploited to perform very high-resolution, fully non-linear simulations in cosmological volumes. In this case, the spatial decoupling due to sCOLA would render computations possible that would otherwise be prohibitive.

Similar comments apply to including non-gravitational physics: since hydrodynamical or other non-gravitational forces are typically much more local than gravitational interactions, there are no algorithmic barriers to including them in each sCOLA tile8.

Construction of light-cones and mock catalogues. The decoupling of computational volumes achieved by our approach means that each sCOLA box can be run completely independently. Therefore, it is not necessary to define a common final redshift for all tiles. This means that to compute a cosmological light-cone, only a single tile (the one containing the observer) needs to be run to redshift zero. Since the volume on the light-cone increases rapidly with redshift, the vast majority of tiles would only have to be run until they intersect the light-cone at high redshift. In monolithic N-body simulations, most of the computational time is spent at low redshift, since the local time-step of simulations decreases with the local dynamical time. Our approach would therefore greatly accelerate the time needed to complete light-cone simulations, by scheduling tiles in order of the redshift to which they should run (and therefore in reverse order of expected computational time), aiding load-balancing.

The construction of light-cones for surveys with large aspect ratios, such as pencil-beam surveys, can further benefit from sCOLA. Indeed, tiles that do not intersect the three-dimensional survey window do not need to be run at all for the construction of mock catalogues. In such a case, the algorithm will still capture the effects of large-scale transverse modes, even if the simulated volume is not substantially increased with respect to the survey volume.

Low memory requirements. sCOLA divides the computational volume into much smaller tiles and vastly reduces the memory footprint of each independent sCOLA tile computation, as shown in Sect. 4.4. As an example, simulating a (16 Gpc h−1)3 volume containing 81923 particles to achieve a mass resolution of 1012.5M requires ∼19.8 TB of RAM with a PM code and ∼33.0 TB of RAM with tCOLA. The setup {Ltile = 62.5 Mpc h−1, Lbuffer = 62.5 Mpc h−1} would break down the problem into 2563 tiles, each with (3 × 32)3 particles and a memory footprint of ∼53 MB. This has important consequences, which we explore in the following.

The very modest memory requirement of our algorithm opens up multiple possibilities to accelerate the computation: even on traditional systems, the entire computation of each sCOLA tile would fit entirely into the L3 cache of a multi-core processor. This would cut out the slowest parts of the memory hierarchy, leading to a large potential performance boost and reducing code complexity. Even more promising, many such tiles could be evolved entirely independently on GPU accelerators, or even dedicated FPGAs, taking advantage of hybrid architectures of modern computational platforms while reducing the need to develop sophisticated code to manage task parallelism. At this scale, each tile computation would even fit comfortably on ubiquitous small computational platforms such as mobile phones.

Grid computing. The perfect scalability achieved by our approach means that large N-body simulations can even be run on very inexpensive, strongly asynchronous networks designed for large throughput computing. An extreme example would be participatory computing platforms such as Cosmology@Home9, where tens of thousands of users donate computational resources. The use of such platforms would be particularly suited to light-cone computations, as described above. Even if running the low-redshift part necessitates dedicated hardware, other workers could efficiently work independently to compute most of the volume, which lives at high-redshift. Only two communication steps are required for each tile: the LPT potentials are received at the beginning, and at the end of the computation each tile returns its final state at the redshift where it intersects the light-cone.

Node Failures. Robustness to node failure is an important consideration on all very large computational platforms. Even with extremely low failure probability for each node, since the number of nodes is high, the probability that some node fails during the course of a computation becomes high. After its initialisation steps (see Sect. 3.1), our approach is entirely robust to such failure, since any individual tile can be recomputed after the fact on a modest system, for very little cost.

5.2. Conclusion

In this paper, we introduced a perfectly parallel and easily applicable algorithm for cosmological simulations using sCOLA. Our approach is based on a tiling of the full simulation box, where each tile is run independently. By the use of buffer regions and appropriate Dirichlet boundary conditions, we improved the accuracy of the algorithm with respect to Tassev et al. (2015). In particular, we showed that suitable setups can reach 3% to 1% accuracy at all the scales simulated, as required for data analysis of the next generation of large-scale structure surveys. In case studies, we tested the relative impact of the two approximations involved in our approach, for density assignment and the boundary gravitational potential. We considered the computational cost of our algorithm and demonstrated that even if the CPU time needed is unavoidably higher, the wall-clock time and memory footprint can be drastically reduced.

This study opens up a wide range of possible extensions, discussed in Sect. 5.1. Benefiting from its perfect scalability, the approach could also allow for novel analyses of cosmological data from fully non-linear models previously too expensive to be tractable. It could straightforwardly be used for the construction of mock catalogues, but also within recently introduced likelihood-free inference techniques such as DELFI (Alsing et al. 2018), BOLFI (Leclercq 2018) and SELFI (Leclercq et al. 2019), which have a need for cheap simulator-based data models. We therefore anticipate that sCOLA will become an important tool in computational cosmology for the coming era.

Our perfectly parallel sCOLA algorithm has been implemented in the publicly available SIMBELMYNë code10, where it is included in version 0.4.0 and later.


2

For example, Oak-Ridge National Laboratories’ (ORNL) Summit machine has a typical power consumption of about 13 MW.

4

See for example ORNL’s next supercomputer, Frontier: https://www.olcf.ornl.gov/wp-content/uploads/2019/05/frontier_specsheet.pdf

5

There is a certain symmetry to this choice, since particles that would have moved into the buffer region from the outside are also neglected in the force calculation, due to the lack of communication between different sCOLA boxes.

7

This means that in the case of our new sCOLA algorithm, we use COLA both “in space and time” (see Tassev et al. 2015).

8

A potential exception is long-range radiative transport of energetic (X-ray or gamma ray) photons, requiring a non-trivial extension of the approach.

Acknowledgments

We are grateful to Matías Zaldarriaga for stimulating discussions and useful comments throughout the realisation of this project. We thank Jens Jasche and Svetlin Tassev for discussions that triggered this project, and Oliver Hahn for constructive observations. FL and BDW acknowledge the hospitality of the Institute for Advanced Study, Princeton, where this project was initiated. FL, BF and WJP thank the Institute of Cosmology and Gravitation of the University of Portsmouth, where part of this work was prepared. This work made use of NumPy (van der Walt et al. 2011), IPython (Perez & Granger 2007), Matplotlib (Hunter 2007), Jupyter notebooks (Kluyver et al. 2016), and the colourmaps provided by the cmocean (Thyng et al. 2016), and CMasher (https://github.com/1313e/CMasher) packages. FL acknowledges funding from the Imperial College London Research Fellowship Scheme. GL and BDW acknowledge financial support from the ANR BIG4, under reference ANR-16-CE23-0002. The Center for Computational Astrophysics is supported by the Simons Foundation. Research at Perimeter Institute is supported in part by the Government of Canada through the Department of Innovation, Science and Economic Development Canada and by the Province of Ontario through the Ministry of Colleges and Universities. Numerical computations were done on the Sciama High Performance Compute (HPC) cluster which is supported by the ICG, SEPNet and the University of Portsmouth; and on the cx1 cluster hosted by the Research Computing Service facilities at Imperial College London (doi:10.14469/hpc/2232). This work was done within the Aquila Consortium (https://aquila-consortium.org). FL and BDW conceived the project. FL wrote the SIMBELMYNë code, implemented the new sCOLA algorithm, ran the simulations, performed the study, supervised BF’s internship project, and wrote the bulk of the paper. BF contributed to the first implementation of the tiling algorithm in SIMBELMYNë and to early tests of the method. GL suggested the investigation of an alternative Poisson solver, proposed tests of the impact of boundary effects, and contributed to writing the paper. BDW made conceptual contributions, helped designing the accuracy and speed tests, and contributed to writing the paper. AHJ prompted the use of the linearly-evolving potential approximation. AHJ and AFH contributed to the interpretation of results. WJP supported the design of the first version of the algorithm and contributed to student supervision. CN contributed to the collegial construction of the standards of science, by developing the methodological framework, the state-of-the-art, as well as post-publication procedures. All natural authors read and approved the final manuscript.

References

  1. Alimi, J. M., Bouillot, V., & Rasera, Y. 2012, DEUS Full Observable ΛCDM Universe Simulation: the numerical challenge [Google Scholar]
  2. Alsing, J., Wandelt, B., & Feeney, S. 2018, MNRAS, 477, 2874 [CrossRef] [Google Scholar]
  3. Amdahl, G. M. 1967, in Proceedings of the April 18–20, 1967, Spring Joint Computer Conference, AFIPS ’67 (Spring) (New York, NY, USA: Association for Computing Machinery), 483 [CrossRef] [Google Scholar]
  4. Aubert, D., Deparis, N., & Ocvirk, P. 2015, MNRAS, 454, 1012 [NASA ADS] [CrossRef] [Google Scholar]
  5. Audren, B., Lesgourgues, J., Bird, S., Haehnelt, M. G., & Viel, M. 2013, J. Cosmology Astropart. Phys., 2013, 026 [NASA ADS] [CrossRef] [Google Scholar]
  6. Austermann, J. E., Aird, K. A., & Beall, J. A. 2012, SPTpol: an instrument for CMB polarization measurements with the South Pole Telescope, SPIE Conf. Ser., 8452, 84521E [Google Scholar]
  7. Bagla, J. S., & Padmanabhan, T. 1994, MNRAS, 266, 227 [NASA ADS] [CrossRef] [Google Scholar]
  8. Barausse, E., Berti, E., & Hertog, T. 2020, Prospects for Fundamental Physics with LISA [Google Scholar]
  9. Berger, M. J., & Colella, P. 1989, J. Comp. Phys., 82, 64 [NASA ADS] [CrossRef] [Google Scholar]
  10. Bernardeau, F., Colombi, S., Gaztañaga, E., & Scoccimarro, R. 2002, Phys. Rep., 367, 1 [NASA ADS] [CrossRef] [MathSciNet] [Google Scholar]
  11. Birdsall, C. K., & Langdon, A. B. 1985, Plasma Physics via Computer Simulation (CRC Press) [Google Scholar]
  12. Bouchet, F. R., Colombi, S., Hivon, E., & Juszkiewicz, R. 1995, A&A, 296, 575 [NASA ADS] [Google Scholar]
  13. Brainerd, T. G., Scherrer, R. J., & Villumsen, J. V. 1993, ApJ, 418, 570 [NASA ADS] [CrossRef] [Google Scholar]
  14. Bryan, G. L., Norman, M. L., O’Shea, B. W., et al. 2014, ApJS, 211, 19 [NASA ADS] [CrossRef] [Google Scholar]
  15. Buchert, T., Melott, A. L., & Weiß, A. G. 1994, A&A, 288, 349 [Google Scholar]
  16. Cheng, S., Yu, H.-R., Inman, D., et al. 2020, CUBE - Towards an Optimal Scaling of Cosmological N-body Simulations [Google Scholar]
  17. Chisari, N. E., Richardson, M. L. A., Devriendt, J., et al. 2018, MNRAS, 480, 3962 [NASA ADS] [CrossRef] [Google Scholar]
  18. DESI Collaboration 2016, The DESI Experiment Part I: Science, Targeting, and Survey Design [Google Scholar]
  19. Eisenstein, D. J., & Hu, W. 1998, ApJ, 496, 605 [NASA ADS] [CrossRef] [Google Scholar]
  20. Eisenstein, D. J., & Hu, W. 1999, ApJ, 511, 5 [NASA ADS] [CrossRef] [Google Scholar]
  21. Frigo, M., & Johnson, S. G. 2005, Program Generation, Optimization, and Platform Adaptation, Proc. IEEE, 93, 216 [Google Scholar]
  22. Fryxell, B., Olson, K., Ricker, P., et al. 2000, ApJS, 131, 273 [NASA ADS] [CrossRef] [Google Scholar]
  23. Garrison, L. H., Eisenstein, D. J., & Pinto, P. A. 2019, MNRAS, 485, 3370 [CrossRef] [Google Scholar]
  24. Gonnet, P., Schaller, M., Theuns, T., & Chalk, A. B. G. 2013, SWIFT: Fast Algorithms for Multi-resolution SPH on Multi-Core Architectures [Google Scholar]
  25. Guillet, T., & Teyssier, R. 2011, J. Comp. Phys., 230, 4756 [NASA ADS] [CrossRef] [Google Scholar]
  26. Hahn, O., & Abel, T. 2011, MNRAS, 415, 2101 [NASA ADS] [CrossRef] [Google Scholar]
  27. Hahn, O., Angulo, R. E., & Abel, T. 2015, MNRAS, 454, 3920 [NASA ADS] [CrossRef] [Google Scholar]
  28. Hockney, R. W., & Eastwood, J. W. 1981, Computer Simulation Using Particles (McGraw-Hill) [Google Scholar]
  29. Howlett, C., Manera, M., & Percival, W. J. 2015, Astron. Comput., 12, 109 [CrossRef] [Google Scholar]
  30. Hunter, J. D. 2007, Comput. Sci. Eng., 9, 90 [Google Scholar]
  31. Huterer, D., & Takada, M. 2005, Astroparticle Phys., 23, 369 [NASA ADS] [CrossRef] [Google Scholar]
  32. Izard, A., Crocce, M., & Fosalba, P. 2016, MNRAS, 459, 2327 [CrossRef] [Google Scholar]
  33. James, R. A. 1977, J. Comp. Phys., 25, 71 [NASA ADS] [CrossRef] [Google Scholar]
  34. Johnson, S. G., & Frigo, M. 2008, in Fast Fourier Transforms, ed. C. S. Burrus (Houston TX: Rice University Connexions) [Google Scholar]
  35. Kluyver, T., Ragan-Kelley, B., & Pérez, F. 2016, ELPUB [Google Scholar]
  36. Klypin, A., & Holtzman, J. 1997, Particle-Mesh code for cosmological simulations [Google Scholar]
  37. Knebe, A., & Doumler, T. 2010, AMIGA: Adaptive Mesh Investigations of Galaxy Assembly [Google Scholar]
  38. Koda, J., Blake, C., Beutler, F., Kazin, E., & Marin, F. 2016, MNRAS, 459, 2118 [CrossRef] [Google Scholar]
  39. Laureijs, R., Amiaux, J., & Arduini, S. 2011, Euclid Definition Study Report [Google Scholar]
  40. Leclercq, F. 2015, Ph.D. Thesis, Institut d’Astrophysique de Paris [Google Scholar]
  41. Leclercq, F. 2018, Phys. Rev. D, 98, 063511 [CrossRef] [Google Scholar]
  42. Leclercq, F., Jasche, J., & Wandelt, B. 2015, J. Cosmol. Astropart. Phys., 6, 15 [NASA ADS] [CrossRef] [Google Scholar]
  43. Leclercq, F., Jasche, J., Lavaux, G., Wandelt, B., & Percival, W. 2017, J. Cosmol. Astropart. Phys., 6, 049 [NASA ADS] [CrossRef] [Google Scholar]
  44. Leclercq, F., Enzi, W., Jasche, J., & Heavens, A. 2019, MNRAS, 490, 4237 [CrossRef] [Google Scholar]
  45. LIGO Scientific Collaboration 2015, Class. Quant. Grav., 32, 074001 [CrossRef] [Google Scholar]
  46. LSST Science Collaboration 2012, Large Synoptic Survey Telescope: Dark Energy Science Collaboration [Google Scholar]
  47. Merloni, A., Predehl, P., Becker, W., et al. 2012, eROSITA Science Book: Mapping the Structure of the (Energetic Universe) [Google Scholar]
  48. Ocvirk, P., Gillet, N., Shapiro, P. R., et al. 2016, MNRAS, 463, 1462 [NASA ADS] [CrossRef] [Google Scholar]
  49. Perez, F., & Granger, B. E. 2007, Comput. Sci. Eng., 9, 21 [CrossRef] [Google Scholar]
  50. Planck Collaboration XIII. 2016, A&A, 594, A13 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  51. Potter, D., Stadel, J., & Teyssier, R. 2017, Comput Astrophys. Cosmology., 4, 2 [NASA ADS] [CrossRef] [Google Scholar]
  52. Quinn, T., Katz, N., Stadel, J., & Lake, G. 1997, Time stepping N-body simulations [Google Scholar]
  53. Schneider, A., Teyssier, R., Stadel, J., et al. 2019, J. Cosmol. Astropart. Phys., 3, 020 [CrossRef] [Google Scholar]
  54. Simon, S. M., Beall, J. A., Cothard, N. F., et al. 2018, J. Low. Temp. Phys., 193, 1041 [CrossRef] [Google Scholar]
  55. Simons Observatory Collaboration 2019, J. Cosmol. Astropart. Phys., 2, 056 [Google Scholar]
  56. SPHEREx Science Team 2018, Am. Astron. Soc. Meet. Abstr., 231, 354.21 [Google Scholar]
  57. Square Kilometre Array Cosmology Science Working Group 2018, Cosmology with Phase 1 of the Square Kilometre Array; Red Book 2018: Technical specifications and performance forecasts [Google Scholar]
  58. Strauss, M. A., Cen, R., Ostriker, J. P., Lauer, T. R., & Postman, M. 1995, ApJ, 444, 507 [CrossRef] [Google Scholar]
  59. Tassev, S., & Zaldarriaga, M. 2012, J. Cosmol. Astropart. Phys., 12, 11 [CrossRef] [Google Scholar]
  60. Tassev, S., Eisenstein, D. J., Wandelt, B. D., & Zaldarriaga, M. 2015, sCOLA: The N-body COLA Method Extended to the Spatial Domain [Google Scholar]
  61. Tassev, S., Zaldarriaga, M., & Eisenstein, D. J. 2013, J. Cosmol. Astropart. Phys., 6, 36 [CrossRef] [Google Scholar]
  62. Teyssier, R. 2002, A&A, 385, 337 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  63. The Virgo Collaboration 2020, J. Phys. Conf. Ser., 1342, 012010 [CrossRef] [Google Scholar]
  64. Theuns, T., Chalk, A., Schaller, M., & Gonnet, P. 2015, SWIFT: task-based Hydrodynamics and Gravity for Cosmological Simulations [Google Scholar]
  65. Thyng, K. M., Greene, C. A., Hetland, R. D., Zimmerle, H. M. 2016, Oceanography, 29 [Google Scholar]
  66. van Daalen, M. P., Schaye, J., Booth, C. M., & Dalla Vecchia, C. 2011, MNRAS, 415, 3649 [NASA ADS] [CrossRef] [Google Scholar]
  67. van der Walt, S., Colbert, S. C., & Varoquaux, G. 2011, Comput. Sci. Eng., 13, 22 [Google Scholar]
  68. Yu, H.-R., Pen, U.-L., & Wang, X. 2018, ApJS, 237, 24 [CrossRef] [Google Scholar]
  69. Zel’dovich, Y. B. 1970, A&A, 5, 84 [NASA ADS] [Google Scholar]

Appendix A: Model equations

A.1. Model equations in the standard PM code

Denoting by a the scale factor of the Universe and τ the conformal time, a PM code solves the equations of motion for the position x and momentum p of dark matter particles in comoving coordinates (the mass of particles m is absorbed in the definition of the momentum p):

(A.1)

(A.2)

coupled to the Poisson equation for the gravitational potential, sourced by density fluctuations (Eq. (8)),

(A.3)

where G is the gravitational constant and is the mean matter density at conformal time τ. The density contrast is defined from the local matter density ρ(x, τ) by

(A.4)

For simplicity, from now on we note x = , Δx = Δ and δ(x, τ) = δ.

It is convenient to choose the scale factor as time variable. Using ∂τ = a′ ∂a and the background evolution (a prime denotes a differentiation with respect to τ and the superscript (0) denotes quantities at the present time), the equations to solve are rewritten:

(A.5)

(A.6)

(A.7)

We will use the equivalent formulation

(A.8)

(A.9)

where we have combined Eqs. (A.6) and (A.7), introduced the conformal Hubble factor ℋ(a)≡a′/a, and defined the ‘drift prefactor’ 𝒟(a) and the “kick prefactor” 𝒦(a).

A.2. Model equations with COLA

We now introduce the COLA scheme, following Tassev et al. (2013, 2015). For each particle, we work in the frame comoving with its LPT observer, whose position is given by (see Sect. 2.1)

(A.10)

where we have introduced the time-independent vectors Ψ1 ≡ qϕ(1) and Ψ2 ≡ qϕ(2). Noting x(a) = xLPT(a)+xres(a) the final position of the same particle, we have

with

(A.11)

We also define pres such that dxres/da ≡ 𝒟(a)pres. Then p = pLPT + pMC (see Eq. (A.8)). Furthermore,

(A.12)

where the differential operator 𝒱[⋅](a) is defined by

(A.13)

With these notations, Eq. (A.9) reads

(A.14)

In COLA, the natural variables are therefore x and pres.

As mentioned in Sect. 2.1, the key point in COLA is that the fictitious LPT force acting on particles, 𝒱[xLPT](a), can be computed analytically. From Eq. (A.10), it is straightforward to check that 𝒱[xLPT](a) = − 𝒱[D1](a)Ψ1 + 𝒱[D2](a)Ψ2. The computation of 𝒱[D1](a) and 𝒱[D2](a) uses the differential equations followed by the linear and second-order growth factor, as well as the second Friedmann equation. The result is (see e.g. Leclercq 2015, Eqs. (1.7), (1.96), (1.118) and Appendix B)

(A.15)

(A.16)

The equations of motion to solve are therefore, in tCOLA,

(A.17)

(A.18)

These are mathematically equivalent to the equations of motion of a PM code (Eqs. (A.8) and (A.9)). In sCOLA, the “kick equation” (Eq. (A.18)) is replaced for each particle of the sCOLA box by the approximation (Tassev et al. 2015)

(A.19)

with the notations introduced in Sect. 2.3, as well as and . Importantly, the “drift equation” (Eq. (A.17)) is not modified, since we are always, by definition, computing a residual displacement with respect to the LPT observer of the full box, whose position is given by Eq. (A.10).

Appendix B: Standard and modified time-stepping

B.1. Time-stepping in the standard PM algorithm

In this paper, we adopt the second-order symplectic “kick-drift-kick” algorithm, also known as the leapfrog scheme (e.g. Birdsall & Langdon 1985) to integrate the equations of motion. This algorithm relies on integrating the equations on a small time-step and approximating the momenta (p in the “drift equation” (A.8)) and accelerations (−1δ) in the “kick equation” (A.9)) that appear in the integrands by their value at some time within the interval. This defines the Drift (D) and Kick (K) operators, which read using the standard discretisation (Quinn et al. 1997):

(B.1)

(B.2)

where

(B.3)

and t is a function of the scale factor a (typically t(a) = a or t(a) = exp(a) for time-steps linearly spaced or logarithmically spaced in the scale factor, respectively).

The time evolution between t0 = t(ai) and tn + 1 = t(af) is then achieved by applying the following operator, E(tn + 1, t0), to the initial state (x(t0),p(t0)):

(B.4)

B.2. Time-stepping with COLA, standard discretisation

Using the standard discretisation (Quinn et al. 1997) of Eqs. (A.17) and (A.18), the Kick and Drift operators for tCOLA are defined by

(B.5)

(B.6)

where the time factors and are the same as in the PM case (see Eq. (B.3)). For sCOLA, is given by the same expression (Eq. (B.6)) but operates on quantities and differential operators superscripted “sCOLA” consistently with Eq. (A.19).

In the initial conditions, generated with LPT, we have p = pLPT; therefore the momentum residual in the rest frame of LPT observers, pres, should be initialised to zero. At the end, the LPT momentum pLPT has to be added to pres to recover the full momentum of particles, p. This corresponds respectively to the L and L+ operators (Tassev et al. 2013, Appendix A), given by

(B.7)

In COLA, the time evolution between t0 = t(ai) and tn + 1 = t(af) is therefore achieved by applying the following operator to the initial state (x(t0),p(t0)):

(B.8)

where is the operator given by Eq. (B.4), replacing D by and K by .

B.3. Time-stepping with COLA, modified discretisation

Another approach for the discretisation of Eqs. (A.17) and (A.18) is proposed by Tassev et al. (2013). For any arbitrary positive function u of t, we can rewrite

(B.9)

(B.10)

This form is particularly relevant if pres has a time dependence which is entirely captured by a particular u(t), which is universal for all particles. For each equation, considering that the part between curly brackets is constant during the time-step (instead of the momentum and accelerations, respectively), the modified and operators are given by Eqs. (B.5) and (B.6) with the following modified time factors instead of and :

(B.11)

where in , we have used the trivial integration .

Using the Ansatz suggested by Tassev et al. (2013), u(a) = anLPT when t(a) = a, we get the explicit expressions

(B.12)

We adopt this form and nLPT = −2.5 for both tCOLA and sCOLA operators, throughout this paper.

Appendix C: Poisson solver with Dirichlet boundary conditions

In this appendix, we describe how to compute the interior gravitational potential Φ with Dirichlet boundary conditions. The method is standard in computational physics and has been used at least since James (1977). Formally, we seek to solve the discrete Poisson equation,

(C.1)

subject to a known boundary potential ΦBCs, where Δ is the FDA to the exact Laplacian operator, i.e. where , , and are discrete one-dimensional second-order derivatives (see Table 1 in Hahn & Abel 2011, for their expressions in FDA at order 2, 4 and 6).

C.1. Modified density distribution

We define ΦBCs as having non-zero values only in a layer of Nghost cells immediately outside the active domain of the PM grid. We can then write the desired potential as where the required boundary condition for is . From the definition of ΦBCs, ΔΦBCs will be non-zero only in a layer of Nghost active cells just inside the domain boundaries. We can thus define a modified density distribution,

(C.2)

which is the same as δ everywhere except in the layer of Nghost cells adjoining the domain boundaries. We can then employ a zero-boundary condition Poisson solver to obtain a solution of (see Sect. C.2). Within the interior, where ΦBCs = 0, this solution is the desired final solution of ΔΦ = δ with the Dirichlet boundary condition Φ = ΦBCs.

C.2. Zero-boundary condition Poisson solver

In cosmological simulations, it is conventional to use FFTs to solve the Poisson equation, since the discrete Laplacian operator is diagonal in Fourier space. FFTs assume that the input source has periodic boundary conditions. Similarly, for zero boundary conditions, we can work with three-dimensional type-I discrete sine transforms (DST-I), defined by

(C.3)

where δi, j, k is the value of the source field in the voxel indexed by 1 ≤ i ≤ Nx, 1 ≤ y ≤ Ny, 1 ≤ k ≤ Nz (Nx = Ny = Nz = Ng in this paper). The basis functions are defined by

(C.4)

They ensure that the signal has zero boundary values (for i ∈ {0, Nx + 1} or j ∈ {0, Ny + 1} or k ∈ {0, Nz + 1}). They satisfy discrete orthogonality relations, for example,

(C.5)

where δK is the Kronecker symbol. The inverse transformation is simply DST-I multiplied by 8/[(Nx+1)(Ny+1)(Nz+1)], i.e.

(C.6)

and similarly for the gravitational potential,

(C.7)

It is straightforward to show that , , are eigenfunctions of the discrete one-dimensional second-order derivatives , , and , respectively. The corresponding eigenvalues , and are given by

(C.8)

(C.9)

(C.10)

for FDA at order 2, 4, and 6 respectively, where k ≡ 2πℓ/Lx, dx ≡ Lx/Nx and Lx is the size of the box along the x-direction (Lx = LsCOLA ≡ L/Ntiles in this paper). Similar expressions exist for and .

Plugging Eqs. (C.6) and (C.7) into (C.1) and using the orthogonality relations, we obtain a simple form for the discretised Poisson equation in sine space,

(C.11)

Therefore, the Poisson equation ΔΦ = δ with zero boundary conditions can be solved by the following three steps:

  1. performing a forward DST of the source (δi, j, k → δℓ, m, n), according to Eq. (C.3) (costing 𝒪(NxNyNzlog[NxNyNz]) operations),

  2. solving the Poisson equation in sine space (δℓ, m, n → Φℓ, m, n), according to Eq. (C.11) (costing 𝒪(NxNyNz) operations),

  3. performing an inverse DST of the gravitational potential (Φℓ, m, n → Φi, j, k), according to Eq. (C.7) (costing 𝒪(NxNyNzlog[NxNyNz]) operations).

In practice, forward and inverse DSTs are performed using the FFTW library (Frigo & Johnson 2005), publicly available11, where the DST-I is known as FFTW_RODFT00.

All Tables

Table 1.

Nomenclature of symbols used in the present article.

Table 2.

Different setups used to test the accuracy and speed of our sCOLA algorithm.

All Figures

thumbnail Fig. 1.

Functional diagram of sCOLA (left) versus tCOLA (right). The grey boxes are common steps. sCOLA specific steps are represented in blue, and tCOLA specific steps in red. The yellow rectangle constitutes the perfectly parallel section, within which no communication is required with the master process or between processes. Arrows represent dependencies, and references to the main text are given between parentheses.

Open with DEXTER
In the text
thumbnail Fig. 2.

Illustration of the different grids used within sCOLA. The Lagrangian lattice is represented by dashed lines. For each tile, central particles (in black) are surrounded by buffer particles (in cyan), which are ignored at the end of the evolution. The corresponding buffer region in other grids is represented in cyan. The left panel represents the “LPT grid” on which Lagrangian potentials and are defined. The central region has grid points (in red) and is padded by 2Nghost cells in each direction (pink region). The right panel shows the “PM grid” on which the density contrast δsCOLA, the gravitational potential ΦsCOLA, and the accelerations are defined. The density contrast is defined only in the central region (which has grid points, in dark green). The gravitational potential is padded by 2Nghost cells in each direction (light green and yellow regions), and the gridded accelerations only by Nghost cells in each direction (yellow region). Solving the Poisson equation requires Dirichlet boundary conditions in six layers of Nghost cells, denoted as hatched regions. For simplicity of representation, we have used here Nghost = 1.

Open with DEXTER
In the text
thumbnail Fig. 3.

Qualitative assessment of the redshift-zero density field from sCOLA for different tilings and buffer sizes, with respect to tCOLA. The bottom right panel shows the reference tCOLA density field in a 200 Mpc h−1 box with periodic boundary conditions (the quantity represented is ln(2 + δtCOLA) where δtCOLA is the density contrast). Other panels show the difference between sCOLA and tCOLA density fields, ln(2 + δsCOLA)−ln(2 + δtCOLA), for different sizes of tile and buffer region, as indicated above the panels. The tiling is represented by dashed lines, and the central tile’s buffer region is represented by solid lines. In the third dimension, the slices represented intersect the central tile at its centre. As can be observed in this figure, artefacts are predominantly located close to the boundaries of tiles; they are reduced with increasing tile size and buffer region size.

Open with DEXTER
In the text
thumbnail Fig. 4.

Same as Fig. 3, but for one component of the velocity field, in km/s. Bulk flows are correctly captured if tiles and their buffer regions are large enough. Residual differences inside halos can be observed, but they are expected due to the limited number of time-steps, rendering both tCOLA and sCOLA velocities inaccurate in the deeply non-linear regime.

Open with DEXTER
In the text
thumbnail Fig. 5.

Power spectrum relative to tCOLA (top panel) and cross-correlation with respect to tCOLA (bottom panel) of redshift-zero sCOLA density fields, in a 200 Mpc h−1 box containing 5123 dark matter particles. Different sizes for the tiles (represented by different line styles) and buffer regions (represented by different colours) are used, as indicated in the legend. The vertical lines show the respective fundamental mode of different tiles, the light grey bands correspond to 3% accuracy, and the dark grey bands to 1% accuracy.

Open with DEXTER
In the text
thumbnail Fig. 6.

Same as Fig. 5, but in a 1 Gpc h−1 box containing 10243 particles.

Open with DEXTER
In the text
thumbnail Fig. 7.

Tests of the approximations made in sCOLA for the density field and the gravitational potential. As in Fig. 5, the diagnostic tools are the power spectrum relative to tCOLA (top panel) and the cross-correlation with tCOLA (bottom panel). Our sCOLA algorithm uses the approximate interior density field δsCOLA and the LEP approximation for the boundary gravitational potential (dash-dotted blue line). In other simulations, as indicated in the legend, we use the true density field δ and/or the true gravitational potential Φ at the boundaries. The approximation made for the density field dominates, especially at large scales.

Open with DEXTER
In the text
thumbnail Fig. 8.

Memory requirements (first row) and timings for two corresponding tCOLA and sCOLA simulations. Although the CPU time required is higher for sCOLA, the memory consumption and wall-clock time are significantly reduced with respect to tCOLA, due to the perfectly parallel nature of most computations (second row). In the middle left panel, the height of the white bar shows the hypothetical cost of running tCOLA for the same volume as simulated with sCOLA, when taking buffer regions into account. The relative contributions of different operations, as detailed in the legend, is shown in the third row. The main difference in computational cost in sCOLA with respect to tCOLA comes from the use of DSTs instead of FFTs, which makes the evaluation of the potential significantly more expensive.

Open with DEXTER
In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.