Issue 
A&A
Volume 639, July 2020



Article Number  A91  
Number of page(s)  18  
Section  Cosmology (including clusters of galaxies)  
DOI  https://doi.org/10.1051/00046361/202037995  
Published online  14 July 2020 
Perfectly parallel cosmological simulations using spatial comoving Lagrangian acceleration
^{1}
Imperial Centre for Inference and Cosmology (ICIC) & Astrophysics Group, Imperial College London, Blackett Laboratory, Prince Consort Road, London SW7 2AZ, UK
email: florent.leclercq@polytechnique.org
^{2}
AIM, CEA, CNRS, Université ParisSaclay, Université Paris Diderot, Sorbonne Paris Cité, 91191 GifsurYvette, France
^{3}
CNRS, Sorbonne Université, UMR 7095, Institut d’Astrophysique de Paris, 98bis bd Arago, 75014 Paris, France
^{4}
Sorbonne Université, Institut Lagrange de Paris (ILP), 98 bis bd Arago, 75014 Paris, France
^{5}
Center for Computational Astrophysics, Flatiron Institute, 162 5th Avenue, 10010 New York, NY, USA
^{6}
Waterloo Centre for Astrophysics, University of Waterloo, 200 University Ave W, Waterloo ON N2L 3G1, Canada
^{7}
Department of Physics and Astronomy, University of Waterloo, 200 University Ave W, Waterloo, ON N2L 3G1, Canada
^{8}
Perimeter Institute for Theoretical Physics, 31 Caroline St. North, Waterloo ON N2L 2Y5, Canada
Received:
20
March
2020
Accepted:
15
May
2020
Context. Existing cosmological simulation methods lack a high degree of parallelism due to the longrange nature of the gravitational force, which limits the size of simulations that can be run at high resolution.
Aims. To solve this problem, we propose a new, perfectly parallel approach to simulate cosmic structure formation, which is based on the spatial COmoving Lagrangian Acceleration (sCOLA) framework.
Methods. Building upon a hybrid analytical and numerical description of particles’ trajectories, our algorithm allows for an efficient tiling of a cosmological volume, where the dynamics within each tile is computed independently. As a consequence, the degree of parallelism is equal to the number of tiles. We optimised the accuracy of sCOLA through the use of a buffer region around tiles and of appropriate Dirichlet boundary conditions around sCOLA boxes.
Results. As a result, we show that cosmological simulations at the degree of accuracy required for the analysis of the next generation of surveys can be run in drastically reduced wallclock times and with very low memory requirements.
Conclusions. The perfect scalability of our algorithm unlocks profoundly new possibilities for computing larger cosmological simulations at high resolution, taking advantage of a variety of hardware architectures.
Key words: largescale structure of Universe / methods: numerical
© ESO 2020
1. Introduction
We live in the age of large astronomical surveys. These surveys detect and record tracers of cosmic structure across vast volumes of the Universe, using electromagnetic and gravitational waves. A nonexhaustive list includes optical and infrared imaging and spectroscopic surveys such as LSST (LSST Science Collaboration 2012), Euclid (Laureijs et al. 2011), DESI (DESI Collaboration 2016), and SPHEREx (SPHEREx Science Team 2018); catalogues and intensity maps from large radio surveys such as the square kilometer array (Square Kilometre Array Cosmology Science Working Group 2018) and its precursors; cluster catalogues from highresolution observations of the microwave sky (Advanced ACTPol, Simon et al. 2018; SPTPol, Austermann et al. 2012; Simons Observatory, Simons Observatory Collaboration 2019, and CMBS4); Xray surveys such as the eROSITA mission (Merloni et al. 2012); as well as gravitational wave sirens across cosmological volumes with successive updates of (Advanced) LIGO (LIGO Scientific Collaboration 2015), Virgo (The Virgo Collaboration 2020) and LISA (Barausse et al. 2020). Whilst these data sets will be prodigious sources of scientific discovery across astrophysics, their enormous volume and dense sampling of cosmic structure will make them uniquely powerful when studying some of the deepest scientific mysteries of our time: the statistical properties of the primordial perturbations, the nature of dark matter, and the physical properties of dark energy. Indeed many of these surveys were conceived to address these questions.
Accomplishing this promise requires the ability to model these surveys in sufficient detail and with sufficient accuracy. All but the most simplistic models require the production of cosmological lightcone simulations. In particular, cosmological inferences often rely on large numbers of mock catalogues, which are used to construct unbiased estimators and study their statistical properties, such as covariance matrices. As surveys are getting deeper, these mock catalogues now need to represent a sizeable portion of the observable Universe, up to a redshift of ∼2 − 3 (e.g. z = 2.3 for the Euclid Flagship simulation^{1}). Unfortunately, cosmological simulations put a heavy load on supercomputers. Even if only dark matter is included and resolution is minimised, they can require millions of CPU hours and hundreds of terabytes of disk space to solve the gravitational evolution of billions of particles and store the corresponding data. For instance, the DEUSFUR simulation (Alimi et al. 2012), containing 8192^{3} particles in a box of 21 Gpc h^{−1} side length, required 10 million hours of CPU time and 300 TB of storage.
While computational needs are soaring, the performance of individual compute cores attained a plateau around 2015. Traditional hardware architectures are reaching their physical limit. Therefore, cosmological simulations cannot merely rely on processors becoming faster to reduce the computational time. Current hardware development focuses on increasing power efficiency^{2} and solving problems of heat dissipation to allow packing a larger number of cores into each CPU. As a consequence, the performance gains of the world’s top supercomputers are the result of a massive increase in the number of parallel cores, currently^{3} to 𝒪(10^{5}), and soon to 𝒪(10^{6 − 7}) in systems that are currently being built^{4}. Hybrid architectures, where CPUs work alongside GPUs and/or reconfigurable chips such as FPGAs, add to the massive parallelism. In the exascale world, raw compute cycles are no longer the scarce resource. The challenge is to access the available computational power when Amdahl’s law demonstrates that communication latencies kill the potential gains due to parallelisation (Amdahl 1967).
A way to embed highresolution simulation of objects such as galaxy clusters, or even galaxies, in a cosmological context is through the use of varying particle mass resolution and the adaptive mesh refinement technique (AMR, Berger & Colella 1989). AMR is widely employed in gridbased simulation codes such as RAMSES (Teyssier 2002), ENZO (Bryan et al. 2014), FLASH (Fryxell et al. 2000), and AMIGA (Knebe & Doumler 2010). It is also used in MUSIC (Hahn & Abel 2011) to generate zoomin initial conditions for simulations. The AMR technique, which uses multigrid relaxation methods (e.g. Guillet & Teyssier 2011), allows focusing the effort on a specific region of the computational domain, but requires a twoway flow of information between small and large scales. More recently, leading computational cosmology groups have been developing sophisticated schemes to leverage parallel and hybrid computing architectures (Gonnet et al. 2013; Theuns et al. 2015; Aubert et al. 2015; Ocvirk et al. 2016; Potter et al. 2017; Yu et al. 2018; Garrison et al. 2019; Cheng et al. 2020).
Full simulations of large cosmological volumes, even limited to cold dark matter and at coarse resolution, involve multiple challenges. One of the main issues preventing their easy parallelisation is the longrange nature of gravitational interactions, which forestalls highresolution, largevolume cosmological simulations. As a a response, much of the classical work in numerical cosmology focused on computational algorithms (tree codes, fast multipole methods, particlemesh methods, and hybrids such as particleparticle–particlemesh and tree–particlemesh) that reduced the need for 𝒪(N^{2}) alltoall communications between N particles across the full computational volume.
While these algorithms are and remain the backbone of computational cosmology, they fail to fully exploit the physical scale hierarchy of cosmological perturbations. This hierarchy has first been used to push the results of Nbody simulations to Universe scale for cosmic velocity fields (Strauss et al. 1995). At the largest scales, the dynamics of the Universe is not complicated, and in particular, is wellcaptured by Lagrangian Perturbation Theory (LPT; see Bouchet et al. 1995). Building upon this view, Tassev et al. (2015) introduced spatial COmoving Lagrangian Acceleration (sCOLA). This algorithm, using a hybrid analytical and numerical treatment of particles’ trajectories, allows one to perform simulations without the need to substantially extend the simulated volume beyond the region of interest in order to capture farfield effects, such as density fluctuations due to superbox modes. The sCOLA proofofconcept focused on one subbox embedded into a larger simulation box.
In this paper, we extend the sCOLA algorithm and use it within a novel method for perfectly parallel cosmological simulations. To do so, we rely on a tiling of the full cosmological volume to be simulated, where each tile is evolved independently using sCOLA. The principal challenge for the accuracy of such simulations are the boundary conditions used throughout the evolution of tiles, which can introduce artefacts. In this respect, we introduce three crucial improvements with respect to Tassev et al. (2015): the use of a buffer region around each tile, the use of exact boundary conditions in the calculation of LPT displacements (which has the side benefit of reducing memory requirements), and the use of a Poisson solver with Dirichlet boundary conditions meant to approximate the exact gravitational potential around sCOLA boxes. The method proposed in this work shares similar goals with zoomin simulation techniques, the main difference residing in the change of frame of reference introduced in sCOLA, which accounts for the dynamics of large scales without requiring flows of information during the evolution. On the other hand, our method is independent of the Nbody integrator used to calculate the numerical part of particles’ trajectories within each sCOLA box, and therefore, it cannot be related to specific approaches to do so, such as forcesplitting. It is slightly approximate and more CPUexpensive than the corresponding “monolithic” simulation technique (chosen in this paper as tCOLA, Tassev et al. 2013), but has the essential advantage of perfect scalability. This scalability comes from the removal of any kind of communication among tiles after the initialisation of the simulation. As a consequence, for its major part, the degree of parallelism of the algorithm equals the number of tiles, which means that the workload is perfectly parallel (also called embarrassingly parallel). This property can be exploited to produce cosmological simulations in very short wallclock times on a variety of hardware architectures, as we discuss in this paper.
After reviewing Lagrangian Perturbation Theory and its use within numerical simulations in Sect. 2, we describe our algorithm for perfectly parallel cosmological simulations in Sect. 3. In Sect. 4, we test the accuracy and speed of the algorithm with respect to reference simulations that do not use the tiling. We discuss the implications of our results for computational strategies to model cosmic structure formation, and conclude, in Sect. 5. Details regarding the implementation are provided in the appendices.
2. Cosmological simulations using Lagrangian perturbation theory
Throughout this section we denote by a the scale factor of the Universe. For simplicity, some of the equations are abridged. We reintroduce the omitted constants, temporal prefactors, and Hubble expansion in Appendix A.
Particle simulators are algorithms that compute the final position x and momentum p ≡ dx/da of a set of particles, given some initial conditions. They can also be seen as algorithms that compute a displacement field Ψ, which maps the initial (Lagrangian) position q of each particle to its final (Eulerian) position x, according to the classic equation (see e.g. Bernardeau et al. 2002, for a review)
With this point of view, the outputs are x and p = ∂Ψ/∂a.
2.1. Lagrangian perturbation theory (LPT)
In Lagrangian perturbation theory (LPT), the displacement field is given by an analytic equation which is used to move particles, without the need for a numerical solver. At second order in LPT, the displacement field is written
where each of the terms is separable into a temporal and a spatial contribution deriving from a Lagrangian potential:
In Eqs. (3) and (4), D_{1} and D_{2} are the growth factor and secondorder growth factor, respectively. The Lagrangian potentials obey Poissonlike equations (Buchert et al. 1994):
where δ_{i}(q) is the density contrast in the initial conditions, in Lagrangian coordinates, and the are spatial second derivatives of ϕ^{(1)}, i.e. .
If only the firstorder term is included in Eq. (2), the solution is known as the Zel’dovich approximation (Zel’dovich 1970).
2.2. Temporal comoving Lagrangian acceleration (tCOLA)
In contrast to the analytical equations of LPT, particlemesh (PM) codes (see e.g. Klypin & Holtzman 1997) provide a fully numerical solution to the problem of largescale structure formation. The equation of motion to be solved in a PM code reads schematically
where the gravitational potential Φ satisfies the Poisson equation,
Here, δ(x, a) is the density contrast at a scale factor a, which is obtained from the set of particles’ positions {x(a)} through a density assignment operator that we denote B (typically a cloudincell (CiC) scheme, see Hockney & Eastwood 1981):
We denote by the corresponding interpolation operator, which is needed to obtain the accelerations of particles given the acceleration field on the grid:
The temporal COmoving Lagrangian Acceleration (tCOLA) algorithm seeks to decouple large and small scales by evolving large scales using analytic LPT results, and small scales using a numerical solver. This is achieved by splitting the Lagrangian displacement field into two contributions (Tassev & Zaldarriaga 2012):
where Ψ_{LPT}(q, a) is the LPT displacement field discussed in Sect. 2.1 and Ψ_{res}(q, a) is the residual displacement of each particle, as measured in a frame comoving with an “LPT observer”, whose trajectory is given by Ψ_{LPT}(q, a). Using Eq. (11), it is possible to rewrite Eq. (7) as
The term can be thought of as a fictitious force acting on particles, caused by our use of a noninertial frame of reference. Importantly, it can be computed analytically given the equations of Lagrangian perturbation theory.
The equations of motions (7) and (12) are usually integrated by the use of timestepping techniques (see Appendix B). In the limit of zero timesteps used to discretise the lefthand side of Eq. (12), Ψ_{res} = 0 and tCOLA recovers the results of LPT; therefore, tCOLA always solves the large scales with an accuracy of at least that of LPT. In contrast, PM codes require many timesteps in Eq. (7) just to recover the value of the linear growth factor D_{1}. In the limit where the number of timesteps becomes large, tCOLA reduces to a standard PM code. In the intermediate regime (for 𝒪(10) timesteps), tCOLA provides a good approximation to largescale structure formation, at the expense of not solving the details of particle trajectories in deeply nonlinear halos (see Tassev et al. 2013; Howlett et al. 2015; Leclercq et al. 2015; Koda et al. 2016; Izard et al. 2016, for further discussion). Since by construction, tCOLA always gets the large scales correct, contrary to a PM code, the tradeoff between speed and accuracy only affects small scales.
2.3. Spatial comoving Lagrangian acceleration (sCOLA)
During largescale structure formation, nonlinearities appear at late times and/or at small scales. tCOLA (Eq. (12)) decouples LPT displacements and residual nonlinear contributions “in time”, so that, for a given accuracy, fewer timesteps are required to solve largescale structure evolution than with a PM code. Following a similar spirit, the spatial COmoving Lagrangian Acceleration (sCOLA) framework decouples LPT displacements and residual nonlinear contributions “in space”, so that numerically evolved small scales can feel farfield effects captured analytically via LPT.
More specifically, for each particle in a volume of interest (the “sCOLA box”) embedded in a larger cosmological volume (the “full box”), the equation of motion of particles, which reads for a traditional Nbody problem
is replaced by
is defined by Eq. (11) as the residual displacement with respect to the LPT observer of the full box, whose trajectory is given by Ψ_{LPT}(q, a). In Eq. (14), is the trajectory prescribed by solving LPT equations (see Sect. 2.1) in the sCOLA box. Note that may differ from Ψ_{LPT}(q, a), depending on the assumptions made for the boundary conditions of the sCOLA box, discussed in Sect. 3.3. Denoting by 𝒮 ⊆ ⟦1, N⟧ the set of particles in the sCOLA box, the gravitational force, which in Eq. (13) reads
is replaced by
It is possible to evaluate F^{sCOLA}(x, a), and thus to solve Eq. (14), like Eq. (13), using any numerical gravity solver, such as particleparticle–particlemesh, tree codes, or AMR. In this paper, we choose to focus on evaluating forces via a PM scheme. In this case, the equation of motion of particles in sCOLA reads schematically (Tassev et al. 2015)
The gravitational potential in the sCOLA box, Φ^{sCOLA}(x, a), obeys the nearfield version of the Poisson equation,
The superscript “sCOLA” over the gradient and Laplacian operators, and , mean that they are restricted to the sCOLA box (contrary to that of Eqs. (8) and (12)). Over the density contrast δ^{sCOLA}(x, a), the superscript means that only particles in the sCOLA box {x(a)}_{sCOLA} ≡ {x_{i}(a)}_{i ∈ 𝒮} (instead of the full box) are used within the density assignment B^{sCOLA}, i.e.
Contrary to tCOLA, which is an exact rewriting of the equations of motion of a PM code, sCOLA potentially involves approximations for the calculation of each quantity and operator with a superscript “sCOLA” instead of its full box equivalent. As a proof of concept, Tassev et al. (2015) showed that under certain circumstances, sCOLA provides a good approximation for the evolution of one sCOLA box embedded into a larger full box. As discussed in the introduction, we aim at generalising this result by using sCOLA within multiple subvolumes of a full simulation box.
3. Algorithm for perfectly parallel simulations using sCOLA
In this section, we describe an algorithm for cosmological simulations using sCOLA, for which the time evolution of independent Lagrangian subvolumes is perfectly parallel, without any communication. A functional block diagram representing the main steps and their dependencies is given in Fig. 1. An illustration of the different grids appearing in the algorithm is presented in Fig. 2, and Table 1 provides the nomenclature of some of the different variables appearing in this section.
Fig. 1. Functional diagram of sCOLA (left) versus tCOLA (right). The grey boxes are common steps. sCOLA specific steps are represented in blue, and tCOLA specific steps in red. The yellow rectangle constitutes the perfectly parallel section, within which no communication is required with the master process or between processes. Arrows represent dependencies, and references to the main text are given between parentheses. 
Nomenclature of symbols used in the present article.
We work in a cubic full box of side length L with periodic boundary conditions, populated by particles initially at the nodes {q} of a regular Lagrangian lattice. We seek to compute the set of final positions {x(a_{f})} and momenta {p(a_{f})} at final scale factor a_{f}. The model equations are reviewed in Appendix A. The timestepping of these equations consists of a series of “kick” and “drift” operations and is discussed in Appendix B.
We approximate the Laplacians Δ_{x}, Δ_{q} and gradient operators ∇_{x}, ∇_{q} by finite difference approximation (FDA) at order 2, 4, or 6. The coefficients of the finite difference stencils in configuration and in Fourier space are given for example in table 1 in Hahn & Abel (2011). We note N_{ghost} = 1, 2, 3 if FDA is taken at order 2, 4, 6, respectively.
3.1. Initial conditions and Lagrangian potentials
Before the perfectly parallel section, two initialisation steps are performed by the master process in the full box.
A.1. The first step is to generate the initial density contrast δ_{i} in the full box, on a cubic grid of N^{3} cells (the “LPT grid”, represented in red in the left panel of Fig. 2). This step can be done via the standard convolution approach (e.g. Hockney & Eastwood 1981), given the specified initial power spectrum.
Fig. 2. Illustration of the different grids used within sCOLA. The Lagrangian lattice is represented by dashed lines. For each tile, central particles (in black) are surrounded by buffer particles (in cyan), which are ignored at the end of the evolution. The corresponding buffer region in other grids is represented in cyan. The left panel represents the “LPT grid” on which Lagrangian potentials and are defined. The central region has grid points (in red) and is padded by 2N_{ghost} cells in each direction (pink region). The right panel shows the “PM grid” on which the density contrast δ^{sCOLA}, the gravitational potential Φ^{sCOLA}, and the accelerations are defined. The density contrast is defined only in the central region (which has grid points, in dark green). The gravitational potential is padded by 2N_{ghost} cells in each direction (light green and yellow regions), and the gridded accelerations only by N_{ghost} cells in each direction (yellow region). Solving the Poisson equation requires Dirichlet boundary conditions in six layers of N_{ghost} cells, denoted as hatched regions. For simplicity of representation, we have used here N_{ghost} = 1. 
A.2. The second step is to compute the Lagrangian potentials ϕ^{(1)}(q) and ϕ^{(2)}(q) on the LPT grid in the full box, which is achieved by solving Eqs. (5) and (6).
If initial phases are generated in Fourier space, the Zel’dovich approximation (i.e. the calculation of ϕ^{(1)}) requires only one inverse fast Fourier transform (FFT) on the LPT grid. For the secondorder potential, the source term on the righthand side of Eq. (6) has to be computed from ϕ^{(1)}; this can either be done in Fourier space (for a cost of six inverse FFTs) or in configuration space via finite differencing (for a cost of nine onedimensional gradient operations). In both cases, the calculation of ϕ^{(2)} from its source then requires one forward and one inverse FFT.
These few FFTs in the full box are the most hardwaredemanding requirement of the algorithm (particularly in terms of memory), and the only step which is not distributed and suitable for grid computing. These FFTs may however be performed on a cluster of computers with fast interconnection suitable for Message Passing Interface (Frigo & Johnson 2005; Johnson et al. 2008).
3.2. Tiling and buffer region
B.1. After having computed the Lagrangian potentials, the master process splits the Lagrangian lattice (of size ) into cubic tiles (we require that N_{p} is a multiple of N_{tiles}). Tiles are constructed to be evolved independently; therefore the main, perfectly parallel region of the algorithm starts here.
To minimise artefacts due to boundary effects (see Sect. 3.4), each tile is surrounded by a “buffer region” in Lagrangian space. This buffer region consists of N_{p, buffer} particles in each direction, so that each sCOLA box contains a total of particles, where N_{p, sCOLA} ≡ N_{p, tile} + 2N_{p, buffer} and N_{p, tile} ≡ N_{p}/N_{tiles}. Corresponding physical sizes are L_{tile} ≡ L N_{p, tile}/N_{p}, L_{buffer} ≡ L N_{p, buffer}/N_{p}, and L_{sCOLA} ≡ L N_{p, sCOLA}/N_{p}. The fraction of the full Lagrangian lattice assigned to one child sCOLA process is represented by dotted lines in Fig. 2. Particles of the tile are represented in black, and particles of the buffer region are represented in cyan.
The sCOLA box is chosen to encompass the tile and its buffer region. We define the oversimulation factor r as the ratio between the total volume simulated in all sCOLA boxes and the target simulation volume, i.e.
Since all sCOLA boxes can be evolved independently, the degree of parallelism of the algorithm is equal to the number of sCOLA boxes, . We call the “parallelisation potential factor” the quantity p ≡ /r, which balances the degree of parallelism with the amount of oversimulation. It is also
For each sCOLA box, the corresponding child process computes the set of final positions {x}_{sCOLA} and momenta {p}_{sCOLA}.
B.2. At the end of the evolution, each child process sends the set of final positions {x}_{tile} and momenta {p}_{tile} of particles of the tile back to the master process. Particles of the buffer region are ignored. The master process then “untiles” the simulation by gathering the results from all the tiles.
3.3. Initial operations in the sCOLA boxes
A few steps are required in each sCOLA box before starting the evolution per se.
C.1. The sCOLA box receives the relevant portion of ϕ^{(1)}(q) and ϕ^{(2)}(q) from the master process. This is the only communication required with the master process before sending back the results at the end of the evolution.
The portion of the LPT grid received by each process from the master process corresponds to the full spatial region covered by the sCOLA box, plus an additional padding of 2N_{ghost} cells in each direction. We denote by and the parts of ϕ^{(1)}(q) and ϕ^{(2)}(q) received from the master process (we avoid the superscript “sCOLA” since no approximation is involved at this stage). They are defined on a grid of size (N_{sCOLA} + 4N_{ghost})^{3}, where
(⌈⋅⌉ denotes the ceiling function). An illustration is provided in Fig. 2, left panel. There, the portion of the LPT grid corresponding to the sCOLA box, of size N_{sCOLA} in each direction, is represented in red and the padding region, of size 2N_{ghost} in each direction, is represented in pink.
C.2. The sCOLA process locally computes the required timeindependent LPT vectors and via finite differencing in configuration space and interpolation to particles’ positions.
The ghost cells included around and in the sCOLA box ensure that the proper boundary conditions are used when applying the gradient operator in configuration space to get the LPT displacements on the grid. This step “consumes” N_{ghost} layers of ghost cells in each direction, so that the grid of LPT displacements has a size of (N_{sCOLA} + 2N_{ghost})^{3}. To use again the proper boundary conditions when going from the LPT grid to particles’ positions, another N_{ghost} layers of ghost cells is consumed by the interpolation operator . The use of the exact boundary conditions at each of these two steps ensures that and . Therefore, by construction, and in the sCOLA box are always the same as Ψ_{1} ≡ ∇_{q}ϕ^{(1)}(q) and Ψ_{2} ≡ ∇_{q}ϕ^{(2)}(q) in the full box (as would be computed by the master process). Consequently, we do not keep track of both and Ψ_{1, 2}, contrary to Tassev et al. (2015). In addition to being simpler, this scheme has the practical advantage of saving six floatingpoint numbers per particle in memory (three in the case of the Zel’dovich approximation).
C.3. The sCOLA process precomputes the Dirichlet boundary conditions Φ_{BCs} that will be used at each calculation of the gravitational potential during the sCOLA evolution.
For each sCOLA box, we define a particlemesh grid of size (the “PM grid”, represented in dark green in the right panel of Fig. 2). The PM grid defines the force resolution; it should be equal to or finer than the LPT grid (N_{g} ≥ N_{sCOLA}). Before starting the evolution with sCOLA, each process precomputes the Dirichlet boundary conditions that will be required by the Poisson solver at each value of the scale factor a_{K}. This calculation takes as input the initial gravitational potential and outputs Φ_{BCs}(x, a_{K}) for each a_{K}, defined on the PM grid with a padding of 2N_{ghost} cells around the sCOLA box in each direction (light green and yellow regions in Fig. 2, right panel). The approximation involved in this step is further discussed in Sect. 3.4.2.
3.4. Evolution of sCOLA boxes
Each sCOLA box is then evolved according to the scheme reviewed in Sect. 2.3 and Appendices A and B. Two specific approximations are needed to compute the operators and quantities with a superscript “sCOLA”; we now discuss the choices that we made.
3.4.1. Density assignment (B^{sCOLA})
As mentioned in Sect. 2.3, only particles of the sCOLA box should contribute to δ^{sCOLA}(x, a). For particles that are fully in the sCOLA box, density assignment can be chosen as the same operation as would be used in a PM or tCOLA code (typically, a CiC scheme). A question is what to do with particles that have (partially) left the sCOLA box during the evolution, while keeping the requirement of no communication between boxes: this constitutes the only difference between the operators B and B^{sCOLA}. Possible choices include artificially periodising the sCOLA box (which is clearly erroneous) or stopping particles at its boundaries (which does not conserve momentum). Both of these choices assign the entire mass carried by the set of sCOLA particles 𝒮 to the PM grid, but result in artefacts in the final conditions, if the buffer region is not large enough.
An alternative choice is simply to limit the (Eulerian) PM grid volume where we compute δ^{sCOLA}(x, a) to the (Lagrangian) sCOLA box, including central and buffer regions. In practice, this means ignoring the fractional particle masses that the CiC assignment would have deposited to grid points outside the sCOLA box. We have found in our tests that this choice gives the smallest artefacts of the three choices considered^{5}. We note that (partially) erasing some particles’ mass is an approximation that is only used in the B^{sCOLA} operator to evaluate the source term in the Poisson equation, and therefore only affects the force calculation. The number of particles, both within each sCOLA process () and in the full simulation (), is left unchanged during the evolution. Therefore, mass is always conserved both within each sCOLA process and within the full volume.
3.4.2. Gravitational potential (, and ^{sCOLA})
Poisson solver (). To make sure that differences between Φ^{sCOLA}(x, a) and Φ(x, a) are as small as possible, we make use of a Poisson solver with Dirichlet boundary conditions, instead of assuming periodic boundary conditions. Such a Poisson solver uses discrete sine transforms (DSTs) instead of FFTs, and requires the boundary values of Φ in six planes (west, east, south, north, bottom, top) surrounding the PM grid (see Appendix C). These planes have a thickness of N_{ghost} cells (depending on the value of the FDA used to approximate the Laplacian); they are represented by hatched regions in Fig. 2, right panel. At each scale factor a_{K} when the computation of accelerations is needed, the Dirichlet boundary conditions are extracted from the precomputed Φ_{BCs}(x, a_{K}) (step C.3., see Sect. 3.3).
Ideally, Φ_{BCs}(x, a_{K}) should be the exact, nonlinear gravitational potential in the full volume at a_{K}, Φ(x, a_{K}). However, knowing this quantity would require having previously run the monolithic simulation in the full volume, which we seek to avoid. In this paper, we rely instead on the linearlyevolving potential (LEP) approximation (Brainerd et al. 1993; Bagla & Padmanabhan 1994), namely
The idea behind this approximation is that the gravitational potential is dominated by longwavelength modes, and therefore it ought to obey linear perturbation theory to a better approximation than the density field.
In Eq. (23), we have assumed that the linear growth factor D_{1} is normalised to unity at the scale factor corresponding to the initial conditions. The precomputation of Φ_{BCs} in step C.3. is therefore an interpolation from the LPT grid to the PM grid and a simple scaling with D_{1}(a_{K}).
The output of the Poisson solver is the gravitational potential Φ^{sCOLA}(x, a_{K}) on the PM grid, in the interior of the sCOLA box (dark green grid points in Fig. 2, right panel). Consistently with the treatment above, Φ^{sCOLA}(x, a_{K}) is padded using the values of Φ_{BCs}(x, a_{K}) in 2N_{ghost} cells around the PM grid, in each direction (light green and yellow regions in Fig. 2, right panel).
Therefore, the only difference between and Δ_{x} resides in using the LEP instead of the true, nonlinear gravitational potential at the boundaries of the sCOLA box.
Accelerations ( and ^{sCOLA}). Given the gravitational potential Φ^{sCOLA}(x, a_{K}), accelerations are computed by finite differencing in configuration space and interpolation to particles’ positions, similarly to step C.2. (see Sect. 3.3). The application of consumes N_{ghost} cells, so that accelerations are obtained on the PM grid with a padding of N_{ghost} cells (yellow region in Fig. 2, right panel). Interpolation from the grid to particles’ position (the operator) further consumes N_{ghost} cells.
As for the Laplacian, the only difference between and ∇_{x}, and and , resides in using the LEP in Φ^{sCOLA}(x, a_{K}) instead of the true, nonlinear gravitational potential at the boundaries of the sCOLA box.
4. Accuracy and speed
We implemented the perfectly parallel sCOLA algorithm described in Sect. 3 in the SIMBELMYNë code (Leclercq et al. 2015), publicly available^{6} (see also Leclercq 2015, appendix B, for technical details on the implementation of the PM and tCOLA models in SIMBELMYNë). This section describes some tests of the accuracy and speed of the new sCOLA algorithm. Since our implementation, relying on evaluating forces with a PM scheme, introduces some additional approximations with respect to tCOLA, we compare our results to that of corresponding monolithic tCOLA simulations. The accuracy of tCOLA with respect to more accurate gravity solvers has been characterised in the earlier literature (Tassev et al. 2013; Howlett et al. 2015; Leclercq et al. 2015; Koda et al. 2016; Izard et al. 2016). The question of comparing the accuracy of our sCOLA algorithm to full Nbody simulations would require building in a full Nbody integrator for the sCOLA boxes (see Eqs. (14) and (16)); this subject is left for future research.
Throughout the paper, we adopt the ΛCDM model with Planck 2015 cosmological parameters: h = 0.6774, Ω_{Λ} = 0.6911, Ω_{b} = 0.0486, Ω_{m} = 0.3089, n_{S} = 0.9667, σ_{8} = 0.8159 (Planck Collaboration XIII 2016, page 31, Table 4, last column). The initial power spectrum is computed using the Eisenstein & Hu (1998, 1999) fitting function.
We base our first tests on a periodic box of comoving side length L = 200 Mpc h^{−1} populated with = 512^{3} dark matter particles. For all operators, we use FDA at order 2. The LPT grid has N^{3} = 256^{3} voxels. Particles are evolved to redshift z = 19 using 2LPT. For all runs, we use 10 timesteps linearlyspaced in the scale factor to evolve particles from z = 19 (a_{i} = 0.05) to z = 0 (a_{f} = 1) (see Appendix B)^{7}. For tCOLA, the PM grid, covering the full box, has 512^{3} voxels. For sCOLA, we use eight different setups, with various parameters {N_{tiles}, N_{p, tile}, L_{tile}, N_{p, buffer}, L_{buffer}, N_{g}, r, p} given in the first part of Table 2.
Different setups used to test the accuracy and speed of our sCOLA algorithm.
To assess more extensively the impact of using sCOLA on large scales, we used a second ensemble of simulations with the following differences: a box with comoving side length of L = 1 Gpc h^{−1}, N_{p} = 1024^{3} particles, a LPT grid with N^{3} = 512^{3} voxels, and a PM grid of 1024^{3} voxels for tCOLA. For sCOLA, we use eight different setups given in the second part of Table 2.
4.1. Qualitative assessments
The redshiftzero density field is estimated by assigning all particles to the LPT grid using the CiC scheme. Results for the 200 Mpc h^{−1} box are shown in Fig. 3. There, the bottom right panel shows the reference tCOLA density field and other panels show the differences between sCOLA and tCOLA results, for the eight different setups. Some qualitative observations can be made: when artefacts are visible in the sCOLA results, they mainly affect overdense regions of the cosmic web (filaments and halos), whereas underdense regions are generally better recovered. Artefacts are of two types: the position of a structure (usually a filament) can be imprecise due to a misestimation of bulk motions (this is visible as a “dipole” in Fig. 3); or the density (usually of halos) can be over or underestimated (this is visible as a “monopole” in Fig. 3). In all setups, artefacts are predominantly located close to the boundaries of tiles (represented as dashed lines) and are less visible in the centre of tiles. This can be easily understood given that the approximations made all concern the behaviour at the boundaries of sCOLA boxes. At fixed size for the buffer region, the correspondence between sCOLA and tCOLA density fields improves with increasing tile size. A minimum tile size of about 50 Mpc h^{−1} seems necessary to limit the misestimation of halo densities (“monopoles” in Fig. 3). At low redshift, this scale is in the mildly nonlinear regime, where LPT starts to break down; therefore, the LPT frame is inaccurate for particles, and the requirement of no communication between tiles leads to mispredicted clustering. As expected, at fixed tile size, the results are improved by increasing the buffer region around tiles: in each sCOLA box, boundary approximations are pushed farther away from the central region of interest. A good compromise between reducing artefacts and increasing the size of buffer regions seems to be found for a buffer region of 25 Mpc h^{−1}, which corresponds roughly to the maximum distance travelled by a particle from its initial to its final position. In particular, the setup L_{tile} = 50 Mpc h^{−1}, L_{buffer} = 25 Mpc h^{−1} leads to a satisfactory approximation of the tCOLA density with a parallelisation potential factor p = 8.
Fig. 3. Qualitative assessment of the redshiftzero density field from sCOLA for different tilings and buffer sizes, with respect to tCOLA. The bottom right panel shows the reference tCOLA density field in a 200 Mpc h^{−1} box with periodic boundary conditions (the quantity represented is ln(2 + δ_{tCOLA}) where δ_{tCOLA} is the density contrast). Other panels show the difference between sCOLA and tCOLA density fields, ln(2 + δ_{sCOLA})−ln(2 + δ_{tCOLA}), for different sizes of tile and buffer region, as indicated above the panels. The tiling is represented by dashed lines, and the central tile’s buffer region is represented by solid lines. In the third dimension, the slices represented intersect the central tile at its centre. As can be observed in this figure, artefacts are predominantly located close to the boundaries of tiles; they are reduced with increasing tile size and buffer region size. 
In a similar fashion, the velocity field is estimated on the LPT grid from particle information, using the simplexincell estimator (Hahn et al. 2015; Leclercq et al. 2017). Using phasespace information, this estimator accurately captures the velocity field, even in regions sparsely sampled by simulation particles. Results for the 200 Mpc h^{−1} box are shown in Fig. 4, where one component of the tCOLA velocity field v_{tCOLA} (in km s^{−1}) is shown in the bottom right panel. Other panels show the velocity error in sCOLA, v_{sCOLA} − v_{tCOLA} in km s^{−1}. Differences between tCOLA and sCOLA velocity fields are of two kinds: misestimation of bulk flows (visible as light, spatially extended regions in Fig. 4), or misestimation of particle velocities inside halos (visible as dark spots in Fig. 4). We do not interpret the second kind of differences as errors made by our sCOLA algorithm: indeed, motions within virialised regions are not captured accurately by any simulation using only ten timesteps, even by tCOLA in the full box. Therefore, only the first kind of differences, that is, the misestimation of coherent bulk motions is physically interpretable. In this respect, the same behaviour as for density fields can be observed: artefacts are mostly located at the boundaries of tiles, and they are reduced with increasing tile size and buffer region size, with safe minima of L_{tile} ≳ 50 Mpc h^{−1} and L_{buffer} ≳ 25 Mpc h^{−1}, respectively.
Fig. 4. Same as Fig. 3, but for one component of the velocity field, in km/s. Bulk flows are correctly captured if tiles and their buffer regions are large enough. Residual differences inside halos can be observed, but they are expected due to the limited number of timesteps, rendering both tCOLA and sCOLA velocities inaccurate in the deeply nonlinear regime. 
4.2. Summary statistics
In this section, we turn to a more quantitative assessment of our results, by checking the power spectrum of final density fields and their crosscorrelation to the tCOLA density field. Even if final density fields are nonGaussian, twopoint statistics (auto and crossspectra) are expected to be sensitive to the approximations made in our sCOLA algorithm, which involves both local and nonlocal operations in configuration space.
According to Huterer & Takada (2005) or Audren et al. (2013), in the best cases, observational errors for a Euclidlike survey are typically of order 3% for k < 10^{−2} (Mpc h^{−1})^{−1}. These results do not account for any of the systematic uncertainties linked to selection effects or contamination of the clustering signal by foregrounds. At smaller scales, theoretical uncertainties take over, reaching 1% and above for k > 10^{−1} (Mpc h^{−1})^{−1}. In addition, the impact of baryonic physics is still largely uncertain, some models predicting an impact of at least 10% at k = 1 (Mpc h^{−1})^{−1} (e.g. van Daalen et al. 2011; Chisari et al. 2018; Schneider et al. 2019). Any data model involving our sCOLA algorithm will be subject to these uncertainties. For this reason, we aim for no better than 3% to 1% accuracy at all scales up to k = 1 (Mpc h^{−1})^{−1}, for any twopoint measurement of clustering.
More precisely, we work with P(k) and R(k), defined for two density contrast fields δ and δ′=δ_{tCOLA}, with our Fourier transform convention, by
where δ_{D} is a Dirac delta distribution. For the estimation of P(k) and R(k), we use 100 logarithmicallyspaced kbins from the fundamental mode of the box k_{min} ≡ 2π/L to k = 1 (Mpc h^{−1})^{−1}.
In Figs. 5 and 6, we plot the power spectrum of sCOLA density fields divided by the power spectrum of the reference tCOLA density field, P_{sCOLA}(k)/P_{tCOLA}(k) (upper panels) and the crosscorrelation between sCOLA and tCOLA density fields, R(k) (bottom panels), for our 200 Mpc h^{−1} (Fig. 5) and 1 Gpc h^{−1} box (Fig. 6). The grey horizontal bands represent the target accuracies of 3% and 1%, and the vertical lines mark the fundamental modes of the tiles, k_{tile} ≡ 2π/L_{tile}, for the different values of L_{tile} used.
Fig. 5. Power spectrum relative to tCOLA (top panel) and crosscorrelation with respect to tCOLA (bottom panel) of redshiftzero sCOLA density fields, in a 200 Mpc h^{−1} box containing 512^{3} dark matter particles. Different sizes for the tiles (represented by different line styles) and buffer regions (represented by different colours) are used, as indicated in the legend. The vertical lines show the respective fundamental mode of different tiles, the light grey bands correspond to 3% accuracy, and the dark grey bands to 1% accuracy. 
Figure 5 quantitatively confirms the considerations of Sect. 4.1. Both the amplitudes (as probed by P(k)/P_{tCOLA}(k)) and the phase accuracy (as probed by R(k)) of sCOLA simulations are improved with increasing tile size, for a fixed buffer region (different line styles, same colours). For a fixed tile size, results are also improved by increasing the size of the buffer region (same line styles, different colours). Remarkably, all setups yield perfect phase accuracy at large scales (R(k) = 1 for k ≤ 0.2 (Mpc h^{−1})^{−1}), even when the amplitude of corresponding modes deviates from the tCOLA result. Defects at small scales (lack of power and inaccurate phases) are only observed for the smallest tile sizes and are fixed by increasing the size of buffer region. This effect can be interpreted in Lagrangian coordinates: when the Lagrangian volume forming a halo is divided among different tiles that do not exchange particles, and if the buffer region is too small to contain the rest of the halo, the resulting structure is then split and underclustered in Eulerian coordinates. In this respect, preferring a sCOLA box size (L_{sCOLA} ≡ L_{tile} + 2L_{buffer}) of at least 100 Mpc h^{−1} (and therefore L_{tile} ≳ 50 Mpc h^{−1}, L_{buffer} ≳ 25 Mpc h^{−1}, in most situations) seems to be sensible. A more difficult issue is the amplitude of largescale modes, for k < k_{tile}. These are sensitive to the tiling if buffer regions around tiles are too small. A safe requirement also seems to be L_{buffer} ≳ 25 Mpc h^{−1}. Putting everything together, in our 200 Mpc h^{−1} box, three setups reach 3% accuracy in amplitude and phases at all scales: {L_{tile} = 50 Mpc h^{−1}, L_{buffer} = 25 Mpc h^{−1}} (discussed already in Sect. 4.1); {L_{tile} = 100 Mpc h^{−1}, L_{buffer} = 25 Mpc h^{−1}}; and {L_{tile} = 50 Mpc h^{−1}, L_{buffer} = 50 Mpc h^{−1}}. The lastmentioned performs even better, reaching 1% accuracy at all scales, but at the price of oversimulating the volume by a larger factor.
Figure 6 shows the same diagnostics for a 1 Gpc h^{−1} box, where the qualitative behaviour is the same as before. It confirms the requirement L_{buffer} ≳ 25 Mpc h^{−1} to get sufficient accuracy at high k. The question of the accuracy reached at the largest scales is then jointly sensitive to L_{tile} and L. In our tests, the setups {L_{tile} = 62.5 Mpc h^{−1}, L_{buffer} = 39.1 Mpc h^{−1}} and {L_{tile} = 125 Mpc h^{−1}, L_{buffer} = 29.3 Mpc h^{−1}} yield 3% accurate results at all scales, and the setups {L_{tile} = 62.5 Mpc h^{−1}, L_{buffer} = 62.5 Mpc h^{−1}} and {L_{tile} = 125 Mpc h^{−1}, L_{buffer} = 48.8 Mpc h^{−1}} almost reach 1%level precision at all scales. We note that the two different boxes have different mass resolutions, which confirms that requirements for tile and buffer region sizes should be expressed in physical size.
4.3. Tests of the approximations
As discussed in Sect. 3.4, two approximations are introduced in our sCOLA algorithm with respect to a monolithic tCOLA approach. These concern density assignment in the interior of sCOLA boxes (approximation D.1.) and the gravitational potential at the boundaries of sCOLA boxes (approximation D.2.). In this section, we test the impact of these approximations on final results, using twopoint statistics as diagnostic tools. For this test we use our sCOLA run with L = 200 Mpc h^{−1}, N_{p} = 512^{3}, 64 tiles (N_{tiles} = 4, N_{p, tile} = 128) and N_{p, buffer} = 32 (i.e. L_{tile} = 50 Mpc h^{−1}, L_{buffer} = 12.5 Mpc h^{−1}). We choose a small buffer size on purpose, to be sensitive to the approximations made.
Let us denote by δ_{int} the density contrast in the interior of sCOLA boxes and by Φ_{BCs} the gravitational potential at the boundaries of sCOLA boxes. As discussed in Sect. 3.4, our algorithm involves an approximation regarding particles leaving the sCOLA box during the evolution, yielding δ^{sCOLA}, and relies on the LEP approximation at the boundaries. It therefore uses
Everything else being fixed, we ran three investigative sCOLA simulations using respectively,
where δ is the “true” density contrast and Φ is the “true” gravitational potential, extracted at each timestep from the corresponding tCOLA simulation.
Figure 7 shows the auto and crossspectra of resulting sCOLA density fields, with respect to the reference tCOLA result. The use of δ_{int} = δ yields by construction R(k) = 1 at all scales, as can be checked from the bottom panel. The setup given by Eq. (29) is rid of the two approximations; it is therefore a consistency check: one should retrieve the tCOLA result if no bias is introduced by the tiling and different Poisson solver. As expected, Fig. 7 shows that our implementation recovers the tCOLA result at all scales, with only a small excess of power at k > 0.4 (Mpc h^{−1})^{−1} explained by the slightly higher force resolution of the sCOLA run with respect to tCOLA (the PM grid cell sizes are 0.3886 and 0.3906 Mpc h^{−1}, respectively).
Fig. 7. Tests of the approximations made in sCOLA for the density field and the gravitational potential. As in Fig. 5, the diagnostic tools are the power spectrum relative to tCOLA (top panel) and the crosscorrelation with tCOLA (bottom panel). Our sCOLA algorithm uses the approximate interior density field δ^{sCOLA} and the LEP approximation for the boundary gravitational potential (dashdotted blue line). In other simulations, as indicated in the legend, we use the true density field δ and/or the true gravitational potential Φ at the boundaries. The approximation made for the density field dominates, especially at large scales. 
The setups given by Eqs. (27) and (28) allow disentangling the impact of approximations D.1. and D.2. In the standard run (Eq. (26)), averaging over tiles and timesteps, ∼0.43% of the 512^{3} particles, all of which belonging to the buffer region, do not deposit all of their mass in the calculation of δ^{sCOLA}, but ∼76.5% on average. This number only slightly increases with time (from ∼0.35% at a = 0.05 to ∼0.47% at a = 1); in other simulations, we have found that it has a stronger dependence on the mass resolution and on the surface of sCOLA boxes. Regarding the accuracy of the LEP approximation, the ratio of the power spectra of Φ − Φ_{LEP} and of Φ goes to zero at early times and large scales, and stays below 12% for all scales with wavenumber k ≤ 2π/L_{sCOLA} at a = 1. As can be observed in Fig. 7, although using the nonlinear gravitational potential instead of the LEP improves both P(k) and R(k) for the final density field at all scales with wavenumber k > 7 × 10^{−2} (Mpc h^{−1})^{−1}, it does not remove the ≳5% bias in amplitude at the largest scales. On the contrary, using the true density contrast solves this problem and yields a 3% accurate result at all scales, which is remarkable given the small buffer size used in this case (the oversimulation factor is only r = 3.38).
We conclude from these tests that the approximation made regarding the density field (D.1.) has more impact than the one regarding the gravitational potential (D.2.), especially on the largest modes. This result is consistent with the standard paradigm for structure formation, where the density contrast undergoes severe nonlinearity at small scales and late times, while the gravitational potential evolves very little. It also suggests that future improvements of our algorithm should focus on finding a better approximation for δ^{sCOLA}, rather than Φ_{BCs}.
4.4. Computational cost
One of the main motivations for our perfectly parallel algorithm based on sCOLA is to be able to run very large volume simulations at reasonably high resolution. A detailed analysis of the speed and computational cost of our algorithm, as implemented in SIMBELMYNë, is therefore beyond the intent of this paper. However, in this section we discuss some performance considerations based on a sCOLA run with L = 1 Gpc h^{−1}, N_{p} = 1024^{3}, 512 tiles (N_{tiles} = 8, N_{p, tile} = 128), N_{p, buffer} = 30 (i.e. L_{tile} = 125 Mpc h^{−1}, L_{buffer} = 29.3 Mpc h^{−1}), N_{g} = 199; and the corresponding monolithic tCOLA simulation. In this case, the oversimulation factor is r ≈ 3.17 and the parallelisation potential factor is p ≈ 161.59. To compare the theoretical parallelisation potential factor and the realised parallelisation efficiency, we use one process for tCOLA and 512 processes for sCOLA. Each process is run on a node with 32 cores using OpenMP parallelisation.
One of the main advantages of our sCOLA algorithm lies in its reduced memory consumption. In Fig. 8 (first row), we show the memory requirements for the calculation of LPT potentials in the full box (common for tCOLA and sCOLA), for the evolution of the full box with tCOLA, and for the evolution of each sCOLA box, all in singleprecision floatingpoint format. LPT requires eight grids of size N^{3} (one for the initial conditions, one for the Zel’dovich potential, and six for the secondorder term), occupying ∼4.3 GB. Evolution with tCOLA requires one integer and 12 floatingpoint numbers per particle (their identifier, their position x, their momentum p, and the vectors Ψ_{1} and Ψ_{2}), plus a PM grid of 1024^{3} voxels, for a total of ∼60.1 GB. Within each box, sCOLA requires the same memory per particle (but with , a PM grid of size , and some overhead for Dirichlet boundary conditions. The total is around 400 MB per sCOLA box with the setup considered here.
Fig. 8. Memory requirements (first row) and timings for two corresponding tCOLA and sCOLA simulations. Although the CPU time required is higher for sCOLA, the memory consumption and wallclock time are significantly reduced with respect to tCOLA, due to the perfectly parallel nature of most computations (second row). In the middle left panel, the height of the white bar shows the hypothetical cost of running tCOLA for the same volume as simulated with sCOLA, when taking buffer regions into account. The relative contributions of different operations, as detailed in the legend, is shown in the third row. The main difference in computational cost in sCOLA with respect to tCOLA comes from the use of DSTs instead of FFTs, which makes the evaluation of the potential significantly more expensive. 
In the second row of Fig. 8, we show the overall cost of tCOLA versus sCOLA, both in terms of CPU time (middle left panel) and wallclock time (middle right panel). The key feature of our algorithm is that, although the overall CPU time needed is unavoidably higher than with tCOLA, the wallclock time spent can be drastically reduced. This owes to the degree of parallelism of our algorithm, which is equal to the number of sCOLA boxes. In particular, if as many processes as sCOLA boxes can be allocated (512 in this case), the overall wallclock time is determined by the initial full box operations (common with tCOLA, see Sect. 3.1), plus the cost of evolving only one sCOLA box (an average of 30.9 wallclock seconds on 32 cores in this test). This is what is shown in the middle right panel of Fig. 8. The wallclock time reduction factor is ≈93 for the evolution only (≈11 when accounting for initialisation and writing outputs). Compared to the parallelisation potential factor p ≈ 162, this number means that sCOLAspecific operations and the larger fractional parallelisation overhead in sCOLA boxes do not significantly hamper the perfectly parallel nature of the code.
The increased CPU time needed with sCOLA (see Fig. 8, middle left panel) is partly due to the necessity of oversimulating the volume of interest by a factor r > 1 for accuracy. For comparison with the sCOLA CPU time, the height of the white bar shows the tCOLA CPU time multiplied by r. The rest of the difference in CPU time principally comes form the fact that simulations with our variant of sCOLA are intrinsically more expensive than with tCOLA for a periodic volume of the same size. This point is further discussed below.
In the third row of Fig. 8, we show the various relative contributions to CPU time and wallclock time, both for full tCOLA/sCOLA runs and per tCOLA/sCOLA box. The generations of the initial conditions (brown, step A.1.) and writing of outputs to disk (grey) are common to tCOLA and sCOLA and have an overall fixed cost. LPT calculations in the full box (pink) consist of computing the Lagrangian potentials and the particlebased LPT displacements in tCOLA, but are limited to computing the Lagrangian potentials in the full box in the case of sCOLA (step A.2.). These fullbox operations are only showed in the bars labelled “tCOLA” and “sCOLA”. Within each box, the different operations are evaluating the density field (yellow), solving the Poisson equation to get the gravitational potential (green), differentiating the gravitational potential to get the accelerations (blue), “kicking” particles (red), and “drifting” particles (purple). sCOLA further requires some specific operations within each box: communicating with the master process (steps B.1., B.2., and C.1.), calculating the particlebased LPT displacements (step C.2.), grouped in Fig. 8 and shown in orange; and precomputing the Dirichlet boundary conditions with the LEP approximation (step C.3., cyan). sCOLAspecific operations do not contribute more than 10% of the CPU and wallclock times per box.
A notable difference between evolving a given box with sCOLA or with tCOLA resides in the higher cost of evaluating the potential (green): in this case, 9% of CPU time and 13% of wallclock time with sCOLA versus 6% of CPU time and 3% of wallclock time with tCOLA. This effect is due to the use of DSTs, required by the Poisson solver with Dirichlet boundary conditions (see Sect. 3.4 and Appendix C), instead of FFTs. Indeed, depending on the size of the PM grid, the evaluation of DSTs can be the computational bottleneck of our algorithm (up to 60% of overall CPU time is some of our runs), as opposed to the evaluation of the density field (e.g. via CiC) in traditional tCOLA or PM codes (37% of overall CPU time). For this reason, within each setup, we recommend performing experiments to find a PM grid size giving a good compromise between force accuracy and computational efficiency. In particular, it is strongly preferable that N_{g} + 1 not contain large prime factors (this number appears in the basis functions of sine transforms, see Appendix C.2). Throughout this paper, we ensured that N_{g} + 1 is always even, while keeping roughly the same force resolution as the corresponding tCOLA simulation. We note that our choice of N_{g} + 1 = 200 in the present test, combined with the use of a power of two for the PM grid in the monolithic tCOLA run, favours tCOLA in the comparison of CPU times. The sCOLA CPU time shown in the middle left panel of Fig. 8 could be further optimised by making N_{g} + 1 a power of two in sCOLA boxes.
5. Discussion and conclusion
5.1. Discussion
The principal computational challenge of the gravitational Nbody problem is the longrange nature of the gravitational force. Our sCOLA approach enables perfectly parallel computations and therefore opens up profoundly new possibilities for how to compute largescale cosmological simulations. We discuss these, some consequences and possible future directions in the following.
Gravity and physics models. It is important to note that the sCOLA algorithm introduced in this work is general, and not limited to the gravity model used here: while we focused on a tCOLA particlemesh implementation to evolve the sCOLA tiles, this choice was designed to facilitate the assessment of tiling artefacts against monolithic tCOLA runs. Nonetheless, any Nbody method, such as particleparticle–particlemesh, tree methods or AMR, could be used to evolve each tile. In particular, since the sCOLA approach separates quasilinear and nonlinear scales, there is no need to cut off the computation on small scales. In concert with the approaches discussed below, this fact can be exploited to perform very highresolution, fully nonlinear simulations in cosmological volumes. In this case, the spatial decoupling due to sCOLA would render computations possible that would otherwise be prohibitive.
Similar comments apply to including nongravitational physics: since hydrodynamical or other nongravitational forces are typically much more local than gravitational interactions, there are no algorithmic barriers to including them in each sCOLA tile^{8}.
Construction of lightcones and mock catalogues. The decoupling of computational volumes achieved by our approach means that each sCOLA box can be run completely independently. Therefore, it is not necessary to define a common final redshift for all tiles. This means that to compute a cosmological lightcone, only a single tile (the one containing the observer) needs to be run to redshift zero. Since the volume on the lightcone increases rapidly with redshift, the vast majority of tiles would only have to be run until they intersect the lightcone at high redshift. In monolithic Nbody simulations, most of the computational time is spent at low redshift, since the local timestep of simulations decreases with the local dynamical time. Our approach would therefore greatly accelerate the time needed to complete lightcone simulations, by scheduling tiles in order of the redshift to which they should run (and therefore in reverse order of expected computational time), aiding loadbalancing.
The construction of lightcones for surveys with large aspect ratios, such as pencilbeam surveys, can further benefit from sCOLA. Indeed, tiles that do not intersect the threedimensional survey window do not need to be run at all for the construction of mock catalogues. In such a case, the algorithm will still capture the effects of largescale transverse modes, even if the simulated volume is not substantially increased with respect to the survey volume.
Low memory requirements. sCOLA divides the computational volume into much smaller tiles and vastly reduces the memory footprint of each independent sCOLA tile computation, as shown in Sect. 4.4. As an example, simulating a (16 Gpc h^{−1})^{3} volume containing 8192^{3} particles to achieve a mass resolution of 10^{12.5} M_{⊙} requires ∼19.8 TB of RAM with a PM code and ∼33.0 TB of RAM with tCOLA. The setup {L_{tile} = 62.5 Mpc h^{−1}, L_{buffer} = 62.5 Mpc h^{−1}} would break down the problem into 256^{3} tiles, each with (3 × 32)^{3} particles and a memory footprint of ∼53 MB. This has important consequences, which we explore in the following.
The very modest memory requirement of our algorithm opens up multiple possibilities to accelerate the computation: even on traditional systems, the entire computation of each sCOLA tile would fit entirely into the L3 cache of a multicore processor. This would cut out the slowest parts of the memory hierarchy, leading to a large potential performance boost and reducing code complexity. Even more promising, many such tiles could be evolved entirely independently on GPU accelerators, or even dedicated FPGAs, taking advantage of hybrid architectures of modern computational platforms while reducing the need to develop sophisticated code to manage task parallelism. At this scale, each tile computation would even fit comfortably on ubiquitous small computational platforms such as mobile phones.
Grid computing. The perfect scalability achieved by our approach means that large Nbody simulations can even be run on very inexpensive, strongly asynchronous networks designed for large throughput computing. An extreme example would be participatory computing platforms such as Cosmology@Home^{9}, where tens of thousands of users donate computational resources. The use of such platforms would be particularly suited to lightcone computations, as described above. Even if running the lowredshift part necessitates dedicated hardware, other workers could efficiently work independently to compute most of the volume, which lives at highredshift. Only two communication steps are required for each tile: the LPT potentials are received at the beginning, and at the end of the computation each tile returns its final state at the redshift where it intersects the lightcone.
Node Failures. Robustness to node failure is an important consideration on all very large computational platforms. Even with extremely low failure probability for each node, since the number of nodes is high, the probability that some node fails during the course of a computation becomes high. After its initialisation steps (see Sect. 3.1), our approach is entirely robust to such failure, since any individual tile can be recomputed after the fact on a modest system, for very little cost.
5.2. Conclusion
In this paper, we introduced a perfectly parallel and easily applicable algorithm for cosmological simulations using sCOLA. Our approach is based on a tiling of the full simulation box, where each tile is run independently. By the use of buffer regions and appropriate Dirichlet boundary conditions, we improved the accuracy of the algorithm with respect to Tassev et al. (2015). In particular, we showed that suitable setups can reach 3% to 1% accuracy at all the scales simulated, as required for data analysis of the next generation of largescale structure surveys. In case studies, we tested the relative impact of the two approximations involved in our approach, for density assignment and the boundary gravitational potential. We considered the computational cost of our algorithm and demonstrated that even if the CPU time needed is unavoidably higher, the wallclock time and memory footprint can be drastically reduced.
This study opens up a wide range of possible extensions, discussed in Sect. 5.1. Benefiting from its perfect scalability, the approach could also allow for novel analyses of cosmological data from fully nonlinear models previously too expensive to be tractable. It could straightforwardly be used for the construction of mock catalogues, but also within recently introduced likelihoodfree inference techniques such as DELFI (Alsing et al. 2018), BOLFI (Leclercq 2018) and SELFI (Leclercq et al. 2019), which have a need for cheap simulatorbased data models. We therefore anticipate that sCOLA will become an important tool in computational cosmology for the coming era.
Our perfectly parallel sCOLA algorithm has been implemented in the publicly available SIMBELMYNë code^{10}, where it is included in version 0.4.0 and later.
See for example ORNL’s next supercomputer, Frontier: https://www.olcf.ornl.gov/wpcontent/uploads/2019/05/frontier_specsheet.pdf
This means that in the case of our new sCOLA algorithm, we use COLA both “in space and time” (see Tassev et al. 2015).
Acknowledgments
We are grateful to Matías Zaldarriaga for stimulating discussions and useful comments throughout the realisation of this project. We thank Jens Jasche and Svetlin Tassev for discussions that triggered this project, and Oliver Hahn for constructive observations. FL and BDW acknowledge the hospitality of the Institute for Advanced Study, Princeton, where this project was initiated. FL, BF and WJP thank the Institute of Cosmology and Gravitation of the University of Portsmouth, where part of this work was prepared. This work made use of NumPy (van der Walt et al. 2011), IPython (Perez & Granger 2007), Matplotlib (Hunter 2007), Jupyter notebooks (Kluyver et al. 2016), and the colourmaps provided by the cmocean (Thyng et al. 2016), and CMasher (https://github.com/1313e/CMasher) packages. FL acknowledges funding from the Imperial College London Research Fellowship Scheme. GL and BDW acknowledge financial support from the ANR BIG4, under reference ANR16CE230002. The Center for Computational Astrophysics is supported by the Simons Foundation. Research at Perimeter Institute is supported in part by the Government of Canada through the Department of Innovation, Science and Economic Development Canada and by the Province of Ontario through the Ministry of Colleges and Universities. Numerical computations were done on the Sciama High Performance Compute (HPC) cluster which is supported by the ICG, SEPNet and the University of Portsmouth; and on the cx1 cluster hosted by the Research Computing Service facilities at Imperial College London (doi:10.14469/hpc/2232). This work was done within the Aquila Consortium (https://aquilaconsortium.org). FL and BDW conceived the project. FL wrote the SIMBELMYNë code, implemented the new sCOLA algorithm, ran the simulations, performed the study, supervised BF’s internship project, and wrote the bulk of the paper. BF contributed to the first implementation of the tiling algorithm in SIMBELMYNë and to early tests of the method. GL suggested the investigation of an alternative Poisson solver, proposed tests of the impact of boundary effects, and contributed to writing the paper. BDW made conceptual contributions, helped designing the accuracy and speed tests, and contributed to writing the paper. AHJ prompted the use of the linearlyevolving potential approximation. AHJ and AFH contributed to the interpretation of results. WJP supported the design of the first version of the algorithm and contributed to student supervision. CN contributed to the collegial construction of the standards of science, by developing the methodological framework, the stateoftheart, as well as postpublication procedures. All natural authors read and approved the final manuscript.
References
 Alimi, J. M., Bouillot, V., & Rasera, Y. 2012, DEUS Full Observable ΛCDM Universe Simulation: the numerical challenge [Google Scholar]
 Alsing, J., Wandelt, B., & Feeney, S. 2018, MNRAS, 477, 2874 [NASA ADS] [CrossRef] [Google Scholar]
 Amdahl, G. M. 1967, in Proceedings of the April 18–20, 1967, Spring Joint Computer Conference, AFIPS ’67 (Spring) (New York, NY, USA: Association for Computing Machinery), 483 [CrossRef] [Google Scholar]
 Aubert, D., Deparis, N., & Ocvirk, P. 2015, MNRAS, 454, 1012 [Google Scholar]
 Audren, B., Lesgourgues, J., Bird, S., Haehnelt, M. G., & Viel, M. 2013, J. Cosmology Astropart. Phys., 2013, 026 [Google Scholar]
 Austermann, J. E., Aird, K. A., & Beall, J. A. 2012, SPTpol: an instrument for CMB polarization measurements with the South Pole Telescope, SPIE Conf. Ser., 8452, 84521E [Google Scholar]
 Bagla, J. S., & Padmanabhan, T. 1994, MNRAS, 266, 227 [NASA ADS] [CrossRef] [Google Scholar]
 Barausse, E., Berti, E., & Hertog, T. 2020, Prospects for Fundamental Physics with LISA [Google Scholar]
 Berger, M. J., & Colella, P. 1989, J. Comp. Phys., 82, 64 [NASA ADS] [CrossRef] [Google Scholar]
 Bernardeau, F., Colombi, S., Gaztañaga, E., & Scoccimarro, R. 2002, Phys. Rep., 367, 1 [NASA ADS] [CrossRef] [EDP Sciences] [MathSciNet] [Google Scholar]
 Birdsall, C. K., & Langdon, A. B. 1985, Plasma Physics via Computer Simulation (CRC Press) [Google Scholar]
 Bouchet, F. R., Colombi, S., Hivon, E., & Juszkiewicz, R. 1995, A&A, 296, 575 [NASA ADS] [Google Scholar]
 Brainerd, T. G., Scherrer, R. J., & Villumsen, J. V. 1993, ApJ, 418, 570 [CrossRef] [Google Scholar]
 Bryan, G. L., Norman, M. L., O’Shea, B. W., et al. 2014, ApJS, 211, 19 [NASA ADS] [CrossRef] [Google Scholar]
 Buchert, T., Melott, A. L., & Weiß, A. G. 1994, A&A, 288, 349 [Google Scholar]
 Cheng, S., Yu, H.R., Inman, D., et al. 2020, CUBE  Towards an Optimal Scaling of Cosmological Nbody Simulations [Google Scholar]
 Chisari, N. E., Richardson, M. L. A., Devriendt, J., et al. 2018, MNRAS, 480, 3962 [NASA ADS] [CrossRef] [Google Scholar]
 DESI Collaboration 2016, The DESI Experiment Part I: Science, Targeting, and Survey Design [Google Scholar]
 Eisenstein, D. J., & Hu, W. 1998, ApJ, 496, 605 [NASA ADS] [CrossRef] [Google Scholar]
 Eisenstein, D. J., & Hu, W. 1999, ApJ, 511, 5 [NASA ADS] [CrossRef] [Google Scholar]
 Frigo, M., & Johnson, S. G. 2005, Program Generation, Optimization, and Platform Adaptation, Proc. IEEE, 93, 216 [Google Scholar]
 Fryxell, B., Olson, K., Ricker, P., et al. 2000, ApJS, 131, 273 [NASA ADS] [CrossRef] [Google Scholar]
 Garrison, L. H., Eisenstein, D. J., & Pinto, P. A. 2019, MNRAS, 485, 3370 [CrossRef] [Google Scholar]
 Gonnet, P., Schaller, M., Theuns, T., & Chalk, A. B. G. 2013, SWIFT: Fast Algorithms for Multiresolution SPH on MultiCore Architectures [Google Scholar]
 Guillet, T., & Teyssier, R. 2011, J. Comp. Phys., 230, 4756 [NASA ADS] [CrossRef] [Google Scholar]
 Hahn, O., & Abel, T. 2011, MNRAS, 415, 2101 [NASA ADS] [CrossRef] [Google Scholar]
 Hahn, O., Angulo, R. E., & Abel, T. 2015, MNRAS, 454, 3920 [NASA ADS] [CrossRef] [Google Scholar]
 Hockney, R. W., & Eastwood, J. W. 1981, Computer Simulation Using Particles (McGrawHill) [Google Scholar]
 Howlett, C., Manera, M., & Percival, W. J. 2015, Astron. Comput., 12, 109 [NASA ADS] [CrossRef] [Google Scholar]
 Hunter, J. D. 2007, Comput. Sci. Eng., 9, 90 [Google Scholar]
 Huterer, D., & Takada, M. 2005, Astroparticle Phys., 23, 369 [NASA ADS] [CrossRef] [Google Scholar]
 Izard, A., Crocce, M., & Fosalba, P. 2016, MNRAS, 459, 2327 [NASA ADS] [CrossRef] [Google Scholar]
 James, R. A. 1977, J. Comp. Phys., 25, 71 [NASA ADS] [CrossRef] [Google Scholar]
 Johnson, S. G., & Frigo, M. 2008, in Fast Fourier Transforms, ed. C. S. Burrus (Houston TX: Rice University Connexions) [Google Scholar]
 Kluyver, T., RaganKelley, B., & Pérez, F. 2016, ELPUB [Google Scholar]
 Klypin, A., & Holtzman, J. 1997, ParticleMesh code for cosmological simulations [Google Scholar]
 Knebe, A., & Doumler, T. 2010, AMIGA: Adaptive Mesh Investigations of Galaxy Assembly [Google Scholar]
 Koda, J., Blake, C., Beutler, F., Kazin, E., & Marin, F. 2016, MNRAS, 459, 2118 [NASA ADS] [CrossRef] [Google Scholar]
 Laureijs, R., Amiaux, J., & Arduini, S. 2011, Euclid Definition Study Report [Google Scholar]
 Leclercq, F. 2015, Ph.D. Thesis, Institut d’Astrophysique de Paris [Google Scholar]
 Leclercq, F. 2018, Phys. Rev. D, 98, 063511 [NASA ADS] [CrossRef] [Google Scholar]
 Leclercq, F., Jasche, J., & Wandelt, B. 2015, J. Cosmol. Astropart. Phys., 6, 15 [CrossRef] [Google Scholar]
 Leclercq, F., Jasche, J., Lavaux, G., Wandelt, B., & Percival, W. 2017, J. Cosmol. Astropart. Phys., 6, 049 [NASA ADS] [CrossRef] [Google Scholar]
 Leclercq, F., Enzi, W., Jasche, J., & Heavens, A. 2019, MNRAS, 490, 4237 [NASA ADS] [CrossRef] [Google Scholar]
 LIGO Scientific Collaboration 2015, Class. Quant. Grav., 32, 074001 [CrossRef] [Google Scholar]
 LSST Science Collaboration 2012, Large Synoptic Survey Telescope: Dark Energy Science Collaboration [Google Scholar]
 Merloni, A., Predehl, P., Becker, W., et al. 2012, eROSITA Science Book: Mapping the Structure of the (Energetic Universe) [Google Scholar]
 Ocvirk, P., Gillet, N., Shapiro, P. R., et al. 2016, MNRAS, 463, 1462 [NASA ADS] [CrossRef] [Google Scholar]
 Perez, F., & Granger, B. E. 2007, Comput. Sci. Eng., 9, 21 [Google Scholar]
 Planck Collaboration XIII. 2016, A&A, 594, A13 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Potter, D., Stadel, J., & Teyssier, R. 2017, Comput Astrophys. Cosmology., 4, 2 [Google Scholar]
 Quinn, T., Katz, N., Stadel, J., & Lake, G. 1997, Time stepping Nbody simulations [Google Scholar]
 Schneider, A., Teyssier, R., Stadel, J., et al. 2019, J. Cosmol. Astropart. Phys., 3, 020 [CrossRef] [Google Scholar]
 Simon, S. M., Beall, J. A., Cothard, N. F., et al. 2018, J. Low. Temp. Phys., 193, 1041 [CrossRef] [Google Scholar]
 Simons Observatory Collaboration 2019, J. Cosmol. Astropart. Phys., 2, 056 [Google Scholar]
 SPHEREx Science Team 2018, Am. Astron. Soc. Meet. Abstr., 231, 354.21 [Google Scholar]
 Square Kilometre Array Cosmology Science Working Group 2018, Cosmology with Phase 1 of the Square Kilometre Array; Red Book 2018: Technical specifications and performance forecasts [Google Scholar]
 Strauss, M. A., Cen, R., Ostriker, J. P., Lauer, T. R., & Postman, M. 1995, ApJ, 444, 507 [CrossRef] [Google Scholar]
 Tassev, S., & Zaldarriaga, M. 2012, J. Cosmol. Astropart. Phys., 12, 11 [CrossRef] [Google Scholar]
 Tassev, S., Eisenstein, D. J., Wandelt, B. D., & Zaldarriaga, M. 2015, sCOLA: The Nbody COLA Method Extended to the Spatial Domain [Google Scholar]
 Tassev, S., Zaldarriaga, M., & Eisenstein, D. J. 2013, J. Cosmol. Astropart. Phys., 6, 36 [CrossRef] [Google Scholar]
 Teyssier, R. 2002, A&A, 385, 337 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 The Virgo Collaboration 2020, J. Phys. Conf. Ser., 1342, 012010 [CrossRef] [Google Scholar]
 Theuns, T., Chalk, A., Schaller, M., & Gonnet, P. 2015, SWIFT: taskbased Hydrodynamics and Gravity for Cosmological Simulations [Google Scholar]
 Thyng, K. M., Greene, C. A., Hetland, R. D., Zimmerle, H. M. 2016, Oceanography, 29 [Google Scholar]
 van Daalen, M. P., Schaye, J., Booth, C. M., & Dalla Vecchia, C. 2011, MNRAS, 415, 3649 [NASA ADS] [CrossRef] [Google Scholar]
 van der Walt, S., Colbert, S. C., & Varoquaux, G. 2011, Comput. Sci. Eng., 13, 22 [Google Scholar]
 Yu, H.R., Pen, U.L., & Wang, X. 2018, ApJS, 237, 24 [CrossRef] [Google Scholar]
 Zel’dovich, Y. B. 1970, A&A, 5, 84 [NASA ADS] [Google Scholar]
Appendix A: Model equations
A.1. Model equations in the standard PM code
Denoting by a the scale factor of the Universe and τ the conformal time, a PM code solves the equations of motion for the position x and momentum p of dark matter particles in comoving coordinates (the mass of particles m is absorbed in the definition of the momentum p):
coupled to the Poisson equation for the gravitational potential, sourced by density fluctuations (Eq. (8)),
where G is the gravitational constant and is the mean matter density at conformal time τ. The density contrast is defined from the local matter density ρ(x, τ) by
For simplicity, from now on we note ∇_{x} = ∇, Δ_{x} = Δ and δ(x, τ) = δ.
It is convenient to choose the scale factor as time variable. Using ∂_{τ} = a′ ∂_{a} and the background evolution (a prime denotes a differentiation with respect to τ and the superscript (0) denotes quantities at the present time), the equations to solve are rewritten:
We will use the equivalent formulation
where we have combined Eqs. (A.6) and (A.7), introduced the conformal Hubble factor ℋ(a)≡a′/a, and defined the ‘drift prefactor’ 𝒟(a) and the “kick prefactor” 𝒦(a).
A.2. Model equations with COLA
We now introduce the COLA scheme, following Tassev et al. (2013, 2015). For each particle, we work in the frame comoving with its LPT observer, whose position is given by (see Sect. 2.1)
where we have introduced the timeindependent vectors Ψ_{1} ≡ ∇_{q}ϕ^{(1)} and Ψ_{2} ≡ ∇_{q}ϕ^{(2)}. Noting x(a) = x_{LPT}(a)+x_{res}(a) the final position of the same particle, we have
with
We also define p_{res} such that dx_{res}/da ≡ 𝒟(a)p_{res}. Then p = p_{LPT} + p_{MC} (see Eq. (A.8)). Furthermore,
where the differential operator 𝒱[⋅](a) is defined by
With these notations, Eq. (A.9) reads
In COLA, the natural variables are therefore x and p_{res}.
As mentioned in Sect. 2.1, the key point in COLA is that the fictitious LPT force acting on particles, 𝒱[x_{LPT}](a), can be computed analytically. From Eq. (A.10), it is straightforward to check that 𝒱[x_{LPT}](a) = − 𝒱[D_{1}](a)Ψ_{1} + 𝒱[D_{2}](a)Ψ_{2}. The computation of 𝒱[D_{1}](a) and 𝒱[D_{2}](a) uses the differential equations followed by the linear and secondorder growth factor, as well as the second Friedmann equation. The result is (see e.g. Leclercq 2015, Eqs. (1.7), (1.96), (1.118) and Appendix B)
The equations of motion to solve are therefore, in tCOLA,
These are mathematically equivalent to the equations of motion of a PM code (Eqs. (A.8) and (A.9)). In sCOLA, the “kick equation” (Eq. (A.18)) is replaced for each particle of the sCOLA box by the approximation (Tassev et al. 2015)
with the notations introduced in Sect. 2.3, as well as and . Importantly, the “drift equation” (Eq. (A.17)) is not modified, since we are always, by definition, computing a residual displacement with respect to the LPT observer of the full box, whose position is given by Eq. (A.10).
Appendix B: Standard and modified timestepping
B.1. Timestepping in the standard PM algorithm
In this paper, we adopt the secondorder symplectic “kickdriftkick” algorithm, also known as the leapfrog scheme (e.g. Birdsall & Langdon 1985) to integrate the equations of motion. This algorithm relies on integrating the equations on a small timestep and approximating the momenta (p in the “drift equation” (A.8)) and accelerations (∇(Δ^{−1}δ) in the “kick equation” (A.9)) that appear in the integrands by their value at some time within the interval. This defines the Drift (D) and Kick (K) operators, which read using the standard discretisation (Quinn et al. 1997):
where
and t is a function of the scale factor a (typically t(a) = a or t(a) = exp(a) for timesteps linearly spaced or logarithmically spaced in the scale factor, respectively).
The time evolution between t_{0} = t(a_{i}) and t_{n + 1} = t(a_{f}) is then achieved by applying the following operator, E(t_{n + 1}, t_{0}), to the initial state (x(t_{0}),p(t_{0})):
B.2. Timestepping with COLA, standard discretisation
Using the standard discretisation (Quinn et al. 1997) of Eqs. (A.17) and (A.18), the Kick and Drift operators for tCOLA are defined by
where the time factors and are the same as in the PM case (see Eq. (B.3)). For sCOLA, is given by the same expression (Eq. (B.6)) but operates on quantities and differential operators superscripted “sCOLA” consistently with Eq. (A.19).
In the initial conditions, generated with LPT, we have p = p_{LPT}; therefore the momentum residual in the rest frame of LPT observers, p_{res}, should be initialised to zero. At the end, the LPT momentum p_{LPT} has to be added to p_{res} to recover the full momentum of particles, p. This corresponds respectively to the L_{−} and L_{+} operators (Tassev et al. 2013, Appendix A), given by
In COLA, the time evolution between t_{0} = t(a_{i}) and t_{n + 1} = t(a_{f}) is therefore achieved by applying the following operator to the initial state (x(t_{0}),p(t_{0})):
where is the operator given by Eq. (B.4), replacing D by and K by .
B.3. Timestepping with COLA, modified discretisation
Another approach for the discretisation of Eqs. (A.17) and (A.18) is proposed by Tassev et al. (2013). For any arbitrary positive function u of t, we can rewrite
This form is particularly relevant if p_{res} has a time dependence which is entirely captured by a particular u(t), which is universal for all particles. For each equation, considering that the part between curly brackets is constant during the timestep (instead of the momentum and accelerations, respectively), the modified and operators are given by Eqs. (B.5) and (B.6) with the following modified time factors instead of and :
where in , we have used the trivial integration .
Using the Ansatz suggested by Tassev et al. (2013), u(a) = a^{nLPT} when t(a) = a, we get the explicit expressions
We adopt this form and n_{LPT} = −2.5 for both tCOLA and sCOLA operators, throughout this paper.
Appendix C: Poisson solver with Dirichlet boundary conditions
In this appendix, we describe how to compute the interior gravitational potential Φ with Dirichlet boundary conditions. The method is standard in computational physics and has been used at least since James (1977). Formally, we seek to solve the discrete Poisson equation,
subject to a known boundary potential Φ_{BCs}, where Δ is the FDA to the exact Laplacian operator, i.e. where , , and are discrete onedimensional secondorder derivatives (see Table 1 in Hahn & Abel 2011, for their expressions in FDA at order 2, 4 and 6).
C.1. Modified density distribution
We define Φ_{BCs} as having nonzero values only in a layer of N_{ghost} cells immediately outside the active domain of the PM grid. We can then write the desired potential as where the required boundary condition for is . From the definition of Φ_{BCs}, ΔΦ_{BCs} will be nonzero only in a layer of N_{ghost} active cells just inside the domain boundaries. We can thus define a modified density distribution,
which is the same as δ everywhere except in the layer of N_{ghost} cells adjoining the domain boundaries. We can then employ a zeroboundary condition Poisson solver to obtain a solution of (see Sect. C.2). Within the interior, where Φ_{BCs} = 0, this solution is the desired final solution of ΔΦ = δ with the Dirichlet boundary condition Φ = Φ_{BCs}.
C.2. Zeroboundary condition Poisson solver
In cosmological simulations, it is conventional to use FFTs to solve the Poisson equation, since the discrete Laplacian operator is diagonal in Fourier space. FFTs assume that the input source has periodic boundary conditions. Similarly, for zero boundary conditions, we can work with threedimensional typeI discrete sine transforms (DSTI), defined by
where δ_{i, j, k} is the value of the source field in the voxel indexed by 1 ≤ i ≤ N_{x}, 1 ≤ y ≤ N_{y}, 1 ≤ k ≤ N_{z} (N_{x} = N_{y} = N_{z} = N_{g} in this paper). The basis functions are defined by
They ensure that the signal has zero boundary values (for i ∈ {0, N_{x} + 1} or j ∈ {0, N_{y} + 1} or k ∈ {0, N_{z} + 1}). They satisfy discrete orthogonality relations, for example,
where δ_{K} is the Kronecker symbol. The inverse transformation is simply DSTI multiplied by 8/[(N_{x}+1)(N_{y}+1)(N_{z}+1)], i.e.
and similarly for the gravitational potential,
It is straightforward to show that , , are eigenfunctions of the discrete onedimensional secondorder derivatives , , and , respectively. The corresponding eigenvalues , and are given by
for FDA at order 2, 4, and 6 respectively, where k_{ℓ} ≡ 2πℓ/L_{x}, d_{x} ≡ L_{x}/N_{x} and L_{x} is the size of the box along the xdirection (L_{x} = L_{sCOLA} ≡ L/N_{tiles} in this paper). Similar expressions exist for and .
Plugging Eqs. (C.6) and (C.7) into (C.1) and using the orthogonality relations, we obtain a simple form for the discretised Poisson equation in sine space,
Therefore, the Poisson equation ΔΦ = δ with zero boundary conditions can be solved by the following three steps:

performing a forward DST of the source (δ_{i, j, k} → δ^{ℓ, m, n}), according to Eq. (C.3) (costing 𝒪(N_{x}N_{y}N_{z}log[N_{x}N_{y}N_{z}]) operations),

solving the Poisson equation in sine space (δ^{ℓ, m, n} → Φ^{ℓ, m, n}), according to Eq. (C.11) (costing 𝒪(N_{x}N_{y}N_{z}) operations),

performing an inverse DST of the gravitational potential (Φ^{ℓ, m, n} → Φ_{i, j, k}), according to Eq. (C.7) (costing 𝒪(N_{x}N_{y}N_{z}log[N_{x}N_{y}N_{z}]) operations).
In practice, forward and inverse DSTs are performed using the FFTW library (Frigo & Johnson 2005), publicly available^{11}, where the DSTI is known as FFTW_RODFT00.
All Tables
All Figures
Fig. 1. Functional diagram of sCOLA (left) versus tCOLA (right). The grey boxes are common steps. sCOLA specific steps are represented in blue, and tCOLA specific steps in red. The yellow rectangle constitutes the perfectly parallel section, within which no communication is required with the master process or between processes. Arrows represent dependencies, and references to the main text are given between parentheses. 

In the text 
Fig. 2. Illustration of the different grids used within sCOLA. The Lagrangian lattice is represented by dashed lines. For each tile, central particles (in black) are surrounded by buffer particles (in cyan), which are ignored at the end of the evolution. The corresponding buffer region in other grids is represented in cyan. The left panel represents the “LPT grid” on which Lagrangian potentials and are defined. The central region has grid points (in red) and is padded by 2N_{ghost} cells in each direction (pink region). The right panel shows the “PM grid” on which the density contrast δ^{sCOLA}, the gravitational potential Φ^{sCOLA}, and the accelerations are defined. The density contrast is defined only in the central region (which has grid points, in dark green). The gravitational potential is padded by 2N_{ghost} cells in each direction (light green and yellow regions), and the gridded accelerations only by N_{ghost} cells in each direction (yellow region). Solving the Poisson equation requires Dirichlet boundary conditions in six layers of N_{ghost} cells, denoted as hatched regions. For simplicity of representation, we have used here N_{ghost} = 1. 

In the text 
Fig. 3. Qualitative assessment of the redshiftzero density field from sCOLA for different tilings and buffer sizes, with respect to tCOLA. The bottom right panel shows the reference tCOLA density field in a 200 Mpc h^{−1} box with periodic boundary conditions (the quantity represented is ln(2 + δ_{tCOLA}) where δ_{tCOLA} is the density contrast). Other panels show the difference between sCOLA and tCOLA density fields, ln(2 + δ_{sCOLA})−ln(2 + δ_{tCOLA}), for different sizes of tile and buffer region, as indicated above the panels. The tiling is represented by dashed lines, and the central tile’s buffer region is represented by solid lines. In the third dimension, the slices represented intersect the central tile at its centre. As can be observed in this figure, artefacts are predominantly located close to the boundaries of tiles; they are reduced with increasing tile size and buffer region size. 

In the text 
Fig. 4. Same as Fig. 3, but for one component of the velocity field, in km/s. Bulk flows are correctly captured if tiles and their buffer regions are large enough. Residual differences inside halos can be observed, but they are expected due to the limited number of timesteps, rendering both tCOLA and sCOLA velocities inaccurate in the deeply nonlinear regime. 

In the text 
Fig. 5. Power spectrum relative to tCOLA (top panel) and crosscorrelation with respect to tCOLA (bottom panel) of redshiftzero sCOLA density fields, in a 200 Mpc h^{−1} box containing 512^{3} dark matter particles. Different sizes for the tiles (represented by different line styles) and buffer regions (represented by different colours) are used, as indicated in the legend. The vertical lines show the respective fundamental mode of different tiles, the light grey bands correspond to 3% accuracy, and the dark grey bands to 1% accuracy. 

In the text 
Fig. 6. Same as Fig. 5, but in a 1 Gpc h^{−1} box containing 1024^{3} particles. 

In the text 
Fig. 7. Tests of the approximations made in sCOLA for the density field and the gravitational potential. As in Fig. 5, the diagnostic tools are the power spectrum relative to tCOLA (top panel) and the crosscorrelation with tCOLA (bottom panel). Our sCOLA algorithm uses the approximate interior density field δ^{sCOLA} and the LEP approximation for the boundary gravitational potential (dashdotted blue line). In other simulations, as indicated in the legend, we use the true density field δ and/or the true gravitational potential Φ at the boundaries. The approximation made for the density field dominates, especially at large scales. 

In the text 
Fig. 8. Memory requirements (first row) and timings for two corresponding tCOLA and sCOLA simulations. Although the CPU time required is higher for sCOLA, the memory consumption and wallclock time are significantly reduced with respect to tCOLA, due to the perfectly parallel nature of most computations (second row). In the middle left panel, the height of the white bar shows the hypothetical cost of running tCOLA for the same volume as simulated with sCOLA, when taking buffer regions into account. The relative contributions of different operations, as detailed in the legend, is shown in the third row. The main difference in computational cost in sCOLA with respect to tCOLA comes from the use of DSTs instead of FFTs, which makes the evaluation of the potential significantly more expensive. 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.