A connected componentbased method for efficiently integrating multiscale Nbody systems
^{1} Faculty of Science, University of Amsterdam, PO Box 94216, 1090 GE, Amsterdam, The Netherlands
^{2} Leiden Observatory, Leiden University, PO Box 9513, 2300 RA, Leiden, The Netherlands
email:
pelupes@strw.leidenuniv.nl
Received: 18 March 2014
Accepted: 20 July 2014
We present a novel method for efficient direct integration of gravitational Nbody systems with a large variation in characteristic time scales. The method is based on a recursive and adaptive partitioning of the system based on the connected components of the graph generated by the particle distribution combined with an interactionspecific time step criterion. It uses an explicit and approximately timesymmetric time step criterion, and conserves linear and angular momentum to machine precision. In numerical tests on astrophysically relevant setups, the method compares favourably to both alternative Hamiltoniansplitting integrators as well as recently developed block time stepbased GPUaccelerated Hermite codes. Our reference implementation is incorporated in the HUAYNO code, which is freely available as a part of the AMUSE framework.
Key words: methods: numerical / stars: kinematics and dynamics / gravitation
© ESO, 2014
1. Introduction
Direct integration of the classical Nbody problem is an important tool for studying astrophysical systems. Examples include planetary systems, open and globular clusters dynamics, largescale dynamics of galaxies, and structure formation in the universe. In many cases the calculations involve systems where the intensity of gravitational interactions spans multiple orders of magnitude with corresponding time scale variations. For example, the initial stages of cluster formation are now thought to resemble multiscale fractal structures (Goodwin & Whitworth 2004), and stellar systems are invariably formed with a high fraction of binaries and hierarchical multiples that affect the dynamical evolution in crucial ways (Portegies Zwart et al. 2010).
In practice, integrating multiscale systems requires specialised methods that vary the resolution at which we treat different parts of the simulation. The aim of this is to obtain a solution with an acceptable accuracy without unnecessarily spending computational resources on the slowly evolving parts of the simulation. In generic Nbody integrators, this idea is most commonly implemented via particlebased block time steps – every particle in the system maintains an individual time step limited to discrete values in a power of two hierarchy. These block time steps are then typically used to determine the frequency of calculating the total force acting on a particle (e.g. McMillan 1986; Makino 1991; Konstantinidis & Kokkotas 2010).
While considerably speeding up calculations, particlebased block time steps are nevertheless limited in their ability to treat the extreme scale differences often present in Nbody systems. Hence, complementary strategies, such as binary regularisation and neighbourhood schemes, have been devised. These approaches complicate the implementation of Nbody integrators, and often introduce new methodspecific free parameters. It is also unclear whether these combinations of multiple strategies represent the best possible approach for integrating multiscale Nbody systems. These issues provide a clear incentive to explore alternative methods.
In Pelupessy et al. (2012), we derived generic Nbody integrators that recursively and adaptively split the Hamiltonian of the system. These methods show improved conservation of the integrals of motion by always evaluating partial forces between particles in different timestep bins symmetrically, and by using an approximately timesymmetric timestep criterion.
In the present work, we introduce a new Hamiltoniansplitting integration method that is particularly adept at integrating initial conditions with significant hierarchical substructure. Our approach is based on assigning time steps to individual interactions, followed by partitioning the system Hamiltonian based on a graph formed by the set of interactions that are faster than a fixed threshold time step. The successive partitioning produces closed Hamiltonians such that we can easily use specialised solvers for situations where more efficient solvers are available. Numerical experiments show that our integrator compares favourably to existing splitting methods even for an ordinary Plummer sphere where the prevalence of isolated subsystems is not immediately obvious. For astrophysically realistic systems explicitly chosen for their multiscale substructure, the performance gains increase can be orders of magnitude. An implementation of the method is incorporated in the HUAYNO code, which is freely available as a part of the AMUSE framework (Portegies Zwart et al. 2013; Pelupessy et al. 2013) and which was used for the tests presented in this paper.
Our method is similar in spirit, and accelerates the calculation of the Nbody problem for much the same reasons as the well known neighbour schemes. The main idea is to divide the total force acting on a particle into a fast and a slow component based the distance to the given particle. Different approaches have been used for treating fast and slow components. The AhmadCohen neighbourhood scheme (Ahmad & Cohen 1973) treats fast components with a more strict time step criteria. Alternatively, the PPPT scheme (Oshino et al. 2011) integrates fast components with a fourthorder Hermite method while using a leapfrogbased tree code for the long range interactions. The criteria for determining neighbourhood memberships are heuristics known to work in numerical experiments, e.g. a sphere with a fixed radius centred on the acting particle. These methods need to continuously update neighbourhood memberships as the system state changes throughout the simulation. In addition, neighbourhood schemes only make a single distinction between treating small subsystems such as hard binaries or manybody close encounters, and the large scale dynamics. It is difficult to generalise a neighbourhood scheme beyond a binary differentiation of the particle distribution.
In Sect. 2 we describe a bottleneck in existing general Nbody splitting methods, and derive our novel splitting scheme that overcomes this bottleneck. Section 3 presents the results of numerical tests comparing of our method to existing approaches. Finally, in Sect. 4 we discuss possible improvements and extensions to our work, including the feasibility of integrating general Nbody systems using purely interactionspecific time steps.
2. Method
2.1. Deriving time stepping schemes via Hamiltonian splitting
The Hamiltonian for a system of N particles i = 1...N under gravitational interaction can be represented as a sum of momentum terms T_{i} and potential terms V_{ij}:
where m_{i} is the mass, q_{i} is the position and p_{i} is the momentum of the ith particle of the system, and q_{ij} = ∥ q_{i} − q_{j} ∥ . The evolution of the state of the system for a time step h is given formally by the flow operator E_{H}(h) = exp(hH) where H is the Hamiltonian vector field corresponding to H.
If the Hamiltonian H of the system is representable as a sum of two subHamiltonians, H = A + B, we can approximate the time evolution under H with a sequence of time evolution steps under the subHamiltonians A and B. A straightforward successive application of the time evolution under A followed by the time evolution under B gives a firstorder approximation of the full time evolution under A + B, while a secondorder accurate approximation can be obtained with one additional operator evaluation (SanzSerna & Calvo 1994, Sect. 12.4; also Hairer et al. 2006). (4)The subHamiltonian A is evolved in two steps of h/ 2 and the subHamiltonian B is evolved in a single step h. We can take advantage of this property of the splitting formula by dividing terms associated with fast interactions into A and terms associated with slow interactions into B. We can proceed by applying this splitting procedure to different subHamiltonians multiple times, thereby constructing an integrator that evaluates parts of the Hamiltonian at h, h/ 2, h/ 4 etc., similarly to the power of two hierarchy used in block time step schemes. This approach was followed in Pelupessy et al. (2012), below we introduce some notation and give a rough derivation of the integrators there.
Hamiltonians consisting of a single momentum term and Hamiltonians consisting of a single potential term have analytic solutions. For a momentum term of the ith particle (5)the solution consists of updating the position of the ith particle under the assumption of constant velocity for a time period of h (all positions except the position of the ith particle and the momenta of all particles remain unchanged). (6)We call the time evolution operator for the momentum term of the ith particle the drift operator and write D_{h,Ti}.
For a single potential term between particles i and j(7)the solution consists of updating the momenta of the ith and jth particles under the assumption of constant force for a time period of h (all momenta except the momenta of the ith and jth particles and the positions of all particles remain unchanged). We call the time evolution operator for the potential term between the ith and jth particles the kick operator and write K_{h,Vij}.
In addition to the kick and drift operators, the twobody Hamiltonian (10)is solved (semi) analytically by the Kepler solution^{1}.
In Pelupessy et al. (2012), we derive multiple integrators that recursively and adaptively split the system Hamiltonian through the secondorder splitting formula (4). At every step in the recursion, all particles under consideration are divided into a slow set S and a fast set F by comparing the particlespecific time step function τ(i) to a pivot time step h. Using the two sets S and F, we can rewrite the system Hamiltonian as follows. (13)The subHamiltonian H_{S} can be thought of as a “closed Hamiltonian” of the particles in S. Specifically, it consists of all drifts of particles in S and all kicks where both participating particles are in S. The same property holds for the subHamiltonian H_{F} and the particles in F. The mixed term V_{SF} contains all kicks where one particle is in S and the other is in F.
We proceed by applying the secondorder splitting rule (4): (this is not the only conceivable approximation). The subHamiltonian H_{F} is closed, and consists of particles where τ(i) <h. We integrate H_{F} by recursively applying the entire “slow/fast” partitioning, but using a smaller pivot h/ 2. In contrast, both H_{S} = T_{S} + V_{S} and V_{SF} are explicitly decomposed into individual kicks and drifts which are applied using the current pivot time step h. We refer to this particular choice as the HOLD method (since it “holds” V_{SF} for evaluation at the slow timestep). The pivot time step h is halved with each consecutive partitioning, and the recursion terminates when all remaining particles are placed into the S set.
As noted previously, recursively and adaptively splitting the system Hamiltonian using the second order splitting rule (Eq. (4)) is similar to conventional block time steps. Both approaches evolve different parts of the system using time steps that belong to a power of two hierarchy. However, the Hamiltonian splitting method derived above evaluates pairwise particle forces symmetrically in the sense that a “kick” from particle i to particle j (Eq. (8)) is always paired with an opposite kick from particle j to particle i (Eq. (9)). Furthermore, the kicks acting upon a particle at any given timestep typically correspond to partial forces only. This is in contrast to conventional block time steps where we always calculate the total force acting on a particle at the frequency determined by the particlespecific time step criteria τ(i)(18)where, is the force acting on particle i due to particle j, derived from extrapolated positions if necessary. Specifically, in situations where the position of particle j has not been calculated for time t, we calculate the force by extrapolating the position at t from the last known position. This can happen when particle i is assigned a smaller time step than particle j. We refer to this method as BLOCK, and include it as a reference in our numerical tests to determine whether more “aggressive” splitting methods (such as HOLD) reduce the number of kicks and drifts while maintaining the accuracy of the solution.
The HOLD method evolves all kicks between fast particles at the fast time step. This is inefficient in the presence of isolated fast subsystems, as interactions between particles that belong to different subsystems could be evolved at a slower time step. As an extreme example, consider a Plummer sphere with each star being replaced by a stable hard binary. Here, every star has a close binary interaction that needs to be evaluated at a fast time step. However, the HOLD integrator will in this case integrate all interactions, including longrange interactions between stars in different binaries at a time step determined the binary interactions. The behaviour of the method becomes equivalent to evolving the entire system with a shared global time step!
In addition to the dramatic example just discussed, the same inefficiency – evaluating longrange interactions between isolated fast subsystems at time steps determined by fast interactions inside the subsystems – can manifest itself in other situations, such as the following.

In a system with multiple globular clusters, each individualglobular cluster is a subsystem.

In a globular cluster with planets around some of the stars, each star with planets is a subsystem.

In a single globular cluster, each close encounter between two or more stars is a subsystem.
2.2. Hamiltonian splitting with connected components
The partitioning used in the HOLD method is based on a particlespecific time step criteria τ(i), which by definition cannot separate slow and fast interactions in situations where all particlespecific time steps have the same (fast) value. We therefore introduce the interactionspecific time step criterion τ(i,j)(19)where τ_{freefall}(i,j) and τ_{flyby}(i,j) are proportional to the interparticle freefall and interparticle flyby times as defined by Eqs. (13) and (16) in Pelupessy et al. (2012), and η is an accuracy parameter.
Fig. 1 Time step graphs generated by τ(i,j) at different levels of the time step hierarchy (left to right) for three values (top to bottom rows) of the fractal dimension. We plot particles that have been passed on as a part of a connected component from the previous time step level as black, grey points indicate points that are inactive on a given level (because they formed a single or binary component at a lower level). Thin grey lines indicate interactions with τ(i,j) <h. Indicated in each frame are the fractal dimension (top left), and (on the bottom right) the level in the hierarchy, the number of connected components (cc) at this level, as well as the number of components that are single (s) and binary (b), and the total number of particles. 

Open with DEXTER 
We split the system Hamiltonian using the connected components (Cormen et al. 2001, Sect. B.4) of the undirected graph generated by the time step criteria τ(i,j). Specifically, the particles of the system correspond to the vertices of the graph, and there is a edge between particles i and j if their interaction cannot be evaluated at the threshold time step h. (20)Figure 1 we visualises the time step graphs at varying values of the pivot time step h for three different fractal initial conditions with different fractal dimension (described further in Sect. 3). As the pivot time step h decreases, the set of interactions (and associated particles) that cannot be evaluated at the current pivot time step gradually decreases as well. Although for visualisation purposes, we plot the time step graph of the entire system for varying h, the connected components (CC) splitting method we are about to introduce typically calculates CC for the entire system only once, at the largest pivot time step. At smaller pivot time steps, the connected components search is only calculated for parts of the system. The intuition behind this partitioning comes from clustering by maximising the margin between individual clusters as described in Duan et al. (2009).
Fig. 2 Pseudocode for a secondorder CC splitting routine for integrating a set of particles H for time step h. 

Open with DEXTER 
Given a fixed pivot time step h, let the sets C_{i},i = 1...K contain vertices of K nontrivial connected components, and the set R (“remainder set”) contain all particles in trivial connected components^{2}. Based on the particle sets C_{i} and R, we rewrite the Hamiltonian of the system in the following form (21)where the individual terms are defined as follows. The term H_{C} is the sum of all closed Hamiltonians H_{Ci}, each corresponding to one of the K connected components. In every H_{Ci} all drifts and some kicks cannot be evolved at the time step h without violating the time step criteria. The term H_{R} consists of the closed Hamiltonian formed by all of the particles in the rest system. All drifts and kicks in H_{R} can be evolved at the current time step h.
The term V_{CC} contains all kicks between particles that are in different connected components. These kicks can be evaluated at the time step h. We point out that V_{CC} explicitly contains the terms that are evolved inefficiently in the HOLD method. Similarly, V_{CR} contains all kicks where one of the particles is in a connected component C_{i}, and the other is in the rest set R.
Overview of the integrators used in the numerical tests.
We split the system Hamiltonian H by applying the secondorder splitting rule (4): such that individual connected components C_{i} are independently evolved at a higher pivot time step h/ 2 via recursion. (28)All remaining terms (including V_{CC}) are decomposed into individual drifts and kicks using the secondorder splitting rule (4). As with the HOLD method, the pivot time step h is halved at each successive partitioning such that at some point, all remaining particles are in the remainder set R.
Finally, the partitioning can lead to situations where H_{Ci} is a twobody Hamiltonian (Eq. (10)). We can use this property by integrating these cases with a dedicated a Kepler solver (we discuss this further in Sect. 3.2).
2.3. Implementation
We implemented the CC split in the HUAYNO code, which is freely available as a part of the AMUSE framework. Figure 2 sketches the main routine of the CC integrator in pseudocode, including explicit references to the corresponding equations and variables used in the derivation of the method (Sect. 2.2).
Subroutines and data structures that store the system state, calculate time steps, apply kicks and drifts to groups of particles, and gather statistics, are shared with other integrators such as the HOLD method. All particle states are kept in a contiguous block of memory. The connected component algorithm is implemented as a breadthfirst search. It reshuffles particle states such that particles in the same connected component or rest set are kept adjacent to each other. Connected components are represented by a start and an end pointer to the contiguous array of particle states.
The time complexity of the connected component decomposition for N particles has an upper bound of O(N^{2}). This matches the time complexity of the splitting step of the HOLD method – while the actual shuffling of the particles into S and F sets is O(N), this division is based on the preceding step of calculating particlebased time steps τ(i) for all particles, which is O(N^{2}).
For the special case where all interactions between the N particles are below the threshold h, the complexity of the connected components decomposition is O(N). This can happen multiple times (at consecutive recursion levels) when the initial value of the pivot time step h is sufficiently large. Figuratively, if particle X has a known connected component while particle Y is unassigned, we can assign particle Y to the connected component of particle X based on a single time step evaluation τ(X,Y). A key step of the connected components search is choosing a particle X with a known connected component, followed by assigning the membership of X to all unassigned particles U_{k} where τ(X,U_{k}) <h. For the special case under consideration, a single iteration of this step is sufficient to assign membership to all particles (irrespective of the choice of the initial particle X), leading to a time complexity of O(N). Further, while the splitting step is bounded from above by O(N^{2}) for both HOLD and CC, the HOLD split always calculates time steps for all interactions. This is not the case with the CC method, and numerical tests in Sect. 3 indicate that the reduction in time step evaluations does translate into improved performance.
3. Tests
Fig. 3 A comparison of the BLOCK, HOLD and CC methods on integrating an softened 1024body Plummer sphere: conservation of the integrals of motion (top two rows), evolution of the mass distribution of the solution (middle row) and performance metrics (two bottom rows). Lagrangian radii are plotted for 90%, 50%, 10% and 1% of the system mass. The performance metrics are normalised to the CC method at the start of the simulation. The CC method performs 1.1 × 10^{9} kick, 6.6 × 10^{8} time step, and 5.0 × 10^{6} drift evaluations, taking 62 s for the first global time step (1/512th of the simulation) on a laptop with a 1.3 GHz Intel Core i5 processor. 

Open with DEXTER 
We present results of numerical experiments of the CC splitting method described in the previous section. We confirm that the CC method works as intended conceptually by comparing to alternative Hamiltonian splitting methods (see Table 1 for an overview). Specifically, we demonstrate that the connected components search does not use excessive computational resources, reduces the number of elementary operations (kick, drift and time step evaluations) while maintaining the accuracy of the solution, and performs particularly well on multiscale problems. Finally, we compare the CC method to established Nbody codes. We use Nbody units as described in Heggie & Mathieu (1986).
3.1. Smoothed Plummer sphere test
We begin by integrating an equalmass 1024body Plummer sphere with softening (ε = 1/256) for 700Nbody time units using a time step accuracy parameter of η = 0.01. We choose initial velocities such that the Plummer sphere is in a dynamic equilibrium. This setup is chosen to match the longterm integration tests in Nitadori & Makino (2008, their Sect. 3.2).
Figure 3 visualises the conservation of the integrals of motion, the time evolution of the mass distribution, and performance metrics. While all three methods show similar energy conservation properties, only HOLD and CC maintain centre of mass, linear momentum and angular momentum near machine precision. As noted previously in Pelupessy et al. (2012), this is caused by unsynchronised kicks which are only present in the BLOCK scheme. The solutions obtained by all three methods reproduce known results in terms of Lagrangian radii, the core radius and the core density. The CC scheme is about twice as fast than the HOLD scheme at the beginning of the simulation, and remains the fastest scheme throughout the run. The overall runtime measurements correlate with the number of time step formula evaluations and, to a lesser extent, the number of kick and drift formula evaluations. This indicates that the improved runtime is attributable to a reduction of time step, kick and drift formula evaluations.
The left plot of Fig. 4 visualises energy error of evolving the softened Plummer sphere as described previously, but for 1Nbody units and under varying time step accuracy η. As predicted, all three methods show second order behaviour. On the corresponding wallclock time vs energy error plot on the right CC consistently outperforms HOLD, followed by BLOCK. We emphasise that the Plummer sphere is a spherically symmetric configuration with a smoothly changing mass distribution, and a nonzero softening length ε sets an upper limit on the hardness of the binaries that can form during the simulation. Hence, we would not expect the CC scheme to have a significant advantage over the HOLD method.
Fig. 4 Left: time step accuracy parameter η vs. energy error for integrating a 1024body Plummer sphere for 1Nbody units. Right: corresponding wallclock time vs. energy error from the same set of tests. 

Open with DEXTER 
Fig. 5 A comparison of the HOLD, CC and CC_KEPLER methods on integrating an unsoftened 1024body Plummer sphere through core collapse: conservation of the integrals of motion (top two rows), evolution of the mass distribution of the solution (third row) and wallclock time (bottom left). Lagrangian radii are plotted for 90%, 50%, 10% and 1% of the system mass. Wallclock times are normalised by CC_KEPLER at the first global time step (1/64th of the simulation). On a laptop with a 1.3 GHz Intel Core i5 processor, CC_KEPLER integrates the first global time step in roughly 5 min while the entire simulation takes about 7 h. 

Open with DEXTER 
3.2. Unsoftened Plummer sphere test
We proceed by evolving an equalmass 1024body Plummer sphere without softening through core collapse. We choose initial velocities consistent with a dynamic equilibrium, as in the softened case considered previously. This setup is chosen to match a test used on a modern implementation of a fourthorder Hermite scheme with block time steps in Konstantinidis & Kokkotas (2010, their Sect. 3.4.1).
In addition to the HOLD and CC schemes that have been introduced previously, we also test a modification of the CC scheme with a dedicated Kepler solver (CC_KEPLER). In this scheme a Kepler solver is used for evolving connected components consisting of two particles. This is a form of algorithmic regularisation of binaries, but note that the regularisation follows naturally from the structure of the integrator and no separate binary detection or additional free parameters are necessary. The implementation of the Kepler solver is based on a universal variable formulation (Bate et al. 1971).
Results of the core collapse simulation are visualised in Fig. 5. All three methods produce solutions that are realistic in terms of the evolution of the mass distribution. Energy conservation is comparable to what is observed in Konstantinidis & Kokkotas (2010). Other integrals of motion show conservation around machine precision with the exception of a jump in the HOLD method around core collapse (this is caused by a high speed particle escaping from the system, causing a loss of precision in the force evaluations).
Before core collapse, execution times are roughly equivalent to the softened case considered previously (Sect. 3.1) – CC shows a modest improvement over HOLD, and CC_KEPLER is very close to CC. Around core collapse, execution times of the HOLD and CC methods gradually increase by an order of magnitude (the CC method still consistently outperforms the HOLD method). In contrast, execution time used by the CC_KEPLER method remains relatively uniform throughout the simulation, including core collapse.
The Sakura integrator achieves a similarly efficient treatment of close binaries by decomposing the evolution of an Nbody Hamiltonian into a sequence of Kepler problems (Gonçalves Ferrari et al. 2014). The main source of errors in Sakura comes from manybody close encounters, as these are difficult to decompose into twobody interactions. In contrast, CC_KEPLER only uses the binary solver for an isolated binary system, and switches to the regular manybody integrator when necessary (this is further discussed in Sect. 4.1).
3.3. Fractal distributions
Since our new methods are based on the partitioning of the particle distribution in connected subsystems, we expect the method to be especially well suited to situations where substructure with extreme density contrasts exist. We therefore proceed by integrating a set of initial conditions developed with the aim of describing a star cluster with fractal substructure (Goodwin & Whitworth 2004). These initial conditions mimic the observed distribution of young stellar associations. They are parametrized by a fractal dimension: a low fractal dimension leads to an inhomogeneous (“structured”) distribution of stars whereas a high fractal dimension leads to a more homogenous (“spherical”) distribution (Fig. 1). For the highest possible fractal dimension value of 3, the initial conditions approximate a constant density sphere. We use η = 0.03, and integrate a 1024particle system under an unsoftened potential for 0.25Nbody units for varying fractal dimensions.
In Fig. 6 we plot the energy error and runtime of the simulation, averaged over 10 runs, as a function of the fractal dimension f. While all integrators show similar energy conservation, CC and CC_KEPLER consistently outperform BLOCK and HOLD irrespective of fractal dimensions in terms of runtime. Further, runtime increases for decreasing fractal dimension for the BLOCK and HOLD integrators, while runtime remains essentially flat (and even decreases slightly) for decreasing fractal dimension for the CC and CC_KEPLER integrators.
3.4. Plummer sphere with binaries
We proceed by looking at how our methods perform on systems containing a large number of binaries. Specifically, we take a Plummer sphere and replace every particle with a binary system. The positions and velocities of the particles are chosen such that under the absence of external perturbations, they would form a stable binary with a randomly oriented orbital plane, and a semimajor axis drawn uniformly in log space between log(a) and −0.5. We integrate a system of 512 binaries (=1024 individual particles) for 0.25Nbody units with η = 0.03.
Figure 7 visualises energy conservation and runtime of the initial conditions as a function of minimum semimajor axis a. For large a, the introduced binaries are generally unbounded, and the results are equivalent to evolving an ordinary Plummer sphere. As the minimum a decreases, the introduced binaries become bounded and their interactions start dominating in the integration time, leading to a significant advantage for CC and CC_KEPLER methods.
3.5. Cold collapse test
As a final test we evaluate the performance of our integrators in a cold collapse scenario. Specifically, we use the fractal initial conditions described in Sect. 3.3 with the initial velocities set to zero. We consider a “structured” case with the fractal dimension f_{d} = 1.6, and a “spherical” case with f_{d} = 3.0. We evolve initial conditions for 2 Nbody time units. For the spherical case, this is past the moment of collapse that occurs around 1.5Nbody time units. For the structured case the moment of collapse is less welldefined, as different substructures collapse at different times.
We compare CC and CC_KEPLER to two recent Nbody codes, Ph4 (McMillan, in prep.) and HiGPUs (CapuzzoDolcetta et al. 2013). Both codes use a Hermite scheme with conventional block time steps. Ph4 implements a fourthorder scheme with the option of using the GPUaccelerated SAPORRO library (Gaburov et al. 2009; and Bédorf et al., in prep.). HiGPUs implements a sixthorder scheme and requires a GPU to run. We conduct our tests on a workstation – running on a single core of an Intel i72720QM CPU, and a GTX460M GPU. The FLOPS performance of the GPU is roughly 40 times larger than a single core of the CPU. The hardware setup is thus indicative only of the intrinsic algorithmic scaling, rather than representative of the performance in production simulations (which would use multiple and/or more powerful CPUs/GPUs). We use unsoftened potential for CC, CC_KEPLER, and Ph4 without GPU acceleration. We use a very small softening parameter (ε = 10^{4}) for Ph4 with GPU acceleration, and HiGPUS, as these run into severe slowdowns and/or crashes with unsoftened gravity – probably because of the limited precision of their GPU kernels. We set codespecific time step accuracy parameters to η = 0.01 for CC/CC_KEPLER, η_{4} = 0.1 for Ph4, and η_{4} = 0.05 and η_{6} = 0.6 for HiGPUs.
Fig. 6 Energy conservation (top) and runtime (bottom) for integrating a 1024particle system of fractal initial conditions for 0.25Nbody units as a function of the fractal dimension. 

Open with DEXTER 
Fig. 7 Energy conservation (top) and runtime (bottom) for integrating 512binary (=1024particle) Plummer sphere for 0.25Nbody units as a function of the initial semimajor axis a. 

Open with DEXTER 
Figure 8 visualises energy conservation, momentum conservation and the wallclock time of the initial conditions as a function of the system size for structured and homogenous initial conditions. In the spherical case, setups that take advantage of the GPU (PH4_gpu and HiGPUs) outperform the alternatives, but note that both CC and CC_KEPLER show very similar scaling to Ph4 without GPU acceleration. In contrast, for structured case, CC_KEPLER and CC show a marked speed up in comparison with the conventional block time step schemes, being faster for this particular calculation than Ph4_GPU and HiGPUs, despite the latter having the advantage of using the GPU acceleration and integrating with softened gravity. The differences between the structured and the spherical cases highlight the relative advantage that the connected component approach has with respect to conventional block time steps when applied to multiscale initial conditions.
4. Discussion
4.1. Using and extending the CC method
We introduced a novel method for direct integration of Nbody systems based on splitting the system Hamiltonian using connected components of the time step graph (the CC split). We were motivated by the need for a more efficient divideandconquer strategy for reducing the intractable Hamiltonian (Eq. (18)) to the least possible number of analytically solvable Hamiltonians (Eqs. (7), (6)). In comparison to existing splitting methods, notably the HOLD split introduced in Pelupessy et al. (2012), the CC split is particularly effective at splitting multiscale systems. We have not encountered a situation where the HOLD split would be preferable over the CC split. The practical advantages of Hamiltonian splitting are similar to what is usually achieved with block timesteps. However, as our splitting methods, including the CC method, do not extrapolate particle states for evaluating the total force acting on a particle, we conserve linear and angular momentum to machine precision.
We went on to show on the example of the CC_KEPLER method that the connected components partitioning has additional uses beyond improved splitting efficiency. Specifically, we were able to incorporate regularisation of twobody close encounters by simply checking for the condition where the successive partitioning leads to a connected component with two particles, and evolving the corresponding twobody Hamiltonian (Eq. (10)) using a dedicated Kepler solver. This approach can be extended to manybody close encounters by using a suitable specialised solver (or e.g. chain regularisation methods, Mikkola 2008) to evolve isolated Hamiltonians corresponding to connected components with certain properties. Possible selection criteria include having a specific number of particles and/or a maximum time step below a threshold value.
The numerical experiments of Sect. 3 were chosen to mainly study the splitting aspect of Nbody integration. We focused on normalised performance metrics, and the scaling of the wallclock time as a function of the “multiscaleness” in the initial conditions. Our current implementations would benefit from additional optimisations typically used in productionlevel Nbody codes. Specifically, there is inherent parallelism in the CC method, as recursive calls for evolving successively smaller closed Hamiltonians only affect the state of the particles in the “current” closed component. It may be possible to parallelise the method based on this property. However, tests show that a naive approach does not scale well due to loadbalancing issues, as subsystems can vary substantially in size. Alternatively, it could be feasible to implement the CC method on a GPU, as the major components – Nbody force evaluation (Portegies Zwart et al. 2007; Belleman et al. 2008; CapuzzoDolcetta et al. 2013) and graph processing algorithms (Harish & Narayanan 2007) – have individually been successfully implemented on GPUs.
Fig. 8 Energy conservation (left column), momentum (middle column) and wallclock time (right column) as a function of system size N for the cold collapse test. The top row shows results for a homogenous sphere (fractal dimension f_{d} = 3), while the bottom row shows a highly structured fractal (f_{d} = 1.6). We evolve the initial conditions for 2 Nbody time units, and plot the mean values across 5 runs. CC, CC_KEPLER and Ph4 use unsoftened gravity (ε = 0), while PH4_gpu and HIGPUS use very small softening (ε = 10^{4}). 

Open with DEXTER 
It may be possible to speed up the evaluation of longrange interactions between different connected components (V_{CC} in the CC decomposition formula) through a centreofmass (or multipole) approximation that form the basis of tree codes Barnes & Hut (1986). As longrange interactions between two connected components are evaluated symmetrically, this approach could make it possible to obtain most of the speedup of a tree code while maintaining good linear and angular momentum conservation. A potential pitfall with this approach could arise from the fact that the time step criterion τ(i,j) used in finding the connected components is only partially determined by the coordinates of the particles. As such, particles in the same connected component may occupy a “noncompact” region in physical space, making multipole approximation difficult.
4.2. Formally optimal Hamiltonian splitting
The HOLD integrator determines the accuracy of a kick between particles i and j from the particlebased time steps τ(i) and τ(j). In the CC integrator the accuracy of a kick is determined by the time step graph generated directly from interactionbased time steps τ(i,j). Could we further improve the splitting by applying kicks directly based on the interactionspecific time step criteria τ(i,j)?
We implemented this idea in an experimental integrator which we named the OK split (OK stands for Optimal Kick). The method partitions a list of all interactions in the system (based on a pivot time step h) just like the HOLD split partitions a list of all particles in the system. The partitioning is formally optimal in the sense that every kick is evaluated at the time step closest to τ(i,j) in the power of two hierarchy based on the pivot time step h. While the possibility of direct Nbody integration with interactionbased time steps has been previously considered in Nitadori & Makino (2008), the OK split is the first workable implementation of this idea that we are aware of.
In numerical tests, the OK split is not competitive compared to other methods such as the CC split. For example, in the 1024body smoothed Plummer sphere test from Sect. 3.1, the relative energy error at the end of the simulation is around 10^{2} (several orders of magnitude worse than HOLD and CC, but possibly still enough for drawing statistically correct conclusions, Portegies Zwart & Boekholt 2014). The remaining integrals of motion are conserved at machine precision, as the OK split applies kicks in pairs. Finally, the evolution of the mass distribution is comparable to HOLD and CC with the OK split using fewer kick and time step evaluations.
Could we improve the OK split by changing the time step criteria? For example, consider τ_{⋆}(i,j) = min(τ(i),τ(j)) where τ is the particlebased time step criteria as defined in Pelupessy et al. (2012). Formally, combining the OK split with τ_{⋆}(i,j) would result in a splitting with the exact same kicks and drifts as the HOLD integrator. This somewhat contrived example only serves the point of illustrating that the time step criteria can qualitatively change the behaviour of the OK split. While it is unknown whether practical interactionbased time step criteria even exist, we do believe that a closer look at the various simplifications made during the derivation of the explicit and approximately timesymmetric time step criteria that we have used throughout this work (Eq. (19)) would serve as a good starting point.
Acknowledgments
We thank Guilherme Gonçalves Ferrari and the anonymous referee for a critical reading of the manuscript. This work was supported by the Netherlands Research Council NWO (Grants #643.200.503, #639.073.803 and #614.061.608) and by the Netherlands Research School for Astronomy (NOVA). Jürgen Jänes was supported by the Archimedes Foundation, Estonian Students’ Fund USA, Estonian Information Technology Foundation and Skype.
References
 Ahmad, A., & Cohen, L. 1973, J. Comput. Phys., 12, 389 [NASA ADS] [CrossRef] [Google Scholar]
 Barnes, J., & Hut, P. 1986, Nature, 324, 446 [NASA ADS] [CrossRef] [Google Scholar]
 Bate, R. R., Mueller, D. D., & White, J. E. 1971, Fundamentals of Astrodynamics, Dover Books on Aeronautical Engineering Series (Dover Publications) [Google Scholar]
 Belleman, R. G., Bédorf, J., & Portegies Zwart, S. F. 2008, New Astron., 13, 103 [NASA ADS] [CrossRef] [Google Scholar]
 CapuzzoDolcetta, R., Spera, M., & Punzo, D. 2013, J. Comput. Phys., 236, 580 [NASA ADS] [CrossRef] [Google Scholar]
 Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. 2001, Introduction to Algorithms, 2nd edn. (MIT Press) [Google Scholar]
 Duan, W., Song, M., & Yates, A. 2009, BMC Bioinformatics, 10, S4 [CrossRef] [Google Scholar]
 Gaburov, E., Harfst, S., & Portegies Zwart, S. 2009, New A, 14, 630 [NASA ADS] [CrossRef] [Google Scholar]
 Gonçalves Ferrari, G., Boekholt, T., & Portegies Zwart, S. F. 2014, MNRAS, 440, 719 [NASA ADS] [CrossRef] [Google Scholar]
 Goodwin, S. P., & Whitworth, A. P. 2004, A&A, 413, 929 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Hairer, E., C., L., & Wanner, G. 2006, Geometric Numerical Integration: StructurePreserving Algorithms for Ordinary Differential Equations (Springer Verlag) [Google Scholar]
 Harish, P., & Narayanan, P. 2007, in High Performance Computing – HiPC 2007, eds. S. Aluru, M. Parashar, R. Badrinath, & V. Prasanna (Berlin, Heidelberg: Springer), Lect. Notes Comput. Sci., 4873, 197 [Google Scholar]
 Heggie, D. C., & Mathieu, R. D. 1986, in The Use of Supercomputers in Stellar Dynamics, eds. P. Hut, & S. L. W. McMillan (Berlin: Springer Verlag), Lect. Notes Phys., 267, 233 [Google Scholar]
 Konstantinidis, S., & Kokkotas, K. D. 2010, A&A, 522, A70 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Makino, J. 1991, PASJ, 43, 859 [NASA ADS] [Google Scholar]
 McMillan, S. L. W. 1986, in The Use of Supercomputers in Stellar Dynamics, eds. P. Huts, & S. L. W. McMillan (Berlin: Springer Verlag), Lect. Notes Phys., 267, 156 [Google Scholar]
 Mikkola, S. 2008, in IAU Symp. 246, eds. E. Vesperini, M. Giersz, & A. Sills, 218 [Google Scholar]
 Nitadori, K., & Makino, J. 2008, New Astron., 13, 498 [NASA ADS] [CrossRef] [Google Scholar]
 Oshino, S., Funato, Y., & Makino, J. 2011, PASJ, 63, 881 [NASA ADS] [Google Scholar]
 Pelupessy, F. I., Jänes, J., & Portegies Zwart, S. 2012, New Astron., 17, 711 [NASA ADS] [CrossRef] [Google Scholar]
 Pelupessy, F. I., van Elteren, A., de Vries, N., et al. 2013, A&A, 557, A84 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Portegies Zwart, S., & Boekholt, T. 2014, ApJ, 785, L3 [NASA ADS] [CrossRef] [Google Scholar]
 Portegies Zwart, S., McMillan, S. L. W., van Elteren, E., Pelupessy, I., & de Vries, N. 2013, Comput. Phys. Commun., 183, 456 [NASA ADS] [CrossRef] [Google Scholar]
 Portegies Zwart, S. F., Belleman, R. G., & Geldof, P. M. 2007, New Astron., 12, 641 [NASA ADS] [CrossRef] [Google Scholar]
 Portegies Zwart, S. F., McMillan, S. L. W., & Gieles, M. 2010, ARA&A, 48, 431 [NASA ADS] [CrossRef] [Google Scholar]
 SanzSerna, J. M., & Calvo, M. P. 1994, Numerical Hamiltonian problems, Applied Mathematics and Mathematical Computation 7 (London, New York: Chapman & Hall) [Google Scholar]
All Tables
All Figures
Fig. 1 Time step graphs generated by τ(i,j) at different levels of the time step hierarchy (left to right) for three values (top to bottom rows) of the fractal dimension. We plot particles that have been passed on as a part of a connected component from the previous time step level as black, grey points indicate points that are inactive on a given level (because they formed a single or binary component at a lower level). Thin grey lines indicate interactions with τ(i,j) <h. Indicated in each frame are the fractal dimension (top left), and (on the bottom right) the level in the hierarchy, the number of connected components (cc) at this level, as well as the number of components that are single (s) and binary (b), and the total number of particles. 

Open with DEXTER  
In the text 
Fig. 2 Pseudocode for a secondorder CC splitting routine for integrating a set of particles H for time step h. 

Open with DEXTER  
In the text 
Fig. 3 A comparison of the BLOCK, HOLD and CC methods on integrating an softened 1024body Plummer sphere: conservation of the integrals of motion (top two rows), evolution of the mass distribution of the solution (middle row) and performance metrics (two bottom rows). Lagrangian radii are plotted for 90%, 50%, 10% and 1% of the system mass. The performance metrics are normalised to the CC method at the start of the simulation. The CC method performs 1.1 × 10^{9} kick, 6.6 × 10^{8} time step, and 5.0 × 10^{6} drift evaluations, taking 62 s for the first global time step (1/512th of the simulation) on a laptop with a 1.3 GHz Intel Core i5 processor. 

Open with DEXTER  
In the text 
Fig. 4 Left: time step accuracy parameter η vs. energy error for integrating a 1024body Plummer sphere for 1Nbody units. Right: corresponding wallclock time vs. energy error from the same set of tests. 

Open with DEXTER  
In the text 
Fig. 5 A comparison of the HOLD, CC and CC_KEPLER methods on integrating an unsoftened 1024body Plummer sphere through core collapse: conservation of the integrals of motion (top two rows), evolution of the mass distribution of the solution (third row) and wallclock time (bottom left). Lagrangian radii are plotted for 90%, 50%, 10% and 1% of the system mass. Wallclock times are normalised by CC_KEPLER at the first global time step (1/64th of the simulation). On a laptop with a 1.3 GHz Intel Core i5 processor, CC_KEPLER integrates the first global time step in roughly 5 min while the entire simulation takes about 7 h. 

Open with DEXTER  
In the text 
Fig. 6 Energy conservation (top) and runtime (bottom) for integrating a 1024particle system of fractal initial conditions for 0.25Nbody units as a function of the fractal dimension. 

Open with DEXTER  
In the text 
Fig. 7 Energy conservation (top) and runtime (bottom) for integrating 512binary (=1024particle) Plummer sphere for 0.25Nbody units as a function of the initial semimajor axis a. 

Open with DEXTER  
In the text 
Fig. 8 Energy conservation (left column), momentum (middle column) and wallclock time (right column) as a function of system size N for the cold collapse test. The top row shows results for a homogenous sphere (fractal dimension f_{d} = 3), while the bottom row shows a highly structured fractal (f_{d} = 1.6). We evolve the initial conditions for 2 Nbody time units, and plot the mean values across 5 runs. CC, CC_KEPLER and Ph4 use unsoftened gravity (ε = 0), while PH4_gpu and HIGPUS use very small softening (ε = 10^{4}). 

Open with DEXTER  
In the text 