Issue 
A&A
Volume 590, June 2016



Article Number  A19  
Number of page(s)  17  
Section  Numerical methods and codes  
DOI  https://doi.org/10.1051/00046361/201528060  
Published online  29 April 2016 
A smooth particle hydrodynamics code to model collisions between solid, selfgravitating objects^{⋆}
^{1} Institut für Astronomie und Astrophysik, Eberhard Karls Universität Tübingen, Auf der Morgenstelle 10, 72076 Tübingen, Germany
email: ch.schaefer@unituebingen.de
^{2} Department of Astrophysics, University of Vienna, Türkenschanzstraße 17, 1180 Vienna, Austria
^{3} Physikalisches Institut, Eberhard Karls Universität Tübingen, Auf der Morgenstelle 14, 72076 Tübingen, Germany
Received: 29 December 2015
Accepted: 2 March 2016
Context. Modern graphics processing units (GPUs) lead to a major increase in the performance of the computation of astrophysical simulations. Owing to the different nature of GPU architecture compared to traditional central processing units (CPUs) such as x86 architecture, existing numerical codes cannot be easily migrated to run on GPU. Here, we present a new implementation of the numerical method smooth particle hydrodynamics (SPH) using CUDA^{TM} and the first astrophysical application of the new code: the collision between Ceressized objects.
Aims. The new code allows for a tremendous increase in speed of astrophysical simulations with SPH and selfgravity at low costs for new hardware.
Methods. We have implemented the SPH equations to model gas, liquids and elastic, and plastic solid bodies and added a fragmentation model for brittle materials. Selfgravity may be optionally included in the simulations and is treated by the use of a BarnesHut tree.
Results. We find an impressive performance gain using NVIDIA consumer devices compared to our existing OpenMP code. The new code is freely available to the community upon request.
Key words: methods: numerical / planets and satellites: formation
© ESO, 2016
1. Introduction
Since the introduction of the first CUDA (formerly known as Compute Unified Device Architecture) development kit by the NVIDIA Corporation and even more since the support of double precision by modern GPU hardware (i.e., NVIDIA GT200 chip with compute capability of 1.3 and higher), many astrophysical applications that make use of the acceleration by generalpurpose computing on graphics processing units have been developed (e.g. Bédorf et al. 2012; Kulikov 2014). The modern GPUs allow for a significant performance gain compared to serial codes for central processing units (CPUs). The lower cost of GPUs compared to high performance computing clusters of CPUs leads to higher performance per cost in favor of the GPU. Moreover, GPUs are inherently more energy efficient than CPUs because they are optimized for throughput and performance per watt rather than for absolute performance (Tingxing et al. 2014). Henceforth, astrophysical problems that in general demand cluster hardware may even be addressed on workstations with NVIDIA hardware.
We decided to implement a CUDA port of our parallelized OpenMP SPH code to benefit from these modern GPUs. In this publication, we present the physical models, their CUDA implementation, the validation simulations of our new code, and we show a first glimpse at possible applications. The SPH code may be used for hydrodynamical simulations and to model solid bodies, including ductile and brittle materials. The code uses a tree code algorithm for the additional treatment of selfgravity. We show several validation simulations: the simulation of elastic rubber cylinders, the gravitational collapse of a molecular cloud, the impact of an aluminium bullet into an aluminium cube and the impact of a nylon bullet in a basaltic rock.
Extended hydrocodes have been applied to model large deformation and strong shock wave physics using different numerical schemes. The Eulerian grid code CTH has several models that are useful for simulating strong shock, large deformation events, and this code can be used to model elasticplastic behavior and fracture (McGlaun et al. 1990). The iSALE grid code is a comprehensive tool to study impacts (Wünnemann et al. 2006; Elbeshausen et al. 2009). Apart from grid codes, there are also particlebased numerical schemes. Several SPH codes for modern cluster hardware are available to the community. The wellknown Gadget2 code by Springel (2005) may be the most sophisticated and most widely applied SPH code for the simulation of astrophysical flows. The SPH code by Benz & Asphaug (1994) with fundamental improvements of the physical models later by Jutzi et al. (2008) has been successfully applied to simulate impacts and collisions of astrophysical objects. However, there is no freely available SPH code for the simulation of solid and brittle material including the treatment of selfgravity at hand. Our new code bridges this gap and high resolution SPH simulations become feasible and may be performed even on a standard workstation with a GPU from the consumer price range.
The outline is as follows. In the next section we present the physical models and the governing equations that are solved by the code. In Sect. 3 we describe the smooth particle hydrodynamics numerical model for the simulation of solid bodies, the GradyKipp damage model for the formation of cracks and stress weakening, and the BarnesHut tree algorithm for gravitational forces. We give a detailed description of our implementation of these algorithms in the new CUDA code. We present the results of three standards tests for the code in Sect. 4: the collision of two perfect elastic rubber cylinders, the highvelocity impact of a bullet into an aluminium cube, and the collapse of an isothermal molecular cloud. Our first application of the new code s described in Sect. 5 and we present its results in Sect. 6 where we additionally discuss the limitations of the current code version in regard to numerical issues as well as in terms of the physical models. Eventually, we conclude in Sect. 7.
2. Physical model
In this section, we present the set of partial differential equations from continuum mechanics in Cartesian coordinates that we apply to model gas and solid bodies. Throughout this paper, the Einstein summation rule is used, Greek indices denote spatial coordinates and run from 1 to 3.
2.1. Mass conservation
The equation of continuity is the equation of the conservation of mass. It is given in Lagrangian form by (1)where ϱ denotes the density and v the velocity of the solid body or fluid.
2.2. Conservation of momentum
The equation of the conservation of momentum in continuum mechanics for a solid body reads in Lagrangian formulation (2)where σ^{αβ} is the stress tensor given by the pressure p and the deviatoric stress tensor S^{αβ}(3)with the Kronecker delta δ^{αβ}. For a liquid or gas, the deviatoric stress tensor vanishes and the conservation of momentum is given by the Euler equation (4)
2.3. Conservation of internal energy
The equation of the conservation of the specific internal energy u for an elastic body is given in Lagrangian formulation by (5)with the strain rate tensor given by Eq. (8).
2.4. Constitutive equations
In contrast to fluid dynamics, the set Eqs. (1, 2, 5) of partial differential equations is not sufficient to describe the dynamics of an elastic solid body, since the time evolution of the deviatoric stress tensor S^{αβ} is not yet specified. For the elastic behavior of a solid body, the constitutive equation is based on Hooke’s law. Extended to three dimensional deformations, it reads (6)Here, μ denotes the material dependent shear modulus and ϵ^{αβ} is the strain tensor. For the time evolution, it follows (for small deformations) (7)where denotes the strain rate tensor, given by (8)The rotation terms in Eq. (7) are needed since the constitutive equations have to be independent from the material frame of reference. There are various possibilities to achieve this. The usual approach for particle codes is the Jaumann rate form (see, e.g., Gray et al. 2001 and references therein). The rotation terms of the Jaumann rate form are (9)with the rotation rate tensor (10)Together, Eqs. (7) and (9) determine the change of deviatoric stress due to deformation of the solid body.
2.5. Equation of state
The equation of state relates the thermodynamic variables density ϱ, pressure p, and specific internal energy u. We provide a short description about the equation of states that are implemented in the code.
2.5.1. Liquid equation of state
For small compressions and expansions of a liquid, the pressure depends linearly on the change of density. Assume ϱ_{0} is the initial density of the uncompressed liquid. Then, the liquid equation of state reads (11)Here, c_{s} denotes the bulk sound speed of the liquid, which is related to the bulk modulus K_{0} and the initial density by (12)If the density ϱ falls strongly beyond the value of the initial density ϱ_{0}, the liquid equation of state fails because the liquid forms droplets and attracting forces vanish. To account for this behavior, the pressure is set to zero as soon as the ratio ϱ/ϱ_{0} drops below 0.8–0.95 (Melosh 1996).
2.5.2. Murnaghan equation of state
The Murnaghan equation of state is an extension of the liquid equation of state (see, e.g., Melosh 1996). In contrast to the liquid equation of state, the pressure now depends nonlinearly on the density (13)with the zero pressure bulk modulus K_{0} and the constant n_{M}. The Murnaghan equation of state is limited to isothermal compression. Parameters for various materials can be found in the literature, e.g., Melosh (1996).
2.5.3. Tillotson equation of state
The Tillotson equation of state was originally derived for highvelocity impact simulations (Tillotson 1962). There are two domains depending upon the material specific energy density u. In the case of compressed regions (ϱ ≥ ϱ_{0}) and u lower than the energy of incipient vaporization E_{iv} the equation of state reads (14)with η = ϱ/ϱ_{0} and χ = η−1. In case of expanded material (u greater than the energy of complete vaporization E_{cv}) the equation of state takes the form (15)The symbols ϱ_{0}, A_{T}, B_{T}, E_{0}, a_{T}, b_{T}, α_{T}, and β_{T} are material dependent parameters.
In the partial vaporization E_{iv}<u<E_{cv}, p is linearly interpolated between the pressures obtained via Eqs. (14) and (15), respectively. For a more detailed description see Melosh (1996).
The Tillotson equation of state has the advantage of being computational very simple, while sophisticated enough for the application over a wide regime of physical conditions, such as highvelocity collisions and impact cratering.
2.6. Plastic material equations
Together with the equation of state, the set of Eqs. (1), (2), (5) can be used to describe the dynamics of a solid body in the elastic regime. We apply the von Mises yield criterion (von Mises 1913) to model plasticity. The deviatoric stress is limited by (16)where the factor f_{Y} is computed from (17)with the second invariant of the deviatoric stress tensor J_{2} given by (18)and the material dependent yield stress is Y_{0}.
2.7. Damage model for brittle material
Since one major application of the newly developed code includes the simulation of the collision between brittle planetesimals, we included a fragmentation model. Physically, fracture is related to the failure of atomic or molecular bonds. If an applied force on a solid brittle body is large enough to break the atomic bonds, the material begins to fracture and develop cracks. A successful continuum model for fragmentation was derived by Grady & Kipp (1980) and first implemented in a SPH code by Benz & Asphaug (1995). The model is based on two assumptions: brittle materials contain socalled flaws that are sources of weakness leading to activation and growth under tensile loading, and dynamic fracture depends on the rate of tensile loading. In other words, material that is slowly pulled apart develops fewer cracks than material that experiences higher strain rates, since one single crack cannot relieve the higher tensile stress, and hence more flaws get activated leading to more cracks and consequently more damage.
The scalar parameter damage d parametrizes the influence of cracks on a given volume (19)and is defined in a way that d = 0 represents an undamaged, intact material and d = 1 a fully damaged material that cannot undergo any tension or deviatoric stress. In this manner, a material with d = 0.5 experiences half the stress and tension of undamaged material with d = 0.
Damage reduces the strength under tensile loading and the elastic stress σ^{αβ} is decreased by the factor (1−d), or in more detail (20)with (21)The number of flaws per unit volume in a brittle body with failure strains that are lower than ε, is given by the Weibull distribution (Weibull 1939) (22)with the two Weibull parameters k and m. Typical values for basalt are m = 16, k =1 × 10^{61} m^{− 3} (Nakamura et al. 2007).
As soon as a spherical volume undergoes a strain greater than its lowest activation threshold strain, damage starts to grow according to (23)with the constant crack growth velocity c_{g}. The crack growth velocity is chosen to be 40% of the velocity of a longitudinal elastic wave in the damaged material (24)The fracture model requires the distribution of activation threshold strains over the brittle body as a preprocessing step to the simulations. We show how we achieve this in Sect. 3.2.
3. Numerical model and implementation
In this section, we describe the numerical models that are implemented in the CUDA code.
3.1. Smooth particle hydrodynamics
Smooth particle hydrodynamics is a meshless Lagrangian particle method that was first introduced by Lucy (1977) and Gingold & Monaghan (1977) for the solution of hydrodynamic equations for compressible flows in astrophysical applications. The SPH method was extended to model solid bodies at the Los Alamos National Laboratory. This work was pioneered by Libersky & Petschek (1991) with various improvements from the same source later on (Randles & Libersky 1996; Libersky et al. 1997). The first astrophysical application of SPH with strength of material was by Benz & Asphaug (1994). Our implementation of the SPH equations in the code mainly follows their work.
For a comprehensive introduction to the basic SPH idea and algorithm, we refer to the excellent publications by Benz (1990) and Monaghan (1992). In this section, we present the SPH equations that we have implemented in the CUDA code, since more than one equivalent SPH representation exists for the partial differential equations considered in this work.
In the following, roman indices a, b denote particle indices and the sum in a SPH equation runs over all of the interaction partners.
3.1.1. Kernel function
The standard kernel, which is widely used in the astrophysical SPH community and in our CUDA code, is the cubic Bspline function introduced by Monaghan & Lattanzio (1985). This function is written as (25)where r is the distance between two interacting particles, h denotes the smoothing length, and s is a normalization constant depending on the dimension D (26)The first derivative of the kernel is given by (27)
3.1.2. SPH equations
In most astrophysical applications of SPH to hydrodynamical problems, the continuity Eq. (1) is not solved. The density of one particle a is given directly by the kernel sum (28)We use Eq. (28) for purely hydrodynamical problems as well and to calculate the densities during the gravitational collapse of a molecular cloud in Sect. 4.2.
To model solid bodies, however, it is often more convenient to integrate the continuity equation using the following SPH representation: the change of the density of particle a is given by (Gray et al. 2001) (29)In order to calculate the acceleration for particle a, we use the SPH representation of the equation of motion (2) (30)We add some additional artificial terms to Eq. (30) in the following sections to avoid some numerical issues of the algorithm.
Finally, the equation for the time evolution of the specific internal energy reads in SPH representation (31)
3.1.3. Artificial viscosity
An important issue is the artificial viscosity. In order to treat shock waves in an ideal fluid, dissipation is needed to dissipate kinetic energy in the shock front. In SPH, it is necessary to prevent the particles from unphysical mutual penetration. The artificial viscosity terms were introduced by Monaghan & Gingold (1983). They can be regarded as an additional artificial pressure term in the equations for the conservation of momentum and energy. The additional pressure is calculated for each interaction pair a and b as follows: (32)where α and β are free parameters that determine the strength of the viscosity, and and are the averaged quantities for density and sound speed for the two interacting particles, and .
The term ν_{ab} is an approximation for the divergence and is calculated accordingly (33)Here, is the average of the smoothing length of particle a and b. The term in the denominator prevents divergence of ν_{ab} for particles with small separations.
The artificial pressure is added to the acceleration of the particle (34)with for (v_{a}−v_{b})·(x_{a}−x_{b}) < 0 and 0 elsewhere. This means, the artificial viscosity leads to a repellent force only for approaching particles.
3.1.4. Artificial stress: tensile instability
The tensile instability can occur in SPH simulations that include materials with equation of states that allow negative pressure or tension. In the worst case, the instability leads to artificial numerical clumping of particles and in the end to unnatural fragmentation. There are several different approaches to overcome the tensile instability. In our code, we have implemented an additional artificial stress. To avoid artificial clumping of particles due to the tensile instability of the SPH method, Monaghan (2000) added a small repulsive artificial stress to the SPH Eq. (30) (35)The artificial stress is given by the sum of the artificial stress for particle a and b, respectively, (36)with for tension (37)whereas for compression. The factor ε_{s} is typically around 0.2. The repulsive force that results from the artificial stress is scaled with f^{n}, with n> 0, where f is given by the ratio between the kernel for the interaction of particles a and b and Δmpd, the mean particle distance in the vicinity of particle a, (38)The usual approach is to take the mean particle distance of the initial particle distribution for Δmpd. The artificial stress leads to an additional repulsive force whenever tension would trigger the tensile instability. The effect of artificial stress is shown in an illustrative example in Sect. 4.1.
3.1.5. XSPH – The Velocity
The time derivative of the location of particle a is given by the velocity of particle a(39)However, the velocity of the continuum cannot necessarily be identified with the particle velocity. The XSPH algorithm (Monaghan 1989) moves the particles at an averaged speed smoothed by the velocity of the interaction partners (40)Here, the factor x_{sph} is called the XSPH factor, 0 ≤ x_{sph} ≤ 1. Throughout our simulations performed for this paper, we set x_{sph} = 0.5 when we use XSPH.
3.1.6. A note on linear consistency
In standard SPH, the velocity derivatives in Eq. (8) for the determination of the strain rate tensor for particle a are given by (41)This approach, however, leads to erroneous results and does not conserve angular momentum due to the discretization error by particle disorder in simulations of solid bodies. This error can be avoided by constructing and applying a correction tensor (Schäfer et al. 2007) (42)where the correction tensor C^{γβ} is given by the inverse of (43)and it holds (44)We apply the correction tensor according to Eq. (42) in all our simulations that demand for derivatives of the velocity to force first order consistency.
3.2. Damage model
As a preprocessing step to a simulation containing brittle materials, we have to distribute activation threshold strains or flaws among the SPH particles. We apply the algorithm introduced by Benz & Asphaug (1995) with the following approach: flaws are distributed randomly among the brittle particles until each particle has been assigned with at least one activation threshold. We let N_{flaw} be the number of totally distributed activation flaws. For each flaw j, 1 ≤ j ≤ N_{flaw}, a particle a is chosen randomly and assigned with the activation threshold strain derived from the Weibull distribution ε_{a,j}, (45)where V is the volume of the brittle material and k and m are the material dependent Weibull parameters. Using this procedure, each particle gets at least one flaw and on average a total of N_{flaws} = N_{p}lnN_{p} flaws are distributed. This number of flaws has to be stored in the memory during the simulation and is the main memory consumer in simulations with brittle materials.
During the simulation, once a particle undergoes a strain greater than its lowest activation threshold strain, damage grows according to a modified version of Eq. (23) (46)where n_{act} is the number of activated flaws of the particle. This modification accounts for the accumulation of cracks.
Additionally, a damage limiter is included in the model. Since a particle has in general more than only one activation threshold, its behavior under tensile loading would be given by the flaw with the lowest activation threshold. To prevent this, the damage of the particle is limited by the ratio between the number of activated flaws n_{act} and the number of total flaws n_{tot} of this particle (47)In other words, a particle can only be totally damaged if all of its flaws are activated.
In order to determine if a particle exceeds one of its threshold strains during the simulation, we need to derive the local scalar strain from the threedimensional stress. Benz & Asphaug (1995) suggest computing the local scalar strain ε from the maximum tensile stress σ_{max} = max [σ_{1},σ_{2},σ_{3}], where σ_{γ} are the principal stresses determined by a principal axis transformation of the stress tensor σ^{αβ}, which is already reduced by damage and yield strength (48)The Young’s modulus E of the intact material given by the bulk and shear moduli is written as (49)
3.3. BarnesHut tree
We have implemented the efficient routines for BarnesHut tree builds and calculation of the gravitational forces for GPUs introduced by Burtscher (2011). The algorithm for the tree generation is explained in Sect. 3.5.1.
Barnes & Hut (1986) were the first to use a hierarchical tree to calculate the gravitational forces between N bodies. Their approximation algorithm uses a treestructured hierarchical subdivision of space into cubic cells, each of which is divided into 2^{D} subcells. The algorithm has the order compared to the algorithm of the direct sum. The basic idea is that only nearby particles have to be treated individually during the calculation of the gravitational force on a specific particle. Distant nodes of the tree can be seen as one single particle sitting at the center of mass of the specific tree node and carrying the mass of all the particles that reside in this node. Although different algorithms for the computation of selfgravitational forces exist, the hierarchical tree method is specially suited for the use in combination with SPH. As a side product of the computation of the gravitational forces, the hierarchical tree is constructed for each individual time step and hence, a geometrical hierarchy of the SPH particles (leaves of the tree) is available at all time of the computation. It can be used to determine the interaction partners for the SPH algorithm by performing a tree walk for each individual particle (Hernquist & Katz 1989). Although the tree walks may take longer than the search through a helping grid, the interaction partner search may be faster because, in any case the tree build needs to be carried out in simulations with selfgravity.
We calculate the center of mass and total mass of every node in the tree to calculate the gravitational force exerted on the SPH particles. We call these objects pseudoparticles in the following. Now, for each SPH particle, we perform a tree walk over all nodes. With the use of the ϑ criterion, we identify whether a pseudoparticle sitting in the node may be used for the calculation of the gravitational force acting on our particle or whether we have to descend further down in the tree. The ϑ criterion is given by (50)where s_{dt} denotes the edge length of the tree node containing the pseudoparticle and r_{pp} is the distance between the SPH particle and the pseudoparticle. The gravitational force from the pseudoparticle acting on the particle a is then given by (51)where x_{p} and m_{p} denote the location and the mass of the pseudoparticle, respectively. To avoid numerical instabilities, the distance between the pseudoparticle and the particle is smoothed.
Barnes & Hut (1989) present a detailed error analysis for their tree algorithm. They find consistent accuracies of the algorithm for ϑ = 0.3 with the consideration of the monopoles of the tree compared to ϑ = 0.7 and the additional calculation of the quadrupoles for each tree node. The error scales with ϑ^{2} and . For the sake of memory saving, we calculate only the monopoles of the tree and use ϑ ≤ 0.5 in our simulations including selfgravity instead of the usual value of 0.8.
3.4. Time integration
An important subset of every SPH code is the time integration function. We have implemented two different time integrators for the code: a predictorcorrector algorithm following the publication by Monaghan (1989) and a RungeKutta 3rd order integrator with adaptive time step control. In the following, we present the basic algorithm for the integrators and the time step control mechanisms.
3.4.1. Predictorcorrector
A simple predictorcorrector integration scheme is implemented in the code as follows. We let q_{a}(t) be a quantity that is integrated in time and its derivative. At first we perform a prediction integration step and calculate the value after half the time step Δt, (52)Now, we use the predicted values to calculate the time derivatives for the correction step and based on them the new values q_{a}(t + Δt), (53)Obviously, the time step Δt has to be limited. To set an upper limit to the time step we use the following condition from the acceleration: (54)and the Courant condition (55)The maximum time step for this integration step is then given by (56)Additionally, we allow only a maximum change for all other quantities that are integrated. This may be necessary for quantities, such as the density, stress, internal energy, and damage, and is written as (57)
3.4.2. Embedded RungeKutta integrator
In addition to the predictorcorrector scheme, we also implemented a more sophisticated RungeKutta integration scheme that features an adaptive step size control. The basic algorithm is based on a RungeKutta 3rd order scheme where the third step that is used to give an error estimate about the quality of integration step; see Abramowitz & Stegun (1964) and especially Butcher (1987) for a more detailed description. The great advantage of this scheme lies in the possibility of setting a maximum relative error for the quantities since the Courant condition, in principle, is only applicable in hydrodynamical simulations.
The standard 3rd order RungeKutta integration scheme for a quantity q, which is integrated in time from t_{n} to t_{n} + Δt, is written as (58)with q_{n} = q(t_{n}) and q_{n + 1} = q(t_{n} + Δt).
Writing the three substeps k_{1},k_{2},k_{3} are given by The quantity can also be integrated using the Euler midpoint method. There, the value at time t_{n} + Δt is given by (62)Now, we use the difference between and to give an estimate for the local error of the scheme by comparison to the value resulting from an Euler step (63)If ε_{err} is smaller than the desired relative error defined in the configuration file ε_{des}, the integration step is accepted and the time step size is increased to (64)If the error is too large, the time step size is decreased accordingly to (65)
3.5. Implementation in CUDA
In this section, we present how we have implemented the numerical algorithms from the last section for the NVIDIA GPUs with CUDA. Depending on the problem, the information about the particles is stored in onedimensional arrays, e.g., posx[N_{p}], posy[N_{p}], posz[N_{p}], velx[N_{p}], vely[N_{p}], velz[N_{p}], and stress[D × D × N_{p}]. The bottleneck for GPU computation is the slow memory bandwidth between host and device. Therefore, we initially copy all particle data to the GPU and perform all calculations with the SPH particles there. After certain time frames the data is written to the hard disk in an additional POSIX thread to avoid that the computation time depends on disk i/o for simulations with high output.
3.5.1. BarnesHut tree
We have implemented the BarnesHut tree strictly following the efficient implementation for Nbody calculations with CUDA by Burtscher (2011). The hierarchical decomposition is recorded in an octree. Since CUDA does not allow the use of pointers, which is convenient for the handling of trees, e.g., allows for recursive programming of tree walks, we use one large index array for all SPH particles and all tree nodes, which we call childList. This memory is allocated dynamically at the start of the simulation run. The maximum number of tree nodes is given by ceil [2.5 × N_{p}], where N_{p} is the number of SPH particles in the simulation. For each tree node there are N_{children} entries in childList, where N_{children} is simply given by the dimension (66)The tree is constructed in a way that the N_{p} particles reside in N_{p} leaves in the end. At first, the root of the tree is constructed from the geometry of the particle distribution. The computational domain (in three dimensions) is given by the cube, which encloses all SPH particles. This cube has the edge length (67)and defines the edge length of the root node. The edge length of a tree node of depth d_{t} is given by (68)Starting from the root node, the particles are inserted into the tree in the following way. In the beginning, all but one (the root node) entry in childList are −1, corresponding to the status “no kid”. All particles are distributed among the threads on the GPU. Now N_{threads} threads on the GPU start to compare the locations of their SPH particles with the position of the root node to determine the childindex. Since there are many more threads than children (especially in the beginning of the algorithm), we need to avoid race conditions actively. To accomplish this, we add the status −2 or “locked” to the childList and use the CUDAfunction “atomic compare and save” atomicCAS to lock an entry in the childList. This functions checks first if the specific index is already locked. If the index in childList is already locked, the thread waits for the other threads to finish and restarts its attempt. If the index is not locked, the threads read the index value, which may be −1 in the beginning or a different node index, and saves it. Now, the second round starts and the children of the root node are already filled with SPH particles that were sorted there. Since leaves or particles cannot have children by definition, new inner tree nodes have to be generated. A new inner tree node has half of the edge length from its parent. New inner tree nodes are added until both particles, the new particle and the particle in the parent node, can be sorted into two different leaves. As soon as both particles reside in different leaves, the thread writes and cancels the lock.
When we want to calculate the gravitational forces between the SPH particles, we have to calculate the positions and locations of the pseudoparticles in the tree. We use the algorithm presented by Burtscher (2011) in the following manner to determine the pseudoparticle in a specific node: for each inner node, we start in parallel for N_{threads} a CUDA kernel that picks the last N_{threads} added inner nodes, for which we calculate the center of mass and total mass at first. If the inner nodes are not leaves, the thread looking at these nodes and has to wait for all other threads to end beforce continuing because it needs the location and mass of the pseudoparticles in these nodes. This is again done by the use of −1 as the initial mass of a pseudoparticle. When the thread finishes the calculation for its node, it moves up one tree depth and continues with the next node. The implementation of Burtscher is especially designed for CUDA, regarding memory access and parallelization.
3.5.2. Neighbor search
One major task in a SPH code is to find an efficient way to determine the interaction partners for each SPH particle for the kernel sums. Normally, the fastest way to obtain the interaction partner list in a SPH simulation with one fixed smoothing length is via an additional helping grid with a grid size of the smoothing length. Then all particles are sorted into the grid, and only the neighboring grid cells have to be searched for interaction partners. There are numerous publications about the best way to implement a searching grid in SPH, and we refer to the nice overview given in Speith (1998).
Since we already computed the BarnesHut tree for the calculation of selfgravity, we can easily benefit and use the tree to determine the interaction partners without any additional helping grid. The tree search is implemented in the following way: corresponding to the childList for the tree, we also have an index array that stores the information about the interaction partners, interactionsList. In contrast to the tree generation algorithm, there are no race conditions in the interaction partner search algorithm, since we can search for the interaction partners of particle a independently from the search of particle b. Each thread searches for its particles and starts the search for each particle at the root node. If the particle lies within the cube of edge length (69)we descend into the children and continue searching. As soon as a leaf, that is a different particle, is encountered, we check whether the distance between the leaf and particle is smaller than the smoothing length, and if so, the index of the leaf is added to the interactionsList accordingly.
Since all threads are independent in this algorithm, it is very efficient and fast.
3.5.3. Smooth particle hydrodynamics equations
The CUDA implementation of the SPH sums for the equation of motion and conservation of energy is straightforward, since as soon as the interaction partners for each particle are known, the computation of these quantities can be performed independently.
4. Numerical tests and code validation
In this section, we present some of our numerical tests that we performed for code validation. We chose four standard cases to test the different aspects of the code.
4.1. Colliding rubber cylinders
A suitable test for the correct implementation of the tensile instability fix is the simulation of two colliding rubber cylinders. As first presented by Swegle (1992), the collision results in fragmentation, especially when the particles are initially placed on a squared grid. We use the setup by Monaghan (2000) and Gray et al. (2001), and use following material parameters and units: the unit of density is ϱ_{0}, the unit of length is 1cm, the unit of velocity is (852m/s), and the unit of stress is . The value of the shear modulus μ is set to . We use two rubber cylinders with inner radius 3cm and outer radius 4cm, moving initially in one direction with ±0.059 c_{s}, so that the collision velocity is 0.118 c_{s}. In order to test the implementation of artificial stress, we perform two different simulations: we set the parameters for the artificial stress to ε_{s} = 0.2 and n = 4 in one simulation, while we do not add any artificial stress in the other run. Additionally, XSPH is applied with x_{sph} = 0.5 and the artificial viscosity coefficient is set to α = 1 in both setups.
The results of the simulations with and without artificial stress are shown in Figs. 1 and 2, respectively. In the case without any artificial stress, the cylinders fracture shortly after maximum compression. In the simulation with artificial stress, the cylinders bounce off each other without fracturing and begin to swing. The timescale is chosen according to Monaghan (2000) for best comparison. The cylinders touch at t = 0.
Fig. 1 Colliding rubber cylinders with artificial stress; absolute value of the velocity is colorcoded. The use of an additional artificial stress hinders the fracture of the colliding rubber cylinders. The moment of first contact between the two cylinders is at t = 0. 
Fig. 2 Colliding rubber cylinders without artificial stress; absolute value of the velocity is colorcoded. Without the use of an additional artificial stress, the colliding rubber cylinders break up. The time of first contact between the two cylinders is t = 0. 
Our results are in perfect agreement with Monaghan (2000) and Gray et al. (2001).
4.2. Gravitational collapse of a molecular cloud
In order to test the gravitational tree algorithm in the code, we run the standard test case for selfgravitating objects presented by Boss & Bodenheimer (1979). These authors investigated the collapse of a selfgravitating, initially spherical, and uniformly rotating gas cloud. Internal pressure is included, but viscous and magnetic forces are neglected. The collapse of the cloud is considered to be isothermal. The equation of state is that of an ideal gas of pure molecular hydrogen with constant temperature T =10 K. The initial conditions are as follows: the total mass of the molecular cloud is M = 1.0 M_{⊙}, the mean density is given by ϱ_{0} =1.44 × 10^{14} kg /^{3}m, the molecular cloud rotates uniformly with the angular velocity Ω =1.6 × 10^{12} rad/s, and the initial radius of the spherical cloud is R_{0} =3.2 × 10^{14} m.
A nonaxisymmetric perturbation of mode m = 2 and amplitude 0.5 depending on the azimuthal angle ϕ is superposed on the initially uniform density (70)The freefall time is given by (71)
Fig. 3 Gravitational collapse, value of the density in the midplane is colorcoded. The density in the midplane for z = 0 is shown at t = 0.998 t_{f}. In order to generate the plot, the 2 × 106 SPH particles were mapped on a cubic grid 128 × 128 × 128. 
We performed several simulations with different number of particles, ranging from 8000 to 2 000 000. We varied the type of smoothing of the gravitational forces. Since the mean particle distance rapidly shrinks during the simulation, it is convenient to use a variable smoothing length and a constant number of interaction partners. The kernel value for the interaction of particle a and b is then calculated with the mean smoothing length (72)To enforce a constant number of interaction partners per particle, we start with the smoothing length of the last time step and iterate using the new smoothing length (Hernquist & Katz 1989) (73)where n_{des} is the desired number of interaction partners and n_{a} the actual number of interaction partners gained with the smoothing length h.
The results of our simulations are in very good agreement to previous calculations by Boss & Bodenheimer (1979). Moreover, in accord with Bate & Burkert (1997) we find differing collapse speeds depending on the smoothing of the gravitational potential.
Figure 3 shows a colorcoded plot of the density in the midplane z = 0 after a simulation time of approximately the free fall time t_{f} for the simulation with two million SPH particles. There is already an increase of density of almost two orders in magnitude at the location of the initial perturbation, where the two protostars are about to form.
Figure 4 shows the evolution of the maximum density during the collapse for different setups and smoothing of the gravitational force. In order to estimate the maximum density, we mapped the SPH particles on a cubic grid to compare our results to the grid code results by Boss & Bodenheimer (1979). Since our spatial resolution is much better than the resolution from the coarse simulations 30 years ago, we obtain a higher maximum density. Accordingly, we find a slower evolution of the maximum density in our (very) low resolution simulations including 8000 particles. Another major influence on the density evolution is found in the smoothing of the gravitational forces. In order to avoid numerical instabilities due to close encounters of two particles, Eq. (51) has to be modified and the distance between two bodies is smoothed. We applied three different smoothing schemes for best comparison to the former publications. The first smoothing algorithm sets the smoothing length as the lowest possible distance between two particles. This algorithm was used in the simulation with two million particles. Another ansatz is to add an additional tiny distance to r_{pp} in Eq. (51) regardless of the separation between two gravitational objects. The values for this tiny distance employed by Bate & Burkert (1997) are 5% and 10% of the initial radius of the molecular cloud. Using these different smoothing methods, we receive the density evolutions presented in Fig. 4.
Fig. 4 Gravitational collapse of a molecular cloud. Evolution of the maximum density for different setups and smoothing schemes of the gravitational force. The two dots are values by Boss and Bodenheimer from simulations with grid codes. 
4.3. Impact cratering: comparison to experiment
Prater (1970) performed several experiments of highvelocity impacts in different aluminium alloys. To validate our code for the use of impact modeling, we simulated the impact of an aluminium alloy 6061 bullet with radius 6.35 mm in an aluminium alloy 6061 cube with an impact speed of 7000 m s^{1}. We then compared our results to Pierazzo et al. (2008). Since aluminium is not a geological material, this simulation was performed without the use of the GradyKipp damage model, solely with the von Mises plasticity model. The threedimensional simulation has the following initial setup: we use 500 000 SPH particles for the cube and 136 particles for the impacting bullet, the numerical parameters are α = 0.5, β = 0, and the material parameters are μ =26 × 109 Pa, K_{0} =75.27 × 109 Pa, Y_{0} =1 × 108 Pa, and ϱ_{0} =2.7 × 103 kg /^{3}m. The parameters for the Tillotson equation of state A_{T}, B_{T}, E_{0}, E_{iv}, E_{cv}, a_{T}, b_{T}, α_{T}, β_{T} are 75.2 × 109 Pa, 65 × 109 Pa, 5 × 106 J, 3 × 106 J, 13.9 × 106 J, 0.5, 1.63, 5 and 5, respectively. The surface of the cube with the crater after the impact is shown in Fig. 5.
Fig. 5 Impact into Al6061 simulation. The surface of the cube with the impact crater after 100 μ s is shown. The size of the cube is 10 × 10 × 10 cm. The impactor had a radius of 6.35 mm and the impact speed was 7000 m s^{1}. 
The dimensions of the final crater can be seen in Fig. 6. They are in good agreement with the numerical results by Pierazzo et al. (2008) and the experimental results by Prater (1970).
Fig. 6 Crater dimensions of Al6061 impact simulation. The plot shows the colorcoded density in a slice through the cube at y = 0 with the cube edge from –5 cm to 5 cm after 100 μ s. 
Fig. 7 Damage pattern in the basalt target 50 μs after the impact of the nylon bullet. The impact point is at the upper left surface of the spherical target. The diameter of the target is 6 cm. The color code denotes the damage of the particles and the target was split in half to reveal the damage pattern inside. Red particles are fully damaged. The intact core, indicated by the blue particles, incorporates about 30.4% of the original target mass. 
Fig. 8 Cumulative fragment mass spectrum determined at the end of the impact simulation, calculated by a friendoffriend algorithm 50 μs after the impact. 
4.4. Damage model: comparison to experiment
We simulated a laboratory impact experiment into a basaltic rock performed by Nakamura & Fujiwara (1991) to test the implementation of the damage model. In their experiment, a small 0.2 g nylon bullet impacts with an impact speed of 3200 m s^{1} into a basaltic rocky sphere with a diameter of 6 cm. We chose to simulate an offaxis impact with an impact angle of 30 ◦ (see Fig. 9 for the definition of the impact angle), which corresponds to an impact parameter of 25% of the diameter. We modeled the basalt target with 523 741 particles and applied the Tillotson equation of state. Owing to the lack of parameters for nylon, we used lucite for the projectile material (values were taken from Benz & Asphaug 1994). The basalt target was modeled with the additional damage model and activation threshold strains were distributed following the Weibull distribution statistically as a preprocessing step (see Sect. 3.2 for a detailed description of the algorithm). Following Benz & Asphaug (1994), we performed several simulations with varying Weibull parameter pairs k and m. Figure 7 shows the damage pattern in the target after 50 μs for the simulation that matched the experimental outcome best. The damage in the region around the impact point is the largest and all particles close to the impact point were immediately damaged upon impact. The damage pattern is radially symmetric around the impact point. Another source of damage is found near the surface of the target. There, damage grows due to spallation stresses that occur when compressive waves get reflected at the free surface. As a result of this nature, the inner core of the target remains intact while the surface is coated with cracks. After 50 μs simulation time, the damage pattern in the target remains unchanged. We identified single fragments using a friendoffriend algorithm in the following way: first, we removed all fully damaged particles. Then, we clustered all remaining particles that are closer than 1.01 of the original smoothing length to their neighboring particles. Fragments consisting of less than three particles were neglected. Figure 8 shows the fragment mass spectrum obtained with this friendoffriend clustering algorithm.
We find best agreement of the remaining target core size to the experimental data for k =5 × 10^{34} m^{− 3} and m = 8, which is close to the values k =v 5 × 10^{34} m^{− 3} and m = 8.5 found by Benz & Asphaug (1994). The intact core incorporates 30.4% of the initial target mass, which has to be compared to the experimental value of 31%.
Fig. 9 Collision geometry. The collision angle α is defined such that α = 0 for a headon impact. Projectile and target radii are R_{P} and R_{T}, respectively; b is the impact parameter. 
Fig. 10 Collision snapshots. In the right top corner we indicate the timespan into the simulation for each frame. The collision itself occurs at a time stamp of 234 min. For better visibility of the inside structure we cut open the two Ceresmass bodies along the z = 0 symmetry plane. Basalt is plotted in red, water ice in blue. 
5. Collision between Ceressized objects and water transport
In a forthcoming paper, we study the transfer of water during collisions of Ceressized bodies based on preliminary results obtained with our previously presented OpenMP SPH code (Maindl et al. 2013, 2014) and new, higher resolution simulation results gained from the code described here. While Marcus et al. (2010) show that water contents cannot increase from giant impacts involving bodies with masses between 0.5 M_{⊕} and 5 M_{⊕}, we are interested in possible water delivery by impacts in early planetary systems and hence collision events of smaller bodies, typically involving lower energy and the possibility of water (ice) being transferred between the impactors. Many existing planet formation studies (e.g., Dvorak et al. 2012; Raymond et al. 2004; Izidoro et al. 2013) assume perfect merging and complete delivery of the asteroids’ and planetesimals’ water contents to the impact target. As this assumption does not hold, we need to closely investigate the impact process itself to define conditions under which water is actually transferred rather than lost during a smaller scale collision. By including a realistic water transfer and loss model in planet formation studies the actual amount of water on the final stages of terrestrial planets may be smaller by a factor of 5–10 (Haghighipour & Raymond 2007). Most simulations of giant impacts use a strengthless material model (e.g., Canup et al. 2013) based on selfgravity dominating the material’s tensile strength beyond a certain size of the colliding bodies (400 m in radius; Melosh & Ryan 1997). We study collisions of Ceressized objects the size of which is just about on the verge of gravity domination and we expect more realistic results from including solid body dynamics as put forth in Sect. 2.
In this section we present first results from the mentioned higher resolution simulations using the subject CUDA code.
5.1. Scenario setup
The collision scenarios include two spherical objects composed of basaltic rock and water ice. Target and projectile have one Ceres mass (9.43 × 10^{20} kg) each. The target has a basaltic core with a 30 wt% water ice mantle and the projectile consists of solid basalt. Owing to the different densities, this configuration results in target and projectile radii of R_{T} =509 km and R_{P} =437 km, respectively. Upon contact, the mutual escape velocity of the two bodies is v_{esc} = 516 m s^{1}. Neither of the bodies is rotating at the start of the simulation.
Tillotson equation of state parameters, shear modulus μ, and yield stress Y_{0} in SI units (Benz & Asphaug 1999).
For modeling the materials we use the Tillotson equation of state (see Sect. 2.5.3) with material parameters as listed in Benz & Asphaug (1999) and our Table 1. The damage model (cf. Sect. 3.2) is implemented with Weibull parameters, which were measured for basalt by Nakamura et al. (2007), and for water ice we adopt values mentioned in Lange et al. (1984); see Table 2.
Here we present results for a specific hitandrun collision simulated with a resolution of 200 000 SPH particles. The projectile and target start 11 369 km apart with an initial impact parameter of b_{0} =819 km and a relative velocity of 744 m/s. While the individual bodies’ SPH particle distributions were relaxed and in equilibrium before the simulation, we chose this relatively large initial separation of the bodies to allow for a realistic modeling of tidal effects during their mutual approach. Under the influence of the initial momenta and selfgravity the actual collision happens 234 min into the simulation. The impact velocity is v_{i} = 1.7 v_{esc} and the impact angle α = 48°. The latter is measured such that a headon collision corresponds to α = 0; Fig. 9 illustrates the collision geometry. These impact characteristics are definitely in the hitandrun domain of the collision outcome map (see e.g., Leinhardt & Stewart 2012), hence we expect two major survivors that escape each other. In order to be able to clearly identify the surviving bodies and to cover the complete collision process, we simulate for 2000 min system time.
5.2. Results
Weibull distribution parameters for basalt and water ice.
The simulation results in two surviving bodies of about equal mass that escape each other, as expected for a hitandrun scenario. Figure 10 shows snapshots from the collision until about 86 min after, a time when the two bodies are clearly separated again, but still connected by a debris bridge. At the collision site, the icy mantle of the target is stripped and exposes the rocky core. Is it also notable that the initially dry projectile accretes some of the water originally in the target’s mantle. The spreadout debris is either reaccreted by one of the two major survivors or it forms fragments.
After the simulation timespan of 2000 min, the two big objects reach a mutual distance of more than 60 000 km. At this point, we identify the surviving fragments F_{x} using a friendsoffriends algorithm with the smoothing length as the linking length and find a sharp decrease in mass beyond the two main survivors F_{1} and F_{2}. In total there are 2503 fragments, most of them consisting of one to few SPH particles. In order for F_{k} to qualify as a “significant fragment”, we require that it consists of at least 20 SPH particles. Table 3 lists the surviving six significant fragments and illustrates the sharp decline in their size beyond the two major survivors. Given the differing mass of icy and rocky SPH particles by a factor of about three, the number of particles n_{k} in a fragment is not necessarily proportional to its mass. Beyond the two main bodies, the clumps consist of just barely over 20 SPH particles so that their properties are not well defined.
Given the F_{1}F_{2} mutual escape velocity at their final separation of 64 m s^{− 1} the collision outcome is clearly hitandrun. For the smaller fragments we determined their escape velocities v_{esc,k} at their distance from the system’s barycenter r_{k} via (74)where M_{r} is the aggregated mass of all fragments that are closer to the barycenter than r_{k}. For all F_{3...6} the v_{esc,k} values are much smaller than v_{k} (between 5.2 and 6.7 m s^{− 1}) so that all of these fragments escape. Further analysis (details not shown) reveals, however, that the two main survivors accrete some of the smaller fragments: F_{1} accretes F_{3}, F_{2} accretes F_{5}, and F_{4} and F_{6} escape to space.
Significant fragments F_{k} after the hitandrun collision with distances from the center of mass r_{k}, number of SPH particles n_{k}, barycentric velocities v_{k}, and wt% of water ice w_{k}, respectively.
Fig. 11 Damage pattern after the impact (cf. the snapshots in Fig. 10). The wet body’s icy mantle and outer basalt layers and the surface of the dry body get completely damaged immediately. For better visibility of the inside structure, we again cut the two Ceresmass bodies along the z = 0 symmetry plane. 
The small fragments F_{3...6} are originating from debris that is accreted after being ejected by the collision. Hence, they consist of fully fractured material and are gravitationally bound. The two main survivors’ material strength suffers from the impact but a significant part stays intact; F_{1} is 61% damaged and the originally wet body F_{2} suffers 70% fracture. Compared to the situation right after the collision, as illustrated in Fig. 11 (average damage of the two survivors is 64.5%), tidal forces and reaccretion do not significantly add to the degree of overall damage.
Concerning the water content, we observe a tendency toward higher values for smaller fragments, but we also notice that the projectile, which is initially dry, gets a 2 wt% water fraction after colliding with the wet target. There is also a tendency of the smaller fragments to contain more water than the two main survivors. In total, the surviving significant fragments have an aggregated mass of 1.87 × 10^{21} kg and hold 2.73 × 10^{20} kg ice, which computes to a water fraction of 14.6 wt%.
In summary, we get a big picture result that is consistent with predictions from Leinhardt & Stewart (2012) (hitandrun outcome), and especially the expected sharply decreasing fragment sizes beyond F_{2}. Also, the aggregated water fraction of the significant fragments is in accordance with values obtained via our OpenMP CPU code, which we used to simulate similar impacts with a resolution one order of magnitude lower (20 000 SPH particles in Maindl et al. 2014).
5.3. Performance
We presented a v_{i} = 1.7 v_{esc}, α = 48° collision of two Ceresmass bodies. Using a resolution of 200 000 SPH particles it takes 299 h walltime to simulate 2000 min of the process, consisting of the mutual approach, collision, and subsequent escape of the survivors. Our test was run on a NVIDIA GeForce GTX 770 GPU.
Previously (Maindl et al. 2014), we simulated similar collisions with our OpenMP CPU code using a resolution of only 20 000 SPH particles and experienced runtimes of the same order of magnitude for comparable collisions: 450 h walltime for a v_{i} = 1.34 v_{esc}, α = 48° collision on a dual Intel Xeon E5410 node and 249 h walltime for a v_{i} = 2.13 v_{esc}, α = 50° collision on a dual Intel Xeon E5440 node, respectively.
Given the rather similar computing power of the two mentioned CPUs and also given that both the v_{i} = 1.34 v_{esc} and v_{i} = 2.13 v_{esc} scenarios result in a hitandrun collision outcome, the obvious runtime difference might be surprising. It originates from the slower scenario, which is closer to the merging domain, showing more complex behavior before the two major survivors actually separate.
Even considering these variations from one particular scenario to another, we think there is a clear case for GPUbased code development as we experience a tenfold increase of resolution with about equal processing time when comparing CPU and GPU codes.
6. Results
We have successfully developed a new SPH code that makes use of modern NVIDIA GPUs and includes various different numerical models. The code uses libconfig (Lindner 2015) and requires CUDA^{TM}5.0 or higher (the simulations for this work were performed with CUDA 7.5). We run the code successfully on various consumer GPUs from NVIDIA, such as the GeForce GTX 570, GeForce GTX 680, GeForce GTX 770, GeForce GTX Titan, GeForce GTX TITAN X, and on NVIDIA High Performance Computing Devices such as the Tesla C2050/C2070, Tesla K40, and Tesla K80.
6.1. Experience with CUDA simulations
One disadvantage of CUDA is that it is not quite obvious which number of threads should be used to call a certain kernel. Often, it is necessary to adapt this number to the physical problem or number of particles. Moreover, these values differ between different GPUs. Hence, sometimes it is more efficient to try a different number of threads for several kernels before starting a longtime simulation. In our code, the number of threads per kernel may be specified for the most important kernels during compile time.
6.2. Runtime comparison: speed increase
Figure 12 shows the main motivation to write a CUDA port of our SPH code: the performance gain of the implemented algorithms of the GPU compared to a single core CPU.
Fig. 12 Runtime: Nvidia GTX Titan versus single core Intel i74779. Comparison between the runtime of typical substeps of the SPH and Nbody algorithms for the new CUDA code and the OpenMP code running on one CPU core. See Sect. 6.2 for the description of the substeps. 
We compared the averaged time per substep for different test cases and particle numbers. Of course, these numbers change with different particle numbers and applications. However, we achieve a tremendous increase in speed for the two most time consuming processes in a SPH simulation including selfgravity: the neighbor search that is used to determine the interaction partners for all particles and the calculation of the gravitational forces with the BarnesHut algorithm. These two processes add up to more than 80% of the total computation time. The different subprocesses that are shown in the plot include the following steps: first, the neighbor search determines the interaction partners for all particles. Second, the internal forces calculates all SPH sums and accelerations from the equation of momentum conservation. Third, selfgravity includes all computations that are necessary to determine the gravitational forces between the particles. Fourth, the total righthand side shows all computations that are necessary to determine the acceleration of all particles.
Fig. 13 Scaling of the different SPH substeps on the NVIDIA Tesla K40 – BossBodenheimer test case. 
Figure 14 shows the scaling of the code on different GPUs for different number of SPH particles. The time has been normalized to the time used on the NVIDIA Tesla K40 with the device errorcorrecting code (ECC) enabled. Interestingly, the new consumer devices (Maxwell generation) are as fast as the Kepler generation HPC hardware. However, the consumer GPUs do not provide an ECC mode. The more sophisticated HPC hardware enables the use of ECC for the cost of 20–30% performance. We have not found any differences in the simulation results that would demand the adoption of an automatic error correction code. Since we perform all calculations on the GPU and transfer data to the host only for input/output procedures, the other features of the HPC hardware such as bilateral transfer between host and device are not extensively relevant.
We additionally analyzed the individual scaling for each subprocess of the numerical scheme. Figure 13 shows the scaling on the NVIDIA Tesla K40 for different number of SPH particles for the simulation of the collapse of the molecular cloud. The scheme for the calculation of the selfgravity of the particle distribution and the scheme for the build of the BarnesHut tree show the best scaling.
Fig. 14 Scaling of the code on different GPUs – BossBodenheimer test case. 
The increase in speed of the new CUDA version in comparison to our OpenMP implementation is shown in Fig. 15. We ran the Prater impact experiment (see Sect. 4.3) with both codes for different numbers of SPH particles. The two codes show similar scaling behavior for the number of particles with a constant increase in speed of the GPU version. We intentionally used only one core in the OpenMP version runs for better comparison.
Fig. 15 Scaling of the code compared to the serial CPU version; impact experiment test case. 
In our first astrophysical application of the code, we were able to increase the resolution by a factor of ten for about equal processing time (see Sect. 5.3).
Glossary of symbols.
6.3. Numerical limitations
Modern GPUs currently offer fast memory in the range up to 12 Gigabytes (e.g., Tesla K40 and K80, GeForce Titan X) and up to 2880 streaming processors. The size of the memory is the limiting factor for our code. The memory usage strongly depends on the physical model that is applied. Depending on the specific model and dimension, the code holds different information for each particle in the memory. For example, the simplest hydro model only demands information about the location of the particles, their velocities and their masses, that is eight variables in double precision in 3D and only six variables in 2D per particle. For the simulation of solid bodies, the number of variables per particle in the memory increases, since the density, internal energy and deviatoric stress tensor have to be kept available. We additionally store the tree and information about the interactions in the memory. The most expensive model in regard to memory consumption is the damage model, since we need to save the activation thresholds for each individual particle in the memory. Hence, the number of required values per particle grows extensively. The mean number of activation flaws per particle is given by log N_{p}, where N_{p} denotes the total number of SPH particles. Additionally, the memory demand is related to the choice of the integrator; the RungeKutta integrator requires more substeps as the predictorcorrector scheme to store the additional derivatives at the time of the substeps in memory. We allocated about four gigabytes of memory on the Geforce Titan X for the most demanding simulation in regard to memory consumption, the Nakamura experiment in Sect. 4.4 with 0.52 million particles. As can be seen in Fig. 12, the most expensive interaction in terms of computation time is selfgravity. The wall clock time of all simulations including selfgravity is dominated by the tree algorithm. For some applications it might be helpful to decouple the hydro timescale from the gravitational timescale, since the code resolves shock waves in the solid bodies while the locations of the particles hardly change. This allows for reusage of already computed pseudoparticles for the calculation of the accelerations due to selfgravity, as long as the tree properties remain unchanged. We added a command line switch for selfgravity simulations that enables this feature. However, we have not used the decoupling for our test simulations presented here.
6.4. Limitations in terms of physical models
As was pointed out by Holsapple (2009), the von Mises yield criterion does not perfectly describe the behavior of geologic materials. The fully damaged material after an impact or a collision should not be treated as a pure fluid, since it behaves more like a granular medium. Hence, the postimpact material flow cannot be modeled realistically with the von Mises criterion. This has already been investigated by Jutzi (2015). To overcome this limitation, we will implement a proper material model for granular media. Another limitation is the lack of a porosity model. Numerous astrophysical objects such as comets and asteroids show high porosities. We would like to apply our newly developed code to simulate collisions between porous objects and plan to include two porosity models to the code: a strain based model (Wünnemann et al. 2006; Collins et al. 2011) and a pressure based model (Jutzi et al. 2008). In the current version of the code, the yield criterion does not depend on the temperature. The change of temperature because of plastic deformation is not calculated. However, these features have been already implemented in the CPU version of the code and will be migrated to the CUDA version soon. This will add more material models to the code, such as the JohnsonCook plasticity model, where the yield strength depends on temperature and plastic strain and strain rate.
7. Conclusions and future work
We presented a new SPH code for NVIDIA GPUs using the CUDA framework. The code can be applied to various different applications in astrophysics and geophysics. It may be used to model hydrodynamics and solid bodies, including ductile and brittle materials, such as aluminium or basalt. Additionally, the code implements a BarnesHut tree that allows for treatment of selfgravity. The code is freely available to the astrophysical and geophysical community upon request by email to the authors.
One main advantage of the CUDA implementation is the enormous increase in speed compared to our existing OpenMP SPH code. The performance of the CUDA code allows for high spatial resolution without the need for cluster hardware. This may enable researchers without access to HPC hardware to perform sophisticated SPH simulations on workstations. Although the code runs faster and more efficient on NVIDIA’s HPC GPUs such as the Kepler GPUs, a large increase in speed compared to the CPU version already exists with the use of cheaper consumer graphics cards. We plan to include more physics in the CUDA version of our SPH code. Currently, we add a porosity model and a more refined plasticity model to the code. One major future step will be to add the support for more than one GPU.
Acknowledgments
We thank the anonymous referee for the thorough review, positive comments, constructive remarks, and suggestions on the first manuscript. Christoph Schäfer wants to thank Daniel Thun for various helpful comments about the mysteries of CUDA. Most plots in this publication have been made by the use of the matplotlib (Hunter 2007). Thomas Maindl appreciates support by the FWF Austrian Science Fund project S 11603N16. Xeon is a trademark from Intel Corporation. GeForce, Tesla, CUDA are trademarks from NVIDIA Corporation.
References
 Abramowitz, M., & Stegun, I. 1964, Handbook of Mathematical Functions: With Formulas, Graphs, and Mathematical Tables, Applied mathematics series (Dover Publications) [Google Scholar]
 Barnes, J., & Hut, P. 1986, Nature, 324, 446 [NASA ADS] [CrossRef] [Google Scholar]
 Barnes, J. E., & Hut, P. 1989, ApJS, 70, 389 [NASA ADS] [CrossRef] [Google Scholar]
 Bate, M. R., & Burkert, A. 1997, MNRAS, 288, 1060 [NASA ADS] [CrossRef] [Google Scholar]
 Bédorf, J., Gaburov, E., & Portegies Zwart, S. 2012, J. Computat. Phys., 231, 2825 [NASA ADS] [CrossRef] [Google Scholar]
 Benz, W. 1990, in Numerical Modelling of Nonlinear Stellar Pulsations Problems and Prospects, ed. J. R. Buchler, 269 [Google Scholar]
 Benz, W., & Asphaug, E. 1994, Icarus, 107, 98 [NASA ADS] [CrossRef] [Google Scholar]
 Benz, W., & Asphaug, E. 1995, Comput. Phys. Comm., 87, 253 [Google Scholar]
 Benz, W., & Asphaug, E. 1999, Icarus, 142, 5 [NASA ADS] [CrossRef] [Google Scholar]
 Boss, A. P., & Bodenheimer, P. 1979, ApJ, 234, 289 [NASA ADS] [CrossRef] [Google Scholar]
 Burtscher, M. 2011, GPU Computing Gems Emerald Edition, ed. W. Hwu [Google Scholar]
 Butcher, J. C. 1987, The Numerical Analysis of Ordinary Differential Equations: RungeKutta and General Linear Methods (New York: WileyInterscience) [Google Scholar]
 Canup, R. M., Barr, A. C., & Crawford, D. A. 2013, Icarus, 222, 200 [NASA ADS] [CrossRef] [Google Scholar]
 Collins, G., Melosh, H., & Wünnemann, K. 2011, Int. J. Impact Eng., 38, 434 [CrossRef] [Google Scholar]
 Dvorak, R., Eggl, S., Süli, Á., et al. 2012, in AIP Conf. Ser. 1468, eds. M. Robnik, & V. G. Romanovski, 137 [Google Scholar]
 Elbeshausen, D., Wünnemann, K., & Collins, G. S. 2009, Icarus, 204, 716 [NASA ADS] [CrossRef] [Google Scholar]
 Gingold, R. A., & Monaghan, J. J. 1977, MNRAS, 181, 375 [Google Scholar]
 Grady, D. E., & Kipp, M. E. 1980, Int. J. Rock Mech. Min. Sci. Geomech. Abstr., 17, 147 [Google Scholar]
 Gray, J. P., Monaghan, J. J., & Swift, R. P. 2001, Comp. Methods Appl. Mech. Engrg., 190, 6641 [Google Scholar]
 Haghighipour, N., & Raymond, S. N. 2007, ApJ, 666, 436 [NASA ADS] [CrossRef] [Google Scholar]
 Hernquist, L., & Katz, N. 1989, ApJS, 70, 419 [NASA ADS] [CrossRef] [Google Scholar]
 Holsapple, K. A. 2009, Planet. Space Sci., 57, 127 [NASA ADS] [CrossRef] [Google Scholar]
 Hunter, J. D. 2007, Comput. Sci. & Engineering, 9, 90 [Google Scholar]
 Izidoro, A., de Souza Torres, K., Winter, O. C., & Haghighipour, N. 2013, ApJ, 767, 54 [NASA ADS] [CrossRef] [Google Scholar]
 Jutzi, M. 2015, Planet. Space Sci., 107, 3 [NASA ADS] [CrossRef] [Google Scholar]
 Jutzi, M., Benz, W., & Michel, P. 2008, Icarus, 198, 242 [NASA ADS] [CrossRef] [Google Scholar]
 Kulikov, I. 2014, ApJS, 214, 12 [NASA ADS] [CrossRef] [Google Scholar]
 Lange, M. A., Ahrens, T. J., & Boslough, M. B. 1984, Icarus, 58, 383 [NASA ADS] [CrossRef] [Google Scholar]
 Leinhardt, Z. M., & Stewart, S. T. 2012, ApJ, 745, 79 [NASA ADS] [CrossRef] [Google Scholar]
 Libersky, L. D., & Petschek, A. 1991, in Advances in the FreeLagrange Method Including Contributions on Adaptive Gridding and the Smooth Particle Hydrodynamics Method, eds. H. Trease, M. Fritts, & W. Crowley, Lect. Notes Phys., 395, 248 [Google Scholar]
 Libersky, L. D., Randles, P. W., Carney, T. C., & Dickinson, D. L. 1997, Int. J. Impact Eng., 20, 525 [CrossRef] [Google Scholar]
 Lindner, M. 2015, http://www.hyperrealm.com/libconfig [Google Scholar]
 Lucy, L. B. 1977, AJ, 82, 10134 [NASA ADS] [CrossRef] [Google Scholar]
 Maindl, T. I., Schäfer, C., Speith, R., et al. 2013, Astron. Nachr., 334, 996 [NASA ADS] [CrossRef] [Google Scholar]
 Maindl, T. I., Dvorak, R., Schäfer, C., & Speith, R. 2014, in IAU Symp., 310, 138 [Google Scholar]
 Marcus, R. A., Sasselov, D., Stewart, S. T., & Hernquist, L. 2010, ApJ, 719, L45 [NASA ADS] [CrossRef] [Google Scholar]
 McGlaun, J., Thompson, S., & Elrick, M. 1990, Int. J. Impact Eng., 10, 351 [Google Scholar]
 Melosh, H. 1996, Impact Cratering: A Geologic Process, Oxford monographs on geology and geophysics (Oxford University Press) [Google Scholar]
 Melosh, H. J., & Ryan, E. V. 1997, Icarus, 129, 562 [NASA ADS] [CrossRef] [Google Scholar]
 Monaghan, J. J. 1989, J. Comput. Phys., 82, 1 [NASA ADS] [CrossRef] [Google Scholar]
 Monaghan, J. J. 1992, ARA&A, 30, 543 [NASA ADS] [CrossRef] [Google Scholar]
 Monaghan, J. J. 2000, J. Comput. Phys., 159, 290 [NASA ADS] [CrossRef] [Google Scholar]
 Monaghan, J. J., & Gingold, R. A. 1983, J. Comput. Phys., 52, 374 [NASA ADS] [CrossRef] [Google Scholar]
 Monaghan, J., & Lattanzio, J. 1985, A&A, 149, 135 [NASA ADS] [Google Scholar]
 Nakamura, A., & Fujiwara, A. 1991, Icarus, 92, 132 [NASA ADS] [CrossRef] [Google Scholar]
 Nakamura, A. M., Michel, P., & Setoh, M. 2007, J. Geophys. Res., 112, 2001 [CrossRef] [Google Scholar]
 Pierazzo, E., Artemieva, N., Asphaug, E., et al. 2008, Met. Planet. Sci., 43, 1917 [Google Scholar]
 Prater, R. 1970, Hypervelocity Impact: Material Strength Effects on Crater Formation and Shock Propagation in Three Aluminum Alloys (Air Force Institute of Technology) [Google Scholar]
 Randles, P. W., & Libersky, L. D. 1996, Comp. Methods Appl. Mech. Engrg., 139, 375 [Google Scholar]
 Raymond, S. N., Quinn, T., & Lunine, J. I. 2004, Icarus, 168, 1 [NASA ADS] [CrossRef] [MathSciNet] [Google Scholar]
 Schäfer, C., Speith, R., & Kley, W. 2007, A&A, 470, 733 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Speith, R. 1998, Ph.D. Thesis, EberhardKarls Universität Tübingen [Google Scholar]
 Springel, V. 2005, MNRAS, 364, 1105 [NASA ADS] [CrossRef] [Google Scholar]
 Swegle, J. W. 1992, Memo, Sandia National Laboratory [Google Scholar]
 Tillotson, J. H. 1962, Metallic Equations of State for Hypervelocity Impact, Tech. Rep. General Atomic Report GA3216, General Dynamics, San Diego, CA [Google Scholar]
 Tingxing, D., Dobrev, V., Kolev, T., et al. 2014, in Parallel and Distributed Processing Symposium, IEEE 28th International [Google Scholar]
 von Mises, R. 1913, Göttin. Nachr. Math. Phys., 1, 582 [Google Scholar]
 Weibull, W. A. 1939, Ingvetensk. Akad. Handl., 151, 5 [Google Scholar]
 Wünnemann, K., Collins, G. S., & Melosh, H. J. 2006, Icarus, 180, 514 [NASA ADS] [CrossRef] [Google Scholar]
All Tables
Tillotson equation of state parameters, shear modulus μ, and yield stress Y_{0} in SI units (Benz & Asphaug 1999).
Significant fragments F_{k} after the hitandrun collision with distances from the center of mass r_{k}, number of SPH particles n_{k}, barycentric velocities v_{k}, and wt% of water ice w_{k}, respectively.
All Figures
Fig. 1 Colliding rubber cylinders with artificial stress; absolute value of the velocity is colorcoded. The use of an additional artificial stress hinders the fracture of the colliding rubber cylinders. The moment of first contact between the two cylinders is at t = 0. 

In the text 
Fig. 2 Colliding rubber cylinders without artificial stress; absolute value of the velocity is colorcoded. Without the use of an additional artificial stress, the colliding rubber cylinders break up. The time of first contact between the two cylinders is t = 0. 

In the text 
Fig. 3 Gravitational collapse, value of the density in the midplane is colorcoded. The density in the midplane for z = 0 is shown at t = 0.998 t_{f}. In order to generate the plot, the 2 × 106 SPH particles were mapped on a cubic grid 128 × 128 × 128. 

In the text 
Fig. 4 Gravitational collapse of a molecular cloud. Evolution of the maximum density for different setups and smoothing schemes of the gravitational force. The two dots are values by Boss and Bodenheimer from simulations with grid codes. 

In the text 
Fig. 5 Impact into Al6061 simulation. The surface of the cube with the impact crater after 100 μ s is shown. The size of the cube is 10 × 10 × 10 cm. The impactor had a radius of 6.35 mm and the impact speed was 7000 m s^{1}. 

In the text 
Fig. 6 Crater dimensions of Al6061 impact simulation. The plot shows the colorcoded density in a slice through the cube at y = 0 with the cube edge from –5 cm to 5 cm after 100 μ s. 

In the text 
Fig. 7 Damage pattern in the basalt target 50 μs after the impact of the nylon bullet. The impact point is at the upper left surface of the spherical target. The diameter of the target is 6 cm. The color code denotes the damage of the particles and the target was split in half to reveal the damage pattern inside. Red particles are fully damaged. The intact core, indicated by the blue particles, incorporates about 30.4% of the original target mass. 

In the text 
Fig. 8 Cumulative fragment mass spectrum determined at the end of the impact simulation, calculated by a friendoffriend algorithm 50 μs after the impact. 

In the text 
Fig. 9 Collision geometry. The collision angle α is defined such that α = 0 for a headon impact. Projectile and target radii are R_{P} and R_{T}, respectively; b is the impact parameter. 

In the text 
Fig. 10 Collision snapshots. In the right top corner we indicate the timespan into the simulation for each frame. The collision itself occurs at a time stamp of 234 min. For better visibility of the inside structure we cut open the two Ceresmass bodies along the z = 0 symmetry plane. Basalt is plotted in red, water ice in blue. 

In the text 
Fig. 11 Damage pattern after the impact (cf. the snapshots in Fig. 10). The wet body’s icy mantle and outer basalt layers and the surface of the dry body get completely damaged immediately. For better visibility of the inside structure, we again cut the two Ceresmass bodies along the z = 0 symmetry plane. 

In the text 
Fig. 12 Runtime: Nvidia GTX Titan versus single core Intel i74779. Comparison between the runtime of typical substeps of the SPH and Nbody algorithms for the new CUDA code and the OpenMP code running on one CPU core. See Sect. 6.2 for the description of the substeps. 

In the text 
Fig. 13 Scaling of the different SPH substeps on the NVIDIA Tesla K40 – BossBodenheimer test case. 

In the text 
Fig. 14 Scaling of the code on different GPUs – BossBodenheimer test case. 

In the text 
Fig. 15 Scaling of the code compared to the serial CPU version; impact experiment test case. 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.