Open Access
Volume 664, August 2022
Article Number A68
Number of page(s) 18
Section Numerical methods and codes
Published online 09 August 2022

© J. Nättilä 2022

Licence Creative CommonsOpen Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article is published in open access under the Subscribe-to-Open model. Subscribe to A&A to support open access publication.

1 Introduction

Ever since the introduction of computers, numerical simulations have been used to study the nonlinear behavior of plasma (see, e.g., Buneman 1959; Dawson 1962, 1964). In the early days, the research was mainly motivated by studies of basic plasma instabilities and confinement in fusion experiments, but the method of computationally solving the dynamics of charged particles quickly gained popularity also in understanding plasma in space (see, e.g., Tanaka 1983; Langdon et al. 1988; Buneman et al. 1992).

An important factor in accelerating the use of plasma simulations is the ever-increasing computational speed and number of processors. We have recently arrived in the petaflop (1015 floating-point operations per second, FLOPS) supercomputing era, and have started to pave our way toward exascale computing systems (1018 FLOPS). This technological surge (and the ingenuity of the researchers themselves) has enabled us to shed light on many longstanding issues in high-energy astrophysics, including shocks (e.g., Frederiksen et al. 2004; Spitkovsky 2005), reconnection (e.g., Zenitani & Hoshino 2001; Cerutti et al. 2012; Kagan et al. 2013; Sironi & Spitkovsky 2014; Werner et al. 2015), pulsar magnetospheres (e.g., Philippov & Spitkovsky 2014; Chen & Beloborodov 2014; Cerutti et al. 2015), kinetic turbulence (e.g., Howes et al. 2008; Servidio et al. 2014; Zhdankin et al. 2017b; Comisso & Sironi 2018; Nättilä & Beloborodov 2021), and nonlinear wave interactions (e.g., Nättilä & Beloborodov 2022).

In order to keep riding this technological wave, successful numerical codes need to be highly scalable and modern in order to take advantage of the newest supercomputing architectures. These kinds of exascale-ready features include avoidance of global communication patterns and support of heterogeneous computing nodes with varying resources, such as a different number of cores/accelerators or varying amount of memory per core (see also Nordlund et al. 2018, for a discussion). The presented RUNKO framework was developed from scratch to support these features.

An additional feature of the plasma simulations is that there are various theoretical frameworks for modeling the dynamics of the system, whereas the underlying numerical algorithms all rely on techniques based on computational particles and multidimensional grids. The fully kinetic description is based on sampling the distribution function with superparticles and representing the electromagnetic fields on a grid, known as the particle-in-cell (PIC) approach (Hockney & Eastwood 1981; Birdsall & Langdon 1985)1. In the case of a strong (background) magnetic field, it is possible to average over the gyrorotation of the particle's gyro-orbit and use the gyrokinetic description where, instead of particles, computational “rings” are considered (see, e.g., Howes et al. 2006). Alternatively, the kinetic systems can be significantly simplified by considering only the moments of the distribution function; this macro-scale description is related to the magnetohydrodynamic (MHD) models (see, e.g., Brandenburg 2003). The MHD description itself can be divided into multiple categories with different closure approximations (e.g., ideal, visco-resistive) or the evolution of multiple fluid species (known as multifluid description). There also exists a class of hybrid methods that combine both kinetic and mag-netohydrodynamic models, for example by treating electrons with fluid description and protons with the kinetic description (see, e.g., Winske et al. 2003; Valentini et al. 2007; Haggerty & Caprioli 2019). From the numerical perspective, all of these methods, in the end, only require the efficient manipulation of particles and/or physical quantities that are stored on grids.

It is possible to take advantage of this structural similarity of the various formulations with well-designed software abstractions. This is the second core idea of the RUNKO simulation framework, which in essence is just a collection of data structures for storing, manipulating, and communicating particles and grids efficiently. The framework relies on modern computer science methods and paradigms that provide a sufficient degree of problem generalization to technically enable easy introduction of physics knowledge into the simulations. Technically this is achieved by polymorphic multiple inheritance as supported by the use of modern C++. The concept of object inheritance also nicely reflects the mathematical way of constructing hierarchical models that build on the previous one. Another level of abstraction is obtained by using template metaprogramming that allows, for example, the parameterization of the dimensionality of the system, again closely reflecting the mathematical form of the problem. Finally, in order to balance the computational efficiency and the user-friendliness of the code, the fast low-level C++ classes are exposed to high-level PYTHON3 scripting language.

The complete framework is provided as open-source software for the community. It is available from a GitHub repository2 and the code design and structure are presented in detail in this article. Here, we also present the PIC code of the framework and describe the implementation of the solvers needed to operate it. First, the theoretical background of the kinetic modeling of plasma is briefly reviewed in Sect. 2. The numerical implementation of the framework is then discussed in Sect. 3. As an example, the use of the new PIC code is demonstrated on astro-physical plasma problems. A discussion of new features and the future directions of the framework are given in Sect. 5. Finally, the content is summarized in Sect. 6.

2 Theory

2.1 Kinetic Plasma Theory

The kinetic plasma theory we adopt here is based on the special relativistic formulation of the Vlasov–Boltzmann equation. The spatial coordinate location vector is given by x ≡ (x, y, z) and coordinate time is measured by t. Coordinate velocity (three-velocity) is given by υ ≡ dtx and the individual Cartesian components of the velocity are denoted as υ ≡ (υx, υy, υz). Proper time τ (measured with a co-moving clock) is connected to the coordinate time via the Lorentz factor γdτ t. The proper velocity (spatial components of the four-velocity) is u ≡ dτx = γυ. The Lorentz factor and the velocities are connected by the expression (1)

where c is the speed of light, и = |u|, and υ = |υ|. Acceleration is denoted with a ≡ dtu.

A six-dimensional phase-space density distribution for particle species s is given by fsfs(x, u; t). Thus, ƒsd3xd3u is the number of particles in the six-dimensional differential phase space volume between x, u and x + dx, u + du.

The evolution of fs is governed by the relativistic Boltzmann–Vlasov equation (2)

where and are the spatial and momentum parts of the differential operator ∇, respectively. The term on the right-hand side, defined as C ≡ ∂t fs |coll, is the collision operator. For the Vlasov equation C = 0 (i.e., the plasma is collisionless). Acceleration of a charged particle as is governed by the Lorentz force (3)

where E and B are the electric and magnetic fields, qs is the charge, and ms is the mass of the particle of species s.

The moments of the distribution function define macroscopic (bulk) quantities of the plasma. The zeroth moment of the distribution function fs defines the charge density of species s as (4)

The total charge density is ρ = ∑s ρs. The first moment defines the current (charge flux) vector as (5)

The total current is J = ∑s Js.

2.2 Maxwell’s Equations

Evolution of electric field E and magnetic field B is governed by Maxwell’s equations. These are Gauss’s law (6)

Gauss’s law for magnetism (7)

Ampere’s law (8)

and Faraday’s law (9)

Charge conservation follows from these equations by taking a divergence of Eq. (9) and substituting Eq. (6) to get (10)

2.3 Dimensionless Equations

We now describe the actual dimensionless equations that are solved numerically. A unit system similar to that used in TRISTAN and TRISTAN-MP (Buneman et al. 1993; Spitkovsky 2005; Sironi et al. 2013) is used. The derivation and more thorough discussion of this unit system is given in Appendix A.

Many quantities, such as the electromagnetic fields, are defined on a lattice (mesh). The discrete location of a point in this case is given as x(i,j,k)x(i, j, k) = (ix, jy, kz), where i, j, k are the grid indices and ∆x, ∆y, and ∆z are the grid point distances in each Cartesian dimension. Similarly, discretized time is given as tnt(n) = nt. Cells of the lattice are taken to have a fixed cube geometry: ∆x = ∆y = ∆z. The Courant (or Courant–Friedrichs–Lewy; Courant et al. 1928) number is defined as (11)

which represents a maximum numerical velocity of a signal traveling on the grid. For explicit time integration schemes ĉ ≤ 1.

Electromagnetic fields are normalized with B0 as Ê = E/B0 and . Similarly, current density is normalized with J0 as ĵ = J/J0. he values of these normalization factors are selected such that the numerically solved equations appear as ∆x = ∆t = 1. This also means that the grid indices i, j, and k have the same numerical value as location x.

2.3.1 Electromagnetic Field Module

The time evolution of electromagnetic fields is handled with a finite-difference time-domain (FDTD) method. A Yee lattice (Yee 1966) is used for the E and B fields such that they are staggered both in space and in time (12) (13)

where Ê is located in the middle of the cell sides and in the center of the cell faces. This makes it easy to calculate the curl of the fields in the subsequent equations.

In the time domain we update the Ê and fields with discrete forms of Eqs. (8) and (9) given as (14)

and (15)

where ∆[Q]t,x is the time differentiation or curl of a variable Q without the ∆x or ∆t denominator. The only normalization factor entering these equations is the Courant velocity ĉ since everything else is absorbed in B0 and J0. There is no need to solve Eqs. (6) and (7) if the charge-conserving scheme is used together with the Yee staggering of the fields (see Appendix B for details).

2.3.2 Particle-in-cell Module

The main purpose of the PIC module is to update particle velocities according to the Lorentz force. We express the spatial component of the four-velocity и in units of c. The particle species specifier s is omitted in this section for brevity.

The Lorentz force acting on a charged particle is (16)

We simplify this expression again to make it appear unitless. We express the charge and mass as and . The right-hand side of the Lorentz force equation is then (17)

The actual velocity update is done in parts since Ê and are staggered in time. As an example of particle momentum update, the relativistic Boris scheme (Boris 1970) is presented in detail in Appendix C. In addition to the particle velocity update routines, we need a scheme to interpolate the (staggered) Ê and fields to the arbitrary particle locations. This interpolation is typically implemented using a linear volume-averaging interpolation scheme although higher-order schemes can also be utilized.

After the velocity update, we advance the particle’s location. The particle’s position x is advanced in time with (18)

The described particle propagation scheme is second order in time. The velocity is stored in the middle of the time steps and updated as ; the location is stored at integer time steps and updated as .

Finally, when a charged particle moves on the grid it creates a current ĵ that induces an electric field via Eq. (9). We use the charge-conserving ZigZag current deposition scheme to deposit the current from the computational particles into the mesh (Umeda et al. 2003). This scheme is summarized in Appendix D.

3 Numerical Implementation

In this section, we now discuss the general design principles and structure of the framework itself. RUNKO is a hybrid C++/PYTHON code: low-level computationally intensive kernels are written in C++, whereas the high-level functionality is operated by PYTHON scripts. The design heavily relies on an object-oriented programming paradigm (supported naturally by both C++ and PYTHON), where data containers (attributes) and functions (methods) are grouped together into objects that interact with each other. In RUNKO we call a group of these objects modules (Sect. 3.2) and the objects themselves are called solvers (Sect. 3.3). The solvers are operated by different kinds of drivers (Sect. 3.4). This kind of hybrid use of C++ and PYTHON ensures a fast and well-optimized low-level code, whereas the PYTHON scripts allow for ease of use and straightforward extensibility of the code.

The low-level C++ kernels rely heavily on template metaprogramming features of modern C++ (conforming currently to C++14 standard3). In template metaprogramming a code template is provided to the compiler that then generates the actual source code based on it, given some extra information on how to fill the template parameters4. Many of the implemented C++ kernels take advantage of this by only having a D-dimensional general template of the algorithm. This template of an algorithm is then specialized to the required spatial dimension (typically D = 1, 2, or 3) by the compiler during compile time. This translates to a more bug-free code as typically only one implementation of the algorithm is required.

The low-level C++ kernels are operated by user-friendly PYTHON scripts. Technically, the C++ kernels are exposed to PYTHON using the PYBINDII library (Jakob et al. 2017). All the exposed C++ objects behave like native PYTHON objects. No unnecessary memory copies are performed because PYBINDII manages memory pointers. After installing RUNKO these objects are available to PYTHON by importing them from the pyrunko package (and pycorgi, as described in Sect. 3.1). Not only can these classes be easily used from the PYTHON scripts, but they also support inheritance in the case the user wants to extend and modify the objects. This allows for fast prototyping as new classes can first be developed in PYTHON before implementing them in C++ (if better performance is needed). These PYTHON scripts also leverage NUMPY (van der Walt et al. 2011) and SCIPY for performance and convenience, together with MATPLOTLIB (Hunter 2007) for visualization.

3.1 Grid Infrastructure

RUNKO is built on top of the massively parallel CORGI5 C++ template library (Nättilä, in prep.). CORGI is a modern parallelized grid infrastructure library that provides the (spatial) simulation grid. it is designed to be run on distributed memory systems (i.e., computing clusters and supercomputers) that require explicit inter-node communication. All inter-node communications are handled by the CORGI library using the Message Passing Interface (MPI) library6. Hereafter, one MPI process is called a rank, to make a distinction with a processor because a rank does not necessarily have to coincide with a processor.

CORGI uses a patch-based super-decomposition strategy to partition the simulation grid between different ranks. This means that the full simulation grid is split into small (preferably continuous) subregions called tiles. one tile can, for example, host a 10 × 10 subgrid in a 2D simulation. If the complete simulation domain is composed of a 104 × 104 lattice, it can be tessellated with 103 × 103 of these tiles.

The implementation of a physical simulation code works by inheriting all the grid infrastructure methods from a corgi::Tile<D> template class. This derived template class then automatically holds all the relevant physical attributes and methods needed to perform the simulation communications. The tile object provided by CORGI is a template class that is parameterized with the spatial dimension D. Specializing this dimension parameter to any given value sets the simulation grid dimensionality.

All the tiles are stored in a container provided by CORGI called a grid. Each program rank has its own grid object corgi::Grid<D> (specialized for D) that holds a varying number of tiles. This allows us to decompose the grid so that each rank will only hold and process a small part of the simulation domain (i.e., set of tiles). The grid class is responsible for providing all the spatial neighborhood information of the tiles and executing the inter-tile communications. These include both local shared-memory intra-rank communications and global inter-rank MPi communications. in practice, these methods are typically used to keep the halo regions of the tiles up to date.

There are no restrictions on the shape of the tile boundaries between different ranks. This allows more complex tile ownership to be used between ranks, which aims to maximize the volume and minimize the surface area, like a honeycomb tessellation in 2D. This configuration then translates directly to less communication needed between ranks.

Both the corgi::Tile and corgi::Grid C++ classes are exposed to PYTHON. Dimensionality (i.e., the D template parameter) is automatically set by loading a correct subpackage from the RUNKO (and CORGI) Python libraries: options are oneD for one-dimensional simulations (Nx sized tiles; Nx dimensional global grid configuration), twoD for two-dimensional simulations (Nx × Ny sized tiles; Nx × Ny dimensional global grid configuration), and threeD for three-dimensional (Nx × Ny × Nz sized tiles; Nx × Ny × Nz dimensional global grid configuration). Here Nx,y,z is the subgrid resolution of tiles and Nx,y,z is the simulation grid resolution in units of the tiles.

3.2 Modules

The actual physics of the simulation is implemented in derived tiles that are based on the CORGI base tiles. We call these tiles and all the related objects and functions modules7. An important part of the framework design is that different modules can be combined with polymorphic multiple inheritance (i.e., deriving methods and attributes from multiple other modules). This allows an efficient re-use of different modules when designing a new module with similar or overlapping physics concepts.

As an example, the PiC code (Sect. 2.3.2) uses corgi::Tile as its first base class to inherit the grid infrastructure and fields::Tile as its second base class to inherit the electromagnetic FDTD routines (Sect. 2.3.1). The PIC module itself only defines the particle containers and methods to manipulate particle movement. The two modules are connected by the current ĵ that is created by the moving charged particles. This current then induces a change in the Ê and fields that affects the Lorentz force experienced by the particles (see Fig. 1).

3.3 Solvers

Each module can contain several solvers that operate on and/or change the internal attributes of the module’s tiles. Solvers are created by first defining an interface with an abstract class. The actual solver implementations that can be instantiated (i.e., the concrete classes) can then be implemented by creating a derived class from this abstract interface. These abstract interface classes are bound to PYTHON objects via the “trampoline” classes. This automates the binding of any user-defined concrete solver implementations. A typical requirement of any solver is that it should provide a solve method that takes (at least) the module’s tile as an input.

As an example, the PIC module defines an abstract INTERPOLATOR solver interface. The purpose of this solver object is to interpolate the Ê and fields, defined on a lattice, to arbitrary particle locations. One of the currently provided INTERPOLATOR instances is the LINEAR solver that implements a linear first-order interpolation scheme to approximate the field values between the grid points. Thanks to this design pattern, it is very easy to extend the module to include other interpolation methods, such as a second-order polynomial or a spectral interpolation. The framework is agnostic about how the job of the solver is actually fulfilled; it is enough that it conforms to the abstract interface class standard, after which it can be automatically incorporated into the module routines.

thumbnail Fig. 1

Schematic presentation of the PIC code’s class relationship and connectivity. The FIELDS module is built on top of the CORGI module that provides the spatial simulation grid. It contains the electromagnetic fields Ê and , and the methods to advance them in time. The PIC module handles the computational particles. It is coupled to the FIELDS module via the current ĵ induced by the moving charged particles. Similarly, the FIELDS module is coupled to the PIC module by the Lorentz force.

3.4 Drivers

The actual simulation codes in RUNKO are executed by the drivers. Drivers are short PYTHON scripts that use the С++ bindings of the tiles and solvers. They typically consist of an initialization section in the beginning where tiles of some specific module are loaded into the CORGI grid container. After this, the data attributes of the tiles are initialized. The drivers also include the main time propagation loop where solvers are applied to the tiles in the correct order. As an example, Table 1 presents the main time integration loop of the PIC code.

The main advantage of using the low-level С++ classes via the PYTHON driver scripts is the ease of use. For example, various complex initialization routines are rendered much easier to implement in a high-level language that also supports data visualization. It also opens up the possibility of performing the analysis (or at least part of it) on the fly during the simulation.

3.5 Currently Implemented Modules

The currently implemented modules include the FIELDS and PIC modules. These and the corresponding solvers are listed in Table 2.

The FIELDS module (Sect. 2.3.1) is responsible for advancing the electromagnetic fields via the Maxwell’s equations. This module inherits its spatial dimensionality D from the CORGI tiles. Internally it has a mesh container that stores the field components as D-dimensional lattices. It also holds methods for advancing these fields in time using Eqs. (14) and (15). The module is agnostic about the form of the incoming current; it can be coupled to any other module to provide this closure.

The PIC module (Sect. 2.3.2) handles the particle propagation and provides the current closure for the FIELDS module (see also Fig. 1). Attributes of the electromagnetic field lattices are inherited from the fields: :Tile. In addition to the spatial dimensionality template parameter D, the PIC module has another template parameter for the velocity space dimensionality V (i.e., pic: :Tile<D, V>). By specializing the V template parameter, the solvers acting on the pic: :Tile take into account the corresponding velocity space dimensionality. For example, setting V = 2 equals manipulating the velocity vector as и = (ux, uy), whereas V = 3 modifies a full three-dimensional velocity vector и = (ux, uy, uz). The PIC module also defines various solvers that operate on the piс : : Tiles (see Table 2 for all the solvers implemented in the framework).

The final important module in the current framework is the io module, which is responsible for the data input and output functionality. It is currently more detached from the physical implementations providing only solvers that act on the tiles. These solvers can be categorized into two distinct types of input readers and output writers. The input readers load data into the tiles from storage files, whereas the output writers write data from the tiles to the disk. Currently implemented readers and writers are capable of operating on full simulation snapshot restart files using the hierarchical data format (HDF5). In addition, both field and particle classes have more storage-efficient writers that reprocess the data on the fly during the simulation and only write a smaller subsample of the full data into disk.

4 Problem Showcase

In this section, we showcase the new PIC code implemented into the RUNKO framework. First, we demonstrate the code validity by comparing the simulation results of the relativistic Weibel instability to the known analytical growth rates. Next, we demonstrate the same instability on a larger scale by presenting results from an unmagnetized collisionless shock simulation. Finally, we measure the numerical performance of the code by using results from relativistic turbulence simulations.

4.1 Relativistic Weibel Instability

We start showcasing the code capabilities by first simulating a standard astrophysics-motivated plasma instability, commonly known as the Weibel instability (Weibel 1959; Fried 1959; Bret et al. 2010). The relativistic Weibel instability in a collisionless plasma is triggered when two counter-streaming plasma populations start to experience small charge-dependent deflections, seeded by the noise. The resulting modes are electromagnetic so perturbations in both E and В are important. Therefore, the presented simulations also help us probe the validity of the solutions to the full Vlasov–Maxwell equations.

4.1.1 Problem Setup

We simulate the linear growth phase of the instability by following two counter-streaming pair plasma populations in a 2D simulation domain. The plasma is modeled with 32 particles per cell per species, and is initialized to have a small thermal spread of θ = kT/mec2 = 10−5 and to flow relativistically along as . Electromagnetic fields are propagated in time with the standard FDTD2 solver, current is deposited with the ZigZag scheme, particle interpolation is done with the Linear/Interpolator method, and particle propagation is performed with the BorisPusher solver. The simulation domain is captured with a grid of 320 × 80 cells and the skin depth is resolved with 10 cells. We use a standard time step of ∆t = 0.45∆x. We use symmetric beams with Lorentz factors of γb = 3, 10, 30, and 100.

thumbnail Fig. 2

Simulations of Weibel instability growth rates in relativistic counter-streaming plasmas. The different curves show the magnetic field energy density B2/8π (normalized by the relativistic plasma enthalpy density, γbn0mec2) as a function of time p for beam velocities of γb = 3 (blue), 10 (orange), 30 (green), and 100 (red). The dashed lines show the expected analytic growth rates.

Table 1

Time-advance loop of particle-in-cell module.

4.1.2 Comparison to Analytical Estimates

The analytic linear growth rates of the Weibel instability are known for symmetric beams. Bret et al. (2010) presents the filamentation instability growth rate δ as (19)

where is the bulk Lorentz factor of the beams, is the plasma frequency, and ne is the plasma number density (of electrons). We compare the simulation results to these theoretical estimates in Fig. 2 and find the linear instability growth rate to be perfectly matched with the expectations.

These simulations are not particularly demanding and can therefore serve as a helpful introduction to using the code. The discussed setup is also general enough that it could be easily reformulated to study other analytically intractable plasma beam instabilities.

Table 2

List of RUNKO modules and solvers.

4.2 Formation of Collisionless Shocks

The relativistic Weibel filamentation setup can be expanded to a realistic 3D simulation and used to model the plasma physics of collisionless shocks. The shock is formed by the growth and nonlinear saturation of the same instability.

We find this setup to be one of the most demanding numerical tests for PIC codes because the numerical algorithms need to be able to both sustain the relativistic upstream bulk flow and correctly model the kinetic shock physics. The upstream is unstable to the numerical Cherenkov instability (Godfrey 1974) and is, in addition, sensitive to any numerical error in the current deposition; even tiny perturbations in the relativistically advected upstream plasma can trigger an isotropization of the bulk flow. The shock front, on the other hand, is sustained by a strong electric current that is sensitive to charge conservation. We use this setup to stress test the full 3D PIC routines and their correctness.

4.2.1 Problem Setup

The numerical setup consists of a elongated shock “tube” along the £-axis and a reflecting wall on the left side of the domain. Initially, we set all the plasma in the domain to move from right to left with a velocity . The plasma is reflected from the left simulation domain and forms a counter-propagating beam. The beam is unstable to the Weibel instability that will quickly stop the right-moving component. This causes the formation of a collisionless shock.

We simulate the shock in an elongated box with a size using 10 cells per skin depth so that the full domain is covered with a grid of 5120 × 6402. The plasma flows with a relativistic velocity Г = 10 and is modeled with 64 particles per cell per species. Electromagnetic fields are propagated in time with the fourth-order FDTD4 scheme. The resulting current is calculated using the ZigZag scheme, particle interpolation is done via the Linear/Interpolator, and the current is smoothed with eight passes of the binomial filter (Binomial2). We use a time step of ∆t = 0.45∆x. The pair plasma has an initial thermal spread of θ = kT/mec2 = 2 × 10−5 and is updated using the HigueraCaryPusher. We start the simulation with no background magnetic field, which corresponds to an unmagnetized shock.

The plasma is injected into the domain from a receding right wall that moves at the speed of light, from left to right; this enables longer simulations both because the particles are not “alive” as long as they normally would be (suppressing numerical Cherenkov instability) and by saving computational time (since we do not need to initially propagate all the particles in the domain). The shock front is adaptively tracked on the fly, and only the high-resolution grid data around it is saved. The grid is decomposed into cyclic MPI domains of length 1280 cells in the direction (i.e. five cycles per the full box); this results in a more even memory load for the simulation.

thumbnail Fig. 3

Visualization of a relativistic collisionless shock in unmagnetized pair plasma at tωp ≈ 500. The box periphery shows the plasma number density п/п0, and the exploded panels the electric current density along the flow Jx/en0c. The shock is located at the middle of the box at x = 0; the same region shows strong current filaments that are responsible for mediating the shock.

4.2.2 Physical Interpretation and Numerical Tests

A collisionless shock is quickly formed when the simulation is started. The shock is mediated by a formation of Weibel filaments that grow, saturate, and merge; this stops the counter-streaming plasma (see Fig. 3).

The simulated shock structure and dynamics are well described by the theoretical expectations and previous numerical experiments (e.g., Sironi & Spitkovsky 2009; Plotnikov et al. 2018). The filaments form and induce the shock in about l00ωp−1. The upstream flow is compressed into a denser downstream with a compression ratio ndown/nup ≈ 4.1 ± 0.15, and the shock propagates into the upstream with a mildly relativistic velocity, ßsh ≈ 0.32 ± 0.03 (measured for 200 ≤ p ≤ 1000). Unmagnetized MHD jump conditions, in comparison, predict, ndown/nup = 1 + (Г + l)/[Г(γAD − 1]) = 4.3 for the compression ratio, and ßsh= (γAD − l)(Г − 1)/Г) = 0.3, given the 3D adi-abatic index γAD = 4/3 (see, e.g., Plotnikov et al. 2018). The shock ramp itself has a length of roughly ≈50c/ωp and is accommodated by an increase in the electric field energy density just ahead of the moving front.

We also specifically tested that the upstream remains smooth and stable for an extended period of time ≳l000ωp−1 with electric and magnetic field fluctuations of δE, δB ≲ 10−6 (in absolute code units; the value corresponds to floating-point accuracy). This is a stringent test that can be used to verify that the particle-to-grid and grid-to-particle routines are all correctly implemented, and no charge or electric current is lost; in the opposite case, any numerical error will quickly grow and isotropize the flow. Similarly, the formation of the shock front supports the notion that all the 3D routines are correct. We note that the shock simulation is a particularly demanding code test because of the above details.

4.3 Decaying Relativistic Kinetic Turbulence

In this section we use a relativistic decaying turbulence setup to showcase the code performance and scalability metrics. The setup is presented and discussed in detail in Comisso & Sironi (2018, 2019); Nättilä & Beloborodov (2021). Similar driven turbulence simulations are performed in Zhdankin et al. (20l7a, b), among others.

4.3.1 Problem Setup

We consider a uniform pair plasma that is suddenly stirred. Due to this large-scale driving motion, a turbulent cascade quickly develops. We study the 2D3V setups, where the cascade forms in the planar x–y direction and is captured by the spatial 2D simulation domain. All three velocity components are considered, however, including the out-of-the-plane z direction. To physically mimic confinement of the cascade to a 2D plane, a guide field is imposed in the z direction as ( is the unit vector along the z-axis).

Presence of the strong guide-field renders the plasma magnetically dominated. The plasma magnetization parameter we use is (20)

where no is the initial number density and is the mean thermal Lorentz factor of the plasma. Here we focus only on a strongly magnetized case of σ = 10. Initially, the plasma is defined to be warm with a temperature of .

The spatial resolution of our fiducial simulation is 51202 cells (512 × 512 tiles with internal tile resolution of 10 × 10). The initial plasma skin depth c/ωp is resolved with five cells. This determines our box size as L ≈ 1000c/ωp. We use 64 particles per cell per species, totaling 128 particles per cell for both electron and positron species. In total, we then propagate ~109 particles in the simulation.

The sudden large-scale stirring is mimicked by perturbing the planar magnetic field B with strong uncorrelated fluctuations in the x and у directions following Comisso & Sironi (2018). This drives strong turbulence with semi-relativisticbulk velocities. The forcing modes, m,n ∈ {1, . . . , 8}, are initialized with equal power as (21) (22)

where km = 2πm/L and kn = 2πn/L are the wave numbers along the x and у dimensions, respectively. The forcing is uncorrelated so ϕтп, ψmn ϵ [0,2π[, correspond to random phases. The forcing amplitude is (23)

where N = 8 in our case. The largest forcing mode (kN = 2πN/L) defines the energy-carrying scale and the characteristic size of the largest magnetic eddies l0 = /kN.

thumbnail Fig. 4

Visualizations of the kinetic turbulence simulations. The top row shows the honeycomb-like domain decomposition that partitions the grid between different processors with the CORGI library (the color map is cyclic). The bottom row shows the plasma density (in units of the initial plasma mass density ρ0). The left panels show the full simulation domain, whereas the right panels focus on a smaller subregion that better illustrates the small-scale structures.

thumbnail Fig. 5

Code performance analysis for 51202 resolution turbulence runs with 1024 cores. The black line shows the total push time; the other curves show the individual results for each component in the loop (see Table 1). In a perfectly load-balanced state (n ≲ 200) the total particle push time is measured to be around 0.7 µs, whereas in a strongly load-imbalanced state an average time of about 1.1 µs is obtained.

4.3.2 Numerical Efficiency and Parallel Scaling

The turbulence cascade is formed as the plasma relaxes from the non-equilibrium state (Fig. 4, bottom row; see, e.g., Nättilä & Beloborodov 2021 for a more in-depth physical interpretation). We use the turbulence simulation setup to probe the numerical performance and parallel scaling of the code.

This setup is advantageous not only because it is a real physical problem, but also because we can probe both the load-balanced and load-imbalanced performance of the PIC code. Initially the simulations are in the load-balanced state (computing time being dominated by particle push time), but after the cascade forms we observe a strong numerical imbalance between different ranks because of the strong fluctuations in particle number densities (computing time dominated by communication routines). No dynamic load-balancing algorithm is used. Here we mainly focus on these transition scales of around ~103 processors where both of these regimes can be simulated, leaving extremely large simulations with ≳105 processors and dynamic load balancing for future work. The scaling measurement reported here is performed in the Keb-nekaise computing cluster8. We use the second-order FDTD solver (FDTD2), Boris pusher (BorisPusher), linear interpolation (Linear/Interpolator), and first-order ZigZag scheme (ZigZag) to perform the simulations. Calculations are done using the double floating-point precision.

Good numerical performance is obtained for the PIC code. In an ideal completely load-balanced state the average total particle update time per processor is around 0.7 µs (see Fig. 5 around n ≲ 200). In the strongly unbalanced simulation state (dominated mainly by inter-rank particle communications) the particle update time per processor is around 1.1 µs when the turbulent cascade has developed (Fig. 5 for n ≳ 4000). The decrease in speed is caused by an increase in the evaluation time of routines communicating the fields and particles to the neighboring ranks.

We also monitor all the separate routines (see Table 1) independently to probe the numerical performance of the code in more detail. We note, however, that this monitoring is only done for one particular rank so the component-wise evaluation times reported here might fluctuate between different ranks experiencing differing loads. However, the numbers are indicative of where most of the evaluation time is spent. As seen in Fig. 5, in the ideal case most of the evaluation time (around 0.3 µs per particle per processor) is spent in the interpolation routine (interp_em; pink solid line) where and Ê are computed at the particle locations between the grid points. The large cost of the interpolation step is caused by frequent cache misses. The interpolation routine consists of many random access operations of the field arrays because particles are not sorted by their location; this prevents the processor cache prefetching from correctly predicting the incoming requested values. The second most expensive routine is the current calculation (comp_curr; green dashed line) with a typical evaluation time of 0.06 µs per particle per processor. It has a similar problem with the cache memory because writing the current to the array is preformed with an unpredictable access pattern. When communication costs increase, the MPI messaging routines start to be as costly as the current deposition with an average time of about 0.1 µs per particle per processor.

The code shows good weak (Fig. 6) and strong scaling (Fig. 7) when the simulations transform from cpu-bound to communication-bound. Weak scaling proceeds as 1.0:1.10:1.40:1.93:2.09 for every quadrupling of the number of processors 16:64:256:1024:4096. Strong scaling proceeds as 1.0:1.18:1.48:2.36:1.93 against an ideal scaling for every doubling of the number of processors 256:512:1024:2048:4096. The increase in the evaluation time when the number of processors is increased can indeed be attributed to the increase in the evaluation time of the communication routines (field updates: mpi_bl, mpi_b2, mpi_el, mpi_cur, clear_vir_cur; and particle updates: mpi_prtcls, unpack_vir_prtcls, del_vir_prtcls). A similar test performed in the load-balanced state yields almost ideal scaling up to ~103 processors (scaling of roughly 1.10 of the ideal) with a total particle update time of about 0.7 µs.

Finally, we note that the extra costs originating from the intra-rank updates of the tile boundaries are negligible. This extra cost is introduced because of the patch-based grid infrastructure where the smallest unit of calculation is one tile (10 × 10 lattice with about 104 particles per species per tile in this particular case). Therefore, some communication is always needed to keep these individual units of calculation in sync (field update routines: upd_bcθ, upd_bcl, upd_bc2, cur_exchange; and particle update routines: check_outg_prtcls, pack_outg_prtcls, get_inc_prtcls, del_trnsfrd_prtcls), even in systems completely relying on shared-memory parallelism. These routines are, however, typically 10–100 times cheaper than the costs of the field interpolation or MPI communication tasks.

thumbnail Fig. 6

Weak scaling results of the PIC code measured in a strongly load-imbalanced simulation state. Scaling is presented in terms of the mean particle push time per processor for various simulations that have a fixed ratio of total number of particles to processors. The black line shows the total mean push time; the other curves show the individual results for each component in the loop (see Table 1). Ideal scaling corresponds to a horizontal line, although there is a slight increase in the evaluation time due to more time spent on communication routines as the number of processors increases.

thumbnail Fig. 7

Strong scaling results of the PIC code measured in a strongly load-imbalanced simulation state. Results are for a 51202 box size with 128 particles per cell. The black line shows the total evaluation time per lap; the other curves show the individual results for each component in the loop (see Table 1). The ideal scaling corresponds to the thick black dotted line.

5 Discussion

5.1 Computational Advantages

RUNKO is a modern numerical toolbox tailored for astrophysi-cal plasma simulations. The framework is designed using recent computer science methods that rely on multiple levels of code abstraction. These in turn, help to create a general and easily extensible code framework. Furthermore, in order to encourage all kinds of use, RUNKO is provided as an open-source software for the community.

The code is written as a hybrid C++/PYTHON program. The low-level С++ kernels are currently designed to conform to a С++14 standard with the benefit of a plethora of modern computational methods. For example, we heavily rely on recent С++ template metaprogramming features to simplify the design of the code. This ensures a truly modern numerical code that will not be outdated immediately upon publication.

In addition to the underlying numerically efficient С++ implementation, all the classes have a PYTHON interface. This allows the user to take full advantage of all the benefits of using a high-level language without sacrificing the actual run-time speed of the code. In this way, many of simulation setups can be quickly prototyped in a laptop and then scaled up to supercomputer platforms for the actual production runs. Furthermore, it allows designing and implementing very complex initial conditions for simulations because the initial routines can be created using PYTHON.

The modular design allows RUNKO to function as a fully agnostic framework for various different modeling formulations of simulating plasma with computers. The physics is implemented via the modules, which in practice often solve some partial differential equations or propagate Lagrangian particle species. This means that we are not locked into one predefined theoretical formalism of describing the evolution of the plasma. The high degree of modularity also allows the user to easily change the numerical solvers and algorithms depending on the problem at hand.

The implementation relies on modern С++ features such as polymorpishm and template metaprogramming to simplify the code and make it easily extensible. Different solvers inherit their standard interface from an abstract base class implementation so that introduction and implementation of new physical algorithms is straightforward. User-definable dimensionality of different solvers is handled by template specializations. Additionally, most of the solvers are agnostic to the actual floating-point accuracy (i.e., single- or double-length floating-point precision) because the working precision is typically also specified as a template variable.

Another technical advantage of the framework is the patch-based domain decomposition strategy that is built into the grid infrastructure routines. As shown by the performance analysis, the cost of the additional updates needed to keep the individual tiles in sync are negligible in comparison to the evaluation costs of the physics in modules. Having a larger number of moderately sized tiles is found to be faster than partitioning the domain with fewer but larger tiles9. The benefits of this system are that it automatically introduces some level of load-balancing into the simulations (each rank has ~102 tiles so load imbalances at the level of one tile are effectively smoothed out), it helps in data cache locality (array sizes and particle containers remain small and easily movable in memory), and it simplifies the global communication routines (since these communication routines are anyway needed). Another benefit is the possibility to optimize the MPI domain boundaries with more complex edge shapes. An interesting example of this is the honeycomb-like tile memory configuration (see Fig. 4, top row). We also note that a similar decomposition technique currently seems to be the state-of-the-art technical choice of many other recent highperformance PIC codes, such as VPIC (Bowers et al. 2008) and SMILEI (Derouillat et al. 2018).

5.2 Future Directions

The presented framework offers a unique possibility to start exploring and experimenting with multiphysics simulations. These are new kinds of numerical simulations that evolve multiple physical formalisms simultaneously, or in an adaptive fashion selecting the most accurate (or relevant) solver on the fly.

These heterogeneous simulations will enable us to simulate larger, and therefore more realistic systems. In the future this can enable, for example, novel plasma simulations where the majority of the simulation domain is modeled with a force-free electro-dynamic description of the plasma (Komissarov 2002). However, in some small regions with extreme conditions (e.g., current sheets, shocks) the simulation can be adaptively refined (on the fly) to the fully kinetic description that enables a realistic in situ modeling of particle acceleration. Another possibility could be to use a multiple-domain–multiple-physics approach, where, for example, some fixed part of the simulation is described with a kinetic description, whereas some other part is modeled with MHD. Here the domains could be divided based on some strong density gradient in the problem setup, like those found in systems with diluted nonthermal particles on top of a denser fluid-like plasma; physically these can model astrophysical systems like accretion disk coronas.

Another important focus is the code performance and scalability. In practice, this mostly means minimizing the global communication costs. one possibility of decreasing the communication costs is the use of more complex hybrid paralleliza-tion schemes. As an example, using the patch-based domain super-decomposition encourages the use of a task-based hybrid parallelization scheme, for example similar to the DISPATCH framework (Nordlund et al. 2018). Since most of the updates of the tiles are numerically independent, these operations can be easily performed in parallel with shared-memory paralleliza-tion strategies. This, in turn, allows the number of tiles to be increased per MPI rank, which acts as an extra buffer against sudden load-imbalance fluctuations. This strategy also allows interleaved global (non-blocking) communications and normal solver updates to be performed simultaneously. This allows us to hide almost all the (MPI) communications since we can prioritize the evaluation of tiles such that the boundary tiles are always computed first and then sent to the neighbors, whereas the calculation of the inner tiles are continued independently by other threads while waiting for the halo tiles. We are also currently experimenting with dynamical load balancing methods where the ownership of the tiles changes during the simulation depending on the computational load.

Finally, since PYTHON is becoming the new lingua franca of astronomy and astrophysics, it is important to note that problem initialization in RUNKO can be done purely via PYTHON scripts. This means that new users are immediately familiar with the basic approach of the code, and that initialization of very complex simulation setups can be done by a high-level language. We hope that this will accelerate the use of plasma simulations to study many yet unresolved astrophysical problems.

6 Summary

We started by reviewing the kinetic plasma theory and the numerically solved Vlasov-Maxwell system of equations (Sect. 2). The numerical implementation of the framework is discussed in Sect. 3. We focused in particular on the design of the code, and introduced the different numerical components of the framework. First, we discussed the CORGI grid infrastructure and the related patch-based domain super-decomposition strategy where the computational grid is partitioned into separate tiles (Sect. 3.1). Second, we presented the physical building blocks of the code, the modules (Sect. 3.2). Different physical formulations of modeling the evolution of the plasma and electromagnetic fields are encapsulated in different modules. Third, each module can contain several numerical algorithms called solvers that are short C++ classes that operate and modify the content of different tiles (Sect. 3.3). Finally, these C++ solvers are applied to the tiles and operated by PYTHON scripts called drivers, ensuring easy of use (Sect. 3.4).

As our first task, we implemented a new particle-in-cell module into the framework. The implementation is discussed in detail in Sects. 2.3.1 and 2.3.2. The PIC code demonstrates good numerical performance with a mean particle push time per processor of about 0.7 µs in an ideal load-balanced state and 1.1 µs in a strongly load-imbalanced state. Furthermore, the code is shown to scale well up to ~ 10 000 processors.

We showcased the use of this PIC code by three different complex astrophysically motivated plasma simulations. First, we simulated the linear growth and nonlinear saturation of the Weibel instability in a relativistically streaming pair plasma. Second, we performed a mock simulation of a current-filamentation-mediated relativistic collisionless shock in full 3D. Last, we tested the code performance by modeling a decaying relativistic kinetic turbulence in magnetized pair plasma. These simulations demonstrated the ease of use, extendability, and flexibility of the code. All the setups are commonly used to study astrophysical phenomena, and can therefore serve as a useful starting point for new users.


J.N. would like to thank Andrei Beloborodov, Axel Brandenburg, Luca Comisso, Troels Haugbølle, Dhruba Mitra, Åke Nordlund, Lorenzo Sironi, Anatoly Spitkovsky, and Alexandra Veledina for stimulating discussions on numerics and plasma simulations. We would also like to thank the referee for many helpful comments and suggestions. The simulations were performed on resources provided by the Swedish National Infrastructure for Computing (SNIC) at PDC and HPC2N.

Appendix A Non-dimensionalization of the Vlasov–Maxwell Equations

Following Jackson (1975), the Maxwell equations in an arbitrary system of units can be written as (A.1) (A.2) (A.3) (A.4)

where k1 and k2 are constants that define the unit system. For a Gaussian system we have k1 = 1 and k2 = c, whereas for rationalized MKSA k1 = 1/4πϵ0 and k2 = 1. Here ϵ0 is the vacuum permittivity (and µ0 is the vacuum permeability found from the relation c2 = 1/(ϵ0µ0)).

In this appendix we present the unit system in use (originally by Buneman et al. 1993, and described in detail also in the TRISTAN-MP user-manual by A. Spitkovsky). We select the Gaussian system here by setting k1 = 1 and k2 = c. Our Maxwell equations are then simplified to (A.5) (A.6) (A.7) (A.8)

and the fields appear symmetrical as EB. The Lorentz force in this unit system is (A.9)

We next normalize all the variables with some fiducial values. The most peculiar of these is our selection of distance and time scalings: we express the distance in units of the grid spacing, , and time in units of the time step, . The coordinate-velocity is then and the four-velocity is . The fields are scaled with a fiducial field B0 such that E = ÊB0 and . Similarly, the charge, mass, and current density are also presented such that , and J = J0ĵ, respectively. This way, all of our numerical quantities are denoted with a hat. In numerical (code) units the equations that are being solved are then (A.10) (A.11) (A.12)

where in the last steps we define (A.13) (A.14)

so that Eqs. (A.10)–(A.12) appear unitless.

From Eqs. (A.13) and (A.14) we obtain a connection between the grid spacing, particle mass, and charge. We note first that the current density is (A.15)

i.e., J0 = q0/∆x2t. This implies that (A.16)

i.e., (A.17)

Assuming that one computational macroparticle contains N electrons (or positrons) we can write (A.18)

where we have set m0 = me (me is the electron rest mass) and q0 = e (e is the electron charge), and where r = e2/mec2 ≈ 2.82 × 10−13 cm is the classical electron radius. This enables us to express the field scaling, Eq. (A.13), in physical (Gaussian) units (A.19)

The conversions from code units to Gaussian units can then be performed by selecting a length scale, ∆x, as (A.20) (A.21) (A.22) (A.23)

We next discuss some derived plasma quantities in the simulation. The total relativistic plasma frequency is given as (A.24)

where ωp,− and ωp,+ are the plasma frequencies of electrons and ions (or electrons and positrons), respectively, and where we have assumed n = n = n+. For electron-ion plasma m = 1 and m+ = 1836 (or some smaller reduced proton mass), whereas for electron-positron pair plasma m = 1, m+ = 1, and 1 + m/m+ = 2. Time dilatation from a relativistic bulk-motion (if any) is corrected by the mean Lorentz factor of that flow 〈γ〉. The counter-part in the code units is (A.25)

where Nppc is the number of particles per cell per species (again assuming charge equilibrium, Nppc = n_ = n+) and in the last step we take into account that . Initial numerical plasma oscillation frequency is obtained by specifying the skin-depth resolution, , and remembering that and , so that we obtain (A.26)

We can then fix the value of charge (and ) such that the initial plasma frequency for the electrons in the beginning of the simulation is (i.e., requiring ) (A.27)

Similarly, the relativistic species-specific cyclotron frequency is (A.28)

which in code units is simply (A.29)

where is the numerical mass of the particle (i.e., for pair plasma or for protons). We note that the critical electron cyclotron frequency for corresponds to . Both Eqs. (A.25) and (A.29) can be translated to physical units by simply inserting numerical values to the time step . By comparing Eqs. (A.24) vs. (A.25) and Eqs. (A.28) vs. (A.29) we also see that numerical quantities are conveniently obtained by just replacing the physical quantities with the code variables.

Finally, the total plasma magnetization is given as (A.30)

This demonstrates the usefulness of the unit system. Numerically we perform fewer floating-point operations because formulas appear unitless, our typical values are of the same order of magnitude reducing floating-point round-off errors, and physical quantities are easily transformed to the code unit system by replacing variables with code quantities (e.g., and n = Nppc) and dropping 4π factors from the Maxwell equations.

Appendix B Time and Space Staggering with Finite-difference Method

Numerical charge conservation and divergence-free B field preservation follow from the use of a charge conserving current deposition strategy and the staggered Yee lattice (Yee 1966). See Esirkepov (2001) for more in-depth discussion of the topic.

Let us define discrete versions of the ∇ operator as (B.1)

and (B.2)

where ∆x, ∆y, and ∆z are the grid spacings. These operators have the following properties (B.3)

where is the discrete Poisson operator in central differences such that (B.4)

Let us stagger our variables in space and time such that (B.5) (B.6) (B.7) (B.8) (B.9) (B.10)

Maxwell’s equations for E and B are given as (B.11) (B.12) (B.13) (B.14)

The discrete form of the charge conservation then follows by operating with ∇+. on Eq. (B.11) and substituting (B.13) into the equation that follows. We then obtain (B.15)

Similarly, operating on Eq. (B.12) with ∇ we obtain the divergence-free nature of B field (B.16)

Thus, if (B.17)

and (B.18)

at t = 0, then the divergence of E is always equal to the charge density and B will retain its divergence-free nature up to machine precision during the simulation.

Appendix C Relativistic Boris Scheme

We present the full relativistic Boris scheme here that updates particle four-velocity one time step forward. This can be found as PUSHER solver named BorisPusher in the PIC module. In the beginning we have a particle with a four-velocity un in units of c. The Lorentz update is done via velocities in units of ĉ since this is the effective speed of light experienced by the electromagnetic fields. To differentiate between these two units we use primed quantities to mark velocities in units of ĉ.

The scheme begins with a first half of electric field acceleration as (C.1)

where (C.2)

and m± = m/me is the mass of the particle in units of electron rest mass me. This four-velocity in units of ĉ corresponds to a Lorentz factor of (C.3)

We then proceed with a first half of magnetic rotation (C.4)

where (C.5)

As a final step we combine the second half of magnetic rotation and second half of the electric field acceleration as (C.6)

Final particle four-velocity in units of c is then (C.7)

summarizing a recipe for updating the particle velocity from an initial time un to un+1 (i.e., one time step of ∆t forward).

Appendix D Current Deposition

We summarize here the charge-conserving ZigZag current deposition scheme (Umeda et al. 2003) implemented as the ZigZag solver in RUNKO. First, we compute the relay point xr between previous particle position x1 (and closest grid index i1 = (i1, j1, k1)) and the current position x2 (and closest grid index i2) as (D.1)

Here we assume element-wise min and max operators as min{a, b} = (min{ax, bx}, min{ay, by}, min{az, bz}).

The current fluxes are then obtained as (D.2) (D.3)

We assume first-order shape functions for the particles. This is related to particle weights defined at the mid points (xr + x1)/2 and (x2 + xr)/2 as (D.4) (D.5)

Depending on the dimensionality, the weights are described as W1,2 = (W1,2;x) for D = 1, W1,2 = (W1,2;x, W1,2;y) for D = 2, or W1,2 = (W1,2;x, W1,2;y, W1,2;z) for D = 3. Finally, the current density J = FW needs to be injected into adjacent grid points following a charge-conserving deposition strategy. This can be implemented following Umeda et al. (2003).


  1. Bacchini, F., Ripperda, B., Philippov, A. A., & Parfrey, K. 2020, ApJS, 251, 10 [NASA ADS] [CrossRef] [Google Scholar]
  2. Birdsall, C., & Langdon, A. 1985, Plasma Physics via Computer Simulation, The Adam Hilger Series on Plasma Physics (New York: McGraw-Hill) [Google Scholar]
  3. Blinne, A., Schinkel, D., Kuschel, S., et al. 2018, Comput. Phys. Commun., 224, 273 [NASA ADS] [CrossRef] [Google Scholar]
  4. Boris, J. P. 1970, in Proc. Fourth Conf. Num. Sim. Plasmas, Naval Res. Lab, Washington, DC, 3 [Google Scholar]
  5. Bowers, K. J., Albright, B. J., Yin, L., Bergen, B., & Kwan, T. J. T. 2008, Phys. Plasmas, 15, 055703 [NASA ADS] [CrossRef] [Google Scholar]
  6. Brandenburg, A. 2003, in Advances in Nonlinear Dynamos (CRC Press), 269 [CrossRef] [Google Scholar]
  7. Bret, A., Gremillet, L., & Dieckmann, M. E. 2010, Phys. Plasmas, 17, 120501 [Google Scholar]
  8. Buneman, O. 1959, Phys. Rev., 115, 503 [NASA ADS] [CrossRef] [Google Scholar]
  9. Buneman, O., Neubert, T., & Nishikawa, K. I. 1992, IEEE Trans. Plasma Sci., 20, 810 [CrossRef] [Google Scholar]
  10. Buneman, O., Nishikawa, K.-I., & Neubert, T. 1993, in Plasma Physics and Controlled Nuclear Fusion (ITC-4), eds. H. T. D. Guyenne, & J. J. Hunt [Google Scholar]
  11. Cerutti, B., Werner, G. R., Uzdensky, D. A., & Begelman, M. C. 2012, ApJ, 754, L33 [NASA ADS] [CrossRef] [Google Scholar]
  12. Cerutti, B., Philippov, A., Parfrey, K., & Spitkovsky, A. 2015, MNRAS, 448, 606 [Google Scholar]
  13. Chapman, S., & Cowling, T. G. 1970, The Mathematical Theory of Non-Uniform Gases (Cambridge: Cambridge University Press) [Google Scholar]
  14. Chen, A. Y., & Beloborodov, A. M. 2014, ApJ, 795, L22 [NASA ADS] [CrossRef] [Google Scholar]
  15. Cheng, C. Z., & Knorr, G. 1976, J. Comput. Phys., 22, 330 [NASA ADS] [CrossRef] [Google Scholar]
  16. Comisso, L., & Sironi, L. 2018, Phys. Rev. Lett., 121 [Google Scholar]
  17. Comisso, L., & Sironi, L. 2019, ApJ, 886, 122 [Google Scholar]
  18. Courant, R., Friedrichs, K., & Lewy, H. 1928, Math. Ann., 100, 32 [Google Scholar]
  19. Dawson, J. 1962, Phys. Fluids, 5, 445 [NASA ADS] [CrossRef] [Google Scholar]
  20. Dawson, J. M. 1964, Phys. Fluids, 7, 419 [NASA ADS] [CrossRef] [Google Scholar]
  21. Derouillat, J., Beck, A., Pérez, F., et al. 2018, Comput. Phys. Commun., 222, 351 [NASA ADS] [CrossRef] [Google Scholar]
  22. Esirkepov, T. Z. 2001, Comput. Phys. Commun., 135, 144 [NASA ADS] [CrossRef] [Google Scholar]
  23. Frederiksen, J. T., Hededal, C. B., Haugbolle, T., & Nordlund, Â. 2004, ApJ, 608, L13 [NASA ADS] [CrossRef] [Google Scholar]
  24. Fried, B. D. 1959, Phys. Fluids, 2, 337 [NASA ADS] [CrossRef] [Google Scholar]
  25. Godfrey, B. B. 1974, J. Comput. Phys., 15, 504 [NASA ADS] [CrossRef] [Google Scholar]
  26. Greenwood, A. D., Cartwright, K. L., Luginsland, J. W., & Baca, E. A. 2004, J. Comput. Phys., 201, 665 [NASA ADS] [CrossRef] [Google Scholar]
  27. Haggerty, C. C., & Caprioli, D. 2019, ApJ, 887, 165 [NASA ADS] [CrossRef] [Google Scholar]
  28. Higuera, A. V., & Cary, J. R. 2017, Phys. Plasmas, 24, 052104 [NASA ADS] [CrossRef] [Google Scholar]
  29. Hockney, R. W., & Eastwood, J. W. 1981, Computer Simulation Using Particles (New York: McGraw-Hill) [Google Scholar]
  30. Howes, G. G., Cowley, S. C., Dorland, W., et al. 2006, Astrophys. J., 651, 590 [NASA ADS] [CrossRef] [Google Scholar]
  31. Howes, G. G., Dorland, W., Cowley, S. C., et al. 2008, Phys. Rev. Lett., 100 [Google Scholar]
  32. Hunter, J. D. 2007, Comput. Sci. Eng., 9, 90 [Google Scholar]
  33. Jackson, J. D. 1975, Classical Electrodynamics (New York: Wiley) [Google Scholar]
  34. Jakob, W., Rhinelander, J., & Moldovan, D. 2017, Pybind11 - Seamless operabil-ity between C++11 and Python, [Google Scholar]
  35. Kagan, D., Milosavljevic, M., & Spitkovsky, A. 2013, ApJ, 774, 41 [NASA ADS] [CrossRef] [Google Scholar]
  36. Komissarov, S. S., 2002, MNRAS, 336, 759 [Google Scholar]
  37. Kumar, P., & Narayan, R. 2009, MNRAS, 395, 472 [NASA ADS] [CrossRef] [Google Scholar]
  38. Langdon, A. B., Arons, J., & Max, C. E. 1988, Phys. Rev. Lett., 61, 779 [CrossRef] [Google Scholar]
  39. MacDonald, N. R., & Marscher, A. P. 2018, ApJ, 862, 58 [CrossRef] [Google Scholar]
  40. Nordlund, Â., Ramsey, J. P., Popovas, A., & Küffmeier, M. 2018, MNRAS, 477, 624 [NASA ADS] [CrossRef] [Google Scholar]
  41. Nättilä, J., & Beloborodov, A. M. 2021, ApJ, 921, 1 [CrossRef] [Google Scholar]
  42. Nättilä, J., & Beloborodov, A. M. 2022, Phys. Rev. Lett., 128, 7 [CrossRef] [Google Scholar]
  43. Philippov, A. A., & Spitkovsky, A. 2014, ApJ, 785, L33 [NASA ADS] [CrossRef] [Google Scholar]
  44. Plotnikov, I., Grassi, A., & Grech, M. 2018, MNRAS, 477, 5238 [Google Scholar]
  45. Servidio, S., Valentini, F., Perrone, D., et al. 2014, J. Plasma Phys., 81 [Google Scholar]
  46. Sironi, L., & Spitkovsky, A. 2009, ApJ, 698, 1523 [NASA ADS] [CrossRef] [Google Scholar]
  47. Sironi, L., & Spitkovsky, A. 2014, ApJ, 783, L21 [NASA ADS] [CrossRef] [Google Scholar]
  48. Sironi, L., Spitkovsky, A., & Arons, J. 2013, ApJ, 771, 54 [Google Scholar]
  49. Spitkovsky, A. 2005, in American Institute of Physics Conference Series, Astro-physical Sources of High Energy Particles and Radiation, eds. T. Bulik, B. Rudak, & G. Madejski, 801, 345 [NASA ADS] [CrossRef] [Google Scholar]
  50. Tanaka, M. 1983, Phys. Fluids, 26, 1697 [NASA ADS] [CrossRef] [Google Scholar]
  51. Umeda, Y. & Omura, T. T. H. M. 2005, ISSS-7, 26 [Google Scholar]
  52. Umeda, T., Omura, Y., Tominaga, T., & Matsumoto, H. 2003, Comput. Phys. Commun., 156, 73 [NASA ADS] [CrossRef] [Google Scholar]
  53. Valentini, F., Trâvnicek, P., Califano, F., Hellinger, P., & Mangeney, A. 2007, J. Comput. Phys., 225, 753 [NASA ADS] [CrossRef] [Google Scholar]
  54. van der Walt, S., Colbert, S. C., & Varoquaux, G. 2011, Comput. Sci. Eng., 13, 22 [Google Scholar]
  55. Vay, J.-L. 2008, Phys. Plasmas, 15, 056701 [NASA ADS] [CrossRef] [Google Scholar]
  56. Verlet, L. 1967, Phys. Rev., 159, 98 [CrossRef] [Google Scholar]
  57. Weibel, E. S. 1959, Phys. Rev. Lett., 2, 83 [NASA ADS] [CrossRef] [Google Scholar]
  58. Werner, G. R., Uzdensky, D. A., Cerutti, B., Nalewajko, K., & Begelman, M. C. 2015, ApJ, 816, L8 [NASA ADS] [CrossRef] [Google Scholar]
  59. Winske, D., Yin, L., Omidi, N., et al., 2003, Hybrid Simulation Codes: Past, Present and Future - A Tutorial, eds. J. Büchner, C. Dum, & M. Scholer, 615, 136 [NASA ADS] [Google Scholar]
  60. Yee, K. 1966, IEEE Trans. Antennas Propagation, 14, 302 [CrossRef] [Google Scholar]
  61. Yuan, F., & Narayan, R. 2014, ARA&A, 52, 529 [NASA ADS] [CrossRef] [Google Scholar]
  62. Zenitani, S., & Hoshino, M. 2001, ApJ, 562, L63 [NASA ADS] [CrossRef] [Google Scholar]
  63. Zhdankin, V., Uzdensky, D. A., Werner, G. R., & Begelman, M. C. 2017a, MNRAS, 474, 2514 [Google Scholar]
  64. Zhdankin, V., Werner, G. R., Uzdensky, D. A., & Begelman, M. C. 2017b, Phys. Rev. Lett., 118 [Google Scholar]


Alternatively, such a system can be modeled by solving the partial differential equations directly on a grid (known as the Vlasov method; Cheng & Knorr 1976).


The C++14 and 17 standards are generally considered as “transition standards” bridging the gap between the older C++11 and the newer C++20 version. In the future, RUNKO will transition to fully conform to the C++20 standard as soon as sufficient HPC compiler support can be guaranteed. At the moment, many of the newer C++20 template metaprogramming features are provided together with RUNKO itself, implemented using C++11.


We use the standard C++ syntax to present template classes: an object A with a template parameter X is given as A<X>.


Technically, a module corresponds to a namespace in C++ and a subpackage in PYTHON.


Kebnekaise is a Lenovo cluster with 19288 cores (mainly Intel Xeon E5-269Ov4 Broadwell CPUs with a 2.6 GHz base frequency) and 64 GB of memory per each NUMA island in a node, and has FDR Inflniband switches for communication.


For one particular test we partitioned a 3D PIC shock simulation grid of 7680 × 1202 with tiles of sizes 83, 153, or 303 (keeping the number of processors fixed at 1280 cores) and found that the absolute wall-clock time per lap varies as 1.34, 0.66, and 1.22 s. Smaller but more numerous tiles are found to have more expensive communication routines, whereas larger tiles have worse solver performance. The optimal size in this particular case is for tiles somewhere around 153.

All Tables

Table 1

Time-advance loop of particle-in-cell module.

Table 2

List of RUNKO modules and solvers.

All Figures

thumbnail Fig. 1

Schematic presentation of the PIC code’s class relationship and connectivity. The FIELDS module is built on top of the CORGI module that provides the spatial simulation grid. It contains the electromagnetic fields Ê and , and the methods to advance them in time. The PIC module handles the computational particles. It is coupled to the FIELDS module via the current ĵ induced by the moving charged particles. Similarly, the FIELDS module is coupled to the PIC module by the Lorentz force.

In the text
thumbnail Fig. 2

Simulations of Weibel instability growth rates in relativistic counter-streaming plasmas. The different curves show the magnetic field energy density B2/8π (normalized by the relativistic plasma enthalpy density, γbn0mec2) as a function of time p for beam velocities of γb = 3 (blue), 10 (orange), 30 (green), and 100 (red). The dashed lines show the expected analytic growth rates.

In the text
thumbnail Fig. 3

Visualization of a relativistic collisionless shock in unmagnetized pair plasma at tωp ≈ 500. The box periphery shows the plasma number density п/п0, and the exploded panels the electric current density along the flow Jx/en0c. The shock is located at the middle of the box at x = 0; the same region shows strong current filaments that are responsible for mediating the shock.

In the text
thumbnail Fig. 4

Visualizations of the kinetic turbulence simulations. The top row shows the honeycomb-like domain decomposition that partitions the grid between different processors with the CORGI library (the color map is cyclic). The bottom row shows the plasma density (in units of the initial plasma mass density ρ0). The left panels show the full simulation domain, whereas the right panels focus on a smaller subregion that better illustrates the small-scale structures.

In the text
thumbnail Fig. 5

Code performance analysis for 51202 resolution turbulence runs with 1024 cores. The black line shows the total push time; the other curves show the individual results for each component in the loop (see Table 1). In a perfectly load-balanced state (n ≲ 200) the total particle push time is measured to be around 0.7 µs, whereas in a strongly load-imbalanced state an average time of about 1.1 µs is obtained.

In the text
thumbnail Fig. 6

Weak scaling results of the PIC code measured in a strongly load-imbalanced simulation state. Scaling is presented in terms of the mean particle push time per processor for various simulations that have a fixed ratio of total number of particles to processors. The black line shows the total mean push time; the other curves show the individual results for each component in the loop (see Table 1). Ideal scaling corresponds to a horizontal line, although there is a slight increase in the evaluation time due to more time spent on communication routines as the number of processors increases.

In the text
thumbnail Fig. 7

Strong scaling results of the PIC code measured in a strongly load-imbalanced simulation state. Results are for a 51202 box size with 128 particles per cell. The black line shows the total evaluation time per lap; the other curves show the individual results for each component in the loop (see Table 1). The ideal scaling corresponds to the thick black dotted line.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.