Issue 
A&A
Volume 664, August 2022



Article Number  A68  
Number of page(s)  18  
Section  Numerical methods and codes  
DOI  https://doi.org/10.1051/00046361/201937402  
Published online  09 August 2022 
Runko: Modern multiphysics toolbox for plasma simulations
^{1}
Nordita, KTH Royal Institute of Technology and Stockholm University,
Hannes Alfvéns väg 12,
10691
Stockholm, Sweden
^{2}
Physics Department and Columbia Astrophysics Laboratory, Columbia University,
538 West 120th Street,
New York,
NY 10027
USA
^{3}
Center for Computational Astrophysics, Flatiron Institute,
162 Fifth Avenue,
New York,
NY 10010
USA
email: jnattila@flatironinstitute.org
Received:
23
December
2019
Accepted:
10
April
2022
runko is a new opensource plasma simulation framework implemented in C++ and Python. It is designed to function as an easytoextend general toolbox for simulating astrophysical plasmas with different theoretical and numerical models. Computationally intensive lowlevel kernels are written in modern C++ taking advantage of polymorphic classes, multiple inheritance, and template metaprogramming. Highlevel functionality is operated with Python scripts. The hybrid program design ensures good code performance together with ease of use. The framework has a modular objectoriented design that allows the user to easily add new numerical algorithms to the system. The code can be run on various computing platforms ranging from laptops (sharedmemory systems) to massively parallel supercomputer architectures (distributedmemory systems). The framework supports heterogeneous multiphysics simulations in which different physical solvers can be combined and run simultaneously. Here, we showcase the framework’s relativistic particleincell (PIC) module by presenting (i) 1D simulations of relativistic Weibel instability, (ii) 2D simulations of relativistic kinetic turbulence in a suddenly stirred magneticallydominated pair plasma, and (iii) 3D simulations of collisionless shocks in an unmagnetized medium.
Key words: plasmas / turbulence / methods: numerical
© J. Nättilä 2022
Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
This article is published in open access under the SubscribetoOpen model. Subscribe to A&A to support open access publication.
1 Introduction
Ever since the introduction of computers, numerical simulations have been used to study the nonlinear behavior of plasma (see, e.g., Buneman 1959; Dawson 1962, 1964). In the early days, the research was mainly motivated by studies of basic plasma instabilities and confinement in fusion experiments, but the method of computationally solving the dynamics of charged particles quickly gained popularity also in understanding plasma in space (see, e.g., Tanaka 1983; Langdon et al. 1988; Buneman et al. 1992).
An important factor in accelerating the use of plasma simulations is the everincreasing computational speed and number of processors. We have recently arrived in the petaflop (10^{15} floatingpoint operations per second, FLOPS) supercomputing era, and have started to pave our way toward exascale computing systems (10^{18} FLOPS). This technological surge (and the ingenuity of the researchers themselves) has enabled us to shed light on many longstanding issues in highenergy astrophysics, including shocks (e.g., Frederiksen et al. 2004; Spitkovsky 2005), reconnection (e.g., Zenitani & Hoshino 2001; Cerutti et al. 2012; Kagan et al. 2013; Sironi & Spitkovsky 2014; Werner et al. 2015), pulsar magnetospheres (e.g., Philippov & Spitkovsky 2014; Chen & Beloborodov 2014; Cerutti et al. 2015), kinetic turbulence (e.g., Howes et al. 2008; Servidio et al. 2014; Zhdankin et al. 2017b; Comisso & Sironi 2018; Nättilä & Beloborodov 2021), and nonlinear wave interactions (e.g., Nättilä & Beloborodov 2022).
In order to keep riding this technological wave, successful numerical codes need to be highly scalable and modern in order to take advantage of the newest supercomputing architectures. These kinds of exascaleready features include avoidance of global communication patterns and support of heterogeneous computing nodes with varying resources, such as a different number of cores/accelerators or varying amount of memory per core (see also Nordlund et al. 2018, for a discussion). The presented RUNKO framework was developed from scratch to support these features.
An additional feature of the plasma simulations is that there are various theoretical frameworks for modeling the dynamics of the system, whereas the underlying numerical algorithms all rely on techniques based on computational particles and multidimensional grids. The fully kinetic description is based on sampling the distribution function with superparticles and representing the electromagnetic fields on a grid, known as the particleincell (PIC) approach (Hockney & Eastwood 1981; Birdsall & Langdon 1985)^{1}. In the case of a strong (background) magnetic field, it is possible to average over the gyrorotation of the particle's gyroorbit and use the gyrokinetic description where, instead of particles, computational “rings” are considered (see, e.g., Howes et al. 2006). Alternatively, the kinetic systems can be significantly simplified by considering only the moments of the distribution function; this macroscale description is related to the magnetohydrodynamic (MHD) models (see, e.g., Brandenburg 2003). The MHD description itself can be divided into multiple categories with different closure approximations (e.g., ideal, viscoresistive) or the evolution of multiple fluid species (known as multifluid description). There also exists a class of hybrid methods that combine both kinetic and magnetohydrodynamic models, for example by treating electrons with fluid description and protons with the kinetic description (see, e.g., Winske et al. 2003; Valentini et al. 2007; Haggerty & Caprioli 2019). From the numerical perspective, all of these methods, in the end, only require the efficient manipulation of particles and/or physical quantities that are stored on grids.
It is possible to take advantage of this structural similarity of the various formulations with welldesigned software abstractions. This is the second core idea of the RUNKO simulation framework, which in essence is just a collection of data structures for storing, manipulating, and communicating particles and grids efficiently. The framework relies on modern computer science methods and paradigms that provide a sufficient degree of problem generalization to technically enable easy introduction of physics knowledge into the simulations. Technically this is achieved by polymorphic multiple inheritance as supported by the use of modern C++. The concept of object inheritance also nicely reflects the mathematical way of constructing hierarchical models that build on the previous one. Another level of abstraction is obtained by using template metaprogramming that allows, for example, the parameterization of the dimensionality of the system, again closely reflecting the mathematical form of the problem. Finally, in order to balance the computational efficiency and the userfriendliness of the code, the fast lowlevel C++ classes are exposed to highlevel PYTHON3 scripting language.
The complete framework is provided as opensource software for the community. It is available from a GitHub repository^{2} and the code design and structure are presented in detail in this article. Here, we also present the PIC code of the framework and describe the implementation of the solvers needed to operate it. First, the theoretical background of the kinetic modeling of plasma is briefly reviewed in Sect. 2. The numerical implementation of the framework is then discussed in Sect. 3. As an example, the use of the new PIC code is demonstrated on astrophysical plasma problems. A discussion of new features and the future directions of the framework are given in Sect. 5. Finally, the content is summarized in Sect. 6.
2 Theory
2.1 Kinetic Plasma Theory
The kinetic plasma theory we adopt here is based on the special relativistic formulation of the Vlasov–Boltzmann equation. The spatial coordinate location vector is given by x ≡ (x, y, z) and coordinate time is measured by t. Coordinate velocity (threevelocity) is given by υ ≡ d_{t}x and the individual Cartesian components of the velocity are denoted as υ ≡ (υ_{x}, υ_{y}, υ_{z}). Proper time τ (measured with a comoving clock) is connected to the coordinate time via the Lorentz factor γ ≡ d_{τ} t. The proper velocity (spatial components of the fourvelocity) is u ≡ d_{τ}x = γυ. The Lorentz factor and the velocities are connected by the expression (1)
where c is the speed of light, и = u, and υ = υ. Acceleration is denoted with a ≡ d_{t}u.
A sixdimensional phasespace density distribution for particle species s is given by f_{s} ≡ f_{s}(x, u; t). Thus, ƒ_{s}d^{3}xd^{3}u is the number of particles in the sixdimensional differential phase space volume between x, u and x + dx, u + du.
The evolution of f_{s} is governed by the relativistic Boltzmann–Vlasov equation (2)
where and are the spatial and momentum parts of the differential operator ∇, respectively. The term on the righthand side, defined as C ≡ ∂_{t} f_{s} _{coll}, is the collision operator. For the Vlasov equation C = 0 (i.e., the plasma is collisionless). Acceleration of a charged particle a_{s} is governed by the Lorentz force (3)
where E and B are the electric and magnetic fields, q_{s} is the charge, and m_{s} is the mass of the particle of species s.
The moments of the distribution function define macroscopic (bulk) quantities of the plasma. The zeroth moment of the distribution function f_{s} defines the charge density of species s as (4)
The total charge density is ρ = ∑_{s} ρ_{s}. The first moment defines the current (charge flux) vector as (5)
The total current is J = ∑_{s} J_{s}.
2.2 Maxwell’s Equations
Evolution of electric field E and magnetic field B is governed by Maxwell’s equations. These are Gauss’s law (6)
Charge conservation follows from these equations by taking a divergence of Eq. (9) and substituting Eq. (6) to get (10)
2.3 Dimensionless Equations
We now describe the actual dimensionless equations that are solved numerically. A unit system similar to that used in TRISTAN and TRISTANMP (Buneman et al. 1993; Spitkovsky 2005; Sironi et al. 2013) is used. The derivation and more thorough discussion of this unit system is given in Appendix A.
Many quantities, such as the electromagnetic fields, are defined on a lattice (mesh). The discrete location of a point in this case is given as x^{(i,j,k)} ≡ x(i, j, k) = (i∆x, j∆y, k∆z), where i, j, k are the grid indices and ∆x, ∆y, and ∆z are the grid point distances in each Cartesian dimension. Similarly, discretized time is given as t^{n} ≡ t(n) = n∆t. Cells of the lattice are taken to have a fixed cube geometry: ∆x = ∆y = ∆z. The Courant (or Courant–Friedrichs–Lewy; Courant et al. 1928) number is defined as (11)
which represents a maximum numerical velocity of a signal traveling on the grid. For explicit time integration schemes ĉ ≤ 1.
Electromagnetic fields are normalized with B_{0} as Ê = E/B_{0} and . Similarly, current density is normalized with J_{0} as ĵ = J/J_{0.} he values of these normalization factors are selected such that the numerically solved equations appear as ∆x = ∆t = 1. This also means that the grid indices i, j, and k have the same numerical value as location x.
2.3.1 Electromagnetic Field Module
The time evolution of electromagnetic fields is handled with a finitedifference timedomain (FDTD) method. A Yee lattice (Yee 1966) is used for the E and B fields such that they are staggered both in space and in time (12) (13)
where Ê is located in the middle of the cell sides and in the center of the cell faces. This makes it easy to calculate the curl of the fields in the subsequent equations.
In the time domain we update the Ê and fields with discrete forms of Eqs. (8) and (9) given as (14)
where ∆[Q]_{t,x} is the time differentiation or curl of a variable Q without the ∆x or ∆t denominator. The only normalization factor entering these equations is the Courant velocity ĉ since everything else is absorbed in B_{0} and J_{0}. There is no need to solve Eqs. (6) and (7) if the chargeconserving scheme is used together with the Yee staggering of the fields (see Appendix B for details).
2.3.2 Particleincell Module
The main purpose of the PIC module is to update particle velocities according to the Lorentz force. We express the spatial component of the fourvelocity и in units of c. The particle species specifier s is omitted in this section for brevity.
The Lorentz force acting on a charged particle is (16)
We simplify this expression again to make it appear unitless. We express the charge and mass as and . The righthand side of the Lorentz force equation is then (17)
The actual velocity update is done in parts since Ê and are staggered in time. As an example of particle momentum update, the relativistic Boris scheme (Boris 1970) is presented in detail in Appendix C. In addition to the particle velocity update routines, we need a scheme to interpolate the (staggered) Ê and fields to the arbitrary particle locations. This interpolation is typically implemented using a linear volumeaveraging interpolation scheme although higherorder schemes can also be utilized.
After the velocity update, we advance the particle’s location. The particle’s position x is advanced in time with (18)
The described particle propagation scheme is second order in time. The velocity is stored in the middle of the time steps and updated as ; the location is stored at integer time steps and updated as .
Finally, when a charged particle moves on the grid it creates a current ĵ that induces an electric field via Eq. (9). We use the chargeconserving ZigZag current deposition scheme to deposit the current from the computational particles into the mesh (Umeda et al. 2003). This scheme is summarized in Appendix D.
3 Numerical Implementation
In this section, we now discuss the general design principles and structure of the framework itself. RUNKO is a hybrid C++/PYTHON code: lowlevel computationally intensive kernels are written in C++, whereas the highlevel functionality is operated by PYTHON scripts. The design heavily relies on an objectoriented programming paradigm (supported naturally by both C++ and PYTHON), where data containers (attributes) and functions (methods) are grouped together into objects that interact with each other. In RUNKO we call a group of these objects modules (Sect. 3.2) and the objects themselves are called solvers (Sect. 3.3). The solvers are operated by different kinds of drivers (Sect. 3.4). This kind of hybrid use of C++ and PYTHON ensures a fast and welloptimized lowlevel code, whereas the PYTHON scripts allow for ease of use and straightforward extensibility of the code.
The lowlevel C++ kernels rely heavily on template metaprogramming features of modern C++ (conforming currently to C++14 standard^{3}). In template metaprogramming a code template is provided to the compiler that then generates the actual source code based on it, given some extra information on how to fill the template parameters^{4}. Many of the implemented C++ kernels take advantage of this by only having a Ddimensional general template of the algorithm. This template of an algorithm is then specialized to the required spatial dimension (typically D = 1, 2, or 3) by the compiler during compile time. This translates to a more bugfree code as typically only one implementation of the algorithm is required.
The lowlevel C++ kernels are operated by userfriendly PYTHON scripts. Technically, the C++ kernels are exposed to PYTHON using the PYBINDII library (Jakob et al. 2017). All the exposed C++ objects behave like native PYTHON objects. No unnecessary memory copies are performed because PYBINDII manages memory pointers. After installing RUNKO these objects are available to PYTHON by importing them from the pyrunko package (and pycorgi, as described in Sect. 3.1). Not only can these classes be easily used from the PYTHON scripts, but they also support inheritance in the case the user wants to extend and modify the objects. This allows for fast prototyping as new classes can first be developed in PYTHON before implementing them in C++ (if better performance is needed). These PYTHON scripts also leverage NUMPY (van der Walt et al. 2011) and SCIPY for performance and convenience, together with MATPLOTLIB (Hunter 2007) for visualization.
3.1 Grid Infrastructure
RUNKO is built on top of the massively parallel CORGI^{5} C++ template library (Nättilä, in prep.). CORGI is a modern parallelized grid infrastructure library that provides the (spatial) simulation grid. it is designed to be run on distributed memory systems (i.e., computing clusters and supercomputers) that require explicit internode communication. All internode communications are handled by the CORGI library using the Message Passing Interface (MPI) library^{6}. Hereafter, one MPI process is called a rank, to make a distinction with a processor because a rank does not necessarily have to coincide with a processor.
CORGI uses a patchbased superdecomposition strategy to partition the simulation grid between different ranks. This means that the full simulation grid is split into small (preferably continuous) subregions called tiles. one tile can, for example, host a 10 × 10 subgrid in a 2D simulation. If the complete simulation domain is composed of a 10^{4} × 10^{4} lattice, it can be tessellated with 10^{3} × 10^{3} of these tiles.
The implementation of a physical simulation code works by inheriting all the grid infrastructure methods from a corgi::Tile<D> template class. This derived template class then automatically holds all the relevant physical attributes and methods needed to perform the simulation communications. The tile object provided by CORGI is a template class that is parameterized with the spatial dimension D. Specializing this dimension parameter to any given value sets the simulation grid dimensionality.
All the tiles are stored in a container provided by CORGI called a grid. Each program rank has its own grid object corgi::Grid<D> (specialized for D) that holds a varying number of tiles. This allows us to decompose the grid so that each rank will only hold and process a small part of the simulation domain (i.e., set of tiles). The grid class is responsible for providing all the spatial neighborhood information of the tiles and executing the intertile communications. These include both local sharedmemory intrarank communications and global interrank MPi communications. in practice, these methods are typically used to keep the halo regions of the tiles up to date.
There are no restrictions on the shape of the tile boundaries between different ranks. This allows more complex tile ownership to be used between ranks, which aims to maximize the volume and minimize the surface area, like a honeycomb tessellation in 2D. This configuration then translates directly to less communication needed between ranks.
Both the corgi::Tile and corgi::Grid C++ classes are exposed to PYTHON. Dimensionality (i.e., the D template parameter) is automatically set by loading a correct subpackage from the RUNKO (and CORGI) Python libraries: options are oneD for onedimensional simulations (N_{x} sized tiles; N_{x} dimensional global grid configuration), twoD for twodimensional simulations (N_{x} × N_{y} sized tiles; N_{x} × N_{y} dimensional global grid configuration), and threeD for threedimensional (N_{x} × N_{y} × N_{z} sized tiles; N_{x} × N_{y} × N_{z} dimensional global grid configuration). Here N_{x,y,z} is the subgrid resolution of tiles and N_{x,y,z} is the simulation grid resolution in units of the tiles.
3.2 Modules
The actual physics of the simulation is implemented in derived tiles that are based on the CORGI base tiles. We call these tiles and all the related objects and functions modules^{7}. An important part of the framework design is that different modules can be combined with polymorphic multiple inheritance (i.e., deriving methods and attributes from multiple other modules). This allows an efficient reuse of different modules when designing a new module with similar or overlapping physics concepts.
As an example, the PiC code (Sect. 2.3.2) uses corgi::Tile as its first base class to inherit the grid infrastructure and fields::Tile as its second base class to inherit the electromagnetic FDTD routines (Sect. 2.3.1). The PIC module itself only defines the particle containers and methods to manipulate particle movement. The two modules are connected by the current ĵ that is created by the moving charged particles. This current then induces a change in the Ê and fields that affects the Lorentz force experienced by the particles (see Fig. 1).
3.3 Solvers
Each module can contain several solvers that operate on and/or change the internal attributes of the module’s tiles. Solvers are created by first defining an interface with an abstract class. The actual solver implementations that can be instantiated (i.e., the concrete classes) can then be implemented by creating a derived class from this abstract interface. These abstract interface classes are bound to PYTHON objects via the “trampoline” classes. This automates the binding of any userdefined concrete solver implementations. A typical requirement of any solver is that it should provide a solve method that takes (at least) the module’s tile as an input.
As an example, the PIC module defines an abstract INTERPOLATOR solver interface. The purpose of this solver object is to interpolate the Ê and fields, defined on a lattice, to arbitrary particle locations. One of the currently provided INTERPOLATOR instances is the LINEAR solver that implements a linear firstorder interpolation scheme to approximate the field values between the grid points. Thanks to this design pattern, it is very easy to extend the module to include other interpolation methods, such as a secondorder polynomial or a spectral interpolation. The framework is agnostic about how the job of the solver is actually fulfilled; it is enough that it conforms to the abstract interface class standard, after which it can be automatically incorporated into the module routines.
Fig. 1 Schematic presentation of the PIC code’s class relationship and connectivity. The FIELDS module is built on top of the CORGI module that provides the spatial simulation grid. It contains the electromagnetic fields Ê and , and the methods to advance them in time. The PIC module handles the computational particles. It is coupled to the FIELDS module via the current ĵ induced by the moving charged particles. Similarly, the FIELDS module is coupled to the PIC module by the Lorentz force. 
3.4 Drivers
The actual simulation codes in RUNKO are executed by the drivers. Drivers are short PYTHON scripts that use the С++ bindings of the tiles and solvers. They typically consist of an initialization section in the beginning where tiles of some specific module are loaded into the CORGI grid container. After this, the data attributes of the tiles are initialized. The drivers also include the main time propagation loop where solvers are applied to the tiles in the correct order. As an example, Table 1 presents the main time integration loop of the PIC code.
The main advantage of using the lowlevel С++ classes via the PYTHON driver scripts is the ease of use. For example, various complex initialization routines are rendered much easier to implement in a highlevel language that also supports data visualization. It also opens up the possibility of performing the analysis (or at least part of it) on the fly during the simulation.
3.5 Currently Implemented Modules
The currently implemented modules include the FIELDS and PIC modules. These and the corresponding solvers are listed in Table 2.
The FIELDS module (Sect. 2.3.1) is responsible for advancing the electromagnetic fields via the Maxwell’s equations. This module inherits its spatial dimensionality D from the CORGI tiles. Internally it has a mesh container that stores the field components as Ddimensional lattices. It also holds methods for advancing these fields in time using Eqs. (14) and (15). The module is agnostic about the form of the incoming current; it can be coupled to any other module to provide this closure.
The PIC module (Sect. 2.3.2) handles the particle propagation and provides the current closure for the FIELDS module (see also Fig. 1). Attributes of the electromagnetic field lattices are inherited from the fields: :Tile. In addition to the spatial dimensionality template parameter D, the PIC module has another template parameter for the velocity space dimensionality V (i.e., pic: :Tile<D, V>). By specializing the V template parameter, the solvers acting on the pic: :Tile take into account the corresponding velocity space dimensionality. For example, setting V = 2 equals manipulating the velocity vector as и = (u_{x}, u_{y}), whereas V = 3 modifies a full threedimensional velocity vector и = (u_{x}, u_{y}, u_{z}). The PIC module also defines various solvers that operate on the piс : : Tiles (see Table 2 for all the solvers implemented in the framework).
The final important module in the current framework is the io module, which is responsible for the data input and output functionality. It is currently more detached from the physical implementations providing only solvers that act on the tiles. These solvers can be categorized into two distinct types of input readers and output writers. The input readers load data into the tiles from storage files, whereas the output writers write data from the tiles to the disk. Currently implemented readers and writers are capable of operating on full simulation snapshot restart files using the hierarchical data format (HDF5). In addition, both field and particle classes have more storageefficient writers that reprocess the data on the fly during the simulation and only write a smaller subsample of the full data into disk.
4 Problem Showcase
In this section, we showcase the new PIC code implemented into the RUNKO framework. First, we demonstrate the code validity by comparing the simulation results of the relativistic Weibel instability to the known analytical growth rates. Next, we demonstrate the same instability on a larger scale by presenting results from an unmagnetized collisionless shock simulation. Finally, we measure the numerical performance of the code by using results from relativistic turbulence simulations.
4.1 Relativistic Weibel Instability
We start showcasing the code capabilities by first simulating a standard astrophysicsmotivated plasma instability, commonly known as the Weibel instability (Weibel 1959; Fried 1959; Bret et al. 2010). The relativistic Weibel instability in a collisionless plasma is triggered when two counterstreaming plasma populations start to experience small chargedependent deflections, seeded by the noise. The resulting modes are electromagnetic so perturbations in both E and В are important. Therefore, the presented simulations also help us probe the validity of the solutions to the full Vlasov–Maxwell equations.
4.1.1 Problem Setup
We simulate the linear growth phase of the instability by following two counterstreaming pair plasma populations in a 2D simulation domain. The plasma is modeled with 32 particles per cell per species, and is initialized to have a small thermal spread of θ = kT/m_{e}c^{2} = 10^{−5} and to flow relativistically along as . Electromagnetic fields are propagated in time with the standard FDTD2 solver, current is deposited with the ZigZag scheme, particle interpolation is done with the Linear/Interpolator method, and particle propagation is performed with the BorisPusher solver. The simulation domain is captured with a grid of 320 × 80 cells and the skin depth is resolved with 10 cells. We use a standard time step of ∆t = 0.45∆x. We use symmetric beams with Lorentz factors of γ_{b} = 3, 10, 30, and 100.
Fig. 2 Simulations of Weibel instability growth rates in relativistic counterstreaming plasmas. The different curves show the magnetic field energy density B^{2}/8π (normalized by the relativistic plasma enthalpy density, γ_{b}n_{0}m_{e}c^{2}) as a function of time tω_{p} for beam velocities of γ_{b} = 3 (blue), 10 (orange), 30 (green), and 100 (red). The dashed lines show the expected analytic growth rates. 
Timeadvance loop of particleincell module.
4.1.2 Comparison to Analytical Estimates
The analytic linear growth rates of the Weibel instability are known for symmetric beams. Bret et al. (2010) presents the filamentation instability growth rate δ as (19)
where is the bulk Lorentz factor of the beams, is the plasma frequency, and n_{e} is the plasma number density (of electrons). We compare the simulation results to these theoretical estimates in Fig. 2 and find the linear instability growth rate to be perfectly matched with the expectations.
These simulations are not particularly demanding and can therefore serve as a helpful introduction to using the code. The discussed setup is also general enough that it could be easily reformulated to study other analytically intractable plasma beam instabilities.
List of RUNKO modules and solvers.
4.2 Formation of Collisionless Shocks
The relativistic Weibel filamentation setup can be expanded to a realistic 3D simulation and used to model the plasma physics of collisionless shocks. The shock is formed by the growth and nonlinear saturation of the same instability.
We find this setup to be one of the most demanding numerical tests for PIC codes because the numerical algorithms need to be able to both sustain the relativistic upstream bulk flow and correctly model the kinetic shock physics. The upstream is unstable to the numerical Cherenkov instability (Godfrey 1974) and is, in addition, sensitive to any numerical error in the current deposition; even tiny perturbations in the relativistically advected upstream plasma can trigger an isotropization of the bulk flow. The shock front, on the other hand, is sustained by a strong electric current that is sensitive to charge conservation. We use this setup to stress test the full 3D PIC routines and their correctness.
4.2.1 Problem Setup
The numerical setup consists of a elongated shock “tube” along the £axis and a reflecting wall on the left side of the domain. Initially, we set all the plasma in the domain to move from right to left with a velocity . The plasma is reflected from the left simulation domain and forms a counterpropagating beam. The beam is unstable to the Weibel instability that will quickly stop the rightmoving component. This causes the formation of a collisionless shock.
We simulate the shock in an elongated box with a size using 10 cells per skin depth so that the full domain is covered with a grid of 5120 × 640^{2}. The plasma flows with a relativistic velocity Г = 10 and is modeled with 64 particles per cell per species. Electromagnetic fields are propagated in time with the fourthorder FDTD4 scheme. The resulting current is calculated using the ZigZag scheme, particle interpolation is done via the Linear/Interpolator, and the current is smoothed with eight passes of the binomial filter (Binomial2). We use a time step of ∆t = 0.45∆x. The pair plasma has an initial thermal spread of θ = kT/m_{e}c^{2} = 2 × 10^{−5} and is updated using the HigueraCaryPusher. We start the simulation with no background magnetic field, which corresponds to an unmagnetized shock.
The plasma is injected into the domain from a receding right wall that moves at the speed of light, from left to right; this enables longer simulations both because the particles are not “alive” as long as they normally would be (suppressing numerical Cherenkov instability) and by saving computational time (since we do not need to initially propagate all the particles in the domain). The shock front is adaptively tracked on the fly, and only the highresolution grid data around it is saved. The grid is decomposed into cyclic MPI domains of length 1280 cells in the direction (i.e. five cycles per the full box); this results in a more even memory load for the simulation.
Fig. 3 Visualization of a relativistic collisionless shock in unmagnetized pair plasma at t_{ωp} ≈ 500. The box periphery shows the plasma number density п/п_{0}, and the exploded panels the electric current density along the flow J_{x}/en_{0}c. The shock is located at the middle of the box at x = 0; the same region shows strong current filaments that are responsible for mediating the shock. 
4.2.2 Physical Interpretation and Numerical Tests
A collisionless shock is quickly formed when the simulation is started. The shock is mediated by a formation of Weibel filaments that grow, saturate, and merge; this stops the counterstreaming plasma (see Fig. 3).
The simulated shock structure and dynamics are well described by the theoretical expectations and previous numerical experiments (e.g., Sironi & Spitkovsky 2009; Plotnikov et al. 2018). The filaments form and induce the shock in about l00ω_{p}^{−1}. The upstream flow is compressed into a denser downstream with a compression ratio n_{down}/n_{up} ≈ 4.1 ± 0.15, and the shock propagates into the upstream with a mildly relativistic velocity, ß_{sh} ≈ 0.32 ± 0.03 (measured for 200 ≤ tω_{p} ≤ 1000). Unmagnetized MHD jump conditions, in comparison, predict, n_{down}/n_{up} = 1 + (Г + l)/[Г(γAD − 1]) = 4.3 for the compression ratio, and ß_{sh}= (γAD − l)(Г − 1)/Г) = 0.3, given the 3D adiabatic index γAD = 4/3 (see, e.g., Plotnikov et al. 2018). The shock ramp itself has a length of roughly ≈50c/ω_{p} and is accommodated by an increase in the electric field energy density just ahead of the moving front.
We also specifically tested that the upstream remains smooth and stable for an extended period of time ≳l000ω_{p}^{−1} with electric and magnetic field fluctuations of δE, δB ≲ 10^{−6} (in absolute code units; the value corresponds to floatingpoint accuracy). This is a stringent test that can be used to verify that the particletogrid and gridtoparticle routines are all correctly implemented, and no charge or electric current is lost; in the opposite case, any numerical error will quickly grow and isotropize the flow. Similarly, the formation of the shock front supports the notion that all the 3D routines are correct. We note that the shock simulation is a particularly demanding code test because of the above details.
4.3 Decaying Relativistic Kinetic Turbulence
In this section we use a relativistic decaying turbulence setup to showcase the code performance and scalability metrics. The setup is presented and discussed in detail in Comisso & Sironi (2018, 2019); Nättilä & Beloborodov (2021). Similar driven turbulence simulations are performed in Zhdankin et al. (20l7a, b), among others.
4.3.1 Problem Setup
We consider a uniform pair plasma that is suddenly stirred. Due to this largescale driving motion, a turbulent cascade quickly develops. We study the 2D3V setups, where the cascade forms in the planar x–y direction and is captured by the spatial 2D simulation domain. All three velocity components are considered, however, including the outoftheplane z direction. To physically mimic confinement of the cascade to a 2D plane, a guide field is imposed in the z direction as ( is the unit vector along the zaxis).
Presence of the strong guidefield renders the plasma magnetically dominated. The plasma magnetization parameter we use is (20)
where no is the initial number density and is the mean thermal Lorentz factor of the plasma. Here we focus only on a strongly magnetized case of σ = 10. Initially, the plasma is defined to be warm with a temperature of .
The spatial resolution of our fiducial simulation is 5120^{2} cells (512 × 512 tiles with internal tile resolution of 10 × 10). The initial plasma skin depth c/ω_{p} is resolved with five cells. This determines our box size as L ≈ 1000c/ω_{p}. We use 64 particles per cell per species, totaling 128 particles per cell for both electron and positron species. In total, we then propagate ~10^{9} particles in the simulation.
The sudden largescale stirring is mimicked by perturbing the planar magnetic field B_{⊥} with strong uncorrelated fluctuations in the x and у directions following Comisso & Sironi (2018). This drives strong turbulence with semirelativisticbulk velocities. The forcing modes, m,n ∈ {1, . . . , 8}, are initialized with equal power as (21) (22)
where k_{m} = 2πm/L and k_{n} = 2πn/L are the wave numbers along the x and у dimensions, respectively. The forcing is uncorrelated so ϕ_{тп}, ψ_{mn} ϵ [0,2π[, correspond to random phases. The forcing amplitude is (23)
where N = 8 in our case. The largest forcing mode (k_{N} = 2πN/L) defines the energycarrying scale and the characteristic size of the largest magnetic eddies l_{0} = 2π/k_{N}.
Fig. 4 Visualizations of the kinetic turbulence simulations. The top row shows the honeycomblike domain decomposition that partitions the grid between different processors with the CORGI library (the color map is cyclic). The bottom row shows the plasma density (in units of the initial plasma mass density ρ_{0}). The left panels show the full simulation domain, whereas the right panels focus on a smaller subregion that better illustrates the smallscale structures. 
Fig. 5 Code performance analysis for 5120^{2} resolution turbulence runs with 1024 cores. The black line shows the total push time; the other curves show the individual results for each component in the loop (see Table 1). In a perfectly loadbalanced state (n ≲ 200) the total particle push time is measured to be around 0.7 µs, whereas in a strongly loadimbalanced state an average time of about 1.1 µs is obtained. 
4.3.2 Numerical Efficiency and Parallel Scaling
The turbulence cascade is formed as the plasma relaxes from the nonequilibrium state (Fig. 4, bottom row; see, e.g., Nättilä & Beloborodov 2021 for a more indepth physical interpretation). We use the turbulence simulation setup to probe the numerical performance and parallel scaling of the code.
This setup is advantageous not only because it is a real physical problem, but also because we can probe both the loadbalanced and loadimbalanced performance of the PIC code. Initially the simulations are in the loadbalanced state (computing time being dominated by particle push time), but after the cascade forms we observe a strong numerical imbalance between different ranks because of the strong fluctuations in particle number densities (computing time dominated by communication routines). No dynamic loadbalancing algorithm is used. Here we mainly focus on these transition scales of around ~10^{3} processors where both of these regimes can be simulated, leaving extremely large simulations with ≳10^{5} processors and dynamic load balancing for future work. The scaling measurement reported here is performed in the Kebnekaise computing cluster^{8}. We use the secondorder FDTD solver (FDTD2), Boris pusher (BorisPusher), linear interpolation (Linear/Interpolator), and firstorder ZigZag scheme (ZigZag) to perform the simulations. Calculations are done using the double floatingpoint precision.
Good numerical performance is obtained for the PIC code. In an ideal completely loadbalanced state the average total particle update time per processor is around 0.7 µs (see Fig. 5 around n ≲ 200). In the strongly unbalanced simulation state (dominated mainly by interrank particle communications) the particle update time per processor is around 1.1 µs when the turbulent cascade has developed (Fig. 5 for n ≳ 4000). The decrease in speed is caused by an increase in the evaluation time of routines communicating the fields and particles to the neighboring ranks.
We also monitor all the separate routines (see Table 1) independently to probe the numerical performance of the code in more detail. We note, however, that this monitoring is only done for one particular rank so the componentwise evaluation times reported here might fluctuate between different ranks experiencing differing loads. However, the numbers are indicative of where most of the evaluation time is spent. As seen in Fig. 5, in the ideal case most of the evaluation time (around 0.3 µs per particle per processor) is spent in the interpolation routine (interp_em; pink solid line) where and Ê are computed at the particle locations between the grid points. The large cost of the interpolation step is caused by frequent cache misses. The interpolation routine consists of many random access operations of the field arrays because particles are not sorted by their location; this prevents the processor cache prefetching from correctly predicting the incoming requested values. The second most expensive routine is the current calculation (comp_curr; green dashed line) with a typical evaluation time of 0.06 µs per particle per processor. It has a similar problem with the cache memory because writing the current to the array is preformed with an unpredictable access pattern. When communication costs increase, the MPI messaging routines start to be as costly as the current deposition with an average time of about 0.1 µs per particle per processor.
The code shows good weak (Fig. 6) and strong scaling (Fig. 7) when the simulations transform from cpubound to communicationbound. Weak scaling proceeds as 1.0:1.10:1.40:1.93:2.09 for every quadrupling of the number of processors 16:64:256:1024:4096. Strong scaling proceeds as 1.0:1.18:1.48:2.36:1.93 against an ideal scaling for every doubling of the number of processors 256:512:1024:2048:4096. The increase in the evaluation time when the number of processors is increased can indeed be attributed to the increase in the evaluation time of the communication routines (field updates: mpi_bl, mpi_b2, mpi_el, mpi_cur, clear_vir_cur; and particle updates: mpi_prtcls, unpack_vir_prtcls, del_vir_prtcls). A similar test performed in the loadbalanced state yields almost ideal scaling up to ~10^{3} processors (scaling of roughly 1.10 of the ideal) with a total particle update time of about 0.7 µs.
Finally, we note that the extra costs originating from the intrarank updates of the tile boundaries are negligible. This extra cost is introduced because of the patchbased grid infrastructure where the smallest unit of calculation is one tile (10 × 10 lattice with about 10^{4} particles per species per tile in this particular case). Therefore, some communication is always needed to keep these individual units of calculation in sync (field update routines: upd_bcθ, upd_bcl, upd_bc2, cur_exchange; and particle update routines: check_outg_prtcls, pack_outg_prtcls, get_inc_prtcls, del_trnsfrd_prtcls), even in systems completely relying on sharedmemory parallelism. These routines are, however, typically 10–100 times cheaper than the costs of the field interpolation or MPI communication tasks.
Fig. 6 Weak scaling results of the PIC code measured in a strongly loadimbalanced simulation state. Scaling is presented in terms of the mean particle push time per processor for various simulations that have a fixed ratio of total number of particles to processors. The black line shows the total mean push time; the other curves show the individual results for each component in the loop (see Table 1). Ideal scaling corresponds to a horizontal line, although there is a slight increase in the evaluation time due to more time spent on communication routines as the number of processors increases. 
Fig. 7 Strong scaling results of the PIC code measured in a strongly loadimbalanced simulation state. Results are for a 5120^{2} box size with 128 particles per cell. The black line shows the total evaluation time per lap; the other curves show the individual results for each component in the loop (see Table 1). The ideal scaling corresponds to the thick black dotted line. 
5 Discussion
5.1 Computational Advantages
RUNKO is a modern numerical toolbox tailored for astrophysical plasma simulations. The framework is designed using recent computer science methods that rely on multiple levels of code abstraction. These in turn, help to create a general and easily extensible code framework. Furthermore, in order to encourage all kinds of use, RUNKO is provided as an opensource software for the community.
The code is written as a hybrid C++/PYTHON program. The lowlevel С++ kernels are currently designed to conform to a С++14 standard with the benefit of a plethora of modern computational methods. For example, we heavily rely on recent С++ template metaprogramming features to simplify the design of the code. This ensures a truly modern numerical code that will not be outdated immediately upon publication.
In addition to the underlying numerically efficient С++ implementation, all the classes have a PYTHON interface. This allows the user to take full advantage of all the benefits of using a highlevel language without sacrificing the actual runtime speed of the code. In this way, many of simulation setups can be quickly prototyped in a laptop and then scaled up to supercomputer platforms for the actual production runs. Furthermore, it allows designing and implementing very complex initial conditions for simulations because the initial routines can be created using PYTHON.
The modular design allows RUNKO to function as a fully agnostic framework for various different modeling formulations of simulating plasma with computers. The physics is implemented via the modules, which in practice often solve some partial differential equations or propagate Lagrangian particle species. This means that we are not locked into one predefined theoretical formalism of describing the evolution of the plasma. The high degree of modularity also allows the user to easily change the numerical solvers and algorithms depending on the problem at hand.
The implementation relies on modern С++ features such as polymorpishm and template metaprogramming to simplify the code and make it easily extensible. Different solvers inherit their standard interface from an abstract base class implementation so that introduction and implementation of new physical algorithms is straightforward. Userdefinable dimensionality of different solvers is handled by template specializations. Additionally, most of the solvers are agnostic to the actual floatingpoint accuracy (i.e., single or doublelength floatingpoint precision) because the working precision is typically also specified as a template variable.
Another technical advantage of the framework is the patchbased domain decomposition strategy that is built into the grid infrastructure routines. As shown by the performance analysis, the cost of the additional updates needed to keep the individual tiles in sync are negligible in comparison to the evaluation costs of the physics in modules. Having a larger number of moderately sized tiles is found to be faster than partitioning the domain with fewer but larger tiles^{9}. The benefits of this system are that it automatically introduces some level of loadbalancing into the simulations (each rank has ~10^{2} tiles so load imbalances at the level of one tile are effectively smoothed out), it helps in data cache locality (array sizes and particle containers remain small and easily movable in memory), and it simplifies the global communication routines (since these communication routines are anyway needed). Another benefit is the possibility to optimize the MPI domain boundaries with more complex edge shapes. An interesting example of this is the honeycomblike tile memory configuration (see Fig. 4, top row). We also note that a similar decomposition technique currently seems to be the stateoftheart technical choice of many other recent highperformance PIC codes, such as VPIC (Bowers et al. 2008) and SMILEI (Derouillat et al. 2018).
5.2 Future Directions
The presented framework offers a unique possibility to start exploring and experimenting with multiphysics simulations. These are new kinds of numerical simulations that evolve multiple physical formalisms simultaneously, or in an adaptive fashion selecting the most accurate (or relevant) solver on the fly.
These heterogeneous simulations will enable us to simulate larger, and therefore more realistic systems. In the future this can enable, for example, novel plasma simulations where the majority of the simulation domain is modeled with a forcefree electrodynamic description of the plasma (Komissarov 2002). However, in some small regions with extreme conditions (e.g., current sheets, shocks) the simulation can be adaptively refined (on the fly) to the fully kinetic description that enables a realistic in situ modeling of particle acceleration. Another possibility could be to use a multipledomain–multiplephysics approach, where, for example, some fixed part of the simulation is described with a kinetic description, whereas some other part is modeled with MHD. Here the domains could be divided based on some strong density gradient in the problem setup, like those found in systems with diluted nonthermal particles on top of a denser fluidlike plasma; physically these can model astrophysical systems like accretion disk coronas.
Another important focus is the code performance and scalability. In practice, this mostly means minimizing the global communication costs. one possibility of decreasing the communication costs is the use of more complex hybrid parallelization schemes. As an example, using the patchbased domain superdecomposition encourages the use of a taskbased hybrid parallelization scheme, for example similar to the DISPATCH framework (Nordlund et al. 2018). Since most of the updates of the tiles are numerically independent, these operations can be easily performed in parallel with sharedmemory parallelization strategies. This, in turn, allows the number of tiles to be increased per MPI rank, which acts as an extra buffer against sudden loadimbalance fluctuations. This strategy also allows interleaved global (nonblocking) communications and normal solver updates to be performed simultaneously. This allows us to hide almost all the (MPI) communications since we can prioritize the evaluation of tiles such that the boundary tiles are always computed first and then sent to the neighbors, whereas the calculation of the inner tiles are continued independently by other threads while waiting for the halo tiles. We are also currently experimenting with dynamical load balancing methods where the ownership of the tiles changes during the simulation depending on the computational load.
Finally, since PYTHON is becoming the new lingua franca of astronomy and astrophysics, it is important to note that problem initialization in RUNKO can be done purely via PYTHON scripts. This means that new users are immediately familiar with the basic approach of the code, and that initialization of very complex simulation setups can be done by a highlevel language. We hope that this will accelerate the use of plasma simulations to study many yet unresolved astrophysical problems.
6 Summary
We started by reviewing the kinetic plasma theory and the numerically solved VlasovMaxwell system of equations (Sect. 2). The numerical implementation of the framework is discussed in Sect. 3. We focused in particular on the design of the code, and introduced the different numerical components of the framework. First, we discussed the CORGI grid infrastructure and the related patchbased domain superdecomposition strategy where the computational grid is partitioned into separate tiles (Sect. 3.1). Second, we presented the physical building blocks of the code, the modules (Sect. 3.2). Different physical formulations of modeling the evolution of the plasma and electromagnetic fields are encapsulated in different modules. Third, each module can contain several numerical algorithms called solvers that are short C++ classes that operate and modify the content of different tiles (Sect. 3.3). Finally, these C++ solvers are applied to the tiles and operated by PYTHON scripts called drivers, ensuring easy of use (Sect. 3.4).
As our first task, we implemented a new particleincell module into the framework. The implementation is discussed in detail in Sects. 2.3.1 and 2.3.2. The PIC code demonstrates good numerical performance with a mean particle push time per processor of about 0.7 µs in an ideal loadbalanced state and 1.1 µs in a strongly loadimbalanced state. Furthermore, the code is shown to scale well up to ~ 10 000 processors.
We showcased the use of this PIC code by three different complex astrophysically motivated plasma simulations. First, we simulated the linear growth and nonlinear saturation of the Weibel instability in a relativistically streaming pair plasma. Second, we performed a mock simulation of a currentfilamentationmediated relativistic collisionless shock in full 3D. Last, we tested the code performance by modeling a decaying relativistic kinetic turbulence in magnetized pair plasma. These simulations demonstrated the ease of use, extendability, and flexibility of the code. All the setups are commonly used to study astrophysical phenomena, and can therefore serve as a useful starting point for new users.
Acknowledgments
J.N. would like to thank Andrei Beloborodov, Axel Brandenburg, Luca Comisso, Troels Haugbølle, Dhruba Mitra, Åke Nordlund, Lorenzo Sironi, Anatoly Spitkovsky, and Alexandra Veledina for stimulating discussions on numerics and plasma simulations. We would also like to thank the referee for many helpful comments and suggestions. The simulations were performed on resources provided by the Swedish National Infrastructure for Computing (SNIC) at PDC and HPC2N.
Appendix A Nondimensionalization of the Vlasov–Maxwell Equations
Following Jackson (1975), the Maxwell equations in an arbitrary system of units can be written as (A.1) (A.2) (A.3) (A.4)
where k_{1} and k_{2} are constants that define the unit system. For a Gaussian system we have k_{1} = 1 and k_{2} = c, whereas for rationalized MKSA k_{1} = 1/4πϵ_{0} and k_{2} = 1. Here ϵ_{0} is the vacuum permittivity (and µ_{0} is the vacuum permeability found from the relation c^{2} = 1/(ϵ_{0}µ_{0})).
In this appendix we present the unit system in use (originally by Buneman et al. 1993, and described in detail also in the TRISTANMP usermanual by A. Spitkovsky). We select the Gaussian system here by setting k_{1} = 1 and k_{2} = c. Our Maxwell equations are then simplified to (A.5) (A.6) (A.7) (A.8)
and the fields appear symmetrical as E ↔ B. The Lorentz force in this unit system is (A.9)
We next normalize all the variables with some fiducial values. The most peculiar of these is our selection of distance and time scalings: we express the distance in units of the grid spacing, , and time in units of the time step, . The coordinatevelocity is then and the fourvelocity is . The fields are scaled with a fiducial field B_{0} such that E = ÊB_{0} and . Similarly, the charge, mass, and current density are also presented such that , and J = J_{0}ĵ, respectively. This way, all of our numerical quantities are denoted with a hat. In numerical (code) units the equations that are being solved are then (A.10) (A.11) (A.12)
where in the last steps we define (A.13) (A.14)
so that Eqs. (A.10)–(A.12) appear unitless.
From Eqs. (A.13) and (A.14) we obtain a connection between the grid spacing, particle mass, and charge. We note first that the current density is (A.15)
i.e., J_{0} = q_{0}/∆x^{2}∆t. This implies that (A.16)
Assuming that one computational macroparticle contains N electrons (or positrons) we can write (A.18)
where we have set m_{0} = m_{e} (m_{e} is the electron rest mass) and q_{0} = e (e is the electron charge), and where r = e^{2}/m_{e}c^{2} ≈ 2.82 × 10^{−13} cm is the classical electron radius. This enables us to express the field scaling, Eq. (A.13), in physical (Gaussian) units (A.19)
The conversions from code units to Gaussian units can then be performed by selecting a length scale, ∆x, as (A.20) (A.21) (A.22) (A.23)
We next discuss some derived plasma quantities in the simulation. The total relativistic plasma frequency is given as (A.24)
where ω_{p,−} and ω_{p,+} are the plasma frequencies of electrons and ions (or electrons and positrons), respectively, and where we have assumed n = n_{−} = n+. For electronion plasma m_{−} = 1 and m_{+} = 1836 (or some smaller reduced proton mass), whereas for electronpositron pair plasma m_{−} = 1, m_{+} = 1, and 1 + m_{−}/m_{+} = 2. Time dilatation from a relativistic bulkmotion (if any) is corrected by the mean Lorentz factor of that flow 〈γ〉. The counterpart in the code units is (A.25)
where N_{ppc} is the number of particles per cell per species (again assuming charge equilibrium, N_{ppc} = n_ = n_{+}) and in the last step we take into account that . Initial numerical plasma oscillation frequency is obtained by specifying the skindepth resolution, , and remembering that and , so that we obtain (A.26)
We can then fix the value of charge (and ) such that the initial plasma frequency for the electrons in the beginning of the simulation is (i.e., requiring ) (A.27)
Similarly, the relativistic speciesspecific cyclotron frequency is (A.28)
which in code units is simply (A.29)
where is the numerical mass of the particle (i.e., for pair plasma or for protons). We note that the critical electron cyclotron frequency for corresponds to . Both Eqs. (A.25) and (A.29) can be translated to physical units by simply inserting numerical values to the time step . By comparing Eqs. (A.24) vs. (A.25) and Eqs. (A.28) vs. (A.29) we also see that numerical quantities are conveniently obtained by just replacing the physical quantities with the code variables.
Finally, the total plasma magnetization is given as (A.30)
This demonstrates the usefulness of the unit system. Numerically we perform fewer floatingpoint operations because formulas appear unitless, our typical values are of the same order of magnitude reducing floatingpoint roundoff errors, and physical quantities are easily transformed to the code unit system by replacing variables with code quantities (e.g., and n = N_{ppc}) and dropping 4π factors from the Maxwell equations.
Appendix B Time and Space Staggering with Finitedifference Method
Numerical charge conservation and divergencefree B field preservation follow from the use of a charge conserving current deposition strategy and the staggered Yee lattice (Yee 1966). See Esirkepov (2001) for more indepth discussion of the topic.
Let us define discrete versions of the ∇ operator as (B.1)
where ∆x, ∆y, and ∆z are the grid spacings. These operators have the following properties (B.3)
where is the discrete Poisson operator in central differences such that (B.4)
Let us stagger our variables in space and time such that (B.5) (B.6) (B.7) (B.8) (B.9) (B.10)
Maxwell’s equations for E and B are given as (B.11) (B.12) (B.13) (B.14)
The discrete form of the charge conservation then follows by operating with ∇^{+.} on Eq. (B.11) and substituting (B.13) into the equation that follows. We then obtain (B.15)
Similarly, operating on Eq. (B.12) with ∇^{−} we obtain the divergencefree nature of B field (B.16)
at t = 0, then the divergence of E is always equal to the charge density and B will retain its divergencefree nature up to machine precision during the simulation.
Appendix C Relativistic Boris Scheme
We present the full relativistic Boris scheme here that updates particle fourvelocity one time step forward. This can be found as PUSHER solver named BorisPusher in the PIC module. In the beginning we have a particle with a fourvelocity u^{n} in units of c. The Lorentz update is done via velocities in units of ĉ since this is the effective speed of light experienced by the electromagnetic fields. To differentiate between these two units we use primed quantities to mark velocities in units of ĉ.
The scheme begins with a first half of electric field acceleration as (C.1)
and m_{±} = m/m_{e} is the mass of the particle in units of electron rest mass m_{e}. This fourvelocity in units of ĉ corresponds to a Lorentz factor of (C.3)
We then proceed with a first half of magnetic rotation (C.4)
As a final step we combine the second half of magnetic rotation and second half of the electric field acceleration as (C.6)
Final particle fourvelocity in units of c is then (C.7)
summarizing a recipe for updating the particle velocity from an initial time u^{n} to u^{n+1} (i.e., one time step of ∆t forward).
Appendix D Current Deposition
We summarize here the chargeconserving ZigZag current deposition scheme (Umeda et al. 2003) implemented as the ZigZag solver in RUNKO. First, we compute the relay point x_{r} between previous particle position x_{1} (and closest grid index i_{1} = (i_{1}, j_{1}, k_{1})) and the current position x_{2} (and closest grid index i_{2}) as (D.1)
Here we assume elementwise min and max operators as min{a, b} = (min{a_{x}, b_{x}}, min{a_{y}, b_{y}}, min{a_{z}, b_{z}}).
The current fluxes are then obtained as (D.2) (D.3)
We assume firstorder shape functions for the particles. This is related to particle weights defined at the mid points (x_{r} + x_{1})/2 and (x_{2} + x_{r})/2 as (D.4) (D.5)
Depending on the dimensionality, the weights are described as W_{1,2} = (W_{1,2;x}) for D = 1, W_{1,2} = (W_{1,2;x}, W_{1,2;y}) for D = 2, or W_{1,2} = (W_{1,2;x}, W_{1,2;y}, W_{1,2;z}) for D = 3. Finally, the current density J = FW needs to be injected into adjacent grid points following a chargeconserving deposition strategy. This can be implemented following Umeda et al. (2003).
References
 Bacchini, F., Ripperda, B., Philippov, A. A., & Parfrey, K. 2020, ApJS, 251, 10 [NASA ADS] [CrossRef] [Google Scholar]
 Birdsall, C., & Langdon, A. 1985, Plasma Physics via Computer Simulation, The Adam Hilger Series on Plasma Physics (New York: McGrawHill) [Google Scholar]
 Blinne, A., Schinkel, D., Kuschel, S., et al. 2018, Comput. Phys. Commun., 224, 273 [NASA ADS] [CrossRef] [Google Scholar]
 Boris, J. P. 1970, in Proc. Fourth Conf. Num. Sim. Plasmas, Naval Res. Lab, Washington, DC, 3 [Google Scholar]
 Bowers, K. J., Albright, B. J., Yin, L., Bergen, B., & Kwan, T. J. T. 2008, Phys. Plasmas, 15, 055703 [NASA ADS] [CrossRef] [Google Scholar]
 Brandenburg, A. 2003, in Advances in Nonlinear Dynamos (CRC Press), 269 [CrossRef] [Google Scholar]
 Bret, A., Gremillet, L., & Dieckmann, M. E. 2010, Phys. Plasmas, 17, 120501 [Google Scholar]
 Buneman, O. 1959, Phys. Rev., 115, 503 [NASA ADS] [CrossRef] [Google Scholar]
 Buneman, O., Neubert, T., & Nishikawa, K. I. 1992, IEEE Trans. Plasma Sci., 20, 810 [CrossRef] [Google Scholar]
 Buneman, O., Nishikawa, K.I., & Neubert, T. 1993, in Plasma Physics and Controlled Nuclear Fusion (ITC4), eds. H. T. D. Guyenne, & J. J. Hunt [Google Scholar]
 Cerutti, B., Werner, G. R., Uzdensky, D. A., & Begelman, M. C. 2012, ApJ, 754, L33 [NASA ADS] [CrossRef] [Google Scholar]
 Cerutti, B., Philippov, A., Parfrey, K., & Spitkovsky, A. 2015, MNRAS, 448, 606 [Google Scholar]
 Chapman, S., & Cowling, T. G. 1970, The Mathematical Theory of NonUniform Gases (Cambridge: Cambridge University Press) [Google Scholar]
 Chen, A. Y., & Beloborodov, A. M. 2014, ApJ, 795, L22 [NASA ADS] [CrossRef] [Google Scholar]
 Cheng, C. Z., & Knorr, G. 1976, J. Comput. Phys., 22, 330 [NASA ADS] [CrossRef] [Google Scholar]
 Comisso, L., & Sironi, L. 2018, Phys. Rev. Lett., 121 [Google Scholar]
 Comisso, L., & Sironi, L. 2019, ApJ, 886, 122 [Google Scholar]
 Courant, R., Friedrichs, K., & Lewy, H. 1928, Math. Ann., 100, 32 [Google Scholar]
 Dawson, J. 1962, Phys. Fluids, 5, 445 [NASA ADS] [CrossRef] [Google Scholar]
 Dawson, J. M. 1964, Phys. Fluids, 7, 419 [NASA ADS] [CrossRef] [Google Scholar]
 Derouillat, J., Beck, A., Pérez, F., et al. 2018, Comput. Phys. Commun., 222, 351 [NASA ADS] [CrossRef] [Google Scholar]
 Esirkepov, T. Z. 2001, Comput. Phys. Commun., 135, 144 [NASA ADS] [CrossRef] [Google Scholar]
 Frederiksen, J. T., Hededal, C. B., Haugbolle, T., & Nordlund, Â. 2004, ApJ, 608, L13 [NASA ADS] [CrossRef] [Google Scholar]
 Fried, B. D. 1959, Phys. Fluids, 2, 337 [NASA ADS] [CrossRef] [Google Scholar]
 Godfrey, B. B. 1974, J. Comput. Phys., 15, 504 [NASA ADS] [CrossRef] [Google Scholar]
 Greenwood, A. D., Cartwright, K. L., Luginsland, J. W., & Baca, E. A. 2004, J. Comput. Phys., 201, 665 [NASA ADS] [CrossRef] [Google Scholar]
 Haggerty, C. C., & Caprioli, D. 2019, ApJ, 887, 165 [NASA ADS] [CrossRef] [Google Scholar]
 Higuera, A. V., & Cary, J. R. 2017, Phys. Plasmas, 24, 052104 [NASA ADS] [CrossRef] [Google Scholar]
 Hockney, R. W., & Eastwood, J. W. 1981, Computer Simulation Using Particles (New York: McGrawHill) [Google Scholar]
 Howes, G. G., Cowley, S. C., Dorland, W., et al. 2006, Astrophys. J., 651, 590 [NASA ADS] [CrossRef] [Google Scholar]
 Howes, G. G., Dorland, W., Cowley, S. C., et al. 2008, Phys. Rev. Lett., 100 [Google Scholar]
 Hunter, J. D. 2007, Comput. Sci. Eng., 9, 90 [Google Scholar]
 Jackson, J. D. 1975, Classical Electrodynamics (New York: Wiley) [Google Scholar]
 Jakob, W., Rhinelander, J., & Moldovan, D. 2017, Pybind11  Seamless operability between C++11 and Python, https://github.com/pybind/pybind11 [Google Scholar]
 Kagan, D., Milosavljevic, M., & Spitkovsky, A. 2013, ApJ, 774, 41 [NASA ADS] [CrossRef] [Google Scholar]
 Komissarov, S. S., 2002, MNRAS, 336, 759 [Google Scholar]
 Kumar, P., & Narayan, R. 2009, MNRAS, 395, 472 [NASA ADS] [CrossRef] [Google Scholar]
 Langdon, A. B., Arons, J., & Max, C. E. 1988, Phys. Rev. Lett., 61, 779 [CrossRef] [Google Scholar]
 MacDonald, N. R., & Marscher, A. P. 2018, ApJ, 862, 58 [CrossRef] [Google Scholar]
 Nordlund, Â., Ramsey, J. P., Popovas, A., & Küffmeier, M. 2018, MNRAS, 477, 624 [NASA ADS] [CrossRef] [Google Scholar]
 Nättilä, J., & Beloborodov, A. M. 2021, ApJ, 921, 1 [CrossRef] [Google Scholar]
 Nättilä, J., & Beloborodov, A. M. 2022, Phys. Rev. Lett., 128, 7 [CrossRef] [Google Scholar]
 Philippov, A. A., & Spitkovsky, A. 2014, ApJ, 785, L33 [NASA ADS] [CrossRef] [Google Scholar]
 Plotnikov, I., Grassi, A., & Grech, M. 2018, MNRAS, 477, 5238 [Google Scholar]
 Servidio, S., Valentini, F., Perrone, D., et al. 2014, J. Plasma Phys., 81 [Google Scholar]
 Sironi, L., & Spitkovsky, A. 2009, ApJ, 698, 1523 [NASA ADS] [CrossRef] [Google Scholar]
 Sironi, L., & Spitkovsky, A. 2014, ApJ, 783, L21 [NASA ADS] [CrossRef] [Google Scholar]
 Sironi, L., Spitkovsky, A., & Arons, J. 2013, ApJ, 771, 54 [Google Scholar]
 Spitkovsky, A. 2005, in American Institute of Physics Conference Series, Astrophysical Sources of High Energy Particles and Radiation, eds. T. Bulik, B. Rudak, & G. Madejski, 801, 345 [NASA ADS] [CrossRef] [Google Scholar]
 Tanaka, M. 1983, Phys. Fluids, 26, 1697 [NASA ADS] [CrossRef] [Google Scholar]
 Umeda, Y. & Omura, T. T. H. M. 2005, ISSS7, 26 [Google Scholar]
 Umeda, T., Omura, Y., Tominaga, T., & Matsumoto, H. 2003, Comput. Phys. Commun., 156, 73 [NASA ADS] [CrossRef] [Google Scholar]
 Valentini, F., Trâvnicek, P., Califano, F., Hellinger, P., & Mangeney, A. 2007, J. Comput. Phys., 225, 753 [NASA ADS] [CrossRef] [Google Scholar]
 van der Walt, S., Colbert, S. C., & Varoquaux, G. 2011, Comput. Sci. Eng., 13, 22 [Google Scholar]
 Vay, J.L. 2008, Phys. Plasmas, 15, 056701 [NASA ADS] [CrossRef] [Google Scholar]
 Verlet, L. 1967, Phys. Rev., 159, 98 [CrossRef] [Google Scholar]
 Weibel, E. S. 1959, Phys. Rev. Lett., 2, 83 [NASA ADS] [CrossRef] [Google Scholar]
 Werner, G. R., Uzdensky, D. A., Cerutti, B., Nalewajko, K., & Begelman, M. C. 2015, ApJ, 816, L8 [NASA ADS] [CrossRef] [Google Scholar]
 Winske, D., Yin, L., Omidi, N., et al., 2003, Hybrid Simulation Codes: Past, Present and Future  A Tutorial, eds. J. Büchner, C. Dum, & M. Scholer, 615, 136 [NASA ADS] [Google Scholar]
 Yee, K. 1966, IEEE Trans. Antennas Propagation, 14, 302 [CrossRef] [Google Scholar]
 Yuan, F., & Narayan, R. 2014, ARA&A, 52, 529 [NASA ADS] [CrossRef] [Google Scholar]
 Zenitani, S., & Hoshino, M. 2001, ApJ, 562, L63 [NASA ADS] [CrossRef] [Google Scholar]
 Zhdankin, V., Uzdensky, D. A., Werner, G. R., & Begelman, M. C. 2017a, MNRAS, 474, 2514 [Google Scholar]
 Zhdankin, V., Werner, G. R., Uzdensky, D. A., & Begelman, M. C. 2017b, Phys. Rev. Lett., 118 [Google Scholar]
Alternatively, such a system can be modeled by solving the partial differential equations directly on a grid (known as the Vlasov method; Cheng & Knorr 1976).
The C++14 and 17 standards are generally considered as “transition standards” bridging the gap between the older C++11 and the newer C++20 version. In the future, RUNKO will transition to fully conform to the C++20 standard as soon as sufficient HPC compiler support can be guaranteed. At the moment, many of the newer C++20 template metaprogramming features are provided together with RUNKO itself, implemented using C++11.
For one particular test we partitioned a 3D PIC shock simulation grid of 7680 × 120^{2} with tiles of sizes 8^{3}, 15^{3}, or 30^{3} (keeping the number of processors fixed at 1280 cores) and found that the absolute wallclock time per lap varies as 1.34, 0.66, and 1.22 s. Smaller but more numerous tiles are found to have more expensive communication routines, whereas larger tiles have worse solver performance. The optimal size in this particular case is for tiles somewhere around 15^{3}.
All Tables
All Figures
Fig. 1 Schematic presentation of the PIC code’s class relationship and connectivity. The FIELDS module is built on top of the CORGI module that provides the spatial simulation grid. It contains the electromagnetic fields Ê and , and the methods to advance them in time. The PIC module handles the computational particles. It is coupled to the FIELDS module via the current ĵ induced by the moving charged particles. Similarly, the FIELDS module is coupled to the PIC module by the Lorentz force. 

In the text 
Fig. 2 Simulations of Weibel instability growth rates in relativistic counterstreaming plasmas. The different curves show the magnetic field energy density B^{2}/8π (normalized by the relativistic plasma enthalpy density, γ_{b}n_{0}m_{e}c^{2}) as a function of time tω_{p} for beam velocities of γ_{b} = 3 (blue), 10 (orange), 30 (green), and 100 (red). The dashed lines show the expected analytic growth rates. 

In the text 
Fig. 3 Visualization of a relativistic collisionless shock in unmagnetized pair plasma at t_{ωp} ≈ 500. The box periphery shows the plasma number density п/п_{0}, and the exploded panels the electric current density along the flow J_{x}/en_{0}c. The shock is located at the middle of the box at x = 0; the same region shows strong current filaments that are responsible for mediating the shock. 

In the text 
Fig. 4 Visualizations of the kinetic turbulence simulations. The top row shows the honeycomblike domain decomposition that partitions the grid between different processors with the CORGI library (the color map is cyclic). The bottom row shows the plasma density (in units of the initial plasma mass density ρ_{0}). The left panels show the full simulation domain, whereas the right panels focus on a smaller subregion that better illustrates the smallscale structures. 

In the text 
Fig. 5 Code performance analysis for 5120^{2} resolution turbulence runs with 1024 cores. The black line shows the total push time; the other curves show the individual results for each component in the loop (see Table 1). In a perfectly loadbalanced state (n ≲ 200) the total particle push time is measured to be around 0.7 µs, whereas in a strongly loadimbalanced state an average time of about 1.1 µs is obtained. 

In the text 
Fig. 6 Weak scaling results of the PIC code measured in a strongly loadimbalanced simulation state. Scaling is presented in terms of the mean particle push time per processor for various simulations that have a fixed ratio of total number of particles to processors. The black line shows the total mean push time; the other curves show the individual results for each component in the loop (see Table 1). Ideal scaling corresponds to a horizontal line, although there is a slight increase in the evaluation time due to more time spent on communication routines as the number of processors increases. 

In the text 
Fig. 7 Strong scaling results of the PIC code measured in a strongly loadimbalanced simulation state. Results are for a 5120^{2} box size with 128 particles per cell. The black line shows the total evaluation time per lap; the other curves show the individual results for each component in the loop (see Table 1). The ideal scaling corresponds to the thick black dotted line. 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.