A high-performance and portable asymptotic preserving radiation hydrodynamics code with the M1 model

Aims. We present a new radiation hydrodynamics code, called"ARK-RT"which uses a two-moment model with the M1 closure relation for radiative transfer. This code aims at being ready for high-performance computing, on exascale architectures. Methods. The two-moment model is solved using a finite volume scheme. The scheme is asymptotic preserving to capture accurately both optically thick and thin regimes. We also propose a well-balanced discretization of the radiative flux source term able to capture constant flux steady states with discontinuities in opacity. We use the library Trilinos for linear algebra and the package Kokkos allows us to reach high-performance computing and portability across different architectures, such as multi-core, many-core, and GP-GPU. Results. ARK-RT is able to reproduce standard tests in both free-streaming and diffusive limits, including purely radiative tests and radiation hydrodynamics ones. Using a time-implicit solver is profitable as soon as the time step given by the hydrodynamics is 50-100 times larger than the explicit time step for radiative transfer, depending on the preconditioner and the architecture. Albeit more work is needed to ensure stability in all circumstances. Using ARK-RT, we study the propagation of an ionization front in convective dense cores. We show that the ionization front is strongly stable against perturbations even with destabilizing convective motions. As a result, the presence of instabilities should be interpreted with caution. Overall, ARK-RT is well-suited to study many astrophysical problems involving convection and radiative transfer such as the dynamics of H ii regions in massive pre-stellar dense cores and future applications could include planetary atmospheres.


Introduction
In many astrophysical situations, radiation is an important process which interacts with the surrounding gas, e.g. in (exo) planet's atmospheres (e.g. Thomas & Stamnes 2002), massive stars (e.g. Kuiper et al. 2010;Mignon-Risse et al. 2020), H ii regions (e.g. Spitzer 1978), up to the cosmic reionization (e.g. Stiavelli 2009). In all these situations, the radiation can be absorbed, thus heating the surrounding gas. Photons could also be emitted by the gas or by another source. Photons could be scattered, which will change their direction of propagation and perhaps their frequency (Chandrasekhar 1960).
Two main regimes can arise, depending on the mean free path of photons compared to the characteristic length of the system (Mihalas & Mihalas 1984). On one hand, in the diffusive limit, the medium is optically thick (mean free path of photons much smaller than the characteristic length), the radiation and the matter strongly interact with each other. On the other hand, in the free-streaming regime, the radiation does not affect the gas, and the medium is optically thin (mean free path of photons greater than the characteristic length). Numerically, it is difficult to accurately capture both limits.
Due to the high number of degrees of freedom (i.e. the time, the position, the direction of propagation, and the frequency of photons), only a few problems can be solved analytically (Chandrasekhar 1960) and direct simulations are out of reach for modern computers. Different models have been developed to reduce e-mail: helene.bloch@cea.fr the computational cost. We will focus on the moment models (Levermore 1984): the specific intensity is averaged over the direction of propagation of photons. It presents several advantages: the computational cost is lower than other methods such as Monte-Carlo method and, mostly, it is easy to couple it with a grid-based hydrodynamics code.
One can consider only the moment of order 0 (the radiative energy), leading to the flux-limited diffusion (FLD) approximation (Levermore & Pomraning 1981). Because this model considers only the moment of order 0, its computational cost is quite low, but it is very diffusive in the free-streaming regime. To tackle this issue, one can use a two-moment model (radiative energy and radiative flux), with the M 1 closure relation (Dubroca & Feugeas 1999). However, this method can suffer from artifacts when multiple beams cross in the free-streaming regime (González 2006). One can solve this issue by using a three-moment model (radiative energy, radiative flux, and radiative pressure) with the M 2 closure relation (Pichard et al. 2016). However, because of the increase of unknowns, the computational cost also increases. In this work, we have chosen to use the two-moment model with the M 1 closure relation because the computational cost remains affordable, and we do not encounter in our applications the problem of beams crossing in the freestreaming regime.
Even though the M 1 model is accurate in both free-streaming and diffusive regimes at the continuous level, numerical schemes also need to properly capture both limits. Several approaches have been developed. For example, Berthon & Turpault (2011) Article number, page 1 of 18 arXiv:2011.13926v1 [astro-ph.IM] 26 Nov 2020 A&A proofs: manuscript no. main presented a scheme based on an HLL solver with source terms modified with a free parameter. Following this idea, we propose a new so-called asymptotic preserving scheme, also based on an HLL solver. Nevertheless, we have chosen another parameter to recover the asymptotic behavior, in the diffusive limit. Furthermore, our integration of source terms is different. In many physical applications (e.g. clouds), optically thick regions are found next to optically thin zones. We propose a well-balanced modification of the source term, which allows us to accurately reach steady states in the presence of sharp transitions.
As a first application, we are interested in the development of H ii regions in massive pre-stellar dense cores (Churchwell 2002). We focus on solving the radiation hydrodynamics equations. An explicit solver for the radiative transfer would be restricted by a Courant-Friedrichs-Lewy (CFL) condition, limited by the speed of light. This will result in a very low time step compared to the hydrodynamics one, which is limited by the speed of sound of the fluid. Several methods have been developed to get around this problem. The one we have chosen is a time-implicit solver (e.g. González et al. 2007). The temporality of the radiative transfer will be preserved, which is not the case with the reduced speed of light approximation (RSLA, e.g. Gnedin & Abel 2001). However, this method is costly because it requires solving large sparse linear systems.
Fortunately, linear algebra for high-performance computing (HPC) has been investigated over the years. Because the linear system we will have to solve is large and sparse, direct methods are out of reach. Iterative solvers with preconditioners have been developed to tackle this issue (e.g. Saad 2003). In this work, we use the library Trilinos (Heroux et al. 2005) because it allows us to target different architectures, such as multi-core, many-core, and GP-GPU. It also provides, among others, algebraic multigrid (AMG) preconditioners.
The paper is organized as follows. In the next section, we present more precisely the moment model and the M 1 closure relation. We go through our new numerical scheme, well-balanced and asymptotic preserving in the diffusive limit in Sect. 3. In Sect. 4, we give details of some implementation features of our code ARK-RT, especially the Kokkos and Trilinos libraries used for shared memory parallelism and linear algebra for highperformance computing. We also show some performance results. In Sect. 5, we present some numerical test cases to show the importance of the asymptotic preserving and well-balanced properties. Finally, in Sect. 5.6 we present a physical application about the stability of the ionization front in H ii region dense cores.

Physical model
In this work, we only consider gray radiative transfer, i.e. we have computed the average over the frequency. We also assume local thermodynamic equilibrium (LTE) and we do not consider scattering. The mathematical description of radiative transfer was formalized by Chandrasekhar (1960): where I, the quantity of interest, is the specific intensity, c is the speed of light, σ is the opacity of the medium and B is the black body specific intensity. x, t, and Ω are the spatial, temporal, and angular variables. This model can be generalized to multigroup radiative transfer (e.g. Turpault 2005).
Because the specific intensity I depends on six variables in the three-dimensional case, the numerical treatment of Eq. 1 becomes rapidly costly. To reduce the computational cost, we use a moment method, detailed in the next section.

M 1 model
Let us consider the three first moments of the specific intensity: the radiative energy E r , the radiative flux F r , and the radiative pressure P r defined as: The mean over solid angles of Eq. 1 and its product by Ω give the following system: where T g is the gas temperature and a r is the radiation constant. The fluid and the radiation exchange energy and momentum through emission and absorption. To ensure the conservation of the total energy when the hydrodynamics is frozen, the energy exchange term is given by ρc v T g is the gas internal energy, with ρ the density of the fluid and c v the heat capacity, defined by c v = k b µm H (γ−1) for a perfect gas, where k b is the Boltzmann constant, µ is the mean molecular weight, m H is the mass of hydrogen and γ is the adiabatic index of the gas.
We still have to specify a closure relation, i.e. a way to express P r as a function of E r and F r . The one chosen here is the M 1 model (Levermore 1984). From Dubroca & Feugeas (1999), we write P r = DE r where D is the Eddington tensor, defined by D = 1−χ 2 I + 3χ−1 2 n ⊗ n with χ the Eddington factor, I the identity matrix and f = F r cE r = f n is the reduced flux. χ can be specified either by applying a Lorentz transform to an isotropic distribution of photons (Levermore 1984), either by minimizing the radiative entropy (Dubroca & Feugeas 1999). In both cases, we Let us notice that f ≤ 1, which ensures that the radiative energy cannot be transported faster than the speed of light. The M 1 model preserves both free-streaming and diffusive limits. On one hand, if f = 1, then P r = E r n ⊗ n, only the transport regime remains. On the other hand, if f = 0, the model in the diffusive regime simplifies into the P 1 model, with P r = 1 3 E r I. The radiative pressure tensor becomes isotropic. We will look into the diffusion limit more precisely in Sect. 2.4.

Radiation hydrodynamics
We now consider the radiation hydrodynamics equations. The fluid evolution is described by the Euler equations expressing conservation of mass, balance of momentum, and balance of energy. Because photons are relativistic particles, we have to evaluate the quantities in the laboratory frame or the comoving frame, which is moving with the fluid. On one hand, using the comoving frame introduces non-conservative terms in the left-hand side of the equations. On the other hand, the hyperbolic part of the system remains simple in the laboratory frame, but some source terms have to be incorporated to describe the interactions between the matter and the radiation. We have chosen the second approach. These source terms characterize the momentum and energy exchanges between the fluid and the radiation: where u is the material velocity, p is the pressure of the fluid, g is the external gravitational field, ρE = ρe + 1 2 ρu 2 is the density of total matter energy with e the specific internal energy. The terms S E r (u) and S F r (u) depend on the velocity u. Using Eq. 29 and Eq. 31 from Lowrie et al. (1999), we have at first order in u c . To close the system, we also add the equation of state of an ideal gas: p = ρe(γ − 1).

Diffusive limit in a static fluid
Let us now focus on the diffusive regime with the hydrodynamics frozen, i.e. the limit of large opacity, long timescale, and u = 0. We consider the P 1 closure relation Following Berthon & Turpault (2011), we introduce a rescaling parameter ε to write the time (resp. the opacity) ast = εt (resp. σ = εσ). The radiative energy, the radiative flux, and the gas temperature are expanded with ε, e.g. E r = E r,0 + εE r,1 + O(ε 2 ). System 7 becomes By expanding Eq. 8a and Eq. 8b at order 0, we have Expanding Eq. 8b at order 1 leads to Finally, expanding the sum of Eq. 8a and Eq. 8c at order 2 gives In the diffusive limit, the total energy E r + ρc v T g at order 0 obeys the diffusion equation given by Eq. 11. A similar development for the radiation hydrodynamics case is done in Appendix A. The aim of the Sect. 3 is to design an asymptotic preserving scheme, i.e. a numerical scheme that will degenerate to the discretization of Eq. 11 in the diffusive regime.

Radiation transport in a static fluid
Let us first introduce some notations: we note ∆x the step along the x-direction. ∆t is the time interval between the current time t n and t n+1 . We write x i the center of the cell i and x i+ 1 2 the interface between the cell i and the cell i + 1. We use the notation u n i to represent the averaged quantity associated with the field u at time t n in the cell i (finite volume). Finally, we note u n i+ 1 2 to represent the quantity associated with the field u at time t n and at the interface between cells i and i + 1.
The development of the numerical scheme is presented only in the one-dimensional case, but its extension to higher dimensions is straightforward. To ease notations, we drop the indices r for all radiative variables.
We present the development of the numerical scheme using a time-implicit integration, but a similar development can be done with a semi-implicit solver: source terms remain implicit, but the hyperbolic part is time-explicit.

Hyperbolic system
Following González et al. (2007), we discretize the hyperbolic part of Eq. 3 using a first-order Godunov type solver (Toro 2009). From Berthon & Turpault (2011), we also introduce an extra parameter α which will be specified in Sect. 3.1.3: where F * i+ 1 2 and P * i+ 1 2 are the numerical fluxes given by with ξ = 4 − 3 f 2 . See Fig. 1 of González et al. (2007) for more details about the structure of the eigenvalues. {σF} n+1 i is a wellchosen discretization of the term σF r in the cell i and at time t n+1 . It will be specified in the next section.

Well-balanced modification of the source term
From Berthon et al. (2015), a well-balanced scheme catches the correct steady regime. The steady state, if it exists, is given by However, using this formulation, Eq. 15c is discretized as with (∇ · P) n+1 The radiative flux remains cellcentered and is equal to the divergence of radiative pressure, defined at the interfaces of the cells. This can create some spurious flux at the interface when looking for a steady state with a constant flux in the box (see Sect. 5.2). Inspired by well-balanced schemes for hydrodynamics (e.g. Padioleau et al. 2019) which preserve the hydrostatic balance between the pressure forces and the gravitational force (and the similarity of this balance with the balance between radiative pressure and radiative flux source term in Eq. 15c), we choose to use an average of a face discretization of the radiative flux source term: with One way to interpret this equation is to remember that Equation 16 is obtained with the rectangle rule for numerical integration of Eq. 20: whereas Eq. 18 is given by the trapezoidal rule: To have in the whole domain, we also impose it as boundary condition: where P n+1 0 is the radiative pressure given by the boundary condition. In that way, the radiative flux is centered at the interfaces of the cells, as well as the divergence of radiative pressure. A von Neumann stability analysis of the modified scheme is done in Appendix B.

Asymptotic preserving scheme
We still have to specify our choice for α i+ 1 2 . α i+ 1 2 = 1 corresponds to a classic HLL scheme. However, the solution given by an asymptotic preserving scheme has to approximate the solution of Eq. 11 as soon as the asymptotic regime is reached, i.e. large opacity and long timescale. Unfortunately, a standard HLL scheme does not have this property (see Sect. 5.1). To tackle this issue and get an asymptotic preserving scheme, we choose with f i+ 1 2 = 1 2 f n i + f n i+1 . The derivation of Eq. 25 is done in Appendix C. Other choices can be done (see for example Berthon & Turpault 2011). If σ i+ 1 2 ∆x goes to 0, α i+ 1 2 goes to 1, and we recover a standard HLL scheme. Considering the diffusive limit, we prove that the scheme is asymptotic preserving in Appendix C. We show that which is a standard discretization of Eq. 9, Eq. 10 and Eq. 11. Unfortunately, we cannot prove that this scheme will preserve the admissible states f < 1 and, indeed, numerical experiments with this scheme have shown that we can get f > 1 when we are close to the free-streaming regime. In these situations, we can either enforce f < 1 (Sect. 5.6) or come back to a centered discretization of the source term (Sect. 5.4). Furthermore, the development of the asymptotic preserving scheme with the well-balanced modification of the source term is only done in the case of a static fluid. Using the asymptotic correction Eq. 25 is only the first step to have an asymptotic preserving scheme in the case of a moving fluid (see Sect. 6.2).

Coupling to hydrodynamics
Following González et al. (2007), the resolution of the whole system 5 describing radiation hydrodynamics is split into three steps: 1. update of the hydrodynamics quantities (Eq. 5a, Eq. 5b and Eq. 5c without the terms of energy and momentum exchange) using the well-balanced and all-regime solver developed in Padioleau et al. (2019); 2. update of the radiative quantities and gas temperature (Eq. 3 and Eq. 4) using the solver developed in Sect. 3.1. During this step, the hydrodynamics quantities are frozen; 3. addition of source terms S E r (u) and S F r (u). For simplicity, all source terms which depend on the velocity are treated explicitly. The term σ c F r in Eq. 5b and Eq. 5e is discretized using the well-balanced scheme proposed in Sect. 3.1.2. All the other terms remain cell-centered.
This splitting allows reducing the number of equations solved implicitly, making the method more efficient.

Algorithm for non-linear implicit solver
The time step given by the CFL condition is much smaller for radiation than for hydrodynamics. Indeed, for the radiation, it is limited by the speed of light, whereas it is limited by the speed of sound of the fluid for the hydrodynamics. Because we are interested in radiation hydrodynamics, we will consider a long timescale for the radiative transfer. Therefore, we use a timeimplicit integration for the radiative transfer.

Newton-Raphson method and linear solver
Because of the Eddington tensor, the eigenvalues in the numerical fluxes, and the a r T 4 g factor, the system is non-linear. It is solved using a Newton-Raphson method. At each iteration, we have to solve a linear system. Because the system is large ((2 + d)N unknowns, where d is the number of dimensions and N the total number of cells) and sparse, it cannot be solved using a direct method. Because of the numerical fluxes, the matrix is not symmetric, and we use the biconjugate gradient stabilized method (Van der Vorst 1992).

Preconditioner
Using large time steps for the radiative transfer, the matrix is ill-conditioned and iterative methods might not converge. One way to deal with this issue is to use a preconditioner. Instead of solving the original linear system Ax = b, we solve the right preconditioned system AK −1 Kx = b via solving AK −1 y = b to compute y and then Kx = y. As long as the matrix K is invertible, this gives the same solution as the original system. If K is wellchosen, the condition number of the matrix AK −1 is lower than A's one. For example, the Jacobi or diagonal preconditioner is given by See Saad (2003) for more details. Numerous preconditioners have been developed over the years, some of them perform well, some do not, depending on the problem considered. Among different preconditioners, we will explore the algebraic multigrid (AMG) technique. We present this method in the next section, and we compare it with other preconditioners in Sect. 4.3: a standard ILU(k) factorization (Saad 2003), a slightly modified variant of the standard ILU factorization (Saad 1994), an additive Schwarz domain decomposition (Cai & Sarkis 1999), and a classical relaxation method, Jacobi with damping (Saad 2003).

Algebraic multigrid (AMG) preconditioner
The algebraic multigrid (AMG) methods were first developed as linear solvers for symmetric positive definite matrices arising from the discretization of scalar elliptic PDEs. For such a matrix, classical iterative methods are efficient to compute the high frequencies of the solution, but lack efficiency to compute its low frequencies. However, the computation is easier on a coarser grid with fewer unknowns. The idea of the multigrid solver is to build a coarser grid, then solve the problem on this coarse grid and finally interpolate the solution on the fine grid. We can then define a restriction operator R which transfers vectors from the fine grid to the coarse grid and an interpolation operator P used to return to the finer grid. P and R are non squared matrices. From Saad (2003), here are the main steps of the method: 1. pre-smoothing: a few iterations of a simple method such as Jacobi or an incomplete factorization are performed, to get the valuex; 2. the residualr = b − Ax is projected over the coarse grid with the restriction operator R, to get the residual equation RAPy = Rr; 3. this equation is solved, possibly with a direct solver; 4. the solution y is interpolated over the fine grid with the interpolation operator P and thenx =x + Py; 5. post-smoothing: a few iterations of a simple method are again performed to get the solutionx.
The solutionx is used as a preconditioner result. If the coarse grid has too many unknowns to be solved directly, this process is applied recursively: the coarse grid becomes the fine grid and a coarser grid is built. Therefore, we have a hierarchy of grids. With a geometric multigrid solver, the restriction and interpolation operators are determined by the mesh, whereas, with an algebraic multigrid solver, they are automatically generated, using data from the matrix.
Before showing numerical tests, we present some details about implementation and parallelization in the next section.

Implementation and parallelization
Our implementation has been done in the code ARK-RT 1 , a fork of the code ARK developed in Padioleau et al. (2019). The hydrodynamics and gravity part of our solver is similar to ARK and is solved with a well-balanced and all-regime solver. Because the solver for the radiative transfer equations is time-implicit, we have to solve large sparse linear systems. It is done using the library Trilinos, described in the next section.

Linear algebra
We use the second generation of packages of the framework Trilinos (Heroux et al. 2005), i.e. Kokkos (Edwards et al. 2014) for shared memory computation, Tpetra (Baker & Heroux 2012) for distributed vectors and matrices, Belos (Bavier et al. 2012) for linear solvers, Ifpack2 (Prokopenko et al. 2016) for classical preconditioners and MueLu (Berger-Vergiat et al. 2019a,b) for the AMG preconditioner. Let us focus on the package Kokkos.

Shared memory computation
As new architectures have more and more cores, the distributed memory model is not enough to take advantage of all the computational power available. Therefore, we need to use a shared memory model inside the nodes. Furthermore, computational nodes are more and more heterogeneous, for example, multicores, many-cores, or accelerators such as GP-GPUs. Each architecture requires its own interface, such as OpenMP or C++11 threads for multi-cores and many-cores processors and CUDA or OpenACC for NVIDIA GPUs. This raises the problem of portability and performance portability: many HPC codes are optimized for some specific architectures, so running the code on a different architecture will result in bad performance.
The library Kokkos (Edwards et al. 2014) tackles this issue. The user has a unique code which can be compiled with different shared memory models such as OpenMP or CUDA. Kokkos is based on different abstractions, like execution spaces (where a function is executed), memory spaces (where the data are), and execution policies (how the function is executed). It provides execution patterns such as parallel loops and multidimensional arrays, the storage is optimized according to the architecture.

Implementation
For the hydrodynamics step, Kokkos is used as an independent library for shared memory computation. Communications between the nodes are handled by the Message Passing Interface (MPI) programming model through a regular domain decomposition. Following Kestener (2017), inside each node, the domains are endowed with ghost cells used to implement physical boundary conditions, but also to contain values from neighbor domains. The code is organized with computational kernels, each kernel is a C++ functor. See Padioleau et al. (2019) for more details.
The second step is the time-implicit solver for radiative transfer. The values of the matrix and the right-hand side of the linear system have to be updated at each iteration of the Newton-Raphson method, but the graph of the matrix does not change. Using Trilinos, the rows of the matrix are distributed across the MPI processes, each of them is associated with a unique global index. Each global index has a matching local index on the owning process.
Trilinos provides several methods to update the coefficients of the matrix. One of them uses only global indices, which is the way recommended by Trilinos. However, this function can only be called by the host. This has two main consequences. First, because all the rows of the matrix have to be updated, we use a sequential loop over the rows of the matrix. Second, in the case where we are using GP-GPUs, we have to update the matrix with data coming from the device. Therefore, we have to transfer some data from the device to the host, update the matrix with these data, and then transfer the matrix from the host to the device, this last step is done implicitly by Trilinos. This will increase the computational cost.
Another way is to use local indices. The package KokkosKernels, part of the Kokkos ecosystem, provides several ways to update the coefficients of the matrix through a kernel. This allows the use of a parallel loop (via Kokkos::parallel_for) and we avoid data transfers between the host and the device. For performance reasons, we use local indices.

Performances
Thanks to Trilinos, we can use many preconditioners. Unfortunately, they do not behave the same way when the size of the system increases. All tests are performed on Poincare, our local cluster at Maison de la Simulation. Each node consists of two Sandy Bridge E5-2670 @ 2.60 GHz (2 × 8 cores, 32 Go RAM) processors. We use a hybrid configuration MPI/OpenMP, with one MPI process per socket to avoid NUMA effects.
We first performed a weak scaling test, where we consider a two-dimensional case with periodic boundary conditions and a hot source located at the center of each domain. Each MPI process is getting a piece of the whole domain of 1500 2 cells, therefore the size of the system increases with the number of MPI tasks. The resolution is close to the one we are aiming for three-dimensional simulations. The left panel of Fig. 1 shows the mean number of iterations for the linear solver to converge as a function of the number of cells. For all preconditioners, the number of iterations remains constant, around 10 iterations for the AMG preconditioner, around 20 iterations for both incomplete factorizations and the additive Schwarz domain decomposition and around 250 iterations for the relaxation. The right panel of Fig. 1 shows the speed-up as a function of the number of MPI processes. The speed-up reaches a plateau of 80% to 90% of maximum performance, depending on the preconditioner. Figure 2 shows the number of iterations and the speed-up as a function of the number of MPI processes for a strong scaling test. We now consider a Marshak wave propagation in the diffusive limit. The global resolution remains constant as the number of processes increases. It is set to 2048 2 cells. Because the global resolution is constant, one can expect the number of iterations to also remain constant when the number of MPI processes increases. However, using the algebraic multigrid (orange curve) and the incomplete factorizations (green curve), when four MPI processes or more are used, the number of iterations is twice the number of iterations reached with one or two MPI processes. Therefore, the computational time is the same using two or four MPI processes. Furthermore, all tested preconditioners and the linear solver requires several communications per iteration, which likely become the main cost when the local resolution decreases.
Thanks to Kokkos, we can use exactly the same code on different architectures like Sandy Bridge processors and NVIDIA GP-GPUs (e.g. K80). Unfortunately, the memory required by the AMG preconditioner with a 1500 2 simulation is larger than the memory available on a K80 GPU. For the next tests, we use a lower resolution of 1000 2 cells. Table 1   implicit solver only needs a few time steps to reach the same final time. Updating the matrix in parallel allows for a 25% reduction in computational time required. On CPU, the implicit solver is around 160 times faster than the explicit solver, whereas, on GPU, it is only 11 times faster. Figure 3 compares the computational time with different preconditioners, on both CPU and GPU. Except for the implicit solver using the AMG preconditioner, all solvers are faster on GPU than CPU, up to three times faster for the relaxation preconditioner. Part of the AMG algorithm probably remains sequential. On CPU, the AMG preconditioner is faster than the relaxation preconditioner. The other preconditioners are slower, up to a factor 8 between the relaxation and the additive Schwarz domain decomposition on GPU. on the device, but Trilinos still allocates some memory on the host, between 0.125 GB and 0.208 GB, unlike the explicit solver.
Using the relaxation as a preconditioner, the amount of memory allocated is lower than with the other preconditioners (7.3 GB for the relaxation against 11.5 GB for the AMG). Choosing a well-suited preconditioner can be challenging and problem dependent. Once the preconditioner is chosen, it depends on many parameters. For example, Trilinos allows the user to choose the damping factor ω for the relaxation method or the smoother and the coarse solver for the AMG. Performances and stability can largely depend on these choices. For example, the relaxation method seems to be well suited for this problem with low computational time, and memory consumption, but in many other test cases, the linear solver will not converge. The AMG preconditioner performs well on CPU but is less efficient on GPU. Both incomplete factorizations and the additive Schwarz domain decomposition are slightly less efficient than the AMG preconditioner. Overall we have found the AMG preconditioner or relaxation method are a good compromise between stability and performances.
The performances we obtained thanks to Kokkos and Trilinos are encouraging for the study of astrophysical problems. The time step given by the hydrodynamics can be written as CFL ∆x c . Using a relaxation as a preconditioner, we need CFL ≥ 50 on CPU and CFL ≥ 100 on GPU to save computational time, whereas, using an incomplete factorization, we need CFL ≥ 250 on CPU and CFL ≥ 1000 on GPU. We need a larger CFL number on GPU because the explicit solver is more efficient on GPU than CPU.
In the next section, we use several numerical tests to show that the scheme developed in Sect. 3 is well suited for the study of radiation hydrodynamics problems.

Numerical results
We performed a series of verification tests to validate different properties of the scheme: the asymptotic correction with a Marshak wave, the well-balanced property to reach a steady state with a jump of opacity, the properties of the M 1 model with a beam test and a shadow test, and the coupling to the hydrodynamics with radiative shocks. We also present a physical application about the stability of the ionization front in H ii regions dense cores. To ease notations, we define the radiative temperature as T r = E r a r 1 4 .

Marshak wave
From Mihalas & Mihalas (1984), a Marshak wave is the propagation of hot radiation into a cold medium. We consider a onedimensional case in the diffusive limit, to test the asymptotic preserving scheme developed in Sect. 3.1.3.
The length of the computational domain is 1 cm, it is discretized with 400 points. Initially, the medium is at equilibrium with the radiation: T 0 = T r = 300 K, the initial radiative flux is F r = 0. We consider a perfect gas with γ = 5 3 . The hydrodynamics is frozen. The density is constant, such that ρc v = 1 JK −1 cm −3 , the opacity is also constant, with σ = 10 000 cm −1 , therefore σ∆x = 25. At time t = 0, a source is lighted at the left boundary with T r = 1 000 K.
The results are shown in Fig. 5 at time t f = 2 × 10 −4 s. We compare different solutions: a reference solution, the solution given by our asymptotic preserving scheme, and the solution given by a standard scheme. The reference solution is given by a standard discretization of Eq. 11. From Audit et al. (2002), when a scheme that is not asymptotic preserving is used with σ∆x 1, the grid does not sample the mean free path of the photons and the solution is dominated by numerical diffusion. The relative L 2 error between the reference solution and the solution with α i+ 1 2 given by Eq. 25 is 1.1% whereas, with the standard HLL scheme, the relative L 2 error is 84%. Using the asymptotic correction, we recover the correct behavior in the asymptotic limit.

Steady state with a jump of opacity
In the previous case, the opacity is constant, we now consider a test with a jump of opacity, still in the one-dimensional case. We use this test to highlight the need for the well-balanced modification of the source term.
The length of the computational domain is 1 cm, it is discretized with 100 points. Initially, the medium is at equilibrium with the radiation: T 0 = T r = 300 K, the initial radiative flux is F r = 0. The opacity σ is now a function of space: At time t = 0, a source is lighted at the left boundary with T r = 1 000 K. Figure 6 shows the radiative flux at time t f = 10 −3 s. From Eq. 15b, when the steady state is reached, we expect the radiative flux to be constant in the box. Using a standard discretization of the source term, such as Eq. 16, a spurious peak located at the discontinuity of opacity is observed (orange curve). The value taken by the radiative flux is more than 20 times the expected value. This seems to be caused by a numerical instability. This can result in f > 1 during the iterations of our Newton-Raphson implicit scheme, which is not physically admissible. However, using the well-balanced modification of the source term proposed by Eq. 18 (blue curve), the spurious peak does not appear anymore and the constant steady state is reached.
Using the standard discretization of the source term Eq. 16, one can show that the numerical scheme is unconditionally stable, in that the error between the numerical solution and the exact solution goes to 0 as ∆x and ∆t go to 0. The spurious peak seems to be due to a lack of precision in the integration of the source term. Using Eq. 18, the source term is defined at the interfaces of the cells and balances the divergence of radiative pressure, also defined at the interfaces.

Beam
We now perform the same two-dimensional test as in Richling et al. (2001); González et al. (2007). The domain [−1, 1]×[−1, 1] is discretized with 128 × 128 cells. The initial temperature is T 0 = T r = 300 K, the initial radiative flux is F r = 0, the opacity is σ = 0. At time t = 0, a beam with T = T r = 1 000 K is introduced with an angle of 45 • . The beam is located at x = −1 and y ∈ [−0.875, −0.75]. Because we are in the free-streaming regime, we cannot use large time steps. For performance reasons, we use the semi-implicit scheme.
Because there is no opacity, the beam propagates in the vacuum, and we expect it to cross the box without dispersion. The direction of the beam is not along the mesh axis, we use this test to quantify the numerical diffusion. Figure 7 shows the radiative energy at steady state, with the eigenvalues fixed to ±c and with computed eigenvalues. Because there is no opacity, the asymptotic correction nor the well-balanced source term affect the result, and we recover the same result as in González et al. (2007). Figure 8 shows the horizontal cut at the middle height. The beam introduced at the boundary is sampled over 8 cells. Ideally, without any numerical diffusion, we would expect the width of the beam to stay exactly 8 cells. With the computed eigenvalues, we can keep the numerical diffusion under control. Using fixed eigenvalues, the width of the beam at middle height is approximately 30 cells, whereas it is only 24 cells with calculated eigenvalues.
The main difference in this test between ARK-RT and HER-ACLES (González et al. 2007) is the computation of the eigenvalues. We use the exact eigenvalues given by Eq. 14, from Berthon et al. (2007), whereas in González et al. (2007), to save computational time, the eigenvalues are computed once at the beginning of the simulation and then interpolated. However, this approximation does not impact the result.

Shadow
Let us now consider a two-dimensional test with source terms. Following Hayes & Norman (2003); González et al. (2007), we consider a shadow test. The computational domain is a cylinder of length L = 1 cm and radius R = 0.12 cm. It is discretized with 280 × 80 cells. A spheroid clump is located at the center of the box, on the symmetric axis: (z c , r c ) = (0.5, 0). The extension of the clump is (z 0 , r 0 ) = (0.1, 0.06). Initially, the medium is at equilibrium with the radiation, with T 0 = T r = 290 K. We consider a with σ 0 = 0.1 cm −1 . At time t = 0, a source is lighted at the left boundary with T r = 1740 K and the reduced flux is set to f = 1. Because f is close to 1 in the free-streaming regime, we encounter f > 1 in the simulation. To tackle this issue, we use the non-well-balanced scheme: the radiative flux source term is discretized using Eq. 16. Because we are in the free-streaming regime, we cannot use large time steps. For performance reasons, we use the semi-implicit scheme. To recover the same result as in González et al. (2007), we use λ + i+ 1 2 = max(0.1 × c, λ max ) and λ − i+ 1 2 = min(−0.1 × c, λ min ), where λ max and λ min are given by Eq. 14. Figure 9 shows the radiative temperature at the final time t f = 10 −10 s with different closure relations: the P 1 model, the M 1 model with fixed eigenvalues and the M 1 model with computed eigenvalues. Because of the high opacity in the clump, the light does not cross it and we expect the shadow behind it to remain stable.    Fig. 11: Subcritical shock simulation. The figure shows snapshots of gas temperature, radiative temperature, and reduced flux at different times: 1.7 × 10 4 s, 2.8 × 10 4 s, and 3.8 × 10 4 s. (2003); González et al. (2007), we plot the radial profile of the radiative temperature at the right boundary (Fig. 10). Using the P 1 model, the radiative pressure is isotropic, therefore the photons go around the obstacle immediately, heating the whole domain. Using the M 1 closure relation, the shadow is better preserved, the temperature behind the obstacle remains at its initial value, 290 K. As the opacity remains rather low outside of the clump and the light has not crossed the obstacle, the asymptotic correction has no impact on the result. Because the boundary of the clump is smoothed, the transition between the optically thick and thin medium is less sharp than in Sect. 5.2 and the well-balanced modification of the source term is not necessary.

Radiative shocks
We finally consider radiative shocks: the gas and the radiation exchange energy and momentum. Following Ensman (1994); Hayes & Norman (2003); González et al. (2007), we consider a one-dimensional homogeneous medium, with ρ = 7.78 × 10 −10 g cm −3 and σ = 3.1 × 10 −10 cm −1 . We consider a perfect gas with an adiabatic coefficient γ = 7 5 and a mean molecular weight µ = 1. The length of the domain is 7 × 10 10 cm. It is discretized with 400 cells. The initial temperature at the left boundary is set to 10 K and it is increased by 0.25 K per cell. Initially, the radiation is at equilibrium with the gas. The left boundary condition is reflective, the initial velocity of the fluid is set to u 0 . According to the value of u 0 , the shock will be subcritical or supercritical. See González et al. (2007) for more details. To compare our results with Ensman (1994); Hayes & Norman (2003); González et al. (2007), we plot the temperature as a function of x i = x − u 0 t.

Subcritical shock
We first consider a subcritical shock, the initial velocity is set to u 0 = −6 km s −1 . Figure 11 shows the gas temperature, the radiative temperature, and the reduced flux at three different times: 1.7 × 10 4 s, 2.8 × 10 4 s, and 3.8 × 10 4 s. As expected, the gas and the radiation are not at equilibrium, before nor after the shock. The gas temperature reaches 1135 K, as in González et al. (2007), whereas it is only 850 K in Ensman (1994).

Supercritical shock
We consider now a supercritical shock, where the initial velocity is set to u 0 = −20 km s −1 . Figure 12 shows the gas temperature, the radiative temperature, and the reduced flux at three different times: 4 × 10 3 s, 7.5 × 10 3 s, and 1.3 × 10 4 s. As in González et al. (2007), the radiative temperature is the same as the matter temperature on both sides of the shock. The gas and the radiation are therefore at equilibrium. The radiative precursor is larger than the subcritical shock's radiative precursor, as intended, and the temperature reaches 5 000 K, as in Ensman (1994). We also recover the Zel'dovich spike.

Expansion of H ii region
Now that we confirmed the good behavior of the numerical scheme with both the asymptotic preserving and the wellbalanced properties, we can apply it to a physical situation: the propagation of the ionization front in a massive pre-stellar dense core.

Model
We consider the early stage of the development of an H ii region in a massive pre-stellar dense core (Churchwell 2002). We focus on a region of the dense core at about 100 AU from the massive young stellar object (YSO). This region has been heated by the YSO during the pre-main sequence phase. The temperature reached at this location by infrared heating is of the order of 1 000 K and the transport of energy in this region can be dominated by convection. We have inferred the convective state of this region by computing thermal and adiabatic gradients based on observations of Herpin et al. 2009 (Fig. 7). High-energy photons emitted by the YSO when entering the main sequence will start to ionize the surrounding gas. This will trigger the propagation of an ionization front in a convective medium, and we are interested in the stability of such a front perturbed by the pre-existing convective motions.
to trigger convection. T f orcing is the equilibrium temperature profile, depending on space, and τ f orcing is the relaxation timescale. The gas temperature will relax toward the equilibrium temperature profile T f orcing .
By writing cE r = F γ e γ , ρ = n H m H and σ = σ γ n H , we finally have to solve the following system: In this test, we will use the M 1 solver with the asymptotic correction presented in Sect. 3.1.3, but we do not use the wellbalanced discretization of the source term because of stability issues that will be discussed in Sect. 6.1.

Setup
We consider a square domain with a side 5 AU long. We use a setup similar to Padioleau et al. (2019) for compressible convection simulations. The temperature is set at the top and the bottom of the box at 500 K and 1 000 K respectively. The initial temperature is a linear interpolation between the top and the bottom of the box. It is also the forcing temperature profile T f orcing . We take τ f orcing = 10 7 s. These parameters are chosen to trigger the initial convective motions. We also set the pressure at the bottom: 10 −3 dyne · cm −2 (Herpin et al. 2009). The density and the pressure are linked by the ideal gas law: p = ρk b T g m H µ , where µ is the mean molecular weight. The non-ionized medium is made of hydrogen, with µ 1 = 1. When the medium is fully ionized, it is made of atomic nucleus and electrons, so twice as many particles for the same mass. Because the distribution of nucleus and electrons is homogeneous, the mean molecular weight is µ 2 = 0.5. When the medium is partially ionized, we take µ as the mean of the previous values balanced by the fraction of ionization, i.e. µ = (1 − X)µ 1 + Xµ 2 . The density is initialized with the recursive formula p i+1 − p i = 1 2 (ρ i + ρ i+1 ) g∆z, which is the discrete version of the hydrostatic balance ∇p = −ρg.
We impose Neumann boundary conditions for the temperature. The pressure and density are imposed by an extrapolation of the hydrostatic balance. Because the hydrodynamics solver is well-balanced for the gravity, this configuration will remain static, even if the initial condition is unstable. The hydrostatic equilibrium is destabilized with a velocity mode perturbation of the form u(x, y) = 2 · 10 −4 c s sin 2π x − x mid L x sin π y − y mid L y v(x, y) = 2 · 10 −4 c s cos 2π x − x mid L x cos π y − y mid L y , with c s the speed of sound, x mid = y mid = 2.5 AU and L x = L y = 5 AU. Without any interaction with the ionizing photons, the convective motions are stationary. The opacity is set to σ = σ γ ρ m H with σ γ = 6 × 10 −18 cm 2 (Lesaffre 2002). The radiative energy and flux are set to 0 and the medium is not ionized (X = 0). We initialize the hydrodynamics variables with the steady state described previously. At time t = 0, the bottom boundary of the region is ionized: the reduced flux is set to 1 and the radiative energy is set to F * e γ c with F * = 3 × 10 17 cm −2 s −1 in the boundary. The boundary conditions for the hydrodynamics variables remain unchanged.

Results
As the initial condition is such that E r is close to 0, this can easily create some spurious values such that f > 1. This is clearly a numerical artifact induced by the very low value of the radiative energy in regions where no ionizing photons are present. Even with a centered discretization of the radiative flux source term and without the asymptotic correction, we still encounter f > 1 during the simulation. Because of this problem, we impose f = 1 in the computation of the Eddington tensor in the cells where f > 1. Figure 13 shows the evolution of the position of the ionization front as a function of time. With and without the initial convective rolls, the position of the ionization front oscillates around an equilibrium position, between 0.3 AU and 0.4 AU. The oscillations around the equilibrium are expected and have been observed with simpler models (Tremblin et al. 2012). Figure 14 and Fig. 15 show the ionization front at time t = 6× 10 8 s and at the final time t f = 10 10 s. With and without the initial convective rolls, a numerical noise appears, as a consequence of the long timescales. Because of the numerical noise, some lack of symmetry can appear, such as on the left panel of Fig. 14. The fraction of ionization, which is always between 0 and 1, reaches values between 10 −12 and 10 −6 . The effect of the preconditioner The stability of the ionization front is an issue that has been discussed for a long time in the literature (Mizuta et al. 2008). For example, 3D simulations of the expansion of a spherical ionization front in 3D Cartesian grids have shown instabilities either on the axis of the grid or in the diagonal depending on the numerical scheme (see Fig. A3 Bisbas et al. 2015). The dependence of the instability on the grid cast doubts about a possible physical regime. Our test case shows that even with convective motions of large amplitude, the ionization front remains stable.

Discussion and conclusion
We have presented a new radiation hydrodynamics code. The radiative transfer is described with a two-moment model and uses the M 1 closure relation. We first discuss some limitations of our work before reaching our conclusion.

Well-balanced discretization of the source term
In Sect. 3.1.2 we have proposed a well-balanced discretization of the source term on the radiative flux equation. This discretization allows us to properly capture the steady state with constant flux and with a discontinuity of opacity. However, this discretization can lead to spurious oscillations in the radiative flux, a problem that we have encountered in the test case for the expansion of H ii regions. Although we have changed the discretization of the source term to achieve a well-balanced property, our integration of the hyperbolic part on the source term is still split into two steps. Such a splitting strategy might be unstable if the source term is not taken into account in the hyperbolic part. A possible solution to this problem would be to incorporate the source term in a Lagrange-projection-like scheme such as Buet & Despres (2008). Such a strategy might be necessary to treat the radiation hydrodynamics problem of cloud interfaces. 6.2. Asymptotic limit for radiation hydrodynamics In Sect. 3.1.3, we have presented an asymptotic correction which allows us to capture the asymptotic behavior, whereas the solution given by a standard scheme is dominated by numerical diffusion. The asymptotic correction uses the numerical diffusion to recover the physical one, in a static fluid. Nevertheless, this scheme does not capture the asymptotic regime in a moving fluid, as presented in Appendix A. Most of the schemes proposed in the literature do not preserve this asymptotic regime (e.g. González et al. 2007;Berthon & Turpault 2011). The diffusive regime depends on the material velocity, our scheme cannot reach it. A possible solution would be to limit the numerical diffusion with a correction similar to a low Mach correction, e.g. Chalons et al. (2016), in conjunction with a cell-centered discretization of the source term, as proposed by the Lagrangeprojection scheme of Buet & Despres (2008).

Conclusion
The model for radiation hydrodynamics proposed in this paper has the correct behavior in both free-streaming and diffusive limits. It is discretized with an asymptotic preserving scheme. This asymptotic correction is important to capture the correct behavior in the diffusive limit and to preserve the free-streaming regime in a static fluid.
We take advantage of the libraries Kokkos and Trilinos to reach high-performance computing and to solve linear systems. This approach allows us to take advantage of different architectures and to use large time steps for radiative transfer. Using the implicit solver is profitable as soon as the time step given by the hydrodynamics is 50 − 100 times larger than the explicit time step for radiative transfer, depending on the preconditioner and the architecture.
The solver is then coupled with the hydrodynamics code that implements an all-regime and well-balanced solver for hydrodynamics with gravity. The tests performed show that ARK-RT is well suited to study many astrophysical problems. Using this code, we have been able to study the development of H ii regions in massive pre-stellar dense cores, especially the propagation of the ionization front in the presence of convection. We have shown that even with the destabilizing effect of convection, the ionization front is strongly stable against perturbations. A linear stability analysis similar to the Rayleigh-Taylor instability but including source term could give more insight into this behavior.
Further work will consist of the development of a numerical scheme that preserves the admissible states while keeping the asymptotic preserving and well-balanced properties. We will then be able to take full advantage of the largest next generation massively parallel architectures to study, among others, atmospheric physics with opacity interfaces.

Appendix A: Diffusive limit for radiation hydrodynamics
As in Sect. 2.4, we consider the diffusive limit with the P 1 closure relation. We introduce a rescaling parameter ε to write the time (resp. the opacity) ast = εt (resp.σ = εσ). Because the velocity of the fluid is smaller than the speed of light ( u c << 1), we also rescale it asũ = u ε . Let us focus on the equations describing the evolution of the radiative variables: By expanding Eq. A.1a and Eq. A.1b at order 0, we have Expanding Eq. A.1b at order 1 leads to Finally, looking at the radiative energy and the gas internal energy at order 2, source terms cancel each other, only the divergence of the radiative flux at order 1 remains, and we have One can also look at Eq. A.1a at order 2. Expanding Eq. A.1a at order 2 gives ∂tE r,0 − ∇ c 3σ ∇E r,0 = cσ 6a r T 2 g,0 T 2 g,1 + 4a r T 3 g,0 T g,2 − E r,2 We recover Eq. 43 of Krumholz et al. (2007). The term cσ 6a r T 2 g,0 T 2 g,1 + 4a r T 3 g,0 T g,2 − E r,2 is the development at second order of the term κ 0 (4πB − cE). Because we do not neglect any terms O ũ c , some coefficients are slightly different. See the discussion in Krumholz et al. (2007) for the importance of the term 4 3σ c E r,0ũ 2 0 .

Appendix D: Expansion of H ii region
In the test case described in Sect. 5.6, some numerical noise appears, as a consequence of the long timescales. Let us recall that a time-implicit scheme is used, with large time steps for the radiative transfer. At each time step, the Newton-Raphson method is used and, at each iteration of this algorithm, an ill-conditioned linear system is solved, using an iterative process. This results in the appearance of some numerical noise.
Article number, page 17 of 18 A&A proofs: manuscript no. main  We have performed the same simulations as in Sect. 5.6 with different numbers of MPI processes and different preconditioners. The physical domain is either distributed over 4 × 4 MPI processes (Fig. D.1a and Fig. D.1c) or 2 × 2 MPI processes ( Fig. D.1b and Fig. D.1d). We have also tried two preconditioners which allowed us to reach the final time with reasonable computational time: a standard ILU(k) factorization (Fig. D.1a and Fig. D.1b) and an additive Schwarz domain decomposition (Fig. D.1c and Fig. D.1d). Figure D.1 shows snapshots of the fraction of ionization and the velocity field at the final time t f = 10 10 s. The shape of the small structures produced by the numerical noise varies with the number of MPI processes and the preconditioner. Furthermore, the propagation of the ionization front creates some velocity that also depends on the number of MPI processes and preconditioners. However, the position of the ionization front is not affected by these parameters.