Testing strengths, limitations and biases of current Pulsar Timing Arrays detection analyses on realistic data

State-of-the-art searches for gravitational waves (GWs) in pulsar timing array (PTA) datasets model the signal as an isotropic, Gaussian and stationary process described by a power-law. In practice, none of these properties are expected to hold for an incoherent superposition of GWs generated by a cosmic ensemble of supermassive black hole binaries (SMBHBs), which is expected to be the primary signal in the PTA band. We perform a systematic investigation of the performance of current search algorithms, using a simple power law model to characterize GW signals in realistic datasets. We use, as the baseline dataset, synthetic realisations of timing residuals mimicking the European PTA (EPTA) second data release (DR2). Thus, we include in the dataset uneven time stamps, achromatic and chromatic red noise and multi-frequency observations. We then inject timing residuals from an ideal isotropic, Gaussian, single power-law stochastic process and from a realistic population of SMBHBs, performing a methodical investigation of the recovered signal. We find that current search models are efficient at recovering the GW signal, but several biases can be identified due to the signal-template mismatch, which we identify via probability-probability (P-P) plots and quantify using Kolmogorov-Smirnov (KS) statistics. We discuss our findings in light of the signal observed in the EPTA DR2 and corroborate its consistency with an SMBHB origin.


Introduction
Recently, the pulsar timing array (PTA) community has announced the detection of a signal compatible with a gravitational wave (GW) origin.The latest datasets from the European PTA collaboration (EPTA, Antoniadis et al. 2023a), the Chinese PTA collaboration (CPTA, Xu et al. 2023), the North American Nanohertz Observatory for Gravitational Waves collaboration (NANOGrav, Agazie et al. 2023), and the Parkes PTA collaboration (PPTA, Reardon et al. 2023) all show evidence of a red noise process that is common among pulsars and shows correlation properties expected for the long sought after stochastic nano-Hertz (nHz) GW background (GWB, Hellings & Downs 1983).
Pulsar timing arrays are sensitive across the 10 −9 −10 −7 Hz frequency range, where the dominant signal is expected to be an incoherent superposition of sinusoidal GWs emitted by a cosmic population of supermassive black hole binaries (SMBHBs, Rajagopal & Romani 1995;Jaffe & Backer 2003;Wyithe & Loeb 2003;Sesana et al. 2008;Rosado et al. 2015).Nonetheless, a correlated low-frequency signal in PTA data can also arise from GWs generated by early Universe phenomena, such as the non-standard inflationary scenario breaking the slow-roll consistency relations (e.g.Bartolo et al. 2007;Boyle & Buonanno 2008;Sorbo 2011), cosmic string networks (e.g.Damour & Vilenkin 2000), primordial curvature perturbations (e.g.Tomita 1967;Matarrese et al. 1993), turbulence arising in the aftermath of quantum chromodynamics (QCD) phase transitions (e.g.Kosowsky et al. 1992;Hindmarsh et al. 2014), and cosmic domain walls (e.g.Hiramatsu et al. 2014), or they might even originate from oscillations of the Galactic potential in the presence of ultra-light dark-matter (ULDM, Khmelnitsky & Rubakov 2014).Those scenarios inspired numerous studies aiming to test new physics in the early Universe (e.g.Vagnozzi 2023;Madge et al. 2023;Guo et al. 2023;Kitajima et al. 2023;Ellis et al. 2023;Cai et al. 2023;Figueroa et al. 2023;Franciolini et al. 2023) and have also been scrutinised by the EPTA+InPTA and NANOGrav collaborations in two comprehensive interpretation articles (Antoniadis et al. 2023b;Afzal et al. 2023).early Universe interpretations sometimes stress that the detected signal has a higher amplitude and flatter slope than what is expected from an astrophysical population of SMBHBs.This is an observation that has been challenged by Antoniadis et al. (2023b), who showed that a signal with the detected properties can naturally arise from a realistic ensemble of SMBHBs.
Under the hypothesis that the signal has an astrophysical, SMBHB origin, it is important to assess the performance of the state-of-the-art PTA GWB search and parameter estimation algorithms as implemented in the PTA analysis suite enterprise (Ellis et al. 2019) on realistic data.In fact, current pipelines are searching for a stochastic GWB described as a Gaussian, isotropic, and stationary process characterised by a power-law Fourier spectrum, imprinting a correlated red signal in PTA residuals.The peculiar feature that allows us to disentangle those GW-induced delays from other noises is the interpulsar spatial correlation, first derived by Hellings & Downs (1983).Although pipelines searching for individual continuous GWs (CGWs), anisotropy, and broken power-law spectra exist (Babak & Sesana 2012;Ellis et al. 2012;Sampson et al. 2015;Taylor et al. 2020), the evidence for a GW-signal claimed by the different PTA collaborations is based on the aforementioned default assumptions.Conversely, none of these statistical properties are expected to hold for a signal produced by a cosmic ensemble of SMBHBs.Environmental effects and small number statistics are expected to produce spectra that significantly deviate from a smooth single power law (Sesana et al. 2008;Ravi et al. 2014).Furthermore, strong individual CGWs might produce extra power at specific frequencies, also resulting in highly anisotropic and non-Gaussian signals (Sesana et al. 2009;Ravi et al. 2012;Kelley et al. 2018).Finally, the eccentricity of SMBHBs might break the assumption of stationarity (Sesana 2013a).
In a nutshell, there is a clear signal-template mismatch situation, where single power-law, Gaussian, isotropic, stationary templates are used to search for a signal that is unlikely to satisfy any of these hypotheses.Assessing the influence of this basic fact on the reliability of the analysis outcome is of paramount importance, especially in light of the recent PTA claims and of the numerous subsequent interpretation papers that rely on these results.
These issues were first investigated in Cornish & Sampson (2016), who demonstrated, using realistic GW models from (Rosado et al. 2015), that the isotropy hypothesis has only a mild effect on the detectability of the GW signal.Here we take a decisive step forward by conducting a systematic investigation of signal recovery on mock data designed to capture all the complexity of real PTA datasets.We generate mock versions of the EPTA DR2new dataset, described in Antoniadis et al. (2023c), including uneven time-stamps, white noise, chromatic and achromatic red noise, and multi-frequency observations (Antoniadis et al. 2023d).We then inject in our mock dataset two types of signals: (i) a Gaussian, isotropic, stationary GWB created with the libstempo (Vallisneri 2020)1 function createGWB, and (ii) an incoherent superposition of the residuals induced by a population of circular GW-driven SMBHBs.In this latter case, residuals from both the pulsar and Earth terms of each binary are added one by one to the data.This results in a total GWB Fourier spectrum which on average is well fitted by a power-law function of frequency, but is generally much more structured.In particular, bright sources close to Earth can produce very pronounced peaks in power, breaking signal Gaussianity and isotropy at higher frequencies (Sesana et al. 2008).The rationale behind this choice is to test for differences in the performance of the GWB detection pipeline implemented in enterprise when (i) the injected signal matches the power-law template, and (ii) when there is a clear mismatch between the injected signal and power-law template.
The paper is structured as follows.In Sect. 2 we describe the simulated dataset, from the intrinsic pulsar noise to the realis-tic GWB modelling pipeline, and we present the signal recovery model implemented in the enterprise package.Section 3 presents the results of our simulations, which include two sets of createGWB injections (a strong and a weak signal), and a set of injections from a realistic SMBHB population.In Sect.4, we discuss in detail selected realisations of the realistic injections, which help in understanding biases and issues that can arise from the signal-template mismatch.As we were completing this work, Bécsy et al. (2023) presented an independent, parallel investigation that touches on several points examined here.We subsequently discuss similarities and differences between the two works, together with further developments and future directions in Sect. 5.

Simulated data and analysis methods
The main goal of this work is to test our ability to recover a realistic GWB signal in a simulated data set with our existing model.To emulate the complexities of real data, we base all our simulations on the second data release from the EPTA collaboration (Antoniadis et al. 2023a,c,d).In particular, we use the 25 best EPTA pulsars, selected following (Speri et al. 2022).For these, we use the latest estimates of pulsar intrinsic red noise (RN) and dispersion measure (DM) variations to generate different simulated copies of the recently released DR2new dataset.DR2new is a reduced version of the entire second EPTA data release (DR2full), which includes only the last 10.3 years of observations, collected with the new generation wide-band backends.
In the following, we describe how we simulate PTA data using libstempo tools, a python wrapper around the TEMPO2 pulsar timing software (Hobbs et al. 2006;Edwards et al. 2006), and how we obtain realistic GW-induced residuals from state of the art population of SMBHBs.

Timing model and pulsar noise properties
The timing models (TM) of the 25 pulsars are defined by the corresponding parameter files from the EPTA DR2new.We use libstempo to simulate times of arrival (ToAs) for a total observation time of 10.3 years, with a cadence defined by the observations, taken by the five European radio telescopes: the 100m Effelsberg radio telescope in Germany, the Lovell telescope at Jodrell Bank Observatory in the United Kingdom, the Nançay radio telescope operated by the Nançay Radio Observatory in France, the Sardinia Radio Telescope in Italy, and the Westerbork Synthesis Radio Telescope in the Netherlands.ToAs are simulated at two different frequency bands: 1400 and 2200 MHz.Having ToAs at different frequencies allows us to disentangle pulsar intrinsic RN from DM variations.We assume that an initial fit of the TM, obtained with libstempo, reduces it to a linear model where the coefficients are given by a design matrix.Following Van Haasteren et al. (2009), we analytically marginalise the likelihood over the TM parameter errors described by that linear model.
To make our simulations as close as possible to the real data, we include stochastic noise in our datasets.Stochastic noise in pulsar observations is customarily divided into three components: white noise, achromatic, and chromatic red noise.We include all three components in our simulations, following the analysis performed in Antoniadis et al. (2023d), based on the optimisation procedure outlined by Chalumeau et al. (2021).
To model white noise, we distribute ToAs around the values predicted by the TM with a root-mean-square (rms) uncertainty A201, page 2 of 14 given by: where σ ToA is the uncertainty value due to template-fitting errors obtained in EPTA DR2, the EFAC factor takes into account the ToA measurement errors.The EQUAD, added in quadrature, accounts for any other white noise, such as stochastic profile variations, and possible systematic errors.These parameters are specific for each observing backend.In our simulations, we defined EFAC and EQUAD parameters to be constant, setting EFAC = 1.0 and EQUAD = 10 −6 , and identical for all pulsars.
The ToA uncertainties, σ ToA , used in our simulated data sets (see Table 1) are the maximum likelihood estimates obtained in EPTA DR2 (Antoniadis et al. 2023a).
The single-pulsar stochastic achromatic and chromatic red noises are time-correlated signals that can be modelled as a stationary Gaussian process with a power-law spectrum: where RN and DM refer to achromatic and chromatic red noise respectively.The achromatic red noise does not depend on the observing radio frequency and is commonly used in singlepulsar noise models to characterise the long-term variability of the pulsar spin.Conversely, chromatic red noise depends on the observing radio frequency and is due to dispersion measure (DM) variations.In fact, during its propagation, the pulsar radio emission interacts with the ionised interstellar medium (IISM), the Solar System interplanetary medium and the Earth's ionosphere.These interactions lead to frequency-dependent delays in the observed signal: where ν is the radio observing frequency and DM is the path integral of the free-electron density along the line of sight to the pulsar.We take this effect into account in our timing model, which considers the DM value at a reference epoch together with its first and second derivatives.However, the inhomogeneous and turbulent nature of the IISM also induces stochastic variations in the DM value, which are modelled as chromatic red noise.DM variations become more and more important on the decade-long timescales of PTA data (see e.g.Keith et al. 2012).We used the libstempo package to inject in each pulsar RN and DM with spectra given by Eq. (2).In particular, for each pulsar, we defined the values of A RN/DM and γ RN/DM to be equal to the recent maximum likelihood estimates from the EPTA DR2 (Antoniadis et al. 2023a).Those values were obtained from joint analysis runs on the DR2new data set (Antoniadis et al. 2023a) when an additional common red process was included alongside individual pulsar noise terms.We report a complete summary of the individual pulsar's noise parameter values in Table 1.

GWB induced residuals: createGWB vs. realistic SMBHB populations
Having addressed the system and pulsar-related properties of real data, the only thing necessary to complete our data set is the GW signal.Following Rosado et al. (2015), we generate the stochastic GWB-induced signal from the incoherent superposition of individual sinusoidal GW signals emitted by inspiralling SMBHBs.

Ideal signal with createGWB
As described in Phinney (2001), the characteristic strain of the GW signal produced by a population of circular, GW-driven SMBHBs is the integral of the energy emitted by each system over the differential number density of sources per unit redshift z, and chirp masses M2 : where dE gw (M)/dln f r is the energy spectrum emitted by each source (binary).In the circular GW-driven approximation, we can rewrite the emitted energy spectrum as a function of the binary chirp mass and GW rest-frame frequency f r : Here, f r is defined as f r = (1 + z) f and is twice the binary Keplerian rest-frame frequency.By inserting Eq. ( 5) into Eq.( 4), it is straightforward to show that where A GWB is a model-dependent amplitude of the signal at the reference frequency f = 1yr −1 and α GWB = −2/3.The corresponding spectral density S GWB takes the form: with γ GWB = 3 − 2 α GWB = 13/3.We note that the form of Eq. ( 7) is very similar to that of the intrinsic RN spectra (Eq.( 2)).
In fact, in a PTA dataset, a stochastic GWB appears as a red noise that is common to all pulsars and induces a specific angular correlation among pulsar pairs.This correlation is expected to follow, on average, the Hellings & Downs (1983) overlap reduction function.Other examples of red signals correlated over all pulsars are clock and ephemeris errors, which produce, respectively, a monopole and a dipole correlation in the pulsar residuals (Tiburzi et al. 2016).Using libstempo, it is possible to inject a GWB signal into a PTA dataset through the function createGWB.By default, this function simulates the GW-induced delays in a PTA dataset as a common RN, HD-correlated over pulsars and with a smooth power-law shaped spectrum with index α = 13/3 (Eq.( 7)).The only input parameter supplied is the amplitude of the GWB signal.See discussion in Chamberlin et al. (2015) for more details.Notes.The final four columns list the maximum likelihood values obtained from the data set DR2new, using customised noise models when a common red noise is also included in the recovery model (Antoniadis et al. 2023a).Thus, the missing RN and DM parameters refer to the fact that, according to the customised noise model, there is no relevant evidence for the given process in the EPTA DR2new data set.

Realistic signal from an SMBHB population
Equation (4) models the characteristic strain of the GW signal as a smooth continuous function of the frequency, as given by Eq. ( 5).In reality, the expected astrophysical signal is the incoherent superposition of independent sinusoidal waves produced by an ensemble of SMBHB systems.For this case, Eq. ( 4) takes the form (Sesana et al. 2008) where now d 3 N/(dzdMdln f r ) is the number of emitting systems per unit redshift, mass and logarithmic frequency interval, and the strain h( f r ) is given by: Here, r is the co-moving distance to the source and the functions a = 1 + cos 2 ι and b = −2cosι define the relative strength of the two strain polarisations as a function of the binary inclination angle ι (see Rosado et al. 2015, for details).Equation ( 8) can be further manipulated by considering that the signal comes from a finite collection of discrete sources, and the spectrum is practically constructed in discrete frequency bins ∆ f = 1/T , where T = 10.3 yr is the duration of the PTA experiment.The characteristic strain can be thus written as where f i is the central frequency of the bin ∆ f i , and the sum runs over all the systems for which f r /(1 + z) ∈ ∆ f i .
To practically inject the signal from a cosmic population of MBHBs in our PTA timing residual, we proceed as follows.The list of emitting binaries is randomly sampled from the numerical distribution d 3 N/(dzdMdln f r ), which is obtained from the empirical, observation-based models described in Sesana (2013b).The starting point is the galaxy merger rate, expressed as: where the subscript 'g' stands for 'galaxy'.Here φ(M g , z) and F (z, M g , q g ) are the galaxy mass function and the galaxy differential pair fraction function at redshift z.Those quantities can be directly measured from observations, while the typical merger time scale τ(z, M g , q g ) can be inferred by detailed simulations of galaxy mergers.The galaxy mass is then related to the SMBH mass via scaling relations of the form: where X can be, depending on the model, the galaxy bulge mass, or its mid-infrared luminosity or velocity dispersion.We refer to Sesana (2013b) for a list of those relations.Finally, SMBHs grow their mass through accretion in galaxy mergers.It is, however, unclear whether accretion mostly occurs before or after the SMBHB coalesces and, in the former case, whether accretion occurs preferentially on either of the two SMBHs.
A201, page 4 of 14 For this work, we model φ(M g , z) from Muzzin et al. (2013), F (z, M g , q g ) from de Ravel et al. ( 2009) and τ(z, M g , q g ) from Kitzbichler & White (2008).The SMBH mass is related to the galaxy mass via the M − σ relation given by Kormendy & Ho (2013) and we assume that accretion occurs on the two black holes prior to the final merger, with preferential accretion on the secondary hole (Farris et al. 2014).These choices result in a GWB with nominal amplitude A GWB = 2.4 × 10 −15 when computed with Eq. ( 4) and expressed in the power-law form given by Eq. ( 6); this value is consistent with the one inferred from the EPTA DR2new analysis (Antoniadis et al. 2023a).
We use this observation-driven astrophysical model to numerically construct the function d 3 N/(dzdMdln f r ).We then draw 100 Monte Carlo samples of this function in the appropriate mass, redshift and frequency region of the parameter space.Each draw results in ≈100 K binaries, which we refer to as a universe realisation.For each binary, we specify its chirp mass, redshift, and GW-signal amplitude (Eq.( 9)) in the observer frame, sky location coordinates, inclination and polarisation angles.All binaries are assumed to be circular.
Once the population of SMBHBs is defined, we construct the overall GW signal by directly injecting in the time domain the deterministic residulas imprinted by each individual system in the PTA dataset.To this end, we developed a custom injection pipeline, written partly in python (using libstempo functions) and partly in fortran.The script allows the user to take idealised pulsar timing models and produce ToAs for a given observing time span, add to each pulsar specific source noise as described in Sect.2.1, and then add excess delays due to each of the SMBHB in the specified population.We inject both pulsar and Earth terms in the residuals, following the prescription of Babak et al. (2016).Similar injection pipelines have been applied to NANOGrav-like datasets in Bécsy et al. (2022) and Bécsy et al. (2023).
We can visually verify how this GWB signal definition method affects the spectra.The obtained characteristic strain amplitude for each of the 100 mock realisations of universe described above can be computed in the frequency domain from Eq. ( 10), and is shown in Fig. 1.As expected, the signal is much more structured than a plain f −2/3 nominal power law computed through Eq. ( 4), with prominent spikes associated with rare, massive and (or) nearby sources.We note that the squareaveraged signal sits on the theoretical curve, but there is considerable variance among different universe realisations.

Recovery model and analysis methods
To test the performance of current PTA GWB search and parameter estimation pipelines, we perform a set of inference analyses by using Bayes' theorem on the realistic datasets defined in Sect. 2. We search for individual and common noise parameters estimating model parameters characterised by their posterior probability distribution functions (PDFs), including those of RN, DM, and common noise.We then carry out Bayes-factor (BF) evaluations between HD-correlated and common, uncorrelated RN (CURN) models; and reconstruct the angular correlation from the data.
We conducted an in-depth analysis of each dataset using enterprise (Enhanced Numerical Toolbox Enabling a Robust PulsaR Inference SuitE, Ellis et al. 2019), a pulsar timing analysis package including functionalities for timing model evaluation, pulsar noise analysis, and GW searches.For each simulated array of pulsars we defined a noise model including: (i) EFAC and EQUAD parameters fixed respectively at 1.0 and 1e−6 for all pulsars, (ii) RN and DM variations according to the customised noise model used in EPTA DR2new (see Table 1), (iii) a common red noise process.We set the number of coefficients (i.e. the number of modes in the Fourier domain) for modelling the RN and DM processes in each pulsar to be, respectively, 30 and 100.We modelled DM variations as a Gaussian process.
Using this recovery model, we carried out Bayesian inference of the model parameter space using the Markov Chain Monte Carlo (MCMC) sampler included in enterprise: PTMCMCSampler (Ellis & van Haasteren 2017).Since we fixed the white noise parameters for each pulsar, the total number of parameters for the sampling is 62: 60 from pulsar intrinsic noise parameters and two parameters, log amplitude and slope, for the common red noise.
To decrease computational costs, we employed the reweighting method introduced in Hourihane et al. (2023).We first computed approximate posteriors defining the common process as an uncorrelated red noise.Thus, by temporarily ignoring the cross-correlation terms, the covariance matrix becomes block-diagonal, resulting in a much faster sampling.We then reweighted the obtained chains of samples to get the exact posteriors (corresponding to an HD-correlated common red process) via importance sampling.Besides being much faster than a direct search for an HD-correlated common signal, this method also provides an accurate estimate of the Bayes factor between the HD and the CURN models.In fact, the average of the weights computed for all samples is equal to the ratio of the marginal likelihood (or evidence) of the two considered models.While this method is mathematically exact, there are some limitations when the likelihood changes significantly between the two models.For example, as the GWB amplitude increases, the weights distribution becomes broader and the sampling efficiency of the method decreases.Hourihane et al. (2023) present a very detailed study of the limits of this method and conclude that the Bayes factor estimate stays robust up to BF > 10 6 .See Hourihane et al. (2023) for a more detailed description.
To test the limits imposed by the sensitivity of the dataset and the consequences in the inference analysis, for some realisations we repeated the parameter estimation runs considering different A201, page 5 of 14 We use uniform priors for the slope parameters and loguniform priors for the amplitudes of the noise components.The prior ranges for the intrinsic noise and common process parameters are listed in Table 2.
For each simulated PTA dataset, we also computed the induced angular correlation in the timing residuals between pulsar pairs.We followed the method used for the frequentist analysis in EPTA DR2new (Antoniadis et al. 2023a).We used the Optimal Statistic (OS) framework developed by Anholm et al. (2009), Demorest et al. (2012), andChamberlin et al. (2015), with the noise marginalisation described in Vigeland et al. (2018).We also computed the mean correlation and variance in the correlation recovery between different realisations of the same GWB signal, and compared the results with the theoretical predictions from Allen (2023).We followed the prescriptions in Allen & Romano (2023) when computing the average over pulsar pairs.

Results
Using the framework described in the previous section, we generate three sets of 100 mock EPTA DR2new datasets, for a total of 300 simulations.In each of these 300 simulations, the individual pulsar DM and RN are generated as a random realisation of a stochastic process described by the power-law spectra of Eq. ( 2), with amplitude and slope fixed to the ML value of the customised noise analysis performed in Antoniadis et al. (2023d) as reported in Table 1.The three sets of simulations differ for the injected GW signal: -LoudGWB_set.We use createGWB to inject a loud, stochastic GWB with A GWB = 5 × 10 −15 .This signal is easily detectable in the DR2new dataset and serves as a benchmark to test our simulations and analysis pipeline.-CreateGWB_set.We use again createGWB to inject a stochastic GWB with A GWB = 2.4 × 10 −15 , consistent with the signal observed in Antoniadis et al. (2023a).These simulations are meant to test the pipeline in the regime of a relatively weak signal that matches the template used in the likelihood evaluation.-SMBHB_set.We inject individual residuals from an astrophysically motivated SMBHB population producing a GWB with a nominal average amplitude of A GWB = 2.4 × 10 −15 .In this case, however, the signal is very different from the template used in the analysis pipeline, allowing us to investigate limitations and biases due to a mismatch between the signal present in the data and the model used in the analysis.In Sect.3.1 we discuss the performance of the pipeline applied to the LoudGWB_set; we then move to the comparison between the createGWB_set and the SMBHB_set in Sect.3.2.noise fixed complete analysis theoretical expect. 1 3 Fig. 2. P-P plot for the GWB-amplitude recovery for the LoudGWB_set of simulations.The dashed line refers to the inference runs where all the intrinsic noise parameters are fixed in a noise dictionary and the amplitude of the GWB is the only free parameter.The solid line is the result of MCMC runs over all 62 noise parameters.The theoretical expectation for an unbiased recovery and the predicted variance of one-and three-σ are represented in different shades of blue.

The LoudGWB_set of simulations: A benchmark for recovery convergence
As already mentioned, this set of simulations aims to define a benchmark for recovery convergence.In fact, the injected high GWB amplitude (A GWB = 5 × 10 −15 ) is expected to be relatively easy to recover and disentangle from pulsar noise.Some useful tools to validate the performance of our model are the cumulative distributions of the number of times the injected value lies within a credible interval, the socalled probability-probability (P-P) plots (see Cook et al. 2006;Talts et al. 2020;Wilk & Gnanadesikan 1968, and others).These plots show on the y axis the fraction of times in which the nominal value (in our case, the injected A GWB , normalised at the frequency of 1/1yr) lies within the credible interval indicated on the x axis.In the case of unbiased inference, the data points follow the diagonal of the plot parameter space.
A P-P plot for the inferred amplitude of the recovered GWB is shown in Fig. 2. The theoretical expectation and the predicted variance (one-σ and three-σ levels) are shown in different shades of blue.The dashed line is obtained from inference runs where all the pulsar's intrinsic noise parameters and the slope of the GWB signal are fixed to the nominal injected values (Table 1, γ GWB = 13/3).Thus, the only free parameter is the amplitude of the GWB signal.In this case, the obtained distribution follows the diagonal within the one-σ variance and there is no evident bias in the recovery of A GWB .This is expected since the GWB signal is injected via the createGWB function, which simulates a background with a spectrum very close to the nominal power law.The solid line, instead, is obtained from the full analysis sampling of all the 62 model parameters (the pulsars intrinsic RN and DM variation parameters, see Table 1, and the two GWB parameters).Although the recovered distribution is always within the three-σ expected interval, it systematically lies below the diagonal.This may indicate a slightly biased recovery towards lower amplitudes (and, consequently, higher γ GWB ) for the GWB spectrum.Although not particularly worrying, the origin of this potential bias is unknown and it might be due to leakage of power across different noise components.Thanks to the reweighting method, we can also show the distribution of the obtained log 10 BF for the HD-correlated model versus CURN.The results are shown in Fig. 3.The median of the distribution (vertical solid line) is at log 10 BF = 1.91, which corresponds to a BF ∼ 81.The spread of the distribution highlights the impact of the stochastic nature of the pulsar noise in the signal recovery.The central 68% of the BF distribution spans more than three orders of magnitude, and depending on the specific realisation of the noise, the data can either provide decisive evidence of a GWB or an inconclusive result.
Finally, we compute the angular correlation induced in the residuals of each pulsar pair for all realisations.For an array of 25 pulsars, there are 300 independent pairs.For each realisation, we define ten angular separation bins of 30 pulsar pairs each, and compute the mean and variance of the correlation.The calculation follows the prescriptions in Allen & Romano (2023), taking into account the covariance between different pulsar pairs in the same angular separation bin.We then compute the mean over the whole LoudGWB_set as the average, in each angular separation bin, of such optimised correlation estimators.We present the result in Fig. 4, where each data point is computed over 3000 pulsar pairs: 30 for each realisation.For comparison, we also show the cosmic variance limit derived in Allen (2023; see their Eq.(4.8) and ( G11)).The pulsar variance contribution to the expected variance of the HD recovery is minimised by the weighted-average method described in Allen & Romano (2023); thus, in this case the cosmic contribution is the only significant one to compare our results with.We note that the mean correlation estimated in each bin typically lies very close to the expected HD correlation.This is expected for such a loud GWB signal (5 × 10 −15 ).The only exception is the very last bin, which is difficult to constrain due to the limited number of pulsar pairs available at wide angular separation, forcing an averaging procedure over a wide bin.

Ideal vs. real: Comparing the createGWB_set and the SMBHB_set
Having assessed the performance of the analysis pipeline on a loud, ideal signal, we now turn to the comparison of the signal recovery for the createGWB_set and the SMBHB_set.As in the previous section, we start by constructing the P-P plot for the A GWB parameter, shown in Fig. 5.For both data sets, the nominal value of A GWB for each realisation is 2.4 × 10 −15 , at the reference frequency of 1/1 yr.In the SMBHB_set this value corresponds to computing the GWB through Eq. ( 4) and expressing it in the power-law form given by Eq. ( 6).The green lines are for the createGWB_set, while the orange ones are for the SMBHB_set.As in Fig. 2, the solid lines are computed using the posterior distributions of the MCMC runs sampling over all the 62 noise parameters of the array, while the dashed lines are obtained by searching over A GWB keeping all other noise parameters fixed to the nominal value.When fixing all noise parameters, the createGWB_set closely follows the diagonal, indicating an unbiased recovery of the signal amplitude.Conversely, the SMBHB_set tends to be consistently below the diagonal, also crossing the three-σ confidence interval in the A201, page 7 of 14 Notes.With diagonal we refer to the theoretical prediction for unbiased recoveries, which corresponds to the diagonal of the plot.The other distributions analysed are the dashed and solid lines of that P-P plot.
The p-value in the second raw is almost perfect (0.9999); this confirms the good fit between the A GWB recovery for the createGWB_set when the other noise parameters are fixed.
upper-right corner.When sampling over all the noise parameters, this bias is enhanced.As for the LoudGWB_set, the distribution for the createGWB_set (solid green) is still within the three-σ confidence interval, although systematically below the diagonal.The situation gets more extreme for the SMBHB_set (solid orange), which is dramatically biased towards the lower amplitude of the GWB spectra.
To quantify the distances between the different distributions shown in Fig. 5, we used the scipy package scipy.stats.kstest(Virtanen et al. 2020) to perform a series of non-parametric Kolmogorov-Smirnov tests (KS test).The test returns the maximum difference between two distributions and an estimate of the p-value under the null hypothesis that the two distributions are identical.Results are summarised in Table 3 and highlight the inconsistency of the signal recovery in the SMBHB_set.
We note that in these analyses there is a difference between the prior defined in the recovery model, log 10 A GWB uniform in [−15.5, −13.5], and the one from which the injected values are selected (basically a delta function centred in A GWB = 2.4 × 10 −15 ).Since the latter prior is much narrower and completely included in the former, the statistical significance of the P-P plots are unaffected.To test this statement, we generated a new set of simulations where the pulsar's intrinsic noises are the same as for the other datasets and fixed, but now the GWB signal amplitudes have values which are randomly chosen from the uniform prior log 10 A GWB : [−15.5, −13.5].The results are shown in Fig. 5 as dotted lines (green for the createGWB_set, orange for the SMBHB_set).As expected, there is no significant discrepancy between those lines and the dashed ones (same analysis on data where the GWB signal is the same in all realisations).
Since we have 100 simulations, each with a different realisation of population parameters but the same injected parameters of the GWB signal, we can utilise Bayes' theorem to obtain the combined and better constrained posterior PDFs for log 10 A GWB and γ GWB .In Fig. 6 we show the combined posteriors for log 10 A GWB and γ GWB for both the createGWB_set (green) and the SMBHB_set (orange).The posterior for the slope parameter is better constrained than that for the amplitude.While both parameters are compatible with the nominal injected value within ∼ one-σ for the createGWB_set, this is not the case for the SMBHB_set.The orange posteriors are clearly biased towards steeper spectra with lower amplitudes, consistent with what is shown by the P-P plots.
Systematic biases in the signal recovery for the SMBHB_set can be traced back to the h c distribution of the individual realisations shown by the orange lines in Fig. 1.These realisations lie preferentially below the expected f −2/3 line, featuring a steeper spectrum.Only sporadically, loud individual sources result in excess power at specific frequencies.When this happens, a power-law fit to the data can result in a flatter spectrum, above the f −2/3 line.This is a general feature of realistic SMBHB populations characterised by a sparse, high-mass tail of loud GW sources.Although most signal realisations show a deficiency of power at high frequency, the few which feature loud sources ensure that the average signal amplitude sits on the expected f −2/3 power law.It follows that the typical realisation of a nominal f −2/3 power-law signal has in reality, a steeper spectrum.There is, therefore, a mismatch between the theoretically smooth GWB signal used in the recovery model and the real signal produced by a discrete ensemble of SMBHBs, which can lead to systematic biases in the signal recovery and erroneous interpretation of the results.
We note that, during our whole analysis, we fixed the reference frequency for the recovery of the GWB amplitude to 1/1 yr.Changing this frequency to a lower one would result in a weaker dependence of A GWB upon γ GWB , but the one-dimensional posterior of the slope parameter remains unchanged.Reanalysing our data sets with the reference frequency set at 1/10 yr shows that, as expected, the average recovery is still biased to higher γ GWB values.In contrast, the A GWB recovery is now less affected by the lack of power at higher frequencies, resulting in a combined posterior that agrees very well with the nominal amplitude value of the simulated GWB.The difference between the SMBHB_set and the createGWB_set also becomes not statistically significant.We refer to Sect. 2 of Antoniadis et al. (2023b) for further details on this point.
In this paper, we focus on testing the recovery of the GWB signal from PTA datasets.However, pulsar's intrinsic noise parameters are also free parameters in our inference runs and, thus, subject to possible biases in the recovery.We refer to Appendix A for a brief discussion on the recovery of pulsar's intrinsic RN parameters.
From the parameter estimation analysis carried out on the createGWB_set and on the SMBHB_set, we can build the distribution of the BFs of HD vs. CURN obtained with the reweighting method in each realisation (see Sect. 2.3 for more details).Results are shown in Fig. 7.The median of the log 10 BF distribution for the createGWB_set is ∼0.5, while for the SMBHB_set is ∼0.8.The estimated BF for EPTA DR2new is ∼60 and is represented in the plot by the black vertical line.We note that this value is included in the 16th-84th percentile interval of the distribution for the SMBHB_set.
Finally, we also plot the recovered angular correlation in the timing residuals, following the procedure described in the previous section.In Fig. 8 we compare the mean and variance of the reconstructed correlation function to the HD curve and its cosmic variance (Allen 2023).In both cases, not only the different points are always compatible with the predicted HD correlation, but also the computed mean in each angular bin is typically included within the predicted cosmic variance.Fig. 7. Histograms of the log 10 BF HD CURN obtained from the reweighting analysis on the createGWB (green) and SMBHB_set (orange).The solid vertical lines show the medians of the distributions, while the dashed ones refer to the 16th and 84th percentiles.The black vertical line corresponds to the latest EPTA estimate for DR2new: BF HD CURN ∼ 60.

A close look at astrophysical variance: Notable realisations of the SMBHB_set
So far, we focussed on the collective properties of the recovered signal, highlighting possible systematic biases due to the idealised model used in the analysis pipeline.It is also interesting, however, to have a close look at the 'zoology' of signals arising from a realistic SMBHB population, to get an idea of how specific features are reflected in the outcome of the analysis.
We focus here on four selected realisations of the SMBHB_set, stressing that all of them are independent statistical realisations of the same underlying astrophysical model.

Case1 and Case2: Different spectra resulting in similar posteriors
The first two selected cases are characterised by the spectra shown in the left panel of Fig. 9.While Case1 (brown line) has a flat, smooth bump in the lowest frequency bins, Case2 (orange line) is characterised by a jagged behaviour due to few loud sources.In the right panel of Fig. 9 we show the GWB parameter posteriors obtained for those two realisations.The recovery model is the one described in Sect.2.3.Although the two spectra are very different, the inference runs produce very similar posteriors for the GWB amplitude and slope.This can be qualitatively understood by comparing the spectra to the sensitivity curve associated to the considered PTA (blue curve in Fig. 9, left panel).In Case1, the array is mostly sensitive to the last couple of frequency bins, where the signal is flatter than the nominal f −2/3 power law.This leads to a posterior with a slight preference for small γ and high A GWB .Conversely, in Case2, besides detecting the signal at the lowest frequency bin, the inference pipeline picks some correlated power due to the marginally detectable loud source at f ≈ 1.5 × 10 −8 Hz.Since the built-in model is a single power law, this extra high-frequency power also results in a posterior with a slight preference for small γ and high A GWB .
It is also interesting to notice that the recovered GWB parameters for these two selected realisations are quite consistent to the DR2new posteriors presented in Fig. 1 of Antoniadis et al. (2023a).This just exemplifies that there is no conflict between the signal observed in the latest PTA data and astrophysical expectations.
To further demonstrate the impact of sparse, particularly massive and (or) nearby SMBHBs on the GWB signal recovery, we selected two realisations featuring some prominent loud sources.Those are Case2, already introduced in the previous subsection, and Case3.The characteristic strain frequency spectra of those two realisations are presented in the top panel of Fig. 10.The loud source at f ≈ 2 × 10 −8 present in Case3 produces a peak in the power that is marginally above the nominal sensitivity of the simulated PTA, we can therefore expect a strong impact on the recovered signal.This is shown in the central and bottom panels of  we consider only the first four frequency bins, the bright source falls outside the frequency domain of the model and, as expected, the recovered posteriors are much closer to the nominal values for the injected SMBHB population (red posterior in the bottom panel).A similar effect, although to a lesser extent, is seen for Case2.In this case, the loud source is less prominent, contributing only marginally to the overall detected power.Still, by comparing the orange posteriors in the central and bottom panels of Fig. 10, we can see a significant shift of the posterior to the lower right when restricting the model from nine to four frequency bins.The possible bias towards a flatter GWB spectrum due to a particularly bright GW source has been recently discussed also in Bécsy et al. (2023).Here we provided concrete examples of the effect on the recovery of the GWB spectra when bright sources are involved.

Case4: A familiar HD correlation recovery
Finally, we present a fourth interesting realisation.In Case4 the characteristic strain as a function of frequency does not present any particularly pronounced peak (see the left panel in Fig. 11).The interesting feature observed while analysing this PTA dataset is the induced angular correlation in the residuals.In Fig. 11, right panel, we show the results obtained with the OS packages (as described in Sect.2.3) for this specific realisation (light blue points), and compare them with the results obtained for the EPTA DR2new dataset (red points).Each data point of the plot represents the average correlation in a bin containing 30 pulsar pairs.The averages over pulsar pairs are computed following the prescriptions in Allen & Romano (2023).
The reconstructed angular correlation closely follows the HD prediction.What caught our attention is the close resemblance between the right panel of Fig. 11 and the results from the EPTA DR2new dataset (see Fig. 6 of Antoniadis et al. 2023a).In both plots, the considered pulsars are the same (EPTA best 25 pulsars) and the correlation is computed following the same procedure.
The similarity in the reconstructed HD is also reflected in the estimated BF for HD correlated common signal vs. CURN.Using the reweighting technique in the inference run, we obtain BF ≈ 62 for Case4, which is very consistent with the BF ≈ 60 found in EPTA DR2new (Antoniadis et al. 2023a).
A201, page 10 of 14 The BF (HD vs. CURN) of those four notable cases are summarised in Table 4 for completeness, and they give a flavour of the role played by stochasticity in the estimate of signal signif-icance and parameters.For example, the spectra of Case1 and Case4 look very similar, with perhaps Case1 showing a bit more power in the lowest frequency bins.Still, in this case, the result of the analysis is inconclusive (BF = 1.179 for HD vs. CURN), whereas in Case4 there is strong evidence of an HD correlated process (BF = 62.5).This is because, in these early stages of detection, the output of the analysis is very sensitive to the specific realisation of the noise processes and to the specific sky locations of the loudest systems contributing to the GWB with respect to the best pulsars in the array.The key role played by the stochastic realisation of the noise processes involved is also demonstrated by the BF distributions shown in Fig. 7; even when we inject a nearly ideal signal with createGWB, the distribution of Bayes factors returned by the analysis spans several orders of magnitudes.

Discussion and conclusions
In this paper, we carried out an extensive investigation of the performance of current PTA GW analyses on simulated PTA datasets injected with different types of GW signals.Our simulations included realistic levels of white noise, red noise, and DM variations, that were gauged to create mock data equivalent to the recently published EPTA DR2new.We injected in those data either a stochastic, stationary, Gaussian GWB with a standard f −2/3 power-law spectrum (the createGWB_set of simulations) or the incoherent superposition of sinusoidal signals from a cosmic population of SMBHBs (the SMBHB_set of simulations), paying particular attention to possible limitations and biases arising from the mismatch between the signal present in the data and the model used for the inference.The injected signals were calibrated on the results of Antoniadis et al. (2023a), where the amplitude of the signal was estimated to be AGWB ≈ 2.4×10 −15 for a power-law spectrum with γ GWB = −13/3.
We quantified the performance of the analysis model by constructing P-P plots for the amplitude of the recovered signal for each set of simulations.The createGWB_set demonstrates that, when the model matches with the injected signal, the outcome of the analysis is reliable, although some mild bias can arise due to the complex multi-dimensional nature of the parameter space that needs to be searched over.In fact, when fixing all the parameters but the GWB amplitude, the estimate of this parameter is unbiased; conversely, when performing a joint search on the GWB and noise parameters (including RN and DM) the signal amplitude is slightly biased towards lower amplitudes and steeper spectra.Such bias has also been seen in simulations of individual pulsar noise analysis (Antoniadis et al. 2023d), and although it is still between the one-and two-σ level (Figs. 5, 6), it requires further investigation.
In the case of the more realistic SMBHB_set, where there is a mismatch between the injected signal and the simplified recovery model, the bias is much more prominent and the recovered spectra for the common noise tend to be steeper (higher γ GWB ) and with lower amplitude than the injected signal (Figs. 5, 6).This is due to the nature of the SMBHB population, which features a large number of weak sources with a tail of sparse, loud systems, as discussed in Sect.3.2.When one of these loud systems is present in the SMBHB population, the recovered GWB spectra can appear flatter (lower γ GWB ) and with a higher amplitude.
Using the reweighting method, we were also able to build a distribution of BF from the different realisations of both the createGWB_set and the SMBHB_set (see Fig. 7).We found no significant difference between the two distributions, although A201, page 11 of 14 the SMBHB_set results on average in slightly higher BFs.We note that in both sets, the logBF distribution has a large scatter, with one-σ confidence interval spanning index covering the [0, 2] interval.We also computed the induced angular correlation in the timing residuals of the different datasets and showed that they agree, within the predicted variance, with the HD correlation.
These results allow us to make several interesting considerations.First, as already shown in Cornish & Sampson (2016) the mismatch between the model and the data does not seem to affect our ability to recover the GW signal.This is probably because the detection significance is based on the intra-pulsar correlation properties of the signal (i.e. the HD overlap reduction function) which is a feature that emerges for any collection of GW signal, regardless on its specific properties (e.g.stationarity, Gaussianity, isotropy, spectral shape).Conversely, the reconstruction and interpretation of the observed signal can be severely biased by the use of a simplified GWB model.For example, while loud individual sources can cause a flattening of the spectrum which might erroneously misinterpreted as environmental effects or high eccentricity, the lack of them might result in a steep inferred spectrum which can be (again erroneously) claimed to be inconsistent with an astrophysical origin.Finally, the stochastic nature of the noise has a major impact on the outcome of the analysis.Even for the createGWB_set, where we effectively always inject the same signal, the analysis can either return a detection supported by strong evidence or an inconclusive result, just depending on different realisation of the stochastic process describing the noise.
Within this diverse and complex phenomenology, we highlighted also some notable realisations that exemplifies some of the possible analysis outcomes.In particular, we highlighted that: (i) little astrophysical information can be drawn from an inference run using a simple power-law GWB model, by showing that very different GW signals can result in similar inferred GWB parameters, (ii) the presence of loud sources can introduce a bias in the recovery of the common process, (iii) some realisations of our simulations result in a recovered HD correlation and BF completely in line with what observed in EPTA DR2.
Finally, Bécsy et al. (2023) presented a similar set of simulations based on the NANOGrav 15 yr dataset.They also use astrophysically motivated SMBHB populations to generate the Notes.The BF is computed for an HD-correlated signal over a common uncorrelated red noise.
GW signal and carry out a thorough analysis of a realistic dataset including unevenly sampled data and pulsar red noise.They perform Bayesian inference from the data, computing HD vs. CURN Bayes factors and conclude that the simple GWB model implemented in the current analysis is able to recover a realistic GW signal, although they stress that loud sources might affect the inference.Compared to their work, our investigation adds several layers of sophistication.Our simulated data also include observations at two frequencies per epoch, allowing the inclusion of DM as a further source of noise, which allows as to test the analysis performance on a more complicated situation, closer to the real data.We cast our results in terms of P-P plots, and we compute combined posteriors of several realisations of the same dataset, which allowed us to identify some interesting systematic biases in the recovered signal.Finally, we carried out a systematic comparison on the analysis performed on realistic signal injections vs. an ideal GWB generated by the createGWB function, which allowed us to identify potential difficulties due to the signal vs. template mismatch.Now that evidence of a GW signal is emerging independently from several PTA data, it is important to assess the reliability of our analysis methods, in order to maximise the astrophysical potential of the PTA experiments.The present work, along with Bécsy et al. (2023), represents an important first step in this direction, which needs to be extended to include increasingly realistic situations.For example, in this article, we did not investigate complex signal spectra due to environmental effects or binary eccentricity.Likewise, we used the exact same model A201, page 12 of 14 for noise injection and recovery.As shown in Antoniadis et al. (2023b), the excess or mis-modelled noise in the data can be absorbed within the common correlated signal in the analysis, leading to further biases and interpretation issues.Finally, including single sources along with a GWB in the recovery model might significantly improve the quality of the inference, especially when prominent peaks are present in the GW spectrum.
The main scripts used to analyse the data sets here described can be found online 3 .The SMBHB_set is also available on zenodo under the following Digital Object Identifier 4 : In particular, for each universe realisation we uploaded: the SMBHBs specifications, the chain file obtained from the inference runs (while sampling for each pulsar intrinsic noise parameters and a common uncorrelated red noise), each pulsar .parand .timfiles.

FrequencyFig. 1 .
Fig. 1.Characteristic strain as a function of frequency for 100 different realisations of a GWB with nominal amplitude 2.4 × 10 −15 .Each orange line corresponds to a different realisation (SMBHB population).The solid black line is the mean of those realisations, while the dashed black line highlights the nominal f −2/3 spectrum.The sensitivity curve is derived in The International Pulsar Timing Array Collaboration et al. (2023).

Fig. 3 .Fig. 4 .
Fig.3.log 10 BF distribution of HD vs. CURN for the 100 realisations of the LoudGWB_set.The vertical lines show the median (solid) and the 16th and 84th percentile (dashed) of the distribution.

Fig. 5 .
Fig.5.P-P plot for the recovery of A GWB from the SMHBH_set (orange) and createGWB_set (green).The solid lines are obtained from the log 10 A GWB posteriors of the MCMC runs over all 62 noise parameters of the pulsars array.The dashed lines are obtained by fixing all noise parameters in the recovery model; thus, A GWB is the only free parameter.The dotted lines are also from posteriors obtained sampling only over A GWB , but in a dataset where the realisations do not all have the same spectral amplitude.Here, the injected background has an amplitude value extracted from the prior distribution used in the recovery.

Fig. 6 .
Fig. 6.Combined posterior for log 10 A GWB (left panel) and γ GWB (right panel) parameter for the createGWB_set (green) and the SMBHB_set (orange).The injected values, marked by the black vertical lines, are A GWB = 2.4 × 10 −15 and γ GWB = 13/3.The grey posteriors in the background are the ones obtained from the individual realisations of the SMBHB_set.

Fig. 8 .
Fig. 8. Mean and variance of the angular correlation in the timing residuals of the createGWB_set (left panel) and for the SMBHB_set (right panel).Each data point corresponds to the average of the optimal correlation estimators in that bin over all realisations.We also show the expected cosmic variance derived inAllen (2023).

Fig. 9 .
Fig. 9. Comparison between the two realisations denoted as Case1 and Case2.In the left panel we compare the characteristic spectra with the f −2/3 trend, represented by the dashed black line.The sensitivity curve is the one presented in The International Pulsar Timing Array Collaboration et al. (2023).Right panel: corner plot of the GWB parameter posteriors for Case1 (brown posterior) and 2 (orange posterior).The simulated GWB amplitude is 2.4 × 10 −15 .

FrequencyFig. 10 .
Fig. 10.Notable realisations Case2 (orange) and Case3 (red).Top panel: GWB spectra.The dashed black line represents the f −2/3 trend.The sensitivity curve is the one presented in The International Pulsar Timing Array Collaboration et al. (2023).The central and bottom panels show corner plots of the GWB parameter posteriors inferred from the analysis using nine (central panel) and four (bottom panel) frequency bins for the GWB recovery.

FrequencyFig. 11 .
Fig. 11.Case4 analysis.The left panel shows the characteristic strain spectrum compared to the f −2/3 trend (dashed black line).The sensitivity curve is the one presented in The International Pulsar Timing ArrayCollaboration et al. (2023).Right panel: comparison between the recovered angular correlation in the residuals for this specific realisation (light blue points) and the results for the EPTA DR2new dataset (red).The expected variance of the optimal correlation estimator, for the EPTA 25 best pulsars array, is derived from prescriptions inAllen & Romano (2023).

Table 1 .
Values for the distance, timing, and noise parameters of the 25 best EPTA pulsars.

Table 2 .
Prior distributions for the Bayesian inference analyses.

Table 3 .
KS statistic and correspondent p-values obtained from comparisons between different distributions shown in Fig.5.