Adaptation of the phase distance correlation periodogram to account for measurement uncertainties

A. Binnenfeld; S. Shahaf; S. Zucker

doi:10.1051/0004-6361/202347764

Home

All issues

Volume 686 (June 2024)

A&A, 686 (2024) A192

Full HTML

Open Access

Issue		A&A Volume 686, June 2024


Article Number		A192
Number of page(s)		6
Section		Numerical methods and codes
DOI		https://doi.org/10.1051/0004-6361/202347764
Published online		13 June 2024

A&A, 686, A192 (2024)

Adaptation of the phase distance correlation periodogram to account for measurement uncertainties

A. Binnenfeld¹, S. Shahaf² and S. Zucker¹

¹ Porter School of the Environment and Earth Sciences, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Aviv, 6997801, Israel
e-mail: avrahambinn@gmail.com
² Department of Particle Physics and Astrophysics, Weizmann Institute of Science, Rehovot 7610001, Israel

Received: 19 August 2023
Accepted: 2 April 2024

Abstract

We present an improvement of the phase distance correlation (PDC) periodogram to account for uncertainties in the time-series data. The PDC periodogram introduced in our previous papers is based on the statistical concept of distance correlation. By viewing each measurement and its accompanying error estimate as a probability distribution, we are able to use the concept of energy distance to design a distance function (metric) between measurement-uncertainty pairs. We used this metric as the basis for the PDC periodogram, instead of the simple absolute difference. We demonstrate the periodogram’s performance using both simulated and real-life data. This adaptation makes the PDC periodogram much more useful, demonstrating it can be helpful in the exploration of large time-resolved astronomical databases, ranging from Gaia radial velocity and photometry data releases to those of smaller surveys, such as APOGEE and LAMOST. We have made a public GitHub repository available, with a Python implementation of the new tools available to the community.

Key words: methods: data analysis / methods: statistical / planets and satellites: detection / binaries: general

© The Authors 2024

Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article is published in open access under the Subscribe to Open model. Subscribe to A&A to support open access publication.

1 Introduction

Time-domain astronomy poses distinctive challenges, including sparse and uneven sampling, heteroscedasticity, high-dimensionality, and large data volumes. In practice, the astro-physical information obtained from an astronomical observation is often reduced to some scalar quantity, such as the radial velocity or photometric magnitude. The temporal behavior of these observables helps shed light on the system’s physical properties.

One key aspect of astronomical time series analysis is the search for periodic modulation in the observed data. A common approach to detecting periodicity is to generate a periodogram, produced by scanning a grid of trial periods (or frequencies) and ascribing each one with a score. The score quantifies the plausibility that the data are indeed modulated with a given period. A more thorough characterization of the observed object is usually initiated only if a periodic modulation is identified. The appeal of this approach stems from its conceptual simplicity and reduced computational costs.

The Lomb-Scargle (LS) periodogram is a widely used, weighted, least squares-based method for periodicity detection in unevenly sampled data (Ferraz-Mello 1981). It assumes that a sine wave modulation is able to explain most of the variation (‘energy’) in the data and that the measurement error is additive, uncorrelated (‘white’), and Gaussian. If the underlying signal is sinusoid and the noise is white and Gaussian, LS would be an optimal detection scheme, based on the Neyman-Pearson hypothesis-testing theory. The logic behind its application largely relies on the ability to approximate smooth periodic functions by Fourier series, assuming that most energy will concentrate on one fundamental mode (see VanderPlas 2018).

However, the assumption that the periodic signal is well represented by a single harmonic function can be invalid in certain cases. This holds particularly true for light curves exhibiting the majority of photometric variability patterns. As a consequence, several extensions to LS have been developed, aiming to broaden the flexibility of the search. For example, Fawcett (2006) used several harmonics, instead of the pure sinewave used in the classical LS periodogram; another extension is attributed to Zechmeister & Kürster (2009), where a constant offset term was incorporated to account for non-zero baseline values. The latter approach is often referred to as the generalized LS periodogram (GLS).

Nevertheless, some astrophysical phenomena may give rise to signals that cannot be approximated with only a few harmonics. Some studies have shown that GLS and other parametric periodograms tend to underperform in such cases (Pinamonti et al. 2017; Zucker 2018). Examples of such signals include the radial-velocity curves of highly eccentric Keplerian orbits or sawtooth-like photometric modulations of pulsating stars. Exo-planetary transits are another well-known example of a type of signal that is poorly approximated by sinusoids. Instead of increasing the number of harmonics in the template model used for the search, the common approach is changing its functional shape altogether. The classical transit detection method BLS (box-fitting least Squares) looks for box-shaped signals in the data (Kovács et al. 2002; Panahi & Zucker 2021; Shahaf et al. 2022). As the photometric accuracy improved, the assumed underlying model was further refined in various ways to employ more realistic transit-shaped models (e.g. Mandel & Agol 2002; Kipping 2023), further improving the search sensitivity.

The approaches discussed above are parametric, as they rely on the ability to model or approximate the functional shape of the signal. There are cases, however, where the functional shape is unknown or requires significant computational resources for the calculation. Several non-parametric periodograms were developed to address such situations. These techniques are designed to be more agnostic to the particular modulation shape, expanding the reach of the search, sometimes at the cost of reduced statistical efficiency:

String-length techniques are a set of commonly used model-independent methods (Clarke 2002). They include, among others, the Lafler-Kinman method (Lafler & Kinman 1965), as well as the methods of Renson (1978) and Dworetsky (1983). In all of those methods, we quantify the dependence between consecutive phase-folded measurements by estimating the length of an imaginary string connecting them.

The analysis of variance method (AoV; Schwarzenberg-Czerny 1989) is another non-parametric approach, which basically fits a periodic piecewise constant function to the measurements. After dividing the measurements into phase subsets (‘bins’) based on the trial period, the variance within each subset is compared to the variance of the entire dataset to determine the significance of the periodic signal at each period. A method closely related to the AoV method is the phase dispersion minimization method (PDM; Stellingwerf 1978), minimizing the dispersion of the phase-folded data sets to detect the most probable period.

The phase distance correlation periodogram (PDC; Zucker 2018) quantifies the statistical dependence between the observable and the phase at which the data were taken, given the trial period. The statistical dependence is quantified by the distance correlation (Székely et al. 2007). The PDC periodogram does not assume a specific shape for periodic variability and outperforms GLS in detecting some highly non-sinusoidal periodic signals (Zucker 2018).

One weakness of the non-parametric methods discussed above is their inability to consider measurement uncertainties. Error estimates are essential components of all physical measurements (e.g. Barlow 1995). Uncertainties can reflect various noise sources, such as inherently unavoidable photon noise or instrumental glitches, and their estimates are particularly fundamental for understanding and analyzing astronomical measurements (see Andrae 2010).

Uncertainties constitute a unique challenge for model-independent periodograms such as the PDC. Without an assumed functional shape, it is not straightforward to define, let alone quantify, the deviations of the data from their expected value. As a result, uncertainties are often omitted from the calculation in these cases. One notable exception is related to the AoV and PDM methods mentioned earlier, where an assumed functional form does indeed exist, specifically one that minimizes dispersion. In this case, the minimization procedure can involve sampling the assumed distribution of the noise to incorporate them into the procedure.

In this work, we present an improvement of the PDC peri-odogram that considers measurement uncertainties. We do so by viewing each measurement paired with its corresponding uncertainty as a probability distribution, and then introducing a new metric that enables the calculation of distance correlation and therefore the PDC periodogram.

In the next section, we present the details of adapting the PDC periodogram to account for measurement uncertainties. In Sect. 3, we demonstrate the performance of the improved periodogram using simulated data and its application to the HARPS data of a planet-hosting star. We present our conclusions in Sect. 4 and discuss the method and its potential for future studies.

2 Incorporating measurement errors in PDC

2.1 A new metric

As mentioned above, the distance correlation is a way to quantify the statistical dependence between two random variables (Székely et al. 2007). One of its merits is that the two random variables in question do not need to be of the same dimensionality (Lyons 2013).

Distance correlation is reliant on providing each random variable with an adequate metric to determine the pairwise distance of the sample (Lyons 2013). Therefore, to allow the PDC periodogram to account for measurement uncertainties, we must define a suitable metric to consider both the measurements and their accompanying error estimates. To do so, we assume that each measurement is sampled from a Gaussian distribution, with the nominal measurement value as its mean (µ) and the accompanying error estimate as its standard deviation (σ). We define the distance between two measurements as the ‘Energy Distance’ between their two corresponding distributions (Székely & Rizzo 2013). This metric satisfies the required conditions, outlined by Rizzo & Székely (2016), for the distance correlation to be applied as a measurement of statistical independence.

The distance between two measurements is taken as the energy distance between the two probability distributions from which these values were sampled, F_i and F_j. The explicit expression for the energy distance is (Székely & Rizzo 2013): $D^{2} (F_{i}, F_{j}) = 2 E ‖ X_{i} - X_{j} ‖ - E ‖ X_{i} - X_{i}^{'} ‖ - E ‖ X_{j} - X_{j}^{'} ‖,$ ${{\cal D}^2}\left( {{F_i},{F_j}} \right) = 2E\left\| {{X_i} - {X_j}} \right\| - E\left\| {{X_i} - X_i^\prime } \right\| - E\left\| {{X_j} - X_j^\prime } \right\|,$ (1)

where X_i and X_j are independent random variables distributed according to F_i and F_j; $X_{i}^{'} and X_{j}^{'}$ $X_i^\prime {\rm{ and }}X_j^\prime$ are independent and identically distributed copies of X_i and X_j; || • || denotes the Euclidean norm; E represents the expectancy-value.

The difference between two normally distributed random variables is also normally distributed. Therefore, X_i − X_j is a Gaussian random variable with mean µ_i − µ_j and variance $σ_{i}^{2} + σ_{j}^{2}$ $\sigma _i^2 + \sigma _j^2$ . Similarly, $X_{i} - X_{i}^{'} and X_{j} - X_{j}^{'}$ ${X_i} - X_i^\prime {\rm{ and }}{X_j} - X_j^\prime$ are zero-mean Gaussian distributions with variances given by $2 σ_{i}^{2} and 2 σ_{j}^{2}$ $2\sigma _i^2{\rm{ and }}2\sigma _j^2$ . Finally, to obtain an expression for the energy distance, we need to know the distribution of the random variable ||X_i − X_j||. Since we are considering random scalars, the last expression is a random variable representing the absolute value of the difference between two normally distributed variables, which follows a ‘folded normal distribution’, which is the distribution of the absolute value of a normally distributed random variable.

Leone et al. (1961) showed that the expectancy value of a folded normal distribution is $σ \sqrt{\frac{2}{π}} exp (- \frac{μ^{2}}{2 σ^{2}}) + μ \erf (\frac{μ}{\sqrt{2 σ^{2}}}),$ $\sigma \sqrt {{2 \over \pi }} \exp \left( { - {{{\mu ^2}} \over {2{\sigma ^2}}}} \right) + \mu {\mathop{\rm erf}\nolimits} \left( {{\mu \over {\sqrt {2{\sigma ^2}} }}} \right),$ (2)

where µ and σ² represent the mean and variance of the underlying Gaussian, whose absolute value was taken, and erf is the Gauss error function, as follows: $erf (x) = \frac{2}{\sqrt{π}} \int_{0}^{x} \exp (- t^{2}) d t .$ ${\mathop{\rm erf}\nolimits} (x) = {2 \over {\sqrt \pi }}\int_0^x {\exp } \left( { - {t^2}} \right){\rm{d}}t.$

With these results in hand, we can explicitly write the energy distance (Eq. (1)) for two normally distributed random variables. For convenience, we define $σ_{i j}^{2} \equiv σ_{i}^{2} + σ_{j}^{2}$ $\sigma _{ij}^2 \equiv \sigma _i^2 + \sigma _j^2$ . In these terms, Eq. (1) becomes $ε_{i j}^{2} = \sqrt{\frac{8}{π}} σ_{i j} [\exp (- x^{2}) + x erf (x) - y],$ $\varepsilon _{ij}^2 = \sqrt {{8 \over \pi }} {\sigma _{ij}}\left[ {\exp \left( { - {x^2}} \right) + x{\mathop{\rm erf}\nolimits} (x) - y} \right],$ (3)

where $x \equiv \frac{μ_{i} - μ_{j}}{\sqrt{2} σ_{i j}} and y \equiv \frac{σ_{i} + σ_{j}}{\sqrt{2} σ_{i j}} .$ $x \equiv {{{\mu _i} - {\mu _j}} \over {\sqrt 2 {\sigma _{ij}}}}\quad {\rm{ and }}\quad y \equiv {{{\sigma _i} + {\sigma _j}} \over {\sqrt 2 {\sigma _{ij}}}}{\rm{. }}$

It is illuminating to closely examine the behavior of the expression in Eq. (3). First, it can be shown to be always non-negative, as is required for a metric. The part that depends on x is an increasing function of the absolute difference of the two means (as Fig. 1 demonstrates): it is larger or equal to 1 and asymptotically converges to |x|. The second term, y, reflects the heteroscedasticity of the measurement pair. It is bounded between 2^−1/2, for extreme variance ratios, to 1, in the case of equal variances. In essence, the heteroscedasticity makes the energy-distance-based metric treat equal-variance distributions as closer to one another. Finally, the term in square brackets is scaled by σ_ij, which increases the distance between distributions with larger variances. The constant multiplicative factor $\sqrt{8 / π}$ $\sqrt {8/\pi }$ can be omitted in practical contexts.

Following Lyons (2013, corollary 3.18, therein), we take the square root of the metric to guarantee it is of ‘strong negative type’. Exploring the definition and intricacies of strong negative-type metric spaces is outside the scope of this paper, but it is important to note this attribute is a sufficient condition for a metric to be applicable for distance correlation computation (for further details, see Székely et al. 2007; Lyons 2013). $D (F_{i}, F_{j}) = \sqrt{ε_{i j}^{2}} .$ ${\cal D}\left( {{F_i},{F_j}} \right) = \sqrt {\varepsilon _{ij}^2} .$ (4)

Thus we obtained a strong-negative-type metric, which we then used to formulate an adaptation of the PDC periodogram, to use measurement uncertainties (as described in the following section).

Fig. 1

Illustration of the functional form attained by the energy distance between two Gaussians as a function of the normalized difference between their means, x. For simplicity, the diagram shows the value x-dependent term in Eq. (3), illustrating that it is bounded from below by 1 and asymptotically approaches |x|. The additional term of the energy-distance-based metric imposes a constant offset to the curve presented in this figure.

2.2 Phase-distance periodogram

Following Zucker (2018, 2019), we defined a distance matrix based on the pairwise energy distance introduced above. Each entry of the distance matrix represents the distance between the corresponding measurements, namely $a_{i j} = D (i, j) .$ ${a_{ij}} = {\cal D}(i,j).$ (5)

Once a distance matrix is obtained, we apply 𝒰-centering correction (Székely & Rizzo 2014), $A_{i j} = {\begin{array}{l} a_{i j} - \frac{1}{N - 2} \sum_{k = 1}^{N} a_{i k} - \frac{1}{N - 2} \sum_{k = 1}^{N} a_{k j} \\ + \frac{1}{(N - 1) (N - 2)} \sum_{k, l = 1}^{N} a_{k l} & if i \neq j, \\ 0 & if i = j . \end{array}$ ${A_{ij}} = \left\{ {\matrix{ {{a_{ij}} - {1 \over {N - 2}}\sum\limits_{k = 1}^N {{a_{ik}}} - {1 \over {N - 2}}\sum\limits_{k = 1}^N {{a_{kj}}} } \hfill & {} \hfill \cr { + {1 \over {(N - 1)(N - 2)}}\sum\limits_{k,l = 1}^N {{a_{kl}}} } \hfill & {{\rm{ if }}i \ne j,} \hfill \cr 0 \hfill & {{\rm{ if }}i = j.} \hfill \cr } } \right.$ (6)

Here, we use N to represent the number of measurements. The 𝒰-centering procedure is required to produce an unbiased estimator of the distance correlation.

Thus far, we have discussed the distances between pairs of observables. To construct a periodogram, we must construct a distance matrix representing the pairwise phase distance for each trial period, P. We used the metric defined by Zucker (2018) to do so. The phase distance between two measurements is $b_{i j} = ϕ_{i j} (P - ϕ_{i j}),$ ${b_{ij}} = {\phi _{ij}}\left( {P - {\phi _{ij}}} \right),$ (7)

where $ϕ_{i j} = (t_{i} - t_{j}) \mod P$ ${\phi _{ij}} = \left( {{t_i} - {t_j}} \right)\quad \,\bmod \,P$

is the phase difference between two measurements. The resulting distance matrix is then 𝒰-centered, in a process identical to the one presented in Eq. (6), producing the 𝒰-centered phase distance matrix, denoted as B_ij.

The unbiased estimator of the distance correlation is calculated using $D = \frac{\sum_{i j} A_{i j} B_{i j}}{\sqrt{\sum_{i j} A_{i j}^{2} \sum_{i j} B_{i j}^{2}}} .$ $D = {{\sum\limits_{ij} {{A_{ij}}} {B_{ij}}} \over {\sqrt {\sum\limits_{ij} {A_{ij}^2} \sum\limits_{ij} {B_{ij}^2} } }}.$ (8)

Similar to the original version of the PDC periodogram, the proposed recipe is computationally intensive and entails an O(N²) calculation for each trial period. This quadratic dependence may be reduced in future applications to 𝒪(N log N) by using newly developed fast techniques to compute the distance correlation (e.g. Huo & Szekely 2014; Chaudhuri & Hu 2019).

Following Binnenfeld et al. (2022), we used a χ² test to assess the significance of periods detected by the PDC (Shen et al. 2022), producing its false alarm probability (FAP). The χ² -based FAP is simple to compute and spares the need for computationally heavy permutation tests, which we previously used for this purpose (Binnenfeld et al. 2020, 2022). The FAP is calculated as follows: $FAP = 1 - F_{χ_{1}^{2} - 1} (N \cdot D)$ ${\rm{FAP}} = 1 - {F_{\chi _1^2 - 1}}(N \cdot D)$ (9)

with F being the χ² distribution cumulative distribution function (CDF), N is the number of measurements, and D defined in Eq. (8).

Fig. 2

Simulated demonstration of an Earth-analog exoplanet detection. Top: simulated RV curve of a planet-hosting star, as described in Sect. 3.1. The red line represents the orbital model, and the black circles represent the simulated noisy measurements, together with the simulated error bars. Bottom: comparison of the two versions of the PDC periodogram: with (solid blue) and without (dashed red) considering the error bars. The dashed vertical line marks the frequency associated with the injected orbital period of 365 days. The lower dotted horizontal line corresponds to a FAP level of 10⁻², and the upper one to a level of 10⁻³.

3 Demonstration and verification

After introducing the theoretical foundations of the method, we demonstrate its performance using simulated and real datasets. To quantify its merit, we produced receiver operating characteristic (ROC) curves (e.g. Fawcett 2006). The ROC is a diagnostic tool for assessing the performance of detection schemes by comparing the rates of true- and false-positive detection as a function of the threshold applied to the detection statistic.

3.1 Earth-analog exoplanet detection

As a first example, intended to demonstrate the capabilities and advantages of the new method for instances involving non-homogeneous errors, we generated a sinusoidal RV curve corresponding to some hypothetical exoplanet with a circular orbit of 365 days, and radial velocity semi-amplitude of 10 cm s⁻¹. The radial velocity curve was sampled in randomly drawn 35 epochs from a uniform distribution on an interval of 1000 days. We added noise to each measurement by randomly drawing from a normal distribution centered around 0, with standard deviations that were drawn from an exponential distribution with a scale parameter of 1 m s⁻¹. The standard deviations were paired with the corresponding simulated measurements as their ‘estimated’ error bars. We used this procedure to produce a wide range of noise standard deviations in the same dataset. The simulated RV curve, with the simulated error bars, can be seen in Fig. 2 (top panel).

As can be seen in the bottom panel of Fig. 2, the peak corresponding to the actual periodicity in the original PDC peri-odogram is not prominent enough compared to other spurious peaks, while the improved version of PDC does show a dominant peak at the expected annual frequency, with a FAP of 10⁻³.

3.2 Detecting eccentric orbits

Periodic RV signals related to eccentric orbits were shown to be more challenging to detect (Pinamonti et al. 2017). Their diverse variability patterns pose difficulties for model-dependent methods, and they were therefore identified as preferable targets for the PDC periodogram (Zucker 2018).

To demonstrate how the new adaptation of PDC improves the PDC detection capabilities even further, we chose the most eccentric exoplanet currently known: HD 20782 b, with e = 0.97 ± 0.01 and an orbital period of 597 days (O’Toole et al. 2008; Kane et al. 2016). We used RVs from the recently published HARPS radial velocity database (Trifonov et al. 2020), which consists of fifteen years of HARPS RVs.

HARPS optical fibers were upgraded in May 2015, changing the instrumental profile and introducing an RV offset between the pre- and post-upgrade RVs. Trifonov et al. recommended treating the pre- and post-upgrade time series from the database as taken from two different instruments, a recommendation we indeed followed in our analysis. Out of 87 measurements in total, we only used the 72 taken prior to the fiber change. The RVs and their error estimates can be seen in the bottom panels of Fig. 3.

As seen in the top panel of Fig. 3, the GLS periodogram does not exhibit a real prominent peak in the previously published frequency, while the regular PDC version does. The significance of the peak is even emphasized in the improved version of the periodogram. The change translates to a major decrease in the detection FAP, from 10⁻⁹ to 10⁻¹³.

RV curves of very eccentric orbits can benefit significantly from the new adaptation of the periodogram. As demonstrated by the case of HD 20782, samples that occur during periastron passage can easily be mistaken for outliers, even if their uncertainty estimates are small. In other cases, the error bars might be more ambiguous. The new and improved version offers an objective and quantitative way to include the error bars in the analysis.

Fig. 3

Demonstration of the advantages of the new version of the PDC periodogram, which takes into account the uncertainties, using RV data of the planet-host star HD 20782 b data. Top: GLS and PDC peri-odograms. The dashed vertical line marks the frequency corresponding to the known orbital period of 597 days. The lower dotted horizontal line corresponds to a FAP level of 10⁻⁹, and the upper one to a level of 10⁻¹³. Middle panels: HARPS RVs measurements for HD 20782, and their estimated measurement uncertainties (plotted separately for clarity). Bottom: RVs phase-folded according to the previously published orbital period.

Fig. 4

ROC curves comparing the regular PDC periodogram with its newly developed version. The dashed red line represents the regular PDC, and the thick blue one represents the new version that considers uncertainty estimates. The thin black line stands for a 1:1 ratio between the true and false positive rates, i.e., random guess.

3.3 ROC curves

We ran the tests for the ROC curve using the same procedure described in Sect. 3.1, only on 1000 sinusoidal RV curve realizations with periods drawn from a uniform distribution between 7 and 180 days. Similarly, we drew a random number of samples between 20 and 70. We added noise to each measurement by randomly drawing from a normal distribution centered around 0, with standard deviations that were drawn from an exponential distribution with a scale parameter of 1 m s⁻¹. Half of the light curves were then intentionally shuffled to eliminate the periodic signal, so that we could later estimate the FP rate. We calculated the two periodograms for each realization on the same frequency grid with 200 frequencies between 10⁻⁴ and 0.2 day⁻¹.

We then used the resulting periodograms to generate the ROC curve by sorting the FAP thresholds of each periodogram in ascending order. By serially computing the pairs of FP and TP rates, we created the ROC curves presented in Fig. 4.

As can be seen in the figure, the new version of the PDC (solid blue line), as it takes into account the uncertainties, outperforms the old version (dashed red line).

As noted in the previous subsection, the PDC periodogram outperformed other methods in detecting signals of eccentric orbits. To demonstrate this, we repeated the ROC curve test, this time simulating 1000 realizations of eccentric Keplerian orbits (uniformly distributed e > 0.35). As shown in Fig. 5, the PDC periodogram in its new version (using the information contained in the error bars) does indeed outperform both the GLS and the regular version of the PDC in this detection test.

The area under the ROC curve (AUC) metric is often used for analyzing ROC curves, offering an aggregated performance measure, invariant to both scale and detection threshold. In the context of the PDC periodogram, it can be effectively employed to measure the method contribution, particularly when dealing with data exhibiting diverse characteristics such as variance and signal-to-noise ratio (S/N). Future studies, focusing on defining and exploring the physical parameter space for different test cases, could leverage the AUC metric to gain a more nuanced understanding of the periodogram efficacy.

Fig. 5

ROC curves for the case of eccentric Keplerian RV curves (e > 0.35). The dashed red line represents the regular PDC, and the solid blue line represents the new version presented in this paper. They both outperform the GLS periodogram, represented by the dotted grey line. The thin black line stands for a 1:1 random guess.

Fig. 6

ROC curves comparing our newly developed method in its standard configuration against a scenario where the uncertainty estimates were randomly shuffled. The dashed red line represents the regular PDC, and the solid blue line represents the new version. The green dotted line represents the performance with the randomly shuffled uncertainty estimates. The thin black line represents a random guess.

3.3.1 Contribution of the uncertainties information

To further illustrate the capabilities of the updated version of the PDC and establish the importance of the contribution of measurement uncertainties to this detection scheme, we created an additional ROC curve. We repeated the 1000 sinusoidal realization test described in the previous subsections. This time, we evaluated the performance of the periodogram under two distinct settings: the standard configuration and a scenario in which we randomly shuffled the uncertainti estimated, highlighting their critical role in the enhanced performance (presented in Fig. 6).

As is evident from the figure, the shuffled version is outperformed by the regular PDC, while the newly developed method in its standard configuration is better than both. Nonetheless, the fact that false error estimates do not significantly impair the performance of the PDC periodogram is encouraging and attests to the robustness of the PDC.

4 Conclusion

In this paper, we present a modification of the PDC periodogram that is aimed at accounting for measurement uncertainties. We have demonstrated its performance using both simulations and data from the recently published HARPS-RVBank archive.

The original version of the PDC periodogram ignored error bars, which was a real disadvantage of the PDC and is likely the reason the method was not widely adopted. Now that we have found a way to incorporate into PDC the information contained in the error bars, we believe we have made it much more useful.

Although this study emphasizes RV curve illustrations, it is important to note that the PDC periodogram and its current modification presented in this paper are well suited for photometric light curves as well. As we have shown (Zucker 2018), the PDC periodogram outperforms other methods in various cases of non-sinusoidal variability patterns, such as the sawtooth-like variability shapes that are typical of pulsating stars.

The adaptation presented in this paper relies on a Gaussian model for the uncertainty, which we used in order to construct the energy distance metric. The Gaussian uncertainty model is used very often in many contexts (almost by default) due to its simplifying properties. However, somewhat less often, other uncertainty models are used (e.g., Poisson distribution). The energy distance expression shown in Eqs. (3)–(4) should be replaced in those cases with the relevant expressions suitable for the assumed noise distribution. This approach may be useful in a range of applications, including the potential to extend the definition of distance to encompass and mitigate correlated noise.

In recent papers (Binnenfeld et al. 2020, 2022), we introduced USuRPER, which is essentially a PDC periodogram for spectral observations and also a version of the PDC periodogram suitable for scanning astrometry. Despite it not being suited to the exact modification presented in this paper (i.e. scalar observations), we believe similar approaches of viewing measurements and their errors as probability distributions can be applied for those periodograms as well, to make them account for measurement error bars.

Additional relevant PDC extension is the ‘partial phase distance correlation periodograms’, which allow for ‘nuisance’ parameters to be accounted for with the aim of eliminating spurious peaks related to those nuisance parameters (Binnenfeld et al. 2022). In the case of two simultaneously measured scalar quantities, such as RVs and some activity indicator, the latter may be considered a nuisance; thus, it can be used to eliminate periodogram peaks originating from stellar activity. The formulation presented in this work can also be used to improve the partial periodogram by using the relevant uncertainty estimates.

Finally, we provide our Python implementation of the periodogram in the form of a public GitHub repository¹.

Acknowledgements

We thank the editor and the referee, Andrej Prša for reviewing this paper and for their enlightening and useful comments. This research was supported by the Ministry of Innovation, Science & Technology, Israel (grant 3-18143) and by the Israel Science Foundation (grant No. 1404/22). The research of SS is supported by a Benoziyo prize postdoctoral fellowship. The analyses done for this paper made use of the code packages: NumPy (Harris et al. 2020), SciPy (Virtanen et al. 2020) and SPARTA (Shahaf et al. 2020).

References

Andrae, R. 2010, arXiv e-prints [arXiv:1009.2755] [Google Scholar]
Barlow, R. 1995, Statistics: A Guide to the Use of Statistical Methods in the Physical Sciences, Manchester Physics Series (Hoboken: Wiley) [Google Scholar]
Binnenfeld, A., Shahaf, S., & Zucker, S. 2020, A&A, 642, A146 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Binnenfeld, A., Shahaf, S., Anderson, R. I., & Zucker, S. 2022, A&A, 659, A189 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Chaudhuri, A., & Hu, W. 2019, Comput. Stat., 135, 15 [Google Scholar]
Clarke, D. 2002, A&A, 386, 763 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Dworetsky, M. M. 1983, MNRAS, 203, 917 [NASA ADS] [Google Scholar]
Fawcett, T. 2006, Pattern Recog. Lett., 27, 861 [CrossRef] [Google Scholar]
Ferraz-Mello, S. 1981, AJ, 86, 619 [NASA ADS] [CrossRef] [Google Scholar]
Harris, C. R., Millman, K. J., van der Walt, S. J., et al. 2020, Nature, 585, 357 [NASA ADS] [CrossRef] [Google Scholar]
Huo, X., & Szekely, G. J. 2014, Technometrics, Fast Computing for Distance Covariance (UK: Taylor and Francis) [Google Scholar]
Kane, S. R., Wittenmyer, R. A., Hinkel, N. R., et al. 2016, ApJ, 821, 65 [NASA ADS] [CrossRef] [Google Scholar]
Kipping, D. 2023, MNRAS, 523, 1182 [NASA ADS] [CrossRef] [Google Scholar]
Kovács, G., Zucker, S., & Mazeh, T. 2002, A&A, 391, 369 [Google Scholar]
Lafler, J., & Kinman, T. D. 1965, ApJS, 11, 216 [NASA ADS] [CrossRef] [Google Scholar]
Leone, F. C., Nelson, L. S., & Nottingham, R. B. 1961, Technometrics, 3, 543 [CrossRef] [Google Scholar]
Lyons, R. 2013, Ann. Probab., 41, 3284 [CrossRef] [Google Scholar]
Mandel, K., & Agol, E. 2002, ApJ, 580, L171 [Google Scholar]
O’Toole, S. J., Tinney, C. G., Jones, H. R. A., et al. 2008, MNRAS, 392, 641 [Google Scholar]
Panahi, A., & Zucker, S. 2021, PASP, 133, 024502 [NASA ADS] [CrossRef] [Google Scholar]
Pinamonti, M., Sozzetti, A., Bonomo, A. S., & Damasso, M. 2017, MNRAS, 468, 3775 [CrossRef] [Google Scholar]
Renson, P. 1978, A&A, 63, 125 [NASA ADS] [Google Scholar]
Rizzo, M. L., & Székely, G. J. 2016, Wiley Interdiscip. Rev. Comput. Stat., 8, 27 [CrossRef] [Google Scholar]
Schwarzenberg-Czerny, A. 1989, MNRAS, 241, 153 [Google Scholar]
Shahaf, S., Binnenfeld, A., Mazeh, T., & Zucker, S. 2020, Astrophysics Source Code Library [record ascl:2007.022] [Google Scholar]
Shahaf, S., Zackay, B., Mazeh, T., Faigler, S., & Ivashtenko, O. 2022, MNRAS, 513, 2732 [NASA ADS] [CrossRef] [Google Scholar]
Shen, C., Panda, S., & Vogelstein, J. T. 2022, J. Comput. Graph. Stat., 31, 254 [CrossRef] [Google Scholar]
Stellingwerf, R. F. 1978, ApJ, 224, 953 [Google Scholar]
Székely, G. J., & Rizzo, M. L. 2013, J. Stat. Plan. Inference, 143, 1249 [CrossRef] [Google Scholar]
Székely, G. J., & Rizzo, M. L. 2014, Ann. Stat., 42, 2382 [Google Scholar]
Székely, G. J., Rizzo, M. L., & Bakirov, N. K. 2007, Ann. Stat., 35, 2769 [Google Scholar]
Trifonov, T., Tal-Or, L., Zechmeister, M., et al. 2020, A&A, 636, A74 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
VanderPlas, J. T. 2018, ApJS, 236, 16 [Google Scholar]
Virtanen, P., Gommers, R., Oliphant, T. E., et al. 2020, Nat. Methods, 17, 261 [Google Scholar]
Zechmeister, M., & Kürster, M. 2009, A&A, 496, 577 [CrossRef] [EDP Sciences] [Google Scholar]
Zucker, S. 2018, MNRAS, 474, L86 [NASA ADS] [CrossRef] [Google Scholar]
Zucker, S. 2019, MNRAS, 484, L14 [NASA ADS] [CrossRef] [Google Scholar]

¹

PDC and its extensions, including USuRPER, partial distance correlation periodograms, PDC periodogram for scanning astrometry, and the periodogram presented in this work, are all available as part of the SPARTA package (Shahaf et al. 2020), at https://github.com/SPARTA-dev.

All Figures

Fig. 1

Illustration of the functional form attained by the energy distance between two Gaussians as a function of the normalized difference between their means, x. For simplicity, the diagram shows the value x-dependent term in Eq. (3), illustrating that it is bounded from below by 1 and asymptotically approaches |x|. The additional term of the energy-distance-based metric imposes a constant offset to the curve presented in this figure.

In the text

Fig. 2

Simulated demonstration of an Earth-analog exoplanet detection. Top: simulated RV curve of a planet-hosting star, as described in Sect. 3.1. The red line represents the orbital model, and the black circles represent the simulated noisy measurements, together with the simulated error bars. Bottom: comparison of the two versions of the PDC periodogram: with (solid blue) and without (dashed red) considering the error bars. The dashed vertical line marks the frequency associated with the injected orbital period of 365 days. The lower dotted horizontal line corresponds to a FAP level of 10⁻², and the upper one to a level of 10⁻³.

In the text

Fig. 3

Demonstration of the advantages of the new version of the PDC periodogram, which takes into account the uncertainties, using RV data of the planet-host star HD 20782 b data. Top: GLS and PDC peri-odograms. The dashed vertical line marks the frequency corresponding to the known orbital period of 597 days. The lower dotted horizontal line corresponds to a FAP level of 10⁻⁹, and the upper one to a level of 10⁻¹³. Middle panels: HARPS RVs measurements for HD 20782, and their estimated measurement uncertainties (plotted separately for clarity). Bottom: RVs phase-folded according to the previously published orbital period.

In the text

	Fig. 4 ROC curves comparing the regular PDC periodogram with its newly developed version. The dashed red line represents the regular PDC, and the thick blue one represents the new version that considers uncertainty estimates. The thin black line stands for a 1:1 ratio between the true and false positive rates, i.e., random guess.
In the text

	Fig. 5 ROC curves for the case of eccentric Keplerian RV curves (e > 0.35). The dashed red line represents the regular PDC, and the solid blue line represents the new version presented in this paper. They both outperform the GLS periodogram, represented by the dotted grey line. The thin black line stands for a 1:1 random guess.
In the text

Fig. 6

ROC curves comparing our newly developed method in its standard configuration against a scenario where the uncertainty estimates were randomly shuffled. The dashed red line represents the regular PDC, and the solid blue line represents the new version. The green dotted line represents the performance with the randomly shuffled uncertainty estimates. The thin black line represents a random guess.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.

[1] Andrae, R. 2010, arXiv e-prints [arXiv:1009.2755] [Google Scholar]

[2] Barlow, R. 1995, Statistics: A Guide to the Use of Statistical Methods in the Physical Sciences, Manchester Physics Series (Hoboken: Wiley) [Google Scholar]

[3] Binnenfeld, A., Shahaf, S., & Zucker, S. 2020, A&A, 642, A146 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[4] Binnenfeld, A., Shahaf, S., Anderson, R. I., & Zucker, S. 2022, A&A, 659, A189 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[5] Chaudhuri, A., & Hu, W. 2019, Comput. Stat., 135, 15 [Google Scholar]

[6] Clarke, D. 2002, A&A, 386, 763 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[7] Dworetsky, M. M. 1983, MNRAS, 203, 917 [NASA ADS] [Google Scholar]

[8] Fawcett, T. 2006, Pattern Recog. Lett., 27, 861 [CrossRef] [Google Scholar]

[9] Ferraz-Mello, S. 1981, AJ, 86, 619 [NASA ADS] [CrossRef] [Google Scholar]

[10] Harris, C. R., Millman, K. J., van der Walt, S. J., et al. 2020, Nature, 585, 357 [NASA ADS] [CrossRef] [Google Scholar]

[11] Huo, X., & Szekely, G. J. 2014, Technometrics, Fast Computing for Distance Covariance (UK: Taylor and Francis) [Google Scholar]

[12] Kane, S. R., Wittenmyer, R. A., Hinkel, N. R., et al. 2016, ApJ, 821, 65 [NASA ADS] [CrossRef] [Google Scholar]

[13] Kipping, D. 2023, MNRAS, 523, 1182 [NASA ADS] [CrossRef] [Google Scholar]

[14] Kovács, G., Zucker, S., & Mazeh, T. 2002, A&A, 391, 369 [Google Scholar]

[15] Lafler, J., & Kinman, T. D. 1965, ApJS, 11, 216 [NASA ADS] [CrossRef] [Google Scholar]

[16] Leone, F. C., Nelson, L. S., & Nottingham, R. B. 1961, Technometrics, 3, 543 [CrossRef] [Google Scholar]

[17] Lyons, R. 2013, Ann. Probab., 41, 3284 [CrossRef] [Google Scholar]

[18] Mandel, K., & Agol, E. 2002, ApJ, 580, L171 [Google Scholar]

[19] O’Toole, S. J., Tinney, C. G., Jones, H. R. A., et al. 2008, MNRAS, 392, 641 [Google Scholar]

[20] Panahi, A., & Zucker, S. 2021, PASP, 133, 024502 [NASA ADS] [CrossRef] [Google Scholar]

[21] Pinamonti, M., Sozzetti, A., Bonomo, A. S., & Damasso, M. 2017, MNRAS, 468, 3775 [CrossRef] [Google Scholar]

[22] Renson, P. 1978, A&A, 63, 125 [NASA ADS] [Google Scholar]

[23] Rizzo, M. L., & Székely, G. J. 2016, Wiley Interdiscip. Rev. Comput. Stat., 8, 27 [CrossRef] [Google Scholar]

[24] Schwarzenberg-Czerny, A. 1989, MNRAS, 241, 153 [Google Scholar]

[25] Shahaf, S., Binnenfeld, A., Mazeh, T., & Zucker, S. 2020, Astrophysics Source Code Library [record ascl:2007.022] [Google Scholar]

[26] Shahaf, S., Zackay, B., Mazeh, T., Faigler, S., & Ivashtenko, O. 2022, MNRAS, 513, 2732 [NASA ADS] [CrossRef] [Google Scholar]

[27] Shen, C., Panda, S., & Vogelstein, J. T. 2022, J. Comput. Graph. Stat., 31, 254 [CrossRef] [Google Scholar]

[28] Stellingwerf, R. F. 1978, ApJ, 224, 953 [Google Scholar]

[29] Székely, G. J., & Rizzo, M. L. 2013, J. Stat. Plan. Inference, 143, 1249 [CrossRef] [Google Scholar]

[30] Székely, G. J., & Rizzo, M. L. 2014, Ann. Stat., 42, 2382 [Google Scholar]

[31] Székely, G. J., Rizzo, M. L., & Bakirov, N. K. 2007, Ann. Stat., 35, 2769 [Google Scholar]

[32] Trifonov, T., Tal-Or, L., Zechmeister, M., et al. 2020, A&A, 636, A74 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[33] VanderPlas, J. T. 2018, ApJS, 236, 16 [Google Scholar]

[34] Virtanen, P., Gommers, R., Oliphant, T. E., et al. 2020, Nat. Methods, 17, 261 [Google Scholar]

[35] Zechmeister, M., & Kürster, M. 2009, A&A, 496, 577 [CrossRef] [EDP Sciences] [Google Scholar]

[36] Zucker, S. 2018, MNRAS, 474, L86 [NASA ADS] [CrossRef] [Google Scholar]

[37] Zucker, S. 2019, MNRAS, 484, L14 [NASA ADS] [CrossRef] [Google Scholar]