The effects of granulation and supergranulation on Earth-mass planet detectability in the habitable zone around F6-K4 stars

The detectability of exoplanets and the determination of their projected mass in radial velocity are affected by stellar magnetic activity and photospheric dynamics. The effect of granulation, and even more so of supergranulation, has been shown to be significant in the solar case. Our study is aimed at quantifying the impact of these flows for other stars and estimating how such contributions affect their performance. We analysed a broad array of extended synthetic time series that model these processes for main sequence stars with spectral types from F6 to K4, focusing on Earth-mass planets orbiting within the habitable zone around those stars. We estimated the expected detection rates and detection limits, and performed blind tests. We find that both granulation and supergranulation on these stars significantly affect planet mass characterisation in radial velocity when performing a follow-up of a transit detection, with uncertainties sometimes below 20% for a 1 MEarth, but much larger for supergranulation. For granulation and low levels of supergranulation, the detection rates are good for K and late G stars (if the number of points is large), but poor for more massive stars. The highest level of supergranulation leads to a very poor performance, even for K stars; this is both due to low detection rates and to high levels of false positives, even for a very dense temporal sampling over ten years. False positive levels estimated from standard false alarm probabilities sometimes significantly overestimate or underestimate the true level, depending on the number of points. We conclude that granulation and supergranulation significantly affect the performance of exoplanet detectability. Future works will focus on improving the following aspects: decreasing the number of false positives, increasing detection rates, and improving the false alarm probability estimations from observations.


Introduction
A large number of exoplanets have been detected using indirect techniques for over 20 years. However, because these techniques are indirect, they are very sensitive to stellar variability. The radial velocity (RV) technique is particularly sensitive to activity that is due to both magnetic and dynamical processes at different temporal scales. Many studies have focussed on stellar magnetic activity (recognised early on by Saar & Donahue 1997) based on simulations of simple spot configurations (e.g. Desort et al. 2007;Boisse et al. 2012;Dumusque et al. 2012) as well as more complex patterns (e.g. Lagrange et al. 2010;Meunier et al. 2010b,a;Borgniet et al. 2015;Santos et al. 2015;Dumusque 2016;Herrero et al. 2016;Dumusque et al. 2017;Meunier & Lagrange 2019a;. Flows on different spatial and temporal scales also play an important role: in addition to large-scale flows such as meridional circulation (Makarov et al. 2010;Meunier & Lagrange 2020), oscillations, granulation, and supergranulation also affect RV time series.
Send offprint requests to: N. Meunier The properties of these small-scale flows and the mitigating techniques used to remove them (mostly averaging techniques) have been studied in several works (e.g. Dumusque et al. 2011;Cegla et al. 2013;Meunier et al. 2015;Cegla et al. 2015;Sulis et al. 2016;Sulis et al. 2017a;Cegla et al. 2018;Meunier & Lagrange 2019b;Cegla et al. 2019;Chaplin et al. 2019) for the Sun and other stars. More details can be found in the review by Cegla (2019). The impact of granulation on the use of standard statistical tools has been pointed out by Sulis et al. (2017b), who proposed a new method (based on periodogram standardisation) to improve these tools, so far for a solar type star. The RV jitter associated to granulation has also been studied for chromospherically quiet stars covering a large range in spectral types and evolutionary stages by Bastien et al. (2014).
Granulation and supergranulation are challenging because of the shape of their power spectrum, which is flat (instead of decreasing, as in the case of oscillations) at low frequencies (Harvey 1984), and because it is not related to usual activity indicators. Furthermore, in Meunier & Lagrange (2019b), hereafter referred to as Paper I, we showed that for the Sun, the effect of supergranulation was unexpectedly strong and more problematic than the granulation signal. Here, we perform a similar anal-A&A proofs: manuscript no. 38376_final ysis (with the addition of more complete blind tests) for main sequence stars extending over a large range of spectral types, that is, from F6 to K4 as in our magnetic activity simulations , hereafter referred to as Paper II), where this contribution was added to the activity signal to build more realistic long-term time series of realistic activity patterns. In the present paper, we aim to study granulation and supergranulation contributions to RVs for stars with various spectral types and to perform a detailed analysis of the false positive levels from different points of view (theoretical and observational) and their effect on exoplanet detection rates. We adopted a systematic approach to study and quantify these effects for different conditions, including different spectral types, numbers of observations, and samplings. We consider exoplanet detectability using RV techniques, but also the mass characterisation which can be made using RV in transit followups: when the planet has been detected and validated using transits, its radius is known (relative to the stellar radius) along with other parameters (orbital period, phase), but only the RV techniques can currently provide a mass estimate, which, in turn, allows us to estimate its density, thus giving us a hint of its composition. We focus on Earth-like planets in the habitable zone of their host star. Such a systematic approach is also very important because there are few stars observed that have a very large (in the 500-1000 regime or above) number of observations currently available; thus, tests on observations are currently limited, in addition to the fact that they could have undetected planets, for stars other than the Sun (Collier Cameron et al. 2019).
The outline of the paper is as follows. In Sect. 2, we present the synthetic time series and the approaches we implemented to analyse them, as well as, in particular, how we define theoretical levels of false positives. In Sect. 3, we analyse these time series using true false positive levels (i.e. assuming a perfect knowledge of the properties of the signal) to derive detection rates and mass detection limits. In Sect. 4, we focus on the observational point of view by comparing usual false alarm probability levels with the true false positive levels and characterising the detection limits proposed in Meunier et al. (2012) for this type of signal. Then we estimate the uncertainty on the mass estimation in transit follow-ups. We implement blind tests to fully characterise the performance in terms of detectability and false positive levels when a classical tool is used to evaluate detections. Finally, we test complementary samplings in Sect. 5 and present our conclusions in Sect. 6.

Model and analysis
In this section, we describe the time series and how we extrapolate data from solar parameters Meunier & Lagrange 2019b) to build stellar time series. Then we present the different approaches to analyse these synthetic time series and, in particular, we discuss how we determine false positive levels.

Time series of oscillations, granulation, and supergranulation
Our reference times series are solar ones: we first provide the amplitudes we consider for the Sun and apply those to G2 stars. Then we describe our assumptions for other stars. Rms RV vs. spectral type for GRAhigh (orange), SGmed (red), SGlow (brown), ALL GRAhigh,SGmed (green), and ALL GRAhigh,SGlow (blue), for the best sampling (3650 points, no gaps). The dashed lines correspond to the configurations including GRAlow (same colour code). Individual values are shown as stars.

Solar amplitudes
We first define the solar values we consider in this study. The time series are derived from power spectra following Harvey (1984) for granulation and supergranulation and following the shape of the envelope of the oscillations from Kallinger et al. (2014), as in Papers I and II. This method has the advantage of allowing us to produce a large amount of very long time series. We showed in Meunier et al. (2015) that the shape proposed by Harvey (1984) was well adapted, even at low frequencies: therefore, we use the parameters found in Meunier et al. (2015). The choice of a one-hour binning is similar to what we chose in Paper I and corresponds to the timescales where the RV jitter due to granulation reaches an inflexion point : binning over a longer duration is not efficient enough to reduce this jitter further and so, this binning time is used to filter granulation out best.
For granulation, in the majority of our study, we use an rms (root-mean-square) of 0.83 m/s before averaging (i.e. 0.39 m/s after averaging over one hour), hereafter GRAhigh, which stands as our reference value, provided by our simulations of about 1 million granules on the disk at any given time in Meunier et al. (2015). As discussed in Paper I, such simulations were based on realistic properties of granules (derived from hydrodynamical simulations of Rieutord et al. 2002), which are known to reproduce realistic line profiles (Asplund et al. 2000). However, lower values were derived from the observation of two specific spectral lines: about 0.32 m/s by Elsworth et al. (1994) from the Potassium line at 770 nm and 0.46 m/s from the Sodium doublet at 589 nm by Pallé et al. (1999). More recently, the residuals on timescales lower than one day obtained by Collier Cameron et al. (2019) on solar integrated RV times series and covering the whole spectrum obtained by HARPS-N are also of the order of 0.40 m/s (when averaging over typically five minutes). Similar amplitudes have been obtained by Sulis et al. (2020) using MHD simulations. The difference between these estimates and the results of Meunier et al. (2015) may be due to some subtle effects in the centre-to-limb dependence which are not taken into account in Meunier et al. (2015), but also to the fact that observations were made in a single lines which may not be representative of the whole spectrum. For that reason, a twice lower level (hereafter GRAlow) with respect to our reference level will also be considered in mass characterisations and blind tests in Sect. 4.3 and 4.4. We note that Cegla et al. (2019) obtained very low rms RV for granulation using a reconstruction based on MHD simulations of the solar surface, around 0.1 m/s. The reason for this discrepancy is not clear at this stage, although it may be due the fact that strong vertical magnetic fields were used.
Concerning supergranulation, Meunier et al. (2015) provide a large range of possible values based on our current knowledge of these flows. Here, we consider two values, their median level (0.7 m/s, hereafter SGmed), and their lower level (0.27 m/s, hereafter SGlow), as in Meunier & Lagrange (2019b): these are in agreement with typical amplitudes obtained for a few stars by Dumusque et al. (2011). The median level is also close to the rms found by Pallé et al. (1999) for the Sun, with 0.78 m/s for the Sodium doublet lines. Because of the longer timescales of supergranulation, the rms RV is almost the same after the 1 hour averaging. The amplitude of the oscillations is derived from Davies et al. (2014), as in Paper I. The time scale is the same one obtained in Meunier et al. (2015) as in Paper I, that is, 1.1 10 6 s.
We mainly use five types of time series throughout the paper: high level of granulation alone (GRAhigh), supergranulation alone (SGmed, median level, and SGlow, low level), all contributions for oscillations, a high level of granulation, and median supergranulation (ALL GRAhigh,SGmed ) or low supergranulation (ALL GRAhigh,SGlow ). In the following, ALL always represents the superposition of oscillations, granulation, and supergranulation. The other three configurations (GRAlow alone, ALL GRAlow,SGmed , ALL GRAlow,SGlow ) are mostly be considered for the mass characterisation and blind tests to provide a complete view. The configuration ALL GRAhigh,SGmed was used in combination with magnetic activity in Paper II. The contribution attributed to any of these combinations is referred to as the OGS (for oscillations, granulation, supergranulation) signal in the following. The oscillations are not studied alone here because we consider one-hour averages and they are well averaged out (Chaplin et al. 2019) at such timescales: they did not prevent us from obtaining excellent detection rates when considered independently (Paper I).

Stellar time series
We considered seven spectral types covering the F6-K4 range, that is, F6, F9, G2, G5, G8, K1, and K4. The amplitudes of the different components were scaled to G2 stars (i.e. solar values from the previous section) as in Paper II (previous section). We recall them here in brief. Granulation parameters are scaled from G2 stars to other spectral types using results from Beeck et al. (2013). Oscillation parameters are scaled using laws from Kjeldsen & Bedding (1995), Samadi et al. (2007), Bedding & Kjeldsen (2003), Kippenhahn & Weigert (1990), and Belkacem et al. (2013) 1 . Supergranulation is scaled following the granulation scaling, assuming supergranulation is strongly related to granulation properties (Rieutord et al. 2000;Roudier et al. 2016), including the time scale, which can differ by up to about 20%, so that the impact should be small. All time series were produced for a ten-year period of duration with a 30-second time step and are then binned over one hour. We then selected one such point per night. Examples of time series (subsets over short periods) are shown in Appendix A for F6, G2, and K4 stars, as well as examples of the power functions versus frequency. In addition to this full sample of 3650 nights, we consider several other configurations with a gap of four months per year to simulate the fact that a star can usually not be observed all year long. Then N obs nights were randomly selected out of the remaining nights over the ten-year duration. Each realisation of this selection for a given value of N obs corresponds to a different sampling. We use N obs = 180, 542, 904, 1266, 1628, 1990, and 2352 nights with the four-month gap each year), and 3650 nights (no gap), leading to a total of eight configurations. In Paper I, we found that using a random selection or considering packs of adjacent nights did not lead to significant differences. In addition, Burt et al. (2018) tested different ways of building the sampling for magnetic activity time series and found that the random sampling was optimal (the uniform sampling was not extremely different however). Testing of additional sampling configurations is presented in Sect. 5. Figure 1 summarises the rms RV versus spectral type for the eight configurations of OGS time series used in this paper. We note a general decrease towards lower mass stars. When considering all components and spectral types, the RV jitter varies between 0.28 and 0.9 m/s typically when considering GRAhigh. The dashed lines show the levels when the granulation level is divided by 2 (GRAlow). In this case the granulation rms varies between 0.22 and 0.1 m/s, and when combined with the low level of supergranulation it varies between 0.2 and 0.37 m/s. We note that even for such a large number of points, there is little dispersion in RV jitter from one realisation to the next.

Principle of the analysis
Here, we describe the planet properties considered in this paper and then discuss issues related to detectability as well as mass characterisation in transit follow-ups.

Planets
We focus our analysis on low-mass planets orbiting in the habitable zone of their host stars. We define the limits of the habitable zone as a function of spectral type as in Meunier & Lagrange (2019a), following Kasting et al. (1993), Jones et al. (2006), andZaninetti (2008). We consider three typical orbital periods, corresponding to the inner side (PHZ in ), the middle (PHZ med ), and the outer side (PHZ out ) of the habitable zone: the resulting orbital periods vary between 409-1174 days for F6 stars to 179-501 days for K4 stars. Furthermore, we consider only circular orbits, for simplicity.
Most of the computations are carried out with projected masses of 1 and 2 M Earth . For inclinations higher than 40-50 • , the performance obtained with these masses is representative of this whole range of inclinations, while for lower inclinations the performance should be significantly worse than the one presented in this paper. Therefore, additional blind tests, presented in Sect. 4.4, are also performed, considering a distribution of inclinations between 0 • and 90 • when building the data set and with the assumption that the orbital plane is the same as the stellar equatorial plane. In the case of a transit follow-up using RV to characterise the mass, however, the projected mass can be considered to be the true mass.

Detectability
In subsequent sections, the analysis of the time series is made using two complementary approaches (i.e. two test statistics), which are then compared. The steps are as follows: (i) we analyse the periodograms 2 of the time series, and compute the maximum amplitude around the considered PHZ (frequential analysis, computed in 0.9-1.1 PHZ range); (ii) we fit the planetary 2 We use the Lomb Scargle periodogram with no normalisation to be able to compare powers between different types of contributions. They are computed between 2 and 2000 days. . False positive level in power fp P vs. period for highest number of points, G2 stars, and five OGS configurations (GRAhigh in orange, SGmed in red, SGlow in brown, ALL GRAhigh,SGmed in green, and ALL GRAhigh,SGlow in blue). The solid lines represent fp P computed in 100d ranges, while the dashed horizontal lines correspond to the single value of fp P computed over 100-1000d. signal, considering a period guess corresponding to the period of this peak with maximum amplitude (temporal analysis) or of interest (PHZ) depending on the case. This fit is made using a χ 2 minimization.
We first consider the detectability of such exoplanets in the presence of the stellar contribution defined in the previous section. Because we consider synthetic time series, we can study them with the certainty that there is no planet present in the signal. As a consequence, we can estimate a true level of false positive (FP) for a given test statistics (frequential and temporal analysis) and for a given probability (e.g. 1%), and it is then possible to compute detection rates for a given planet (on the time series where the planet have been added), considering this level of false positives. The method we apply to determine the FP is described in Sect. 2.3. Once we have determined a true FP level corresponding to a certain percentage of false positives, and a detection rate for a given mass, we can also determine which mass corresponds to a good detection rate (e.g. 95%), which provides a detection limit. This approach is explored in Sect. 3.
From the point of view of the observer, however, the determination of the true level of false positive due to a given stellar contribution is not possible because it is not possible to know if it includes other, additional signals (of a planet for instance) and because we have only one realisation of the signal. This is why the analysis of observed time series always relies on other methods, such as the use of false alarm probability levels using bootstrap analysis, although this approach makes assumptions on the signal which may not be correct, as pointed out by Sulis et al. (2017a). In a second step, we therefore test this type of approach and compare it with the one based on the true false positive level. We also compare the detection limits based on the periodogram analysis proposed by Meunier et al. (2012), the local power analysis (LPA) method with the true detection limits. A blind test is implemented to estimate the detection rates and false positive levels and compare them with the true ones. This approach is explored in Sect. 4.

Mass characterisation
The latter issue, also studied in Sect. 4, concerns the performance with regard to mass characterisations of planets detected by transit in photometric light curves. In this case, we consider that the planet presence is confirmed,meaning that the transits do not require any validation using RV observations. There is, therefore, no issue with false positives in this case and we also know its orbital period and phase with very good precision from the transit. We can then fit the RV amplitude due to the planet (temporal analysis) at this orbital period to determine the precision for the mass characterisation.

False positives from synthetic time series
Here, we describe how we estimate the false positive (FP) level at the 1% level, both in mass (temporal analysis) and power (frequential analysis) This level corresponds to the behavior in the frequency of the OGS signal alone and for a given test of statistics (here, the power at the period we are interested in or the fitted mass; see previous section) since it is computed based on a large number of time series of the OGS signal alone. This is done with no correction of the signal (apart from the 1 hour binning).
To estimate the FP from our time series, we produce 1000 realisations of the OGS signal and sampling (for a given spectral type and number of points N obs ) as described in Sect. 2.1. For each of the three orbital periods corresponding to the habitable zone (Sect. 2.2), we fit a planetary signal at this period, which provides 1000 values of the mass. The period used as a guess before minimisation is the period of the peak with maximum power in the periodogram around the period we are interested in (namely in the 0.9-1.1 PHZ range as above). The 1% false positive level fp M is defined as the mass such that 1% of the 1000 values are higher. This level is therefore estimated for each spectral type, N obs and PHZ.
To ascertain that a planet has been detected, we compare the fitted mass (temporal analysis) to fp M : if it is higher than fp M , we consider the planet as detected. Figure 2 shows fp M versus spectral type and N obs , for PHZ med . The values of fp M decrease towards lower mass stars and with higher values of N obs . There are many configurations where fp M is higher than 1 M Earth with values as high as several M Earth for F6 stars, but below 1 M Earth for K4 stars. For a given spectral type and OGS configuration, fp M decreases as N obs increases, but the variation is not linear in √ N obs , and after a sharp decrease at low N obs , the level does not change much, as shown in Fig. 2. More details about the dependence on N obs is shown in Sect. 5.
For each of these 1000 realisations, we also compute the periodogram and the maximum peak amplitude in two ways: between 100 and 1000 days (which includes most of our PZH values) and in 10 ranges of 100 days between 0 and 1000 days, to check whether the FP depends on the period. As before, the 1% level, fp P , is computed out of each 1000 series of values. The results are shown in Fig. 3. There is a clear trend with period and the whole range of values corresponds roughly to the lowest period. In the following, we consider fp P computed for the different period ranges to take this trend into account.

Simulated detection rates of Earth-mass planets in the habitable zone
In this section, we consider the synthetic time series produced in the previous section and add planets with different masses at  different orbital periods to estimate the effect of the OGS signal on exoplanet detectability. We use the level of false positives (corresponding to 1% in the following) defined in Sect. 2.3, both in mass for the temporal analysis (fp M ) and in power for the frequential analysis (fp P ). We then compute detection rates for various masses and detection limits corresponding to these well identified detection rates. We use only GRAhigh in this section (with five OGS configurations).

Detection rates for Earth-mass planets
We consider planets with projected masses of 1 M Earth and 2 M Earth (see Sect. 2.2.1 for a discussion) on circular orbits and at three positions in the habitable zone of each spectral type as described in Sect. 2. The signal due to such planets (with a random phase) is added to each of the 1000 realisations of the OGS signals and sampling for each spectral type and N obs . For the frequential analysis, we use the amplitude of the peak at the orbital period we are interested in. For the temporal analysis, the fit is made with an initial guess for the period corresponding to this  that the frequential analysis is more robust for obtaining good detection rates. Figure 5 shows the resulting detection rates obtained with the frequential analysis depending on N obs . For each spectral type, the curves indicate the necessary number of points N obs to reach a 50% detection rate (solid lines) or a 95% detection rate (dashed lines). Curves at a low level mean that it is very easy to detect planets (small values of N obs are sufficient), while curves at the top correspond to configurations for which a detection is difficult to obtain (high values of N obs ). Higher values of N obs are necessary for longer orbital periods, as expected (since the planetary signal is dropping). For 2 M Earth , the detection rates are very good for granulation and low supergranulation levels (or ALL GRAhigh,SGlow ), as excellent rates can be reached with a low number of observations. Very good detection rates require a very large number of observations (a few hundreds to a few thousands depending on spectral type) when considering SGmed. Adding Detection rate (%) Fig. 7. Example of detection rate vs. planet mass, for G2 stars, 1266 points, and GRAhigh, in two cases: Based on frequential analysis (black curve) and on temporal analysis (red curve). The vertical solid lines indicate the corresponding 95% level, and the dashed lines the 50% level. granulation to SGmed does not change much the performance. For 1 M Earth , the performance is not as good, and higher numbers of points are required to get good detection rates. The low level of supergranulation leads to good detection rates, but only with a high number of points, except for F stars for which even our maximum N obs of 3650 nights does not allow us to reach detection rates of 95%. The situation is significantly worse for the median level of supergranulation, with conclusions similar to what was found in Paper I for G2 stars. We also observe a bump for K1 stars and PHZ med : this is due to the fact that for this particular configuration, the orbital period is equal to 366 days 3 , and given the gap introduced every one year in the sampling, planets at such periods would naturally be more difficult to detect. As expected, the frequential analysis is therefore quite sensitive to the temporal window. We conclude that the performance is good for a 2 M Earth planet, while for a 1 M Earth planet good results can be achieved only with a very high frequency of observations, mostly due to supergranulation. Figure 6 shows similar curves for the temporal analysis, that is, with the fitted mass used as a criterion for estimating the detection rates. The global trends are similar to the frequential analysis, with two main differences. All curves correspond to higher numbers of points, that is, more observations are requested to obtain the same detection rate. This is due to the difference in false positive levels already noted in Sect. 2.3: the frequential analysis criterion allows us to get better detection rates. On the other hand, there is no more bump for K1 stars and PHZ med with this approach, as the temporal analysis is less sensitive to the temporal window than the frequential analysis.

Detection limits
Detections rates are computed as in the previous section but for a large range of planet masses and a 0.1 M Earth step. This allows us to determine at which mass, for a given spectral type, N obs , and OGS configuration, the detection rate is equal to 95% for example (given a false positive level of 1%). Only 100 realisations of the signal OGS+planet are performed because such computations are time consuming. For the same reason, computations are made only for the middle of the habitable zone PHZ med . An example of detection rate versus planet mass is shown in Fig. 7 to illustrate the procedure. As already noted, there is a shift between the frequential analysis and the temporal analysis, of the order of 0.1 M Earth in this example. Figure 8 shows the detection limits versus spectral types for different OGS contributions and different values of N obs , for various numbers of nights (between 180 and 3650) covering 10 A&A proofs: manuscript no. 38376_final . Average ratio fap/fp P for PHZ med vs. N obs . The average is computed over all realisations and spectral types. The colour code represents the period: Inner side (black), middle (red), and outer side (green) of the habitable zone. years. At the 50% level, they are often below 1 M Earth (especially for low mass stars) if N obs is sufficiently high: this is the case for GRAhigh, SGlow and ALL GRAhigh,SGlow . They are mostly above 1 M Earth for SGmed and ALL GRAhigh,SGmed with values up to 2.5 M Earth for F6 stars however. At the 95% level, only the highest values of N obs allow to reach 1 M Earth , and this is true for K4 stars only when considering the median level of supergranulation.
We conclude that in most configurations, the detection limits are higher than 1 M Earth . This is the case especially for the most massive stars and when a limited number of nights is available (typically a few hundreds for granulation, but a few thousands for supergranulation).

Observational approach
The results presented in the previous section are based on a perfect knowledge of the OGS signal. This allowed us to compute true false positive levels and to deduce detection rates corresponding to a given level (1%) of a false positive: given the true false positive levels, this approach provided the best detection rates possible, with a controlled false positive level. We now consider the point of a view of an observer, who is interested in a time series which may contain other contributions and for which we have only one realisation: different tools must then be used, and actual detection rates may be lower, or the resulting detection rates may correspond to a higher false positive level. It is, therefore, important to compare what tools. We first compare the false alarm probability (FAP) obtained using a bootstrap analysis with the true false positive level. Then we compute the detection limits using the LPA method (Meunier et al. 2012) and determine which true detection rates and exclusion rates these detection limits correspond to. Finally, we characterise the mass uncertainty in transit follow-ups and we implement several blind tests to estimate the detection rates and false positives obtained when a usual FAP analysis of the data is performed.

Classical bootstrap false alarm probability
In this section, we focus on the comparison between the FAP level and the true false positive level, fp P , with no injected planet, both at the 1% level. The effect on detection rates will be studied in the blind tests in Sect. 4.4. Only GRAhigh is used in this section. For each time series (with no planet), we compute the 1% FAP level using a bootstrap analysis. Because it is time consuming, only ten realisations of the OGS signal are considered for each spectral type and value of N obs . The maximum of the periodogram to compute the FAP is computed over the whole periodogram, that is, between 2 and 2000 days. For each configuration (spectral type, N obs ) and a given orbital period (one of the three PHZ values), we compute the following values: the percentage of simulations with a FAP higher than the true false positive level fp P at 1% obtained in the previous sections (this is necessarily noisy since there are only ten realisations); the ratio of the FAP and FP, namely, fap/fp P (averaged over the ten realisations); the number of peaks above the FAP (averaged over the ten realisations).
The results are summarised in Fig. 9. The fap/fp P and the percentage of simulations with FAP larger than the FP are strongly correlated, therefore only the ratio is shown. Although the results show some dispersion because of the low number of realisations (a larger number of realisations performed on a few typical configurations gives similar results, however), some trends can be observed. The ratio covers a wide range, with values between 0.6 and 3 (after averaging on the ten realisations). For GRAhigh, the percentage is always 100%, and it is almost always the case for ALL GRAhigh,SGlow : the FAP is then always overestimating the false positive level, on average by a factor of two (corresponding to a factor of four on the mass). In the other configurations, there is a high proportion of simulations where the FAP is larger than the true false positive level when N obs is small, and it tends to be the opposite for a large number of points, with a transition for N obs in the 1000-2000 range. The limit between the two regimes occurs at higher N obs for a given orbital period (alternatively, for a given N obs , the ratio is larger at longer periods). Finally, the average number of peaks over all configurations is low (0.24) but there are several peaks above the FAP in some configurations, mostly for supergranulation alone and ALL GRAhigh,SGmed , especially when N obs is large, in agreement with the ratio.
The true false positive level corresponds to the true frequency behaviour of the OGS signal, while the FAP assumes a white noise with a similar rms RV and a similar distribution of RV values. The shape of the power spectrum of the OGS signal is such that the usual FAP computation is not always adapted: it appears to overestimate the false positive level when the number of point is low (or for GRAhigh and ALL GRAhigh,SGlow in all configurations) when, rather, it should underestimate the detection rate, corresponding to a conservative approach of the detection. When the number of points is high however, for SGlow SGmed, and ALL GRAhigh,SGmed , the FAP underestimates the false positive level, which should lead to potentially good detection rates but corresponding to much higher false positive levels in reality. These results are compatible with those of Sulis et al. (2017a), who proposed a new method (periodogram standardisation) to be able to use standard tools such as the FAP.
Finally, we note that the FAP is computed over the whole range over which we compute the periodogram (2-2000 days), while in the previous section, we consider the FP dependent on the period (see Sect. 2.3 and Fig. 3): given its shape, the FP level we are interested when searching for planets in the habitable zone is lower than at short periods. We expect the FAP to agree better with the FP at low periods.

LPA detection limits: exclusion rates and detection rates
The LPA method proposed in Meunier et al. (2012) computes detection limits as a function of orbital period from a given RV time series, taking the power around the considered orbital period due to stellar contribution into account since stellar activity produces signal at some specific periods. This fast computing method has been used in several works (for example Lagrange et al. 2013;Borgniet et al. 2017;Lannier et al. 2017;Lagrange et al. 2018;Borgniet et al. 2019;Grandjean et al. 2019). Here, we recall the method in brief, which is also illustrated in the upper panel of Fig. 10. For a given orbital period P orb , we compute the maximum power P max in the periodogram in a window around P orb . The detection limit is defined as the mass which would give a peak amplitude equal to 1.3×P max (Lannier et al. 2017): we exclude the presence of planets with masses above the LPA detection limit because otherwise they would have produced a larger amplitude than observed (around that period), meaning that it is an exclusion limit. There is, however, a simplification in this computation. This is because when the planetary signal is superposed on a stellar signal, depending on its phase, the amplitude of the resulting peak can vary a great deal, as shown, for example, in Paper I. This effect is not taken into account in the LPA computation, although the 1.3 factor gives a good margin. It is useful to estimate, for different OGS configurations (only GRAhigh is used here, i.e. five configurations of OGS), which exclusion rates such a definition corresponds to: the objective is that this rate is close to 100% for good exclusion performance derived from this limit and as robust as possible for all configurations.
For that purpose, we implement the following procedure, illustrated in the lower panel of Fig. 10. For each spectral type and Notes. The amplitude factor is the ratio applied to the maximum power in the periodogram in the period range we are interested in to be compared to the planet amplitude (see text).
each N obs value chosen among a subsample (180, 1266, 2353 points), we consider 100 (N1) realisations of the OGS signal and sampling. One of these realisations is shown in Fig. 10. For each of these N1 realisations, the LPA detection limit M lpa is computed for orbital periods equal to PHZ med and we perform 100 (N2) realisations of the planetary signal of this mass M lpa and period (i.e. N2 random phases), which is added to the corresponding OGS signal. The maximum peak in the periodogram, P', computed in the same window as above, is compared to P max (maximum power in the periodogram around the considered period) for each of these N2 realisations (the N2 values of P' are shown in the lower panel of Fig. 10: the percentage of realisations (out of the N2 values) where this maximum is higher than P max (i.e. unobserved) is the exclusion rate. In addition, the maximum peak in the periodogram can also be compared to the true false positive fp P (from Sect. 2.3), leading to a detection rate computed from the N2 realisations. For each configuration, we therefore derive 100 exclusion rates and 100 detection rates.
We find that the exclusion rate is quite constant for most spectral types (and slightly lower for F6 stars), with a median of 87%. Therefore, when computing the LPA limits with the above threshold, there would therefore still be a 13% chance to miss a planet at the detection limit. The detection rates, on the other hand are rather low, typically in the 20-40% range. The average detection limits are below 1 M Earth , except when the median supergranulation level is considered, in which case it is above 1 M Earth for the most massive stars. The LPA detection limit naturally depends on the spectral type, but also depends strongly on the number of points N obs . As a summary, Fig. 11 shows the distribution of the different rates for all realisations. The exclusion rate shows a higher peak at 100% (about a quarter of all simulations) and all are above 50%. The detection rates are much lower, with a high peak at 0.
Finally, for G2 stars and 1266 points, we investigated the effect of the chosen factor (1.3) to compute the LPA limit on the exclusion rates. The results are shown in Table 1. As expected, the exclusion rates are improved by a larger factor. A median exclusion rate of 99% is reached for a factor of 1.9, for which half of the cases correspond to a 100% exclusion rate: this would correspond to a LPA mass that is higher by 21% (compared to the mass obtained with the 1.3 factor). We note, however, that the minimum exclusion rate increases very slowly.
We conclude that the LPA corresponds to a good exclusion rate, although it is not 100%. The LPA masses are also lower than the detection limits computed in the previous section.

Mass characterisation for Earth-mass planets in the habitable zone
Before considering the detectability issue from the point of view of an observer, we consider the performance in terms of mass characterisation during a transit follow-up in RV. The transit provides an excellent estimate of the orbital period and of the phase of the planetary signal (the length of the transit is extremely small compared to the orbital periods considered here). The mass of the injected planet is extremely close to the true mass (orbit seen edge-on). We consider 1000 realisations of the OGS signal (8 configurations) as defined in Sect. 2.1 and samplings for each spectral type, values of N obs and PHZ, and add a 1 M Earth or a 2 M Earth planet with an arbitrary phase to each of them. Results for additional masses are shown in Appendix B. The planetary signal is then fitted (amplitude only as the period and phase are known) and from this we deduce the planet's mass. For each configuration, the 1000 values of the mass can then be compared to the input value. For K4 stars, the mass distributions are quite narrow and are well separated between the two input masses we consider. The distributions are very good for GRAhigh and GRAlow, but when added to supergranulation (in particular, SGmed) the distributions are dominated by supergranulation. Distributions are close to Gaussian. For G5 stars, the distributions widen and for the input of 1 M Earth and SGmed (or ALL GRAhigh,SGmed ), the distributions are wide enough to include no planet, hence, there are large uncertainties on the mass. Finally, for F6 stars, the distributions are much wider and the median level of supergranulation leads to very large dispersion, (much larger than the mass). Thus, they correspond to very poor mass characterisations.
The average fitted mass is always in excellent agreement with the input mass, with no significant bias. The dispersion decreases with increasing N obs and decreasing stellar mass. For example, for G2 stars and ALL GRAhigh,SGlow , at the 3σ level, masses are between 0 and 3 M Earth (for an input of 2 M Earth ) and between 0 and 3 M Earth (for an input of 1 M Earth ) for 180 points. The ranges are reduced to 1.1-3 and 0.1-1.7 M Earth , respectively for 1266 points, and to 1.2-2.6 and 0.2-1.6 for 2352 points. For K4 stars in the same conditions, the 3σ uncertainties are already very good for 180 points (0.2-1.8 and 1.2-2.8 M Earth ) and falls to 0.7-1.2 and 1.8-2.2 M Earth for the higher number of points.
The uncertainties on the mass are summarised in Fig. 12. For 1 M Earth and GRAhigh, the uncertainties at the 1σ level are below 35% and except for the most massive stars, they are around 20% or below, which are good mass estimates. With SGlow, the uncertainties are larger, but remain below a few 10% (40% for F6 stars with a very good sampling). They are, however, significantly higher when considering SGmed (alone or added to granulation and oscillations), and can be as high as 100% for F6 stars and are always above 20%. The low level of granulation alone provides very good uncertainties: for F6 they are below 20% for N obs above 1266 for 1 M Earth , and for K4 they are much below 20% even for a small N obs . Performance is still good when the low level of supergranulation is added (except for stars with spectral types earlier than G2, even for very high N obs ), providing a large N obs , but again are mostly above 20% for the median level of supergranulation is added and can reach values up to 50% for F6 stars. In absolute values, the uncertainties are not very different between 1 M Earth and 2 M Earth , so that the relative uncertainty for 2 M Earth is about twice smaller than for 1 M Earth . Overall, there is a significant gain in performance between 180 (very poor in general) and 1266 points, but not between 1266 and 2352 points, which does not improve the situation significantly.
The dependence on N obs is discussed in detail in Sect. 5.1. For practical purposes, a representation of the values of N obs to reach a precision of 20% on the mass is shown in Fig. 13. Values are lower or upper limits in a few cases: upper limits mean that even with 180 points, uncertainties are below 20%, so that a lower number of points are sufficient. Lower limits shown by the diamond symbols mean that even with 3650 points over ten years it is impossible to reach a 20% uncertainties. Apart for K4, the only OGS contributions allowing to reach 20% with N obs within the range that we considered are granulation alone (high or low), SGlow, and combination of both.
The uncertainty on the mass estimation is strongly correlated with the true false positive level (in mass) computed in Sect. 2.3, as illustrated in Fig. 14. When considering all spectral types, N obs values, orbital periods, and different OGS configurations together, the correlation between the two variables is 0.96. The correlation slightly depends on the OGS configurations, with values between 0.93 and 0.99, but remains very high. There is a tendency for high values of N obs to lead to higher uncertainties at a given false positive level (however, they naturally correspond on average to lower false positive levels). For example, the false positive level at 2 M Earth corresponds to a 1σ uncertainty between 40% and 60%. For 1 M Earth , it is between 20% and 35 %. To guarantee uncertainties below 20%, the theoretical false positive level should be below ∼0.5 M Earth .

Blind tests
In this last section, we implement blind tests to estimate the level of false positives and the detection rates when applying the FAP criterion to the OGS time series in two cases: when a planet is injected or when there is no planet. We describe the principle of the blind tests, how the data sets are built and analysed, and, finally, our results.

Principle
For each OGS signal and spectral type, statistically half of the realisations of the time series remain unchanged while a planet is added to the other half. The analysis of each time series allows to determine whether a planet is detected on not. In a second independent step, we determine the level of false positives and the detection rate for each set of simulations, by comparing the outputs with what was actually injected or not. We focus our analysis on one of the N obs values (1266 points), which corresponds to good conditions, but with still a reasonable rate of observations in future dense monitorings.
The fitting challenge implemented in Dumusque et al. (2017) which focusses on stellar magnetic activity defined several detectability criteria. We use similar criteria and terminology with a few modifications: 1) We decide whether there is a detection or not using a binary choice, but since there is no further comparison with activity indicator for example, there is no intermediate case; 2) False positives are counted separately for realisations with an injected planet and with no planet; 3) The identification of the planet in Dumusque et al. (2017) was based on whether the amplitude K and the period P corresponded to the injected planet, while here we use only the period as a criterion, because given the dispersion in mass this criteria would be quite subjective and can be used in a second step if necessary. The different categories are summarised in Table 2. False positives and missed or wrong planets can bias statistics on exoplanets: their effect is also indicated in Table 2. We also note that the detection criteria in Dumusque et al. (2017) was not the same in all methods as it depended on the fitting method, and may be different from ours.

Building of the data sets
The first step of the procedure consists of building the data sets. For each configuration (one spectral type, and one of the eight OGS configurations), we consider 200 realisations of the OGS signal and sampling. Computations are made for 1266 points and 1 M Earth unless otherwise noted (figures for other values and approaches such as including a distribution in inclinations are shown in Appendix C). Based on a random variable, on average, half of the realisations remain unchanged, while a planet is added to the other half. The planet has the following properties: the orbital period P orb is chosen randomly in the PHZ in -PHZ out range (i.e. we consider the habitable zone globally, using a uniform distribution), and the phase is chosen randomly in the [0-2π] interval. The projected mass is equal to 1 M Earth (projected mass, see discussion in Sect. 2.2.1) unless noted otherwise: these blind tests serve as our reference. Figures corresponding to other masses are shown in Appendix C, along with blind tests that include inclination distribution.

Analysis of the time series and detectability criteria
For each configuration (one spectral type, and one of the 8 OGS configurations), each of the 200 realisations of the time series are analysed as follows. The FAP at the 1% level is computed (using 200 realisations of the bootstrap, which we checked does not give very different results from a larger number of realisations). The periodogram of the time series is computed and the highest peak is identified (in the range 2-2000 d). If the amplitude of the peak is lower than the FAP, then we establish that there is no detection, whereas if the amplitude of the peak is higher, we consider this to be a detection. In this latter case, a sinusoid (we recall that we consider only circular orbits here) is fitted, with the period fixed to the peak period, to obtain the mass.
We note that the conditions are different from the theoretical results presented in Sect. 3. In Sect. 3, each computation was focusing on the behaviour at a given period (for example the middle of the habitable zone) and, therefore, on the power at this particular period or the mass corresponding to a fit at this period. Here, we address a different question, since we do not focus on a given period: we place ourselves at the point of view of an observer and we do not know where the planet is injected, that is, we consider the whole 2-2000 day range and not only the habitable zone. The analysis can even lead to a (wrong) detection outside of the period range where the planet is injected. This can then induce a higher rate of false positives (unless the criteria to make the detection is much higher than the true false positive level).
In a second independent step, we compare the results with the input parameters: this allows us to associate one of the categories of Table 2 to each realisation. The decision algorithm is shown in Fig. 15. To define whether a peak is attributable to the true planet or not, we use the criterion |P peak − P true | < 0.1P true to determine if the planet is the correct one (see next section).
A&A proofs: manuscript no. 38376_final Notes. The peak significance is based on the FAP (bootstrap of the signal, 1% level). "Pfit=Ptrue" means that the difference between the two is lower than a certain threshold (see text). The colour code corresponds to Fig. 17. The bias on planet statistics can either be on the number of detected planets or on their properties (in particular the orbital period).

Results of the blind tests: Planet properties and detection rates
The outputs of each blind test are mainly the properties of the fitted planet parameters when detected and the percentages corresponding to the different categories defined in Table 2. We first focus on the period since a criterion on the period obtained during the analysis must be defined to assign each realisation to one of the categories. Figure 16 shows the distribution of the difference between the periods provided by the analysis and the true periods over all realisations (i.e. all OGS configurations and all realisations with an injected planet), independently of the significance of the peak. The realisations outside this range correspond mostly to peaks found at low periods, with a maximum of the distribution in the 20-30 days range, as shown in the lower panel of Fig. 16: 95% of those peaks are at periods below the true orbital period and many of them are, in effect, below the FAP. In practice, the width of the peak varies with the period, and a threshold of 10% of the period allows us to separate the peak from outliers. Table 3 shows a few examples of percentages, for G2 and K4 stars and a subset of OGS (GRA, ALL GRAhigh,SGmed , ALL GRAhigh,SGlow ). Ideally, we would like to obtain 100% on the first two lines and 0% on the other lines. The categories correspond to Table 2, some of them being regrouped. For example, the percentage of bad planet detections (i.e. the global false pos-itive rate) corresponds to planets detected when none was injected, added to the planet detected with a wrong period. For G2 stars and granulation, the recovery rate is very good when no planet is injected but lower when there is an injected planet: most of the lost planets correspond to peaks below the FAP. The performance is much poorer for ALL GRAhigh,SGmed , with very low detection rates when a planet is injected and high false positive level. Even for ALL GRAhigh,SGlow , the recovery rate when a planet is injected is only 35%. For K4 stars, performance is perfect of granulation and very good for ALL GRAhigh,SGlow . For ALL GRAhigh,SGmed , the detection rate is only 67%, however. Figure 17 summarises the percentages for all configurations (1266 points, 1 M Earth ). The good recovery rates are shown on the left-hand side panels. When no planet is injected (black curves), they are very good for GRAhigh and GRAlow, and above 80-90% when added to SGlow. They are strongly degraded in other configurations, for all spectral types (and more so for high mass stars). The detection rates when a planet is injected (green curves) are good for all stars for GRAlow only, and for K stars and sometimes G stars for GRAhigh, SGlow, ALL GRAhigh,SGlow , and ALL GRAlow,SGlow (the threshold depends on the configuration) but strongly degraded for all other cases. It could seem surprising that the performance is better when considering ALL GRAhigh,SGlow compared to SGlow alone (no injected planet): this is likely due to the fact that when adding the GRAhigh signal, even though the rms is increased, the power spectrum is then more similar to the GRAhigh power spectrum corresponding to good performance in the habitable zone.
The green dotted lines correspond to the detection rate obtained for the theoretical false positive level of 1% (Sect. 3), which is to be compared to the green solid line observed in the blind test. The two estimations are sometimes similar, corresponding to the FP that is very close to the FAP (Sect. 3), while in other cases, the blind test detection rates are lower than expected from the theoretical false positive level due to the difference between the FAP and the true FP. There is, therefore, a complex relationship between the theoretical results and the detection rates derived from the blind tests. We conclude that the FAP provides a detection rate which corresponds to a different false positive level from the one expected (i.e. in our case, diverging by 1%).

Results of the blind tests: False positives
The right-hand panels in Fig. 17 show the bad recovery rates. When a planet is injected, the bad recoveries (dashed black line) naturally serve as the complement of the green curve from the left panels. It represents a wide variety of situations: it is sometimes dominated by the missed planet (bad period and below the FAP, in blue), sometimes by the rejection of true planet (in orange); globally, that is, because peaks are below the FAP and sometimes because the highest peak is above the FAP but does not correspond to the planet (in brown, i.e. a false positive). We note that the false positive rate when a planet is injected (brown) is different from the false positive when no planet is injected (in red, completary to the black curves on the left-hand side panels) for supergranulation (especially SGlow) alone, but it is similar when granulation and supergranulation are superposed.
For GRA, ALL GRAhigh,SGmed , and ALL GRAhigh,SGlow , the red and brown curves are similar, that is, the percentage of false positives is the same whether a planet (of 1 M Earth ) is injected or not. However, the situation is different for SGmed and SGlow because when no planet is injected, the percentage of false positives is the same, even though they have very different rms RV. This is due to the fact that here the comparison of the power is made with the FAP. Because the shape of the power spectrum is the same between SGmed and SGlow, and because the FAP values are scaled with the rms, both power and FAP increases from SGlow to SGmed in a similar manner and the percentage of false positives is similar. In the case of ALL GRAhigh,SGlow , the signal is dominated by GRAhigh, hence, a level that is similar betwen GRAhigh and ALL GRAhigh,SGlow , while the situation is intermediate for ALL GRAhigh,SGmed . For GRAlow, the rate of false positives is very small in all cases. However, when added to supergranulation (either SGmed or SGlow), the latter dominates, and rates are very similar to those obtained when combining with GRAhigh, only slightly lower.
The level of false positives here may be large because our analysis is too simplistic. When a peak is detected above the FAP, we should test the robustness of the detection to determine whether the peak is stable or not for example. More sophisticated methods will have to applied in this area in the future (see Sect. 6).
Another representation of these results is shown in Fig. 18, showing the percentage of false positives (sum of the two contributions described above) versus the detection rate (computed on the cases with an injected planet), which is similar to a ROC curve (but where each point corresponds to a spectral type). Each curve corresponds to one of the OGS configurations. Ideally, we would like points to be in the lower right corner. Points at the top have a high false positive level and points on the left correspond to poor detection rates. If we compare the global level of false positive here and the rms for each type of OGS configuration, we see that there is not a direct correspondence, because a granulation-like signal provides better performance due to their more suitable power spectrum (for a given rms). Highmass stars are to the left of each curve and lead to high rates of false positives and low detection rates except for granulation alone (for GRA low all points are in the lower right corner), and to a lesser extent ALL GRAhigh,SGlow . We also note that the highest level of false positives is obtained for SG alone. However, when granulation is added to supergranulation, the rms increases, but the level of false positive decreases because the shape of the power spectrum is closer to the granulation shape, leading to better performance: This explains why the level of false positive is higher when SGmed and GRAlow are superimposed (dashed green curve) compared to SGmed and GRAhigh (solid green curve), that is, closer to the SG behavior (large false positive rates) even though the rms is lower.

Additional configurations
Additional configurations are tested in Appendix C.1 (180 points only) and C.2 (2 M Earth ). The performance for 180 points is very poor. The level of false positives is quite low, which can be explained by the results shown in Sect. 4.1: here, the FAP overestimates the true false positive level and, therefore, there are few peaks above the FAP. The detection rates are very low, however. On the other hand, the performance is much better for a 2 M Earth planet compared to a 1 M Earth , although it is not perfect in all cases: for F and early G stars, the detection rates reach values below 50% when supergranulation is high.
We also implemented a similar blind test, but in which 1 M Earth or 2 M Earth are the true planet mass. We assume that the orbital plane is similar to the equatorial plane and take the distribution of stellar inclination into account. We expect slightly lower detection rates than before (for cases with injected planets), which is indeed observed as shown in Appendix C.3 and C.4. Figure 19 shows the average of the rates over all spectral types for each OGS configuration, without taking inclination into account (previous results) and, conversely, taking it into account. The detection rates are slightly lower when considering inclination (i.e. the true mass), typically by a different of about 12-13 points on the percentage. The difference is mostly due to the larger amount of missed planet when the mass is the projected mass only.

Corresponding LPA limits
Finally, we compute the LPA detection limits (see Sect. 4.2 for the definition): with an injected planet with a mass of 1 M Earth , we want the LPA detection limit (M lpa ) to be higher than 1 M Earth . We compute ten values of M lpa over the habitable zone, which are then averaged together for each spectral type. The average M lpa and the percentage of realisations where M lpa is higher than 1 M Earth . In all cases, M lpa is indeed above 1 M Earth , and the percentage above 70%, which is in agreement with expectation. When no planet is injected on the other hand, we want M lpa to be as low as possible. For SGmed and ALL GRAhigh,SGmed , they are above 1 M Earth for F6-G8 stars, so that in those cases, the exclusion of the presence of low mass planet (below 1 M Earth ) is not possible. This is strongly related to the performance in terms of detection rates described above. For all other configurations A&A proofs: manuscript no. 38376_final (OGS, spectral types), they are always below 1 M Earth . We conclude that the LPA provides results that are consistent with the presence of the injected planet.

Comparison of the detection rates with the K/N criterion
In this section, we compute the K/N criterion proposed in Dumusque et al. (2017) and defined as K pl √ N obs /RV rms , where K pl is the amplitude of a planetary signal in RV (for a given mass, period, host star), N obs is the number of observation, and RV rms is the RV jitter 4 . K/N is used by Dumusque et al. (2017) as a criterion for estimate the quality of recovery rates. Therefore, we compute this practical criterion for a 1 M Earth planet and compare it to the detection rates obtained previously for the same planet mass (cases with injected planet). The results are shown in Fig. 20. We find a very clear relationship between the two: all OGS configurations and spectral types lie along the same curve with very little dispersion, so the criterion is adequate to describe the detection rate in these conditions. Detection rates better than 50% correspond to K/N above ∼7, and K/N must be above ∼9 to reach detection rates better than 95%. This is very similar to the rough threshold between bad recoveries and good recoveries of ∼7.5 in Dumusque et al. (2017), who focused on magnetic activity. On the other hand, there is not a one-to-one relationship between this criterion and the false positives, as the different OGS configurations correspond to different levels, as shown on the lower panel of Fig. 20.
Although the curve for a given number of points, N obs , and the mass are well-defined, it is, in fact, very dependent on the configurations. For example, for a lower number of points (see Fig. C.3 in Appendix C.1 for 180 points) the curve is very different: the curve is also well-defined, but for a similar K/N, the detection rates are lower than for 1266 points. The same is observed for 2 M Earth , with the 50% level reached at lower K/N values compared to a 1 M Earth planet. Thus, the criterion is not universal. We then consider the performance as a function of the number of points in Sect. 5.

Effect of the sampling
In this last section, we focus on the effects of the sampling. We first summarise the dependence of the performance obtained in Sects. 3 and 4 on N obs . Then we test the effect of the sampling in a limited amount of cases: regular sampling instead of random, with a duration limited to three years instead of ten years, and including data binning. Figure 21 summarises the performance obtained in the previous sections for G2 stars, PHZ med , and ALL GRAhigh,SGlow versus the number of points. Below 500 points, curves obtained with the theoretical false positive threshold in mass, detection limits, and mass characterisation are not very different from a 1/ √ N obs dependence. However, above 500 points (and for all values for the fap/fp ratio), they decrease more slowly than the 1/ √ N obs law. This is, therefore, important to optimise the observing time.

Summary of the effect of the number of points
The uncertainty on the mass appears, for example, to be saturating at high N obs . On the other hand, detection limits (upper right panel) vary strongly with N obs and do not follow a / √ N obs law. The same is true for the detection rates in the blind tests. Increasing the number of points may also increase the level of false positives however (when no planet is injected).

Regular vs. random sampling
In previous sections, we consider a random sampling during the period of observations. We now consider the effect of this choice by testing the performance of a regular sampling in a few cases (G2 and K4 stars) for the blind test and over all spectral types for the mass characterisation. This test is done as in Sect. 4.4, that is, with 1266 points over ten years, and GRAhigh. We find that the mass uncertainties are extremely similar to what is obtained with the random sampling. The blind tests show that the detection rates when a planet is injected are also very similar, the random sampling providing slightly better detection rates. However, when no planet is injected, the regular sampling provides better false positive rates for certain OGS signals (SG alone and ALL GRAhigh,SGmed ) while they are very similar for GRAhigh alone and ALL GRAhigh,SGlow . We conclude that in the future, depending on the observational constrains and type of signals, the two types of sampling must be tested to decide which one provides the best performance.

Temporal coverage
In this work, we observed that high values of N obs were necessary for obtaining good performance and we tested only across a long duration (ten years). In this section, we estimate the performance in a few cases if only three years of data are available, both on the blind tests (detectability) and mass characterisation. We keep the four-month gap every year (except for the highest value of N obs , 1095) and consider the following number of points with this gap: 180 (to be compared with the same number of points spread over 10 years), 284, 384 (to be compared with a N obs of 1266 in the previous simulations because it corresponds to the same density of points), 486, 588, and 690. We consider all spectral types. The figures are shown in Appendix B.1 and C.5. Figure 22 shows a comparison in mass uncertainty between a few ten-year and three-year coverage configurations for GRAhigh and ALL GRAhigh,SGlow . For 180 points for both coverages, the performance is similar for GRAhigh but worse when supergranulation is added for the three-year coverage compared to the ten-year period. When N obs increases the differences remain when supergranulation is added. The same behavior is observed for 542 points over ten years and 588 over three years (with a similar number of points). It is, for example, more efficient to obtain 904 observations over ten years than 1095 over three years in this case. We conclude that for granulation alone, the temporal coverage is not a critical choice, but longer time series provides better performance when considering supergranulation. Figure B.2 also shows the number of points necessary to reach a 20% uncertainty on the mass: In most cases, when supergranulation is included, 1095 is a lower limit, that is, it is not possible to reach such a level for a 1 M Earth planet; saturation is present only for SGmed for 2 M Earth . With granulation alone, it is possible to reach 20% for a 1 M Earth planet in most cases. The blind tests were carried out for 384 observations over the three years. Compared to the 1266 points over 10 years, the detection rates are significantly lower, although the false positive rates are not much affected. The relationship between K/N and the detection rate is also shifted compared to Sect. 4.4.

Temporal binning
We compare the performance after binning the time series using 30-day bins with the preceding results. The objective is mostly to test whether binning the signal over several days to average out supergranulation is efficient. Since we are interested in long orbital periods, such a binning should not a priori affect the planetary signal very much. The protocol is otherwise similar to the one described in Sect. 4.3 for the mass characterisation in transit follow-up and in Sect. 4.4 for the blind tests (1 M Earth , 1266 points). The figures are shown in Appendix B.2 and C.6. The mass characterisation is not improved by the binning: depending on the configuration it is similar to the no-binning results or worse. The number of observations necessary to reach a precision of 20% on the mass is higher than without binning. The blind test shows that when no planet is injected, performance in terms of good recovery is slightly better than with no binning. However, when a planet is injected, the performance is worse. The level of false positives is very low. We conclude that such a binning does not significantly help to improve the detectability performance.

Conclusion
In this paper, we study in detail the effect of granulation and supergranulation on Earth-mass planet mass characterisation and detectability for stars between F6 and K4 stars for different numbers of points. The two strong advantages of our approach include: the application of a large set of time series due to these flows and a systematic analysis of their impact and performance in terms of false positive, detection rates, detection limits, and mass characterisation. This work is based on several assumptions, which we recall here: 1) The shape of the power spectrum is similar to what we found in Meunier et al. (2015), although we test different granulation and supergranulation levels (the power at long orbital period depending on the rms of the signal and the timescale, which is fixed here), and the supergranulation amplitude versus spectral type follows the granulation dependence on spectral type; 2) We do not add any other signal (magnetic activity, instrumental, photon noise...) except for planets; 3) We focus on a long orbital period in the habitable zone around these stars; 4) No correction technique is applied except for the one-hour binning and the test involving a 30-day binning.
Our main conclusions, noted here and detailed below, are: 1) Both granulation and supergranulation affect the detection rates and the false positive levels, but supergranulation plays the main role; 2) Different tools give different results because they are based on different assumptions (mainly on the false positive definition) and should be used with caution (e.g. FAP computed from a bootstrap analysis).
Our results can be summarised as follows. The presence of granulation and supergranulation affects mass characterisation in RV when performing a follow-up of a transit detection. The uncertainties on these masses are sometimes below 20% for a 1 M Earth (mostly for granulation alone or for low mass stars), but they are much larger in certain configurations (supergranulation, high-mass stars). This contribution is, therefore, important to consider when performing mass characterisations.
We estimated detection rates and detection limits corresponding to a good detection rate using theoretical levels of false positive (i.e. assuming a perfect knowledge of the signal). Aside from when the temporal window is not very good (for example period close to the one-year period), the frequential analysis (periodogram analysis) leads to better detection rates than the temporal analysis (fit of the planetary signal). The performance is poor for a large fraction of our configurations, and always requires a large amount of points. Granulation alone or added to low levels of supergranulation leads to good detection rates (although a very high number of points is required for F stars), but the performance is very poor for the median level of supergranulation.
When adopting the point of view of an observer (i.e. without knowing whether any other contribution than the stellar signal is present), we found that the FAP (obtained with a standard bootstrap anaysis of the observed time series) does not provide the true false positive level: apart from GRA and SGlow (always an overestimation of the true level), they overestimate the true level for a low number of points (meaning a conservative detection) and underestimate it when the number of points was large (with the risk of false positives). Current surveys are in the regime of a low number of points (the FAP estimate is, therefore, conservative), but future observations using a large N obs to improve the detection rates are likely to be more sensitive to an underestimation of the FAP. Here, we characterise the exclusion rates associated to the LPA detection limits (Meunier et al. 2012) when applied to this type of signal, showing that the threshold used in previous works corresponds to a median exclusion rate of 83 % (masses should be increased by about 20% to correspond to 99%). This should be kept in mind when using them to compute occurrence rates.
Finally, we performed several blind tests corresponding to different conditions in terms of planet mass, number of points, and different sampling issues (binning, duration...). As for the theoretical approach, the performance both in terms of detection rates and false positives is poor for F and G stars, whereas it is good for K stars. These rates strongly depend on the number of points as well and we find that the detection rate as a function of the K/N criterion (Dumusque et al. 2017) follows a single curve for all OGS configuration for a given number of points, but not when considering different number of points: the performance fortunately increases faster than √ N obs . An important result from the blind tests comes from the comparison between the detection rates and false positives in our various configurations: We find that for most stars, the detection rates are well below 100% and always associated to a high level of false positives. The blind tests we implemented used a simple analysis method, that is, based only on the FAP, given that we lack 'activity' indicators for this type of signal, which is in contrast to the case dealing with magnetic activity (see below). As a consequence, to improve this performance, future works will need to concentrate on both aspects. The scope of the present paper is focussed on estimating the performance across a wide variety of configurations but without using mitigating techniques, which have yet to be developed.
Some approaches in the literature may help to decrease the number of false positives. Periodogram standardisation may help to better define the false positive level, as discussed, for example, by Sulis et al. (2016); Sulis et al. (2017a). Stacked periodograms, as proposed by Mortier & Collier Cameron (2017), may also aid in this purpose. However, it remains to be seen whether these methods allow us to increase the detection rate, that is, to recover missed planets (although the second one may help to a certain extent with regard to planet peaks that are not too far below the FAP). Improving the detection rates will, how-A&A proofs: manuscript no. 38376_final ever, require the development of new methods. Gaussian processes, which may be fitting to describe this type of signal due to their flexibility, may also absorb planets at long orbital periods: this will have to be checked with similar simulations. One difficulty arises from the fact that usual activity indicators cannot be used (e.g. the log R ′ HK ). We do not expect a correlation with photometry (which is not often simultaneous with the RV, anyway) from the simulations of Meunier et al. (2015) due to the high stochasticity of the granulation signal and it is not present for supergranulation (Meunier et al. 2007). There may be a small correlation with the bisector shape variation  for granulation (but its use when superposed on the bisector variations due to other processes may be limited), however, we do not expect any for supergranulation because it involves relatively large scale flows (little dependence on line depth expected) which is relatively symmetric across the disk (no strong effect as there would be e.g. for a spot crossing the disk). However, this aspect has not yet been measured nor simulated so it remains to be checked in future studies.   The dashed lines correspond to the configurations including GRAlow (same colour code as in Fig. 1). Stars indicate that even with our largest number of points the uncertainties are in fact higher than 20% (lower limit for N obs ). Diamonds indicate that even with 180 points the uncertainties are in fact lower than 20% (upper limit for N obs ).   Missed planet rate Fig. 19. Comparison of average rates for 1 M Earth (black) and 2 M Earth (green), and without taking projection into account (solid lines, the mass is the apparent mass) and taking inclination into account (dashed lines, the mass is the true mass). The number associated to each OGS configuration corresponds to the order of the plots in Fig. 17 (from top to bottom, i.e. GRA high is number 1, SG med is number 2 and so on). The detection rate plot corresponds to the green curves in the left panels in Fig. 17, the wrong planet rate plot to the brown curves, the rejected planet rate plot to the orange curves, and the missed planet rate plot to the blue curves in the right panels in Fig. 17.  Fig. 21. Effect of N obs on performance studied in Sect. 2, 3, and 4 for G2 stars, PHZ med and ALL GRAhigh,SGlow . The different panels represent: fp M from Sect. 2.3; detection rates using the true false positive level in power (black line) and mass (red line) from Sect. 3.1; true detection limits in power (black line for 50% detection rate, red line for 95% detection rate) and in mass (green line for 50% detection rate, blue line for 95% detection rate) from Sect. 3.2; fap/fp P from Sect. 4.1; Average LPA detection limit from Sect. 4.2; 1σ uncertainty on the mass characterisation from Sect. 4.3 (black for 1 M Earth and red for 2 M Earth ); Detection rate from the blind test in Sect. 4.4 with planet injected (green) and good recovery when no planet is injected (black); False poitives when a planet is injected (dashed black line) and no planet is injected (red) from the same blind tests. The dotted lines correspond to what would be obtained if the variability was following a N −0.5 obs law (N 0.5 obs in the case of the detection rate), scaled to the values at 180 days.