Appendix: Non-parametric treatment of the data

To decrease the noise and allow a tractable use of the information present in small data samples, heavy smoothing techniques are often required. A common practice consists converting a set of discrete positions into binned "counts''. Binning is a crude sort of smoothing and many studies in statistical analysis have shown that, unless the smoothing is done in an optimum way, some, or even most, of the information content of the data could be lost. This is especially true when a large amount of smoothing is necessary, which then changes the "shape'' of the resulting function. In statistical terms, the smoothing process not only decreases the noise (i.e., the function's variance), but at the same time introduces a bias in the estimate of the function.

The variance-bias trade off is unavoidable but, for a given level of variance, the bias increase can be minimized. The correct manner of achieving that task is provided by the so-called non-parametric density estimate methods for the determination of the "frequency'' function of a given parameter or by the non-parametric regression methods for the determination of a smooth function ginferred from direct measurements of g itself. Moreover, adaptive non-parametric methods are designed to filter the data in some local, objective way minimizing the bias, in order to get the smooth desired function without constraining its form in any way. The theory and algorithms related to those methods, originally built to handle ill-conditioned problems (either under-determined or extremely sensitive to errors or incompleteness in the data), are widely discussed in the specialized literature and summarized in easy-to-read textbooks (e.g., Silverman 1986; Härdle 1990; Scott 1992).

The simplest of the available algorithms is provided by the kernel estimator leading to the following form of the normalized "frequency'' function

$\begin{displaymath} \hat f_K(x) = {1\over N h}\sum_{n=1}^N K\left({x-X_n\over h}\right) \end{displaymath}$

(19)

with the normalization

$\begin{displaymath}\int_{-\infty}^{\infty} K(u)\;{\rm d}u = 1, \end{displaymath}$

(20)

where X_n (n=1,... N) are the n available data points. Each data point is thus simply replaced by a "bump''. The kernel function K determines the shape of the bumps while the bandwidth h (also called smoothing parameter) determines their width.

In the adaptive kernel version, a local bandwidth h_n=h(X_n,f) is defined and used in Eq. (19). In order to follow the "true'' underlying function in the best possible way, the amount of smoothing should be small when f is large whereas more smoothing is needed where f takes lower values. A convenient method to do so consists in deriving first a pilot estimate $\tilde f$ of f, e.g. by using an histogram or a kernel with fixed bandwidth $h_{\rm opt}$ , and then by defining the local bandwidths

$\begin{displaymath}h_n=h(X_n)=h_{\rm opt}[\tilde f(X_n)/s]^{-\alpha}, \end{displaymath}$

(21)

where

$\begin{displaymath}\log{s}=\frac{1}{N}\sum_{n=1}^N{\log{\tilde f(X_n)}}. \end{displaymath}$

(22)

It may be shown (Abramson 1982) that $\alpha=1/2$ leads to a bias of smaller order than for the fixed h estimate.

The optimum kernel K may be taken as the one minimizing the integrated mean square error beween f and $\hat f$ (MISE), where

$\begin{displaymath}{\rm MISE}(\hat{f}) = E\int\left[ \hat{f}(x) - f(x) \right]^2 {\rm d}x \end{displaymath}$

(23)

is usually taken as an indicator of efficiency. In the above expression, E denotes the statistical expectation. This optimum kernel is the so-called Epanechnikov ( $K_{\rm e}$ ) or quadratic kernel:

$\begin{displaymath}K_{\rm e}(u) = {3\over 4}(1-u^2),\ \ \ \vert u\vert<1. \end{displaymath}$

(24)

Other choices of K differ only slightly in their asymptotic efficiencies and could be more adapted to particular purposes as e.g., for bi-dimensional data.

The pilot smoothing length ( $h_{\rm opt}$ ) is the only subjective parameter required by the method. It relates to the quality of the sampling of the variable under consideration. There are several ways for automatically estimating an optimum value of $h_{\rm opt}$ (see e.g. Silverman 1986 for an extensive review). A simple approach based on the data variance gives in our case $h_{\rm opt}\simeq 1.0$ $M_{\rm J}$ . As the derivative of the frequency function rather than the function itself is actually used in Eq. (6), a larger pilot smoothing length ( $2\,h_{\rm opt}$ ) was also considered in order to remove spurious small-statistics fluctuations of the density estimate.

Up: The distribution of exoplanet