Free Access
Volume 601, May 2017
Article Number A40
Number of page(s) 13
Section Cosmology (including clusters of galaxies)
Published online 26 April 2017

© ESO, 2017

1. Introduction

One of the first statistics applied to study the galaxy clustering in the very early galaxy catalogs (that were built on the projected celestial sphere) was the counts-in-cells (CiC) method. Hubble (1934) was the first to notice that the distribution of galaxy counts in two-dimensional cells could be well approximated by a lognormal distribution. This technique permits us to describe the spatial distribution of galaxies in a way that is complementary to other descriptors of the galaxy clustering such as the correlation function or the power spectrum (White 1979; Peebles 1980). The count probability distribution function fV(N) gives the probability that a randomly placed volume in the universe will contain exactly N galaxies. For N = 0, this function is known as the void-probability function (Maurogordato & Lachieze-Rey 1987) and it is of particular interest, since is related with higher order correlation functions (White 1979) and provides a simple approach to establishing hierarchical scaling relations (Balian & Schaeffer 1989; Croton et al. 2004). Fry & Gaztanaga (1994) have shown that correlation functions can be measured from the moments of fV(N). Several distribution functions have been used to model the counts-in-cells galaxy distribution (Yang & Saslaw 2011; Bel et al. 2016) and the counts-in-cells matter distribution (Clerkin et al. 2017), or used this distribution to estimate galaxy-dark matter bias (Sigad et al. 2000; Marinoni et al. 2005; Kovač et al. 2011; Di Porto et al. 2016).

However, the count probability distribution function is still missing a model that fits the observed distribution of galaxies well in the three dimensional space. This is the goal of the present paper. We plan to measure the galaxy counts-in-cells distribution on the New York University – Value Added Galaxy Catalog Blanton et al. (2005) and to assess what is the statistical distribution that fits best the observational frequency distribution. This blind fit of the observational galaxy CiC distribution will be accompanied of a model selection analysis, which will help us to determine the best fitting distribution.

The paper is organized as follows. In Sect. 2, we describe the data from SDSS Blanton et al. (2005) used for the analysis. In Sect. 3, we explain the procedure used to measure the observed counts-in-cells (CiC) distribution fN(V), and describe how we have estimated the errors. In Sect. 4, we introduce different statistical distribution functions to model the observed galaxy CiC distribution. In Sect. 5, we present the observed results and its comparison with the statistical models introduced in Sect. 4. We summarize our conclusions in Sect. 6. In the Appendix A we use mock catalogs extracted from the LasDamas simulations McBride et al. (2011) to assess the significance of the internal error estimates. We use the same values for the cosmological parameters as were used by LasDamas simulation (McBride et al. 2011) in preparation for the error analysis of Appendix A: Ωm = 0.25, Ωk = 0.0, ΩΛ = 0.75 and H0 = 100 h km s-1 Mpc-1.

2. Data catalogs

The data analyzed in this work is a galaxy sample from the main catalog of the Sloan Digital Sky Survey (SDSS) Data Release 7 (DR7, Abazajian et al. 2009). This catalog is provided by The New York University – Value Added Galaxy Catalog (Blanton et al. 2005), presented in the next section. In addition to this, we also make use of LasDamas simulation catalog (McBride et al. 2011) to estimate the uncertainties in fV(N).

2.1. The SDSS – New York University – Value Added Galaxy Catalog

The NYU-VAGC (Blanton et al. 2005) is composed of data from the Sloan Digital Sky Survey DR7 (Abazajian et al. 2009) and the 2-Micron All-Sky Survey (2MASS) (Skrutskie et al. 1997), though here we make use of only the former. The SDSS-DR7 mapped one quarter of the entire sky and performed a redshift survey of galaxies, quasars and stars. It consists of a series of three interlocking imaging and spectroscopic surveys, carried out over an eight-year period with a dedicated 2.5 m telescope located at Apache Point Observatory in southern New Mexico.

The NYU-VAGC survey provides us with the position and redshift of more than 550 000 galaxies with corrected extinction and K-corrected absolute magnitudes for eight bands, of which the u, g, r, i, and z bands come from the SDSS. In addition, NYU-VAGC catalog also contains a survey geometry catalog, which define the window of the galaxy population and can be operated using the software MANGLE. This survey includes carefully constructed large-scale structure samples useful for calculating power spectra, correlation functions, for example.

In this work we have used two different samples selected in the r band, corresponding to the DR72 catalogs of the SDSS (Abazajian et al. 2009) included in NYU-VAGC with an apparent r-band magnitude limit of 17.6. These two samples are located in the same angular region. In this work we reproduce and compare our analysis with data from the LasDamas simulations (McBride et al. 2011), and for this reason we selected comparable datasets. Our first population is at low redshift, between 0.05 and 0.106. The low redshift limit ensures that the sample is within the Hubble flow, and excludes the Coma and Virgo clusters. We also took a second sample with redshifts between 0.075 and 0.165. To determine a suitable absolute magnitude cut, we define a faint limit Mr where the limiting magnitude has been reached to ensure luminosity completeness. in this way, we can assure a similar comoving number density of galaxies within the redshift limits. These limits are Mr< −20 for the first population and Mr< −21 for the second population. The samples are summarized in Table 1.

Table 1

SDSS selected samples.

2.2. LasDamas simulations

To obtain reliable estimations of the CiC distribution error bars, we will make use of the multiple realizations of the Large Suite of Dark Matter Simulations (LasDamas) McBride et al. (2011), Swanson (Accessed: 2016-03-25), a project that ran a large suite of cosmological N-body simulations that follow the evolution of dark matter in the universe. Results provide us an adequate resolution in many large boxes, rather than a single realization at high resolution. The enormous volume of generated data is appropriate for statistical studies of the distribution of galaxies and halos. The LasDamas simulations are designed to model the clustering of Sloan Digital Sky Survey (SDSS) galaxies in a wide luminosity range, with the goal of assisting in the modeling of galaxy clustering measurements. Specifically, the simulations are used to construct detailed mock galaxy catalogs by placing artificial galaxies inside dark matter halos using the Halo Occupation Distribution (HOD) with parameters fitted to reproduce the galaxy number density and projected correlation function of the respective SDSS galaxy samples. The HOD describes the distribution of galaxies within the dark matter halos. It uses three related properties of the halo model: the probability distribution relating the mass of a dark matter halo to the number of galaxies that form within that halo, the distribution in space of galactic matter within a dark matter halo and the distribution of velocities of galaxies relative to the dark matter within the halo. LasDamas is also designed to reproduce the SDSS DR7 geometry. Altogether, we can see that LasDamas simulations are specially adequate for NYU-VAGC comparison.

Data used consist of two different sets of mocks with different properties, matched to our SDSS samples. They are summarized in Table 2. Simulation Esmeralda contains 120 realizations and is started at an initial redshift of z = 99; while Carmen contains 164 realizations and is started at z = 49. Mocks are generated with the same initial power spectrum, but a different random seed.

Table 2

LasDamas selected mock catalogs.

As mentioned above, we have used LasDamas simulations to estimate the error bars in the CiC distribution. A brief analysis on the reliability of this method and other alternatives can be found in Appendix A.

3. Estimation of the CiC distribution

For our counts-in-cells process we use spherical cells in three-dimensional redshift space with constant radius. These cells are described by their center locations (three coordinates) and a radius r, and randomly distributed in space allowing them to intersect. The higher the number of cells used in the calculation, the more precise our calculation of the counts-in-cells process will be.

thumbnail Fig. 1

Illustration of the calculation of the counts-in-cells. On the left panel, we show the footprint of about 10% of the galaxies in the SDSS survey used here (Population 1) in right ascension and declination. We see the irregular sky coverage in the borders outlined by the points. In addition, two masked regions have been added to illustrate the real mask. These masked regions are also illustrated on the right panel where we represent an edge-diagram in Cartesian comoving coordinates. In both panels, the green circle represents a cell which falls completely within the surveyed region, the blue circle represent a cell, still accepted for the counts-in-cells algorithm, that has part of its volume outside of the window, but at least 95% of it within the surveyed regions. In this case, counts are compensated to account for the galaxies unseen by the masked regions. Finally, the red circle represents a rejected cell because more than 5% of its volume falls out of the surveyed region.

A randomly distributed population of cells is generated in the studied survey volume. However, a galaxy survey window can be highly irregular, masked by stars and other objects. This affects not only the distribution of the cells but also their effective volume, since even if the center of the cell is inside the window, a significant part of the sphere might be outside. Once the centers of the galaxies are determined, we estimate the effective volume of the cell contained in the window. For this purpose was created the software MANGLE (Blanton et al. 2005; Hamilton & Tegmark 2004; Swanson et al. 2008). MANGLE performs several operations with complex angular windows, mapping them with sky-projected polygons and allowing one to find the polygons containing a given point on the sphere.

The effective volume of our cells is estimated with a Monte Carlo integration, populating a large number of random distributed points in our window and counting the number of points that lie inside our cells. This is exactly a CiC calculation over a random process. Given a window of volume ν(W) = V and a cell radius r, we determine the total number of necessary random points to estimate the effective volume with typical rms uncertainty a as (1)where t = 4πr3/ 3V is the ratio between the cell volume and the total volume. The number of points in a non masked cell has expectancy N·t, and therefore the relative effective volume of each cell is estimated as (2)We estimate this effective volume for all cells using a number of random points corresponding to a = 1%. Only cells with veff ≥ 95% will be accepted.

With our accepted cells we can proceed to perform the counts-in-cells. The counting is done using the Euclidian metric and we are interested in knowing how many galaxies are found inside every cell of radius r. Then we multiply the number of counts by the inverse of the effective volume to obtain our counts-in-cells distribution. With this correction we estimate the number of galaxies unseen by the masking effects or redshift limits assuming uniform distribution of galaxies for cells dimensions.

The counts-in-cells distribution is binned in a histogram of frequencies, each bin containing the number of cells with n galaxies. If we normalize it by the total number of used cells we obtain the probability density function fV(N) of finding N galaxies in a cell of volume V. In our case, we will compute the CiC distribution fV(N) for our galaxy samples using five different cell radii: 8, 12, 16, 20 and 24 h-1 Mpc.

Estimation of the errors: LasDamas populations

The estimation of the uncertainty in our CiC measurements are done using the presented LasDamas populations. These samples have been selected so they can be compared with the SDSS populations. Using our accepted cells and their volume correction, we repeated the CiC calculations over LasDamas realizations, obtaining a reliable sampling of the distribution of cells containing a given number of galaxies. We used the standard deviation of these quantities as an error estimation for our CiC distribution over the SDSS populations. We computed the mean and the variance with (3)where Ns is the number of LasDamas samples and is the ith realization distribution evaluated at the number of galaxies per cell N = i. We opted for using these quantities instead of errors estimated from intrinsic methods (i.e., the delete-one jackknife method, Norberg et al. 2009, or the bootstrap method, Efron & Tibshirani 1994) due to an overestimation of the found errors. An analysis of this issue can be found in Appendix A.

The standard deviation values σj found with LasDamas realizations provide error estimation for a large quantity of number of galaxies per cell N. However, for radii 20 and 24 h-1 Mpc we will see how, for every realization, no cells where found with a certain small number of galaxies inside. These values have fV(N) = 0 with no estimation of their uncertainty, for which we assume the closest positive σj value. This is specially important for the case of the void-probability funcion, the probability of finding an empty cell (fV(0)). These values might affect the fitting results (Sect. 5.1) and must be included in the best fit estimation. For large values of N this problem can be ignored.

As we will see in Sect. 5.1, we have made use of these standard deviations as weights for our χ2 fitting procedure (see Eq. (17)). This implies assuming a Gaussian distribution for the observed abundances of cells containing N galaxies. However, the Shapiro-Wilk test for Gaussianity (Royston 1982) reveals how these distributions are not Gaussian when fV(N) is close to zero. We have not been able to find any known probability density satisfactorily modeling these distributions at all scales and further analysis is needed. As explained in Sect. 5.1 we perform the χ2 fit twice, first using all the obtained σi values and then only with those satisfying Gaussianity. Since most of the non-Gaussian distributed errors correspond to high values of N, where the CiC distribution describe an asymptotic tail, negligible differences were found between both best fit parameters. However, the number of bins used can be crucial for the χ2 value and the parameter error bars. The chosen solution consists of increasing the size of those bins where Gaussianity cannot be assured including the cell counts from the first Gaussian bin on the right. This creates a wide bin for the smaller N values in large cell radius samples, and another wide bin for the right tail. Altogether, Gaussianity of all bins can be assured to be above 0.9 in the Shapiro-Wilk test.

4. Frequency distribution models

In this section we present the probability distribution functions used to model the observed counts-in-cells distributions from the SDSS. Some of these functions have been commonly used in the cosmological literature to fit the CiC distribution, and some are an original contribution of this work. It is therefore of high interest to discriminate the best fitting model and their best fitting parameter values. Despite the information contained in a CiC distribution does not fully characterize the galaxy distribution, it is a usefull contraint for N-body simulations or models of the galaxy distribution. Next we present the used probability density functions.

4.1. Gravitational quasi-equilibrium distribution

The GQED (Saslaw & Hamilton 1984) is derived from a thermodynamical description of the galaxy fluid, though it can be also derived from the statistical mechanics (Ahmad et al. 2002). One of its principals assumptions is to accept that the galaxy accretion is succeeded through series of quasi-equilibrium states, a basic condition to start the thermodynamical approximation. This is possible if we assume an infinite number of galaxies in the universe. With this assumption we say clusters and large structures remain stable along large cosmological times before changing into a new quasi-equilibrium state.

We will also assume that galaxies are formed without dynamical interactions with the exterior and then they interact gravitationally as punctual masses with the rest of the universe. As the clusters dimensions are smaller than the curvature radius of the universe and the velocities involved are much lower than the speed of light, we can assume Newtonian gravity with potential φ = r-1.

After these assumptions, the GQED can be derived as the pdf(4)where the expected value of N is the product of the cell volume by the mean density of the galaxies, . Details of the deduction and properties of this distribution are described in full in Saslaw & Hamilton (1984), Sheth & Saslaw (1996) and Sheth (1995).

Parameter d (named b in Yang & Saslaw 2011) allows us to study both physical and statistical properties of this distribution. For the limit d = 0 we have a random distribution, where galaxies are uniformly distributed. At large scales, where fluctuations are small, the density function becomes Gaussian. Since this d value can be determined from physical magnitudes of the galaxies, we have a free parameter model of the galaxy distribution. As we find in Ahmad et al. (2002), d represents a measure of the state of aggregation and we can express it as (5)This quotient relates d to galaxy mass m, mean density , galaxies kinetic temperature T and gravitational constant G.

In the fitting process we will show the resulting parameters obtained directly from the CiC distribution (Eq. (6)) and parameters and d when both are fitted. The calculation of this distribution and the function Gamma for large values of N includes considerable difficulties, since the evaluation of the factorial function requires dealing with large numbers and needs a high precision to avoid the propagation error along the loops. As we will see in the following sections, this is the case for large cell radius distributions. This difficulty has been overcome using continuous fractions as representation of real numbers with infinite (i.e., arbitrarily large) precision and exploiting the model of lazy evaluation in the Haskell programming language (Peyton Jones et al. 2003), which allows us to perform heavy calculations and it is easy to integrate in C++.

In addition, we will compare these results with the estimated values of and d obtained directly from the CiC distribution. As explained in Yang & Saslaw (2011), the dependence of d on V can be obtained empirically from the variance of the number of counts in a cell of volume V () and the mean . (6)where ξ2 is the two-point correlation function.

4.2. Negative binomial bistribution

The negative binomial distribution (NBD) was first proposed in a cosmological context by Carruthers & Duong-van (1983) and derived later by Elizalde & Gaztanaga (1992). With this model we study the probability of distributing N galaxies in m disconnected boxes. The probability of finding a galaxy in one of these boxes is proportional to the number of galaxies already located in the box. Expressing the probability function in terms of galaxies per cell instead of its conventional form, we have: (7)where Γ is the gamma function. As we had with the value d for the GQED, g is also an aggregation parameter. We can obtain it theoretically for the NBD with (8)In this model, g depends both on the volume and on the shape of the cells.

This function is widely used in statistics and was firstly introduced in the cosmological context as a phenomenological model without physical explanation. Some authors (Betancort-Rijo 2000) claim that this function appear as a limit case in a variety of processes, in particular Carruthers & Duong-van (1983) found that approximates rather well the distribution of Zwicky clusters. Others, however, found that violates the second law of thermodynamics (Yang & Saslaw 2011; Saslaw & Fang 1996). However, it provides a fair agreement with the observational distribution and it is thought to be related with the hierarchical universe properties. A deeper study of the implications of using such distribution suggests that the NBD assumes galaxies to be formed where there is already a galaxy cluster. Hence, this model does not take infall into account, but can describe the case where galaxies form from the merging of less massive objects. As before, we will show the resulting parameters obtained from Eq. (8) and parameters and g when both are fitted. Similarly to the GQED distribution, the NBD involves the calculation of factorials for large N values and infinite precision, which requires versions of these functions parameterized with the precision required, as we have already seen. We have also compared the results of Haskell programming mentioned above with the use of the Stirling approximation for the factorials of big numbers, by means of the MAXIMA program1. The differences of the results are negligible.

4.3. Log normal distribution

The log normal distribution (LND) was first used by Hubble (1934) but it was not formally proposed until Coles & Jones (1991). It was one of the first fully described stochastic models applied to the distribution of matter density. This model allows us to calculate many complex properties of the CiC distribution.

While the Gaussian distribution provides us a valid description of linear and weak density perturbation fields, the log normal distribution represents the same case for the non-linear regime and is one of the few non-Gaussian random fields for which interesting properties are calculable analytically. In Coles & Jones (1991) one may find further motivation for this model, such as agreements with observational data. This has turned the log normal distribution into one of the most well known and widely applied models (Wild et al. 2005; Kitaura et al. 2010; Kayo et al. 2001).

Given a Gaussian random field X(r), the probability of finding a certain number density from our field in a given position is given by X ~ N(μ,σ2). The log normal field is obtained by applying a non-linear transformation of the Gaussian field: Y(r) = exp(X(r)), where the new one-point probability function in the field Y is (9)This is the log normal variable Y ~ Λ(μ,σ2) pdf, with the same parameters as the original Gaussian variable. In this work we make use of an expression adapted to the cosmological scenario (Arnalte-Mur et al. 2016) (10)where (11)and C is the variance in the matter distribution, which we expect to be roughly for cell radius r = 8 h-1 Mpc at z = 0. This distribution has proved efficient in the description of the dark matter density field (Arnalte-Mur et al. 2016; Clerkin et al. 2017), understood as a local non-linear transformation of the initial energy density distribution. However, the galaxy distribution studied in this work presents a bias with respect to the dark matter distribution, and the log normal distribution must be modified conveniently. This bias parameter b can be introduced in the probability density function as shown in Dekel & Lahav (1999). (12)where the above expressions are conveniently modified into (13)and Δ remains the same. With this new distribution we expect improved fittings of the galaxy CiC distribution. In this case, we assume the simplest case of a scale-independent and linear bias, so δg = , where δg and δ are the contrasts of the galaxy number density and the matter density, respectively.

Shot noise correction

As explained above, the log normal distribution is valid to describe a continuous field. In the cosmological context this is either the dark matter or the galaxy density fields. In our case, therefore, we need to introduce a discrete sampling of that underlying continuous field in order to compare to the observed counts-in-cells. This discrete sampling introduces a modification of the distribution known as shot noise. It is more important at smaller scales, where the typical number of galaxies per cell is low.

We correct for this shot noise term assuming the local Poisson process model (Layzer 1956; Peebles 1980). This has been done in the literature both at the level of the moments of the distribution (Gaztañaga et al. 2002; Angulo et al. 2008; Bel & Marinoni 2014; Bel et al. 2014) or, as we are dealing in this paper, at the level of the probability distribution function (Di Porto et al. 2016; Bel et al. 2016; Hoffmann et al. 2017). Following these authors the modified probability of galaxy counts, f(N), can be expressed as (14)here P(N | Δ) is the conditional probability of finding N galaxies in a Poisson process with density Δ. Its expression is (15)Although in this work we limit ourself to the use of a Poisson distribution, different alternatives can be used to correct from shot noise effects. In the following sections, we will use Eq. (14) as our log normal distribution fV(N) (with or without bias correction).

5. Results

Counts-in-cells calculations have been performed making use of at least as many cells as galaxies in the sample, which we consider a sufficient condition (Yang & Saslaw 2011). In addition, nearly 50% of cells are rejected by mask effects, so we have to double the number of tested cells. In Table 3 one can find the numbers of used cells for each SDSS sample and LasDamas realizations under column Cells. The counts-in-cells observed probability density functions fV(N) are shown in Figs. 2 and 3 together with the best fits of the different distributions. The curves show the expected probabilities for our chosen samples and radii (Yang & Saslaw 2011), with higher values of for bigger cells. Used bins variate their width depending on the CiC population. We have used the Freedman-Diaconis rule (Freedman & Diaconis 1981), which fixes the width m of the bin depending on the number of observations and the spread of the population: (16)where IQR(X) is the interquartile of population X and n is the number of objects in X. The resulting quantity has been rounded, resulting on m = 1 for cell radii 8 and 12 h-1 Mpc, m = 3 for radius 16 h-1 Mpc and m = 6 for radius 24 h-1 Mpc in Population 1. In Population 2, all bins have width m = 1, except for radius 24 h-1 Mpc, where m = 2.

5.1. Fitting the results to a distribution function

In this section we proceed to fit the probability distribution functions defined in Sect. 4 to our counts-in-cells observed distributions. We calculate the χ2 goodness of fit (17)between the observed distribution and the theoretical distribution as a qualitative measure, where Nmax is the largest number of galaxies in a cell. The fitting makes use of the diagonal values σ2, ignoring correlations between bins. This estimation might underestimate the obtained best fit χ2 values and the parameters 1σ error bars, however, we have checked that this decision does not affect the best-fit values. After generating samples of Poisson-distributed points in the used masks with the densities of Table 1, we performed the described effective volume estimation method. The analytical Poisson distribution for the used densities are within the uncertainties of the obtained results and thus, we are confident of our methods. Vector θ contains the parameters of the theoretical distribution.

As introduced in Sect. 3, we have performed our fitting analysis over an observational distribution function with adaptive bin width, with wider bins for those values where fV(N) was close to zero and the Gaussianity of its error distribution could not be assured. The solution consisted of merging these bins with the closest Gaussian bin on the right. The best fit solution for the full curve analysis for each population can be found respectively in Figs. 2 and 3.

We show in Table 3 the values of θ and χ2 obtained by performing a blind fit where all parameters are allowed to variate. Fittings have been performed using our own developed routines and also by means of the non-linear least-squares fitting MPFIT routines in IDL Markwardt (2009). Differences between both procedures are negligible. The numbers reported here are the ones by our own routines.

Table 3

Counts-in-cells best fit fV(N) – all parameters free.

We use the χ2 values to discriminate the best fitting distribution. These values can be compared with results in Figs. 2 and 3. The top box shows the observational and best fit distributions. In the bottom boxes we can see the residuals, the difference between the best-fit distributions and the observed one (X axis). As expected, curves outside the error bars have higher values of χ2.

Regarding the best fit distribution, the negative binomial distribution generally performs the best fit results with lower values of χ2. This is specially clear for radius 8 h-1 Mpc. However, we note that for higher radii, the log normal distribution with bias obtains remarkably good values as well, specially the higher radii of Population 2. At small scales, the log normal distribution with bias show a worse fit than the NBD. This may be an indication that the correction introduced in Sect. 4.3 does not completely correct for the shot noise term, or that this underlying distribution fails to describe the density field at these scales.

The gravitational quasi-equilibrium distribution generally performs poorer fittings than the NBD or the LN with bias distribution. But its χ2 values are often close to those obtained by the LN. Finally, the LN distribution, without the bias modification, performs, as expected, worse fittings than any other distribution, clearly showing its unsuitability to describe the galaxy distribution, although it is rather appropriate to describe the dark matter distribution (see, e.g., Clerkin et al. 2017).

Regarding the found parameters, we have obtained similar results to Yang & Saslaw (2011), with generally good fittings (inside the 1σ values). A monotonic trend is found between parameter d and radii r, indicating stronger correlations inside bigger cells, which contain more structure. For the NBD we have , and therefore the opposite trend is found.

For the log normal distribution with bias, mean is, as expected, close to the value obtained in Table 3 or the best fit of GQED and NBD. Parameter Cb, the variance of the matter distribution, strongly varies with r. This quantity is roughly related to the cosmological parameter σ8, which at z = 0 is measured σ8 = 0.828 ± 0.012 (Planck Collaboration XVI 2014). Therefore, for the case of cells of radius 8 h-1 Mpc, we would expect the value of Cb to be close to , which we actually found for Population 1, with lower redshift, where C = 0.71 ± 0.03. The monotonic evolution of Cb with the cell radius for Population 1 shows a good description of the variation of the strength of the clustering with the scale. However, we have not been able to find the same results with Population 2. Results on the bias parameter b for different radius of the cells are compatible with expected scale dependent behavior of b (Baugh 2013).

An interesting comparison can be made between the galaxy variance C and the value b2Cb, the variance of a linearly biased tracer. If we campare these values as found in Table 3, we can see a general similarity of these values except for radii 8 and 12 h-1 Mpc of Population 2.

thumbnail Fig. 2

Count-in-cells results for Population 1. fV(N) CiC distribution function with best fit models and error bars. Top to bottom and left to right: radii 8, 12, 16, 20 and 24 h-1 Mpc. Best fit plots for bins with Gaussian errors. Boxes: NBD residuals, GQED residuals, log normal distribution residuals, log normal with bias distribution residuals.

thumbnail Fig. 3

Count-in-cells results for Population 2. fV(N) CiC distribution function with best fit models and error bars. Top to bottom and left to right: radii 8, 12, 16, 20 and 24 h-1 Mpc. Best fit plots for bins with Gaussian errors. Boxes: NBD residuals, GQED residuals, log normal distribution residuals, log normal with bias distribution residuals.

5.2. Model selection

The NBD, the GQED and the LND make use of two free parameters, therefore the χ2 test should be enough to discriminate between them. However, the LN distribution with bias includes a third parameter, and this should be taken into account when comparing its performance with the other distributions. Therefore we use the hypothesis test to discriminate the best fitting distribution for the galaxy counts-in-cells distribution. We use the Bayesian and the Akaike information criteria (BIC and AIC, Schwarz 1978; Akaike 1998). The preference on the AIC or the BIC is widely debated (Lahiri 2001; Konishi & Kitagawa 2008). As usual, we make us of both. These criteria are calculated using the maximum likelihood value L, which can be defined with the χ2 value from the best fit: (18)where K is a constant that only depends on the errors of the data values so it cancels out when comparing different models. And the information criteria

where L is the maximum likelihood for a given sample and model, | θ | is the number of used parameters in the model and n is the number of observations in the data. Now, we can compare the goodness of fit on any pair of models. We are specially interested in comparing the performance of the three different two-parameters models with the LN distribution with bias (LNb). Then, we will say that a model α is performing a better (or worse) fit than the LN distribution with bias if BICα< BICLNb (or >), and similarly with AIC. It is in these comparisons where constant K cancels. Results are shown in Table 4.

The values obtained for the GQED and the LN distributions show that the LN distribution with bias is clearly superior, even with a additional parameter, and it is completely justified to use a modified log normal. However, a comparison between the NBD and LNb distribution shows that the first one is clearly preferable for small radii cells, when the shot noise effect is stronger. For radius 16 h-1 Mpc and higher, the performance is comparable for both distributions (Koopman 1943; Kass & Raftery 1995). Due to its success among the smaller radii we chose the NBD as the best fit distribution function rather than the LNb distribution, although the satisfactory results at large scales of the later make this distribution still a justifiable election.

Table 4

Model selection.

6. Discussion and conclusions

In this work we proceed with a blind fit of four different probability functions. A blind fit allows us to discriminate the best model of the observed counts-in-cells distribution. This work provides a very necessary tool for many different applications in modern cosmology, such as the generation of galaxy mocks or the testing of N-body simulations.

The GQED and the Log Normal distribution (with no bias parameter) generally show clearly poorer performances and both the χ2 and the hypothesis test advice against using them as descriptors of the galaxy distribution. We have also fitted the calculated counts-in-cells distribution to the standard log normal distribution (Eq. (10)), and the fits are systematically worse than the fits obtained including the bias parameter. We conclude that this is an important result: the NBD provides the best fit for the galaxy counts-in-cells distribution for all scales, while the log normal distribution modified by the inclusion of the bias term might be a justifiable alternative choice for large-scale cells. Moreover, the estimation of parameters C and b from the log normal distribution incorporates an interesting asset and allows its direct comparison with other methods of bias estimation (López-Sanjuan et al. 2015; Hurtado-Gil et al. 2016). Nevertheless, we might have obtained results for the bias parameter in tension with other works (Zehavi et al. 2011), although our results could be an evidence of the expected scale dependent behavior of the bias parameter Baugh (2013).

Regarding the χ2 results, the NBD is, generally the best model. For the higher radii, 16 to 24 h-1 Mpc, the log normal distribution with bias performs a slightly better fit. This probability distribution has three free parameters, which could explain its better performance. Actually, our hypothesis test analysis showed that the NBD (two free parameters) is preferable, since the lower χ2 values of the log normal distribution do not justify the extra parameter. Therefore, we find that the NBD is the best model describing the counts-in-cells galaxy distribution for all scales, and that a choice for the modified log normal distribution is only justified at large scales.

Regarding the physical motivation of the NBD, here we must say that various authors find it fully justified (Elizalde & Gaztanaga 1992; Betancort-Rijo 2000; Carruthers & Duong-van 1983), while others have found that this model might contradict known physics (Saslaw & Fang 1996), although these authors also recognize that their calculation of the function fV(N) for 3D cells performed on the SDSS follows better the NBD than the GQED distribution. The same agreement with the NBD is found for the void-probability function, fV(0) of the SDSS by Conroy et al. (2005). Moreover, Bel et al. (2016) have recently shown that the negative binomial fits rather well the observed Counts-in-Cells distribution function for the VIPERS survey, concluding that the underlying continuous galaxy field should be well described by a Gamma distribution, since the NBD is considered to be the discrete analogue of the continuous Gamma distribution (see Adell & De la Cal 1994, where the approximation between these two distributions is studied).

In sum, we conclude that the negative binomial distribution has proved to be a very efficient descriptor of the counts-in-cells galaxy distribution.


We would like to thank an anonymous referee for comments and suggestions that have improved the quality and readability of this paper. This work has been supported by the Spanish Ministerio de Economía y Competitividad projects AYA2013-48623-C2-2 and AYA2016-81065-C2-2-P, including FEDER contributions, by the Generalitat Valenciana project of excellence PROMETEOII/2014/060. We acknowledge the use of data from the LasDamas simulations, publicly available at We also acknowledge the use of public data from SDSS. Funding for the SDSS and SDSS-II has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Science Foundation, the U.S. Department of Energy, the National Aeronautics and Space Administration, the Japanese Monbukagakusho, the Max Planck Society, and the Higher Education Funding Council for England. The SDSS Web Site is The SDSS is managed by the Astrophysical Research Consortium for the Participating Institutions. The Participating Institutions are the American Museum of Natural History, Astrophysical Institute Potsdam, University of Basel, University of Cambridge, Case Western Reserve University, University of Chicago, Drexel University, Fermilab, the Institute for Advanced Study, the Japan Participation Group, Johns Hopkins University, the Joint Institute for Nuclear Astrophysics, the Kavli Institute for Particle Astrophysics and Cosmology, the Korean Scientist Group, the Chinese Academy of Sciences (LAMOST), Los Alamos National Laboratory, the Max-Planck-Institute for Astronomy (MPIA), the Max-Planck-Institute for Astrophysics (MPA), New Mexico State University, Ohio State University, University of Pittsburgh, University of Portsmouth, Princeton University, the United States Naval Observatory, and the University of Washington.


  1. Abazajian, K. N., Adelman-McCarthy, J. K., Agüeros, M. A., et al. 2009, ApJS, 182, 543 [NASA ADS] [CrossRef] [Google Scholar]
  2. Adell, J. A., & De la Cal, J. 1994, J. Appl. Prob., 31, 391 [CrossRef] [Google Scholar]
  3. Ahmad, F., Saslaw, W. C., & Bhat, N. I. 2002, ApJ, 571, 576 [NASA ADS] [CrossRef] [Google Scholar]
  4. Akaike, H. 1998, in Selected Papers of Hirotugu Akaike (Springer), 199 [Google Scholar]
  5. Angulo, R. E., Baugh, C. M., & Lacey, C. G. 2008, MNRAS, 387, 921 [NASA ADS] [CrossRef] [Google Scholar]
  6. Arnalte-Mur, P., Vielva, P., Martínez, V. J., et al. 2016, J. Cosmol. Astropart. Phys., 3, 005 [NASA ADS] [CrossRef] [Google Scholar]
  7. Balian, R., & Schaeffer, R. 1989, A&A, 220, 1 [NASA ADS] [Google Scholar]
  8. Baugh, C. M. 2013, PASA, 30, e030 [NASA ADS] [CrossRef] [Google Scholar]
  9. Bel, J., & Marinoni, C. 2014, A&A, 563, A36 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  10. Bel, J., Marinoni, C., Granett, B. R., et al. 2014, A&A, 563, A37 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  11. Bel, J., Branchini, E., Di Porto, C., et al. 2016, A&A, 588, A51 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  12. Betancort-Rijo, J. 2000, J. Stat. Phys., 98, 917 [CrossRef] [Google Scholar]
  13. Blanton, M. R., Schlegel, D. J., Strauss, M. A., et al. 2005, AJ, 129, 2562 [NASA ADS] [CrossRef] [Google Scholar]
  14. Carruthers, P., & Duong-van, M. 1983, Phys. Lett. B, 131, 116 [NASA ADS] [CrossRef] [Google Scholar]
  15. Clerkin, L., Kirk, D., Manera, M., et al. 2017, MNRAS, 466, 1444 [NASA ADS] [CrossRef] [Google Scholar]
  16. Coles, P., & Jones, B. 1991, MNRAS, 248, 1 [NASA ADS] [CrossRef] [Google Scholar]
  17. Conroy, C., Coil, A. L., White, M., et al. 2005, ApJ, 635, 990 [NASA ADS] [CrossRef] [Google Scholar]
  18. Croton, D. J., Colless, M., Gaztañaga, E., et al. 2004, MNRAS, 352, 828 [NASA ADS] [CrossRef] [Google Scholar]
  19. Dekel, A., & Lahav, O. 1999, ApJ, 520, 24 [NASA ADS] [CrossRef] [MathSciNet] [Google Scholar]
  20. Di Porto, C., Branchini, E., Bel, J., et al. 2016, A&A, 594, A62 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  21. Efron, B., & Tibshirani, R. J. 1994, An introduction to the bootstrap (CRC press) [Google Scholar]
  22. Elizalde, E., & Gaztanaga, E. 1992, MNRAS, 254, 247 [NASA ADS] [CrossRef] [Google Scholar]
  23. Freedman, D., & Diaconis, P. 1981, Theory Relat Fields, 57, 453 [Google Scholar]
  24. Fry, J. N., & Gaztanaga, E. 1994, ApJ, 425, 1 [NASA ADS] [CrossRef] [Google Scholar]
  25. Gaztañaga, E., Fosalba, P., & Croft, R. A. C. 2002, MNRAS, 331, 13 [NASA ADS] [CrossRef] [Google Scholar]
  26. Hamilton, A. J. S., & Tegmark, M. 2004, MNRAS, 349, 115 [NASA ADS] [CrossRef] [Google Scholar]
  27. Hoffmann, K., Bel, J., & Gaztañaga, E. 2017, MNRAS, 465, 2225 [NASA ADS] [CrossRef] [Google Scholar]
  28. Hubble, E. 1934, ApJ, 79, 8 [NASA ADS] [CrossRef] [Google Scholar]
  29. Hurtado-Gil, L., Arnalte-Mur, P., Martínez, V. J., et al. 2016, ApJ, 818, 174 [NASA ADS] [CrossRef] [Google Scholar]
  30. Kass, R. E., & Raftery, A. E. 1995, J. Amer. Statist. Ass., 90, 773 [CrossRef] [MathSciNet] [Google Scholar]
  31. Kayo, I., Taruya, A., & Suto, Y. 2001, ApJ, 561, 22 [NASA ADS] [CrossRef] [Google Scholar]
  32. Kitaura, F.-S., Jasche, J., & Metcalf, R. B. 2010, MNRAS, 403, 589 [NASA ADS] [CrossRef] [Google Scholar]
  33. Konishi, S., & Kitagawa, G. 2008, Information criteria and statistical modeling (Springer Science & Business Media) [Google Scholar]
  34. Koopman, B. 1943, J. Symb. Log., 8, 34 [CrossRef] [Google Scholar]
  35. Kovač, K., Porciani, C., Lilly, S. J., et al. 2011, ApJ, 731, 102 [NASA ADS] [CrossRef] [Google Scholar]
  36. Lahiri, P. 2001, in Model selection (Beachwood: Institute of Mathematical Statistics), 1 [Google Scholar]
  37. Layzer, D. 1956, AJ, 61, 383 [NASA ADS] [CrossRef] [Google Scholar]
  38. López-Sanjuan, C., Cenarro, A. J., Hernández-Monteagudo, C., et al. 2015, A&A, 582, A16 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  39. Marinoni, C., Le Fèvre, O., Meneux, B., et al. 2005, A&A, 442, 801 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  40. Markwardt, C. B. 2009, in Astronomical Data Analysis Software and Systems XVIII, eds. D. A. Bohlender, D. Durand, & P. Dowler, ASP Conf. Ser., 411, 251 [Google Scholar]
  41. Maurogordato, S., & Lachieze-Rey, M. 1987, ApJ, 320, 13 [NASA ADS] [CrossRef] [Google Scholar]
  42. McBride, C., Berlind, A. A., Scoccimarro, R., et al. 2011, in Am. Astron. Soc. Meet. Abstr., BAAS, 43, 249.07 [Google Scholar]
  43. Norberg, P., Baugh, C. M., Gaztañaga, E., & Croton, D. J. 2009, MNRAS, 396, 19 [NASA ADS] [CrossRef] [Google Scholar]
  44. Peebles, P. J. E. 1980, The large-scale structure of the universe (Princeton, N.J.: Princeton University Press), 435 [Google Scholar]
  45. Peyton Jones, S., et al. 2003, J. Functional Programming, 13, [Google Scholar]
  46. Planck Collaboration XVI. 2014, A&A, 571, A16 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
  47. Royston, J. 1982, Appl. Stat., 115 [Google Scholar]
  48. Saslaw, W. C., & Fang, F. 1996, ApJ, 460, 16 [NASA ADS] [CrossRef] [Google Scholar]
  49. Saslaw, W. C., & Hamilton, A. J. S. 1984, ApJ, 276, 13 [NASA ADS] [CrossRef] [Google Scholar]
  50. Schwarz, G. 1978, Ann. Statist., 6, 461 [NASA ADS] [CrossRef] [MathSciNet] [Google Scholar]
  51. Sheth, R. K. 1995, MNRAS, 274, 213 [NASA ADS] [CrossRef] [Google Scholar]
  52. Sheth, R. K., & Saslaw, W. C. 1996, ApJ, 470, 78 [NASA ADS] [CrossRef] [Google Scholar]
  53. Sigad, Y., Branchini, E., & Dekel, A. 2000, ApJ, 540, 62 [NASA ADS] [CrossRef] [Google Scholar]
  54. Skrutskie, M. F., Schneider, S. E., Stiening, R., et al. 1997, in The Impact of Large Scale Near-IR Sky Surveys, eds. F. Garzon, N. Epchtein, A. Omont, B. Burton, & P. Persi, Astrophys. Space Sci. Libr., 210, 25 [Google Scholar]
  55. Swanson, M. Accessed: 2016-03-25, LasDamas webpage, [Google Scholar]
  56. Swanson, M. E. C., Tegmark, M., Hamilton, A. J. S., & Hill, J. C. 2008, MNRAS, 387, 1391 [NASA ADS] [CrossRef] [Google Scholar]
  57. White, S. D. M. 1979, MNRAS, 189, 831 [NASA ADS] [CrossRef] [Google Scholar]
  58. Wild, V., Peacock, J. A., Lahav, O., et al. 2005, MNRAS, 356, 247 [NASA ADS] [CrossRef] [Google Scholar]
  59. Yang, A., & Saslaw, W. C. 2011, ApJ, 729, 123 [NASA ADS] [CrossRef] [Google Scholar]
  60. Zehavi, I., Zheng, Z., Weinberg, D. H., et al. 2011, ApJ, 736, 59 [NASA ADS] [CrossRef] [Google Scholar]

Appendix A: Reliability of the error estimation

The estimation of error bars for our counts-in-cells distribution is a task that demands an additional analysis. Two kinds of methods can be followed: intrinsic and extrinsic. In the first group we have the jackknife resampling method, which together with the Bootstrap method (Efron & Tibshirani 1994), allows us to estimate the errors in multiple statistics. It can be used when we make use of summary statistics, like the counts-in-cells or the correlation function. With these statistics, we perform an analysis over the entire population, shrinking its information into a single quantity, such as the first and second characteristics. These methods are meant to be used when the limited volume of data in use does not allow us to compare our calculations among different populations, and therefore, we do not have a direct way to obtain its variation. As we only have one single galaxy population for each sample, we have to apply an internal error estimate method over the entire population.

On the other hand, the extrinsic methods make use of alternative samples to estimate the variance of our sample of study. In galaxy surveys analysis, these alternative samples can be obtained from mock catalogs, like those presented in Sect. 2.2 from LasDamas survey (McBride et al. 2011).

In this section we compare the performance of the two methods. To do so, we will compare the error estimations σi obtained in Sect. 3 with those obtained with the jackknife method over one of the LasDamas simulations.

The used jackknife method is the “delete one jackknife” method (Norberg et al. 2009). Our data from the chosen LasDamas simulation consists on coordinates in a certain space contained by a window. Dividing this window into different regions we generate Nsub subvolumes of different disjoint areas of the sky with the same redshift limits. Each subvolume contains a subsample of cells and deleting one of these subsamples from the parent population we produce a new sample. Cells are included in the corresponding subvolume using their center coordinates, and no further calculation is made regarding their volume overlapping with neighbor subsamples. These resampled data shares its cells and galaxies with the original one but lacks one of the subsamples. We can repeat this procedure Nsub times by systematically omitting, in turn, each of the subsamples in which the data has been split. The resampling of the data set consists of Nsub−1 remaining subsamples, with volume ν(W)−ν(Vi), where Vi is the volume occupied by the ith subvolume.

Our parent population of cells C occupies an irregular area of the sky due to the rejection process explained in Sect. 3. In order to ensure that all jackknife samples contain the same amount of cells, subsamples are selected in the following way: we sort our cells by declination and create 7 (or other desired number) subgroups of equal amount, in such a way that the groups are also sorted by declination. Then, we perform the same division for each subgroup with the right ascension, sorting each of them and dividing into another seven subgroups, 49 in total. Now we have subgroups of cells located in differently shaped rectangles but containing the same amount of cells.

Let f be the counts-in-cells distribution of our parent population. Once we have generated the new subsamples from the original population, we proceed to calculate the covariance matrix as (A.1)where Nsub is the number of used subsamples, fk is the Counts-in-Cells distribution when we omit subsample k, and is the mean of distributions fk for every subsample Nsubk. We compute these functions in the values i and j corresponding to all the considered values of N, this is, the values of the CiC frequency histogram, from 0 to the maximum number of galaxies contained into a cell. We finally calculate our standard deviations as .

Now, we can compare the jackknife error estimates with the standard deviations of the LasDamas simulations (see Sect. 3) and check if jackknife errors are over- or underestimated (Fig. A.1).

thumbnail Fig. A.1

Comparison between “delete one jackknife” method error bars for fV(N) (pink) and its population variance (green) in Esmeralda simulation. Top: 8 h-1 Mpc cell radius, bottom: 24 h-1 Mpc cell radius.

After these results, we see that the jackknife errors for LasDamas samples are typically larger than the errors obtained from the sample-to-sample variances in the catalogs. This indicates that the same systematic error should be found for the SDSS samples, with jackknife estimated error bars overestimating the real uncertainty. This proves that the jackknife estimations are unreliable in strong mask conditions, even after our cell rejection process. Hence, we will make use of the variances estimated from the LasDamas simulations (Eq. (3)) when fitting our probability functions. In Fig. A.2 we can see the LasDamas CiC distribution compared with the SDSS distribution. The good agreement between data and simulations allows us to use LasDamas variances as error bars for the SDSS counts-in-cells distributions.

thumbnail Fig. A.2

Comparison between LasDamas CiC distribution (pink) with the results obtained for the SDSS selected samples. Top: 8 h-1 Mpc cell radius, bottom: 24 h-1 Mpc cell radius.

All Tables

Table 1

SDSS selected samples.

Table 2

LasDamas selected mock catalogs.

Table 3

Counts-in-cells best fit fV(N) – all parameters free.

Table 4

Model selection.

All Figures

thumbnail Fig. 1

Illustration of the calculation of the counts-in-cells. On the left panel, we show the footprint of about 10% of the galaxies in the SDSS survey used here (Population 1) in right ascension and declination. We see the irregular sky coverage in the borders outlined by the points. In addition, two masked regions have been added to illustrate the real mask. These masked regions are also illustrated on the right panel where we represent an edge-diagram in Cartesian comoving coordinates. In both panels, the green circle represents a cell which falls completely within the surveyed region, the blue circle represent a cell, still accepted for the counts-in-cells algorithm, that has part of its volume outside of the window, but at least 95% of it within the surveyed regions. In this case, counts are compensated to account for the galaxies unseen by the masked regions. Finally, the red circle represents a rejected cell because more than 5% of its volume falls out of the surveyed region.

In the text
thumbnail Fig. 2

Count-in-cells results for Population 1. fV(N) CiC distribution function with best fit models and error bars. Top to bottom and left to right: radii 8, 12, 16, 20 and 24 h-1 Mpc. Best fit plots for bins with Gaussian errors. Boxes: NBD residuals, GQED residuals, log normal distribution residuals, log normal with bias distribution residuals.

In the text
thumbnail Fig. 3

Count-in-cells results for Population 2. fV(N) CiC distribution function with best fit models and error bars. Top to bottom and left to right: radii 8, 12, 16, 20 and 24 h-1 Mpc. Best fit plots for bins with Gaussian errors. Boxes: NBD residuals, GQED residuals, log normal distribution residuals, log normal with bias distribution residuals.

In the text
thumbnail Fig. A.1

Comparison between “delete one jackknife” method error bars for fV(N) (pink) and its population variance (green) in Esmeralda simulation. Top: 8 h-1 Mpc cell radius, bottom: 24 h-1 Mpc cell radius.

In the text
thumbnail Fig. A.2

Comparison between LasDamas CiC distribution (pink) with the results obtained for the SDSS selected samples. Top: 8 h-1 Mpc cell radius, bottom: 24 h-1 Mpc cell radius.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.