Issue 
A&A
Volume 597, January 2017



Article Number  A89  
Number of page(s)  28  
Section  Numerical methods and codes  
DOI  https://doi.org/10.1051/00046361/201629219  
Published online  10 January 2017 
Probabilistic multicatalogue positional crossmatch
^{1} Observatoire astronomique de Strasbourg, Université de Strasbourg, CNRS, UMR 7550, 11 rue de l’Université, 67000 Strasbourg, France
email: francoisxavier.pineau@astro.unistra.fr
^{2} IFCA (CSICUC), Avenida de los Castros, 39005 Santander, Spain
^{3} Department of Physics & Astronomy, University of Leicester, Leicester, LEI 7RH, UK
^{4} LeibnizInstitut für Astrophysik Potsdam (AIP), An der Sternwarte 16, 14482 Potsdam, Germany
^{5} MaxPlanck Institute for Solar System Research, JustusvonLiebigWeg 3, 37077 Göttingen, Germany
Received: 30 June 2016
Accepted: 23 August 2016
Context. Catalogue crosscorrelation is essential to building large sets of multiwavelength data, whether it be to study the properties of populations of astrophysical objects or to build reference catalogues (or timeseries) from survey observations. Nevertheless, resorting to automated processes with limited sets of information available on large numbers of sources detected at different epochs with various filters and instruments inevitably leads to spurious associations. We need both statistical criteria to select detections to be merged as unique sources, and statistical indicators helping in achieving compromises between completeness and reliability of selected associations.
Aims. We lay the foundations of a statistical framework for multicatalogue crosscorrelation and crossidentification based on explicit simplified catalogue models. A proper identification process should rely on both astrometric and photometric data. Under some conditions, the astrometric part and the photometric part can be processed separately and merged a posteriori to provide a single global probability of identification. The present paper addresses almost exclusively the astrometrical part and specifies the proper probabilities to be merged with photometric likelihoods.
Methods. To select matching candidates in n catalogues, we used the Chi (or, indifferently, the Chisquare) test with 2(n−1) degrees of freedom. We thus call this crossmatch a χmatch. In order to use Bayes’ formula, we considered exhaustive sets of hypotheses based on combinatorial analysis. The volume of the χtest domain of acceptance – a 2(n−1)dimensional acceptance ellipsoid – is used to estimate the expected numbers of spurious associations. We derived priors for those numbers using a frequentist approach relying on simple geometrical considerations. Likelihoods are based on standard Rayleigh, χ and Poisson distributions that we normalized over the χtest acceptance domain. We validated our theoretical results by generating and crossmatching synthetic catalogues.
Results. The results we obtain do not depend on the order used to crosscorrelate the catalogues. We applied the formalism described in the present paper to build the multiwavelength catalogues used for the science cases of the Astronomical Resource Crossmatching for High Energy Studies (ARCHES) project. Our crossmatching engine is publicly available through a multipurpose web interface. In a longer term, we plan to integrate this tool into the CDS XMatch Service.
Key words: methods: data analysis / methods: statistical / catalogs / astrometry
© ESO, 2017
1. Introduction
The development of new detectors with high throughput over large areas has revolutionized observational astronomy during recent decades. These technological advances, aided by a considerable increase of computing power, have opened the way to outstanding groundbased and spaceborne allsky or very large area imaging projects (e.g. the 2MASS Skrutskie et al. 2006; Cutri et al. 2003; SDSS Ahn et al. 2012, 2013; and WISE Wright et al. 2010; Cutri et al. 2014, surveys). These surveys have provided an essential astrometric and photometric reference frame and the first true digital maps of the entire sky.
As an illustration of this flood of data, the number of catalogue entries in the VizieR service at the Centre de Données astronomiques de Strasbourg (CDS) which was about 500 million in 1999 has reached almost 18 billion as on February 2016. At the 2020 horizon, European space missions such as Gaia and Euclid together with the Large Synoptic Survey Telescope (LSST) will provide a severalfold increase in the number of catalogued optical objects while providing measurements of exquisite astrometric and photometric quality.
This exponentially increasing flow of high quality multiwavelength data has radically altered the way astronomers now design observing strategies and tackle scientific issues. The former paradigm, mostly focusing on a single wavelength range, has in many cases evolved towards a systematic fully multiwavelength study. In fact, modelling the spectral energy distributions over the widest range of frequencies, spanning from radio to the highest energy gammarays has been instrumental in understanding the physics of stars and galaxies.
Many well designed and useful tools have been developed worldwide concurrently with the emergence of the virtual observatory. Most if not all of these tools can handle and process multiband images and catalogues. When assembling spectral energy distributions using surveys obtained at very different wavelengths and with discrepant spatial resolution, one of the most acute problems is to find the correct counterpart across the various bands. Several tools such as TOPCAT (Taylor 2005) or the CDS XMatch Service (Pineau et al. 2011a; Boch et al. 2012) offer basic crossmatching facilities. However, none of the publicly available tools handles the statistics inherent to the crossmatching process in a fully coherent manner. A standard method for a dependable and robust association of a physical source to instances of it in different catalogues (crossidentification) and in diverse spectral ranges is still absent.
The pressing need for a multicatalogue probabilistic crossmatching tool was one of the strong motivations of the FP7Space European program ARCHES (Motch et al. 2016)^{1}. Designing a crossmatching tool able to process, in a single pass, a theoretically unlimited number of catalogues, while computing probabilities of associations for all catalogue configurations, using the background of sources, positional errors and eventually introducing priors on the expected shape of the spectral energy distribution is one of the most important outcomes of the project. A preliminary description of this algorithm was presented in Pineau et al. (2015). Although ARCHES was originally focusing on the crossmatching of XMMNewton sources, the algorithms developed in this context are clearly applicable to any combination of catalogues and energy bands (see for example Mingo et al. 2016).
2. Going beyond the twocatalogue case
Computing probabilities of identifications when crosscorrelating two catalogues in a given area can be quite straightforward (provided the area is small enough so that the density of sources can be considered more or less constant, but large enough to provide sufficient statistics). For each possible pair of sources (one from each catalogue), we compute the distance normalized by positional errors (called normalized distance, σdistance, χdistance or more generally in this paper Mahalanobis distance D_{M}). Then we build the histogram of the number of associations per bin of D_{M}. This histogram is the sum of two components (see Fig. 2): the “real” or “true” associations (T) for which the distribution p(D_{M}  T) follows a Rayleigh distribution; the spurious or “false” associations, for which the distribution p(D_{M}  F) follows a linear (Poisson) distribution. Knowing these two distributions and the total number of associations (n_{T + F}), we may fit the histogram with the function (1)to estimate the number of spurious associations (n_{F}) and thus the number of good matches (n_{T} = n_{T + F}−n_{F}). Hence, we are able to attribute to an association with a given normalized distance the probability of being a good match: (2)Dividing both the numerator and the denominator by n_{T + F} we recognize Bayes’ formula considering (n_{T + F}−n_{F}) /n_{T + F} as the prior p(T) and considering either n_{F}/n_{T + F} as the prior p(F) or f(D_{M}) = n_{T + F}p(D_{M}).
The present paper basically extends this simple approach to more than two catalogues. Instead of fitting histograms to find the number of spurious associations, we directly compute them from the input catalogues data and from geometrical considerations.
Previously, Budavári & Szalay (2008) developed a multicatalogue crossmatch. For a given set of n sources from n distinct catalogues, they compute a “Bayes’ factor” based on both astrometric and photometric data. The “Bayes’ factor” is then used as a score: a predefined threshold on its value is applied to select or reject the given set of n sources. We discuss the astrometric part of Budavári & Szalay (2008) “Bayes’ factor” and compare it to our selection criterion in Sects. 5.6 and 6.1.
Throughout the present paper we consider a set of n catalogues. We use a Chisquare criterion based on individual elliptical positional errors to select, in these catalogues, sets of associations containing at most one source per catalogue. We call this selection a χmatch. We then compute probabilities for each set of associations. To compute probabilities, we consider only the result set in which each set of associations contains exactly n sources (one per catalogue, see below for partial matches). For people familiar with databases, it can be seen as the result of inner joins, joining successively each catalogue using a Chisquare criteria. The probabilities we then compute are only based on positional coincidences. Although we show how it is possible to add likelihoods based on photometric considerations, the computation of such photometric likelihoods is beyond the scope of this paper.
As the result of a χmatch, two distinct sets of associations may have sources in common: a source having a large positional error in one catalogue may for example be associated to several sources with smaller errors in another catalogue. We do not take into account in our probabilities the “onetoseveral” and the “onetoone” associations paradigms defined in Fioc (2014): it becomes far too complex when dealing with a generic number of catalogues and it is not that simple when a source may be blended, etc. We use a severaltoseveral(toseveral...) paradigm. In other words, we compute probabilities for a set of associations regardless of the fact that a source in the set can be in other sets of associations. So a same detection in one catalogue may have very high probabilities of associations with several (sets of) candidates in the other catalogues. We think it is the responsibility of the photometric part to disentangle such cases.
Requiring one candidate per catalogue for each set of associations (i.e. each tuple) is somewhat restrictive. But, if one or several catalogues do not contain any candidates for a tuple, then we compute the probabilities from the crossmatch of the subset of catalogues providing one candidate to that tuple. Those probabilities are computed independently of the “full” n catalogues probabilities. For example, if we crossmatch three catalogues and if a set of associations (a tuple) contains one source per catalogues (A, B and C), then we will compute five probabilities: one for each possible configuration (ABC, AB_C, A_BC, AC_B and A_B_C in which the underscore “_” separates the catalogue entries associated to different actual sources, see Sect. 6.2.2). Now, if one source from A has a candidate in B and no candidate in C, we will compute only two probabilities (AB and A_B, see Sect. 6.2.1) considering only the result of the crossmatch of A with B. Likewise for A and C only and for B and C only. These four crossmatches will yield eleven distinct probabilities. It is possible to deal with “missing” detections when computing photometrically based likelihoods (taking into account limit fluxes, ...) but it is not the case in the astrometric part of this work.
When χmatching n catalogues, the number of hypotheses to be tested, and thus the number of probabilities to be computed for a given set of associations, increases dramatically with n. This number is 203 for 6 catalogues and reaches 877 for seven catalogues (see Table 2 in Sect. 6.2.4). To be able to compute probabilities when χmatching more than seven catalogues we may start by merging catalogues for which the probability of making spurious associations is very low (e.g. catalogues of similar wavelength and similar astrometric accuracy), and handle the merged catalogue as a single input catalogue.
In Sect. 3 we lay down the assumptions we use to work on a simplified problem. We then (Sect. 4) define the notations and the standards used throughout the paper and link them to the standards adopted in a few catalogues. We then describe in detail the candidate selection criterion (Sect. 5) before providing (Sect. 6) an exhaustive list of all hypotheses we have to account for to apply Bayes’ formula. In Sect. 5 we also show how the “Bayesian crossmatch” of Budavári & Szalay (2008) may be interpreted as an inhomogeneous χmatch. Then (Sect. 7) we show how it is possible to estimate the rates of spurious associations and hence “priors”. In Sect. 8 we compute an integral which is related to the probability the selection criterion has to select a set of n sources for a given hypothesis. This integral is crucial to compute likelihoods defined in Sect. 9 and to normalize likelihoods in Sect. 10. Finally, after showing how to introduce the photometric data into the probabilities (Sect. 11), and before concluding (Sect. 14), we explain the tests we carried out on synthetic catalogues in Sect. 12. Since this paper is long and technical, we put a summary of the steps to follow to perform a probabilistic χmatch in Sect. 13.
3. Simplifying assumptions
Crosscorrelating catalogues taking into account an accurate model of the sky on one hand, and the effects and biases due to the catalogue building process on the other hand is a daunting task. To make progress towards this objective, we have to start by making simplifying assumptions.
First of all, we assume that there are no systematic offsets between the positions of each possible pair of catalogues. It means that the positions are accurate (no bias). We also assume that positional errors provided in catalogues are trustworthy. It means that they are neither overestimated nor underestimated: for instance, no systematic have to be quadratically added or removed. The first point supposes an accurate astrometric calibration of all catalogues. This is somewhat the “dog chasing its tail” problem since a proper astrometric calibration should be based on secure identifications, themselves based on... crossidentification! Ideally the astrometric calibration and the crossidentification should be performed simultaneously in an iterative process. It will not be developed here but we point out that the present work can be used to calibrate astrometrically n catalogues at the same time from one reference catalogue, taking into account all possible associations in all possible catalogue subsets. However, carrying out careful identification of primary or secondary astrometric standards is only important when the density of bright astrometric references is very low, typically in deep small field exposures. Reliable crossidentification is also crucial when the wavelength band of the image to calibrate differs widely from that of the astrometric reference image. In most large scale surveys such as 2MASS (Skrutskie et al. 2006) or SDSS (Pier et al. 2003) the density of bright Tycho2 (Høg et al. 2000) or UCAC (Zacharias et al. 2004) astrometric reference stars is high enough to ensure an excellent overall calibration without any ambiguity in the associations.
Although the idealized vision of an immutable and static sky is long gone, we ignore proper motions in this analysis. There are at least two ways of taking them into account: either we may force associations to include at least one source from a catalogue containing measured proper motions; or we may try to fit proper motions during the crossmatch process. In this last case, if a set of n sources detected at different epochs in n distinct catalogues does not satisfy the candidate selection defined in Sect. 5.2, we may make the hypothesis that they nonetheless are from a same source but having a proper motion. We can then estimate the proper motion and the associated error based on positions, (Gaussian) positional errors and epochs (see Appendix B). From the n observed positions and associated errors and from the n theoretical estimated positions and associated errors we can compute a Mahalanobis distance which follows a χ distribution with 2(n−2) degrees of freedom. Similarly to the candidate selection criterion in Sect. 5.2 we can then reject the hypothesis “same source with proper motion” if the Mahalanobis distance is larger than a given threshold.
We neglect clustering effects. We suppose that in a given area Ω, source properties are homogeneous. This implies that the local density of sources, the positional error distributions and the associations priors (probabilities of true associations that in principle depend on the astrophysical nature of the sources and on the limiting flux) are uniform over the sky area considered. As usual we have to face the following dilemma: on the one hand, the larger the area Ω, the better the statistic; on the other hand, the larger the area Ω, the less probable the uniform density, errors distributions and priors hypothesis. In the ARCHES project, for instance, we grouped the individual XMMNewton EPIC fields of view of ≈0.126 deg^{2} each into installments of homogeneous exposure times and galactic latitude so as to ensure as much uniformity as possible. Each installment contained on the order of several hundred sources.
Finally, we neglect blending. If two sources are separated in one catalogue and blended in the other one, the position of the blended source will be something like the photocentre of the two sources. Either the blended source will not match any of the two distinct sources, or only one of the two distinct sources will match, the match likely being in the tail of the Rayleigh distribution, possibly leading to a low probability of identification. It will then not be problematic to consider the match as spurious since the observed flux is contaminated by the flux of the nearby source. Finally, if the positional accuracy of the blended source is well below that of the distinct sources, both distinct sources will match the blended source, leading to a nonunique association requiring further investigations to be disentangled.
4. Notations and links with catalogues
4.1. Notations
This article uses almost exclusively the notations defined in the ISO 800002:2009(E) international standard. Exceptionally we waive the notation detA for determinant and replace it by the equivalent but more compact notation  A .
We consider n catalogues defined on a common surface of area Ω. We assume that each catalogue source has individual elliptical positional errors defined by a bivariate normal (or binormal) distribution. For this, we assimilate locally the surface of the sphere to its zenithal (or azimuthal) equidistant projection (see ARC projection in Calabretta & Greisen 2002), that is to its local Euclidean tangent plane. In this frame, the position of a point at distance d arcsec from the origin O (the tangent point) and having a position angle ϕ (east of north) is simply This approximation is acceptable since typical positional errors, distances and surfaces locally considered are small.
We note the binormal probability density function (p.d.f.) representing the position of a source S and its associated uncertainty: (5)with

μ = (μ_{x},μ_{y})^{⊺} the position of the source S provided in a catalogue, that is the mean of the binormal distribution;

V the provided variancecovariance – also simply called covariance – matrix which defines the error on the source position;

p = (x,y)^{⊺} any given twodimensional position;

Q(p) the quadratic form Q(p) = (p−μ)^{⊺}V^{1}(p−μ), that is the square of the weighted distance between a given position p and the position of the source Swhere

σ_{x} is the standard deviation along the xaxis (i.e. the east axis);

σ_{y} is the standard deviation along the yaxis (i.e. the north axis);

ρ the correlation factor between σ_{x} and σ_{y};


the determinant of V;

dp = dxdy.
A covariance matrix V represents a 1σ ellipse. The “real” position of the source S has ≈39% chances to be located inside this 1σellipse. It must not be confused with the 1dimensional 1σsegment which contains a real “value” with a probability of ≈68%.
4.2. Classical positional errors in catalogues
In astronomical catalogues like the 2MASS AllSky Catalog of Point Sources (2MASSPSC, Cutri et al. 2003) positional errors are described by three parameters defining the 1σ positional uncertainty ellipse^{2}: err_maj or a the semimajor axis, err_min or b the semiminor axis, and err_ang or ψ the positional angle (east of north) of the semimajor axis. We give the formula to transform the ellipse into a covariance matrix (see Appendix A.2 of Pineau et al. 2011b and footnote 11 in Fioc 2014): In the AllWISE catalogue (Cutri et al. 2014), the coefficients of the covariance matrix are (almost) directly available. Instead of providing the unitless correlation factor ρ or the covariance ρσ_{x}σ_{y} (in arcsec^{2}), the authors chose to provide the cosigma (σ_{αδ}) because, as they state^{3}, the latter is in the same units as the other uncertainties. We thus have In catalogues like the Sloan Digital Sky Survey since its eighth data release (SDSSDR8, Aihara et al. 2011) positional errors contain two terms: the error on RA (raErr) and the error on Dec (decErr). In this case the parameters of the covariance matrix are simply In catalogues like the XMM catalogues (e.g. the 3XMMDR5, Rosen et al. 2016) a single error is provided. Ideally, one would like to have access to the two onedimensional errors, even if their respective values are often very close. The column named radecErr is the total error, so the quadratic sum of the two computed (but not provided) 1dimensional errors, one computed on RA and one computed on Dec. If one uses σ_{x} = radecErr and σ_{y} = radecErr, the total error will be instead of radecErr. In output of the astrometric calibration process, the XMM pipeline provides a systematic error sysErrCC which is quadratically added to radecErr to compute the “total radial position uncertainty”^{4}posErr. As for radecErr, we must divide posErr by to obtain the 1dimensional error. The appropriate errors to be used (including a systematic) are then The factor has not been taken into account in Pineau et al. (2011b). It partly explains why the fit of the curve in the right panel of Fig. 3 mentioned in Sect. 5 of this paper does not lead to a Rayleigh scale parameter equal to 1.
Similarly to the XMM case, the error posErr provided in the GALEX AllSky Survey Source Catalog (GASC) catalogue^{5} (which also includes the systematic) is a “total radial error”. It is thus the Rayleigh parameter σ which is the quadratic sum of two onedimensional errors. As for XMM, the appropriate errors to be used are In catalogues like the ROSAT AllSky Bright Source Catalogue (1RXS, Voges et al. 1999) the error provided is the radius of the cone containing the real position of a source with a probability of ≈68.269% (the 1 dimensional 1σ). Authors like Rutledge et al. (2003; given the details provided in Voges et al. 1999, Sect. 3.3.3) call this radius the 1σradius. We note it r_{68%}. But, in the Rayleigh distribution, the scale parameter σ is defined such that the cone of radius r = σ contains the real position with a probability 100 × (1−exp(−1/2)) ≈ 39.347%. Adjusting such that leads to (21)Similarly if the provided error is the radius of the cone containing the real position with a probability of 90% (e.g. in the WGACAT, White et al. 2000) (22)The description (White et al. 1997) and the online documentation^{6} of the FIRST catalogue (Helfand et al. 2015a,b) provide an “empirical expression” to compute the semimajor and semiminor axis of the 90% positional accuracy associated to each source: in which fMaj (fMin) is the major (minor) axis of the fitted FWHM, RMS “is a local noise estimate at the source position” and Fpeak is the peak flux density. The position angle ψ of the accuracy equals the fitted FWHM angle fPA. We first obtain the 1σ accuracy ellipse by resizing the 90% ellipse axes divinding them by the same factor as for the WGACAT (i.e. ) After possibly adding systematics, the variancecovariance matrix is obtained applying the equations used for the 2MASS catalogue.
Errors in catalogues like the Guide Star Catalog Version 2.3.2^{7} (Lasker et al. 2007, GSC2.3) should not be used in the framework of this paper. As stated in Table 3 of Lasker et al. (2008): these astrometric and photometric errors are not formal statistical uncertainties but a raw and conservative estimate to be used for telescope operations.
Table 1 summarizes the transformation of catalogues positional errors into the coefficients of covariance matrices V.
Summary of the transformations of positional errors provided in various astronomical catalogues into the coefficients of error covariance matrices (before adding quadratically possible systematics).
5. Candidates selection: the χmatch
We make the hypothesis that n sources from n distinct catalogues are n independent detections of a same real source. With p the unknown position of the real source and μ_{i} the observed position of detection i, the probability for the n detections to be located at the observed positions is expressed by the joint density function: (27)
5.1. Estimation of the real position given n observations
We introduce the notations μ_{Σ} and V_{Σ} for the weighted mean position of the n sources and its associated error respectively. The inverse of the covariance matrix V_{Σ} is (28)leading to (see demonstration in Sect. A.1) (29)which is used in the weighted mean position expression (30)Using both the weighted mean position and its error, the sum of quadratics in Eq. (27) can be divided into two parts and written as (see demonstration Sect. A.2)with In the case of two catalogues the latter term can be written as in Eq. (51). Moreover, if both covariances are null, it takes the simple and common form (35)Back to the general case, the term Q_{χ2} can also be put in the more computationally efficient form (only one loop over i) (36)From those formulae, it appears that the weighted mean position (Eq. (30)) is the maximum likelihood estimator of the “true” position of the source: the second term (Q_{χ2}) is constant with respect to p so the maximum of the likelihood function ℒ(p;μ_{1},μ_{2},...,μ_{n}) = f_{p}(μ_{1},μ_{2},...,μ_{n}  p) is obtained when the first term (Q_{p}) is null, so when p = μ_{Σ}. The error on this estimate is simply V_{Σ}, the inverse of the Hessian of the likelihood function.
5.2. Candidates selection criterion
For the candidate selection, we are interested in the probability the n sources have to be located at the same position. Let’s first rewrite Eq. (27) to exhibit a product of a binormal distribution by another multidimensional normal law: (37)When integrating Eq. (37) over all possible positions (i.e. over p) the first term integrates to 1, since it is the p.d.f. of a normal law in p, so we obtain (38)We are supposed to integrate on the surface of the unit sphere. But the errors being small, we consider the infinity being at a relatively close distance, before effects of the sphere curvature become nonnegligible.
In the previous equation, only the Q_{χ2} term remains. It can also be written (see demonstration Sect. A.3) (39)Equation (38) is equivalent to P(D  H) in Budavári & Szalay (2008) and Eq. (39) – multiplied by the factor in the exponential (Eq. (38)) – is the generalization for elliptical errors of Eq. (B12) in Budavári & Szalay (2008). In practice, we never use Eq. (38) since the number of terms to be computed increases with O(n(n−1)/2) while it increases with O(n) in Eq. (36) or in its iterative form (see Eq. (48) in Sect. 5.3). We use here the big O notation, to be read as “the order of”.
We can see Eq. (34) as the result of a 2ndimensional weighted least squares in which the model is the “real” position of the source and the solution is μ_{Σ} (by similarity with Eq. (31)). Putting all positional errors matrices in a 2n × 2n block diagonal matrix M, Q_{χ2} is the square of the Mahalanobis distance defined by (40)(41)which follows in our particular case a χ^{2} distribution with 2(n−1) degrees of freedom, or equivalently, (n−1)χ^{2} distributions with two degrees of freedom. Equation (40) is probably the Mahalanobis distance mentioned without giving its expression in Adorf et al. (2006).
If follows a distribution, then its square root, the distance D_{M}(μ), follows a χ_{d.o.f. = 2(n−1)} distribution.
We perform a statistical hypothesis test on a set of n sources, defining the null hyothesis H_{0} as follows: all sources in the set are detections of the same “real” source. The alternative hypothesis H_{1} would thus be: not all sources in the set are detections of the same “real” source; in other words the set of n sources contains at least one spurious source; or, expressed differently, the n sources are n observations of at least two distinct real sources. We adopt Fisher’s approach, that is we will reject the null hypothesis if, the null hypothesis being true, the observed data is significantly unlikely.
From now on, we indifferently write x or D_{M} the Mahalanobis distance. Assuming the null hypothesis is true, the “theoretical” probability we had to get the actual computed (square of) Mahalanobis distance is given by a Chi(square) distribution with 2(n−1) degrees of freedom: The probability we had to get an actual computed (square of) Mahalanobis distance less than or equal to a given threshold (or critical value) is given by the value of the cumulative distribution function of a the Chi(square) at the given threshold (44)We can indifferently work on x with the χ distribution or on x^{2} with the χ^{2} distribution. The threshold k_{γ} we obtain on x is simply the square root of the threshold we obtain on x^{2}. Although we find the Chi test more natural in the present case, most astronomers are familiar with the Chisquare test.
In the framework of statistical hypothesis tests, it is the complementary cumulative distribution (or tail distribution) function which is usually used by defining the pvalue (45)and a significance level α defined by (46)is fixed. The null hypothesis is then rejected if p  value <α. In the NeymanPearson framework α is the type I error, or the false positive rate, that is the probability the null hypothesis has to be rejected (positive rejection test) while it is true (wrong/false decision). In our case we fix γ (hereafter called completeness), the fraction of real associations we “theoretically” select over all real associations. The candidates selection criterion, or fail of rejection criterion, we use is then (47)in which or, equivalently, . This inequality is equivalent to p  value <α. It is important to write “fail of rejection” since nothing proves that if Eq. (47) is satisfied the null hypothesis is true: at this point the selected set of sources is nothing else than a set of candidates. Nevertheless we do call region of acceptance the set of D_{M}(μ) values satisfying Eq. (47). This region of acceptance will be useful to define the domain of integration used to normalize likelihoods when computing probabilities for each hypothesis from Sect. 7. Its volume (see e.g. Eq. (64)) is the volume of the 2nellipsoid defined by M (see Eq. (40)) divided by the error ellipse associated to the weighted mean position μ_{Σ} and defined by V_{Σ} (it thus is a volume in a 2(n−1) space).
In practice, the value is computed numerically using Newton’s method to solve . The initial guess we use is the approximate value returned by Eq. (A.3) of Inglot (2010).
The value of γ we fix is independent of n, the number of candidates. In practice we often set this input parameter to γ = 0.9973. In one dimension this value leads to k_{γ} = 3, that is the famous 3σ rule. It means that for 10 000 real associations in a dataset, we theoretically miss 27 of them by applying the candidate selection criterion. From now on we call this crosscorrelation a χ_{γ}match, or simply a χmatch.
In the particular case of two catalogues D_{M}(μ) follows a χ distribution with 2 degrees of freedom – that is a Rayleigh distribution – and k_{γ = 0.9973} = 3.443935. This latter value is used in the twocatalogues χmatch of Pineau et al. (2011b).
5.3. Iterative form: catalogue by catalogue
Somewhat similarly to the Bayes factor in Budavári & Szalay (2008, Sect. 6) it is noteworthy that Q_{χ2} can be computed iteratively, summing (n−1) successive χ^{2} with two degrees of freedom computed from (n−1) successive twocatalogues crossmatches.
After each iteration, the new position to be used for the next crossmatch is the weighted mean of all already matched positions and the new associated error is the error on this weighted mean. The strict equality between Eq. (48) and the non iterative form, for example Eq. (34), proves that the result is independent of the successive crossmatches order.
The maximum number of crossmatches to be performed must be known in advance in order to put an upper limit on k_{γ} since it depends on the degree of freedom of the total χ^{2}. The iteration formula is simply (48)in which We find it from the 2catalogues case, for which (see Sect. A.4) (51)We can demonstrate by direct calculation that (52)and so, iteratively, we find the general expression (53)which is consistent with Eq. (38). The volume of the acceptance region of the statistical hypothesis test is the volume of a 2(n−1) dimensional ellipsoid. More precisely, it is the product of the previous equation Eq. (53) by the volume of a 2(n−1)sphere of radius k_{γ}. This will be crucial when computing the rate of spurious associations.
5.4. Iterative form: by groups of catalogues
Instead of iterating over catalogues one by one, we can also perform G subcrossmatches, each associating n_{g} distinct sources such that . We note Q_{χ2, { g }} the square of the Mahalanobis distance associated with the group g: (54)We show that we can compute Q_{χ2} iteratively from the G weighted mean positions μ_{Σ{g}} and their associated errors . The square of the Mahalanobis distance can be written (55)In other words, the square of the Mahalanobis distance is the sum of the square of the intragroup Mahalanobis distances plus the intergroup iterative one. With k being an index defined inside each of the G groups { g }In fact, it is a straightforward generalization of the G = 2 groups case for which Here again, (63)Again, k_{γ} depends on the number of degrees of freedom of the total χ^{2}, thus on the total number of crosscorrelated tables. It means that to be complete, all subcrosscorrelations must use the candidate selection threshold k_{γ}(2(n−1)) computed from the total number of tables instead of k_{γ}(2(n_{g}−1)) computed from the number of tables in a group.
5.5. Summary and Interpretation
Equations (34), (36), (39), (40), (48) and (55) are all equivalent and they lead to the same value, that is to the same squared Mahalanobis distance. All sources are retained as possible candidates if Eq. (47) is verified, so if the Mahalanobis distance is smaller or equal to k_{γ}. This threshold is the inverse of the cumulative χ distribution function at the chosen completeness γ, for 2(n−1) degrees of freedom.
As this criterion is no other than a χtest criterion (or χ^{2}test criterion if we work on squared Mahalanobis distances) we call the result of such a criterion a χmatch.
The χmatch criterion defines a region of acceptance which is a 2(n−1)ellipsoid of radius k_{γ}. Its volume is computed from Eq. (53): (64)with the volume of a 2(n−1)sphere of radius k_{γ}. It will be later used to compute the expected number of spurious associations.
5.6. Comment on the “Bayesian crossmatch” of Budavári & Szalay (2008)
We mention in Sect. 6.1 what appears to be a conceptual problem in calling B (Eq. (65)) a Bayes factor for more than two catalogues in the astrometrical part of Budavári & Szalay (2008).
Performing a crossmatch by fixing a lower limit L on the “Bayes factor” B defined in Eq. (18) of Budavári & Szalay (2008) is no other than performing a χmatch with a significance level which depends both on the number of sources n and on the volume of the 2(n−1)ellipsoid of radius 1. In fact, using the factor B of Budavári & Szalay (2008) in which w_{i} is the inverse of the cirular error on the position of the source i and φ_{ij} is the angular distance between sources i and j, we have the equivalence (65)We showed that the quantity on the left side of the inequality is equal to Eq. (39) in the present paper and thus follows a χ^{2} distribution for “real” associations. It means that the “Bayesian” candidate selection criterion B ≥ L is equivalent to a χ^{2} test having a significance level equal to (66)The larger the volume of the 2(n−1)ellipsoid of radius 1 (∝), the more “real” associations are missed and the less spurious associations are retrieved. We could replace the criterion B ≥ L by . This is somewhere between the fixed radius cone search and the fixed significance level χmatch. The rate of missed “real” associations is not homogeneous but depends on the positional errors. Only if positional errors are constant in all catalogues, then the B ≥ L constraint becomes equal to the χmatch which is equal to a fixed radius crossmatch.
6. Hypotheses from combinatorial considerations
A χmatch output is made of sets of associations, each set of associations containing one source per catalogue. For each set of associations we want to compute the probability all sources of the set have to come from a same actual source. In this section, especially in Sect. 6.2 we make explicit the sets { h_{i} } of hypotheses we have to formulate to compute probabilities of identification when crosscorrelating n catalogues.
6.1. Generalities
Given a set { h_{k} } of pairwise disjoint hypotheses whose union is the entire set of possibilities, the law of total probabilities for an observable x is (67)Leading to Bayes’ theorem (68)We stress that Bayes’ factor (also called likelihood ratio) is defined only in cases involving two and only two hypotheses (69)and is used when no trustworthy priors p(h_{1}) and p(h_{2}) are available. We can transform any set of pairwise disjoint hypotheses into two disjoint hypotheses. In this case, using the negation notation ¬(70)with (71)and (72)Such a likelihood ratio (Eq. (70)) is not interesting since it is not only computed from likelihoods, but also from priors.
The term B(H  K) in Budavári & Szalay (2008, Eq. (8)) is improperly called Bayes factor when dealing with more than two catalogues. As a matter of fact, the union of the two hypotheses – all sources are from the same real source and each sources is from a distinct real source – is only a subset of all possibilities so the law of total probabilities and hence Bayes’ formula are not valid.
In Pineau et al. (2011b) the term LR(r) in Eq. (9) is also improperly called likelihood ratio since a likelihood is a probability density function and so integrates to 1 over its domain of definition. It is obviously not the case of dp(r  spur) in Eq. (8). The built quantity is related to the ratio between the probability the association has to be “real” over the probability it has to be spurious, but formally it is not a likelihood ratio. The very same “abuse of term” is made in Wolstencroft et al. (1986; who, moreover, adds a prior in the likelihood ratio), in Rutledge et al. (2000), Brusa et al. (2007) and probably other publications.
6.2. Possible combinations and the Bell number
Let’s suppose we have selected one set of n distinct sources from n different catalogues, one source per catalogue. Those n sources possibly are n detections of k distinct real sources, with k ∈ [1,n]. The case k = 1 corresponds to the situation where all sources are n observations of the same real source and the case k = n corresponds to the situation where there are n distinct real sources detected independently, one in each catalogue.
We call A the source from catalogue number one, B the source from catalogue number two and so on.
6.2.1. Twocatalogues case: two hypotheses
The classical twocatalogues case is trivial. We formulate only two hypotheses:

AB, the match is a real match, the two sources are two observations of a same real source, that is k = 1;

A_B, the match is spurious, the two sources are two observations of two different real sources, that is k = 2.
6.2.2. Threecatalogues case: five hypotheses
For three sources A, B and C from three different catalogues, we formulate five hypotheses:

ABC, all three sources come from a same real source, that is k = 1;

AB_C, A and B are from a same real source and C is from a different real source, that is k = 2;

AC_B, A and C are from a same real source and B is from a different real source, that is k = 2;

A_BC, B and C are from a same real source and A is from a different real source, that is k = 2;

A_B_C, all three sources are from three different real sources, that is k = 3.
6.2.3. Fourcatalogues case: 15 hypotheses
For four sources A, B, C and D we have to formulate 15 hypotheses:

ABCD, when k = 1;

ABC_D, ABD_C, ACD_B and BCD_A, but also

AB_CD, AC_BD and AD_BC for k = 2;

AB_C_D, AC_B_D, AD_B_C, BC_A_D, BD_A_C and DC_A_B when k = 3;

A_B_C_D when k = 4.
6.2.4. ncatalogues case: Bell number of hypotheses
We now generalize to n catalogues. For each possible value of k, the number of ways the set of n sources can be partitioned into k nonempty subsets – each subset correspond to a real source – is given by the Stirling number of the second kind denoted . The total number of hypotheses to be formulated is equal to the Bell number. The Bell number counts the number of partitions of a set and is given by (73)Its seven first values are provided in Table 2 and a graphic illustration representing all possible partitions for five catalogues is provided in Fig. 1^{8}.
Values of the seven first Bell numbers.
Fig. 1 The 52 partitions of a set with n = 5 elements. Each partition corresponds to one hypothesis for five distinct sources from five distinct catalogues. Left: k = 5, the five sources are from five distinct real sources. Right: k = 1, the five sources are from a same real source. (Tilman Piesk – CC BY 3.0 – modified – link in footnote). 

Open with DEXTER 
We face a combinatorial explosion of the number of hypotheses to be tested when increasing the number of catalogues. Although the theoretical developments presented here deal with any number of catalogues, the exhaustive analysis may be in practice limited to a few catalogues (n< 10).
Hereafter we note h_{i} the hypothesis number i, we explicit it with letters for example h_{AB}, and we note h_{k = i} an hypothesis in which n observed sources are associated to i real sources.
7. Frequentist estimation of spurious associations rates and priors
We have defined a candidate selection criterion to perform χmatches. We recall that we note x the Mahalanobis distance, and we note s the “event” x ≤ k_{γ}, that is a given set of sources satisfies the selection criterion.
In a first step we want to estimate the number of “fully spurious” associations we would expect to find in a χmatch output and derive the prior p(h_{k = n}  s) from this estimate. By “fully spurious” we mean that each candidate from each catalogue is actually associated with a different “real” source. A good such estimate is simply the mean sky area of the test acceptance region (see Eq. (64)) over all possible sources of all catalogues, multiplied by the number of sources in one of the catalogues and by the density of sources in the other ones. Written differently for n catalogues of n_{i} sources each, on a common surface area Ω, it leads to an estimated number of spurious associations equals to: (74)Or, having histograms or more generally discretized positional error distributions: (75)in which N_{k} are the numbers of bins in histograms – or number of points in a discrete distribution– and c_{bk} are number of counts in given bins of a histogram. The number of counts may be replaced by the value of the discrete distribution (or weight w_{bk}) times the number of elements: c_{bk} = n_{k}w_{bk}.
To perform quick estimations using only a one dimensional error histogram per catalogue, we approximate elliptical errors by circular errors of same surface area.
The remainder of this section explains how we can compute priors from the rate of “fully spurious” associations and the number of associations found in all possible subcrossmatches.
7.1. Case of two catalogues
Let’s suppose that we have two catalogues A and B and each catalogue contains only one source in the common surface area Ω. We note μ_{a1}, V_{a1} and μ_{b1}, V_{b1} the position of the source and associated covariance matrix in A and B respectively. If we fix the position μ_{a1} of the first source, the second source will be associated with the first one by a χ_{γ}match if Eq. (47) is satisfied. So if the second source is located in an ellipse of surface area centred around the position of the first source. We temporarily waive the ISO 800002 notation detM and replace it by the equivalent and more compact notation  M . We also replace V_{a1} + V_{b1} by V_{1,1} to rewrite the last term in the pithier form . We now suppose that both sources are unrelated and that μ_{a1} and μ_{b1} are uniformly distributed in Ω. Then, neglecting border effects, the probability that the two sources are associated by chance when performing a χ_{γ}match is given by the ratio of the acceptance ellipse to the total surface area Ω: (76)We now suppose that the second catalogue contains n_{B} sources uniformly distributed in Ω. And if all of them are unrelated to the source of the first catalogue, then the estimated number of spurious associations is simply the sum of the previous probability over the n_{B} sources of the second catalogue (77)We now suppose that the first catalogue contains n_{A} sources also uniformly distributed in Ω, all unrelated to catalogue B sources. Still neglecting border effects, the estimated number of spurious associations is simply the sum of the previous estimation over all catalogue A sources (78)In practice, evaluating this quantity can be timeconsuming since we have to compute and sum n_{A} × n_{B} terms. Fortunately, we can evaluate it exactly for circular errors and approximately for elliptical errors computing only n_{A} + n_{B} terms. In fact in which (83)and thus (84)For ordinary ellipses, that is ellipses having a position angle different from 0 and π/ 2, the approximation is valid if . In the particular case of circular errors, Eq. (78) becomes (85)in which and are the mean surface area of all positional error ellipses in catalogues A and B respectively: (86)For simple circular errors σ_{ai} and σ_{bi}, this reduces to (87)If errors are constant for all sources in each catalogue, this reduces to (88)These estimates based on geometrical considerations have the advantage of being very fast to compute.
Theoretically, we should remove from the double summation in Eq. (78) the pairs (i,j) which are real associations. We have no mean to do this since we do not known in advance the result of the crossidentification. Fortunately this effect is negligible in common cases. Indeed, if the result of the crossmatch of the two catalogues contains n_{AB} real associations – that is sources of both catalogues from a same real source – and supposing that the positional error distribution of sources having a counterpart is similar to the global error distribution we should modify Eq. (85) by(89)In practice this estimate will tend to be overestimated since the distribution of sources in a catalogue cannot be uniform because of the limited angular resolution preventing the detection of very close sources in a same image. This effect is usually deemed to be of negligible importance. However one can detect its presence in particular circumstances. For instance, if the actual counterpart is located in the wings of a much brighter nearby source it may not be detected. This effect probably accounts for the presence of a fraction of the stellar identifications in high Galactic latitude Xray surveys, in particular those with a much higher F_{x}/F_{opt} flux ratios and harder Xray spectra than normal for active coronae in which cases a faint AGN may be the correct identification (Watson 2012; Menzel et al. 2016). One way to account for this effect and to limit the overestimation is to remove from the surface area Ω small areas around each source. The value of those areas depends for example on the source brightness. In addition, again because of the angular resolution: for real associations in catalogues having similar positional errors, the chance a source has to be also associated with a spurious source is low. More precisely, the start of the Poisson distribution will be truncated. In extreme cases in which the Poisson distribution is truncated for x<k_{γ}, meaning that sources in a real association cannot be part of a spurious association, we should remove those sources from the estimate . We thus have to rewrite the previous equation Eq. (89) as (90)Knowing the total number of associations, n_{T}, resulting from the χmatch, we can estimate from Eq. (89) the number of spurious associations, and thus the number of real associations is estimated by (91)If mean error ellipses in both catalogues are very small compared to the total surface area – that is – we can use the approximation (92)which is equivalent to using directly equation Eq. (85), that is without taking care of removing real associations. is but an estimate and nothing prevents it from being negative due to count statistics in crossmatches with very few real associations and a lot of spurious associations. In practice, we have to define a lower limit such as .
Hence we can estimate the priors in the sample of associations satisfying the selection criterion (s) After a first twocatalogues crossmatch, we may compare the expected histogram of detV_{i,j} for spurious associations with the same histogram obtained from all associations. We may then derive the estimated distribution of this quantity (detV_{i,j}) for “real” associations and compute the two likelihoods p(detV_{i,j}  h_{AB},s) and p(detV_{i,j}  h_{A_B},s).
Similarly we may build the histograms of the quantity detV_{Σ} for both the spurious and the “real” associations. This quantity is the determinant of the covariance matrix – that is the positional error – associated with the weigthed mean positions. We proceeded likewise in Pineau et al. (2011b) using the “likelihood ratio” (see our comment on the abuse of term likelihood ratio) quantity instead of positional uncertainties.
7.2. Case of three catalogues
We recall that for 3 catalogues, the output contains five components (see Sect. 6.2.2): ABC, AB_C, A_BC, AC_B, A_B_C. We would like to estimate the number of spurious associations, that is the number of associations in the four components other than ABC. To do so, we need to perform the three twocatalogue crossmatches A with B, A with C and B with C. We are thus able to estimate n_{AB} and n_{A_B}, n_{AC} and n_{A_C} and finally n_{BC} and n_{B_C} respectively. To compute n_{AB_C}, we proceed like in the previous section considering the two catalogues AB and C. AB is the result of the χmatch of A with B: the positions in catalogue AB are the weighted mean positions (μ_{Σ}, Eq. (30)) of associated A and B sources and the associated errors (or covariance matrix) are given by V_{Σ} (Eq. (29)). The only difference with the twocatalogues case is that for the first catalogue (AB) we replace the simple mean elliptical error surface over the n_{TAB} entries by the weighted mean accounting for the probabilities the AB associations have to be “real” (i.e. not spurious) (95)in which p(h_{AB}  ...) is the probability the association has to be a real association knowing some parameters (“...”), and V_{ΣABi} is the covariance matrix of the error on the weighted mean position i (see Sect. 5.3, particulary Eq. (49)). Such a probability will be computed in the next sections. We then compute and derive like in the two catalogues case replacing A by AB and B by C. Similarly to , we can estimate and .
We now want to estimate n_{A_B_C}, with a result which is independent from the crosscorrelation order. Although we may use Eq. (74), it is possibly time consuming. Another solution is to use its discretized form Eq. (75). To do so quickly at the cost of an approximation we may circularize the errors by replacing the coefficients of the covariance matrix V by a single error equal to and setting the correlation (or covariance) parameter equal to zero. It means that the new covariance matrix is diagonal and both diagonal elements are equal to . We choose this value to preserve the surface area of the twodimensional error since the determinant (∝area) of the circular error equals the determinant (∝area) of the ellipse. This approximation is the same as the one made in the previous section. For each catalogue we then make the histogram of values using steps of for example one mas and we apply Eq. (75). In this case – circular errors – we simplify the equation using (96)and thus (97)We do not use this last form but give it for comparison with the denominator of Eq. (17) in Budavári & Szalay (2008).
Another option is to compute the number of “fully” spurious associations three times by following what was done in the previous section (and in the beginning of this section), but computing instead of . Similarly to Eq. (95): (98)Computing we derive . Similarly we can compute and and estimate the number of fully spurious associations taking the mean of , and .
Having the estimated number of associations being part of the components AB_C, AC_B, A_BC and A_B_C plus knowing the total number of associations n_{T}, we are able to estimate and to compute the priors, for example (99)
7.3. Case of n catalogues
We can easily generalise the previous section using recursion. For n = 4 catalogues, we estimate the number of associations in component A_B_C_D knowing the number of associations in the result of the fourcatalogue χmatch and estimating recursively (from the threecatalogue χmatches) the number of associations in the 14 other components (AB_C_D, ...). So for n catalogues, the total number of distinct (sub)crossmatches to be performed to compute all priors recursively is (100)in which terms are the binomial coefficients (n−1) ! /(k ! (n−1−k) !). For five catalogues, N_{χ−match} = 26 and for six catalogues, N_{χ−match} = 57.
8. Probability of being χmatched under hypothesis h_{i}
In this section we compute p(s  h_{i}), the probability that n sources from n distinct catalogues have to satisfy the candidate selection criteria under hypothesis h_{i}. We will show in section Sect. 9 that the p.d.f. of the Mahalanobis distance for χmatch associations under hypothesis h_{i} is the p.d.f. of the Mahalanobis distance without applying the candidate selection criteria, normalized by the probability p(s  h_{i}) we compute in this section: (101)We show here that p(s  h_{i}) is proportional to the integral we note I_{hi,n}(k_{γ}) (see Eq. (107)) which is independent of positional uncertainties and which also plays a role in Sect. 10. We will see also that p(s  h_{i}) and I_{hi,n}(k_{γ}) can be simplified to p(s  h_{k}) and I_{k,n}(k_{γ}) respectively, that is the probability n sources from n distinct catalogues have to satisfy the candidate selection criteria knowing they are actually associated with k distinct real sources.
If k = 1, that is all sources are from a same real source, we – logically – find p(s  h_{k = 1}) = γ, the cumulative χ distribution function evaluated at the threshold k_{γ}.
If k = n, all sources are spurious, we – also logically (see Eq. (74)) – find for I_{k = n,n}(k_{γ}) the volume of a 2(n−1)dimensional sphere of radius k_{γ}, and p(s  h_{k = n}) equals the volume of the 2(n−1)dimensional ellipsoid defined by the test acceptance region divided by the common χmatch surface area raised to the power of the number of χmatches (i.e. n−1).
We note x the total Mahalanobis distance, that is the square root of Eq. (36). The vectorial form x = (x_{1},x_{2},...x_{n−1}) denotes the n−1 terms, also Mahalanobis distances, which are summed in the catalogue by catalogue iterative form Eq. (48). We rewrite this equation with the new notations (102)So x is the radius of an hypersphere in the n−1 successive Mahalanobis distances space. The relation between x and x of dimension n−1 is the polar transformation F:R^{n−1} → R^{n−1}, (x_{1},x_{2},...,x_{n−1}) = F(x,θ_{1},...,θ_{n−2}), (103)with (104)The associated differential transform is (105)with J_{F} the determinant of the Jacobian of F which is for example computed in Stuart & Ord (1994, Chap. II “Exact sampling distributions”, p. 375): (106)We now define the I_{k,n}(k_{γ}) integral which will be crucial in the next sections (107)in which k denotes the hypothetical number of real sources and so ranges from 1 to n. Written this way, the integral is simpler than the equivalent form: (108)in which we have k groups containing each n_{g} sources associated to a same real source so that (see the iterative candidate selection by groups of catalogues in Sect. 5.4). Here Eq. (55) takes the form (109)where x_{g−1} are intergroup Mahalanobis distances and x_{i} are intragroup Mahalanobis distances. In this version of the formula, we suppose that we iteratively crosscorrelate the catalogues by groups. We suppose that each group corresponds to one real source. So inside each group, we multiply Rayleigh distributions and when associating each group, we multiply by twodimensional Poisson distributions.
We compute I_{k,n} using recursive integration by parts, leading to (see Sects. A.5.3 and A.6) (110)We provide the exhaustive list of values of I_{k,n}(k_{γ}) for n = 2,3,4 and 5 in Table 3. Remark: k_{γ} depends on the selected completeness γ and on the number of catalogues n. When we call I_{k,n−k}(k_{γ}), the k_{γ} to be used in the integral is always the k_{γ} computed for n catalogues. So I_{1,n−k} will no longer be equal to γ but to F_{χ2(n−k−1)}(k_{γ}). For example, we fix γ to 0.9973. Then for a n = 2 catalogues χmatch, k_{γ} ≈ 3.4 and I_{1,2} = γ. But for a n = 3 catalogues χmatch, k_{γ} ≈ 4.0, I_{1,3} = γ and I_{1,3−1 = 2} ≠ γ.
Values of the normalization integrals I_{k,n−k}(k_{γ}) for a number of catalogue ranging from two to five.
Values of the derivatives of normalization integrals for a number of catalogue ranging from two to five.
We call p(s  h_{k}) the marginalized probability of observing a Mahalanobis distance less than or equal to k_{γ} in a group of n sources knowing they are actually associated to k real sources. (111)We obtain this equality by replacing 2πx_{g−1} by 2πx_{g−1}det(V_{Σg−1} + V_{Σg})/Ω in Eq. (108) and then applying Eq. (63). The factor 1/Ω comes from the normalisation of the Poisson distribution so it integrates to one over the common surface area of the crossmatched catalogues. For the particular case in which all sources are spurious, we logically find the summed terms in Eq. (74).
And the distribution (p.d.f.) associated to the probability p(x  h_{k}) of observing a given Mahalanobis distance x knowing h_{k} is simply given by the derivative of p(s  h_{k})dx, so is proportional to dI_{k,n}(x)dx.
9. Simple Bayesian probabilities
In this section, we compute Bayesian probabilities which depend on the Mahalanobis distance only.
9.1. General formula
Given a set of n candidates from n distinct catalogues satisfying the candidate selection criterion, we know

x, the Mahalanobis distance (Eq. (40)), or χ value, which is a real value;

s, the result of the selection criterion x ≤ k_{γ}, that is a boolean always equals to true for the sets of associations we keep, so p(s) = 1;

{ h_{i} }, i ∈ [1,B_{n}], the set of B_{n} (Eq. (73)) hypotheses to be formulated for each set of association.
We then note h_{k} hypotheses in which the n sources are associated with k “real” sources.
For a given set of n candidates from n distinct catalogues, the probabilities associated with the various hypotheses are given by Bayes’ formula In this formula, p(h_{i}  s) are priors (considering only χmatches, hence only s = true) and correspond to the number of associations satisfying the candidate selection (χmatches) and hypotheses h_{i} over the total number of associations satisfying the candidate selection. We can transform the likelihood p(x  h_{i},s) in because we keep in our sample only associations satisfying the candidate selection criteria we have p(s  x,h_{i}) = 1. In other words, the likelihood we use is a classical likelihood normalized so it integrates to one over the χ test acceptance region (defined by x ≤ k_{γ}).
We easily compute priors from the numbers estimated in Sect. 7. And likelihoods are simply computed from Sect. 8(117)We make explicit this result in the next section for the case of two, three and four catalogues.
9.2. Likelihoods p(xh_{i},s)
In this section, we compute the likelihoods p(x  h_{i},s), that is the p.d.f. of the Mahalanobis distance of χmatches under hypothesis h_{i}.
9.2.1. Case of two catalogues
For a set of two sources from two distinct catalogues, we have only two hypotheses, hence the two likelihoods: Knowing that the selection criterion is satisfied, we have to normalize so the integral of each likelihood over the domain defined by the selection criteria equals one (likelihoods are p.d.f.): All constant terms, that is terms independent of x, vanish with the normalisation. The likelihoods are plotted in Fig. 2.
Remark: p(x  h_{k = 2}) mixes the derivative of the surface area of an ellipse in the Euclidean plane and the surface area of the sphere. It is an approximation valid as long as is small enough so effects of the curvature of the sphere are negligible.
Fig. 2 Two possible likelihoods for n = 2 catalogues and γ = 0.9973: normalized Rayleigh (red, filled curve) and Poisson (green, dashed curve) components. We note that this γ value implies the [0,k_{γ} = 3.443935] range for x. 

Open with DEXTER 
9.2.2. Case of three catalogues
For a set of three sources A, B, and C from three distinct catalogues, we have five hypotheses (see Sect. 6.2.2). In the five hypotheses, the number of “real” sources can be either one, two or three. We have as many likelihoods as possible distinct number of “real” sources. There are three ways of performing an iterative crossmatch, leading to the same Mahalanobis distance x:

crossmatch A with B and then with C,;

crossmatch B with C and then with A, ;

crossmatch A with C and then with B, .
We denote x_{AB} the Mahalanobis distance between A and B and we denote x_{ABC} the Mahalanobis distance between C and the weighted mean position of A and B.
x_{1} and x_{2} are used to designate without distinction x_{AB}, x_{BC} or x_{AC} and x_{ABC}, x_{BCA} or x_{ACB} respectively.
Although it may be tempting to write we cannot directly compute probabilities p(h_{i}  x) from those likelihoods since infinitesimals (dx_{AB}, dx_{AC}, ...) are not the same and so do not vanish when applying Bayes’ formula.
It seems that the only measurement one can use to obtain coherent (and symmetrical) probabilities is the total Mahalanobis distance x. So we have to integrate the above probabilities over the domain defined by and then evaluate their derivatives for x. We obtain the following likelihoods represented in Fig. 3. in which h_{k = 1} is the hypothesis h_{ABC}, h_{k = 3} is the hypothesis h_{A_B_C} and h_{k = 2} is either the hypothesis h_{AB_C} or h_{AC_B} or h_{A_BC}.
Fig. 3 Three possible likelihoods p(x  h_{k},s) for n = 3 catalogues and γ = 0.9973: χ distribution with four degrees of freedom (red, filled curve); Integral of a χ distribution with two degrees of freedom times a twodimensional Poisson distribution (green, dashed curve); Fourdimensional Poisson distribution (blue, dotted curve). 

Open with DEXTER 
9.2.3. Case of four catalogues
For a set of four sources A, B, C and D from four distinct catalogues, we have fifteen hypotheses (see Sect. 6.2.3). In the fifteen hypotheses, the number of “real” sources can be either one, two, three or four. We have as many likelihoods as possible distinct numbers of “real” sources. They are represented in Fig. 4: In which h_{k = 1} is the hypothesis h_{ABCD}; h_{k = 4} is the hypothesis h_{A_B_C_D}; h_{k = 2} is either the hypothesis h_{A_BCD} or h_{B_ACD} or h_{C_ABD} or h_{D_ABC} or h_{AB_CD} or h_{AC_BD} or h_{AD_BC}; and h_{k = 3} is either the hypothesis h_{AB_C_D} or h_{AC_B_D} or h_{AD_B_C} or h_{BC_A_D} or h_{BD_A_C} or h_{CD_A_B}.
Fig. 4 Four possible likelihoods p(x  h_{k},s) for n = 4 catalogues and γ = 0.9973: χ distribution with six degrees of freedom (red, filled curve); integral of a χ distribution with two degrees of freedom times a fourdimensional Poisson distribution (green, dashed curve); integral of a χ distribution with four degrees of freedom times a twodimensional Poisson distribution (blue, dotted curve); sixdimensional Poisson distribution (cyan, filled curve). 

Open with DEXTER 
9.3. Advantage and limits
The main advantage of using p(h_{i}  x,s) is that the likelihoods it is based on do not depend on the positional errors: the only input parameter is the Mahalanobis distance x. Although it is true that x is computed from positional errors, once the χmatch has been performed we do not need the errors anymore: the distributions we use relies only on x. Changing positional errors modifies the priors, not the likelihoods. So we can easily add independent likelihoods based on magnitudes or other parameters.
There are two main problems. The first problem is precisely that the likelihoods depend only on x. It means that a set of very close sources with very accurate positions may have the same probability than a set of distant sources with large positional errors, even if intuitively the risk the first set have to contain spurious association should be far lower than in the second case. The second limitation is due to the fact that likelihoods are the same for hypotheses considering the same number of “real” sources. In the three catalogues case, p(x  h_{AB_C},s) = p(x  h_{A_BC},s) = p(x  h_{AC_B},s). The priors being constants, if p(h_{AB_C}  s) >p(h_{A_BC}  s) >p(h_{AC_B}  s), we always obtain posterior probability p(x  h_{AB_C},s) >p(h_{A_BC}  s) >p(h_{AC_B}  s).
10. Bayesian probabilities with positional errors
In this section, we compute Bayesian probabilities which include explicitly positional errors.
10.1. Warning about the non independence of positional uncertainties
In surveys providing individual uncertainties, positions of unsaturated bright sources are often more precise than positions of faint sources. The reason has to do with the higher photometric signaltonoise ratio of bright sources compared to faint sources, while the FWHM is similar. An example of computation of positional uncertainties based on photon statistics can be found for example in the documentation of the SExtractor software (Bertin & Arnouts 1996). As mentioned in the documentation, the photon statistics based error is a lower value estimate.
It means that we cannot blindly assume that the positional uncertainties and photometric quantities like apparent magnitudes are independent. Moreover, if the positional errors of sources are related to their magnitudes and if the magnitudes of the sources in different catalogues are also related, then positional uncertainties in the different catalogues are related too. It means that we cannot blindly assume that the positional uncertainties of matching objects in different catalogues are independent from each other, at least not for h_{k<n}, that is the hypothesis in which at least two sources are from a same actual source.
One has to keep this in mind when using the naive independent hypothesis to simplify Bayes probabilities.
10.2. Probability using the Mahalanobis distance
To (at least partly) solve the first issue mentioned in Sect. 9.3 one possibility is to introduce likelihoods based for example on the volume V of the Chi test acceptance region writing: (134)From Sect. 7, it is easy – even though it may be time consuming – to build the estimated histogram n_{A_B}p(V + ΔV  h_{A_B}) (in which ΔV is the width of the histogram’s bars). Given this histogram and the result of a twocatalogues crossmatch, we can also build an estimated histogram n_{AB}p(V + ΔV  h_{AB}). And so on for multiple catalogues, performing all possible subcrossmatches.
If V and x are independent for all hypotheses, and knowing (having estimates of) n_{AB} and n_{A_B}, we have all the ingredients to compute (135)even if it is not elegant to introduce a somewhat arbitrary slicing in V histograms.
10.3. Putting aside the Mahalanobis distance
We also consider the alternative form which puts aside the Mahalanobis distance and relies on the full sets of positions μ and associated errors V(136)in which the probabilities explicitly depend on the “configuration” of each position and on the associated errors. It also depends on the distribution of positional errors for a given hypothesis. Although p(V  h_{i},s) can be estimated performing all possible subcrossmatches, it is not trivial since it is a joint distribution in a space of dimension equal to the number of “actual” sources considered in h_{i} (using the circular error approximation).
10.3.1. Likelihoods p(μ h_{i}, V, s)
We make the hypothesis that n_{g} sources are n_{g} detections of a same true source having a given position p. The probability to observe the set of positions μ_{{g}} = { μ_{1},μ_{2},...,μ_{ng} }, knowing p and the set of errors V_{{g}} = { V_{1},V_{2},...,V_{ng} } is (137)In practice we do not know the position of the real source p. So the probability to observe the set of positions μ_{{g}} knowing the set of errors V_{{g}} is obtained by integrating over all possible positions This result is the same as Eq. (38) in Sect. 5.2 but applied here to the subset of positions { μ_{1},μ_{2},...,μ_{ng} }. The difference is that in Sect. 5.2 we wanted to estimate the probability the sources had to be at the same location whereas here, knowing (making the hypothesis) they are at the same location, we compute the probability we had to observe this particular outcome. The particular case n_{g} = 1 leads to p(μ_{{g}}  V_{{g}}) = dμ_{1}.
We now consider G groups and the selection criteria s (x ≤ h_{γ}). Each of the n input sources is part of one, and only one group. Given the G groups, the errors on the positions and the candidate selection criteria, the probability to observe positions μ is (140)The denominator ensures that the likelihood integrates to one over its domain of definition, domain delimited by the candidate selection criteria, that is the region of acceptance of the χ^{2} test.
Let us compute the integral in the denominator. The differential of the substitution x = yV^{1}y transforms as . y can be the difference between two positions (e.g. μ_{i}−μ_{Σi−1}) and (x,θ) the polar coordinates of y in the basis defined by the eigenvectors of V and reduced by its eigenvalues. V can for example be V_{Σi−1} + V_{i}. Using the iterative form of Sect. 5.3 and Eq. (53), we can rewrite p(μ_{{g}}  V_{{g}})(141)Integrating over all θ_{i} we obtain This joint p.d.f. of the successive Mahalanobis distances is different from the p.d.f. of their quadratic sum which gives the total Mahalanobis distance (145)in which .
Putting all together, the integral in the denominator is no other than the integral I_{k = G,n}(k_{γ}) defined in Sect. 8. Written explicitly: (146)
10.3.2. Classical two catalogues case
In the case of two catalogues, the probabilities are simply:
If we compare the likelihood ratio LR_{μ,V} computed from those formulae with the likelihood ratio LR_{x} computed from the previous result (Sect. 9.2.1) we obtain: Contrary to LR_{x}, LR_{μ,V} accounts for the size of positional errors. As we will see in the next section, the drawback is that we can hardly combine the likelihoods p(μ_{1},μ_{2}  h_{k},V_{1},V_{2},s) with photometry based likelihoods.
11. Bayesian probabilities with photometric data
All probabilities of association discussed so far are based on the likelihood that the positions recorded in various catalogues are consistent with that of a unique astrophysical object. However, one may wish to make additional assumptions on the nature of the source (e.g. star, active galactic nucleus, etc.) that could help decrease or, conversely, increase the plausibility of a given association of catalogue entries. This is particularly important when one seeks to gather homogeneous samples of objects. Spectral energy distributions assembled from photometric catalogues can be usefully compared with templates and assigned a probability of being representative of the targeted class of objects. This procedure has been presented in Budavári & Szalay (2008) and recently used in Naylor et al. (2013) and Hsu et al. (2014). However, following Budavári & Szalay (2008) we underline that entering criteria of resemblance to a given class of objects in the computation of association probabilities is done at the expense of the capability to find scientifically interesting outliers.
Another possibility may consist in building colourcolour diagrams for random and χmatched associations to derive colourcolour diagrams for real associations (more precisely, we have to derive each diagram for each possible hypothesis h_{i}). Those normalized diagrams are p.d.f. that can be interpreted as likelihoods (p(m  h_{i},s) in the following equations). Smoothing those diagrams, one can see them as the likelihoods of the kernel density classification (Richards et al. 2004), replacing the object types by the hypothesis h_{i}, and using magnitudes from different catalogues instead of just one.
We detail below how photometric data can be folded into the output of the purely astrometric method discussed in this paper, without making additional assumptions on the nature of the source.
Suppose we note a, a vector containing all the astrometric information we have about a set of n candidates (a may contain the positions, the associated covariance matrices, ...). We note m the set of photometric informations we have about the same set of n candidates (m may contain magnitudes and/or colours, associated errors, ...). Then we can write the Bayes formula: (151)If a and m are independent (naive hypothesis which is not granted, see Sect. 10.1) (152)and Eq. (151) becomes (153)Let’s imagine we perform a crossmatch taking into account astrometric data only. We compute probabilities p(h_{i}  a,s) for all possible hypotheses. Then if a and m are independent, and if we are able to compute likelihoods based on photometric data only p(m  h_{i},s), then we can compute the probabilities p(h_{i}  a,m,s) in a second step from the probabilities computed in the astrometric part: (154)which is equivalent to Eq. (153).
Unfortunately, positional errors and magnitudes are not necessarily independent. So one should not use p(h_{i}  μ,V,s) without any due caution in Eq. (154). However, one can use the Mahalanobis distance x which is independent of the photometry, that is probabilities p(h_{i}  x,s) (Eq. (113)).
12. Tests on synthetic catalogues
In the context of the ARCHES project, we developed a tool implementing the statistical multicatalogue crossmatch described in this paper. We added to the tool the possibility to generate synthetic catalogues that can be crossmatched like real tables. It has been allowing us to perform tests and to check both the software and the theory.
We present here such a test and provide the associated script (see Appendix C) so anybody can try it independently, possibly changing the input values. Currently the tool is accessible both via a web interface and an HTTP API^{9}. Future plans are discussed in the conclusion Sect. 14.
We generate three synthetic catalogues, setting the numbers of sources they contain and have in common: we call n_{ABC} the number of common sources in the three catalogues A, B and C; n_{AB} the number of common sources in A and B only; n_{A} the number of sources in catalogue A only; and so on. Knowing a priori common and distinct sources in the catalogues, we can track the associations which are real and the spurious ones in the crossmatch output. We can also check for missing associations.
The error associated to each individual position is a random value which follows a user define distribution. A different distribution is used for each catalogue. For catalogue A, we choose a constant value equal to 0.4′′; for catalogue B, the positional error distribution follows a linear function between 0.8′′ and 1.2′′; for catalogue C, the positional errors follow a Gaussian distribution of mean 0.75′′ and standard deviation 0.1′′ truncated to the 0.5–1′′ range.
We set the input sky area to be a cone of radius 0.42 degrees. Each position is randomly (uniform distribution) placed in this cone. For each catalogue in which the source is included, we randomly pick a positional error that we associate to the source and we blur the position using its error.
We first compute for pairs AB, AC and BC. Given the chosen error distributions, the mean errors are equal to the median errors and the mean of the inverse of the errors is quite close to the inverse of the mean errors. So, for this particular case, we use the inverse of the mean errors 0.4, 1 and 0.75 instead of the means of the inverse. Given this approximation and using Eq. (96), we obtain And, to estimate the number of “fully” spurious associations, we have to compute the mean of the square root of Eq. (97) over all possible source trios, which can be approximated in this specific case by We note that in this particular case, the error distribution of sources A involved in AB, AC and ABC associations is the same. Idem for the error distribution of B and C sources.
We are now able to compute all components, depending on the size of histograms bins (step). To do this we note to finally obtain Each component equals step × n × p(s  h_{i})p(x  h_{i},s), see Eqs. (111) and (117). We simplify the expression since p(s  h_{i}) ∝ I_{k,n}(k_{γ}) and p(x  h_{i},s) ∝ 1 /I_{k,n}(k_{γ}). The normalized histograms associated to each component are distributed according to each likelihood p(x  h_{i},s). Both histograms made from the data and theoretical curves are plotted in Fig. 5. The theoretical results fit very well the result of the Chisquare crossmatch based on simulated data.
We also verify that the number of “good” ABC matches we obtain as output of the crossmatch is coherent with n_{ABC} times the input completeness γ.
Fig. 5 Result of the crossmatch of three synthetic catalogues with input values n_{A} = 40 000, n_{B} = 20 000n_{C} = 35 000n_{AB} = 6000, n_{AC} = 12 000, n_{BC} = 18 000 and n_{ABC} = 10 000. The error on catalogue A is a constant equal to 0.4′′. The circular error on catalogue B follows a linear distribution between 0.8 and 1.2′′. The circular error on catalogue C follows a Gaussian distribution of mean 0.75′′ and standard deviation of 0.1′′ between 0.5 and 1′′. The common surface area is a cone of radius 0.42°. Top left: histogram of all associations and theoretical curve from the input parameters. Top centre: histogram of real associations and theoretical curve from input parameters. Top right: histogram of “fully” spurious associations and theoretical curve from input paramameters. Bottom: histograms and theoretical curves of associations mixing a real association between two sources plus a spurious source. 

Open with DEXTER 
When crossmatching real catalogues, the number of sources n_{ABC}, etc. are not known. But the previous “theoretical” curves can be built after a χmatch from the number of sources estimated to compute priors in Sect. 7.
13. Summarized recipe
In this section, we give the main steps and equations to perform a χmatch and to compute for each association the probability it has to be a good match (or any other possible hypothesis).
For a small and compact sky area, project all the sources of all catalogues on an Euclidian plane using for example the ARC projection (Calabretta & Greisen 2002).
To select matching candidates, for each possible set of n sources from n distinct catalogues:

compute their weighted mean position (Eq. (30)) andthe associated error (Eq. (29));

derive x, the Mahalanobis distance defined by the square root of Eq. (36);

fix a constant threshold on all Mahalanobis distances, that is

set the fraction α of real associations it is acceptable to miss – the type I error – and

derive numerically the threshold k_{γ} inverting Eq. (46) based on the Chisquare distribution with 2(n−1) degrees of freedom (the result is the same computing the threshold from the Chi distribution with 2(n−1) degrees of freedom);

To compute Bayes’ probabilities, as many hypotheses as the number of possible partitions of the set of n sources (see Eq. (73), Table 2 and Fig. 1) have to be formulated. Depending on whether one wants to be able to account for photometry in a second step or not, a set of likelihoods may be chosen among several such sets.
In the first case, the likelihood associated to each hypothesis (knowing the selection criteria is fulfilled) depends only on the Mahalanobis distance and on the number of real sources k in the hypothesis h_{i} (see e.g. Fig. 4). The likelihoods are (Eq. (117)) (175)with I_{k,n}(k_{γ}) given in Eq. (110). The formulae of I_{k,n}(k_{γ}) and dI_{k,n}(x) are provided for n ≤ 5 in Tables 3 and 4 respectively.
In the second case (no photometry to be taken into account), one can use likelihoods defined by Eq. (146).
Finally, to apply Bayes’ formula, priors p(h_{i}  s) are needed. This is more tricky and the steps detailed in Sect. 7 have to be considered. A prerequisite is to work on an area (Ω) uniformly covered by all catalogues. For two catalogues, the number of spurious associations can be estimated by computing for each catalogue the mean area covered by the error ellipses for a radius equal to the threshold k_{γ} (so the mean area of the 1σ error ellipses times ). The two means are summed and the result is divided by Ω to obtain the mean probability to spuriously associate two unrelated sources. It is then multiplied by the product of the number of sources in both catalogues to finally obtain the mean expected number of spurious associations (Eq. (85)). Knowing the number of associations in the crosscorrelation output, both the probability that one such association is spurious and the complementary probability of having a real association (the two priors of the twocatalogues crossmatch) can be estimated (see Eqs. (93) and (94)). Similarly, performing all possible subχmatches, all priors needed for a ncatalogues crossmatch can be derived.
All needed ingredients to compute the probabilities associated with each hypothesis are thus available. Those probabilities can be computed applying Eq. (113).
14. Conclusions
In this paper we developed a comprehensive framework for performing the crosscorrelation of multiple astronomical catalogues, in one pass. The approach employs a classical χ^{2}test to select candidates. We computed two sets of likelihoods based on positions, individual elliptical positional errors and the χ^{2}test region of acceptance: one that can be mixed without any caution with other parameters such as photometric values; and one for which the naive hypothesis of independence between positional uncertainties and magnitudes has to be tested. We also presented a way to estimate “priors” from the region of acceptance of the χ^{2}test. Probabilities for each possible hypothesis can thus be computed from those likelihoods and “priors”.
In practice the number of hypotheses, and thus the number of “priors”, increases dramatically with the number of catalogues. To be able to crossmatch more than six or seven catalogues, it is necessary to simplify the problem. One possibility consists of merging two catalogues of similar astrometric accuracy and similar wavelength range, considering all matches as nonspurious matches. Doing so we would effectively reduce the number of input catalogues by one.
A large part of the statistical work carried out here depends on the simplifying assumptions made in Sect. 3: perfect astrometrical calibration (no systematic offsets), no proper motions, no clustering and no blending. In real life, the “normalized” distance between two detections of a same source present in two distinct catalogues hardly follows a Rayleigh distribution. The “actual distribution” (in practice it is not easy to build such a distribution since it requires secure identifications) often has a broader tail (see for example Rosen et al. 2016, Fig. 5) and a lognormal distribution may better fit it than the Rayleigh distribution. This is probably due to a combination of causes like small proper motions, imperfect reduction, systematics or bias from the calibration process, under or overestimated errors, etc.
In practice this means that the number of associations missed by the candidate selection criteria (based on Rayleigh) is larger that the chosen theoretical value (γ). We could for example add larger systematics to positional errors. The risk is then to distort (even more) the Rayleigh distribution. We could also try to recalibrate locally the set of catalogues we want to crossmatch, but we need secure identifications to do it properly; for each catalogue, all sources in the local area must have been calibrated at once (to possibly correct for a locally uniform systematic using four simple parameters Δα, Δδ, scale, θ). Those two constraints (having secure identifications and at once calibration) are in practice quite hard to satisfy.
If we recalibrate using a “secure” population (i.e. a population of objects having no proper motions like QSOs) we introduce a bias since QSOs are fainter than most stars in the optical and thus have errors larger than the global population of objects. And adding stars, we introduce noise due to proper motions.
For these reasons, we believe that in case of “old” optical surveys based on photographic plates, a classical fixed radius crossmatch may be more efficient that the χmatch to select candidates. We are nonetheless conviced that the equations we derived in this paper can help in building new catalogues, based for example on both multiband and multiepoch observations, and can be used to assess and improve the quality of coming surveys.
We generated and processed synthetical catalogues, which meet the simplifying assumptions, in the tool we developed for the ARCHES project. The consistency between the theoretical results derived in this paper – completeness of the candidate selection criterion, likelihoods and priors – and the outputs of the tool has allowed us to crossvalidate both the method and its implementation. The tool has also been used to generate ARCHES products which were used in the scientific work packages of the project. Currently the CDS XMatch Service (Pineau et al. 2011a; Boch et al. 2012; Pineau et al. 2015) provides a basic but very efficient facility to crosscorrelate two possibly large (>1 billion sources) catalogues. It is planned to include the ARCHES tool into the CDS XMatch. This paper will be the basic reference for the extension of the latter to multicatalogue statistical χmatch.
From the 2MASS online user’s guide http://www.ipac.caltech.edu/2mass/releases/allsky/doc/sec2_2a.html
Original figure: https://commons.wikimedia.org/wiki/File:Set_partitions_5;_cirlces.svg
Acknowledgments
A large part of this work was supported by the ARCHES project. ARCHES (No. 313146) was funded by the 7th Framework of the European Union and coordinated by the University of Strasbourg. All figures (except Fig. 1) were made using the ctioga2 plotting program developped by Vincent Fourmond. F. J. Carrera also acknowledges financial support through grant AYA201564346C21P (MINECO/FEDER).
References
 Adorf, H.M., Lemson, G., & Voges, W. 2006, in Astronomical Data Analysis Software and Systems XV, eds. C. Gabriel, C. Arviset, D. Ponz, & S. Enrique, ASP Conf. Ser., 351, 695 [Google Scholar]
 Ahn, C. P., Alexandroff, R., Allen de Prieto, C., et al. 2012, ApJS, 203, 21 [NASA ADS] [CrossRef] [Google Scholar]
 Ahn, C. P., Alexandroff, R., Allende Prieto, C., et al. 2013, VizieR Online Data Catalog: V/139 [Google Scholar]
 Aihara, H., Allen de Prieto, C., An, D., et al. 2011, ApJS, 193, 29 [NASA ADS] [CrossRef] [Google Scholar]
 Bertin, E., & Arnouts, S. 1996, A&AS, 117, 393 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Boch, T., Pineau, F., & Derriere, S. 2012, in Astronomical Data Analysis Software and Systems XXI, eds. P. Ballester, D. Egret, & N. P. F. Lorente, ASP Conf. Ser. 461, 291 [Google Scholar]
 Brusa, M., Zamorani, G., Comastri, A., et al. 2007, ApJS, 172, 353 [NASA ADS] [CrossRef] [Google Scholar]
 Budavári, T., & Szalay, A. S. 2008, ApJ, 679, 301 [NASA ADS] [CrossRef] [PubMed] [Google Scholar]
 Calabretta, M. R., & Greisen, E. W. 2002, A&A, 395, 1077 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Cutri, R. M., Skrutskie, M. F., van Dyk, S., et al. 2003, VizieR Online Data Catalog:II/246, 0 [Google Scholar]
 Cutri, R. M., et al. 2014, VizieR Online Data Catalog: II/328 [Google Scholar]
 Fioc, M. 2014, A&A, 566, A8 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Helfand, D. J., White, R. L., & Becker, R. H. 2015a, ApJ, 801, 26 [NASA ADS] [CrossRef] [Google Scholar]
 Helfand, D. J., White, R. L., & Becker, R. H. 2015b, VizieR Online Data Catalog:8 VIII/092 [Google Scholar]
 Høg, E., Fabricius, C., Makarov, V. V., et al. 2000, A&A, 355, L27 [NASA ADS] [Google Scholar]
 Hsu, L.T., Salvato, M., Nandra, K., et al. 2014, ApJ, 796, 60 [NASA ADS] [CrossRef] [Google Scholar]
 Inglot, T. 2010, Prob. Math. Stat., 30, 339 [Google Scholar]
 Lasker, B., Lattanzi, M. G., McLean, B. J., et al. 2007, VizieR Online Data Catalog: I/305 [Google Scholar]
 Lasker, B. M., Lattanzi, M. G., McLean, B. J., et al. 2008, AJ, 136, 735 [NASA ADS] [CrossRef] [Google Scholar]
 Menzel, M.L., Merloni, A., Georgakakis, A., et al. 2016, MNRAS, 457, 110 [NASA ADS] [CrossRef] [Google Scholar]
 Mingo, B., Watson, M. G., Rosen, S. R., et al. 2016, MNRAS, 462, 2631 [NASA ADS] [CrossRef] [Google Scholar]
 Motch, C., Carrera, F. J., Genova, F., et al. 2016, in Astronomical Data Analysis Software an Systems XXV (ADASS XXV), eds. F. Lorente & E. Shortridge, ASP Conf. Ser. [arXiv:1609.00809] [Google Scholar]
 Naylor, T., Broos, P. S., & Feigelson, E. D. 2013, ApJS, 209, 30 [NASA ADS] [CrossRef] [Google Scholar]
 Pier, J. R., Munn, J. A., Hindsley, R. B., et al. 2003, AJ, 125, 1559 [NASA ADS] [CrossRef] [Google Scholar]
 Pineau, F.X., Boch, T., & Derriere, S. 2011a, in Astronomical Data Analysis Software and Systems XX, eds. I. N. Evans, A. Accomazzi, D. J. Mink, & A. H. Rots, ASP Conf. Ser., 442, 85 [Google Scholar]
 Pineau, F.X., Motch, C., Carrera, F., et al. 2011b, A&A, 527, A126 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Pineau, F., Boch, T., Derriere, S., & Arches Consortium. 2015, in Astronomical Data Analysis Software an Systems XXIV (ADASS XXIV), eds. A. R. Taylor, & E. Rosolowsky, ASP Conf. Ser., 495, 61 [Google Scholar]
 Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. 2007, Numerical Recipes, The Art of Scientific Computing, 3rd edn. (New York: Cambridge University Press) [Google Scholar]
 Richards, G. T., Nichol, R. C., Gray, A. G., et al. 2004, ApJS, 155, 257 [NASA ADS] [CrossRef] [Google Scholar]
 Rosen, S. R., Webb, N. A., Watson, M. G., et al. 2016, A&A, 590, A1 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Rutledge, R. E., Brunner, R. J., Prince, T. A., & Lonsdale, C. 2000, ApJS, 131, 335 [NASA ADS] [CrossRef] [Google Scholar]
 Rutledge, R. E., Fox, D. W., Bogosavljevic, M., & Mahabal, A. 2003, ApJ, 598, 458 [NASA ADS] [CrossRef] [Google Scholar]
 Skrutskie, M. F., Cutri, R. M., Stiening, R., et al. 2006, AJ, 131, 1163 [NASA ADS] [CrossRef] [Google Scholar]
 Stuart, A., & Ord, K. 1994, Kendall’s Advanced Theory of Statistics: Volume 1: Distribution Theory, Kendall’s advanced theory of statistics (Wiley) [Google Scholar]
 Taylor, M. B. 2005, in Astronomical Data Analysis Software and Systems XIV, eds. P. Shopbell, M. Britton, & R. Ebert, ASP Conf. Ser., 347, 29 [Google Scholar]
 Voges, W., Aschenbach, B., Boller, T., et al. 1999, A&A, 349, 389 [NASA ADS] [Google Scholar]
 Watson, M. 2012, in Half a Century of Xray Astronomy, Proc. Conference held 17–21 September, 2012 in Mykonos Island, Greece, 138 [Google Scholar]
 White, R. L., Becker, R. H., Helfand, D. J., & Gregg, M. D. 1997, ApJ, 475, 479 [NASA ADS] [CrossRef] [Google Scholar]
 White, N. E., Giommi, P., & Angelini, L. 2000, VizieR Online Data Catalog: VIII/031 [Google Scholar]
 Wolstencroft, R. D., Savage, A., Clowes, R. G., et al. 1986, MNRAS, 223, 279 [NASA ADS] [Google Scholar]
 Wright, E. L., Eisenhardt, P. R. M., Mainzer, A. K., et al. 2010, AJ, 140, 1868 [NASA ADS] [CrossRef] [Google Scholar]
 Zacharias, N., Urban, S. E., Zacharias, M. I., et al. 2004, AJ, 127, 3043 [NASA ADS] [CrossRef] [Google Scholar]
Appendix A: Demonstrations
Appendix A.1: From V to V_{Σ}
From Eq. (28), we compute for 2 × 2 symmetric square matrices: in which adjA is the adjugate matrix of A, that is the transpose of the cofactor matrix of A.
Appendix A.2: Sum of quadratics canonical form
First expanding and then factoring: We use that we introduce in the previous equations to finally find or,
Appendix A.3: Expanding the Q_{χ2} term
Noting that (A.17)we can write and thus, with square symmetric matrices:
Appendix A.4: Sum of 2 quadratics: Q_{χ} term
We first develop: Computing : We can write: and, similarly: We use the above 3 relations to develop (A.34)which leads to (A.35)Using Eq. (52) we found that (A.36)is equal to (A.37)
Appendix A.5: χ and χ^{2} distributions
Appendix A.5.1: Definition
The χ and χ^{2} distributions with k = 2(n−1) degrees of freedom are defined as with the gamma function ∀l ∈N, Γ(l) = (l−1) ! it leads to So for and so on.
Appendix A.5.2: Sum of two χ functions
We show here that χ_{k = 2p}(x_{1}) + χ_{k = 2q}(x_{2}) = χ_{k = 2(p + q)}(x). The distribution we are looking for is the density function of χ_{k = 2p}(x_{1})χ_{k = q}(x_{2})dx_{1}dx_{2} given . We use polar coordinates so, dx_{1}dx_{2} = xdxdθ, x_{1} = xcosθ and x_{2} = xsinθ: in which B(p,q) is the beta function
Appendix A.5.3: χ_{k = 2(n−1)} cumulative distribution function
Directly integrating for k = 2(A.53)For k = 4, we integrate by parts with (A.54)so For k = 6, also integrating by parts, we note (A.58)and so We deduce the general form for k = 2(n−1): (A.62)
Appendix A.6: Computing the I_{k,n}(x) integral
Let us first expand a few notations for a better readability We are supposed to use  detJ_{F}  but x is positive and all angles θ are ∈ [0,π/ 2], so detJ_{F} is always positive. Let us also expand (A.65)Multiplying both expressions leads to (A.66)
Appendix A.6.1: Case I_{k=1,n}(x)
The exact solution is given by Eq. (A.62).
We note the particular case I_{k,n}(k_{γ}) = γ if k_{γ} computed for this particular value of n.
Appendix A.6.2: Case I_{k=n,n}(x)
which is the volume of an hypersphere of dimension 2(n−1), also called 2(n−1)sphere.
Appendix A.6.3: Intermediate case I_{k,n}(x)
For k> 1 and k<n: In a first step, we integrated by parts using We thus find that (A.88)We finally find the recurrence formula (A.89)since and Knowing that (A.96)we integrate the differents parts:
Appendix B: Proper motion estimation and testing
Appendix B.1: Estimating proper motions
In this section, we show how it is possible to estimate the proper motion v of a source if the simplifying assumption of null proper motion made in Sect. 3 is not met. We neglect the parallax and the long term effect of the radial motion of the source, but those extra parameters could also be fitted provided we have enough catalogue measurements. In our simple case, the position p of a source at any time t can be computed from its position p_{0} at a reference epoch (e.g. 2000): (B.1)We assume we have n observations of the source at various epochs t_{i}. We want to estimate the proper motion, so to estimate the 4 unknowns (v_{x},v_{y},x_{0},y_{0}). To do so we use the maximumlikelihood estimate which consists in maximizing the likelihood(B.2)therefore minimizing (B.3)by solving the system of equations (B.4)To do so, we compute the derivative of χ^{2}(p(t)) according to a parameter a_{k}(B.5)and the Jacobian matrix of p(t  v,p_{0})So we have to solve the following system of equations, noting Δt_{i} = t_{i}−t_{0}(B.8)We can for example use Cramer’s rule to solve the general problem (B.9)with, in our case, and using the notation (B.12)and (B.13)leading to the solution (B.14)where
Appendix B.2: Estimating the error on the proper motion estimate
The covariance matthat isV_{f} on the estimated proper motion parameters is provided by the inverse of the Hessian matrix Hf of , evaluated with the estimated parameters, that is by the matrix (B.15)(B.16)(B.17)
Appendix B.3: Simple case: no covariance
If all positional errors are circles (i.e. ∀i ∈ [1,n] ,ρ_{i} = 0), the simplifications leads to the classical formulae in which x and y are computed independently (see Press et al. 2007, p. 781, Sect. 15.2 “Fitting Data to a Straight Line”) (B.18)where (B.19)and associated errors are (B.20)
Appendix B.4: Verifying the results
We have implemented and tested the result given by equation Eq. (B.14). We compare the results with a modified version of the LevenbergMarquardt (LM) method (see Press et al. 2007, p. 801, Sect. 15.5.2 “LevenbergMarquardt method”) we designed tohandle binormal distributions. The algorithm is the same except that we replace the term in β_{k} by Eq. (B.5) and α_{kl} by (B.21)We initialize the LM parameters with the approximate solutions given in Eq. (B.18). The results obtained using both methods (LM and Eq. (B.14)) are identical.
Appendix B.5: Testing the unique source hypothesis
When estimating the proper motion, we formulated the hypothesis H than our n observations come from a single underlying source. The Chisquare of equation Eq. (B.3) follows a Chisquare distribution with 2n−4 = 2(n−2) degrees of freedom. Therefore the criteria not to reject H is (B.22)in which .
Appendix C: Synthetic catalogues generation script
Here is the script used to generate three synthetical tables and crossmatch them with the online ARCHES XMatch Tool. The language of the script is specific to the tool. Both the tool and its documentation are available online^{10}.
synthetic seed=1 nTab=3 prefix=true \ geometry=cone ra=22.5 dec=33.5 r=0.42 \ nA=40000 nB=20000 nC=35000 \ nAB=6000 nAC=12000 nBC=18000 \ nABC=10000 \ poserrAtype=CIRCLE poserrAmode=formula paramA1=0.4 \ poserrBtype=CIRCLE poserrBmode=function paramB1func=x \ paramB1xmin=0.8 paramB1xmax=1.2 \ paramB1nstep=100 \ poserrCtype=CIRCLE poserrCmode=function \ paramC1func=exp(0.5*(x0.75)*(x0.75)/0.01)/(0.1*sqrt(2*PI)) \ paramC1xmin=0.5 paramC1xmax=1 \ paramC1nstep=100 save prefix=simu3 suffix=.fits common=simu3.fits format=fits cleartables get FileLoader file=simu3A.fits set pos ra=posRA dec=posDec set poserr type=CIRCLE param1=ePosA param2=ePosB param3=ePosPA set cols * get FileLoader file=simu3B.fits set pos ra=posRA dec=posDec set poserr type=CIRCLE param1=ePosA param2=ePosB param3=ePosPA set cols * xmatch chi2 completeness=0.9973 nStep=1 nMax=2 join=inner merge pos chi2 merge dist mec get FileLoader file=simu3C.fits set pos ra=posRA dec=posDec set poserr type=CIRCLE param1=ePosA param2=ePosB param3=ePosPA set cols * xmatch chi2 completeness=0.9973 nStep=2 nMax=2 join=inner merge pos chi2 merge dist mec save simu3.ABC.fits fits
All Tables
Summary of the transformations of positional errors provided in various astronomical catalogues into the coefficients of error covariance matrices (before adding quadratically possible systematics).
Values of the normalization integrals I_{k,n−k}(k_{γ}) for a number of catalogue ranging from two to five.
Values of the derivatives of normalization integrals for a number of catalogue ranging from two to five.
All Figures
Fig. 1 The 52 partitions of a set with n = 5 elements. Each partition corresponds to one hypothesis for five distinct sources from five distinct catalogues. Left: k = 5, the five sources are from five distinct real sources. Right: k = 1, the five sources are from a same real source. (Tilman Piesk – CC BY 3.0 – modified – link in footnote). 

Open with DEXTER  
In the text 
Fig. 2 Two possible likelihoods for n = 2 catalogues and γ = 0.9973: normalized Rayleigh (red, filled curve) and Poisson (green, dashed curve) components. We note that this γ value implies the [0,k_{γ} = 3.443935] range for x. 

Open with DEXTER  
In the text 
Fig. 3 Three possible likelihoods p(x  h_{k},s) for n = 3 catalogues and γ = 0.9973: χ distribution with four degrees of freedom (red, filled curve); Integral of a χ distribution with two degrees of freedom times a twodimensional Poisson distribution (green, dashed curve); Fourdimensional Poisson distribution (blue, dotted curve). 

Open with DEXTER  
In the text 
Fig. 4 Four possible likelihoods p(x  h_{k},s) for n = 4 catalogues and γ = 0.9973: χ distribution with six degrees of freedom (red, filled curve); integral of a χ distribution with two degrees of freedom times a fourdimensional Poisson distribution (green, dashed curve); integral of a χ distribution with four degrees of freedom times a twodimensional Poisson distribution (blue, dotted curve); sixdimensional Poisson distribution (cyan, filled curve). 

Open with DEXTER  
In the text 
Fig. 5 Result of the crossmatch of three synthetic catalogues with input values n_{A} = 40 000, n_{B} = 20 000n_{C} = 35 000n_{AB} = 6000, n_{AC} = 12 000, n_{BC} = 18 000 and n_{ABC} = 10 000. The error on catalogue A is a constant equal to 0.4′′. The circular error on catalogue B follows a linear distribution between 0.8 and 1.2′′. The circular error on catalogue C follows a Gaussian distribution of mean 0.75′′ and standard deviation of 0.1′′ between 0.5 and 1′′. The common surface area is a cone of radius 0.42°. Top left: histogram of all associations and theoretical curve from the input parameters. Top centre: histogram of real associations and theoretical curve from input parameters. Top right: histogram of “fully” spurious associations and theoretical curve from input paramameters. Bottom: histograms and theoretical curves of associations mixing a real association between two sources plus a spurious source. 

Open with DEXTER  
In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.