Gaia Early Data Release 3: The celestial reference frame (Gaia-CRF3)

Gaia-CRF3 is the celestial reference frame for positions and proper motions in the third release of data from the Gaia mission, Gaia DR3 (and for the early third release, Gaia EDR3, which contains identical astrometric results). The reference frame is defined by the positions and proper motions at epoch 2016.0 for a specific set of extragalactic sources in the (E)DR3 catalogue. We describe the construction of Gaia-CRF3, and its properties in terms of the distributions in magnitude, colour, and astrometric quality. Compact extragalactic sources in Gaia DR3 were identified by positional cross-matching with 17 external catalogues of quasars (QSO) and active galactic nuclei (AGN), followed by astrometric filtering designed to remove stellar contaminants. Selecting a clean sample was favoured over including a higher number of extragalactic sources. For the final sample, the random and systematic errors in the proper motions are analysed, as well as the radio-optical offsets in position for sources in the third realisation of the International Celestial Reference Frame (ICRF3). The Gaia-CRF3 comprises about 1.6 million QSO-like sources, of which 1.2 million have five-parameter astrometric solutions in Gaia DR3 and 0.4 million have six-parameter solutions. The sources span the magnitude range G = 13 to 21 with a peak density at 20.6 mag, at which the typical positional uncertainty is about 1 mas. The proper motions show systematic errors on the level of 12 ${\mu}$as yr${}^{-1}$ on angular scales greater than 15 deg. For the 3142 optical counterparts of ICRF3 sources in the S/X frequency bands, the median offset from the radio positions is about 0.5 mas, but exceeds 4 mas in either coordinate for 127 sources. We outline the future of the Gaia-CRF in the next Gaia data releases.


Introduction
Gaia is the second highly successful astrometric space mission of the European Space Agency (ESA) 1 (Gaia Collaboration et al. 2016), vastly extending and improving the achievements of its predecessor Hipparcos (ESA 1997;Perryman et al. 1997;van Leeuwen 2007). While the latter delivered astrometry at the level 1 Some further acronyms used in this paper: EDR3 -Gaia Early third Data Release; DR3 -Gaia third Data Release; QSO -quasi stellar object (here understood as all compact extragalactic sources); AGNactive galactic nuclei; ICRS -International Celestial Reference System; ICRF3 -third realisation of the International Celestial Reference Frame; Gaia-CRFnnth realisation of the Gaia Celestial Reference Frame; VLBI -very long baseline interferometry; S/X, K, X/Ka (after ICRF3) denote the radio frequency bands used in VLBI. of a milliarcsecond (mas) for nearly 120 000 preselected stars at the mean epoch 1991.25, the latest data release Gaia EDR3 contains astrometry for about 1.8 billion sources with accuracies from a few tens of microarcseconds (µas) to ∼1 mas, mainly depending on the brightness (Gaia Collaboration et al. 2021a;Lindegren et al. 2021b).
It is well known that the observational principles of Gaia (and Hipparcos) do not allow one to fully fix the reference frame, that is the coordinate system of positions and proper motions on the sky. Gaia effectively measures angular separations of sources on the sky, and these separations remain unchanged if one rotates the whole solution by an arbitrary angle at each moment of time. Although the degeneracy with rotation is not mathematically exact, owing to the effects of aberration and gravitational light bending which are coupled to the fixed solar system and Gaia ephemerides, those effects are too small to be of any practical help in fixing the reference frame. In the standard astrometric solution, the apparent motion of a source on the sky is modelled by five freely adjusted parameters: two angular components of its position at some reference epoch, the parallax, and two components of the mean proper motion over the observed time interval. This model gives a six-fold near-degeneracy in the astrometric solution: three constant angles of rotation for the positions at the reference epoch (the 'orientation' of the solution), and three components of the constant angular velocity of rotation that affect the proper motions (the 'spin' of the solution). In short, although there are many more observations than unknowns to fit, the resulting system always has a rank deficiency of six. To fix the reference frame of a Gaia solution, external information should be used to constrain precisely these six degrees of freedom, and nothing more.
Gaia routinely delivers astrometry for sources as faint as G = 21 mag and, unlike Hipparcos, it can observe a large number of extragalactic sources. This capability plays an important role for the whole project. By directly linking the astrometric solution to a set of remote extragalactic sources, Gaia implements the principles set forth in the International Celestial Reference System (ICRS; Arias et al. 1995), a concept originally devised for the use in radio astrometry observing radio-loud quasars at cosmological distances. In practice, the term 'extragalactic' here refers to objects more distant than about 50 Mpc, that is well beyond the Local Group. The reasoning for choosing remote extragalactic sources is simple: because they are not affected by the complex motions in our Galaxy and have negligible apparent motions on the sky (thanks to their large distances), they provide reliable and stable beacons for astrometry. The suggestion to rely on very distant and nearly motionless sources can be traced back to W. Herschel and P.S. Laplace two centuries ago, well before quasars were spotted (see Sect. 6 of Fricke 1985 for a brief historical account).
The ICRS embraces two ideas of a very different nature. The first one concerns the orientation of a celestial reference frame at a given epoch while the second one deals with the rotational state (spin) of the reference frame. The orientation of a celestial reference frame at a given epoch is a pure convention, with no physical significance to the particular choice of axes. In this respect the choice of origin for the right ascension in ICRS is as arbitrary as, for example, the choice of Greenwich as the Prime Meridian for the geographic longitude in 1884. Gaia uses its own astrometric observations of the optical counterparts to sources from the well-established radio astrometric catalogue ICRF3 (Charlot et al. 2020) to fix the orientation of the Gaia catalogue. This choice is of a practical nature, namely to ensure consistency between the optical and radio frames and continuity of the orientation used in astronomy over past decades.
The second aspect, much more significant, is the use of remote extragalactic sources to fix the rotational state of the reference frame. Individual sources may show some (even non-linear) positional variations in time, owing to their peculiar motions or intrinsic variations of their structure (e.g. Taris et al. 2018), but it is assumed that objects at cosmological distances, on the average, show 'no rotation'.
In the framework of Newtonian physics, the assumption of no rotation is interpreted as no rotation with respect to Newtonian absolute space. In the modern framework of the General Theory of Relativity, the procedure results in a reference frame in which the Hubble flow has no rotational component. One can argue that the no rotation assumption is related to Mach's prin-ciple (e.g. Schiff 1964). However, we stress that Gaia's observations of distant extragalactic objects can neither prove nor disprove that principle. To test Mach's principle is a separate task beyond the scope of the current discussion.
Besides the task of fixing the Gaia reference frame, having a set of astrometrically well-behaved extragalactic sources is also important for assessing systematic errors in the Gaia astrometric solution. Again, this is possible because we can assume that the true values of the proper motions and parallaxes of these sources are all very close to zero. Such applications of the Gaia-CRF3 sample are found, for example, in Lindegren et al. (2021a) and Lindegren et al. (2021b).
Based on the way Gaia automatically detects and observes all point-like objects of sufficient brightness (de Bruijne et al. 2015), Gaia EDR3 is expected to contain a few million extragalactic objects suitable for defining a reference frame. However, it is not easy to reliably identify those sources in the Gaia catalogue. As was the case for Gaia-CRF1  and Gaia-CRF2 (Gaia Collaboration et al. 2018), the identification of well-behaved extragalactic sources for Gaia EDR3 relies on a dedicated cross-matching with external catalogues of quasistellar objects (QSOs; also often dubbed 'quasars' irrespective of their radio loudness) and active galactic nuclei (AGNs). For the purpose of this work the actual nature of the compact extragalactic sources plays no role, and we refer to them as 'QSOlike' sources, whether they are genuine QSOs of various kinds, galaxies with astrometrically compact nuclei, or other types of extragalactic objects. The resulting set of QSO-like objects selected on this basis constitutes Gaia-CRF3 -the third version of the Gaia-CRF. This paper describes the procedure used to select the sources of Gaia-CRF3 and presents its most important properties and its relation to the radio ICRF.
According to the principles of the ICRS, Gaia-CRF3 is entirely defined by the set of ∼ 1.6 million QSO-like sources described in this paper. It is implicitly assumed that the positions and proper motions of all other (mostly stellar) sources in Gaia DR3 are expressed in the same reference frame, and thus constitute a secondary, much denser reference frame that extends to the brightest magnitudes. For stars of similar magnitude and colour as the QSO-like sources, the consistency of the reference frames should be guaranteed by the methods of observation and data analysis in Gaia, which are exactly the same for both kinds of objects. For much brighter sources, and perhaps also for objects of extreme colours, this may not be the case owing to various instrumental effects and deficiencies in their modelling. Indeed, Cantat-Gaudin & Brandt (2021) have shown that the proper motions of bright (G 13 mag) sources in Gaia EDR3 have a residual spin with respect to the fainter stars by up to 80 µas yr −1 , and provide a recipe for their correction. Similarly, the alignment errors for the bright sources in Gaia EDR3 were investigated by Lunz et al. (2021), who found differences of about 0.5 mas. The properties of the secondary (stellar) reference frame in Gaia EDR3 are not further discussed in this paper.
The structure of the paper is as follows. Section 2 describes the multi-step procedure used to select the Gaia-CRF3 sources. The properties of those sources are discussed in Sect. 3. The Gaia-CRF3 sources matched to the radio sources of ICRF3 give an important link to radio astrometry and are discussed in Sect. 4. Section 5 reviews the role of the Gaia-CRF3 sources in the 'frame rotator' of the primary astrometric solution for Gaia EDR3, which is where the reference frame was actually fixed. Some conclusions and prospects for future versions of the Gaia-CRF are given in Sect. 6. Article number, page 3 of 31 A&A proofs: manuscript no. GaiaCRF3 The appendices provide more detailed information on a few more technical issues. Appendix B provides some additional details of the external catalogues used for Gaia-CRF3. Appendix C demonstrates how to select the Gaia-CRF3 sources in the Gaia Archive and trace which external catalogues have matches of those sources. In Appendix D we discuss the 'confusion sources' in Gaia EDR3, that is sources, mainly in the Galaxy or Local Group, whose astrometric parameters are roughly consistent with them being at cosmological distances, and that could therefore potentially be confused with QSO-like sources. Finally, Appendix E describes the algorithm used by the frame rotator to estimate the spin and orientation corrections for astrometric solutions.

Selection of Gaia sources
The goal of the source selection process for the Gaia-CRF is to find a set of QSO-like sources that combines three conflicting requirements: it should be as large as possible, as clean as possible (that is, free of stellar contaminants), and homogeneously distributed on the sky. We expect that this selection can eventually be made exclusively from Gaia's own observations, using the astrophysical characterisation of sources based on a combination of astrometric and photometric information (e.g. Bailer-Jones et al. 2013;Delchambre 2018;Heintz et al. 2018;Bailer-Jones et al. 2019). This was not possible for Gaia EDR3, and similarly to what was done for the first two Gaia data releases, Gaia EDR3 uses external catalogues of QSOs and AGNs to identify QSO-like sources among the nearly two billion Gaia sources. As detailed below, the Gaia-CRF3 content is the result of cross-matches with 17 external catalogues, followed by a two-stage astrometric filtering. These external catalogues were all publicly available at the time of the final construction of Gaia-CRF3 in mid-June 2020.

Cross-matching
For the construction of Gaia-CRF3 we cross-matched the Gaia EDR3 astrometric catalogue with the external catalogues listed in Table 1 (see also Appendix B). Only astrometric information was used for the cross-matches. The first four entries in the table -three ICRF3 catalogues and OCARS -are based on VLBI astrometry in the radio domain and have an astrometric quality comparable to that of Gaia EDR3. For those catalogues the maximal cross-match radius was set to ∆ max = 100 mas. The other catalogues have various levels of astrometric quality and for them the cross-match radius was set to ∆ max = 2 arcsec. For virtually all catalogues the cross-matching results in a (relatively small) number of ambiguous matches, where more than one Gaia source was matched to a source in the external catalogue, or more than one source in the external catalogue was matched to the same Gaia source. In all such cases only the closest positional match was retained. For each external catalogue this gives a list of 'unique' matches in the form of pairs 'Gaia source identifier' -'external source identifier', in which all identifiers appear only once. Table 1 gives the total number of sources in the catalogue and, in column 3, the number of unique matches in Gaia EDR3, to which the two-stage filtering described below was applied.
This cross-matching procedure adopted for Gaia-CRF3 is not perfect and could be refined in future releases. For example, the maximum cross-matching radius could be better tuned to the precisions of the individual external catalogues and the expected densities of stars and quasars in the different situations.
A particular case concerns multiple lensed quasar images, where ideally all the images should be retained instead of just the closest match, as is presently done. In some ambiguous cases the cross-matching could benefit from using also the photometric information available in the external catalogues. This is however far from straightforward owing to the expected variability of the QSOs, differences in wavelength bands and resolution between the instruments, etc., which may even require a manual check of individual cases. For the current cross-matching we have chosen not to use such information.

Astrometric filtering: Individual sources
The first stage of the astrometric filtering ensures that the retained Gaia sources have five-and six-parameter astrometric solutions in Gaia EDR3 compatible with the assumed extragalactic nature of the source. For all the catalogues in Table 1 we applied the three criteria in Eqs. (1)-(3) below. This was found to be sufficient for the VLBI catalogues (the first four entries in the table), but for the remaining catalogues the two criteria in Eqs. (4) and (5) were also applied. Some of the criteria are alternatively formulated as parts of a query in the Gaia Archive 2 using the Astronomical Data Query Language (ADQL).
The first criterion is that the source has a valid parallax and proper motion. This is ensured by selecting sources with five-or six-parameter astrometric solutions: astrometric_params_solved = 31 OR astrometric_params_solved = 95 (1) We note that astrometric_params_solved is a binary flag indicating which astrometric parameters are provided for the source, that is 11 2 , 11111 2 , and 1011111 2 for two-, five-, and sixparameter solutions, with the seventh bit representing pseudocolour.
Secondly, the measured parallax of the source, corrected for the median parallax bias, should be zero within five times the formal uncertainty σ : which translates to the ADQL query abs((parallax + 0.017)/parallax_error) < 5 This takes into account the median parallax of −0.017 mas obtained from a provisional selection of QSO-like sources in Gaia EDR3, but ignores the complicated dependence of the parallax bias on the magnitude and other parameters discussed by Lindegren et al. (2021a). As shown by their Fig. 5, the variation of the parallax bias for the QSO-like sources is typically less than ±0.010 mas. Because this is many times smaller than the individual uncertainties (for example, 99% of the Gaia-CRF3 sources have σ > 0.1 mas), these variations were deemed uncritical for the astrometric filtering. Assuming Gaussian errors, ( + 0.017 mas) /σ is then expected to follow the standard normal distribution, and the probability that Eq.
(2) is violated by random errors is, nominally, erfc 5/ √ 2 5.7 × 10 −7 . The third criterion is that the proper motion components µ α * and µ δ should be small compared to their uncertainties σ µα * and σ µδ . Here we take into account the correlation ρ(µ α * , µ δ ) between the two parameters. We require where Cov(µ) is the covariance matrix given by Eq. (E.4). This corresponds to the query ( power(pmra/pmra_error,2) +power(pmdec/pmdec_error,2) -2*pmra_pmdec_corr *pmra/pmra_error *pmdec/pmdec_error ) /(1-power(pmra_pmdec_corr,2)) < 25 For Gaussian errors of the proper motion components, X 2 µ is expected to follow the chi-squared distribution with two degrees of freedom (see Appendix B of Mignard et al. 2016). In this case the nominal probability that a source violates Eq. (3) is exp (−25/2) 3.7 × 10 −6 . We neglect here the systematic proper motions of extragalactic objects caused by the acceleration of the solar system barycentre (e.g. Gaia Collaboration et al. 2021b). The amplitude of this effect, about 0.005 mas yr −1 , is much smaller than the typical proper-motion uncertainty of QSO-like sources in Gaia EDR3 (see below) and is therefore ignored in the filtering. We note that X µ is independent of the reference frame, so that, in particular, it plays no role if the equatorial or, say, galactic components of the proper motions are used in Eq. (3).
For the VLBI catalogues based on high-accuracy radio astrometry (the first four entries in Table 1), the application of Eqs.
(1)-(3) results in filtered samples that show no traces of stellar contamination. This is to be expected thanks to the high positional accuracy of the VLBI data and the small cross-match radius used (100 mas). Consequently, no additional filtering was considered for these catalogues.
The situation is very different for the remaining (non-VLBI) catalogues in Table 1. An analysis of their matches, filtered by Eqs. (1)-(3), revealed that a significant number of chance matches to (stellar) sources in Gaia EDR3 happen through a combination of two factors: (i) the high density of potentially matching stellar sources in certain areas of the sky; and (ii) the lower positional accuracy of these catalogues which motivated the use of a much larger cross-matching radius (2 arcsec) than for the VLBI catalogues.
To investigate the likely effects of chance matches we make use of the 'confusion sources' discussed in Appendix D. These are all the sources in Gaia EDR3 that satisfy all three conditions in Eqs. (1)-(3), irrespective of their true nature. The total number of confusion sources in Gaia EDR3 is approximately 213.5 million, of which 30.7 million have five-parameter solutions and 182.8 million have six-parameter solutions.
The galactic plane is especially problematic for the crossmatching: 23% of the confusion sources with five-parameter astrometric solutions and 50% of the confusion sources with sixparameter solutions are located in the 10% of the celestial sphere with the galactic latitude b such that | sin b | < 0.1. The analysis of the matches from non-VLBI catalogues showed that the resulting samples of QSO candidates in that area are too much contaminated by stars and cannot reasonably be further cleaned by means of astrometric criteria alone. We therefore decided to completely avoid this area of the sky for Gaia-CRF3 for the matches to the non-VLBI catalogues.
Considering the distribution of the Gaia-CRF3 sources and the confusion sources in galactic latitude shown on Figs. 5 and D.2, respectively, our results for Gaia EDR3 generally agree with the discussion of the stellar contamination at various galactic latitudes provided by Heintz et al. (2020) (see their Fig. 9).
The second factor important for the chance matches -the lower quality of the astrometric data in the non-VLBI catalogues -is also more problematic in the areas of higher density of confusion sources. Fig. D.2 shows that the density of confusion sources is a strong function of the Galactic latitude b. To avoid too many chance matches at low | b | we require a smaller matching distance ∆ towards the Galactic plane. For the matches to the non-VLBI catalogues we therefore add the two criteria We note that, after the initial cross-matching, this last criterion is the only one that uses astrometric data (positions) from the external catalogue. The number of unique matches that survived the filtering by Eqs.
(1)-(3) for the VLBI catalogues, is given in the fourth column of Table 1.

Astrometric filtering: Sample statistics
The second stage of the filtering is based on an assessment of the level of stellar contamination among the filtered unique matches in the different external catalogues. To this end we analysed histograms of the normalised parallaxes /σ and normalised proper motion components µ α * /σ µ α * and µ δ /σ µ δ . Ignoring the small systematic and random offsets discussed in Sect. 2.2, we expect that, for a clean sample of QSOs, the normalised quantities should have Gaussian distributions with mean values close to zero and standard deviations only slightly greater than one (reflecting the slight underestimation of the formal uncertainties; see Sect. 3). Conversely, if their distributions deviate significantly from the expected Gaussian, it indicates that the sample is contaminated by stellar objects with non-zero parallaxes and/or proper motions. Although the criteria in Eqs.
(1)-(3) are based on the Gaussian hypothesis, they absolutely do not guarantee that the normalised parallaxes and proper motions are Gaussian after the filtering. Figures D.3 and D.4 show examples of the distributions of the normalised parameters for general Gaia EDR3 sources satisfying the criteria. While a fraction of the sources in these samples are QSOs (including Gaia-CRF3 sources), most of them are stars in our Galaxy or in nearby satellite galaxies such as the Large Magellanic Cloud (LMC) and Small Magellanic Cloud (SMC), and the histograms of the normalised quantities are very far from Gaussian.
A high level of stellar contamination in a sample can be directly seen in the histograms of the normalised parameters as a distortion of the expected Gaussian distribution. Any sub-sample large enough to create representative histograms can be checked separately. This gives a flexible tool for screening samples of QSO candidates against stellar contamination.
For the proper motions the same effect is illustrated in Fig. 1. This shows the relative frequencies of X 2 µ from Eq. (3) for the filtered matches to the catalogues in Table 1. If X 2 µ comes purely from the Gaussian errors of the Gaia proper motions, it should follow the χ 2 distribution with two degrees of freedom, with normalised frequency exp(− 1 2 x/s 2 ). Here s is a scaling factor ( 1) that takes into account that the uncertainties of the astrometric parameters in Gaia EDR3 are slightly underestimated. In Fig. 1 this expected frequency is a straight line through unity at X 2 µ = 0. Applying this tool to the full sets of filtered matches to each external catalogue in Table 1 reveals that only the sources matched to three of the catalogues (C75, Milliquas v6.5, and Gaia-unWISE) show clear evidence of stellar contamination. These catalogues are marked with an asterisk in the table. The filtered matches of the remaining 14 catalogues show no obvious traces of stellar contamination and therefore qualify as being part of Gaia-CRF3.
The same Gaia source usually appears in the filtered crossmatches to several external catalogues. One reason for this, as discussed in Appendix B, is that many of the catalogues are at least partially based on the same observational material (the IR photometry of WISE, certain data releases of SDSS, etc.), although their criteria for the selection of QSOs may be different. But even when the observational material is different, it is no wonder that different approaches and algorithms result in strongly overlapping samples, namely if they tend to find the same objects. It is therefore reasonable to expect that the circumstance that a Gaia source has matches in several external catalogues increases the reliability of its classification as a QSO. If we split the set of Gaia sources matching a given external catalogue (after the astrometric filtering) into subsets that are present also in other catalogues and a subset that appears only in the cross-match with this particular catalogue, we can therefore use an analysis of these subsets to improve the purity of the final QSO selection.
Using this idea, we selected three sets of Gaia sources that appear only among the filtered matches of C75, Milliquas v6.5, and Gaia-unWISE, but do not appear in any of the other catalogues: we call these the c75-only, m65-only, and guw-only sets. All three sets show significant stellar contamination. We found no obvious way to clean the c75-only sources and this whole set of 95 988 sources was therefore dropped from further consideration. The Milliquas v6.5 and Gaia-unWISE catalogues include estimates of the probability that a source is a QSO. Selecting from the m65-only set only the sources with the highest probability of being a QSO did not remove the contamination and that set of 26 326 sources was therefore also dropped. For the guwonly set we found, on the contrary, that the subset of sources with the highest QSO probability (PROB_RF = 1) appears to be free of stellar contaminants. This gave an additional set of 229 914 sources for Gaia-CRF3. The other 491 894 guw-only sources were dropped (any relaxation of PROB_RF = 1 immediately gave samples with clear signs of stellar contamination). The number of sources from the cross-matches of each catalogue that passed also this second stage of astrometric filtering is given in the fifth column of Table 1, summing up to a total of 1 614 173 sources. The relative frequencies of X 2 µ in Gaia-CRF3 and in the three sets of sources dropped from C75, Milliquas v6.5, and Gaia-unWISE are shown in Fig. 2. Deviations from the exponential distribution expected for Gaussian errors are very obvious for the dropped sets; for the final selection the distribution is briefly discussed below (Sect. 2.4).
A separate investigation was done for the QSO candidates in the areas of the LMC and SMC, which have a very high density of confusion sources ( Fig. D.1) and therefore a higher risk of chance matches. The statistics of confusion sources in these areas are discussed in Appendix D. The Gaia-CRF3 contains 5411 and 2779 sources, respectively, in the LMC and SMC areas (as defined in the appendix). The distributions of the normalised quantities indicate a low level of stellar contamination in these sets.
Further checks of the selection were done by splitting the Gaia-CRF3 sources into subsets according to magnitude, colour, and Galactic or ecliptic latitude, and examining the distributions of normalised parallaxes and proper motions. No clear contamination was detected in any of the subsets.
Gaia-CRF3 contains only 195 sources with | sin b | < 0.1. They all come from the VLBI-based catalogues, for which Eqs. (4) and (5) were not applied. We made an attempt to find a clean set of QSO-like sources also in the Galactic plane area, that is without applying the criterion in Eq. (4). Using the cross-matches and the remaining four criteria, we found nearly  Fig. 1. Relative frequencies of X 2 µ for the filtered matches to the catalogues in Table 1. The catalogues are identified by the codes given in column 6 of the table, with the eight larger catalogues in the upper panel and the smaller catalogues in the lower panel. The four radio catalogues (ICRF3 S/X, ICRF3 K, ICRF X/Ka, and OCARS) are combined under 'radio', and the three blasar catalogues (Roma-BZCAT, 2WHSP, and ALMA calibrators) are combined under 'blasars'. 100 000 additional QSO candidates with | sin b | < 0.1. However, the distributions of their normalised parameters indicate a high level of stellar contamination. This demonstrates that the crossmatches and/or external astrophysical classifications are unreliable in this crowded part of the sky, and none of these sources were further considered for Gaia-CRF3. However, we note that this set should contain a significant percentage of genuine QSOs, part of which may be recovered in future versions of the Gaia-CRF (see Sect. 2.5).

Final selection
The union of all Gaia sources that passed both stages of the filtering contains 1 614 173 sources and constitutes the final source list of Gaia-CRF3. Of these, 1 215 942 (75%) have fiveparameter solutions in Gaia EDR3 and 398 231 (25%) have sixparameter solution. Appendix C explains how to find the astrometric parameters of these sources and how to trace the matches of these sources in the external catalogues. Figure 3 shows the distributions of the normalised parallaxes and proper motion components for both kinds of solutions. The standard deviations of the three normalised quantities are 1.052, 1.055, and 1.063 for the five-parameter solutions and 1.073, 1.099, and 1.116 for the six-parameter solutions. All distributions are very symmetric and close to the Gaussian shape (i.e. parabolic in the logarithmic plots), suggesting a low level of stellar contamination. The small excess of sources with normalised parameters beyond ±4 is not necessarily a sign of contamination, but could be produced by source structure or instrumental effects as well as by statistical fluctuations in the sample.
The number of (non-QSO) chance matches in Gaia-CRF3 can be estimated by counting the confusion sources that satisfy Eqs. (4) and (5) around positions that are slightly offset from the Gaia-CRF3 positions. Using offsets of ±0.1 deg in ∆α cos δ or ∆δ we conclude that Gaia-CRF3 is expected to contain up to 420 chance matches, of which 150 have five-parameter solutions and 270 have six-parameter solutions. This corresponds to a probable contamination of 0.012% and 0.068%, respectively, of the fiveand six-parameter sources in Gaia-CRF3.
The chance matches are obviously not the only cause of stellar contamination in Gaia-CRF3. Further contaminants may come from erroneous classification of some sources in the external catalogues, even among the sources that passed the astrometric filtering in Eqs. (1)-(5). We have found no means to quantify this effect, but expect it to be quite small. A certain hint on the possible (total) level of contamination could be given by the relative frequency of X 2 µ for Gaia-CRF3, as shown by the red curve in Fig. 2. This distribution can be rather well described as the mixture of two Gaussian distributions: one with standard deviation of 1.051 containing 98.0% of the sources, the other with standard deviation of 2.038 containing 2.0% of the sources. We stress that this cannot directly be interpreted in terms of the purity of the Gaia-CRF3: on one hand, genuine QSOs may show deviations from Gaussianity for other reasons (as mentioned above); on the other hand, some contaminants will have measured proper motion that are perfectly consistent with the extragalactic hypothesis. Nevertheless, we consider it unlikely that the level of contamination in Gaia-CRF3 is higher than 2%. Further investigations are needed to show if this claim is justified.

Discussion of the selection
Assuming that QSO-like sources have zero true values for the parallaxes and proper motions, and that the astrometric errors are Gaussian with uncertainties as given in Gaia EDR3, we expect that very few actual QSOs in a sample of a few million candidates will violate Eqs. (2) or (3). The numerical boundaries in Eqs.
(2) and (3) were selected empirically, based on the data themselves, and represent a compromise taking into account possible non-Gaussian systematic errors and effects of complex and sometimes time-dependent source structure. The latter are to be expected for QSOs and could produce apparent proper motions over several years of observations, while parallaxes are Article number, page 7 of 31 A&A proofs: manuscript no. GaiaCRF3 much less affected by such effects. This justifies setting separate criteria for parallaxes and proper motions as described above.
In future Gaia data releases, when these effects may be better known and statistically quantified, it will be possible to adopt a mathematically more consistent filtering approach based on the 3 × 3 covariance matrix of the parallax and proper motion components.
As already mentioned, we expect that future versions of Gaia-CRF will to a higher degree depend on Gaia's own (astrometric, photometric, and spectroscopic) observations, and that external catalogues of AGNs will consequently become less important for the selection of QSO-like sources. These catalogues will continue to be important for validation purposes, but their uses in the selection process may become more complex if the astrometry in external catalogues is already Gaia-based. For example, a criterion such as Eq. (5) may not be meaningful if the position in the external catalogue comes from a previous Gaia data release.
The search for the QSOs in the Galactic plane is an important scientific problem that requires further investigation. We note that more advanced methods to combine different datasets in the search for QSOs in the Galactic plane are under development (e.g., Fu et al. 2021). Combined with the expected use of Gaia's own object classification, one can hope that these methods will result in a much improved completeness of the Gaia-CRF in the Galactic plane in the future.
The selection of QSO-like objects described here is mainly driven by the reliability of the resulting set of sources. The resulting set of QSO-like sources can be considered 'reliable', but by no means 'complete'. It is clear that the selection algorithm rejects some genuine quasars, for example if they show observable mean proper motions for the time span of the Gaia EDR3 data. A prominent example here is 3C273 -the brightest quasar on the sky, which has significant proper motion in declination in Gaia EDR3 and X 2 µ = 29.7 (Eq. 3). For the purposes of aligning the reference frame of Gaia EDR3 with the ICRF, and especially for ensuring that it has negligible spin with respect to distant galaxies, one can use a much smaller set of QSO-like sources than the full Gaia-CRF3 without significantly increasing the statistical uncertainties of the global system. Indeed, this system was defined in the final stages of the primary astrometric solution, as described in Sect. 5, using only about 0.4 million 'frame rotation' sources. However, the purpose of Gaia-CRF3 is not just to fix the system for the Gaia astrometric solution, but to provide the community with a direct access to this system with sources as faint as permitted by the sensitivity of Gaia, while still having a very low level of stellar contamination.

Statistical properties of the Gaia-CRF3 sources
In this section we present the overall properties of Gaia-CRF3 primarily in terms of its distributions in position, magnitude, colour, and astrometric quality.

Distribution of the sources on the sky
The spatial distribution of the ∼1.6 million Gaia-CRF3 sources is shown in Fig. 4. Their distributions in galactic longitude and latitude are shown in Fig. 5. In the avoidance zone | sin b | < 0.1 Gaia-CRF3 has only 195 sources matched to the radio catalogues as described in Sect. 2. Outside the avoidance zone the average density is about 42 deg −2 , but with significant variations with galactic latitude; the density is typically below average for | b | 30 • and above at higher latitudes. This comes from the source catalogues used in our compilation combined with the selection function in Eq. (4), designed to avoid stellar contaminants near the Galactic plane. From Fig. B.1 one can see that also most of the external catalogues avoid crowded regions on the sky. We note that the density of Gaia-CRF3 reaches about 110 deg −2 in some areas on the sky. Given the expected isotropic distribution of QSOs and the effects of Galactic extinction, a complete version of a Gaia-CRF could therefore potentially comprise about 4 million sources. This estimate is also suggested by recent studies (e.g. Shu et al. 2019;Bailer-Jones et al. 2019;Fu et al. 2021) of the QSOs as well as by the extragalactic results of Gaia DR3. Notes. Columns 3-9 give median values of the G magnitude, the G BP − G RP colour index, the effective wavenumber ν eff (or the pseudo-colour for six-parameter solutions), the astrometric quality indicator RUWE (see text), the positional uncertainty at epoch J2016.0 (semi-major axis of the error ellipse), and the uncertainties of the proper motion components in α and δ. The last line ('all') is for the whole set of Gaia-CRF3 sources. There are only 195 sources in the avoidance zone | sin b | < 0.1. The distribution is shown for the whole Gaia-CRF3 and separately for the sources with five-and six-parameter astrometric solutions. Table 2 and Fig. 6 show the median values and distributions of several important properties of the Gaia-CRF3 sources, split according to the nature of the solution (five-or six-parameter solution): magnitude (G), colour index (G BP − G RP ), and the renormalised unit weight error (RUWE; Lindegren et al. 2021b). The RUWE is a measure of the goodness-of-fit of the astrometric model to the observations of the source. The expected value for a good fit is 1.0. A higher value could indicate that the source is not point-like at the optical resolution of Gaia ( 0.1 ), or that it has a time-variable structure. As seen in the left panel of Fig. 6, the proportion of six-parameter solutions increases for G 19, a feature that also implies that these sources are on average of lower astrometric accuracy, not only because they are fainter but also because the model fit is often worse, as shown in the right panel. The sources with six-parameter solutions are also on average redder (middle panel).

Main characteristics of the sources
The sky distributions of the median magnitude G and the colour index G BP − G RP (Fig. 7) clearly show the effects of the Galactic extinction and reddening. We also see that the selection favours brighter sources in the crowded areas such as the Galactic plane and the LMC and SMC areas.

Astrometric parameters of the sources
The left panel of Fig. 8 shows the distribution of the formal positional uncertainty in Gaia-CRF3 separately for the five-and six-parameter solution. The positional uncertainty, σ pos,max , is defined as the semi-major axis of the uncertainty ellipse (see Eq. (1) in Gaia Collaboration et al. 2018) at the reference epoch J2016.0. The middle panel is a scatter plot of σ pos,max versus G, with the smoothed median indicated by the dashed curve. In the right panel, the black curve (labeled 'all sources') gives the fraction of the full sky that has a density of Gaia-CRF3 sources exceeding the value on the horizontal axis. The other curves give the fractions when only sources with σ pos,max below a certain level are counted. In the 10% of the sky nearest to the Galactic plane (| sin b | < 0.1) the source density is negligible (cf. Fig. 4), which explains why none of the curves reaches above 90% sky coverage. To compute the fractions, the celestial sphere was divided into 49 152 pixels of equal area ( 0.8393 deg 2 , i.e. HEALPix level 6) and the density computed in each pixel. It should be noted that the curves therefore refer to the source densities at a spatial resolution of about 1 deg and would not be the same at a different resolution (pixel size). The plot shows, for example, that for a minimum density of 35 Gaia-CRF3 sources per square degree, about 62% of the sky could be covered; this fraction is reduced to 55% or 20% if we require, in addition, that σ pos,max is below 1.0 or 0.5 mas. Conversely, at 62% coverage there could be at least 30, 18, and 3 sources per square degree with σ pos,max < 1, 0.5, and 0.2 mas, if in each case the most populated pixels are chosen. The corresponding magnitude distributions can be gleaned from the scatter plot in the middle panel, where the different levels of positional uncertainty have been marked. Figure 9 shows the distributions of the median parallax and proper motion components on the sky. Both large-scale systematics and small-scale inhomogeneities are visible in these maps. The large-scale systematics are discussed in Sect. 3.4. Small-scale inhomogeneities, seen as fluctuating median values between neighbouring HEALPix cells in the maps, are mainly caused by the statistical scatter of the astrometric parameters in each cell. They are generally stronger in areas of low source density (Fig. 4) or fainter median magnitude (Fig. 7, top), and also in the ecliptic belt (ecliptic latitude | β | 45 • ), where the scanning law usually results in less accurate astrometric solutions than at high ecliptic latitudes (see, for example, Sect. 5.4 and Figs. 8 and  9 in Lindegren et al. 2021b). Numerous streaks, edges, and other small-scale features can also be traced to the scanning law. The overall scatter of the median values in the three maps is about 0.11 mas ( ), 0.12 mas yr −1 (µ α * ), and 0.11 mas yr −1 (µ δ ), if only the cells with at least five QSOs are counted (covering 85% of the sky). The positional errors are expected to be of a roughly similar size as the parallax scatter. Thus, we conclude that Gaia-CRF3, on an angular scale of about 1 • and over most of the sky, defines a local reference frame accurate to better than 0.2 mas in position and 0.2 mas yr −1 in proper motion.

Systematic errors
Systematic errors in Gaia-CRF3 were discussed already in Lindegren et al. (2021b, Section 5.6). In particular, the left column of their Fig. 13 contains smoothed maps of the parallaxes and proper motions of the Gaia-CRF3 sources that have fiveparameter astrometric solutions. These maps were obtained by a special smoothing procedure applied to the parameters of the individual sources. Including also the less numerous and fainter sources with six-parameter solutions does not noticeably change these maps.
Here we present the results of a different way to analyse systematics in the Gaia-CRF3 proper motions, namely by fitting an expansion of vector spherical harmonics (VSH; Mignard & Klioner 2012) up to a certain degree . Although the VSH provide a smooth model of the systematic errors also in the regions where only a few sources are present, such as in the Galactic plane, the higher-degree harmonics (  15) are only weakly constrained in these areas, resulting in unreliable estimates of the VSH coefficients also in other parts of the sky because of the global scope of the fitted VSH functions.
Analogous to the discussion of parallax systematics in Sect. 5.7 of Lindegren et al. (2021b), using (scalar) spherical harmonics, the VSH allow us to estimate the angular power spectrum of the systematics in the Gaia-CRF3 proper motions. To this end we used the set of 1 215 942 Gaia-CRF3 sources with five-parameter solutions. Given the vector field of proper motions of those sources we computed the VSH expansion as given by Eq. (30) of Mignard & Klioner (2012), including all the terms of degree 1 ≤ ≤ L. It was found that the VSH fit is reasonably stable for L ≤ 12. When higher degrees are included, the fits become questionable, mainly because of the inhomogeneous distribution of the sources on the sky (see Sect. 3.1 above and Figs. 3 and 4 of Gaia Collaboration et al. 2021b). All subsequent computations used L = 12. The fitted VSH coefficients confirmed the maps of systematic errors in proper motion shown on Fig. 13 of Lindegren et al. (2021b). From the same coefficients we obtained the powers P t , P s of the toroidal and spheroidal signals of degree , as defined by Eq. (76) of Mignard & Klioner (2012). The total power of degree is P = P t + P s . The mean powers of the toroidal and spheroidal harmonics of degree are then C t = P t /(2 + 1) and C s = P s /(2 + 1), where 2 + 1 is the number of harmonics at the degree . This is fully analogous to the definitions for the scalar spherical harmonics given by Eq. (31) of Lindegren et al. (2021b).
We note that the VSH components of degree = 1 in the proper motions should be analysed separately from other systematic effects. The toroidal harmonic of degree = 1 represents a residual spin of the reference frame, and should in principle be negligible thanks to the way the reference frame is constructed (see Sect. 5 below). The spheroidal harmonic of degree = 1 represents a 'glide' component of the proper motions that is expected as a consequence of the acceleration of the Solar system in the rest frame of remote sources (Gaia Collaboration et al. 2021b). Although these terms need to be included in the VSH fit, the subsequent discussion of systematics in the Gaia-CRF3 proper motions only considers the VSH components of degree = 2, . . . , 12. Formally, the powers and the mean powers are weighted sums of the squared VSH coefficients. Appendix A of Gaia Collaboration et al. (2021b) demonstrates that such quantities are biased (overestimated) when the VSH coefficient are estimated in a least squares fit. Equation (A.1) of that Appendix gives the exact formula to correct the bias. In particular, the corrected (un-biased) values of the mean powers arê where the noise corrections N t and N s are obtained by the same formulas as C t and C s but using the uncertainties of the VSH coefficients instead of the coefficients themselves. Explicit computations for the present data show that N t N s 3.44−0.07 − 0.01 2 µas 2 yr −2 (2 ≤ ≤ 12). Figure 10 shows the corrected estimatesĈ t andĈ s for = 2, . . . , 12. The ragged shape of (especially) the toroidal spectrum is very real, as shown by the error bars. One sees that the toroidal powers are substantially larger than the spheroidal ones for the odd degrees, while they are more similar for even degrees. We have no explanation for this but presume it is related to the specific parameters of the scanning law and basic angle.
In analogy with Eq. (32) of Lindegren et al. (2021b), it is convenient to consider the cumulative characteristics quantifying the RMS variations of the toroidal, spheroidal and total proper motion systematics on angular scales 180 • / max (Fig. 11). We note that the toroidal components completely dominate the systematics for ≥ 5. Summarising all this information we can claim that the frame-independent RMS of the vector field representing systematic errors in proper motions is 12 µas yr −1 down to angular scales of ∼ 15 • . This level of systematic errors is low compared to the random errors of individual sources given in Table 2 and made it possible to reliably measure the subtle physical effect of the acceleration of the Solar system in Gaia EDR3 (Gaia Collaboration et al. 2021b).

Gaia-CRF3 counterparts of ICRF3 sources
Radio-loud sources in Gaia-CRF3 deserve special attention as they provide a direct link between the ICRF implementations in the radio and optical regimes. In this section we focus on the Article number, page 11 of 31 A&A proofs: manuscript no. GaiaCRF3 properties of the Gaia-CRF3 sources matched to ICRF3 (Charlot et al. 2020) and a statistical comparison of the optical and radio positions.

Statistics of ICRF3 sources in Gaia-CRF3
Gaia-CRF3 contains a total of 4723 radio-loud sources, of which 3181 are optical counterparts of radio sources in ICRF3 (Charlot et al. 2020), while an additional 1542 sources are found only in OCARS. Because the ICRF3 sources have the best quality of the astrometry in the radio domain and represent the official realisation of the ICRF in the radio, we restrict the discussion in this section to the 3181 Gaia-CRF3 sources in common with ICRF3. This set is the union of 3142, 660, and 576 sources matching the ICRF3 sources in the S/X, K, and X/Ka bands, respectively (69.3%, 80.1% and 85.0% of the corresponding catalogues). Most of the analysis below is further limited to the 3142 sources in ICRF3 S/X. Very few additional Gaia EDR3 sources are found only in the K or X/Ka bands of the ICRF3 (24 and 31 sources, respectively), and their overall properties are not too different from those of the S/X set. From the sky distribution of the 3181 ICRF3 sources in Gaia-CRF3 (Fig. 12) we see that a band along the Galactic plane and the region with δ < −30 • (part of the fourth quadrant of the map in the Galactic coordinates) are underpopulated.  Fig. 12. Distribution of the ICRF3 sources in Gaia-CRF3 in Galactic coordinates. Sources from ICRF3 S/X are shown by red dots, from ICRF3 K by blue crosses, and from ICRF3 X/Ka by turquoise squares. This map uses a Hammer-Aitoff projection in galactic coordinates, with l = b = 0 at the centre, north up, and l increasing from right to left. This reflects the strong optical extinction in the Galactic plane and the fact that ICRF3 itself has a lower density of sources for δ < −30 • .
The left panel in Fig. 13 shows the magnitude distribution of the 3142 Gaia-CRF3 sources in common with ICRF3 S/X and for the subset of 259 sources that are defining sources in ICRF3. As expected, there is no marked difference between the two sets, since the optical brightness plays no role in the ICRF3 selection of defining sources. The median G = 18.9 is about one magnitude brighter than for the whole Gaia-CRF3 (Table 2).
The central panel in Fig. 13 shows the distribution of the positional uncertainty σ pos,max in Gaia-CRF3 for the same selections. The median uncertainty is 194 µas for the whole set of ICRF3 S/X sources, and 162 µas for the defining sources. These values are less than one half of the median positional uncertainty in the full Gaia-CRF3 sample, which can be attributed to their different distributions in G. Indeed, as shown by the middle panel of Fig. 8, the ICRF3 sources follow the same general variation of σ pos,max with G as other Gaia-CRF3 sources. In other words, nothing in the Gaia astrometry suggests that the ICRF3 are better or worse than other Gaia-CRF3 sources of comparable magnitude.
The right panel in Fig. 13 shows the distribution of the positional uncertainty in the radio domain for the same sources. By coincidence, the median positional uncertainty of the 3142 sources in ICRF3, 197 µas, is practically the same as in Gaia-CRF3. For the subset of defining sources, however, the positional uncertainties in ICRF3 are much smaller (the median is 39 µas), as these sources were selected precisely for their high radioastrometric quality. In Gaia-CRF3 the defining sources are only marginally more accurate than the non-defining sources, consistent with the former being, on average, about 0.2 mag brighter (cf. Fig. 13, left panel).

Positional differences
We now compare the individual positions of ICRF3 S/X sources in ICRF3 and Gaia-CRF3. Thanks to the high positional accuracy in both catalogues and the small cross-matching radius ∆ max = 100 mas used, cross-matching errors between the two catalogues are highly unlikely and we can assume that the optical and radio emissions being compared originate in broadly the same extragalactic object. Positional differences ∆α * = (α EDR3 − α ICRF3 ) cos δ and ∆δ = δ EDR3 − δ ICRF3 are computed after prop-  Fig. 14 is a scatter diagram of the positional differences. Of the 3142 sources, 127 (4%) have a positional difference > 4 mas in either coordinate and fall outside the boundaries of the plot. The median separation ∆ = (∆α * 2 + ∆δ 2 ) 1/2 is 0.516 mas for all 3142 sources, and 0.292 mas for the 259 defining sources (Fig. 15).
The distributions of ∆α * and ∆δ are roughly symmetric around zero with a median offset of +4 µas in right ascension and −6 µas in declination. Similarly, the scatter diagrams of the positional differences versus right ascension and declination (Fig. 16) show no significant systematics or trends in either coordinate. The distributions are very regular with a pronounced central concentration of width below 1 mas, plus a scatter of several hundred outliers at several mas. Furthermore, an analysis of the positional differences using VSH does not reveal any significant systematic effects. The statistical significance of the fitted coefficients is low, while the largest effects represented by the VSH expansion are located in the areas with low source density (around the Galactic plane and for δ < −30 • ).
The lower panel of Fig. 14 shows the same position differences as in the upper panel, but normalised by the quadratic combination of the uncertainties in the ICRF3 and Gaia-CRF3 catalogues. If the optical and radio centres of emission coincided, we would expect the normalised position differences to follow a normal distribution with standard deviation only slightly above 1.0. In fact, the distributions are significantly wider (the RSE 3 is 1.92 in right ascension and 1.80 in declination), with fatter tails than expected for a Gaussian: 225 sources fall outside the plotted frame, and more than 100 are outside ±5 × RSE in each coordinate.
Both in ICRF3 and in Gaia-CRF3 there are significant correlations between the errors in right ascension and declination, and a useful statistic that takes this into account is where Cov(∆) is the covariance matrix given by Eq. (E.7). This is completely analogous to the statistic X 2 µ for the proper motions, defined in Eq. (3), and like ∆ it is independent of the 3 The robust scatter estimate (RSE) is defined as 2 √ 2 erf −1 (4/5) −1 ≈ 0.390152 times the difference between the 90th and 10th percentiles of the distribution of the variable. For a Gaussian distribution it equals the standard deviation. The RSE is widely used in Gaia as a standardised, robust measure of dispersion. reference frame used. 4 The distributions of X 2 ∆ are shown in Fig. 17. For Gaussian errors we expect exponential distributions. The dashed lines are the functions 1450 exp[− 1 2 (X ∆ /1.3) 2 ] and 120 exp[− 1 2 (X ∆ /1.3) 2 ] fitted by eye to the inner parts of the histograms (X 2 ∆ 10). The integrals of these exponential are 2450.5 and 202.8, that is 78% of respective sample. One possible interpretation of this is that the (combined) uncertainties are generally underestimated by a factor 1.3, and that about 22% of the objects in the samples have radio-optical positional offsets that are statistically significant at the current level of accuracy in the two catalogues; this fraction is, moreover, the same for the defining sources as for the full sample. Using a criterion such as X 2 ∆ > 25 to pick out the individual sources that have a significant offset yields 433 sources, 36 of which are defining ones, or 14% of the respective samples. The angular offsets of these sources range from 0.17 mas to tens of mas, with a median value of 2.3 mas for the 433 sources from the full sample, and 0.84 mas for the 36 defining sources.
We note that the median X 2 µ is slightly higher (2.36) for the 433 sources with X 2 ∆ > 25 than for the full sample of 3142 sources (1.82). This could indicate that the positional offsets are not constant over the time scale of the Gaia observations.
These results confirm and amplify the conclusion reached already with Gaia DR1 and Gaia DR2, namely that radio-optical offsets are significant for a large fraction of the ICRF3 sources. The correlation with jets has been well established, for example by Petrov et al. (2019) and Plavin et al. (2019), and the physical signature of offsets is also supported in the studies by Lambert et al. (2021) and Liu et al. (2021).
The question now is therefore not so much the existence of such offsets but rather what consequences they have for the relation between the radio and optical frames. A plain match between the two catalogues leaves sources with true differences in position, much larger than the expected random scatter. These differences will not decrease with future versions of the radio and optical frames. Although this does not impact the quality of either realisation, it could limit the attainable accuracy of their mutual alignment. To circumvent this problem, specific procedures may be required, such as the selection of a common set of reference sources for the alignment. More generally, the importance of physical deviations between the reference frames at different 4 The quantity X 2 ∆ is important also for another reason: as explained in Appendix E we consider the optical and radio frames to be aligned if the sum of X 2 ∆ over the 'good' sources is minimal; cf. Eq. (E.18). The wavelength bands needs to be better understood. This problem will be even more important for the future high-accuracy astrometric observations in the near infrared (Hobbs et al. 2021).
A closely related question concerns the interpretation and use of the reference frame at different epochs. ICRF3 is the first realisation of the VLBI frame that has an epoch (2015.0) and an official procedure to transform from this epoch to another (Charlot et al. 2020). The reason for the epoch transformation is the effect of the acceleration of the Solar system relative to the remote celestial sources, often referred to as 'Galactic acceleration', although it may contain other components as well. If ignored, the Galactic acceleration will slowly distort the frame at a rate of about 6 µas yr −1 . The effect should therefore be taken into account when comparing ICRF3 with the Gaia-CRF, as was indeed done for this paper. The transformation depends on three parameters that can be determined by observation; for ICRF3, however, a set of fixed values are specified. In case of Gaia, full astrometric solutions including parallaxes and proper motions are computed for all Gaia-CRF sources. Each version of Gaia-CRF has a particular epoch common to that of the astrometric catalogue of the corresponding Gaia DR; for Gaia-CRF3 this is 2016.0. No specific correction for the Galactic acceleration is applied or recommended for Gaia-CRF. However, the effect is implicitly contained in the measured proper motions and a normal epoch transformation would take also this effect into account, albeit with a positional accuracy that degrades with the epoch difference. In Gaia Collaboration et al. (2021b) the components of the acceleration of the Solar system were determined from the proper motions of Gaia-CRF3. Although that determination differs from the ICRF3 prescription by less than 1 µas yr −1 , there thus is a conceptual difference between how the effect is treated in the two reference frames.

Use of Gaia-CRF3 sources in the frame rotator of AGIS 3.2
In the astrometric core solution for Gaia (the Astrometric Global Iterative Solution, AGIS; Lindegren et al. 2012), the frame rotator ensures that the reference frame of the astrometric solution complies, in a global sense, with the ICRS requirements. The frame rotator uses a pre-selected set of QSO-like sources, including optical counterparts of ICRF3 sources, to estimate the six rotation parameters (spin and orientation) for the solutions obtained in successive iterations, and corrects them in such a way that the QSO-like sources have no net rotation and are globally aligned with the ICRF3 sources. The frame rotator sources are a subset of the much larger set of primary sources (about 14.3 million for EDR3) that contribute to the estimation of attitude, calibration, and global parameters in AGIS. Special selection rules apply to the primary sources; in particular, they must have colour information from Gaia's BP and RP photometers of adequate quality for the calibration of colour-dependent instrumental effects. The selection of primary sources, including the frame rotator sources, is revised with each new release of astrometric data, but for logistic reasons it is mainly based on the astrometric and photometric data of the previous release. Thus, for Gaia EDR3 the selection of frame rotator sources was primarily based on the 556 869 QSO-like sources, including 2820 ICRF3 counterparts, that constitute the Gaia-CRF2 (Gaia Collaboration et al. 2018). However, not all of them had adequate colour information or passed the more stringent astrometric criteria for primary sources in Gaia EDR3, and the number of frame rotator sources used for EDR3 is therefore substantially smaller (429 249 sources). All of them have fiveparameter solutions in Gaia EDR3 and were filtered by a procedure similar to the one used for Gaia-CRF3 but using a preliminary version of the Gaia EDR3 catalogue known as AGIS 3.1 (Lindegren et al. 2021b).
Details of the frame rotator as used for Gaia EDR3 are given in Appendix E. As described in the appendix, the orientation (ε) and spin (ω) are estimated by separate least-squares solutions  Fig. 15. Histograms of the separations ∆ between Gaia-CRF3 and ICRF3 S/X for the full set of 3142 common sources (blue; the median is 0.516 mas) and the 259 defining sources (red; the median is 0.292 mas).
to the position differences of the ICRF3 sources (for the orientation) and to the proper motions of the many more QSO-like sources (for the spin). The algorithm includes an iterative procedure to identify and eliminate outliers, that is sources that deviate very significantly from the overall fit in one or both coordinates. Thus, we need to distinguish between the sources that were 'considered' by the algorithm, and the subset of non-outliers that was actually 'used' to determine the orientation or spin. The auxiliary table gaiaedr3.frame_rotator_source in the Gaia Archive lists the 429 249 frame rotator sources for Gaia EDR3 and contains flags to indicate which of them were considered and used for the orientation and spin solutions. As specified in the table, 428 034 of the sources considered for the spin were actually used to determine the spin in the final iteration of the primary solution, while 1215 (0.3%) were rejected as outliers. Of the 2269 sources with counterparts in ICRF3 S/X considered for the orientation, 2007 sources were actually used to determine the orientation, while 262 (11.5%) were rejected as outliers. Eight of the sources were rejected in both the spin and orientation solutions.
Because the preliminary AGIS 3.1 astrometry had to be used to select the frame rotator sources, it turns out that 46 of the latter do not pass the stricter criteria for the Gaia-CRF3 selection described in Sect. 2. These sources are thus listed in gaiaedr3.frame_rotator_source although they are not part of the Gaia-CRF3. This slight inconsistency has minimal impact on the resulting reference frame, as only two of the 46 sources were accepted by the frame rotator algorithm for the determination of the spin, and none of them for the determination of the orientation.
If the frame rotator algorithm is applied to this subset of sources, using the astrometric data as given in Gaia EDR3, the resulting estimates of the orientation and spin vectors are not strictly zero, as might be expected. For the spin, we obtain | ω | 0.8 µas yr −1 with an uncertainty of about 0.4 µas yr −1 in each axis; for the orientation, the result is | ε(2016.0) | 6.2 µas with an uncertainty of about 7 µas per axis. These deviations from zero can be traced to small differences in the positions and proper motions between the last iteration of the primary solution and the secondary source updates, from which the final data are taken. The reason for this (undesirable) effect is currently under investigation.
If the same algorithm is instead applied to the full set of Gaia-CRF3 sources, the result for the spin is in which 1 607 289 (99.6%) of the sources are accepted; for the orientation correction the result is +3.41 ± 6.71 +8.99 ± 6.50 −1.47 ± 6.04 in which 2738 (87.1%) of the 3142 ICRF3 S/X sources are accepted. Generally speaking, the result is sensitive to the precise selection of sources at the level of a few µas yr −1 in spin and several µas in the orientation.

Conclusions and prospects
In this paper we have reported on the construction and properties of the Gaia-CRF3, a celestial reference frame materialised in the optical domain resulting from the Gaia all-sky survey. Although built on only 33 months of data (from August 2014 to May 2017), the astrometric accuracy is already very similar to what was expected for the 5-year nominal mission, and with a larger set of QSOs than originally anticipated.
The Gaia-CRF3 consists of an astrometric catalogue of more than 1.6 million QSOs selected using a number of existing catalogues, with further filtering based on the Gaia data to ensure that the sample is as much as possible free of stellar contaminants. The G magnitude extends from ∼ 13 to ∼ 21 mag, with a peak density at G 20.5 mag. The formal uncertainty is primarily ruled by the G magnitude, with a typical precision of 1 mas at G = 20.6, 0.4 mas at G = 20 and 0.1 mas at G = 17.9. There are 32 000 sources with formal position uncertainty < 0.1 mas and 210 000 with < 0.2 mas. Gaia-CRF3 is aligned to the radio ICRF3 by minimising the position differences for about 2000 common sources with high-quality solutions.
The comparison with ICRF3 shows that the floor formal accuracies are comparable in the optical and radio domains and better than 0.05 mas for the best measured sources. However the normalised positional differences are not compatible with a standard normal distribution and an estimated ∼ 20% of the common sources have statistically significant positional radio-optical offsets. For many sources, if not all, these offsets are probably real, although this is hard to prove without a detailed investigation at the source level Plavin et al. 2019). The radio and optical CRFs are both internally very consistent and materialise the ICRS in their respective wavelength range, but in view of the detected offsets it remains unclear how much their mutual consistency can be improved without a better knowledge of the physical processes responsible for the light and radio emissions. It is hoped that high-resolution observations and improved radio and optical positions will further characterise these offsets and lead to a better understanding of the underlying mechanisms (Johnson et al. 2019).
It should be understood that Gaia-CRF3 is entirely defined by the Gaia EDR3 data for the ∼ 1.6 million QSO-like sources discussed in this paper. These data include not only the barycentric positions of the sources at the reference epoch 2016.0, but also their proper motions (and parallaxes). Although individually insignificant in the vast majority of cases, these data provide invaluable information on the quality of the data and on   Fig. 17. Histograms of the normalised separations X 2 ∆ between Gaia-CRF3 and ICRF3 S/X for the full set of 3142 common sources (blue) and the 259 defining sources (red). The dashed curves are the exponential functions described in the text. A total of 195 sources have X ∆ > 10 and fall outside this plot (10 of them are defining sources in ICRF3). possible systematic effects in the reference frame, both of instrumental and astrophysical origin. The reference frame defined by the proper motions of the Gaia-CRF3 sources is globally nonrotating at a level of a few µas yr −1 . It is not possible to quantify more precisely than this how well the Gaia-CRF3 complies with the 'no rotation' condition, because different subsets of the Gaia-CRF3 lead to determinations of the spin that differ on this level, which is also consistent with the estimated RMS level of systematics on large angular scales (low values of max ) depicted in Fig. 11. As mentioned in Sect. 1, the secondary frame defined by the positions and proper motions of stars in Gaia EDR3, in particular at bright magnitudes, may have significantly higher systematics.
Gaia-CRF3 is by far the best realisation to date of the celestial reference frame in the visible. It meets the ICRS requirements for a realisation based on extragalactic sources. The IAU, during its XXXIst General Assembly in August 2021, resolved (Resolution B3) that ICRF3 at radio wavelengths and Gaia-CRF3 in the optical are realisations (ICRF) of the International Celestial Reference System (ICRS) to be used as a fundamental standard in their respective domain.
A new version of the Gaia-CRF comes out with every release of a new astrometric solution based on a larger volume of Gaia data. The current version, Gaia-CRF3, is common to Gaia EDR3 and Gaia DR3, for which the basic astrometric data are identical. Two more releases are expected, Gaia DR4 based on 66 months of observations, and Gaia DR5 using up to 120 months of data. The positional accuracy improves roughly as T −1/2 , where T is the time baseline, leading to an improvement by almost a factor two for the final version compared with Gaia-CRF3.
It is expected that the systematics will also be improved in the future releases, although this cannot easily be quantified. A dramatic illustration of the improvement from Gaia-CRF2 to Gaia-CRF3 is the successful use of the latter to determine the acceleration of the Solar System (Gaia Collaboration et al. 2021b). Of paramount importance for the quality of the reference frame is the accuracy of proper motions, which impacts both the astrometric filtering of the sources (thus reducing stellar contamination) and the accuracy of the positions at earlier and later epochs. Proper motions generally improve as T −3/2 , which could give a factor seven more precise proper motions in the final release. Corresponding improvements can be expected for the determi-nation of the reference frame (especially, its spin) as well as for the various physical effects that may be seen in the proper motions (e.g. the solar system acceleration; see Sect. 8 in Gaia Collaboration et al. 2021b for a brief discussion of other effects).
As mentioned in Sect. 3, up to 4 million QSOs could be recognised by Gaia. In future releases the identification of the QSOs will be refined using a combination of all kinds of Gaia data: an improved astrometry, spectrophotometry, and variability analysis. Special efforts will be made to better identify QSOs in the crowded areas and to improve the homogeneity of the sky coverage (a reduced source density at the Galactic plane will however persist due to the effects of Galactic extinction). The prospect for the future versions of the Gaia-CRF therefore lies not only in a further improvement of positional accuracy, but even more in the number of sources, a better homogeneity of the source distribution of the sky, the purity of the selection and in the overall consistency of the solution.   Table 1. The maps show all the sources in the respective catalogue, irrespective of whether they are also in Gaia-CRF3. For the VLBI and blazar catalogues (ICRF3 S/X, ICRF3 K, ICRF3 X/Ka, Roma-BZCAT, 2WHSP, and the ALMA calibrators) that have a small source density on the sky, individual sources are plotted. All three parts of ICRF3 are shown on one plot using different colours. For the other catalogues, the maps show the density of sources per square degree, computed from the source counts per pixel using HEALPix of level 6 (pixel size 0.84 deg 2 ). One sees the strong inhomogeneity of the input catalogues caused by the special observational programmes not covering the whole sky (e.g. SDSS and LAMOST), by the observational schedule of the WISE satellite (AllWISE, R90, and C75), and by the artificial de-selection of certain crowded areas applied by the authors of some catalogues (e.g. the Galactic plane and the LMC and SMC areas in R90,C75,. It is also obvious from the maps that not all catalogues are independent. For example, one directly sees that LQRF and LQAC-5 considered the results of 2QZ and that SDSS DR14Q was used in LQAC-5. Detailed information on all these aspects can be found in the corresponding original publications cited in Table 1. All these inhomogeneities contribute to the resulting inhomogeneity of Gaia-CRF3, the density of which is shown in Fig. 4 allows one to retrieve the astrometric data for only the 1 215 942 Gaia-CRF3 sources that have five-parameter astrometric solutions. This dataset was used for example in Gaia Collaboration et al. (2021b) to measure the acceleration of the Solar System. Changing '31' to '95' in the last line, one gets instead the 398 231 Gaia-CRF3 sources with six-parameter astrometric solutions.
For the sources used to define the orientation and spin of the Gaia EDR3 catalogue the following two queries are useful. gives data for the 428 034 sources used to define the rotation (spin) of Gaia EDR3. The field names used_for_... can be replaced by considered_for_... to get the data of all sources considered by the reference frame algorithm. As discussed in Sect. 5, all sources that were considered by the frame rotator algorithm for fixing the orientation of the Gaia EDR3 catalogue also belong to Gaia-CRF3. However, among the sources that were considered for fixing the spin, there are 46 sources that are not part of Gaia-CRF3. Two of them were used to fix the spin. For those 46 or 2 sources, the respective queries return empty names of the source and the external catalogue.
As described in Sect. 2, the Gaia-CRF3 sources resulted from a filtered cross-match between the Gaia EDR3 catalogue and a number of external AGN/QSO catalogues. The information on which sources were found in the different external catalogues is given in table gaiadr3.gaia_crf3_xm. We note that gaiaedr3.agn_cross_id cannot be used to retrieve all Gaia-CRF3 sources from a given external catalogue, because a Gaia-CRF3 source may be found in several external catalogues but gaiaedr3.agn_cross_id only gives its identifier in one of them. To obtain a list of the Gaia-CRF3 sources that were selected from a specific external catalogue one can use the query: 6 SELECT xm.$$$_name, edr3.* FROM gaiaedr3.gaia_source AS edr3 INNER JOIN gaiadr3.gaia_crf3_xm AS xm USING (source_id) WHERE xm.$$$ = 'true' where '$$$' should be replaced by the catalogue code from column 6 of Table 1. For the sources in the three ICRF3 catalogues (the catalogue codes icrf3sx, icrf3k, icrf3xka) gaiadr3.gaia_crf3_xm provides two source designations as also given in the original ICRF3 catalogues: the ICRF and IERS designations as icrf_name and iers_name, respectively. Additionally, the above ADQL query with the catalogue code 'b19' 6  can be used to see the 920 599 Gaia-CRF3 sources that were also found in the catalogue of Bailer-Jones et al. (2019).
We note that when considering an external catalogue '$$$' some sources were rejected only because a lower quality of astrometry used in that catalogue resulted in the cross-match distance ∆ being too large to satisfy Eq. (5) in Sect. 2. For those sources the field gaiadr3.gaia_crf3_xm.$$$ is set to false. In some cases, however, the same Gaia source was found also in other external catalogues and eventually selected for Gaia-CRF3. In those cases, the name gaiadr3.gaia_crf3_xm.$$$_name of that Gaia source in catalogue '$$$' is also given. To obtain all Gaia-CRF3 sources that appear in an external catalogue one should replace the last line in the previous ADQL query by 'WHERE xm.$$$_name IS NOT NULL'. This helps to mitigate the effect of a lower quality astrometry in some of the external catalogues and improve the overall cross-match quality for those catalogues.
BP and RP spectra on faint sources in these very crowded areas (Riello et al. 2021). As a result, the astrometric solution did not use a spectrophotometrically determined colour (effective wavenumber) for many of these sources, but instead determined a pseudo-colour as the sixth astrometric parameter.
Because the confusion sources in the LMC and SMC tend to have distinctly different kinematics from the Galactic stars, it is advantageous to separate out the sources in the LMC and SMC areas in the further analysis. Guided by the density maps, we take a circle of radius 9 • centred on (α, δ) = (81.3 • , −68.7 • ) to represent the LMC, and a circle of radius 6 • centred on (α, δ) = (16.0 • , −72.8 • ) to represent the SMC. These areas contain, respectively, 2 657 059 and 496 520 confusion sources with five-parameter solutions (6 817 915 and 1 307 591 with sixparameter solutions). This simplistic selection does not pretend to any deeper physical significance, and only partially coincides with the carefully selected samples of Gaia EDR3 sources in the LMC and SMC discussed in Gaia Collaboration et al. (2021c). Figure D.2 shows the distribution of the confusion sources in galactic coordinates and magnitude, split by the kind of solution (five or six parameters) and further divided into four disjoint groups: sources in the LMC and SMC areas defined above, and for the remaining sources according to | sin b | ≷ 0.1. The confusion sources outside the LMC and SMC areas are called 'Galactic' below, although it is clear that this set contains also many extragalactic objects. Figures D.3 and D.4 show the distributions of the normalised parallaxes and proper motions of the confusion sources separately for the Galactic sources (split by galactic latitude) and the LMC and SMC areas. The distributions are strongly non-Gaussian (with the possible exception of parallaxes in the LMC/SMC areas) and very different from those of the Gaia-CRF3 sources in Fig. 3. In Sect. 2, this difference was used to check the purity of the filtered sources in the external QSO catalogues and in the final Gaia-CRF3. The proper motions give the strongest indication of impurity. Interestingly, for the Galactic confusion sources there is no clear dependence according to | sin b | ≷ 0.1. On the other hand, the confusion sources in the LMC and SMC areas give rather different and distinct distributions. The distributions for the five-and six-parameter solutions are not drastically different and mainly reflect the generally higher uncertainties of the six-parameter sources, which makes their distributions slightly less non-Gaussian.
Selection of QSO-like objects in the Gaia catalogue, based exclusively on parallaxes and proper motions, is an interesting prospect already discussed in the literature (Heintz et al. 2015;. Among the Gaia EDR3 sources with five-parameter solutions, there are 23.6 million satisfying Eqs. (1)-(4), of which 1.2 million (5.1%) are also in Gaia-CRF3. Allowing that the actual number of QSOs among the five-parameter sources in Gaia EDR3 could be a factor two higher (cf. Sect. 3), we conclude that the purity of such a selection from Gaia EDR3 would be at most about 10% -too low to be of practical use, but still interesting and promising. In certain areas on the sky the percentage of true QSO-like sources in the sample is much higher. For example, excluding | b | < 20 • and the LMC/SMC areas, there are 5 767 163 confusion sources, of which 1 410 314 (24.5%) are Gaia-CRF3 sources. The purity of such selections in future Gaia data releases will improve drastically thanks to the higher accuracy of parallaxes and (especially) proper motions (cf. Sect. 2.5). 1 £ 10 4 2 £ 10 4 3 £ 10 4 4 £ 10 4 5 £ 10 4 6 £ 10 4 7 £ 10 4 8 £ 10 4 9 £ 10 4 1:0 £ 10 5 1:1 £ 10 5 1:2 £ 10 5 Galactic; j sin bj > 0:1 Galactic; j sin bj  Appendix E: Estimating the spin and orientation corrections As described in Sect. 6.1 of Lindegren et al. (2012), the frame rotator is an integral part of AGIS with three main functions: (i) to estimate the spin and orientation corrections needed to bring the current astrometric solution onto the ICRS; (ii) to apply the estimated corrections to the astrometric parameters of the primary sources; and (iii) to apply the corresponding corrections to the attitude parameters. This appendix describes the algorithm used for the first function in the construction of Gaia EDR3. The 2012 paper gave a stringent mathematical framework for all three functions and outlined how they could be implemented in AGIS. The frame rotator used for the first two releases of Gaia data followed these prescriptions quite rigorously, which however resulted in code that was unnecessarily complex and not very transparent. For example, it permitted the use of different reference epochs for the AGIS solution, comparison catalogue, and orientation parameters, while in practice this was never needed. It also foresaw the use of radio stars observed by VLBI to help align the reference frame, which is no longer considered necessary or even desirable. On the other hand, the details of the numerical algorithm used for the robust least-squares estimation (LSE) of the rotation parameters were not at all described in Lindegren et al. (2012), making it very difficult for outside users to reproduce how the frame rotator worked in those releases.
In 2018 it was decided to implement a new version of the frame rotator to be used for the third and subsequent data releases. The main changes with respect to the earlier version are as follows: only two kinds of sources are used for estimating the rotation parameters, namely QSO-like sources with highaccuracy external positions (the VLBI positions of ICRF3 S/X in Gaia EDR3), used for both orientation and spin, and QSO-like sources without accurate external positions, used only for the spin. The same reference epoch is used for rotation parameters as for the astrometric data (2016.0 for (E)DR3), which permits to separate completely the solution for the orientation from that of the spin. Furthermore, the observation equations are written directly in terms of the proper motions and position differences (Sect. E.1), rather than using the rigorous but less transparent formalism of Lindegren et al. (2012). The solution takes into account the correlation between the components in right ascension and declination of both Gaia and external data (Sect. E.2), and has a transparent scheme for the detection and treatment of outliers (Sect. E.3). Finally, the solution provides reasonable estimates of the statistical uncertainties of the rotation parameters also when the uncertainties of the input data are underestimated (Sect. E.4), and may optionally include additional terms, such as a glide in the proper motions or more general expansions of the proper motions and position differences (Sect. E.5).
All these changes concern the estimation algorithm, that is the first function of the frame rotator. The remaining two functions, that is the application of the corrections to the source and attitude data, are still done as described in Sects. 6.1.2 and 6.1.3 of the 2012 paper and are not further discussed here.
We denote by Q 0 the set of QSO-like sources considered for the determination of the spin correction, of which the subset Q was actually used after elimination of outliers; similarly, R 0 and R are the set of sources considered and used for the determination of the orientation. Although not required by the algorithm, we normally have R 0 ⊆ Q 0 . The number of sources in each set is denoted n(Q 0 ), etc. We use σ(x), ρ(x, y), and Cov(x, y) for the uncertainty, correlation coefficient, and covariance of the arbitrary parameters x and y. The asterisk in α * indicates that the cos δ factor is implicit, as in µ α * =α cos δ. To distinguish between data in the current astrometric solution and the corresponding data in the external catalogue, we put an overline (¯) on the latter. The prime denotes the matrix transpose.

Appendix E.1: Observation equations
Let ω = [ω X , ω Y , ω Z ] be the spin correction of the current solution, expressed by its components along the principal axes of the ICRS. (As indicated by the prime, ω is a column matrix.) In matrix form, the observation equations in proper motion for source i are and right-hand side Here, α i , δ i , µ α * i and µ δi are the position and proper motion components of the source in the current solution, with reference epoch t ref (= 2016.0 for Gaia EDR3). For the weight normalisation and decorrelation (Sect. E.2), we also need the covariance of the right-hand side in Eq. (E.3), Similarly, let ε = [ε X , ε Y , ε Z ] be the orientation correction at the reference epoch t ref , expressed by its components in the ICRS. The linearised observation equations in position for source i are with the same coefficient matrix A i as before. The right-hand side, is the difference between the position (α i , δ i ) in the current solution and the position (ᾱ i ,δ i ) according to the external catalogue. We note that the external positions at t ref (= 2016.0 for EDR3) should be used here. In particular, for Gaia EDR3 the galactic acceleration should be taken into account with the ICRF3 S/X positions as prescribed in Sect. 5.3 of Charlot et al. (2020). The covariance of ∆ i in Eq. (E.6) is computed on the assumption that the positional errors of the current solution and of the external catalogue are independent: Article number, page 29 of 31 A&A proofs: manuscript no. GaiaCRF3 and correspondingly for the external data. We note that σ(∆α * i ) = σ(α * i ) 2 + σ(ᾱ * i ) 2 , (E.9) σ(∆δ i ) = σ(δ i ) 2 + σ(δ i ) 2 , (E.10) . (E.11) The total number of observation equations considered for ω is 2n(Q 0 ), and for ε it is 2n(R 0 ). A feature of the algorithm described here is that the observation equations always come in pairs, as in Eqs. (E.1) and (E.5), and that the two components in a pair are always treated together. This simplifies the decorrelation of data (see below), and affects the way outliers are detected and removed. The resulting solution and statistics are independent of the coordinate system in which they are calculated.

Appendix E.2: Weight normalisation and decorrelation
Because the observation equations have different uncertainties (heteroscedasticity) and are pairwise correlated, they should be solved by the generalised least-squares method for optimality. Equivalently, the ordinary (unweighted) least-squares method can be applied to the normalised equations, where K i and L i are 2 × 2 matrices chosen in such a way that the transformed right-hand sides are uncorrelated and of unit variance. This condition does not define K i and L i uniquely, since any orthogonal transformation of them will give the same result; the form used here is the inverse lower Cholesky factor of the covariance matrix, or It is readily verified that, as required, K i Cov(µ i )K i = I and L i Cov(∆ i )L i = I, where I is the identity matrix. Given the subsets Q, R of the accepted sources, the generalised least-squares solutions are obtained by minimising the sum of squared normalised residuals, This can be done by a range of standard numerical techniques, including orthogonal transformations and the use of normal equations (e.g. Lawson & Hanson 1995, Björck 1996. For the frame rotator, the normalised observation equations in Eqs. (E.12) and (E.13) are solved by QR decomposition, using Householder transformations.

Appendix E.3: Residual statistics and treatment of outliers
Standard least-squares fitting is sensitive to outliers, that is data points that do not reasonably fit the observation equations, considering the uncertainties and the typical scatter of residuals. Outliers must be expected among the external counterparts, as well as among the other QSO-like sources, for a variety of reasons. It is therefore essential that the fit is robust against outliers. The method adopted for the frame rotator is to identify the outliers that must be removed from the originally considered samples (Q 0 , R 0 ) to create the subsets R and Q used in Eqs. (E.17) and (E.16). Two things should be noted concerning this process. Firstly, outliers are identified at source level, that is, a given source is either rejected in both coordinates (α and δ, or µ α * and µ δ ), or accepted in both coordinates. This makes sense from a physical viewpoint, because a 'bad' source is probably bad (at some level) in both coordinates. Secondly, because ω and ε are treated separately, it happens that a source rejected in R (for the orientation) is accepted in Q (for the spin), or vice versa. Again, this could be motivated on physical grounds. Details of the subsequent algorithm are exactly the same for Eqs. (E.16) and (E.17), and are therefore described in terms of the generic function Here, x could be either ω or ε, with S, d, and D correspondingly defined according to Eq. (E.16) or (E.17). We note that the functions X i (x), which can be computed for all the sources in S 0 , are a natural measure of the discrepancy of source i from the model for given parameter vector x. A useful overall statistic is the median discrepancy and a robust estimator could be to simply minimise this quantity. However, if the errors are approximately Gaussian, with a moderate fraction of outliers, robust least-squares estimation is more accurate and therefore preferred for the frame rotator. If x is the true parameter vector, and all the errors are Gaussian with nominal covariances, X 0.5 (x) is expected to be around √ ln 4 1.1774 (which is the median of the chi-distribution with two degrees of freedom). The adopted solutionx and accepted subset S simultaneously satisfy the three conditionŝ ∀i ∈ S : X i (x) ≤ κ X 0.5 (x) , ∀i ∈ S 0 \S : X i (x) > κ X 0.5 (x) for the chosen clip limit κ ≥ 1 (see below). It is not obvious that these conditions lead to a unique solution; indeed, it is easy to construct examples where this is not the case, for example if S 0 consists of disjoint subsets with distinctly different rotation parameters. However, these are somewhat contrived examples and in practice the non-uniqueness is hardly an issue. More relevant is the question whether a solution always exists -to which we do not know the answer. We take the pragmatic view that a unique solution 'close enough' to satisfying Eq. (E.21) can always be found by the following algorithm. Start by provisionally accepting all the sources in S 0 , compute the least-squares estimatex 0 by minimising X 2 (x, S 0 ), and hence obtain the statistic X 0.5 (x 0 ). In the next iteration, let S 1 be the subset of sources satisfying X i (x 0 ) ≤ κ X 0.5 (x 0 ), leading to the estimatex 1 minimising X 2 (x, S 1 ) and the new statistic X 0.5 (x 1 ), from which the subset S 2 is constructed, and so on. This gives a sequence of subsets S k , k = 0, 1, 2, . . . that may converge to a stable subset (i.e. S k+1 = S k for some k). If so, the stable subset and the corresponding estimatex k is the desired solution. But even if the sequence does not converge, it will sooner or later be found that the same subsets reappear cyclically in the sequence, that is S k+p = S k , where p > 1 is the cycle period. In such a case any of the solutions in the cycle might be accepted. However, by adopting the rule that the sequence is stopped at the lowest k where S k = S k−p for some p > 0, we arrive at a unique result both for a converging sequence (p = 1) and in the cyclic case (p > 1).
To avoid having to save and compare the (potentially many and large) subsets S k , we assign a fixed random number r i to each source in S 0 , and save from previous iterations only the outlier 'checksum' o k = i ∈ S 0 \S k r i . The iteration stops as soon as o k = o k−p for some p > 0, at which point we adopt S = S k andx =x k as the solution.
The clip limit κ is a dimensionless number, typically in the range 2-5, to be chosen by the researcher. A value κ ≥ 1 ensures that at most 50% of the sources in S 0 are rejected as outliers, but in practice a higher value should always be used. In principle, an optimum clip level that minimises the expected uncertainty ofx can be estimated by careful modelling of the actual error distribution in S 0 , but for the frame rotator we always use a fixed κ = 3. In the nominal case of purely Gaussian errors and no actually bad data, this will (erroneously) reject about 1% of the sources. This appears to be an acceptable price to pay for an estimation algorithm that is both efficient and very robust.