The total amount of (compressed) science data generated in the course of the five-year mission is about 2 10¹³ bytes (20 TB). Most of this consists of CCD raw or binned pixel values with associated identification tags. The data analysis aims to "explain'' these values in terms of astronomical objects and their characteristics. In principle the analysis is done by adjusting the object, attitude and instrument models until a satisfactory agreement is found between predicted and observed data (dashed lines in Fig. 8). Successful implementation of the data analysis task will require expert knowledge from several different fields of astronomy, mathematics and computer science to be merged in a single, highly efficient system (O'Mullane & Lindegren 1999).

$\begin{figure} \par\epsfig{file=h2529f8.eps,width=16cm,angle=0}\end{figure}$

Figure 8: Model of CCD data interrelations for an astronomical object. In principle, the data analysis aims to provide the "best'' representation of the observed data in terms of the object model, satellite attitude and instrument calibration. Certain data and models can, from the viewpoint of the data analysis, be regarded as "given''; in the figure these are represented by the satellite orbit (in the barycentric reference system) and the relativistic model used to compute celestial directions. Other model data are adjusted to fit the observations (dashed lines)

The global astrometric reductions must be formulated in a fully general relativistic framework, including post-post-Newtonian effects of the spherical Sun at the 1 $\mu$ as level, as well as including corrections due to oblateness and angular momentum of Solar System bodies.

Processing these vast amounts of data will require highly automated and efficient numerical methods. This is particularly critical for the image centroiding of the elementary astrometric and photometric observation in the astrometric instruments, and the corresponding analysis of spectral data in the spectrometric instrument.

Accurate and efficient estimation of the centroid coordinate based on the noisy CCD samples is crucial for the astrometric performance. Simulations indicate that 6 samples approximately centred on the peak can be read out from the CCD. The centroiding, as well as the magnitude estimation, must be based on these six values. Results of a large number of Monte Carlo experiments, using a maximum-likelihood estimator as the centroiding algorithm, indicate that a rather simple maximum-likelihood algorithm performs extremely well under these idealized conditions, and that six samples is sufficient to determine the centroid accurately. Much work remains to extend the analysis to more complex cases, including in particular overlapping stellar images.

A preliminary photometric analysis, for discovery of variables, supernovae, etc., can be carried out using standard photometric techniques immediately after data delivery to the ground. In addition, more detailed modelling of the local background and structure in the vicinity of each target using all the mission data in all the passbands will be required. A final end-of-mission re-analysis may benefit from the astrometric determination of the image centroids, locating a well-calibrated point spread function for photometric analysis. Studies of these photometric reductions have begun.

The high-resolution (radial velocity) spectrometer will produce spectra for about a hundred million stars, and multi-epoch, multi-band photometry will be obtained for about one billion stars. The analysis of such large numbers of spectra and photometric measurements needs to be performed in a fully automated fashion, with no manual intervention. Automatic determination of (at least) the surface temperature $T_{\rm eff}$ , the metallicity [M/H], and the relative $\alpha$ element abundance [ $\alpha$ /Fe] is necessary; determination of $\log g$ is, given the availability of parallaxes for most stars, of lesser importance. A fully automated system for the derivation of astrophysical parameters from the large number of spectra and magnitudes collected by GAIA, using all the available information for each star, has been studied, showing the feasibility of an approach based on the use of neural networks. In the classification system foreseen, spectra and photometric measurements will be sent to an "initial classifier'', to sort objects into stellar and non-stellar. Specialist networks then treat each class. For example, stellar data sets are passed to an "automated stellar parameterization'' sub-package.

It is the physical parameters of stars which are really of interest; therefore the proposed system aims to derive physical parameters directly from a stellar spectrum and photometry. Detailed simulations of the automated stellar parameterization system have been completed using a feed-forward neural network operating on the entire set of spectral and photometric measurements. In such a system, the derived values for the stellar parameters are naturally linked to the models used to train the network. Given the extreme rapidity of neural networks, when stellar atmosphere models are improved, re-classification of the entire data set can be done extremely quickly: an archive of 10⁸ spectra or photometric measurements could be reclassified in about a day with the present-day computing power of a scientific workstation.

The overall data analysis task would be impossible without certain regularizing assumptions: one must assume that a substantial fraction of stars follow a very simple model, viz. (apparently) single stars with little or no photometric variability, whose motions can be described by the standard five astrometric parameters ( $\alpha$ , $\delta$ , $\pi$ , $\mu_{\alpha*}$ , $\mu_\delta$ ). For the satellite attitude and instrument characteristics it must be assumed that sudden changes are rare, so that time-averaging and smoothing are effective in reducing observational noise. Without these assumptions the problem would simply have too many degrees of freedom. While such regularity conditions must be valid in a broad sense, it is clear that they cannot be guaranteed to hold in any particular situation or for a specific object. The data analysis must be able to filter out cases where the conditions do not apply, and divert them to a separate analysis branch. The efficiency of the filtering process depends critically on the quality of the instrument calibrations and attitude determination, which initially is quite low. Thus an iterative process is needed in which the object selection and observations are successively improved, along with the calibrations and attitude determination.

The computational complexity of the data analysis arises not just from the amount of data to be processed, but even more from the intricate relationships between the different pieces of information gathered by the various instruments throughout the mission. It is difficult to assess the magnitude of the data analysis problem in terms of processing requirements. Certain basic algorithms that have to be applied to large data sets can be translated into a minimum required number of floating-point operations. Various estimates suggest of order 10¹⁹ floating-point operations, indicating that very serious attention must be given to the implementation of the data analysis, and that this effort must start very early.

Observations of each object are distributed throughout the mission, so that calibrations and analysis must be feasible both in the time-domain and in the object domain. Flexibility and interaction is needed to cope with special objects, while calibrations must be protected from unintentional modification. Object Oriented (OO) methodologies for data modeling, storage and processing are ideal for meeting the challenges faced by GAIA.

The feasibility of the OO approach has been demonstrated by a short prototyping exercise carried out during the present study phase. Algorithms for three processes were provided and incorporated into the OO model, underlining one important feature of OO design: the ability to have complex data structures and operations described in a single model. Java code was generated from the model and the algorithms implemented. The prototype was highly successful and reinforced confidence in the OO approach for treating the data. The reduction process is inherently distributed, and naturally matched to distributed parallel processors.

7 Data analysis