We found that the separation into such small temperature regions yielded improved parametrization results. This is understandable as the classification results for the stellar parameters, especially $\log g$ and [M/H], depend upon the presence of spectral features in a DISPI which are also closely related to the temperature of a star. This effect was also found in Weaver & Torres-Dodgen (1997), using spectra in the near-infrared and classifying them in terms of MK stellar types and luminosity classes. For our ANN work we chose simple temperature ranges, also aiming at database subsamples of similar size. Though some of the intervals roughly correspond to the temperatures found in certain MK classes which are characterized by certain line-ratios, i.e. common physical characteristics, the chosen distinction was motivated to allow for a reasonable training time for the networks. Another reason was to see what can be learned from DISPIs in different temperature regimes in principle. The mid-temperature samples (M-samples) were defined for the range in which the Balmer jump and the H-lines (e.g. H $_{\beta }$ ) change their meaning as indicators for temperature and surface gravity (see e.g. Napiwotzki et al. 1993). Under real conditions one would have to employ a broad classifier to first separate DISPIs into smaller (possibly overlapping) temperature ranges. This could be also based on neural networks, but also on other methods, such as minimum distance methods. Each temperature sample was finally divided into two disjoint parts, the training- and the application data. This means that our classification results (see Sect. 6) are from DISPIs in the gaps of our training grid.

The generalization performance of a network or its ability to classify previously unseen data is influenced by three factors: The size of the training set (and how representative it is), the architecture of the network and the physical complexity of the specific problem, which also includes the presence of noise. Though there are distribution-free, worst-case formulae for estimating the minimum size of the training set (based on the so called VC dimension, see also Haykin 1999), these are often of little value in practical problems. As a rule of thumb, it is sometimes stated (Duda et al. 2000) that there should be (W $\cdot$ 10) different training samples in the training set, W denoting the total number of free parameters (i.e. weights) in the network. In our network without extinction there were 452 weights. Thus, in some cases, there were fewer training samples than free parameters. However, we found good generalization performance (see Sect. 6 and results therein). This may be due both to (1) the "similarity" of the DISPIs in a specific $T_{\rm eff}$ range, giving rise to a rather smooth (well-behaved) input-output function to be approximated, and (2) redundancy in the input space. Both give rise to a smaller number of effective free parameters in the network.

We also tested whether there are significant differences between determining each parameter separately in different networks and determining all parameters simultaneously. In the first case each network would have only one ouput node while in the latter case the network had multiple outputs. If the parameters ( $T_{\rm eff}$ , $\log g$ etc.) were independent of each other, one could train a network for each parameter separately. However, we know that the stellar parameters influence a stellar energy distribution simultaneously at least for certain parameter ranges, (e.g. hot stars show metal lines less clearly than cool stars). Also, for specific spectral features, changes in the chemical composition [M/H] can sometimes mimic gravity effects (see for example Gray 1992). Varying extinction can cause changes in the slope of a stellar energy distribution which are similar to those resulting from a different temperature.

Table 2: Abbreviations, temperature ranges and number of DISPIs in the training sets, with and without extinction (the number in the application sets similar).
sample temperature range without/with ext.

L₁ 2000 K $\leq$ $T_{\rm eff}$ < 4000 K 330/2300

L₂ 4000 K $\leq$ $T_{\rm eff}$ < 6000 K 570/3980

L₃ 6000 K $\leq$ $T_{\rm eff}$ < 8000 K 500/3500

M₁ 8000 K $\leq$ $T_{\rm eff}$ < 10 000 K 400/2800

M₂ 10 000 K $\leq$ $T_{\rm eff}$ < 12 000 K 180/1200

H₁ 12 000 K $\leq$ $T_{\rm eff}$ < 20 000 K 390/2700

H₂ 20 000 K $\leq$ $T_{\rm eff}$ < 50 000 K 450/3100

**Table 2:** Abbreviations, temperature ranges and number of DISPIs in the training sets, with and without extinction (the number in the application sets similar).
sample	temperature range	without/with ext.
L₁	2000 K $\leq$ $T_{\rm eff}$ < 4000 K	330/2300
L₂	4000 K $\leq$ $T_{\rm eff}$ < 6000 K	570/3980
L₃	6000 K $\leq$ $T_{\rm eff}$ < 8000 K	500/3500
M₁	8000 K $\leq$ $T_{\rm eff}$ < 10 000 K	400/2800
M₂	10 000 K $\leq$ $T_{\rm eff}$ < 12 000 K	180/1200
H₁	12 000 K $\leq$ $T_{\rm eff}$ < 20 000 K	390/2700
H₂	20 000 K $\leq$ $T_{\rm eff}$ < 50 000 K	450/3100

Recently, Snider et al. (2001) determined stellar parameters for low-metallicity stars from stellar spectra (wavelength range from 3800 to 4500 Å). They reported better classification results when training networks on each parameter separately. We tested several network topologies with the number of output nodes ranging from 1 to 3 (in case of extinction from 1 to 4) in different combinations of the parameters. It was found that single output networks did not improve the results. We therefore classified all parameters simultaneously.

5 Preparing the ANN input data