A&A 472, 293-298 (2007)
DOI: 10.1051/0004-6361:20077574
A. Asensio Ramos
Instituto de Astrofísica de Canarias, 38205 La Laguna, Tenerife, Spain
Received 30 March 2007 / Accepted 14 June 2007
Abstract
Aims. We investigate the statistical properties of the extreme events of the solar cycle as measured by the sunspot number.
Methods. The recent advances in the methodology of the theory of extreme values are applied to the maximal extremes of the time series of sunspots. We focus on the extreme events that exceed a carefully chosen threshold and a generalized Pareto distribution is fitted to the tail of the empirical cumulative distribution. A maximum likelihood method is used to estimate the parameters of the generalized Pareto distribution and confidence levels are also given to the parameters. Due to the lack of an automatic procedure for selecting the threshold, we analyze the sensitivity of the fitted generalized Pareto distribution to the exact value of the threshold.
Results. According to the available data, which only span the previous 250 years, the cumulative distribution of the time series is bounded, yielding an upper limit of 324 for the sunspot number. We also estimate that the return value for each solar cycle is
188, while the return value for a century increases to
228. Finally, the results also indicate that the most probable return time for a large event such as the maximum at solar cycle 19, happens once every
700 years and that the probability of finding such a large event with a frequency smaller than
50 years is very small. In spite of the essentially extrapolative character of these results, their statistical significance is very large.
Key words: methods: data analysis - methods: statistical - Sun: activity - Sun: magnetic fields
One of the most striking examples of this is the case of extreme events. In spite of their inherent rarity, extreme events sometimes play important roles and turn out to be of fundamental importance. In certain fields (analysis of precipitation and floods, maximum temperatures, global climate, etc.), extreme events are those that produce radical and serious changes. For this reason, the extreme value theory is well-developed in these fields and has been applied in recent decades with great success.
We find few applications of the theory of extreme values in the field of Astrophysics
(e.g., Bhavsar & Barrow 1985; Bhavsar 1990; Bernstein & Bhavsar 2001). This situation is somewhat surprising because
usually the most interesting events that astrophysicists study
are the most extreme ones. In this paper, we apply the theory of extreme values and its most recent
advances to the investigation of the solar activity cycle. One of the best-known indicators of solar
activity is the sunspot relative number. This indicator closely follows the 11-year solar activity cycle and
has been continuously tabulated since 1750. During the recent decades, there has been increasing
interest in the prediction of the upcoming solar cycles and different techniques have been applied
(e.g., Dikpati & Gilman 2006; Li et al. 2001; Orfila et al. 2002; Du 2006).
The main cause for this interest is to be found, not only on the pure scientific curiosity of knowing in
advance the amplitude of the following solar maximum, but in the influence of a strong (or weak) solar
maximum on the interplanetary medium. A strong solar maximum can induce solar storms which can
damage the enormous number of satellites around the Earth. Another reason is the
feasibility of a tripulated mission to Mars. The required long journey has to be carried out
away from strong solar maxima to minimize the exposure to dangerous doses of radiation.
Our work analyzes the statistical properties of these extreme activity events to estimate their
frequency and amplitude. This could serve to gain some insight into the
efficient amount of shielding needed to protect satellites and/or tripulated missions.
For this reason, the ability to predict when such extreme solar cycle events happen is of interest. A few studies have been oriented towards the investigation of the statistics of such extreme events. Siscoe (1976b,a) used the theory of extreme value to analyze the largest sunspot number per solar cycle. Later, Willis & Tulunay (1979) extended these studies by analyzing the data from sunspots umbrae, complete sunspots (umbrae+penumbrae) and faculae during nine solar cycles. However, this type of analysis has not been repeated during the last 25 years, where more than two solar cycles have occurred. Additionally, the previous studies were based on older approaches to the statistics of extreme values.
In this paper we extend the previous analysis of the solar cycle based on the monthly variation of the sunspot number. The interest of this study resides in the application of the more recent techniques to infer the statistical properties of the extreme values of the activity cycle. Our purpose is to identify clues that help forecast the extremes of the solar cycle and their occurrence frequency for extended periods of time (not only in the future, but also in the past). In more detail, once the probability distribution function of extreme events (largest number of sunspots) is characterized, we investigate whether this distribution is limited and which are the typical events that we can expect for a given amount of time.
Two different approaches are used mainly for the analysis of extreme events. The essential difference
is in the way extreme events are defined. Let
be a sequence of independent random variables
which have a common distribution function F. Each Xi can be considered as a measure of a random process
taken with a certain timestep
.
The first approach, termed the block maxima approach (Fisher & Tippett 1928; Coles 2001), takes as extreme events the
maximum (or minimum) value of the random variable in fixed intervals of time. For instance, if the timestep
is considered to be one hour, the maximum among 24 consecutive measurements is the daily maximum.
The second approach, termed the
peaks over threshold (POT) approach (e.g., Coles 2001), takes as extreme events all the values
of the time series that
exceed a given threshold. The first approach was that taken by Siscoe (1976b,a), and Willis & Tulunay (1979)
for the analysis of extreme values in the temporal evolution of sunspot numbers. Traditionally it has been selected
for the analysis of time series in which a clear seasonal (periodic) behavior is detected.
However, in spite of the theoretical simplicity of the block maxima approach, it suffers from important drawbacks.
One of the most important limitations is that it tends to make inefficient use of the data. The importance
of the necessity to overcome this lack of efficiency lies in the fact that the extreme value theory deals with
extreme events, which are, by definition, scarce. For this reason, POT is becoming the fundamental
method of choice in recent applications of the theory of extreme events thanks to the efficient
use of the reduced amount of data available.
We now briefly present the theoretical results which we use in this paper. Assume that we measure a random variable
at constant time intervals and that we obtain a sequence
.
The measurements have to
be independent and they have a common distribution function. The sequence can then be
described with the aid of the cumulative distribution function F. Since we are focusing on
extreme events, we are interested in the tail of such distribution. The POT approach is based on
analyzing what is known as the conditional excess distribution function, Fu(y), defined as:
![]() |
(1) |
![]() |
(3) |
With the aid of the theorem developed by Pickands (1975), the functional form of the cumulative distribution
function for events above u can be written. Making x=u+y and solving for F(x) in Eq. (2),
the cumulative distribution function for events above the threshold u can be written as:
The parameters of the GPD are usually obtained from the empirical data by means of a maximum likelihood
estimation. Assuming that
y1,y2,..., yN are the N values of the original time series that
exceed the threshold u, the log-likelihood is (Coles 2001):
![]() |
Figure 1: Solar cycle variation of the International Sunspot Number. |
Open with DEXTER |
The time series representing the solar activity cycle during the last 257 years is shown in Fig. 1. The
data represent the sunspot number, an estimation of the number of individual sunspots (counting individual
sunspots and group of sunspots). In order to decrease the dispersion, we focus here on the monthly averaged value.
The data have been tabulated since 1750 and it is now known as the
International Sunspot Number. The time series presents a clear regularity
with time as a consequence of the influence of the solar cycle on the surface magnetism. Under these
circumstances, the premises on which the POT formalism lies are not fulfilled. In particular the
random variables are not independent because there is a certain degree of correlation between consecutive
events: a large sunspot number is typically followed by another large sunspot number.
Several techniques
have been devised to overcome this difficulty. One of the most applied methods is the "de-clustering'' of the
time series (e.g., Coles 2001). It consists of locating clusters in the excedance over the
threshold and representing them by the maximum value inside each cluster. This has two undesired
consequences: (i) the number of events available for the GPD analysis is reduced and (ii) a somewhat
arbitrary criterion for the cluster definition has to be included. Recently, Fawcett & Walshaw (2007) have
shown that this de-clustering technique introduces biases in the maximum likelihood estimations of
and
.
They also show that the direct application of the POT method using the whole time
series neglecting any temporal periodicity leads to negligible biases. The price to pay is that the
confidence intervals for the GPD parameters are larger than those obtained using standard techniques
(Coles 2001). Following Fawcett & Walshaw (2007), we apply the POT method to the sunspot number time series
without any de-clustering technique. The POT formalism also requires the underlying distribution
of the random variables to be stationary (see, e.g., Coles 2001). The large number
of studies that are successful in reproducing the time evolution of the International Sunspot Number
using deterministic methods (e.g., Dikpati & Gilman 2006; Choudhuri et al. 2007; Cameron & Schüssler 2007; Verdes et al. 2000)
suggest that this is the case. Therefore, we can safely consider that
the physics (probability distribution function of the random variables) driving the solar cycle does not
vary appreciably with time.
![]() |
Figure 2:
Summary of the quality of the fit of the empirical cumulative distribution to a generalized Pareto
distribution. The left panel shows the value of the cumulative distribution for different values of the
sunspot number above the selected threshold of 149.4. The right panel shows the variation of the quality of
the fit when the uncertainties in the ![]() ![]() |
Open with DEXTER |
In this paper we only focus on the statistics of the upper tail of the distribution, i.e., maximal values
of the sunspot number. Although the application of the POT formalism to the analysis of the distribution of minimal
values is also possible, larger time series are needed. We briefly discuss this issue in Sect. 4.
In order to apply the POT formalism, a threshold u has to be fixed. The threshold has to be sufficiently large so that
the generalized Pareto distribution is a suitable functional form for describing the tail of the cumulative distribution
and it has to be sufficiently small so that enough values are available to give an accurate estimation
of the parameters of the GPD. There is not any known automatic procedure for the selection of the threshold.
In this paper we choose a value of the threshold based on reasonable ideas and we verify the behavior of
the parameters of the GPD for different values of the threshold. In our case, u has been chosen as the value that
leaves 96% of the points of the time series below and only 4% of the points are considered as extreme values. For
the dataset shown in Fig. 1, we find u=149.4. From the original set of 3096 data points, we leave
122 points above the threshold which are used to fit the GPD neglecting any time dependence. The empirical
cumulative distribution function for points
above the threshold is built and the values of
and
that give the best fit are obtained.
The GPD parameters have been estimated maximizing the log-likelihood given by Eq. (5). As
noted above, such an approach permits the most probable values and their confidence intervals to be obtained.
In our case, we obtain
and
,
as shown in Table 1. With these values, the ensuing cumulative distribution
function is shown in the left panel of Fig. 2, where we show the value over the threshold
on the horizontal axis
and the value of the GPD on the vertical axis. The right panel of Fig. 2 shows the
empirical cumulative distribution versus the fit Fu(y) using the GPD. This is the
so-called probability plot, and it clearly indicates that the GPD produces a good approximation to the empirical cumulative
distribution.
Several interesting points deserve comment. Firstly, the confidence intervals presented in Table 1 are probably underestimating the true confidence intervals because we have neglected the
temporal dependence. Secondly, a negative value for the shape
parameter
is found. The evidence for this is strong because the 95% confidence interval is
almost exclusively in the negative domain. According to the results
presented above, this suggests that the cumulative distribution
is bounded and that zero probability is assigned to excesses above a certain limit, given by the ratio
.
In our case, we find that the fitted GPD assigns zero probability to extreme values larger than 175.6 above
the threshold. Taking into
account the chosen threshold, this gives a limit of
.
This value is consistent with the
data presented in Fig. 1. Note also that the 95% confidence interval is
[270.0, 990.0], which
has been obtained from the
likelihood function. It gives a very stringent value of the lower limit (as is obvious because of the presence of
a large amount of data below the threshold) but a very large upper limit. A 68% confidence interval
is estimated to be [288,420]. This is a clear indicative of the asymmetry of the likelihood function with a
very long tail towards larger
,
giving the idea that the information for a strong upper limit is
hardly present in the data. It is however important to take into account that these results can
fluctuate depending on the exact value of the chosen threshold. Furthermore, it can also fluctuate in the
future if the same calculation is carried out with more data from subsequent solar cycles.
Table 1: Parameters and return level estimates. The values are obtained for a threshold of u=149.4.
In order to
analyze the strength of this conclusion, we carried out the fit of the cumulative distribution for
different values of the threshold. The results are shown in Fig. 3 for different values of the
threshold u, the upper panel showing the values obtained for
and the lower panel the values for
.
If the GPD is a reasonable model for the excedances above a certain threshold u0, it should remain
reasonable for larger values of the threshold (e.g., Coles 2001). As a consequence, the estimates
of
and
should remain constant above u0 provided u0 is a valid threshold. We see that this
happens in our case for u > 150, the value we have chosen above.
It is interesting to note that for
,
approximately 9% of the points lie above the threshold, while for
,
less than 1% of the points lie above. There is a clear increase in the uncertainty of the
retrieved parameters when the threshold increases due to the lack of points. What seems clear is that,
for u < 150, the
parameter has only negative values inside the error bars. For larger values of the
threshold, both negative and positive values for
are obtained, although negative values tend to be
more probable.
![]() |
Figure 3:
Values of the ![]() ![]() |
Open with DEXTER |
![]() |
Figure 4: Return value for a given amount of time (or return time for a given level) for the fit obtained with a threshold of u=149.4. The dashed lines indicate the confidence levels of the return value taking into account the uncertainty in the GPD parametrization. The vertical dotted line indicates 11 years, approximately one solar cycle. The horizontal line indicates the value of the maximum found on solar cycle 19. |
Open with DEXTER |
Once the cumulative distribution given by Eq. (4) is obtained, several statistical
properties of the extreme values can be inferred. One of the most interesting in terms of prediction of future
events are the so-called return times and return level. The return time is the typical time one has to wait
until an event reaches and surpasses a threshold. Similarly, the return level is the typical extreme event
one would find after waiting for a given amount of time. These quantities are obtained easily from
Eq. (4) and they are shown in Fig. 4. For consistency,
we only show the results for values above the chosen threshold. The dashed lines present the confidence interval
which is induced by the uncertainty in the inferred parameters of the GPD. The vertical dotted line
approximately indicates a solar cycle, of the order of 11 years. For such an amount of time, the return level
equals 187. If we take into account the confidence interval, we find that the return level lies
in the interval [180,203] with 95% confidence. These values appear to be consistent with the empirical
results from the past, according
to Fig. 1. This is produced by the similarity between the empirical tail of the extreme distribution
and the GPD, which makes the statistical properties inferred from the GPD a good estimation of the statistical
properties of the empirical distribution. If we assume that the previous 23 solar cycles are
representative of the behavior of the
Sun during a longer time period and unless a long-period modification of the solar cycle exists, the previous estimations
of the return levels are statistically significant. For the case of the return level for 100 years, we find
a value of
228, with a confidence interval of [208,254]. Again, this value is consistent with the
data.
Concerning the return times, it is interesting to estimate them for the most extreme cases in the observed
dataset. It is important to have in mind that relying on the GPD for very long-term extrapolation can
be dangerous. The dataset on which we have applied the extreme value theory spans only 250 years,
so one should not blindly rely on the return values for large events if they only happen once in
the original data. For instance, according to the GPD, an event such as the extremely strong peak of cycle 19
(the maximum of 253.8 took place during 1957) occurs
once every
700 years. Taking into account the uncertainty in the GPD fit, we find that it is possible to
give only a lower limit to this return time because the upper limit is unbounded. The horizontal line
indicates the maximum value of 253.8 obtained during cycle 19. The intersection of this horizontal line
with the solid line given by the GPD extrapolation occurs at
700 years, showing the
most probable value of the return time. The
intersection with the dashed lines indicate the intervals of 95% confidence. In this case, we find that
the GPD extrapolation implies that such an extreme event happens, with 95% probability, with a
frequency above once every
50 years, with an apparently unbounded upper limit.
We want to leave a word of caution on the values obtained above because of their extrapolative character. However, we also want to stress the fact that this extrapolative character is based on strong theoretical roots. A small variation on the results obtained in this work might be expected as soon as more data showing more extreme events is available. This variation with respect to the values presented above should be very small if the underlying probability distribution function (essentially, the physics that drives the solar cycle) that we have calculated does not change appreciably with time.
We have presented an extreme value theory analysis of the solar cycle as measured by the sunspot number. The
peak-over-threshold method has been applied. The analyzed time series presents a certain degree of correlation
because a large
number of sunspots is usually followed by another large number of sunspots and these variables cannot be considered
to be uncorrelated. Following Fawcett & Walshaw (2007), we have applied the POT method to the time series without
any de-clustering technique. As a consequence, the confidence intervals presented in this study could be an
underestimation of the
real confidence intervals because the likelihood function is affected by the presence of time correlation in the
time series. Our results indicate that the distribution of extreme solar cycle events is bounded so that the
value of 324 cannot be exceeded. The analysis of the confidence intervals give that, with 95% confidence,
this maximum value is larger than 288. Of more relevance are the 11-year and 100-year return values. The results
indicate that there is a very high probability of finding values in the range [180,203] every solar
cycle, and values in the range [208,254] every 10 solar cycles. Additionally, we have shown that an extreme
event such as that on solar cycle 19 (during 1957) occurs with a frequency above once every
50 years.
The results obtained in this paper are based on the statistical analysis of the tail of the sunspot number distribution. This analysis is driven by the extreme value theory that is based on strong theoretical roots developed during the last 50 years. Such application relies on the assumption that the limiting behavior of the stochastic process behind the observed time series can be obtained from the application of certain mathematical limits. A functional form for the tails of distributions is available and we only have to fit this tail distribution to our dataset. However, this approach has limitations. One of the strongest is that it is not yet clear whether the limiting mathematical models that we have used can be directly applied to finite time series. It is expected that, in the limit of an infinitely large time series, the models correctly reproduce the tails of the underlying distribution. However, when the time series is of limited size, fluctuations can be of importance and lead to inaccuracies. In our case, the results that we present appear to support the fact that the statistics of extreme events are correctly reproduced under the framework of the extreme value theory. The fact that the theory explains the already observed extreme events is also favorable. The validity of the theory is also supported by the large number of practical applications of the theory which we found in the literature.
The extreme value theory can also be applied to minima. One of the most interesting future applications of this approach is to estimate the return time for long low-activity periods, with the aim of estimating the frequency with which extreme events such as the Maunder minimum may take place. A much longer time series of solar activity would be needed for such an estimation. Furthermore, an appropriate re-definition of the time series has to be carried out in order to transform these low-activity periods into extremes cases. A possibility that could be of interest is to use the solar sunspot number reconstructed from 14C activity during the last 11 000 years (Solanki et al. 2004). This reconstruction shows periods of low activity similar to the Maunder minimum whose probability (and ensuing return time) could be calculated in the framework of the extreme value theory.
Acknowledgements
I thank A. López Ariste, R. Manso Sainz and V. Martínez Pillet for helpful discussions and the anonymous referee for constructive remarks. This research has been funded by the Spanish Ministerio de Educación y Ciencia through project AYA2004-05792.