Summarising changes in air temperature over Central Europe by quantile regression and clustering

The analysis of trends in air temperature observations is one of the most common activities in climate change studies. This work examines the changes in daily mean air temperature over Central Europe using quantile regression, which allows the estimation of trends, not only in the mean but in all parts of the data distribution. A bootstrap procedure is applied for assessing uncertainty on the derived slopes and the resulting distributions are summarised via clustering. The results show considerable spatial diversity over the central European region. A distinct behaviour is found for lower (5 %) and upper (95 %) quantiles, with higher trends around 0 .15C decade −1 at the 5 % quantile and around 0.20 C decade −1 at the 95 % quantile, the largest trends ( >0.2C decade −1) occurring in the Alps.


Introduction
Temperature changes can have profound impacts on socioeconomic activities and on human health. Spells of excess heat, for example, are associated with increasing mortality rates (e.g. Patz et al., 2005), changes in water resources (e.g. Zappa and Kan, 2007) and higher occurrence of forest fires (Trigo et al., 2006). Over Europe, significant trends in extreme temperatures are expected due to changes in largescale circulation and snow cover extent (e.g. Della-Marta et al., 2007;van den Besselaar et al., 2010) Due to its considerable societal relevance, the temporal evolution of temperature extremes is the focus of intense scientific research. Temperature extremes are often analysed by fitting trends to pre-defined extreme indices (e.g. Frich et al., 2002) or data percentiles (e.g. Hertig et al., 2010), or by applying extreme value theory (e.g. Nogaj et al., 2006;Huguet Correspondence to: S. Barbosa (sabarbosa@fc.ul.pt) et al., 2008;Parey, 2008;Scotto et al., 2011). However, in a climate change context the temporal evolution of mean temperature is of relevance as well. Variability in the mean influences variability in extremes (although it does not completely determine), and is easier to quantify since the available sample size is naturally much larger than in the case of extreme temperature events. Furthermore, correct simulation of the temporal mean of climate variables is one of the first requirements for numerical climate models.
The long-term temporal evolution of a time series of averaged temperature observations (e.g. global or daily mean temperature) is often characterised by a single number, the slope of a linear trend model fitted to the data. In that case, the variability is assumed to remain constant in time -when fitting a linear trend by ordinary least squares, the mean is assumed to evolve in time but the data distribution (and specifically the variance) is assumed to remain unchanged over the considered period. However, this is not necessarily a realistic assumption. Of particular interest is the assessment of whether such a hypothesis is reasonable, or if the distribution itself changes in time, leading to different slope values for different parts of the data distribution.
Quantile regression (Koenker and Basset, 1978;Koenker and Hallock, 2001) provides a well defined statistical framework for estimating the rate of change not only in the mean as in ordinary regression, but in all parts of the data distribution. This work addresses the temporal evolution of daily mean temperature over Central Europe. Quantile regression is applied in order to thoroughly quantify the variability structure of daily mean temperature, and assess whether just the mean temperature evolves in time, or if there are changes in the data distribution itself, thereby obtaining a more complete picture of the temporal changes in daily mean temperature.
For regional studies, not only the temporal evolution but also the spatial distribution of temperature changes is of scientific interest. The regional variability of air temperature observations is often analysed by taking each individual temperature time series and summarising the information for the region of interest in terms of maps of individual features. An alternative approach to spatially characterise regional variability is cluster analysis (e.g. Scotto et al., 2009;Mahlstein and Knutti, 2010). The aim of the present study is to describe regional temperature variability over Europe by combining quantile regression and a time series clustering procedure in the analysis of European daily mean temperature records. The data and methodological approach are described in the next section, results are presented in Sect. 3 and concluding remarks are given in Sect. 4.

Data
Time series of daily mean temperature are obtained from the blended European Climate Assessment (ECA) dataset Klok and Tank, 2009). Only stations in western Europe with data from at least January 1901 to December 2007 and with less than 2 % of missing observations have been selected ( Fig. 1 and Table 1). As a pre-processing step, seasonality is removed from each temperature record via sinusoidal regression at annual and semi-annual frequencies (e.g. Kedem and Fokianos, 2002). The ECA dataset is subject to quality control procedures but inhomogeneities that could influence the analysis of extreme temperatures can remain (Wijngaard et al., 2003). The deseasoned data are therefore inspected for potential outliers. The record from Wien was removed from the analysis due to the presence of some suspicious values (exceeding 7σ ) in the last part of the record.

Quantile regression
Quantile regression is a well-defined statistical framework for regression on quantiles rather than regression on the mean. Although it was first introduced in econometrics by Koenker and Basset (1978), quantile regression is being applied in various geoscience contexts (e.g. Koenker and Schorfheide, 1994;Cade and Noon, 2003;Baur et al., 2004;Elsner et al., 2008;Barbosa, 2008).
Given a random variable Y with cumulative continuous distribution function F Y (y), the quantile τ is defined as Whereas ordinary regression is based on the conditional mean function E[Y |X = x] and minimisation of the respective residuals, quantile regression is based on the conditional quantile function and minimisation of the sum of asymmetrically weighted absolute resid- where ρ is the tilted absolute value function. Further details can be found in Koenker and Hallock (2001) and Koenker (2005).

Clustering procedure
In this section, the time series clustering procedure proposed to classify the time series of daily mean temperature based on the corresponding distributions for quantile slopes at lower (0.05), middle (0.5) and upper (0.95) quantiles is introduced. The starting point is the panel of time series (X (1) ,...,X (T ) ). The strategy for clustering the time series is carried out in three stages: firstly, for a fixed (but arbitrary) quantile, the algorithm starts with the estimation of the distribution corresponding to quantile slope estimates; second, the corresponding dissimilarity matrix is computed. Finally, a dendrogram based on the application of classical cluster techniques to the dissimilarity matrix is built and that provides the different clusters formed by the distributions of the quantile slopes. The agglomerative hierarchical method  Table 1. with unweighted average distance (average linkage) is used as grouping criteria.
The clustering procedure is based on the computation of distances among pairs of distribution functions. Therefore an adequate metric between univariate distribution functions is required. The simplest one is the absolute difference among the mean of the distributions, but two distributions can coincide in mean and be completely different in other aspects. Alonso et al. (2006) and Vilar et al. (2010) demonstrated that the choice of metric plays a key role and should reflect the final goal of the clustering procedure. Scotto et al. (2010Scotto et al. ( , 2011, proposed to use the weighted L 2 -Wasserstein distance as a metric since it could be approximated by a fast computational procedure and it has two nice interpretations: (1) as a weighted sum of quantiles squared differences, so it takes into account more than the mean/median behavior and (2) in the case of a Gaussian distribution, a weighted sum of mean differences and standard deviation differences. Since in the present study, the interest is not only in the mean behavior of the distribution but also in the uncertainty related to estimation, this weighted L 2 -Wasserstein distance between two quantile slope distributions has been adopted.

Quantile slopes
Quantile regression has been applied to each temperature record in order to describe the temporal variability of different quantiles of the data distribution. The quantile slopes and corresponding standard errors have been derived using the algorithm of Koenker and D'Orey (1987). As an illustration, detailed results are presented for Paris (PAR) and Graz (GRA). The results for all the stations are presented in Sect. 3.2.
The Paris and Graz records are shown in Fig. 2, along with the quantile slopes at quantiles 0.05, 0.5 and 0.95, corresponding respectively to the lowest 5 %, 50 % (median) and 95 % of the ordered observations. A more complete description of the quantile regression results is given in Fig. 3, which displays for the Paris and Graz records the quantile slopes and the corresponding standard errors computed for quantiles 0.1 to 0.9 in steps of 0.02. Figure 3 clearly shows a distinct pattern for the two records. In the case of Paris, the derived slopes are very similar for all quantiles, and also similar to the usual ordinary least squares slope indicating that the distribution of temperature values is approximately symmetric and the rate of change is the same for all parts of the data distribution. However, in the case of Graz the lower, middle, and upper quantiles behave differently with the upper quantiles of the temperature data distribution increasing at a much faster rate than the middle and lower values.
In order to further assess uncertainty on the quantile slope estimates, a bootstrap procedure is applied. Bootstrap allows us to obtain a distribution of slope values instead of a single punctual estimate, but it is a delicate strategy in the case of non-independent data, since the temporal structure of the time series needs to be preserved in the bootstrap samples. Furthermore, bootstrap procedures often assume stationarity, an assumption whichis not verified by most hydro-climatic series. In this work, maximum entropy bootstrap (Vinod and de Lacalle, 2009) is used since it preserves the temporal structure of the original series in the bootstrap replicates without assuming stationary behavior. For each time series of daily mean temperature, 200 replicates have been obtained by maximum entropy bootstrap -experiments (not shown) indicate that in this case there is no need of using a bigger number of ensembles. Quantile regression is then applied to each replicate, resulting in a sample of 200 quantile slopes instead of a single punctual estimate. The results are shown as histograms in Fig. 4 and allow the assessment of the dispersion around the punctual quantile slope estimates. Graz exhibits a clear, different distribution for the upper quantile slope relative to the the median and lower slopes. The asymmetry in the bootstrap distributions is a consequence of the asymmetry of the quantile slope statistic.

Clustering
The analysis outlined in Sect. 3.1 is repeated for each one of the 28 stations and the resulting punctual estimates are displayed in Table 2. Maximum entropy bootstrap is applied to assess uncertainty, yielding a set of distributions of quantile slope estimates as shown in Fig. 4 for the two selected records. In order to summarise those distributions, the average linkage procedure is applied to obtain dendrograms of slopes for quantiles 0.05 (Fig. 5), 0.5 (Fig. 6) and 0.95 (Fig. 7). The dendrogram for the lower quantile clearly discriminates two groups: stations with larger slopes, typically >0.1 • C decade −1 (LUG, GRA, ZAG, LJU, DEB, SON, KOE, PAR, STO) and the remaining stations with smaller slopes. The cluster of large slopes further distinguishes the stations with the highest slopes, >0.15 • C decade −1 (PAR and STO) from the other stations. The cluster of small slopes further subdivides into stations with moderate slopes and stations with very small or non-significant trends (LEI, HAM,OSI, ZUE, VES, BRE, BOL, HAL). The dendrogram for the median quantile (Fig. 6) first distinguishes the stations with highest slopes, >0.15 • C decade −1 (LUG, GRA, SAE, SON, PAR). Within the remaining stations, the major subdivision discriminates the stations with the lowest slopes, <0.04 • C decade −1 (LEI, OSI, HAL, BRE, HAM). A similar pattern is found in the dendrogram for the upper quantile (Fig. 7). The first major subdivision clusters the stations with highest slopes (GRA, SAE, SON, LJU, HOH, PAR). The remaining stations are subdivided into clusters of moderate slopes and a cluster of stations with the lowest slopes, typically <0.1 • C decade −1 (GEN, SAL, BER, OSI, BRE, HAL,  (Table 2). LEI, STO). Although the clustering results mainly reflect the quantile slope values for each station, the advantage of the adopted clustering procedure is that it classifies the distribution of slopes resulting from the bootstrap analysis, instead of a single punctual value.

Discussion and conclusions
In the present study, quantile regression and clustering are applied to the study of changes in daily mean air temperature over Central Europe. Quantile regression allows us to compute trends at different quantiles of the data distribution within a well-defined statistical framework. Maximum entropy bootstrap allows the assessment of the uncertainty on the computed slopes while taking into account data serial dependence and non-stationary behavior. Finally, a classical clustering procedure allows us to summarise the resulting distributions of sample quantile slopes.
As in ordinary regression, the slopes for a fixed quantile are not the same across Europe, reflecting the spatial dependence of air temperature trends. For example, in the central part of the data distribution (median quantile), trends vary from 0.02 • C decade −1 to 0.18 • C decade −1 . Furthermore, quantile regression allows us to assess trends at different quantiles of the data distribution. While for some stations (e.g. PAR, GEN) the trend is the same for all quantiles, the results show that most stations exhibit different slopes for the 5%, 50% and 95% quantiles. This is also reflected in the difference between clusters for different quantiles. Thus, the rate of temporal change is not the same for all parts of the data distribution, as implicitly assumed in ordinary regression. At the 5% quantile, the largest trends are around 0.15 • C decade −1 , while at the 95% quantile the largest trends are around 0.20 • C decade −1 . This indicates a tendency towards larger increases in the upper part of the data distribution (large positive temperature anomalies).
The largest gradient of spatial variability occurs at the 95 % quantile, with slopes ranging from 0.03 • C decade −1 to 0.24 • C decade −1 . The largest trends (>0.2 • C decade −1 ) are found in the Alps (GRA, SAE, SON). Most stations show a much larger trend in the upper part (95% quantile) than in the central and lower part of the data distribution, consistent with the projected warming for Central Europe (Kjellstrom et al., 2007). The northernmost station, STO, exhibits a contrasting behavior with highest trend (0.19 • C decade −1 ) at the lowest quantile, decreasing to 0.03 • C decade −1 at the 95 % quantile.
The quantile regression results are presented here in terms of slopes at fixed quantiles for all the stations. An alternative approach would be to perform clustering on the distribution of slopes for all quantiles of each station (as displayed in Fig. 3) in order to better compare changes in slopes across different quantiles. This is computationally very intensive, and further methodological work is required on how to incorporate the uncertainty on the slope estimate at each quantile in such a clustering procedure. However, clustering of slopes across quantiles remains a promising perspective for further work, along with the application of the methodology to outputs from regional climate models.