the Creative Commons Attribution 3.0 License. Advances in

Abstract. Backtrajectory differences and clustering sensitivity to the meteorological input data are studied. Trajectories arriving in Southeast Spain (Elche), at 3000, 1500 and 500 m for the 7-year period 2000–2006 have been computed employing two widely used meteorological data sets: the NCEP/NCAR Reanalysis and the FNL data sets. Differences between trajectories grow linearly at least up to 48 h, showing faster growing after 72 h. A k-means cluster analysis performed on each set of trajectories shows differences in the identified clusters (main flows), partially because the number of clusters of each clustering solution differs for the trajectories arriving at 3000 and 1500 m. Trajectory membership to the identified flows is in general more sensitive to the input meteorological data than to the initial selection of cluster centroids.


Introduction
Backtrajectory analysis is a commonly used method to identify atmospheric transport patterns and/or determine the origin and pathway of air trace substances (e.g., Dorling et al., 1992;Brankov et al., 1998;Stohl et al., 2002;Jorba et al., 2004;Salvador et al., 2004).
Trajectory models are sensitive to a variety of parameters, including the source of wind field data, wind field spatial resolution, trajectory type (kinematic, isentropic, isosigma, isobaric) and the numerical integration scheme (for a review, see Stohl, 1998, and references therein).Differences between trajectories have been computed with Euclidean (EU) distances to study error sources (Rolph and Draxler, 1990;Stohl et al., 1995) and to study the sensitivity to the meteorological input data set (Harris et al., 2005).In this paper we report trajectory differences by computing great-circle (GC) and EU distances, and study the influence of the meteorological data on the results of backtrajectory cluster analysis.
While errors in trajectory calculation on the order of 20% of the distance travelled are considered typical (Stohl, 1998), the statistical analysis of a large number of trajectories arriving at a study site over a relatively long period of time increases the accuracy of the trajectory analysis.Therefore, Correspondence to: J. A. G. Orza (ja.garcia@umh.es)backtrajectory cluster analysis is a suitable technique to classify the air masses arriving at a study site.
Cluster analysis is a multivariate statistical technique designed to classify a large data set into non-predefined dominant groups called clusters.However, clustering involves some subjective non-trivial decisions: the number of clusters to use, the selection of centroids in the initialization stage, etc.To determine the appropriate number of clusters and handle the sensitivity of the method to the initial centroids selection we have followed the procedures described by Dorling et al. (1992) and Mattis (2001) and considered some modifications to them in order to obtain smaller (better) values of the total Root Mean Square Deviation (RMSD), the clustering figure of merit. of NCEP operational model runs (FNL data), converted from a 1 • latitude-longitude grid and 13 pressure levels (Draxler and Rolph, 2003).Three-dimensional trajectories that use the vertical wind component of the data set were considered.

96
The statistical measures of trajectory sensitivity employed are closely related to those used in earlier studies (Rolph and Draxler, 1990;Stohl et al., 1995;Harris et al., 2005).The Horizontal Transport Deviation (HTD) t hours out is investigated by analyzing the frequency distribution of dist n (t), the (GC or EU) distance between the two points corresponding to t hours of the nth pair of trajectories (FNL, RP) to compare.Then, for example, the HTD t hours out is the mean where N is the number of trajectory pairs to compare.This is identical, when computing Euclidean (latitude, longitude) distances, to the Absolute Horizontal Trajectory Deviation (AHTD) used in the literature.The great-circle distance between two points is the shortest distance in spherical geometry; it was calculated using the haversine formula.An average Earth radius of 6731 km is used to convert GC distances from degrees to km.We have also considered the Horizontal Deviation Between Trajectories (HDBT) after t hours as the mean of the accumulated distance, Dist n (t), between points of the trajectories being compared up to t hours with H the time interval between the starting and ending points of the trajectories.Calculation of Dist n (t) is performed in practice as a summation that will depend on the number of points used along the trajectories; therefore some care should be taken when comparing this accumulated measure to other studies, as both hourly and 6-h trajectories are commonly found in the literature.As long as all trajectories have the same number of points, HDBT can be computed by summing the HTD values up to hour t.
The classification of the FNL and the RP trajectory sets was performed by k-means cluster analysis.Hourly latitude and longitude were used as input variables in the clustering procedures.We have followed the method described by Dorling et al. (1992) to reduce the subjectivity in the selection of the appropriate number of clusters: the algorithm was run for a range of cluster numbers between 30 and 2, and the percentage change in the total RMSD (i.e. the sum of the RMSD of each cluster) when the number of clusters is reduced from k to k−1 was used to find out the proper number of clusters.When this percentage change is large (Dorling et al., 1992) or exceeds some predefined value, e.g. 5% (Brankov et al., 1998;Jorba et al., 2004), k is selected as the appropriate number of clusters.We retain the smallest number of clusters k for which the smallest total RMSD change is found when decreasing from k+1 to k; that means that it is possible to reduce by one the number of clusters with small worsening in the total RMSD.
With respect to the way of reducing clusters from k to k−1, and the way of dealing with the dependence of the final cluster solution on the initial centroids, different approaches have been considered.The details of the clustering methodology we have followed and its comparison with the procedures of Dorling et al. (1992) and Mattis (2001) will be published elsewhere.Here we note that the computation of 100 000 clustering analyses for each k made independently from the previous k+1 clustering, with initial cluster centroids taken from randomly chosen real trajectories, provides smaller total RMSDs and hence better clustering solutions than the approaches usually found in the literature.We have considered as best solution for each k the one with the smallest RMSD.

Results and discussion
Figure 1 shows how the differences grow over time along the trajectories.Differences grow linearly at least up to 48 h, showing faster growing around 72 h in all cases.The distribution of the differences is strongly skewed.The horizontal transport deviation (HTD) at 96 h is 20% smaller than that found by Harris et al. (2005) in their comparison between trajectories computed with the ERA-40 and the NCEP/NCAR reanalysis data.Trajectory differences exhibit similar growth behavior using EU and GC distances (Fig. 1b) as most of the trajectories arriving at the study site remain in mid latitudes, though larger differences are found when computing EU distances.The use of the GC distance is more appropriate when trajectories pass over high latitudes so this distance metric should be preferred.The highest differences are found in both cases for trajectories arriving at 1500 m: on one hand, the higher the altitude, the longer the trajectories can be and the larger the differences can grow; on the other hand, the lower the altitude, the higher the probability of a low pressure gradient situation that could lead to large differences between the computed trajectories.
The mean and median values of the horizontal deviations between trajectories (HDBTs) grow up as t 2 up to nearly 72 h, showing a higher growing rate at longer times (Fig. 1c).HDBTs are log-normal distributed for the trajectories arriving at 3000, 1500 and 500 m, irrespective of the computed (EU or GC) distance (the case for 3000 m, using the GC distance, is shown in Fig. 1d).This would imply that HDBTs are the result of many small, multiplicative random effects, although dramatic differences between trajectories are found in some cases when the air parcels go through low pressure gradient regions.
Trajectories arriving at 3000, 1500 and 500 m computed with the FNL data are found to be clustered in 6, 5 and 6 groups, respectively, while trajectories computed with RP data are clustered in 7, 6 and 6 groups, respectively (Figs. 2  and 3).
Most of the 3000 m trajectories correspond to westerly flows, identified as northwesterlies (NW) of different speeds, and southwesterly (SW) and zonal (W) flows.At lower altitudes there is an elevated occurrence of slow flows due to low pressure gradient situations that last several days: 54% (59%) of the days for trajectories arriving at 1500 m, and 72% (65%) at 500 m, for the FNL (RP) data.The slow flows correspond to short trajectories which show a pathway variability within the cluster that is greater than the centroid length, induced by weak synoptic forcing.Such flows include regional Mediterranean recirculations (MedR), slow westerlies (WR), and SW (arriving at 1500 m computed with RP data) and N-Eu (arriving at 500 m) flows.Stagnant situations, as well as situations where sea-breeze regime and the Iberian thermal low can develop thus inducing mesoscale recirculations in the spanish Mediterranean basin (Millan et al., 1997), are associated to these slow flows.A short description of the identified air masses with FNL data trajectories can be found in Cabello et al. (2008).It is noteworthy that the origin of the trajectories at the different heights is strongly decoupled.Air flows arriving in SE Spain show a clear seasonal pattern (Fig. 3), northwesterlies are more frequent during the winter, while SW (3000 m) and slow flows and recirculations are common in summertime.
Clustering results are sensitive both to the meteorological input data set and to the initialization stage of the cluster algorithm.One more cluster is identified with RP trajectories than with the FNL ones for 3000 and 1500 m trajectories.For 3000 m, this additional cluster, slow southwesterlies (SWslow), is classified within SW and MedR types when employing the FNL data; for 1500 m, SW flows found with RP data are mainly within WR and W flows when FNL is used.Looking at the number of days classified into the same type of trajectory when using distinct methods (see Table 1), clustering results are more sensitive to the input meteorological data than to the initial selection of centroids.On the other hand, sensitivity to the initial centroids is greater the lower the trajectory arrival height, while sensitivity to the input data does not depend significantly on it.
If we retained the same number of clusters for the 3000 and 1500 m trajectories computed with the two input data, even though that would add some subjectivity to the analysis, it would be found that overall, trajectories were classified into the same type of air flows in greater (but moderate) proportion (Table 1).However, identifying a different number of clusters due to differences in the meteorological data could be of some relevance for later studies of dependence of pollutants concentrations on the identified air flows.

Conclusions
We have computed 96-h backtrajectories arriving in SE Spain at 3000, 1500 and 500 m with the HYSPLIT single-particle Lagrangian model for a 7-year period using two widely employed meteorological input data sets.Differences in trajectories caused by using different meteorological data are significant.Such differences grow linearly at least up to 48 h, showing faster growth after 72 h in all cases.
Agreement among trajectories obtained from different input data or from different numerical models would give more confidence to the trajectory pathway.Similarly, agreement among trajectory sets with common characteristics would lend confidence to the trajectory analysis and its applications.Therefore, in addition to computing trajectory differences, their influence on subsequent analysis should be assessed.
The main flows identified by means of backtrajectory cluster analysis do not differ substantially with respect to the meteorological data, even though the number of trajectory groups is different.However, differences caused by the input meteorological data are higher than those obtained when comparing different trajectory cluster procedures.Trajectory membership to the identified flows is in general more sensitive to the input meteorological data than to the initial selection of centroids.

Figure 1 .
Figure 1.(a) Evolution of the mean, median and 25th and 75th quartiles of the set of dist n values computed with GC distances for trajectories arriving at 3000 m.(b) Evolution of HTD at the 3 altitudes with the 2 distance measures.(c) Evolution of the mean, median and 25th and 75th quartiles of the Dist n values computed with GC distances for trajectories arriving at 3000 m.(d) Density distribution of Dist n at 96 h for the trajectories arriving at 3000 m using the GC distance.

Figure 2 .Figure 3 .
Figure 2. Final centroids (cluster means) for the trajectories arriving at 3000 m (left), 1500 m (center) and 500 m (right) for the 7-year study period computed with the FNL and the NCEP/NCAR reanalysis (RP) data.

Table 1 .
Percentage of daily trajectories classified into each air flow type for each trajectory set(columns FNL and RP)where Miss stands for days when trajectories were not available for the cluster analysis.Percentage of trajectories that fall in the same type of air flow both in FNL and RP trajectories (column FNL=RP).Percentage of days classified in the same type of air flow for the FNL trajectories when considering different clustering procedures (columns FNL= FNL Dorling , FNL= FNL Mattis ).Percentage of trajectories classified into the same type of air flow considering the results for the same number of clusters both for FNL and RP trajectories (column FNL=RP*).