Assessment and Application of Clustering Techniques to 1 Atmospheric Particle Number Size Distribution for the 2 Purpose of Source Apportionment 3

Long-term measurements of particle number size distribution (PNSD) produce a very large 15 number of observations and their analysis requires an efficient approach in order to produce 16 results in the least possible time and with maximum accuracy. Clustering techniques are a 17 family of sophisticated methods which have been recently employed to analyse PNSD data, 18 however, very little information is available comparing the performance of different 19 clustering techniques on PNSD data. This study aims to apply several clustering techniques 20 (i.e. K-means, PAM, CLARA and SOM) to PNSD data, in order to identify and apply the 21 optimum technique to PNSD data measured at 25 sites across Brisbane, Australia. A new 22 method, based on the Generalised Additive Model (GAM) with a basis of penalised B-splines, 23 was proposed to parameterise the PNSD data and the temporal weight of each cluster was also 24 estimated using the GAM. In addition, each cluster was associated with its possible source 25 based on the results of this parameterisation, together with the characteristics of each cluster. 26 The performances of four clustering techniques were compared using the Dunn index and 27 Silhouette width validation values and the K-means technique was found to have the highest with five being the optimum. Therefore, five clusters were found within the data using the K-means technique. The diurnal occurrence of each was used together other air quality parameters, temporal trends and the physical properties of each 3 cluster, each cluster to its source and origin. The five clusters were attributed to three major sources and origins, including regional background particles, photochemically induced nucleated particles and vehicle generated particles. Overall, clustering was found to be an effective technique for attributing each particle size spectra to its source and the GAM was suitable to parameterise the PNSD data. These two techniques can help researchers immensely in analysing PNSD data for characterisation and source apportionment purposes.

method, based on the Generalised Additive Model (GAM) with a basis of penalised B-splines, 23 was proposed to parameterise the PNSD data and the temporal weight of each cluster was also 24 estimated using the GAM. In addition, each cluster was associated with its possible source 25 based on the results of this parameterisation, together with the characteristics of each cluster. 26 The performances of four clustering techniques were compared using the Dunn index and 27 Silhouette width validation values and the K-means technique was found to have the highest 28

Introduction 12
Atmospheric aerosols affect climate, air quality and subsequently human health (Stevens and 13 Feingold, 2009;Pope II and Dockery, 2006;Lohmann and Feichter, 2005). Despite their small 14 contribution to particle volume and mass, ultrafine particles (particles with diameter <100nm) 15 make a significant contribution to particle number concentration (PNC) (Morawska et al.,16 1998; Harrison and Yin, 2000) and toxicological studies show evidence of their adverse 17 effects on human health (WHO, 2005). Therefore, measurements of the chemical and physical 18 properties of aerosol particles are crucial in order to understand their effects on climate and 19 human health. One of the most important properties of particles is their size distribution, 20 which helps in understanding aerosol dynamics, as well as determining their sources (Charron 21 et al., 2008;Harrison et al., 2011). Long-term particle number size distribution (PNSD) 22 measurements have been conducted in a number of different environments and the measured 23 size range can extend from less than 10nm up to more than 10μm. In addition, these long-term 24 measurements generally result in a large number of observations and analysing such a 25 massive data set often requires sophisticated techniques. The clustering technique has recently 26 been used to divide particle size data into groups with similar characteristics and then relate 27 each group to its sources and/or to investigate aerosol particle formation and evolution 28 (Beddows et  Several clustering algorithms currently exist, which makes the selection of an appropriate 31 clustering technique a daunting task. Determining the most appropriate number of clusters can 32 be an additional challenge for researchers. Clusters should ideally be compact, well-separated 1 and scientifically relevant. Beddows et al (Beddows et al., 2009) assessed the performance of 2 four clustering techniques (Fuzzy, K-means, K-median and model based clustering) on PNSD 3 data using different validation indices, particularly the Dunn index, and found the K-means 4 technique capable of finding clusters with smallest size, furthest separation and highest degree 5 of inner cluster similarity compared to others. Throughout their work, four techniques were 6 evaluated while several other methods (e.g. Partitioning around Medoids (PAM), Clustering 7 of Large Applications (CLARA), and Self Organizing Map (SOM), and Affinity Propagation 8 (AP)) are available which their performance on PNSD data has not been assessed so far. 9 Therefore, these techniques were selected to be compared with K-means using two validation 10 measures. 11 K-means is an iterative algorithm minimising the within cluster sum of squares to find a given 12 number of clusters (Hartigan and Wong, 1979). PAM is an iterative algorithm similar to K-13 means which constructs clusters around a set of representative objects by assigning each data 14 to the nearest representative object using sum of pair wise dissimilarities (Kaufman and 15 Rousseeuw, 2009). CLARA performs PAM on a number of subgroups of data, allowing faster 16 performance for a large number of observations (Kaufman and Rousseeuw, 2009). SOM is a 17 neural networks based method, with the ability to map high dimensional data to two 18 dimensions and has been widely used in data mining researches (Kohonen, 2001). The AP 19 algorithm is a relatively new clustering technique which has been employed in different fields 20 since its introduction in 2007. AP considers all data as potential exemplars and finds the best 21 set of exemplars and corresponding clusters by exchanging messages between the data points 22 (Frey and Dueck, 2007). 23 In a recent study, three years of PNSD data were clustered using the K-means technique to 24 produce seven clusters. Those clusters were found to form three main groups, anthropogenic 25 (69%), maritime (29%) and nucleation (2%), which characterised the whole data set (Wegner 26 et al., 2012). In another study, Dall'Osto et al. found nine clusters within the PNSD data 27 collected over a one year period in an urban area and found four typical PNSD groups using 28 diurnal variation, directional and pollution association (Dall'Osto et al., 2012). The authors 29 called those groups traffic, dilution, summer background and regional pollution, which 30 included 69%, 15%, 4% and 12% of the total data respectively. In the above and all other 31 previous studies, PNSD data were averaged to decrease the number of data and consequently, 32 reduce computational cost and complexity. However, averaging can encumber the transient 1 characterisation of PNSD data and the larger the averaging interval, the more transient 2 characteristics will be lost. 3 Parameterisation of PNSD data in terms of a mixture of few log-Normal components is 4 common and beneficial, particularly for data averaged over a longer interval. Multi-log-5 Normal function with predefined number of peaks where the means of each log-Normal 6 distribution are constrained to vary around some initial estimates in the nucleation, Aitken, UPTECH project can be found in (Salimi et al., 2013) and the study design is available online 28 (UPTECH). 29 2.2 Instrumentation, quality assurance, and data processing 1 PNSD within the size range 9 -414 nm was measured every 5 minutes using a TSI Scanning 2 Mobility Particle Sizer (SMPS). The SMPS system included a TSI 3071 Differential Mobility 3 Analyser (DMA) connected to a TSI 3782 water-based Condensation Particle Counter (CPC). 4 A combination of a diaphragm pump and a critical orifice was used to supply a sheathe flow 5 of 6.4 lpm. A zero particle filter and silica gel dryer were used to supply a dry, particle free air 6 stream. PNC was measured using a TSI 3781 water-based CPC, particle mass concentration 7 (PM2.5, and PM10) measurements were conducted using a TSI DustTrak, solar radiation and 8 other meteorological parameters were measured by a Monitor Sensors weather station, and 9 EcoTech gas analysers measured gaseous emission (i.e. CO, NOx and SO2) concentrations. 10 These data were averaged according to five minute intervals prior to data analysis. 11 The sheathe and aerosol flow rate of the SMPS system was checked three times a week using 12 a bubble flow meter. A zero check of the system was done at the start of the measurements at 13 each school using a high efficiency particle (HEPA) filter connected to the inlet of the system. 14 Size accuracy of the SMPS was calibrated using monodisperse polystyrene latex (PSL) 15 particles with a nominal diameter of 100 nm. Size accuracy calibrations were conducted five 16 times throughout the whole measurement campaign and all instruments passed the test with a 17 maximum error of 3.5% from the nominal diameter, as recommended by Wiedensohler et al 18 (Wiedensohler et al., 2012). Particle losses inside the tube were corrected using the formula 19 derived for the laminar flow regime (Hinds, 1999). Equivalent tube length was used to correct 20 Information regarding the quality assurance and data processing procedures for the CPC can 26 be found in (Salimi et al., 2013). 27

Clustering particle number size distribution data 28
Principal Component Analysis (PCA) was performed on the whole data set, in order to reduce 29 its high dimensionality and remove the correlation between features, so as to increase the 30 performance of the clustering. Ideally, clusters should minimise intra-cluster variation 31 (compactness) and maximise the distance between clusters (separation) (Brock et al., 1 2011;Handl et al., 2005), resulting in small, homogenous clusters which are clearly separated 2 from each other. Validation measures reflecting the compactness and separation of the 3 clusters were used to find the optimal method and number of clusters using the "clValid" 4 package in R (Brock et al., 2011;R Development Core Team, 2010). As the number of 5 clusters increases, improving compactness, the separation decreases due to multiple clusters 6 being created which could be described by a single cluster (consider the extreme case of every 7 observation belonging to its own cluster). Combining compactness and separation into a 8 single measure is an effective way to address this issue. The Dunn index (Dunn, 1974) and 9 Silhouette width (Rousseeuw, 1987) are scores resulting from nonlinear combination of 10 compactness and separation. Therefore, those scores were chosen in order to compare the 11 performance of different clustering techniques and to find the optimum number of clusters. 12 The maximum vector length allowed by R is "231-1", therefore, the Dunn index can only be 13 The AP clustering technique was initially selected as a candidate, in addition to the 19 aforementioned techniques. The AP technique was implemented using "APcluster" R package 20 (Bodenhofer et al., 2011) and was computationally expensive but ultimately unsuccessful at 21 clustering our huge dataset effectively. Based on our experience with this technique and the 22 recommendations of its developers, AP is not recommended for finding a small number of 23 clusters in a large dataset, but it is more appropriate for finding a large number of relatively 24 small clusters (FAQ). 25

Non-parametric estimation of particle number size distribution temporal 26 trends 27
In this paper, a new approach was developed to parameterise the PNSD data by finding the 28 local peaks and the normalised concentration at each peak. In order to find the real local 29 peaks, the noisy trend of PNSD data should be firstly smoothed. To achieve this, we used the 30 Generalised Additive Model (GAM), with a basis of penalised B-splines (Wood, 2003; Eilers 31 and Marx, 1996), which allows for the flexible estimation of non-linear effects without 1 assuming, a priori, the functional form of the non-linearity (Wood, 2003). In contrast with the 2 multi lognormal fitting method, this approach keeps all the local peaks while smoothing the 3 noisy data. The fitted function was then used to find the local peaks, which were defined to be 4 the local maxima in a neighbourhood of five PNSD bins on each side (Figure 1). The bi-5 variate kernel density estimate, using the Gaussian kernel, was computed to visualise the 6 distribution of the peaks of the PNSD data for each cluster (Wand, 1994). 7 Understanding the temporal trend of clusters provides further information about the nature 8 and source of each cluster. GAMs allows for a flexible approach to investigating these 9 temporal trends. Therefore, the GAM, with a basis of B-spline, was employed to calculate the 10 temporal trend of each cluster. The resulting fitted smooth functions, and their 95% 11 confidence intervals, indicate the temporal variation of each cluster. Silhouette width values for all of the number of clusters, for both sets of randomly selected 25 data, while the rest of techniques showed the same level of performance ( Figure 2). This 26 indicates that the K-means technique was able to produce more compact clusters that were 27 well separated from each other. The other techniques resulted in almost the same Dunn index 28 value, however SOM resulted in better Silhouette width values, followed by PAM and then 29 CLARA. The performance of the K-means technique was significantly higher than the rest of 1 the techniques and was therefore selected as the preferred technique. 2 Cluster schemes with 2 -8 and 10 clusters had the highest Silhouette width values in the first 3 and second set of randomly selected data, respectively. Five clusters had the highest Dunn 4 index value for the first set of data and it resulted in a high value in the second set as well. 5 However, 9 -10 clusters resulted in a higher Dunn index value in the second set of data. As Therefore, five was selected as the optimum number of clusters, as it had a high validation 10 value, as well as scientifically relevant number of clusters to relate each cluster to a source 11 and to find the processes related to each cluster. 12

Clustering PNSD data 13
The K-means clustering technique was applied to all PNSD data in order to find five clusters. 14 Figure 3 shows the normalised number spectra and 95% confidence interval associated with 15 each cluster, together with the diurnal variation of their occurrence and their association with 16 solar radiation intensity, particle mass concentration, PNC and gaseous pollutants 17

concentrations. 18
Local peaks of each single PNSD spectra were found using the technique described in section 19

Plotting the normalized concentration of the peaks versus their diameter is useful in order 20
to determine the frequency of peaks and their diameter for each cluster. However, with too 21 many data, points are over-plotted which makes it impossible to distinguish the underlying 22 trends and relationships. Therefore, Bi-kernel density estimation, which is a very effective 23 technique to address this issue, was used to visualise the distribution of peaks in PNSD for 24 each cluster (Figure 4). Characteristics of each cluster and the associated sources are 25 explained in the following sections, based on Figures 1 and 2

. 26
Cluster1: This cluster included 4.5% of the total measured particle size spectra with a mode at 27 the SMPS lowest size detection limit (9 nm). The diurnal pattern of occurrence showed 28 nocturnal minima with a peak at midday. Cluster1 was associated with the highest solar 29 intensity and lowest PM2.5 among all clusters, and it was also associated with high PNC and 30 low CO and NOx. Local peaks predominantly occurred at diameters less than 30nm, with the 31 highest density and normalised concentration at 9nm. This clearly shows the dominance of 1 newly formed particles which have grown in size and reached the instrument's size detection 2 limit. The strength of the local peaks was the highest among all clusters, indicating the 3 dominance of the smallest particles (Table 1). 4 The abovementioned observations indicate that this cluster was attributed to photochemically-5 induced nucleated particles in the ambient air. New particles mainly formed during the middle 6 of the day and grew to reach the instrument's size detection limit. The same type of particles 7 were observed in Barcelona for 4% of the observations (Dall'Osto et al., 2012). These types of 8 particles were not observed in London, which could be due to long time averaging of the data 9 and/or different climatological conditions of the monitored environment in that study 10 (Beddows et al., 2009). 11 Cluster 2: This cluster included 14.1% of the total PNSD data and showed a mode for 12 particles with a diameter less than 20 nm. The diurnal pattern of occurrence for Cluster 2 13 peaked strongly at two hours after midday (14:00), with a nocturnal minimum, which was in 14 agreement with its association with high solar radiation intensity. Its diurnal pattern of 15 occurrence and association with a low concentration of traffic generated primary pollutants 16 (NOx and CO) indicated that it had non-traffic related sources. This cluster was also 17 associated with high PNC. Local peaks were mainly present at diameters less than 20 nm, 18 with the highest density occurring at a normalized concentration around 0.03. Local peaks 19 present at diameters larger than 30 nm were lower and corresponded to normalized 20 concentrations of less than 0.01. This cluster was attributed mainly to aged, photochemically-21 induced nucleated particles in the ambient air. Minor peaks at diameters larger than 30 nm 22 revealed the contribution of vehicle generated particles to this cluster (Morawska et al., 2008). 23 However, their contribution to the total PNC was minimal (Table 1). 24 Cluster 1 and 2 were both attributed to the same source of particles, but they were 25 distinguished as relatively fresh and aged, respectively. Nucleated particles initially grew and 26 reached the instrument detection size limit in Cluster 1, before growing further and gradually 27 shifting to Cluster 2 after about two hours. Nucleated particles grew in size by condensation 28 and coagulation, and as they aged their number concentration decreased and their size 29 increased. This illustrates why Cluster 1 had higher PNC but lower PM2.5 compared to 30 Cluster 2. 31 The fraction of PNSD data related to nucleated particles in this study was higher than in 1 another study in the same environment, which used the classic approach to find the banana 2 shaped new particle formation events (Cheung et al., 2010). This shows that the actual 3 occurrence of new particle formation events is much higher than what is generally found by 4 classic approaches, because in most particle formation events, the newly formed particles 5 would be scavenged by the pre-existing particles as a result of coagulation, before they grew 6 by means of condensation. However, this does not mean that the nucleation events would not 7 occur and that the particles resulting from them would not be present in the air. 8 Cluster 3: This cluster included 31.6% of the total measured data, with a mode at 20 nm and a 9 minor peak at the smallest instrument size limit (≈ 9nm), showing a strong association with 10 morning and afternoon rush hour traffic (Salimi et al., 2013). This cluster was associated with 11 low solar radiation and high vehicle generated primary pollutants, including CO and NOx, 12 and the particle number size modes were in the range of vehicle generated primary particles 13 and nucleated particles during the exhaust emissions dilution (Casati et al., 2007;Ntziachristos 14 et al., 2007;Janhäll et al., 2004). Local peaks were mainly present at diameters less than 30 15 nm, particularly at the lowest detection size limit of the instrument. The normalized 16 concentration corresponding to peaks were highest at the smallest particle size. These 17 observations suggest that this cluster was attributed to vehicle generated particles, including 18 primary particles (having a diameter of around 40 nm) and particularly secondary particles 19 formed in the vehicle exhaust (having a diameter of around 9nm) ( Table 1) pattern of occurrence showed nocturnal maxima, with a minimum during the midday and 23 early afternoon hours, and it was associated with the highest PM2.5 among all clusters as well 24 as with high NOx and CO. Local peaks were present at diameters smaller than 20 nm and 25 between 50-70 nm. The normalised concentration corresponding to either of these diameter 26 ranges had a similar magnitude. These findings suggest that cluster 4 was attributed mainly to 27 regional background aerosols. However, a source of small particles and gaseous emissions is 28 also present in this cluster (Table 1). 29 Cluster 5: This cluster included 27.2% of the total PNSD data, with a mode at 40 nm. The 30 diurnal pattern of occurrence followed almost the same trend as Cluster 4, with a minimum 31 during the early afternoon and a peak during the night-time. The local peaks' density was 32 almost evenly distributed through the whole range of diameters, with smaller particles having 1 more peaks particularly for diameters less than 20 nm. However, normalized concentrations 2 corresponding to each local peak were highest at diameters between 30-60 nm, showing the 3 dominance of particles at this size range. The observations mentioned above suggest that, like 4 Cluster 4, this cluster was attributed to regional background aerosols (Table 1). 5

Temporal trend of clusters 6
A non-parametric regression model was fitted to the daily occurrence fraction of each cluster, 7 in order to determine the temporal trend of each cluster. The regression model quantified the 8 presence of each cluster as the sum of a smooth function for each month and a trend which 9 included the day, month and year. It should be noted that no data were collected during the 10 first two months of the year. 11 The smooth monthly function for Cluster 1 did not show any significant variation before 12 November, at which point it increased moderately and peaked during December. Solar 13 radiation intensity is also known to increase during the last two months of the year in the 14 southern hemisphere, which indicates that this may be the driving force of more atmospheric 15 nucleation events. Cluster 2 showed similar trends of increase in December in conjunction 16 with the expected increase in solar radiation intensity. The prevalence of Cluster 4 increased 17 and peaked around July-August and then decreased again. Cluster 4 showed the strongest 18 monthly variation among all clusters. As mentioned earlier, this cluster was associated with 19 regional background aerosols and its monthly variation showed a positive correlation with 20 biomass burning within the studied region, which mostly took place during July to September. 21 Biomass burning associated PNSDs peak at 100-200 nm (Friend et al., 2011), however the 22 PNSD of Cluster 4 peaked at around 60 nm, which was 20 nm higher than the Cluster 5,23 which was also associated with background aerosols. This implies that Cluster 4 included the 24 mixture of background and biomass burning aerosols, which made the average PNSD peak 25 shift by 20 nm. The prevalence of Cluster 3 decreased, with a trough during August, before 26 increasing again, which is likely to be the result of the biomass burning which occurred 27 during August and consequently decreased the occurrence of particles belonging to Cluster 3. 28 For Cluster 5, a peak was observed during June, which decreased to show a trough during 29 August, before peaking again during November. This trend is the opposite to what was 30 observed for Cluster 4 and given that the peaks for Cluster 5 were observed when there were 31 less biomass burning events, it was most likely associated with regional background aerosols, 32 without the influence of biomass burning events. The PNSDs which were attributed to 1 regional background aerosols in this study had different modes compared to the one observed In summary, the K-means clustering technique was found to be the preferred technique when 5 compared to SOM, PAM and CLARA. The K-means clustering technique categorised the 6 PNSD data into five clusters and each cluster was attributed to its source and origin. Five 7 clusters were attributed to three major sources and origins, as follows: 1) Regional 8 background particles: Clusters 3 and 5, which included 49.8% of the total data, were 9 attributed to regional background aerosols with clear modes at 60nm and 40nm, respectively; 10 2) Photochemically induced nucleated particles: Clusters 1 and 2, which included 18.6% of 11 the total data, were attributed to photochemically-induced nucleated particles; and 3) Vehicle 12 generated particles: Cluster 3, which included 31.6% of the data, was attributed to vehicle 13 generated particles. A new method was proposed for the parameterisation of particle size 14 spectra, based on the GAM, which was found to be an effective tool and is recommended to 15 be used for particle size data. K-means clustering successfully attributed each particle size 16 spectra to its source and/or origin. However, while this technique could attribute each particle 17 size spectra to its major contributing source, several sources with different levels of 18 contribution are often responsible for the pattern of each particle size spectra. The structure of 19 the contribution of sources can be further investigated using other techniques, such as 20  Figure 3: Clustering results using SMPS data. Graphs on the left show the particle size 1 distributions with 95% confidence intervals, the associated graphs on the middle show the 2 diurnal cycle of the hourly percentage of occurrence for each cluster, and the graphs on right 3 show the associated solar radiation, PM2.5,, PM10, NOx, CO, SO2, total PNC. 4 5 Figure 4: Density of peaks in particle number size data at each cluster. 6 1 Figure 5: Temporal trend of occurrence of each cluster, with their 95% confidence intervals.