A New Methodology to Assess the Performance and Uncertainty of Source Apportionment Models II: the Results of Two European Intercomparison Exercises

The performance and the uncertainty of receptor models (RMs) were assessed in intercomparison exercises employing real-world and synthetic input datasets. To that end, the results obtained by different practitioners using ten different RMs were compared with a reference. In order to explain the differences in the performances and uncertainties of the different approaches, the apportioned mass, the number of sources, the chemical profiles, the contribution-to-species and the time trends of the sources were all evaluated using the methodology described in Belis et al. (2015). In this study, 87% of the 344 source contribution estimates (SCEs) reported by participants in 47 different source apportionment model results met the 50% standard uncertainty quality objective established for the performance test. In addition, 68% of the SCE uncertainties reported in the results were coherent with the analytical uncertainties in the input data. The most used models, EPA-PMF v.3, PMF2 and EPA-CMB 8.2, presented quite satisfactory performances in the estimation of SCEs while unconstrained models, that do not account for the uncertainty in the input data (e.g. APCS and FA-MLRA), showed below average performance. Sources with well-defined chemical profiles and seasonal time trends, that make appreciable contributions (>10%), were those better quantified by the models while those with contributions to the PM mass close to 1% represented a challenge. The results of the assessment indicate that RMs are capable of estimating the contribution of the major pollution source categories over a given time window with a level of accuracy that is in line with the needs of air quality management.


M A N U S C R I P T A C C E P T E D ACCEPTED MANUSCRIPT
In this study, 87% of the 344 source contribution estimates (SCEs) reported by participants in 47 different source apportionment model results met the 50% standard uncertainty quality objective established for the performance test. In addition, 68% of the SCE uncertainties reported in the results were coherent with the analytical uncertainties in the input data.
The most used models, EPA-PMF v.3, PMF2 and EPA-CMB 8.2, presented quite satisfactory performances in the estimation of SCEs while unconstrained models, that do not account for the uncertainty in the input data (e.g. APCS and FA-MLRA), showed below average performance. Sources with well-defined chemical profiles and seasonal time trends, that make appreciable contributions (>10%), were those better quantified by the models while those with contributions to the PM mass close to 1% represented a challenge.
The results of the assessment indicate that RMs are capable of estimating the contribution of the major pollution source categories over a given time window with a level of accuracy that is in line with the needs of air quality management.
Keywords: source apportionment, receptor models, intercomparison exercise, model performance indicators, model uncertainty, particulate matter Highlights: Intercomparisons were used to test the performance and uncertainty of receptor models. More than 85% of the reported sources met the model quality objectives. Two thirds of the output uncertainties were coherent with those in the input data. PMF v2, 3 and CMB 8.2 estimated the source contributions satisfactorily. The accuracy of receptor models is in line with the needs of air quality management. Source Apportionment (SA) is the practice of deriving information about the pollution 2 sources and the amount they contribute to measured concentrations. Receptor models 3 (RMs) apportion the measured mass of pollutants to its emission sources by using 4 multivariate analysis to solve a mass balance equation (Friedlander, 1973;Schauer et al., 5 1996; Thurston and Spengler, 1985). RMs derive information from measurements 6 including estimations of their uncertainty and have been extensively used in Europe to 7 estimate the contribution of emission sources to atmospheric pollution at a given site or The methodology adopted in this research to assess the model results evaluates all the 28 aspects of a source apportionment study, including the variability due to the influence of 29 different practitioners using the same model on the same data (Belis et al., 2015). The 30 procedure includes: complementary, preliminary and performance tests. 31 The "complementary tests" aim at providing ancillary information about the performance of 32 the solutions in terms of apportioned mass and number of source categories. The 33 "preliminary tests" are targeted at establishing whether the entities identified in the results, M A N U S C R I P T A C C E P T E D ACCEPTED MANUSCRIPT 2 either a factor or a source (hereon, factor/source), are attributable to a given source 35 category. In addition to the correlation coefficient (hereafter, Pearson), the standardized 36 identity distance (SID), that prevents the distortions caused by source profiles with 37 dominant species, is used (more details in Belis et al., 2015). The "ff tests" are the 38 comparison among factor/sources attributed by participants to the same source category 39 in all the solutions while "fr tests" refer to the comparison between reported factor/sources 40 and a reference value. The objective of the "performance tests" is to evaluate whether the 41 source contribution estimates (SCEs) are coherent with a 50% standard uncertainty target 42 value using the z-score performance indicator complemented by the z'-score and zeta-43 score indicators (Thomson et al., 2006;ISO 13528, 2005). In this study, SCE denotes the 44 mass attributed to a source or factor in the results obtained with either CMB or MFA 45 approaches. The methodology is fully described in the companion paper by Belis et al. 46 (2015) and was implemented using the open source software R (and R-studio). Source 47 categories with less than five factors/sources were not evaluated and profiles attributed by 48 participants to more than one category were tested in each of the proposed categories. 49 Considering that source apportionment studies are mostly targeted at identifying and 50 quantifying the typical sources in the studied area, the performance tests were conducted 51 on the average SCE over the whole time window represented in every dataset. Moreover, 52 the SCE time series were evaluated using the root mean square error normalised by the 53 standard deviation/uncertainty of the reference value (RMSE u ), as discussed in Belis et al., 54 (2015). 55 The intercomparison exercises were structured in two rounds involving 16 and 21 56 organizations respectively. In the first round, 22 results were reported and 25 were 57 provided in the second one. A real-world PM 2.5 dataset collected in Saint Louis (USA) was 58 used in Round 1 (Table 1). The dataset used for the intercomparison was developed by 59 merging two datasets: one of inorganic species collected every day (Lee et al., 2006) and 60 one of organic species collected every sixth day over the same time window (Jaeckels et 61 al., 2007). In the final dataset, the structure of the uncertainties of the different species was 62 heterogeneous with differences between species deriving from the data treatment in the 63 original datasets and variability within single species due to the different analytical batches 64 that were necessary to cover the whole monitoring campaign. In addition, the uncertainty 65 of organic tracers was complex to quantify due to the possible influence of atmospheric 66 chemistry and radiation on the degradation of these compounds (Galarneau et al., 2008; The site and time window in which the real-world dataset was collected was not revealed 69 to the intercomparison participants. The dataset containing the concentrations of 44 70 species in 180 samples with their analytical uncertainties was distributed to participants 71 together with the analytical parameters (uncertainty of the method and minimum detection 72 limits) and the emission inventory of the study area. 73 In Round 1, the following preliminary tests were performed: Pearson and SID between 74 factor/source profiles, Pearson between log-transformed factor/source profiles, and 75 Pearson between factor/source time trends. Only ff tests were accomplished in this round 76 because of the absence of independent unbiased reference values. 77 In the performance tests of Round 1, the SCE reference value for each source category 78 was the average of the results reported by the participants. The reference values were 79 obtained by calculating the robust average (Analytical Methods Committee, 1989) using 80 only the SCEs of source/factors that passed the preliminary tests (Table 2). 81 In the second round, a synthetic dataset with known reference values that were unbiased 82 and independent from the results reported by participants was used (Supplementary 83 Material S1). The chemical species included in the synthetic dataset (Round 2) are 84 reported in Table 1 and the procedure followed to generate it is given in Belis et al. (2015). 85 Since the site was not disclosed to participants, the emission inventory of the study area 86 and a set of 23 local source profiles (more than one for every source category) were 87 distributed to them in order to: a) provide the necessary information to create the input files 88 for CMB models, and b) support the interpretation of the models' output. 89 In addition to the preliminary tests performed in the previous round, the Pearson between 90 the factor/source contribution-to-species of the Round 2 results was also computed. All of 91 the preliminary tests were performed by comparing factor/sources reported by participants 92 with the reference source for the considered source category (fr tests). 93 The model abbreviations used in this document are: CMB8.2, Chemical Mass Balance v. The code "PMF2" denotes the program PMF2 described by Paatero (1997). The codes 98 "EPAPMF3, EPAPMF4, and EPAPMF5" denote the respective releases of the U.S. EPA 99 program "EPA PMF".

Mass apportionment 104
The sample-wise comparison between the sum of the SCEs in every solution and the 105 gravimetric mass are summarised using normalised target diagrams (Fig. 1). More than 106 70% of the solutions in Round 1 rank in the area of acceptance (outer circle). Most scores 107 rank in the lower quadrants indicating a tendency to underestimate the observed mass 108 (the distance to the horizontal axis is proportional to the PM 2.5 mass that was not 109 apportioned). On the contrary, the evident overestimation of the mass observed in two 110 solutions is likely due to problems in the conversion of normalised data to concentration 111 values rather than to errors in the apportionment of the mass. In Round 2, the majority of In Round 1, nine factor/sources per solution are reported on the average (Table 3). One 125 half of the solutions identifies between six and ten factor/sources while six solutions report 126 more than 10. An approximation of the expected number of factor/sources for this round is 127 derived from the original solution of the inorganic dataset obtained using PMF (Lee et al., 128 2006), which identified 10 different source categories. In this round, the estimations of 129 PMF and CMB are relatively close. In Round 2, more than half of the solutions report the 130 exact number of factor/sources used to design the dataset (8) and all the solutions, except 131 one, report between six and nine factor/sources. 132 The tests suggest that the reliability of the performance diagnostics influence the ability of 133 the tools to establish the most suitable number of factor/sources. Often, unconstrained On the other hand, factor/sources in the SALT category, which show poor correlation with 220 the concentrations in the reference profile, are well correlated with the reference in terms 221 of contribution-to-species. In the SALT chemical profiles, Cl and Na represent on average 222 81% and 49% of the source mass, respectively, and their relationship is close to the 223 stoichiometric ratio in sodium chloride. As for the contribution-to-species, the ratio between 224 the two elements (39% and 58% of the SALT mass, respectively) indicates that the share 225 of Cl in SALT is lower than the one it would have been if the only source consisted of 226 NaCl. This mismatch indicates the contribution of additional sources to this element other 227 than sea and road salt (e.g. INDU). 228

Chemical Profile Uncertainty 229
In order to assess the uncertainty of the factor/source profiles, the weighted differences 230 (WD, Karagulian and Belis, 2012) between the source profiles reported by participants and 231 the corresponding reference profiles were computed. 232 The interpretation of WD scores depends on the relevance of the reference value for the 233 factor/sources being tested. If a factor/source has been attributed to the wrong source 234 category, the reference is not appropriate to evaluate that factor/source. For that reason, In Round 1, the fr tests were carried out using external reference profiles available in the 238 literature and are, therefore, used only for informative purposes (not reported). 239 The WD test shows that, in Round 2, SALT is the category with the highest proportion of 240 scores outside the area of acceptance (above the broken line) followed by NO3, INDU, 241 SO4 and ROAD (Fig. 6). The analysis of the chemical profile's uncertainty using the WD 242 indicator shows that, in this round, 65 % of factor/sources present acceptable WD scores. 243 In addition, the joint evaluation with the chemical profile test suggests that only 18% of the 244 factor/source profiles, which allocation to source categories was confirmed, 245 underestimated their uncertainty. 246 In Round 2, the SCEs are higher, because of the higher PM levels, and their relative 259 variability within source categories is lower than in Round 1. The highest CV is the one of 260 SALT (0.70) followed by DUST and INDU (0.60 and 0.55. respectively). As in Round 1, the 261 lowest CVs are those in SO4 and NO3 (0.28 and 0.31, respectively). 262 days, that may also lead to inaccuracies in the time trends. A propensity to underestimate 306 source categories with high SCEs such as NO3 and to a lesser extent SO4 (29% and 17% 307 of the PM mass, respectively) is present in many solutions. Nevertheless, the bias is too 308 small to give rise to poor scores. 309

Z-scores 263
In Round 2, about 75% of the reported SCEs derive from solutions obtained with 310 EPAPMF3, PMF2 and CMB8.2 and their performances are comparable to those observed 311 in Round 1. Although a limited number of solutions are available for the other models, it is 312 worth mentioning the good performances of COPREM, EPAPMF5 and ME-2. FA-MLRA is 313 the only model with 50% of the scores either in the warning or action areas. 314 The z'-score indicator was used in Round 2 to assess the difference between solutions 315 and the reference value taking into account the reference's uncertainty. No substantial 316 differences were observed between z-scores and z'-scores indicating that the uncertainty 317 of the reference had no impact on the evaluation of participant's performance. to the reference was equivalent to the noise introduced in the synthetic dataset (20% 349 standard deviation) that was derived from the analytical uncertainty in the input dataset 350 (Belis et al., 2015). The zeta-score test indicates that a 68% of the declared factor/source 351 SCE uncertainties are coherent with the one of the reference while a 19%, ranking in the 352 action area, are likely underestimated (Fig. 9). 353 SALT is the only source category with the majority of the zeta-scores in the action area 354 (75%). Likely, models do not allow for the higher relative uncertainty due the very low 355 SCEs in this source category compared to the others. Uncertainty underestimation is 356 observed also in ROAD, which shows a 60% of the scores either in the warning or in the 357 action areas. 358 A considerable proportion of factor/sources obtained with EPAPMF4 and EPAPMF3 show 359 underestimated uncertainties (29% and 24% of scores in the action area, respectively). 360 COPREM showed uncertainties higher than the reference in a 31% of the factor/sources. 361 The satisfactory performance of CMB8.2 (more than 90% successful scores) suggests that 362 propagating the uncertainty of the source profiles can provide a satisfactory estimation of 363 the SCEs uncertainty.

KEY FINDINGS OF THE INTERCOMPARISON 378
The tests on chemical profiles confirmed, in the majority of cases (83%), the attribution of 379 factors/profiles to source categories in the reported results and the majority of the SCEs 380 (87%) reported by participants met the 50% standard uncertainty quality objective 381 established for the performance test. A high share of the tested solutions (70% -80%) 382 apportioned a considerable amount of the PM 2.5 mass to its pollution sources and many 383 solutions estimated a number of sources close to the expected value. 384 In this study, the estimation of source contribution was most critical for SALT, DUST, SHIP 385 and categories associated with mobile sources. The majority of the solutions 386 overestimated the SCE of SALT, a source category with a contribution of about 1% of the 387 PM mass. Such relative contribution may be considered a first approximation of the lower 388 limit that the tested methodologies are able to quantify. Poor scores attributed to some 389 DUST and ROAD SCEs were ascribed to the similarities in the chemical composition 390 between road dust and crustal material that may have interfered with the allocation of 391 mass between these sources. The uncorrelated time trends and, in some cases, the 392 heterogeneous chemical profiles observed in INDU and SHIP were attributed to the lack of 393 a common definition of these categories. Sources with appreciable contributions and 394 chemical profiles dominated by few species, such as NO3 and SO4, were more efficiently 395 recognised by the models even though there was a tendency to slightly underestimate 396 their SCEs. 397 The most commonly used models, EPAPMF3, PMF2, and CMB8.2 showed quite 398 satisfactory performance with successful z-scores ranging between 80% and 100%. The 399 good agreement between CMB and PMF may be partially due to the main RM 400 assumptions being substantially respected in the used datasets: limited alteration of the 401 species between source and receptor and relatively stable source profiles. In addition, 402 both types of tools account for the uncertainties in the input data, have built-in 403 performance indicators and have been available long enough to allow a wide number of 404 practitioners be familiar with them. For those models used in a limited number of solutions, 405 only preliminary conclusions can be drawn at this stage. In general, fully unconstrained 406 models which do not account for the input data uncertainty (e.g. FA-MLRA and APCFA) 407 showed performances below the average. This result is likely because in these tools, the M A N U S C R I P T A C C E P T E D ACCEPTED MANUSCRIPT 13 noise deriving from the uncertainty structure of the datasets is incorporated into the 409 factor/sources (Paatero and Hopke, 2003). 410 The tests used to assess the SCE uncertainty reported in the solutions confirmed that the 411 RMs output uncertainty estimation is coherent with the analytical/random uncertainty of the 412 input data. Other components of the uncertainty could be evaluated in specially designed 413 intercomparisons where RMs are either compared with other types of models or synthetic 414 datasets with known perturbing factors are used. Processes altering the factor/source 415 chemical profiles could be detected in the preliminary tests by comparison with the 416 reference source profiles. In addition, diagnostic ratios could be used to detect long-range 417 transport or photochemical age of aerosols (Hien et al., 2004;DeCarlo et al., 2010). 418 The slightly better performance observed in Round 2 compared to Round 1 is likely 419 connected to the differences between simulated and measured data. Round 1 was more 420 challenging for the participants due to the inconsistencies in the uncertainties they had to 421 deal with in a blind test with limited information about a non-European study area.          Intercomparisons were used to test the performance and uncertainty of receptor models. 1 More than 85% of the reported sources met the model quality objectives. 2 Two thirds of the output uncertainties were coherent with those in the input data. 3 PMF v2, 3 and CMB 8.2 estimated the source contributions satisfactorily. 4 The accuracy of receptor models is in line with the needs of air quality management. 5 6