MATCH-A SOFTWARE PACKAGE FOR ROBUST PROFILE MATCHING USING S-PLUS

This manual details the implementation of the profile matching techniques introduced in Robust Estimation of Air-Borne Particulate Matter (Wiens, Florence and Hiltz, Environmetrics, 2001 included as an appendix). The program consists of a collection of functions written in S. It runs in S-Plus, including the student version. A graphical user interface is supplied for easy implementation by a user with only a passing familiarity with S-Plus. A description of the software is given, together with an extensive example of an analysis of a data set using the software.


Installation and start-up
The user should have a directory, containing sub-directories .Data and .Prefs, from which S-Plus can be invoked.Put the file match.sscinto this directory, then open this script file and 'run' it from within S-Plus.The software is now installed and this step need not be repeated.To use the software, enter the command match.start().This will open the 'Scan' menu.

The Scan Menu
In this menu -see Figure 1 for an example -the user is prompted to input the names of the Excel files containing the profiles and ambient data.If they are in the same directory as that from which S-Plus was invoked, only the file names need be entered.Otherwise the path is required, e.g.C:\directory\filename.xls.
The user specifies the column numbers in which the various parts of the input are to be found.The current default of 'Ambient.C in columns 6+2 * (1:31)' means that the measured ambient values are to be found in columns 8, 10, ..., 68.This is equivalent to entering the input 8 10 12 14 ... 68, with a space between each number.For Excel files which are exactly in the format as the sample files included with this documentation, only the '31' ( = the number of species in the ambient records) would be changed in each individual application.
The Excel files must contain no blank rows or columns.Now click on Apply.The data files will be read into matrices profiles1, sv.ests1, ambient1, rv.ests1 and total.mass1.The matrices profiles1 etc. are assigned to 'frame 0' (the session frame) and so can be viewed and used from the Commands window throughout the current S-Plus session.
The sources are ordered and numbered, if desired, according to one of two user-specified options: (i) the correlation of the profile with the average (across receptors) ambient vector, or (ii) the Mahalanobis distance of the profile from the average ambient vector.In both these cases (and only for the purposes of this ranking), each species vector is first normalized so as to have an average of zero and a sample variance of one.
At this point the 'Fit' menu will have opened, along with a report as in Figure 2.
3 receptor.fit()runs as a stand-alone function, without the necessity of assignments.The required and optional arguments are as described in the receptor.fithelp in §5.  the Use species: window, the line c(1:31) [−c(5:7, 23, 12, 21, 2)] as in Figure 3.The various options can also be changed from their default values at this point.
The user can also input 'Source groupings', which work as follows.Suppose that an initial fit of sources 20:35 (with other inputs as in Figure 3) showed that sources 20, 21, 22 and 26 were highly correlated with each other, and that sources 29 and 30 were highly correlated with each other.These correlations might well result in collinearity problems, essentially because highly correlated source profiles are statistically almost indistinguishable from each other.A possible remedy is to fit the sum of these profiles, rather that the individual profiles.Thus, the user might re-fit the data, but now in the Group sources: window he would enter c (20, 21, 22, 26), c(29, 30).This will result in the profiles for sources 20, 21, 22 and 26, and the corresponding variance estimates in sv.ests, being summed.Similarly the profiles and variance estimates for sources 29 and 30 will be summed.In the output, the summed profiles will be labelled '20 + 21 + 22 + 26' and '29 + 30'; a legend giving the names of the profiles being summed will also be displayed.
If multiple receptor records are chosen then the ambient vectors are pooled, using the α-trimmed mean.Throughout this manual α = 0 unless robust=TRUE, in which case α is set by the user, with a default value of .1.The receptor variance estimates are pooled in the same way.If option 6 ="ev", the resulting receptor variance estimates are then divided by the number of receptors being pooled, thus yielding the (squared) standard errors of the trimmed means.The total.mass values are pooled in the same manner.
Click on Apply to start the fitting.When the data are exceptionally ill-conditioned, or variance estimates are exceptionally poor, the program may crash when it attempts to invert an almost singular matrix.I have addressed as much as possible of this numerical instability by requiring that all matrix inversions employ a preliminary Choleski decomposition.If however the problem still arises, the user should re-run the program with highly correlated profiles summed as described above, or with a different choice of options.Setting est.corr=FALSE is fairly safe in this respect.On the other hand, choosing positive.variances=FALSE(a choice which is forced by option="ev") can very often cause this singularity problem.(Note that "positive.variances"appears as "+'ve variances?" in the Fit menu.) After the fitting is done, the complete output is assigned to frame 0 as output.It may be viewed here in its entirety, and manipulated as desired.

Example 1
In this section I outline an analysis of REVEAL data similar to that analyzed in Lowenthal et al. (1997) and kindly supplied by Dr. D. H. Lowenthal.Assuming that the file match.sschas been run, and that the data files ambient.xlsand profiles.xlsreside in the same directory as that from which S-Plus was invoked, one has these files scanned by entering > match.start( ) from the Commands window, and then choosing the options of the Scan menu as in Figure 1.Click on Apply; after the scanning is complete the Report window will open as in Figure 2. Since the correlations ordering was chosen in the Scan menu, the Report window will also contain a listing (not shown here) of the correlations of the source profiles with the mean ambient vector.As in 'Case II' in Table 2 of Lowenthal et al., we fit the ambient data from the Chilliwack receptors (numbers 1 -28) using sources PHPVRD, MUCH, MAR100, AMBSUL and AM-NIT.From the Report window, these are numbers 69, 6, 35, 29 and 89.We fit all species except those labelled 5: PHX, 6: SUX, 7: CLX, 23: ZRX, 12: CRX, 21: RBX, 2: MGX and MOX, which was previously removed due to a lack of ambient data.We will first give the Effective Variances fit.See Figure 3; note that choosing the 'ev' option forces est.corr =   F, robust = F, positive.variances= F.If the user inputs other values of these options then they are overridden.The resulting output is as in Figures 4 and 5.
Recall that the 'ev' option forces positive.variances= F. Were some of the variance estimates equal to zero?If so, the corresponding species could have had a very large influence on the fit, since the weight assigned to a species is inversely proportional to its effective variance.Such large weights are almost certainly unjustified, since zero variances probably reflect a lack of useful information or intuition rather than a conviction that the profiles values are in fact constant, i.e. without variation.Entering the command > (output$sv.ests==0)from the Commands window yields a matrix of Ts and Fs, a T representing a 0 in the sv.ests matrix.This output reveals that nearly all species had zero source variance estimates in the AMBSUL and AMNIT columns, and 11 of these had zeros in the MAR100 column as well.Thus these 11 species would have zero in three of the five columns of the sv.ests matrix,  possibly giving them unrealistic influences on the fit.Indeed, all but two of these 11 have very small studentized residuals.This might be due to their conforming well to the model, but might instead be forced by their large influence.To see the effect of this influence, I re-run the analysis using the choices option = 1, est.corr = F, robust = F, positive.variances= T. Thus only the variance estimates change, relative to the previous fit.The results are in Figures 6 and 7, and show a significant change in the parameter estimates and a significant improvement in the residuals as measured by the chi-square value.However, the source contribution estimates θ1, ..., θ5 are now all not significantly larger than zero.
In Figures 8 and 9 I give the output from a robust fit, using robust = T, option = 2, est.corr = T and Huber's ψ with k = .5.All of the source contribution estimates are again significant.This example then illustrates a point noted in Wiens, Florence and Hiltz (2001) -the sensitivity of the CMB results to the quality of the variance estimates which form part of the input data.The robust methods presented there and implemented here afford some protection against this instability.

Example 2
This example, with simulated data, illustrates the point that the robust and non-robust fits are typically in broad agreement when applied to"clean" data.It also illustrates the use of receptor.fit()as a stand-alone function (recall the footnote on p. 3).

Output:
Correlations between sources are: Source Percentage of total (weighted) sum of squares accounted for by the regression is 100*R.sqd=99.19 Mass accounted for (std.dev.) is 54.64 % ( 18.66 %) Chi-square (= ss of studentized residuals/df) is 15.7 with p-value (= prob. of a larger value) of 0 .

Output:
Correlations between sources are: Source a 'source variances' matrix, of the same size as profiles, whose (i, j) th element contains an estimate of the variance of the (i, j) th element of profiles.rv.ests a 'receptor variances' vector or matrix, of the same size as ambient, whose elements are estimates of the variances of the corresponding elements of ambient.total.mass a vector or matrix with elements massc = total mass and massu = standard error of this total.
OPTIONAL ARGUMENTS option if = "1" then the profiles matrix is used as the matrix of independent variables in each regression; if = "2" then at each iteration a maximum likelihood estimate Â of the matrix A of mean source contributions is computed and used as the matrix of independent variables in the next iteration.If = "ev" ("Effective Variances") then the other arguments are set to est.corr=F, robust=F, positive.variances=Fand the input values of these arguments are ignored.intercept if TRUE, an intercept model is fitted.The default is to not fit an intercept.est.corr if TRUE then the common correlation matrix Ω of the within-profile measurements is estimated.If FALSE then Ω = I is assumed.transform if = "log" then the ambient vector and profiles matrix are replaced by their logarithms and the variance estimates are adjusted accordingly.If = "sqrt" then this is done using the square roots.robust if TRUE then the least squares regression estimates are replaced by M-estimates.

psi.type
Possible choices of types of ψ function used for the M-estimates are "Huber" and "Hampel".Ignored if robust = F ALSE. k.Huber Tuning constant for "Huber" psi.type, ignored if robust = F ALSE.

k.Hampel
Tuning constants for "Hampel" psi.type, ignored if robust = F ALSE. alpha.robustTrimming proportion for trimmed means when robust = T RUE.

weight.type
Weights used in the M-estimation if robust = T RUE.Options are "hat" (see §3.7, Wiens et. al 2001) or "mahal" -Mahalanobis-distance based weight w (1) (x i ; γ = √ 2), as at (7) of Du and Wiens (2000).2) estimation of these other parameters, using the current estimate of θ.EXAMPLE set.seed(13) # Simulate some data: profiles <-matrix(data=abs(rnorm(24)),nrow=6) sv.ests <-.5*profiles ambient <-as.vector(abs((profiles%*%c(1,.1,.01,.001)+ rnorm(6,sd=1)))) rv.ests <-as.vector(.5*sv.ests[,1])total.mass<-c(10,1) # Run receptor.fiton these data: qwe1 <-receptor.fit(profiles,ambient, sv.ests, rv.ests, total.mass)# Some source concentration estimates are zero; # remove these sources and try again: qwe2 <-receptor.fit(profiles[,-c(1,2)],ambient, sv.ests[,-c(1,2)], rv.ests, total.mass){"0# compositions of source emissions are constant over the period of ambient and source sampling^"1# chemical species do not react with each other\ i[e[ they add linearly^"2# all sources with potential for signi_cantly contributing to the receptor have been identi_ed and have had their emissions characterized^"3# the source compositions are linearly independent of each other^"4# the number of sources of source cat! egories is less than or equal to the number of chemical species^"5# measurement uncertainties are random\ uncorrelated\ and normally distributed[| Of course practitioners often apply methods developed under possibly untenable assumptions\ in the hopes that assumptions which are {close| to being satis_ed will result in applications which are {close| to being appropriate[ There is now a wealth of robustness studies including that such an attitude can be seriously misguided\ and that seemingly minor violations of assumptions such as normality or independence can result in a very signi_cant deterioration in the performance of an otherwise appropriate or even optimal statistical procedure[ Mathematical descriptions of the di.culty can be phrased in terms of discontinuities in the quality of the procedures\ at those points at which the assumptions are violated[ While we do not argue the utility and contributions made by the CMB method developed at DRI\ we suggest that adding robustness to the estimation methods can reduce the risk of spurious conclusions regarding apportionment of emission sources based upon results where assumptions are not or cannot be met[ Most often\ these sorts of violations would arise because] "0# the user of the CMB software would be using source pro_les obtained from data libraries containing chemical pro_les compiled in many di}erent locations\ not from actual data locally obtained "see e[g [ Lowenthal et al[\ 0886# and not always having a knowledge about the data|s quality during gathering and handling\ and "1# it would often be impossible\ or economically infeasible\ to test whether or not all assumptions were met[ In Section 1 of this article we present a modi_cation to the CMB model[ We discuss the similarities with\ and di}erences from\ other approaches in the literature[ In Section 2 we develop estimation methods for this model based on least squares\ and then a set of robust alternatives[ In a simulation study carried out in Section 3 we compare our methods with analyses carried out using the DRI e}ective variance CMB model and previously published data[ We argue that the new methods a}ord additional and necessary security against erroneous allocations of PM chemistry among emission sources[

1[ THE MODIFIED CMB MODEL
In this section we use the notation y n×0 vector of ambient measurements^thus the ith element y i is the ambient amount of species i[ Typically all measurement units are mg:m 2 [ A n×p matrix whose jth column a j consists of the n {true| pro_le values at source j^thus a ij refers to species i\ source j and is the amount of species i in the emissions from source j as perceived at the receptor[ X n×p matrix whose jth column x j "X 0j \ [ [ \ X nj # T consists of the measured pro_le values at source j[ u p×0 vector of total mass contributions of the sources to the receptor^u j refers to source j[ Assume that\ apart from random error\ one has y s VARðo i Ł[ The a j are not known and are observed with error^i[e[ one observes a random vector x j rather than a j [ The errors x j −a j d j may be correlated[ It is assumed that the variances may vary both with the species and with the source\ but that the correlation structure within each pro_le is the same across sources[ Thus where the structure of the n×n covariance matrix S j is We assume that the errors d j are independent of o[ In the expression above\ is a diagonal matrix of variances for the species within source j\ and V is a correlation matrix[ Since the n×n matrix V must be estimated from only p observation vectors\ it appears that further structure must be imposed[ Assume that all o}!diagonal entries of V are equal to a common value r\ necessarily −−0:"n−0# in order that V be positive semi!de_nite[In the development "WC+H# of the E}ective Variance "EV# CMB model\ it is assumed that the measurements X ij are independently and normally distributed with means a ij and variances s 1 ij [ WC+H thus assume\ in the notation above\ that V I n \ the n×n identity matrix[ The s 1 i and s 1 ij are assumed known in the theoretical developments of the EV model and the models proposed here[ In the implementations these variances are estimated "often before the data are submitted to an analyst# and the estimates are substituted for the true values[ Data given to CMB analysts tend to be {noisy| and {dirty|[ It is typically di.cult to have much faith in the accuracy of much of the data and in particular in the variance estimates[ Thus\ as well as using classical least squares based methods\ we shall propose robust procedures which are not overly sensitive to gross errors in the variance estimates and in other features of the data[ Practitioners might also question the assumption\ in the formulation of the EV model\ that the X ij are independently distributed[ Our assumption that the correlation structure is constant across sources\ and of a constant value\ is also somewhat questionable[ It is however less so than the assumption of WC+H that it is constant with correlation matrix V I n [ The normality assumption is not used explicitly by WC+H\ although it is used implicitly to justify the use of Least Squares as an estimation procedure[ Least Squares is well!known not to be robust against long!tailed "i[e[ longer than normal# error distributions[ Our assumptions that the errors o i are uncorrelated is necessary when the data include only one ambient value y i per species or a mean over time or locations\ so that estimation of correlations between the ambient measurements is not always possible [ Ohtaki et al[ "0886# adopt a model somewhat similar to the one described here[ However\ they assume the availability of data from multiple receptors^this allows for estimation of COVðyŁ by the sample covariance matrix\ summing across receptors [ Ohtaki et al[ "0886# consider u as a realization of a random vector[ The mean contributions are assumed to satisfy 9 ¾ u j ¾ 0\ s p j 0 u j 0\ but the non!negativity is then addressed in an ad hoc "but sensible# manner which does not guarantee that the solution will satisfy this constraint[ In fact Hopke "0874\ p[ 023# comments {[ [ [ in a mass balance\ source contributions should only be positive[ It is possible to use a constrained least!squares _t\ but this approach has not yet been seriously explored[| In particular\ these constraints are not assumed in the EV model or its CMB implementation[ Since a primary purpose of the present article is to extend the EV and CMB techniques by adding considerations of robustness\ the constraints are also not imposed here[ We do however make a post hoc modi_cation to the parameter estimates to ensure non!negativity[

2[ ESTIMATION METHODS
We _rst outline the estimation methods used\ assuming that the regressions will be carried out by least squares[ A robust alternative will then be described and evaluated[ By rearranging Equations "1# Ð "4# the observed vector y may be represented as y Xu¦f "5# where has mean 9 and covariance matrix COVðfŁ V\ given by Two estimation approaches\ Option 0 and Option 1\ are investigated and discussed in this study[ Option 0 relies on the observation that if V were known then one could apply Generalized Least Squares "GLS# to "5# to estimate u] If u were known then one could estimate V\ in a manner described below[ Option 1 relies on the observation that if A were known one could apply GLS to "1# to estimate u]

Again\ if u were known then one could estimate A[
Each option suggests an iterative procedure] estimate u^use this estimate to estimate V or Ar e!estimate u as above\ then re!estimate V or A^iterate to convergence[ We shall _rst describe each estimation in detail[ Section 2[4\ in which the various steps are put together to outline the entire procedure\ also serves as a summary and comparison of the two options[

2[0[ Estimation of A
If all parameters except A are known\ and if the errors are normally distributed\ then from Equations "1# Ð "4# the log!likelihood for A\ apart from some inessential constants\ is given by To obtain the maximum likelihood estimate "MLE# one maximizes logl[ The matrix of partial derivatives\ with "i\ j#th element 1 log l:1a ij \ has jth column Solving the equations g 0 g 1 = = = g p 9 gives the MLE A \ with jth column Although a ¼ j is the MLE only under an assumption of normality\ it is in any event a reasonable estimator[ It adjusts x j \ which is the "only# estimate of a j if no other data are available\ by taking into account the regression on y[ This adjustment vanishes if y−Xu 9\ as it should since then y is perfectly predicted by X and gives us no information which is not already contained in X[ x j −a j # has correlation matrix V\ if all other parameters are known then an estimate of V "ignoring the structural assumption discussed in Section 1# is given by the correlation matrix obtained from the p columns If Option 0 is chosen and the robustness option described in Section 2[6 is not\ then we use this last expression\ with V evaluated at its value in the previous iteration of the numerical procedure and with L j and V estimated as shown below[ Otherwise we use the _rst expression in "8#[ We compute an a!trimmed correlation matrix R from these columns\ and then r is estimated by the a!trimmed mean of the o}!diagonal elements of R[ We take a 9 for the least squares option being described here^the robust option employs a 9[0[ In either case the required structure is imposed on the estimate of V by de_ning V ij to be "x−0: Estimates of s 1 ij of s 1 ij form a part of the data^one can estimate L j by and then S j by Step 9[ Initialization step] The algorithm employed by WC+H is that of Option 0\ with the di}erence that V is never updated^it remains I n [

2[5[ Inferences
For Option 0\ approximately valid inference procedures are obtained by applying standard regression theory to the model y Xt¦f\ COVðfŁ V\ X _xed\ V known[ Then V is replaced by V " the value of V at the termination of the iterative estimation procedure#[ Option 1 can be handled in an analogous manner[ This gives The p!values are computed using a t n−p approximation to the distribution of the standardized ratio The use of the t n−p \ rather than the normal\ reference distribution is the usual penalty paid for estimation of the standard error of the regression estimate[ An ANOVA "Analysis of Variance# breakdown starts by transforming to weighted data] Then the Total Sum of Squares SST >y ½> 1 is broken down into the Sum of Squares due to the Regression SSR " the sum of squares >X "X T X # −0 X T y ½> 1 or >A "A T A # −0 A T y ½> 1 \ as appropriate\ of the _tted values in terms of the {{ ½ || data# and the Sum of Squares due to Error SSE SST−SSR[ For Option 0 the _tted values y ½ Xu ½ Ky where K X"X T V −0 X# −0 X T V −0 \ and residuals e "I−K#y have approximate covariance matrices KVK and "I−K#V"I−K# respectively[ These are estimated by replacing V by V [ For Option 1 X is replaced by A \ V by S o [ This sets the stage for the usual range of diagnostic procedures based on residual analyses[

2[6[ Addin`robustness to the CMB analysis
Robustness is achieved for the two options partly by substituting the following into the least squares regressions described in Section 2[4] Here y ½\ X and A are as at "09#\ S is a robust measure of scale and the w i are weights designed to bound the in~uence of outlying regressors[ Thus u ¼ is a Mallows!type Generalized M!estimate[ If j"t# t 1 and w i 0 0 then "00# gives the least squares estimates of Section 2[4[ Alternative forms of j\ for robustness against outliers\ are obtained by replacing t 1 :1 Ð t 9 x dx by j"t# Ð t 9 c"x# dx\ where c"x# is a bounded score function[ Common choices are {Huber|s c function| or a user!chosen value of k\ and {Hampel|s 2!part redescending c function| w i "0−h ii #:zh ii "a suggestion of Welsch\ 0879#\ where the leverages h ii are the diagonal elements of the {hat| matrix formed from the regressors[ Regressors far from their centroid yield leverages and thus small weights[ These weights are known not to be completely robust\ since clusters of outliers can draw the centroid towards themselves\ thus diminishing their apparent leverages[ However\ more robust weighting schemes are typically much more computationally demanding and require a relatively large number of observations\ hence are not feasible for the large numbers of simulations\ with n 7 only\ carried out in this study [ See Du and Wiens "1999# for a discussion[ The solutions to "00# are initiated by _rst computing the least absolute deviations estimator\ which minimizes the sum of the absolute values of the residuals rather than of their squares[ This is followed by three iterations of a NewtonÐRaphson algorithm for "00#[ Finally\ one step of the iteratively reweighted least squares regression algorithm is performed[ See Simpson and Chang "0886# for details[ Robust up!dating of the matrix A is performed as follows[ Rather than minimizing "6#\ we minimize its analogue De_ne a function u"t# c"zt#:zt for t × 9\ 0 for t 9[ By di}erentiation we obtain the equation A F"A# "01# where If both sequences converge\ with a m having a non!zero limit\ then the limit of A m satis_es "01#[ Note that a m ³ \>F"A m #−A m =:>A m = × tolerance\ so that an approximate solution is obtained by iterating until a m 0\ which also implies that A m¦0 F"A m #[ Our program iterates until m 19 or a m 0^if a 19 ³ a 9 we take A "k¦0# A "k# \ i[e[ no updating is performed at this kth stage[ The equalities in "8# no longer hold\ since they are derived from "7#[ To update V under the robustness option we compute R from the columns given by the _rst term in "8#[ The inferential procedures of Section 2[5 remain asymptotically valid\ once the estimate of the covariance matrix of u ¼ is appropriately modi_ed[ Following Hinkley "0866# and Wu "0875# we use the one!step weighted jackknife estimate as proposed in Du and Wiens "1999#\ together with a _nite!sample correction factor of Huber "0870#[ The estimate is described as follows[ Let the matrix Z be either X "for Option 0# or A "for Option 1#\ with rows z i [ De_ne where aver"c?# and var"c?# are the sample mean and variance of the c?""y ½ i −z T i u ¼ #:S#[ Then the estimated covariance matrix is   21[221[8\ 21[924[9\ 03[620[9[ D[ WIENS\ L[ Z[ FLORENCE AND M[ HILTZ Copyright Þ 1990 III reports the results in the case V V 9 \ an equi!correlation matrix with all o}!diagonal elements equal to 9[1[ In these tables the {self!estimated standard deviations| are the sample averages of the 0999 standard errors computed along with the estimates themselves\ using the covariance matrix estimates from Sections 2[5 and 2[6[ These should be compared with the simulated standard deviations "accompanying the averages of the simulated estimates of the parameters#\ presented in the tables in the form {average 2 one simulated standard deviation|[ The latter standard deviations are obtained by taking all 0999 of the simulated estimates\ and then calculating their sample standard deviations[ They should be viewed as the {true| standard deviations of the regression parameter estimates[ From Table II we conclude that our Options 0 and 1 do not su}er from unnecessarily estimating V\ compared to the WC+H|s EV method which\ correctly\ in this case\ assumes that V I[ The EV method Ð which is identical to Option 0 using least squares and not estimating V\ apart from the protocol detailed above for handling variance estimates which equal 9 Ð performed somewhat better for us than for WC+H[ See the footnote to Table I for some summary values from WC+H[ On the {clean| data used for Tables II and III most methods performed well\ with only moderate biases[ Note\ however\ that the least squares Option 1 method resulted in large biases and huge standard deviations in the estimates for UD when the correlation parameter was not estimatedt his was ameliorated when r was estimated[ The least squares based methods with Option 1 tended to underestimate the standard errors\ whereas the other methods tended to overestimate them[ The latter is generally preferable\ since it leads to con_dence intervals which\ although wider\ have coverages at least as great as the nominal levels[ As seen from Table III\ the estimation of r when in fact r was non!zero did not result in any signi_cant improvement[ This can be expected to change in larger datasets[ The values shown in Tables II and III appear to be too `ood[ This may be due to the fact\ pointed out above\ that the {estimated| variances were in fact exactly correct[ To investigate the robustness of the methods against incorrectly estimated variances\ we multiplied each variance estimate by an independent realization of =C=\ where C followed a Cauchy distribution[ To investigate robustness against outliers\ 09 per cent of the receptor measurements were randomly chosen and multiplied by 0:2 and 09 per cent were randomly chosen and multiplied by 2[ The simulations were then re!run on these severely corrupted data\ with V I[ We suggest "see Table IV# that our robust estimation methods fared somewhat better than the least squares based estimates\ primarily with respect to the accuracy of the standard errors[ In these simulations the robustness was implemented using a {Hampel| score function with the default tuning constants[ Results obtained with the {Huber| were quite similar to those reported here[

4[ SUMMARY AND CONCLUSIONS
We have presented a modi_ed CMB model\ together with a package of estimation procedures including a robustness option[ A special case of our methods is the EV method of WC+H[ A  33[32167[9 29[424[9 10[72111[9 Self!estimated  2one standard deviation of u ¼ j \ as estimated from the simulations[ $ Obtained by averaging the estimated standard deviations over the simulations[ simulation study in a particularly arduous situation "n 7\ p 3# has shown that here most of the various methods have similar performances with respect to the accuracy of the estimates\ but can vary widely in the estimation of their own standard errors[ The protection against outliers a}orded by the robust methods can be expected to become more relevant with larger values of n[ Of course\ the analyst is not restricted to the use of just one method[ We recommend that a thorough analysis includes a comparison of methods based on least squares with those of our robust approach[ Signi_cant di}erences in the results should be interpreted as warnings of particularly anomalous features in the data[

Figure 2 :
Figure 2: Output following the scanning of the data files.The list of profile/ambient correlations has been omitted from this display.

Figure 3 :
Figure 3: Fit window with options chosen for the Effective Variance fit.

Figure 4 :
Figure 4: Output from the Effective Variances fit, with input as in Figure 3.

Figure 5 :
Figure 5: Graphics from the Effective Variances fit.

Figure 6 :
Figure 6: Graphics for the fit of Figure 7.

Figure 7 :
Figure 7: Output with positive.variances= T; other choices as in EV fit.

Figure 9 :
Figure 9: Graphics for the fit of Figure 8.
particulate matter "PM# in the atmosphere is a major environmental and pubic health concern in North America and elsewhere "Burnett et al[\ 0884^Dockery et al[\ 0882#[ PM collected   at a receptor site is generally grouped according to size fractions] _ne "³1[4 mg# and coarse "1[4Ð 09 mg#[ The chemicals associated with these PM fractions o}er unique challenges beyond their physico!chemicalattributes because they represent complex mixtures of often multiple point sources of pollutants "see Hopke\ 0880 and references therein#[ The objectives for monitoring air quality at a receptor site then become those of sampling "de_ning frequency and duration#\ estimation "determining ambient chemical concentrations# and apportionment "allocating the total ambient particulate mass among all regional sources detected at the receptor\ both natural and anthropogenic\ given the sources| chemical pro_les#[ These activities are often further Correspondence to] Zack Florence\ FVA Inc[\ 0205 Slater Street\ Victoria\ BC V7X 1P8\ Canada[ D[ WIENS\ L[ Z[ FLORENCE AND M[ HILTZ Copyright Þ 1990 John Wiley + Sons\ Ltd[ Environmetrics 1990^01]14Ð39 extended to include both temporal and spatial variation "Brook et al[\ 0886^CEPA:FPAC\ 0887Ĉ how et al[\ 0881#[ Statistical and chemometric methods that have been used for partitioning ambient pollutants measured at a receptor site include principal component analysis "including factor analysis#\ multiple linear regression and chemical mass balance "CMB# models[ These techniques have been used singly and in combination[ The CMB model has been used in several studies in Canada and the United States because the theory has been well developed over the past 29 years and MS! Windows!based software has been made publicly available by the U[S[ Environment Protection Agency "EPA# "see Lowenthal et al[\ 0886 and references therein#[ Watson et al[ "0873^henceforth referred to as WC+H# developed an E}ective Variance "EV# CMB model[ This and subsequent CMB applications\ developed for the EPA largely by J[ G[ Watson and colleagues at the Desert Research Institute "DRI#\ Reno\ Nevada "U[S[A[#\ make a number of assumptions which are to be met before _tting ambient and source chemical pro_les[ "The current version of DRI|s CMB software can be downloaded by anonymous FTP from eafs.sage.dri.edu/model/cmb8MMDD.exe\ where MMDD stand for month and day[# These assumptions include "from Watson et al[\ 0880#] contributions u j \ to be estimated[ The ambient amounts y i are measured with error o i \ the variation of an error depending on the species[ Assume that these n errors in the measurement of y are independent of each other[ Thus\ with o "o 0 \ [ [ \ o n # T \ y Au¦o^"1# E ðoŁ 9\ COVðoŁ S o [ "2# Here S o diag"s 1 0 \ [ [ [ \ s 1 n # is a diagonal matrix with diagonal elements of s 1 i

[ 4 [
Iterative procedure for Options 0 and 1^least squares We describe here the numerical procedure by which the estimates are obtained[ The parameters u\ V\ S j \ V and possibly A are _rst set equal to simple initial values\ then successively updated until the values of u ¼ stabilize[ Figure 0[ Huber "a# and Hampel "b# c functions\ using default values of the tuning constants[ Horizontal axis represents regression residual\ vertical axis the relative in~uence of this residual on the regression _t[ calculations of their model\ and include some typical {true| pro_le values a ij [ As in the smaller of their simulation studies\ we chose their n 7 species] Na\ Al\ Si\ Cl\ V\ Ni\ Br and Pb[ Therefore\ the values of the A matrix are as represented in their 2one standard deviation of u ¼ j \ as estimated from the simulations[ $ Obtained by averaging the estimated standard deviations over the simulations[ C[ Lewis\ D[ Lowenthal\ H[ Nguyen\ G[ Morris\ E[ Peake\ B[ Pulsipher and J[ Watson for reviewing an earlier version "with associated software# of this work\ and to three anonymous referees[ The first column contributions contains values 100ŷ i /y i , representing the % of the concentration of species i at the receptor which is accounted for by the regression.The last row of this column is 100 P ŷi / P y i , representing the % of total receptor concentration accounted for by the regression.The second column gives standard errors of the entries in the first.The other entries of the matrix give the percentages accounted for by each source, i.e. the (i, j) th entry is 100 [profile] i,j • θj /y i (with the numerator and denominator summed over species in the last row).anovaa list consisting of: 1. a matrix containing a breakdown of the total weighted sum of squares SST into sums of squares SSE (due to error) and SSR (due to the regression effect).Also given are the corresponding degrees of freedom and mean squares, and the F statistic and p-value for the hypothesis θ 1 = • • • = θ p = 0. 2. R 2 = SSR/SST = the proportion of SST accounted for by the regression.3. χ 2 = positive.var-ifTRUE (the default) then any receptor or source variance estimate iances = 0 is replaced by the mean positive estimate for that species.WARNING: Changing this option can result in program termination due to singular covariance matrix estimates.printlevel if = 0 then no printout is produced.If = 1 then a printout is produced containing (see VALUE below for descriptions) thetahat and an anova table.If = 2 then as well the printout contains history, option, Ahat, relative.contributionsand resids.Omegahat the estimate Ω of the profile correlation matrix Ω; if est.corr = F then Ω = I.SIGMAhat an n × n × p array whose k th face (1 ≤ k ≤ p) is an n × n matrix Σk -the estimated covariance matrix for the k th profile.Ahat an n × p matrix Â which is the estimate of the matrix A of mean source contributions; if option = 1 then Â = profiles.cov.thetahat a P × P matrix (P = p + (intercept==T)) which is the estimated covariance matrix of the regression parameter estimates.thetahat a list consisting of a matrix with P rows, one row for each regression parameter estimates, and some individual columns of this matrix.Columns are the estimates, their standard errors, their t-ratios and the corresponding one-sided p-values, i.e. the p-values associated with the hypotheses H 0 : θ = 0 vs. H 1 : θ > 0. resids a list consisting of a matrix with n rows, one row per species.and some individual columns of this matrix.Columns are the measured ambient values y i , the fitted values ŷi and their standard deviations, the residuals e i = y i − ŷi and their standard deviations s(e i ), and the studentized residuals e i /s(e i ).relative.a matrix with n + 1 rows and p + 2 columns.NOTES If p = 1 care must be taken to ensure that profiles and sv.ests are matrices, not vectors.DETAILS With y = ambient and X = profiles, the model being fitted is y = Aθ + ²; X = A + ||δ 1 • • • δ p || for independent zero-mean error vectors ², δ 1 , ..., δ p .The δ j have a common correlation matrix Ω but possibly different variances, as estimated by sv.ests[,j].The ² i have possibly different variances, as estimated by rv.ests.The algorithm iterates back and forth between two steps: 1) Generalized Least Squares or M-estimation of θ, using the other parameters of the model to determine a suitable transformation of the independent and dependent variables;

Table I [
Table 0^the p 3 possible emission sources are Marine "M#\ Urban dust "UD#\ Auto exhaust "AE# and Residual oil "RO#[ The relevant values from Table 0 of WC+H are reproduced in our Table I\ together Partial replications of {values of variables for generating simulated data and solving mass balance equations| from WC+H[ = "Xu# i # 1 [ The true values of source contributions are u "19\ 24\ 29\ 04# T [ In our computations any s 1 i or s 1 ij equal to 9 was replaced by the a!trimmed mean "with a as in Section 2[1# of all positive variance estimates for that species[ This admittedly ad hoc measure ensured the invertibility of S o and V [ All the results from our simulations and discussion which follow are based upon code developed in the S!Plus software package "MathSoft Inc[\ Seattle\ Washington\ U[S[A[# and available from us "contact doug.wiens@ualberta.ca#[We_rstsimulatedindependentvectors d 0 \ [ [ [ \ d p \ where d j was normally distributed with mean vector 9 and covariance matrix S j L 0:1 the {estimated| variances are in fact exactly correct[ Then X A¦>d 0 \ [ [ [ \ d p > was computed and truncated at 9 so that all elements would be non!negative[A response vector was then simulated] y Au¦o\ where o was normally distributed with mean vector 9 and covariance matrixS o S o diag"s 1 0 \ [ [ [ \ s 1 n #[The _rst set of simulations used V I n "independent measure!ments within species across sources# for a comparison with the WC+H simulations^this series of analyses served as our internal control to verify both our understanding of and concordance with results by WC+H and to test our algorithms[ The procedure outlined above yielded one simulated data set "y\ X#\ from which estimates u ¼ were computed[ This procedure was repeated N times[ The results are summarized in our TableIIfor N 0999\ V I n [ Table