Pattern Recognition Classification and Identification of Trace Organic Pollutants in Ambient Air from Mass Spectra

[I] Grosjean, D., and Fung, K., J. Air Pollut. Control Assoc. 34, 537 (1984). [2] Sonnefeld, W. J., Zoller, W. H., May, W. E., Anal. Chem. 55, 275 (1983). [3] Yamasaki, H., Kuwata, K., and Kuge, Y., Nippon Kagaku Kaishi 8, 1324 (1984) [4] Arey, J., Zielinska, B., Atkinson, R., and Winer, A. M., Atmos. Environ. 21, 1437 (1987) [5) Arey, J., Zielinska, B., Atkinson, R., Winer, A. M., Ramdahl, T., and Pits, J. N., Jr., Atmos. Environ. 20, 2339 (1986). [6] Atkinson, R., Arey, J., Zielinska, B., Winer, A. M., and Pitts, J. N., Jr., The Formation of Nitropolycyclic Hydrocarbons and their Contribution to the Mutagenicity of Ambient Air, In: Short-Term Bioassays in the Analysis of Complex Environmental Mixtures V, Sandhu, S. S., DeMarini, D. M., Mass, M. J., Moore, M. M., and Mumford, J. S., Eds., Plenum Press, in press (1987). [7] Zielinska, B., Arey, J., Atkinson, R., and McElroy, P. A., Nitration of Acephenanthrylene Under Simulated Atmospheric Conditions in Solution, and the Presence of Nitroacephenanthrylene(s) in Ambient Particles, Environ. Sci. Technol., submitted for publication (1987). [8] Zielinska, B., Arey, J., Atkinson, R., and Winer, A. M., The Nitroarenes of Molecular Weight 247 in Ambient Particulate Samples, J. Chromatogr., to be submitted (1987).


Introduction
The most frequently used method of analysis of trace organic pollutants in ambient air is gas chromatography-mass spectrometry (GC-MS). Presently, target compounds in air monitoring samples are identified from gas chromatographic retention times and a combined forward-reverse search of mass spectral reference spectra. The primary objective of this study is to develop computational pattern recognition procedures for the identification of chemical classes and, if possible, individual compounds from routine GC-MS data files. The additional information on nontarget compounds would supplement compound identifications obtained from the target list. The target set of compounds investigated consisted of 78 substituted benzenes, haloalkanes and haloalkenes.

SIMCA Pattern Recognition
SIMCA pattern recognition was developed by Wold and coworkers [1] for application to classification problems in chemistry. The technique is based on the modeling of chemical classes by disjoint principal component models. Once the class models have been determined, objects are classified by fitting their data to the class models. A standard deviation for each model is calculated from the residuals which represents a class tolerance level around the model in measurement space.

Data Analysis
In this study an IBM PC-XT microcomputer with 640K memory was used. A modified version of the SIMCA program was used for the majority of the pattern recognition calculations. The low resolution mass spectra of 78 compounds were obtained from the EPA-NIH Mass Spectral Library on an INCOS data system (see table 1 The data were preprocessed by taking the autocorrelation transform of the mass spectra for mass lags less than 100. In the initial stages of class modeling, the autocorrelation transformed spectra for the 78 training compounds were examined for class separation with two and three dimensional principal components projections of the training data. Three different groups were found: nonhalogenated benzenes; chloroaromatics, chloroalkanes, chloroalkenes; and bromocarbons.
Principal components models were then derived for each class [2]. None of the class models  required more than three principal components. The largest number of autocorrelation coefficients relevant to a specific class was 40 in class 3. Eighty-six percent of the training set compounds were assigned to their correct class as a first choice and were within two standard deviations of the models.

Hierarchical Classification Scheme [2]
Once a class assignment is made, further identification of each spectrum can be made. This is done by calculating the distance, using the entire spectrum in autocorrelation space, of each compound to its three nearest neighbors in its assigned class. To obtain a specific structural assignment, the normalized correlation coefficient of the unknown mass spectrum with that of each of its three nearest neighbors is obtained. If the unknown is not identified as one of the three nearest neighbors in the first class assignment, the correlation coefficients of the three nearest neighbors in the second SIMCA class assignment are compared. 282   3  3  2  2  2  2  2  2  2  2  3  3  2  2  2  2  3  3  2  0   3  3  3  3  3  3   2  3  3  3  2  3  2  3  2  2  3  2  3 Volume 93, Number 3, May-June 1988 Journal of Research of the National Bureau of Standards

GC-MS Calibration Samples
Three direct injection calibration samples, which contained all but one of the training set compounds, were analyzed [2]. Of the 77 compounds observed in the calibration runs, 84% were correctly identified. Of the 13 compounds not identified, three were correctly assigned by chemical class.

GC-MS Field Samples
The GC-MS data files for three field samples were also analyzed [3]. The identities of the target compounds were determined by using both GC retention times and a combination of forward and reverse spectral matching techniques with stringent matching parameters. The identification of other compounds not on the target list was based on a Finnigan search technique. The application of the pattern recognition scheme to the transformed data for the target compounds resulted in 88% correct classification. The compound identification results were 85% accurate.
There were 75 different nontarget compounds identified in 120 occurrences in the three samples. The classification results agreed very well for the two class I and class 2 spectra. However, a very large number of alkanes and alkenes were incorrectly classified as chlorocompounds. Further details of this study are given in references [2], [3], and [4].
Although the research described in this article has been funded by the U.S. Environmental Protection Agency under Cooperative Agreement CR-811617 with the University of Illinois at Chicago, it has not been subjected to Agency review. The mention of commercial products does not constitute endorsement or recommendation for use.