Using A Low-Cost Sensor Array and Machine Learning Techniques to Detect Complex Pollutant Mixtures and Identify Likely Sources

An array of low-cost sensors was assembled and tested in a chamber environment wherein several pollutant mixtures were generated. The four classes of sources that were simulated were mobile emissions, biomass burning, natural gas emissions, and gasoline vapors. A two-step regression and classification method was developed and applied to the sensor data from this array. We first applied regression models to estimate the concentrations of several compounds and then classification models trained to use those estimates to identify the presence of each of those sources. The regression models that were used included forms of multiple linear regression, random forests, Gaussian process regression, and neural networks. The regression models with human-interpretable outputs were investigated to understand the utility of each sensor signal. The classification models that were trained included logistic regression, random forests, support vector machines, and neural networks. The best combination of models was determined by maximizing the F1 score on ten-fold cross-validation data. The highest F1 score, as calculated on testing data, was 0.72 and was produced by the combination of a multiple linear regression model utilizing the full array of sensors and a random forest classification model.

compound to avoid "learning" the artificial correlations that were present in this study but would not fully represent the diversity of mixtures that could be expected in a field deployment.
Some models include terms referred to here as "hyperparameters". Hyperparameters govern how the model operates or is trained and are set before beginning to train the model. For example, the number of layers and nodes in a neural network would be considered a hyperparameter. Another example is the regularization strength (k) in ridge regression, which modifies the loss function used while training the model. Tuning hyperparameters is one way to improve the performance of models and can be used to determine appropriate values for model parameters that might normally be selected arbitrarily or based on previous experience. Wherever possible, these hyperparameters were optimized to maximize the estimated model performance on future data. This was done by holding out a subset of the training data and optimizing the hyperparameters based on the model performance when validated on that subset of the training data. This data selection is illustrated in Figure S1. Once the hyperparameters were set, the regression models were also assessed based on how well the classification models in the next step were able to use the estimated concentrations to make a prediction of the "source" that was being simulated.
Multiple linear regression is one of the most popular forms of regression used to convert sensor signal values into calibrated concentration estimates and have been used in a range of applications with many low-cost sensor technologies [3,30,36,[52][53][54]. Because of popularity and the relatively low computational costs, several forms of multiple linear regression were investigated. The first form, referred to here as "FullLM" was a multiple linear regression model that simply used every sensor input from the array. This was considered as a baseline model as it included almost no previous knowledge except for the design of the array itself. Next was a model, referred to as "SelectLM", included only data from sensors that were known to react to the current target gas, as determined by a combination of field experience and manufacturer recommendations. For each sensor included in a model, an interaction term between the sensor response and the measured temperature was included. Humidity was also included in each model, but no interaction term was added.
Stepwise linear regression ("StepLM") is an interesting but slow-to-train method of determining important predictors for use in a linear regression model. In this methodology, a base model (typically a simple intercept term or a full interaction model) is fit to the data. In the case of this study, the initial model was a simple intercept because a full interaction model was prohibitively large and slow. After fitting the original model, new models are trained with the addition or removal of one of each possible term. If the addition of a new term improves upon the previous model above some threshold, or if the removal does not reduce performance by a similar threshold, the term is added or removed. This new model is used as the new "base" model and the process is iterated until no terms can be added to or removed from the model within those constraints. In this study, the metric for improvement was the R 2 value, and the threshold was a value of 0.075, which was chosen by experimenting with different performance metrics and exploring the performance of the generated models.
Ridge regression ("RidgeLM") is a form of multiple linear regression, but in this study, the two are differentiated by how the terms of the multiple linear regression were determined. During the generation of ridge regression models, all sensor values were included, and additionally, the interaction between each sensor and both temperature and humidity were included. This created a high-dimensionality dataset, on which ridge regression was applied; a method that includes a term, "k", to penalize overfitting. Increasing the value of k affects the loss function, so that the final model assigns low weights to sensor signals that are not generally useful. This value of k was determined during initial investigations and then was kept constant across different compounds and cross-validation sets.
Outside of these linear regression models, several nonlinear models were trained. These were random forest regression ("RandFor"), gaussian process regression ("GuassProc"), and neural networks ("NeurNet"). For random forest regression, a large ensemble of decision trees are trained to output discrete values. These trees are each trained on different bootstrap aggregated ("bagged") subsets of the original data that are selected randomly with replacement. When making predictions, the outputs of each of these trees are averaged in order to produce an output that can approximate a smooth function. Because the individual trees only learn to produce values that they have seen before, the extension of random forests outside of their training space may be limited, although Zimmerman and colleagues showed that they were able to produce reasonable results within some parameter space [65]. The hyperparameters that were optimized for random forests were the minimum number of points at each leaf node, the maximum number of splits for each tree, and the number of variables to select from at each edge. These were optimized using the loss function out-of-bag error, which is the error on data that were not selected during the "bagging" of data during initial training.
Gaussian process regression, which is sometimes referred to as kriging, is a probabilistic method that uses training data and some assumptions about the distribution of the variable to make predictions on new data. Because this method is nonparametric, the ability to extrapolate to new data is somewhat limited; however, it is popular in the environmental modeling community and De Vito recently showed good success applying them to real atmospheric data [53]. A squared exponential kernel was used after some initial investigation showed little dependence of the results on this selection. The hyperparameters that were programmatically optimized for this model were the kernel parameters and were optimized by minimizing the objective function: log(1 + cross-validation loss).
Finally, the last regression model explored here was a neural network, a technique that has been used with low-cost sensors for some time but is seeing a resurgence as improved training methods and computational power have improved their applicability [52,53,66]. These models produce results by combining a set of "neurons" into a larger network. Each neuron applies a set of weights to each input and uses a transfer function to translate the sum of those inputs into an output for the neuron. The first layer of neurons uses the raw sensor values as inputs, and subsequent layers use the outputs of the first layer as inputs. These two layers are often referred to as hidden layers, the last of which provides the input to the output layer that translates these outputs into a single predicted value. The hidden layers and number of nodes in each hidden layer of a neural network model are tunable hyperparameters and were optimized for the best mean squared error (MSE) on a subset of the training data that was held out for testing. The number of nodes in each hidden layer was varied between 1-40 for the first layer and from 0-40 for the second layer. When the number of nodes in the second layer was specified as "0", the second hidden layer was simply omitted.

S.2.3. Predicting the Presence of Sources with Classification Models
After generating estimates of "key" compounds using each of the above regression approaches, classification algorithms were trained to identify the class of "source" that was being simulated, using the estimated concentrations at each timestep as features. The data was divided into the same sets of calibration and validation sets to ensure that the final validation results were left out of model training for both regression and classification. The classification techniques applied here have been seen in the literature, although typically with the goal of identifying individual compounds within simple mixtures [10,55-57]. Those algorithms selected here are logistic regression, support vector machines (SVM) with both a linear and Gaussian kernel, random forest classifiers, and neural networks. The models vary significantly in their ability to separate different classes and were selected because of that diversity. For all of the methods presented here, the classification model outputs were values ranging from 0 to 1, where 0 indicates high confidence in the absence of a source, and 1 indicates high confidence in the presence of a source. When comparing the results to the actual simulated source, a value greater than or equal to 0.5 indicated a prediction that the source was present and a value less than 0.5 that it was absent.
The first type of classifier, the logistic regression, is the simplest and most linear of the classifiers. In the results below, this model was referred to as "Logistic_class" and is a generalized linear model with binomial distribution. An independent classifier was created for each of the simulated sources, and each model was trained to indicate the presence or absence of that source. Much like the StepLM function described earlier, the terms for this model were selected by stepwise regression, with the difference being that the terms here were gas concentrations rather than sensor signal values. Because the logit link function is used to map the output of a linear function to a value from 0 to 1, the output of a logistic regression is often interpreted as the probability or confidence that the value is in the positive or negative class. In this case, that would be the likelihood that a certain source is or is not affecting the measured air quality. One benefit of logistic regression is that it is interpretable and computationally inexpensive relative to other, more complex and nonlinear methods.
The second classification model investigated was support vector machines (SVMs), which were trained to indicate the presence of a source. The general goal of SVMs is to create a line or plane that has the largest margin between separated classes, with some allowance for outliers and noise. In twodimensional space, this can be visualized as creating a line that separates two classes and has the widest empty space on either side. The points closest to the line are referred to as support vectors and give the classifier its name. Because the goal of an SVM is to create a line that separates two classes, it may be considered as a linear classifier, although kernel functions are often used to map this linear function to a nonlinear space. The new features created by kernel methods are generally understood as measures of similarity between each instance. Two kernels were studied here, the first being a linear kernel that does not transform the input variables, and the second being the gaussian kernel. These are referred to in the results as "SVMlin_class" and "SVMgaus_class" respectively, simply because of the name of the functions written to implement them. Both kernels were implemented to predict the presence or absence of each source independently, just as described for logistic regression above. For the SVM with a gaussian kernel, the hyperparameters controlling the kernel scale and box constraint were optimized automatically to reduce cross-validated misclassification errors. These two factors affect how "smoothed" the kernel is and how heavily the loss function is penalized for errors, respectively.
Next, a random forest classification model was trained to identify the presence of each source, referred to in the results as "RandFor_class". Although random forests can quite easily be applied to multiclass regression problems, models were trained to predict the binary presence of each source separately, so that in sum, the models could predict the presence of multiple sources at the same time without creating many classes representing all possible combinations of sources. The hyperparameters that were optimized for this model were the same as for the random forest regression models: minimum leaf size and maximum number of splits. These random forests represent a collection of 500 decision trees that are each trained on different subsets of the training data. In this case, those subsets were selected via bootstrap aggregation (bagging), wherein data is randomly sampled with replacement from the original training dataset. Each of these trees produce a prediction of the presence or lack thereof for the source that they were trained on. The "score" that was output to indicate the confidence that a source was present represents the fraction of trees within each forest that predicted that a source was present.
The last classification approach that was explored was a pattern recognition neural network, referred to in the results as "NeurNet_class". This neural network had one output node for each source and was, therefore, able to predict the presence of multiple sources independently and simultaneously. The hyperparameters that were optimized for this model were the number of layers (1 or 2), the number of nodes in each layer, and the transfer function that translates the input to outputs. In this case, a neural net with two layers of five nodes using the "softmax" transfer function was selected.     Figure S2. Estimated versus reference concentrations are plotted on the next page for each combination of gas and regression technique. Each column of plots contains estimates for a given gas, and each row contains estimates for a given regression technique. The color of each point indicates the cross-validation fold that was used for training and testing the model, and the shade (light versus dark) indicate whether the values were estimated on training data or testing (validation) data.