Detecting plumes in mobile air quality monitoring time series with Density-based Spatial Clustering of Applications with Noise

. Mobile monitoring is becoming an increasingly popular technique to assess air pollution on fine spatial scales, but methods to determine specific source contributions to measured pollutants are sorely needed. One approach is to isolate plumes from mobile monitoring time series and analyze them separately, but methods that are suitable for large mobile monitoring time series are lacking. Here we discuss a novel method used to detect and isolate plumes from an extensive 10 mobile monitoring data set. The new method relies on Density-based Spatial Clustering of Applications with Noise (DBSCAN), an unsupervised machine learning technique. The new method systematically runs DBSCAN on mobile monitoring time series by day and identifies a subset of points as anomalies for further analysis. When applied to a mobile monitoring data set collected in Houston, Texas, analyzed anomalies reveal patterns associated with different types of vehicle emission profiles. We observe spatial differences in these patterns and reveal striking disparities by census tract. 15 These results can be used to inform stakeholders of spatial variations in emission profiles not obvious using data from stationary monitors alone. Graphical Abstract


Supplemental Information
Section S1.Temporal rescaling procedure for census tract comparisons.Section S2.Anomaly type detection probability error estimation procedure.

S1 Temporal Rescaling Procedure for Census Tract Comparisons
To remove temporal effects from census tract comparisons of anomaly type detection probability, we perform a rescaling procedure.We transform each census tract's sampling distribution into a uniform distribution, then multiply each hour of the newly transformed uniform distribution by the fraction of detected anomalies in that hour.
Out of 35 census tracts sampled in the Houston area, we restrict our analysis to 19 to ensure that each hour between 8 AM and 4 PM CST had at least 1,000 samples for each individual census tract.The lowest number of samples in any given hour for a census tract was 1,061, which equates to ≈ 17 minutes of sampling.For each census tract, we calculate the average number of samples per hour, determined by calculating the total number of samples and dividing by 8, the number of analyzed hours.In addition to calculating the average number of samples, we calculate for each hour in each census tract the fraction of that hour's measurement that are of a given anomaly type ("CO2 -Rich", "Transition", "BC/UFP -Rich").In the final step, we multiply the hourly fraction of each anomaly type by the average number of measurements for the census tract and then sum the results.To determine the % probability of detection for a given anomaly type, we divide these weighted totals by the number of measurements made within the census tract.

Figures S2 and S3
display the effects of implementing the rescaling procedure on the calculated probabilities of anomaly detection for the 19 census tracts.In general, we note that implementing the rescaling procedure results in mostly modest increases in these probabilities across the board.A notable exception is the North Rice polygon for CO2 anomaly detections.Figure S4 displays the (a) total sampling distribution and (b) anomaly sampling distribution for the North Rice polygon.We note that the 8 AM hour was oversampled relative to other hours sampled and argue that implementing the rescaling procedure decreases the effects of this hour relative to other sampling times in the census tract.

S2. Anomaly Detection Type Probability Error Estimation Procedure
We provide error estimates of our calculated anomaly type detection probabilities and present them in Tabs.S3, S4, and S5.To do this, we implement the bootstrap for each anomaly detection type probability for each census tract to generate sampling distributions (Efron and Tibshirani, 1994).
We create 1000 synthetic distributions for each census tract by sampling with replacement measurements within each census tract.For each synthetic distribution, we calculate the probability of each anomaly detection type, repeating the same temporal rescaling procedure described in Sect.S1 1000 times for each census tract to generate 1000 probabilities of each type.From the resultant sampling distributions, we report the lower and upper bounds of the 90% confidence interval (5 th to 95 th percentiles), the mean, and bias.We define bias as the difference between the originally calculated probability and its mean probability estimate from its corresponding sampling distribution (in effect, taking the difference between columns in Tab. 2 and mean columns in Tabs.S3, S4, and S5).

Figure S1 .
Figure S1.Illustration of manually flagged plumes for CO2.Points in red are labeled as plume (anomaly), while points in black are labeled as normal (non-anomaly).Ovals represent manually flagged plumes for this portion of the CO2 time series.Note -not all red colored points correspond to CO2 plumes, but they can represent plume detections in other pollutants not shown here.

Figure S2 .
Figure S2.Effects of scaling on the probability of CO2 anomaly type detection for each census tract (green/left bar for each tract is scaled).

Figure S3 .
Figure S3.Effects of rescaling on probability of BC/UFP anomaly type detection for each census tract (green/left bar for each census tract is scaled).

Figure S4 .
Figure S4.Sampling distributions for (a, top) all measurements and (b, bottom) anomalies in the North Rice census tract.

Figure S5 .
Figure S5.Visualizing cluster assignment on the first two principal component axes for DBSCAN-derived anomalies.Cluster 1 extends down and to the right from the origin, cluster 2 is around the origin, and cluster 3 extends up and to the right from the origin.

Figure S6 .
Figure S6.Total anomaly type counts per census tract normalized by the total number of measurements within each census tract.a) CO2 (top) b) BC/UFP (bottom).

Figure S7 .
Figure S7.Probability of detecting CO2 anomaly type with highways in the analysis (green, right bar for each census tract) and without highways in the analysis (blue, left bar for each census tract).