Assessing residential PM2.5 concentrations and infiltration factors with high spatiotemporal resolution using crowdsourced sensors

Significance Until recently, residential studies of indoor fine particulate matter (PM2.5) have been limited to small numbers of homes observed for short periods of time. Advances in low-cost sensors and the advent of crowdsourced data collection have enabled thousands of homes to be studied over extended periods. To understand PM2.5 spatiotemporal variability across the residential building stock, this study uses more than 10,000 monitor-years of PM2.5 data acquired in or near 3,977 US residences. We quantify how indoor and outdoor sources contribute to residential PM2.5 by time of day, time of year, and climate. We find substantial variability related to occupant behavior, outdoor climate, and building conditions. By understanding such drivers, occupants and building designers can take steps to reduce residential exposure to PM2.5.


EXTENDED MATERIALS AND METHODS Notation.
o = outdoor PM2.5 concentration  i = indoor PM2.5 concentration  io = indoor PM2.5 concentration of outdoor origin  ii = indoor PM2.5 concentration of indoor origin  iie = indoor PM2.5 concentration of indoor episodic origin  iip = indoor PM2.5 concentration of indoor persistent origin  tot = first-order loss-rate coefficient for episodic particle emissions  ̅ tot = mean of loss rate coefficients for episodic particle emissions at one residence  inf = infiltration factor RCS = random component superposition NSI = National Structure Inventory PM2.5 = mass concentration of particles smaller than 2.5 micrometers in diameter Potential exposure = the time-integrated concentration of a hypothetical occupant who is always inside the studied residence.Data acquisition.Our process began by selecting all indoor PurpleAir monitors with at least one outdoor monitor within 5 km, yielding 5617 indoor monitors.We gathered building metadata for the closest structure from the National Structure Inventory (US Army Corps of Engineers, www.hec.usace.army.mil/confluence/nsi)using the GPS coordinates associated with each indoor monitor.We removed monitors from the analysis set if the closest structure in the National Structure Inventory was more than 50 m away, yielding 4884 indoor monitors.We further restricted the analysis to buildings specifically identified as residences, reducing the dataset to 4075 indoor monitors.Finally, we removed 98 indoor monitors that contained less than one month of paired indoor-outdoor data or appeared to be outdoor monitors mislabeled as indoors.Mislabeled monitors were identified by regressing the PM2.5 concentrations reported by an identified indoor monitor against those of the nearby outdoor monitors.Monitors coded as "indoor" whose regressions returned slopes greater than 0.9 and coefficients of determination (R 2 ) greater than 0.85 relative to the nearby outdoor monitors were redefined as being outdoors and excluded from the analysis.The final dataset included 3977 indoor monitors with at least 1-month of paired indoor and outdoor concentrations at 10-min time resolution, with the indoor monitor specifically monitoring a residence in the contiguous United States.Outdoor PM2.5 concentrations were defined as the arithmetic mean of all outdoor monitors within 5 km of each indoor monitor for each time point.We followed the QA/QC procedures reported by O'Dell et al. (1) for the Plantower sensors used by PurpleAir monitors.First, we removed indoor datapoints with PM2.5 concentration outside the range 0-500 μg m -3 using the default CF1 calibration provided by Plantower, which corresponded to 0.15% of the original indoor dataset.Second, we removed data points where reported temperatures and humidities were outside of standard operating ranges (14-140 °F and 0-99% relative humidity), corresponding to 0.02% of the original indoor dataset.Third, we removed datapoints for monitors with two sensors in the same device when the two sensors differed by > 10 μg m -3 (absolute concentrations < 100 μg m -3 ) or > 10% (absolute concentrations ≥ 100 μg m -3 ), which corresponded to 0.14% of the original indoor dataset.This step was omitted for monitors with only one sensor in the device.For monitors with more than one sensor, values are reported as the average of both sensors.Finally, we omitted indoor datapoints for which there was no concurrent outdoor measurement.In total, 1.2% of the original indoor PM2.5 dataset was excluded from analysis.The Plantower particle sensor reports number concentrations for six size ranges based on minimum particle size (i.e., >0.3 µm, >0.5 µm, >1 µm, >2.5 µm, >5 µm and >10 µm) and includes proprietary algorithms to calculate and report mass concentrations of PM1, PM2.5 and PM10.Correction factors to these proprietary algorithms have been developed by comparing outdoor PurpleAir monitors with outdoor EPA monitoring stations, some of which account for changes in humidity.Open algorithms have also been developed that calculate PM2.5 from particle number concentrations and have similarly been validated against outdoor EPA monitoring stations.In this work, we quantified data using the "ALT" algorithm (2,3) because it shows the best performance for low concentrations.We note that the ALT calibration used in this work was validated against outdoor PM2.5 monitoring stations; it has not been validated in indoor settings.A related concern is that the variable particle densities of outdoor aerosol (higher density), cooking aerosol (lower), and resuspended aerosol (higher) are not considered in this work or in calibration, potentially biasing results.We also highlight that the exact locations of PurpleAir monitors within residences are not known and may not be fully representative of conditions within the home, such as a monitor placed in a closed bedroom during kitchen cooking.Apparent loss from dilution may cause loss rates to be overestimated if monitors are placed near a source.Considering potential exposures, the concentrations measured in residences will tend to underestimate the concentration directly at the breathing zone due to proximity and personal cloud effects (4).Finally, the performance of Plantower sensors is known to degrade over time (5).This feature is difficult to evaluate for models of PurpleAir monitors with only one Plantower sensor.While temperatures and humidities reported by the PurpleAir monitors correlate well with reference monitors, such parameters are biased slighter warmer and drier than reference monitors (6), impacting conditions reported in Figures S4-S6.We collected other building metadata from the National Structure Inventory, including the structure type (commercial, residential single-family, residential multi-family), structure floor area (square footage), structure replacement value (in dollars), and the median year built of structures within the same census block, among other parameters.We also identified climate zones as defined by US DOE's Building America Program (US Department of Energy, https://www.energy.gov/eere/buildings/climate-zones)using the geographic coordinates of each monitor.The Building America Project defines seven climate zones based on temperature and precipitation: "Very Cold," "Cold," "Marine," "Mixed-Dry," "Mixed-Humid," "Hot-Dry," and "Hot-Humid."We collected population density data for 40,954 US zip codes (https://www.fourfront.us/data/datasets/us-population-density/).To connect these population density estimates with residential monitors, we identified the zip code of each residential monitor using the monitor's GPS coordinates and the Nominatim API for OpenStreetMap (nominatim.openstreetmap.org).When Nominatim was unable to return a US zipcode or returned an invalid zip code, we manually identified the monitor's zipcode using Google Maps.PurpleAir monitors report data times in UTC.To identify local time, we linked GPS coordinates of each monitor to the closet US city with known time zone.The time zones of US cities were acquired from the GeoNames database (http://download.geonames.org/export/dump/cities15000.zip).Time series analysis.A semiquantitative algorithm was developed to identify the start bound of an event and the stop bound when the event's impact has concluded.The algorithm was visually assessed to ensure satisfactory performance (Figure S1).The event start bound was defined as the transition point from negative or fluctuating first derivative to predominantly positive first derivative.The algorithm begins by scanning point-by-point leftward (back in time) from the event concentration peak towards the event start.Each step corresponds to 10 min.During iteration, two counters mark the cumulative number of datapoints with negative first derivative, denoted a, and the number of consecutive datapoints with negative first derivative, denoted b.The second counter, b, increases by one with an instance of negative derivative and resets to zero with an instance of positive first derivative.For well-behaved events, the first counter, a, remains near zero until the event start has been reached.Iteration continues until the first counter, a, reaches a value of seven.To avoid going beyond the event start bound, the event start bound is defined as b datapoints to the right of the iteration stop datapoint.The event stop bound is similarly defined as the transition point from predominantly negative first derivative to positive or fluctuating first derivative and implements the same algorithm.The algorithm begins by iterating rightward (forward in time) from the event concentration peak.Two counters mark the cumulative number of datapoints with positive first derivative, denoted c, and the number of consecutive datapoints with positive first derivative, denoted d.Iteration continues until the first counter, c, reaches a value of seven.The event stop bound is then defined as d datapoints to the left of the iteration stop datapoint.Indoor baseline concentrations with events removed were estimated by interpolating between the event start and stop bounds.Additionally, for time periods where indoor baseline concentrations exceeded outdoor concentrations, the outdoor concentration was defined as the indoor baseline.We manually examined hundreds of weeks of data to ensure satisfactory performance.Event start and stop bounds were ambiguous for a notable minority of peaks, especially for small peaks during times with large outdoor concentration.A prior study (7) defined concentration thresholds of greater than 30 μg m -3 and width thresholds of less than 4 h to be of unambiguous indoor origin.Indoor emission peaks with widths larger than 4 h are rarely present and are often indicative of an extended outdoor emission source such as a wildfire.However, many indoor source events, such as those related to resuspension or light cooking are be captured when using the 30 μg m -3 concentration threshold.We conducted sensitivity testing to determine whether the 30 μg m -3 threshold may underestimate the total contributions of indoor episodic emissions to indoor PM2.5.Averaging over the total dataset, the 30 μg m -3 threshold captures 90.3% of the total integrated peak area when compared to a threshold of 5 μg m -3 .One concern may be potential overcounting where episodic outdoor emission events lead to smaller episodic indoor concentration enhancements that are misidentified as being of indoor origin.This concern is ameliorated by the time-lag for outdoor concentrations to penetrate indoors, thereby dampening peak height and making misidentification less likely.Furthermore, visual analysis suggests that while outdoor emission events leading to misattributed indoor concentration enhancements do occur, they are uncommon and the potential bias smaller in magnitude than would occur with a 30 μg m -3 peak threshold for most time periods.The potential bias may be important during wildfire events with high outdoor PM2.5 concentrations.
To assess particle loss rates, we calculated the first-order loss-rate coefficient (tot) of emission events with well-behaved decay curves.Specifically, we selected all peaks with a prominence of at least 5 μg m -3 and a peak duration between 20 min and 6 h.The start of the decay curve fit region was defined as the event concentration maximum.The range of the decay curve fit region was defined as three times the peak width, where the peak width is the full horizontal peak width at half-prominence peak height.For each selected peak, we calculated the exponential decay using measurement points from the peak maximum to the measurement point that was 3× the peak width after the peak maximum (Figure S12).Peak fits with R 2 > 0.8 were defined as well-behaved while peaks with R 2 < 0.8, signifying poor fit quality, were rejected from further analysis as poorly behaved.The loss-rate coefficient (tot) has contributions from indoor-to-outdoor transport, deposition, and, potentially, filtration.A mass balance equation is shown in Equation S1 for the decay of an emission event without source terms, where  ind is the indoor concentration,  is the airchange rate,  dep is the particle deposition rate,  filt is the particle filtration rate, and () is a modifying function that is 0 when filtration is inactive and 1 when filtration is active.This simplified mass balance equation assumes that source terms (persistent indoor emissions and outdoor-to-indoor transport) are small after major indoor emission events and that the episodic emission source has completely ended.
(S2) Peaks were integrated yielding peak area in units of μg m -3 h.We multiplied the integral area by either (a) their specific loss-rate coefficient for well-behaved peaks or (b) the average loss-rate coefficient over the monitor's total dataset for poorly behaved peaks, yielding the instantaneous concentration enhancement in units of μg m -3 .This value corresponds to the expected concentration increase if event emissions occurred instantaneously in a well-mixed interior volume.We multiplied this value by building volume to obtain episodic emission strengths in units of mg.Residence volumes were estimated from structure floor area (square footage) as reported by the National Structure Inventory and an assumed floor height of 2.4 m.Because the National Structure Inventory reports total building area rather than residential area, emission mass analysis was restricted to single-family residences (3458 of 3977 residences) where it is more justifiable to apply the well-mixed assumption over the total residential volume.Random component superposition analysis.Random component superposition (RCS) analysis ( 8) was used to estimate infiltration factors and PM2.5 attributable to indoor sources.Briefly, the method assumes that indoor PM2.5 is linearly composed of indoor sources and outdoor PM2.5, with the latter scaled by an attenuation factor related to penetration and deposition rates.It is further assumed that the indoor and outdoor sources are independent.
Then, fitted parameters from the regression of indoor PM2.5 concentrations against outdoor PM2.5 concentrations are interpreted as the infiltration factor (slope =  inf ) and the concentration attributable to indoor sources (intercept =  ii ).RCS analysis was performed for each indoor monitor separately, using 1-day time averaged concentrations on datasets with at least 27 days of data.We also conducted a modified RCS analysis by applying the same procedure on an excised indoor concentration time series where episodic peaks were removed.Fitted parameters from the regression of excised indoor PM2.5 against outdoor PM2.5 can be interpreted as the infiltration factor (slope =  inf ) and the indoor source concentration attributable to persistent indoor emissions (intercept =  iip ).A visual demonstration of both methods is provided in Figure S13.We conducted two tests to ensure internal validity of the modified RCS model.These tests are not an external validation of the method.First, the difference between  ii as returned by RCS and  iip as returned by modified RCS should be  iie .A scatterplot of  ii −  iip and  iie demonstrates nearly 1-to-1 comparability (Figure S14).Second, we compare infiltration factors returned by RCS and modified RCS.A scatterplot of  inf (RCS) and  inf (modified RCS) similarly demonstrates nearly 1-to-1 comparability (Figure S15) when only considering physical values between 0 and 1 (coefficient of determination = 0.86, linear fit slope = 0.83).We note that the performance of modified RCS is seemingly improved, with all values of  inf bounded at appropriate physical limits (between 0.003 and 1.002), whereas the original RCS method yields outliers with non-physical values, spanning a range from -2.2 to 7.0.Additionally, the median coefficient of determination (R 2 ) during analysis improved from 0.27 (RCS) to 0.58 (modified RCS) and the median root mean square error improved from 2.3 μg m -3 (RCS) to 1.1 μg m -3 (modified RCS).A final consideration is the potential for systemic errors in relation to the regression method.When uncertainty is present in both the independent and dependent variables, as is the case in RCS analysis, ordinary least squares regression can underestimate the slope and overestimate the intercept (9).We therefore apply orthogonal distance regression (Deming regression with δ = 1) in modified RCS analysis to avoid this potential bias.A comparison of orthogonal distance regression against ordinary least squares regression finds that the population mean of  inf increased by 6% (0.26 to 0.28, R 2 = 0.99) and population mean of  iip decreased by 14% (0.63 to 0.55 μg m -3 , R 2 = 0.97) across all indoor-outdoor pairs.Values of  inf remain bounded by the appropriate physical limits.Because analysis by modified RCS with orthogonal distance regression appears more robust than traditional RCS, we (a) report  inf values from modified RCS with orthogonal distance regression and (b) report  ii as the sum of  iie and  iip throughout the work unless otherwise specified.As a time-integrated mechanistic model, RCS analysis makes a simplifying assumption that indoor source contributions and the infiltration factor are independent.The model does not consider how concentration-dependent variations in particle composition and phase-change phenomena may systematically bias results.For example, ammonium nitrate and other aerosols can partially evaporate upon transport to the indoors (10,11,12).If these processes occur more readily during periods of high outdoor concentrations than low outdoor concentrations, the net effect would be to flatten the slope (decrease  inf ) and increase the intercept (increase  ii or  iip ) in either RCS or modified RCS analysis.Ammonium nitrate aerosols are a substantial contributor to outdoor PM2.5 concentrations, especially in urban areas during wintertime when outdoor PM2.5 concentrations are high (13,14).We observe the highest  iip concentrations in wintertime.While it is plausible that elevated  iip concentrations are related to wintertime tightening of the building envelope (e.g., from less window use), alternative hypotheses such as concentration-dependent nitrate evaporation may also contribute.Phase-change phenomena influencing the indoor-outdoor fine particle relationship are potentially important for nitrates, semivolatile organics, and water.Similar effects related to concentration-dependent calibration biases may also influence RCS model results.) are reported in units of μg m -3 .The infiltration factor,  inf , and the fractions of indoor PM2.5 of episodic indoor origin, persistent indoor origin, and total indoor origin, that is  iie / i ,  iip / i , and ( iie +  iip )/ i are unitless.The mean of well-behaved loss-rate coefficients ( ̅ tot ) is reported in units of h -1 .Monitors that did not report data during this season were excluded from analysis.) are reported in units of μg m -3 .The infiltration factor,  inf , and the fractions of indoor PM2.5 of episodic indoor origin, persistent indoor origin, and total indoor origin, that is  iie / i ,  iip / i , and ( iie +  iip )/ i are unitless.The mean of well-behaved loss-rate coefficients ( ̅ tot ) is reported in units of h -1 .Monitors that did not report data during this season were excluded from analysis.) are reported in units of μg m -3 .The infiltration factor,  inf , and the fractions of indoor PM2.5 of episodic indoor origin, persistent indoor origin, and total indoor origin, that is  iie / i ,  iip / i , and ( iie +  iip )/ i are unitless.The mean of well-behaved loss-rate coefficients ( ̅ tot ) is reported in units of h -1 .Monitors that did not report data during this season were excluded from analysis.) are reported in units of μg m -3 .The infiltration factor,  inf , and the fractions of indoor PM2.5 of episodic indoor origin, persistent indoor origin, and total indoor origin, that is  iie / i ,  iip / i , and ( iie +  iip )/ i are unitless.The mean of well-behaved loss-rate coefficients ( ̅ tot ) is reported in units of h -1 .Monitors that did not report data during this season were excluded from analysis.) are reported in units of μg m -3 .The infiltration factor,  inf , and the fractions of indoor PM2.5 of episodic indoor origin, persistent indoor origin, and total indoor origin, that is  iie / i ,  iip / i , and ( iie +  iip )/ i are unitless.The mean of well-behaved loss-rate coefficients ( ̅ tot ) is reported in units of h -1 .Regions correspond Building America climate zone designations subdivided by western states ("West": MT to NM and westward) and states to the east ("East": remainder of the contiguous US).
FigureS1.A representative concentration time series is displayed.Indoor concentrations captured by the semiquantitative peak-finding algorithm using a 30 μg m -3 peak prominence threshold (or when indoor concentrations are higher than outdoors) are shown in red.Additional peaks captured by 10 and 5 μg m -3 thresholds are shown in orange and blue, respectively.The indoor baseline is displayed in black and outdoor concentrations are displayed in green.Integrated areas are reported in arbitrary units (a.u.).

Figure S7 .
Figure S7.Summary statistics for mean particle loss-rate coefficients and calculated infiltration factors for residences.Individual residences were linked to the median year of construction of all structures within a residence's census block (not limited to those with PurpleAir data) and then grouped by decade.Panels A (particle loss rates) and C (infiltration factors) present the median values of each decadal grouping with the y-axis spanning the minimum and maximum values.Panels B (particle loss rates) and D (infiltration factors) present the same data with the addition of the interquartile range and with the y-axis beginning at y=0.

Table S1 .
Location of monitors in climate zones and states.

Table S2 .
Key summary statistics and major measurement results during the spring season (March, April, May). a Concentration values ( i ,  iie ,  o ,  ii ,  iip

Table S3 .
Key summary statistics and major measurement results during the summer season (June, July, August).a Concentration values ( i ,  iie ,  o ,  ii ,  iip

Table S4 .
Key summary statistics and major measurement results during the fall season (September, October, November).a Concentration values ( i ,  iie ,  o ,  ii ,  iip

Table S5 .
Key summary statistics and major measurement results during the winter season (December, January, February).a Concentration values ( i ,  iie ,  o ,  ii ,  iip

Table S6 .
Key summary statistics for values across climate zones in the western and eastern regions of the United States.a Concentration values ( i ,  iie ,  o ,  ii ,  iip

Table S7 .
Key summary statistics and major measurement results for residences binned by 425 population density.a 426 Concentration values ( i ,  iie ,  o ,  io ,  iip ,  ii ) are reported in units of μg m -3 .The 427 infiltration factor,  inf , and the fractions of indoor PM2.5 of indoor episodic origin, indoor 428 persistent origin, and total indoor origin, that is  iie / i ,  iip / i , and ( iie +  iip )/ i , are 429 unitless.The mean of well-behaved particle loss-rate coefficients at a residence ( ̅ tot ) is 430 reported in units of h -1 . a