Calibration of low-cost particulate matter sensors: model development for a multi-city epidemiological study

Low-cost air monitoring sensors are an appealing tool for assessing pollutants in environmental studies. Portable low-cost sensors hold promise to expand temporal and spatial coverage of air quality information. However, researchers have reported challenges in these sensors’ operational quality. We evaluated the performance characteristics of two widely used sensors, the Plantower PMS A003 and Shinyei PPD 42 NS, for measuring fine particulate matter compared to reference methods, and developed regional calibration models for the Los Angeles, Chicago, New York, Baltimore, Minneapolis-St. Paul, Winston-Salem and Seattle metropolitan areas. Duplicate Plantower PMS A003 sensors demonstrated a high level of precision (averaged Pearson’s r=0.99), and compared with regulatory instruments, showed good accuracy (cross-validated R 2 =0.96, RMSE=1.15 µg/m 3 for daily averaged PM 2.5 estimates in the Seattle region). Shinyei PPD 42 NS sensor results had lower precision (Pearson’s r=0.84) and accuracy (cross-validated R 2 =0.40, RMSE=4.49 µg/m 3 ). Region-specific Plantower PMS A003 models, calibrated with regulatory instruments and adjusted for temperature and relative humidity, demonstrated acceptable performance metrics for daily average measurements in the other six regions (R 2 =0.74–0.95, RMSE=2.46–0.84 µg/m 3 ). Applying the Seattle model conditions and particle sources. We describe an approach to metropolitan region-specific calibration models for low-cost sensors that can be used with caution for exposure measurement in epidemiological studies. through December 2018. LCMs recorded and reported measurements on a 5-minute time scale; these were averaged to the daily (12:00 AM to 11:59 PM) time scale to compare with reference data (either daily or hourly averaged to daily). Descriptive analyses that were performed at the early stage of examination included data completeness and exploration of operating ranges and variation that might affect sensor’s reading. These analyses addressed the influence of factors such as meteorological conditions, regional differences, and comparisons with different reference instruments. Multivariate linear regression calibration models were developed for each study region. The Seattle model was applied to measurements from the other regions to evaluate the generalizability of predictions from a single calibration model across regions.


Introduction
Exposure to air pollution, including fine particulate matter (PM 2.5 ), is a well-established risk factor for a variety of adverse health effects including cardiovascular and respiratory impacts (EPA, 2012;Brunekreef & Holgate, 2002;EPA, 2018;Pope et al., 2002). Low-cost sensors are a promising tool for environmental studies assessing air pollution exposure , Mead et al., 2013Morawska et al., 2018;Zheng et al., 2018). These less expensive, portable sensors have potentially major advantages for research: a) investigators can deploy more sensors to increase spatial coverage; b) sensors are potentially easier to use, maintain, and require less energy and space to operate and; c) they can be easily deployed in a variety of locations (and moved from one location to another). Low-cost sensors have been proposed to stand alone or to be adjuncts to the existing federal air quality monitoring regulatory network that measure air pollution concentrations (Borrego et al., 2015, US EPA, 2013). However, optical particle sensors have demonstrated challenges that can include accuracy, reliability, repeatability and calibration (Castell et al., 2017;Clements et al., 2017). Several studies that examine specific individual sensors in laboratory environments report accuracy variation in pollutant concentrations among light scattering sensors (Austin et al., 2015;Manikonda et al., 2016). Other investigators have reported that low-cost sensors are sensitive to changes in temperature and humidity, particle composition, or particle size (Gao et al., 2015;Holstius et al., 2014;Kelly et al., 2017;Zheng et al., 2018). Recent evaluations of 39 low-cost particle monitors conducted by the South Coast Air Quality Management District's AQ-SPEC program, using both exposure chamber and field site experiments comparing low-cost monitors and regulatory reference method instruments, found that performance varies considerably among manufacturers and models (SCAQMD AQ-SPEC, 2019).
In this study, we combined data from two monitoring campaigns used to measure PM 2.5 to: 1) evaluate the performance characteristics of two types of low-cost particle sensors; 2) develop regional calibration models that incorporate temperature and relative humidity; and 3) evaluate whether a region-specific model can be applied to other regions. We aimed to develop models we could use to develop exposure estimates in epidemiological analyses.

Contributing Studies and Monitoring Strategies
Air quality monitoring data were collected for the "Air Pollution, the Aging Brain and Alzheimer's Disease" (ACT-AP) study, an ancillary study to the Adult Changes in Thought (ACT) study, and "The Multi-Ethnic Study of Atherosclerosis Air Pollution Study" (MESA Air), an ancillary study to the MESA study (ACT-AP, 2019; Kaufman et al., 2012;MESA Air, 2019).
The objective of ACT-AP is to determine whether there are adverse effects of chronic air pollution exposure on the aging brain and the risk of Alzheimer's disease (AD). Low-cost monitoring at approximately 100 participant and volunteer homes in two seasons will be used in the development of spatio-temporal air pollution models. Predictions from these models will be averaged over chronic exposure windows. All low-cost monitors were colocated periodically with regulatory sites throughout the monitoring period.
MESA Air assessed the relation between the subclinical cardiovascular outcomes over a 10year period and long-term individual level residential exposure to ambient air pollution) in six metropolitan areas: Baltimore, MD; Chicago, IL; Winston-Salem, NC; Los Angeles, CA; New York City, NY; and Minneapolis-St. Paul, MN. (Kaufman et al. 2012). A supplemental monitoring study to MESA Air was initiated in 2017 to support spatio-temporal models on a daily scale in order to assess relationships with acute outcomes. Between spring 2017 and winter 2019, MESA Air deployed low-cost monitors at four to seven locations per city with half co-located with local regulatory monitoring sites. The duration of each co-located low cost sensor's monitoring period per study is presented in supplemental materials (SM; see Figure S1).

Monitor Characteristics
Low-cost monitors (LCMs) for both the ACT-AP and MESA studies were designed and assembled at the University of Washington. Each LCM contained identical pairs of two types of PM 2.5 sensors (described below) and sensors for relative humidity, temperature, and four gases (not presented here). Additional components included thermostatically controlled heating, a fan, a memory card, a modem, and a microcontroller running custom firmware for sampling, saving, and transmitting sensor data. Data were transmitted to a secure server every five minutes.
PM sensors were selected for this study primarily based on cost (≤$25/sensor), ease of use, preliminary quality testing, availability of multiple bin sizes, suitability for outdoor urban settings, and real-time response. Both selected PM sensors use a light scattering method . The Shinyei PPD 42 NS (Shinyei Corp, 2010) uses a mass scattering technique (MST). The Plantower PMS A003 is a laser based optical particle counter (OPC) (Plantower, 2016). The Shinyei PPD 42 NS sensor routes air through a sensing chamber that consists of a light emitting diode and photo-diode detector that measures the near-forward scattering properties of particles in the air stream. A resistive heater located at the bottom inlet of the light chamber helps move air convectively through the sensing zone. The resulting electric signal, filtered through an amplification circuitry produces a raw signal (lo-pulse occupancy) which is proportional to particle count concentration (Holstius et al., 2014). Shinyei PPD 42 NS sensors have shown relatively high precision and correlation with reference instruments but also high inter-sensor variability (Austin et al., 2015;Gao et al., 2015). The laser-based Plantower PMS A003 sensor derives the size and number of particles from the scatter pattern of the laser using Mie theory; these are then converted to an estimate of mass concentration by the manufacturer using pre-determined shape and density assumptions. The Plantower PMS A003 model provided counts of particles ranging in optical diameter from 0.3 to 10 micrometers. Different monitors that employ Plantower sensors have demonstrated high precision and correlation with reference instruments in field and laboratory experiments (Levy et al., 2018;SCAQMD AQ-SPEC, 2019), however some (e.g., the PurpleAir) use a different Plantower model than the PMS A003.

Co-location with Air Quality System (AQS) Monitors
EPA's Air Quality System (AQS) network reports air quality data for PM 2.5 and other pollutants collected by EPA, state, local, and tribal air pollution control agencies. PM 2.5 is measured and monitored using three categorizations of methods: the Federal Reference Method (FRM); the Federal Equivalent Methods (FEMs), including the tapered element oscillating microbalance (TEOM) and beta attenuation monitor (Met-One BAM); and other non-FRM/FEM methods (US EPA, 2016b, US EPA, 2017. FRM is a formal EPA reference method that collects a 24-hour integrated sample (12:00 AM to 11:59 PM) of particles on a filter, weighs the mass in a low humidity environment, and divides by the air volume drawn across the filter. A FEM can be any measurement method that demonstrates equivalent results to the FRM method in accordance with the EPA regulations. Unlike filter-based FRM measurements, most FEMs semi-continuously produce data in real time. Some FEMs (e.g., TEOM) collect particles using a pendulum system consisting of a filter attached to an oscillating glass element; as the mass on the filter increases, the fundamental frequency of the pendulum decreases. Several TEOM models have a filter dynamic measurement system that is used to account for both nonvolatile and volatile PM components. Other FEMs (e.g., BAM) measure the absorption of beta radiation by particles collected on a filter tape. Data from FEM monitors are used for regulatory enforcement purposes.
Although most new TEOM and BAM instruments have been EPA-designated as a PM 2.5 FEM, some older semi-continuous TEOM or BAM monitors are still operating at NYC, Chicago, and Winston-Salem sites but are not approved by EPA as equivalent to FRM. The data from semi-continuous monitors is not used for regulatory enforcement purposes. Due to the unavailability of the EPA-designated FEM monitoring data in these locations, we included measures from the six TEOM or BAM non-regulatory monitors (NRMs) in our study. For simplicity, we refer collectively to data from TEOM and BAM NRMs as "FEM" regardless of the EPA classification (relevant sites are marked in Table 1).
During 2017-2018, all 80 LCMs between the two above mentioned studies were co-located at regulatory stations in the AQS network in seven U.S. metropolitan regions: the Seattle metropolitan area, Baltimore, Los Angeles, Chicago, New York City, Minneapolis-Saint Paul area and Winston-Salem (see Figure 1). Criteria for selecting reference stations for sensor's deployment included: proximity to ACT-AP and MESA Air participants' homes; availability of instruments measuring PM 2.5 ; physical space for the LCMs at the regulatory station; and cooperation of site personnel. Table 1 lists the regulatory stations included in our study, along with their location, setting, and method(s) of PM 2.5 measurement (see Figure 1 for maps of AQS sites). Data from regulatory sites were obtained from the "Air Quality System" EPA web server and Puget Sound Clean Air Agency (PSCAA) website (PSCAA, 2018;US EPA, 2018).
For the period of 2017-2018, several LCMs were relocated from one regulatory station to another. Due to the specific study goals, MESA sensors were often monitored continuously at one site for a long period, while most ACT sensors were frequently moved from site to site. For the ACT study and Baltimore subset of the MESA study, participant and community volunteer locations were monitored as well. These non-regulatory site data are outside the scope of this paper, and will not be presented further. The monitoring periods for each LCM is demonstrated in SM Section, Figure S1.

Sensor Quality Assurance and Data Monitoring
An automated weekly report was generated to flag specific data quality issues. We examined data completeness, concentration variability, correlation of duplicate sensors within a LCM, and network correlation with nearby monitors.
Five Plantower PMS A003 broken sensors were detected and replaced during March 2017 through December 2018. Several broken sensors consistently produced unreasonably high PM 2.5 values (sometimes a hundred times higher than the expected PM 2.5 average compared to a second sensor from the same box or for reported season and location). Other Plantower PMS A003 broken sensors were identified when one sensor of the identical pairs stopped sending monitoring data for particles, whereas the second sensor was still reporting.
We also examined Plantower PMS A003 sensors for drift. To determine if the sensor output drifted during the study period, Plantower PMS A003 sensors' calculated PM 2.5 mass concentration data that were co-located for a long time period (>1 year) were binned by reference concentration (2.5 µm/m 3 intervals) and then examined against reference data over time. No significant drift was found.

Analytical Methods and Modeling Decisions
Calibration models were developed using data from March 2017 through December 2018. LCMs recorded and reported measurements on a 5-minute time scale; these were averaged to the daily (12:00 AM to 11:59 PM) time scale to compare with reference data (either daily or hourly averaged to daily). Descriptive analyses that were performed at the early stage of examination included data completeness and exploration of operating ranges and variation that might affect sensor's reading. These analyses addressed the influence of factors such as meteorological conditions, regional differences, and comparisons with different reference instruments. Multivariate linear regression calibration models were developed for each study region. The Seattle model was applied to measurements from the other regions to evaluate the generalizability of predictions from a single calibration model across regions.
2.5.1. Sensor Exclusion Criteria-Data from malfunctioning sensors were excluded from data analysis. We also excluded the first 8 hours of data after each deployment of the LCMs because occasional spikes of PM 2.5 concentrations were observed as the sensors warmed up.
The completeness in the collected data among all monitors that were deployed during March 2017 through December 2018 was 85.6%. The percent of completeness was estimated using observed "sensor-days" of all Plantower PMS A003 sensors' data (i.e. duplicated sensors within the same monitor box were counted individually) divided by expected number of "sensor-days". The percent of missing data among Plantower PMS A003 sensors was 14.4% due to following reasons: broken sensor/no data were recorded; clock related errors (no valid time variable); removal of the first 8 hours of deployment (see Section 2.5.1); failure in box operation (e.g. unplugged); and failure to transmit data. The percent of data completeness per sensor was on average 84.2% with a median of 94.7%. During the quality control screening, an additional 2.1% of remaining sensors' data were excluded from the analysis due to broken or malfunctioning sensors as described above. After all the exclusion stages, approximately 84% of the expected data were useable (although only the subset of data colocated at reference sites is used to fit calibration models).
Outlier concentrations were observed on July 4 th and 5 th (recreational fires and fireworks) and in the Seattle metropolitan area during August 2018 (wildfire season). We excluded these data to prevent our models from being highly influenced by unusual occurrences. In the SM, we present results from a model in the Seattle region that includes these time periods (see Table S1). After exclusions were made, we also required 75% completeness of the measures when averaging sensor data to the hourly or daily scales.
Among 80 sensors that were deployed, data from 8 MESA sensors are not included in the analyses because they were not co-located at PM 2.5 regulatory stations during the time period selected for the data analysis, were co-located for less than a day, or were only colocated when the sensors were broken or malfunctioning.

Model Input Considerations-
The calibration models incorporated the following decisions. a) Duplicate Sensors: Each LCM included two of each type of PM 2.5 sensor (see Section 2.2). Paired measurements for each sensor type were averaged when possible. Single measurements were included otherwise.

b) Regulatory Station Instruments and Available Time Scales:
For calibration purposes, we used data at co-location sites from days where both LCM and regulatory station data were available. FRM monitors provide 24-hour integrated measurement daily or once every three or six days. TEOM and BAM instruments provide hourly measurements that were averaged up to daily values. Both TEOM and BAM measures were available at the PSCA1 and PSCA4 sites in Seattle. At these sites, only the TEOM measure was used. The differences in reference instruments, including substitution of FEM data for FRM data in model fitting and evaluation, and a comparison of models fit on the hourly scale versus the daily scale are presented in the supplement (see SM Section, Table S2 and Table S3 respectively). c) Temperature and Relative Humidity: Our models adjusted for temperature and relative humidity (RH), as measured by the sensors in the LCMs, to account for the known sensitivity of these sensors to changes in meteorological conditions (Casstell, 2017). Prior studies suggest a non-linear relationship between particle concentrations that are monitored by low-cost sensors and relative humidity (Chakrabarti et al., 2004;Di Antonio et al., 2018;Jayaratne et al., 2018). Jayaratne et al. (2018) demonstrated that an exponential increase in PM 2.5 concentrations was observed at 50% RH using the Shinyei PPD 42 NS and at 75% RH using Plantower PMS1003. We tested both linear and non-linear RH adjustments (in models already adjusting for PM instrument measures and temperature splines). Our analysis showed similar to slightly worse model performance when correcting for RH with non-linear terms compared to a linear adjustment. Based on these results, we used a linear correction of RH for the calibration models.
The calibration models can be developed using the standard size-resolved counts or mass concentrations. For the transformation from size-resolved counts that are measured by particle sensors to mass concentrations, most researchers use the manufacturer's devicegenerated algorithm for the sensor (Castell et al., 2017;Clements et al., 2017;Crilley et al., 2018;Northcross et al., 2013). The manufacturer's algorithm for the correction factor (C) incorporates assumptions about potentially varying properties (e.g., density and shape) of the particles observed, but information on these assumptions and properties is not available. In our preliminary analyses, we evaluated both types of values (counts and mass) and transformations thereof (e.g. higher order terms, and natural log). We found that linear adjustment for bin counts provided the best results. We excluded the >10 µm bin from final models since it should primarily contain coarse particles.

e) Sensor Evaluation: Plantower PMS A003 vs. Shinyei PPD 42 NS:
We developed separate models for Plantower PMS A003 and Shinyei PPD 42 NS PM 2.5 mass measurements in the Seattle metropolitan area. Sensors' precision was evaluated comparing the daily averaged PM 2.5 mass measures from duplicate sensors within a box for each sensor type. The description of the form of the models and the results from model comparisons between the sensors for evaluation of the relative accuracy levels for the two sensors types are provided in Figure 2b and in the SM section, Equation S1.

Modeling Approach and Statistical
Implementation-For our analysis, we used data from the total of 72 co-located LCMs in different regions and focused on regionspecific calibration models for each sensor type (see Section 2.5.1). This allows calibration models tailored to regions with specific particle sources and to better adjust for factors such as meteorological conditions. We considered several calibration models for each study region to understand how regulatory instrument differences, regional factors, and other modeling choices influence calibration models. We evaluated staged regression models based on cross-validated metrics as well as residual plots, in order to find a set of predictor variables that provided appropriate adjustment for sensor and environmental factors We developed region-specific models with just FRM data and with FRM data combined with FEM data. In addition to these two region-specific models, we consider "out-of-region" predictions made with the Seattle model to evaluate the generalizability of calibration models across regions.

a) Region-Specific Daily FRM Models:
We fit region-specific models using Plantower PMS A003 data on the daily time scale for each of the seven regions with FRM PM 2.5 measurements (µg/m 3 ) as the outcome variable. These models included the count of particles per 0.1 L of air in each of the various bin sizes predictors (µm), linear adjustment for RH, and temperature adjustment with B-splines (knots at 40, 55, 70 and 85 degrees Fahrenheit). Each region had the same model form: Y i = β 0 + β 1 * P l 0.3, i + β 2 * P l 0.5, i + β 3 * P l 1, i + β 4 * P l 2.5, i + β 5 * P l 5, i + β 6 * RH i + β 7 * S 1 T emp i + β 8 * S 2 T emp i + β 9 * S 3 T emp i + β 10 * S 4 T emp i + β 11 * S 5 T emp i + ϵ i (1) for i = 1, 2, … n. Here, Y i = i th observation of the FRM PM 2.5 (µg/m 3 ) measurement, β 0 :β 11 = regression coefficients, Pl k,i = Plantower bin count for bin size k=0.3, 0.5, 1, 2.5, 5 µm, RH i = relative humidity, S k (Temp i ) for k=1, …, 5 = basis functions of the temperature splines, and ε i = random error. The temperature and relative humidity values that were entered into the models are raw measures of low-cost monitor.

b) Region-Specific Combined FRM & FEM Models:
We also developed models that combined FRM with daily averaged FEM data for each region in order to leverage a greater number of available days and locations. These models were of the same form as the regionspecific daily FRM models (see Section 2.5.3(a)). Sensitivity analyses explored comparability of models that included different methods, since FEM measurements are noisier than FRM measurements.

c) Application of the Seattle Metropolitan Model to Other Regions:
Out-of-region predictions using the combined FRM & FEM Seattle metropolitan model (see Section 2.5.3(b)) were generated for each region in order to explore the portability of a calibration model across regions.

Model Validation-
We evaluated the daily FRM and combined FRM & FEM models with two different cross-validation structures: 10-fold (based on time) and "leaveone-site-out" (LOSO). Model performance was evaluated with cross-validated summary measures (root mean squared error (RMSE) and R 2 ), as well as with residual plots. The 10fold cross-validation approach randomly partitions weeks of monitoring with co-located LCM and FRM data into 10 folds. Typically, 10-fold methods partition data based on individual observations, but using data from adjacent days to both fit and evaluate models could result in overly optimistic performance statistics. We intended to minimize the effects of temporal correlation on our evaluation metrics by disallowing data from the same calendar week to be used to both train and test the models. In LOSO validation, predictions are made for one site with a model using data from all other sites in the region. This approach most closely mimics our intended use of the calibration model: predicting concentrations at locations without reference instruments. Since Winston-Salem has only one FRM site, the LOSO analysis could not be performed there.
A "leave-all-but-one-site-out" (LABOSO) cross-validation design was also performed to assess the generalizability of calibration models based on data from a single site. Models developed using data from a single site were used to predict observations from the remaining sites.
All statistical analyses were conducted in R version 3.6.0. Table 2 summarizes the co-location of the LCMs with regulatory sites. Certain regulatory sites had the same monitor co-located for a long period of time, and other sites had monitors rotated more frequently. The PSCA3 (Beacon Hill) site in Seattle often had many LCMs colocated at the same time. The highest average concentrations of PM 2.5 during 2017-2018 were observed in the LA sites (12.4-13.7 µg/m 3 ) along with the highest average temperatures (66-67°F). The largest observed difference in average PM 2.5 concentrations across sites was observed in the Seattle metropolitan region (5.6 -9.3 µg/m 3 ). Table 3 presents the Pearson correlation between FEM and FRM reference instruments at all sites where both types of instruments are available. The correlations of FRM with BAM are high at most sites, with only Chicago Lawndale (C004) having considerably lower correlation (r=o.78, RMSE=4.23 µg/m 3 ). Reference stations with TEOM showed a somewhat lower correlation with FRM at the NYC sites (r=0.84-0.88, RMSE=2.14-2.62 µg/m 3 ) compared to Winston-Salem and Seattle metropolitan regions (r=0.95-0.99, RMSE=0.71-1.82 µg/m 3 ). The results justify the use of FRM data whenever possible, but the generally good correlation tend to support the incorporation of FEM data into the models in cases where FEM monitoring instruments have greater temporal and/or spatial coverage than FRM instruments.

Sensor Precision and Accuracy
Plots comparing the calculated daily average PM 2.5 measures from duplicate Plantower PMS A003 sensors within a box and daily averaged raw sensor readings from duplicate Shinyei PPD 42 NS are provided in Figure 2a. After removing invalid data, the PM 2.5 manufacturercalculated mass measurements from all duplicate Plantower PMS A003 sensors within a box had a mean Pearson correlation of r=0.998, (min r=0.987 and max r=1), compared to all Shinyei PPD 42 NS sensors' raw readings average correlation of r=0.853, min r=0.053, max r= 0.997). Some Shinyei PPD 42 NS sensors demonstrated a strikingly poor correlation between duplicate sensors within a monitor. In addition to having weaker precision, the quality control was more difficult with Shinyei PPD 42 NS sensors, as it was harder to distinguish concentration-related spikes from the large amounts of sensor noise. For this reason, during our quality control procedures we removed all data above a particular threshold that was defined during preliminary analysis (raw measurement hourly averages above 10, which is almost certainly an incorrect raw measurement). We also removed data if either of the two paired sensors had a value 10 times greater than the other.
Comparison of calibration models developed using calculated PM 2.5 mass measures with each sensor type fit with Seattle combined FRM & FEM data demonstrated that the Plantower PMS A003 sensors performed better than Shinyei PPD 42 NS sensors. The results for this analysis are given in Figure 2b. According to the results of precision and accuracy evaluation of both sensors, our primary analysis that follows is limited to use of Plantower PMS A003 data. Table 4, we present calibration model performance summaries in each of the seven study regions adjusted for temperature and RH. The coefficient estimates for the region-specific models are also presented in the supplement (see Figure S2 and Tables S4a and S4b)). The region-specific models fit with FRM data have strong performance metrics with 10-fold cross-validation (R 2 =0.80-0.97; RMSE=0.84-2.26 µg/m 3 ). The LOSO measures are less strong (R 2 =0.72-0.94; RMSE=1.09-2.54 µg/m 3 ), most notably for Seattle. However, this difference is not surprising with low numbers of sites (such as two sites in the Seattle metropolis).
By incorporating FEM reference data, we increased the number of observations and, in some metropolitan regions, the number of sites available to fit the model (Table 4). The most data were added to the Seattle metropolitan model, which dropped the LOSO RMSE from 2.13 to 1.27 µg/m 3 . However, the 10-fold cross-validation RMSE increases slightly from 1.02 to 1.15 µg/m 3 . The decrease seen with the LOSO measures in Seattle is reasonable since with only two FRM sites, each model is based on a single site, whereas with six combined FRM and FEM sites, each model is fit with five sites. We see a similar result in Minneapolis-St Paul, which also has two FRM sites. Otherwise, we observe slightly weaker summary measures when FEM data are included in the models. This may be partially attributable to weaker correlations of FRM to FEM reference methods shown in Section 3.2. It also could be due to differences in the sites, as additional FEM sites and times were used for fitting the models but only FRM data were used for evaluation.

Application of Seattle Metropolitan Model to Other Regions-In most metropolitan regions, we find the Seattle metropolitan combined FRM & FEM model
predictions to be weaker than those made with the region-specific models (see Table 4). The Seattle metropolitan model is less successful in LA compared to the LA region specific model (R 2 =0.83, RMSE=3.41 µg/m 3 and R 2 =0.90-0.92, RMSE=2.26-2.55 µg/m 3 respectively), where the Seattle model predictions are consistently lower than the reference concentrations (see Figure 3). We see similarly systematic differences in the Seattle metropolitan model predictions in Winston-Salem, where concentrations are consistently overestimated, and in Chicago, where low concentrations are overestimated, and high concentrations are underestimated (Figure 3). The best results of the Seattle metropolitan model are observed in NYC (R 2 =0.83, RMSE=1.69 µg/m 3 , Table 4), which is comparable to the NYC region-specific models (R 2 =0.79-0.84, RMSE=1.67-1.88 µg/m 3 , Table 4). Figure 4. We observe that Chicago, NYC, Minneapolis-St. Paul, Winston-Salem and Baltimore have similar temperature/RH distributions, with the lowest RH values at high and low ends of the temperature levels, and highest RH measurements inside the temperature range of about 60-70°F. In LA, the trend is similar but restricted to temperatures above 50°F. The Seattle metropolitan region shows a different pattern, with a fairly linear inverse relationship between temperature and RH.

The Impact of Temperature and Relative Humidity-The distributions of temperature and RH across different study regions are presented in
Differences in environmental conditions across regions contribute to differences between regional models and to poor predictive performance of the Seattle model in other regions. For example, residual plots of predicted concentrations show poor performance at high temperatures in LA, while the LA regional models are noticeably better at high temperatures ( Figure 5). Similarly, poor performance of the Seattle model is observed in Chicago at high and low temperatures, compared to the Chicago-specific models.

Discussion
This analysis demonstrates the performance of two co-located low-cost sensors deployed in seven metropolitan regions of the United States, and describes approaches to calibrating results for potential use in environmental epidemiological studies. We fit region-specific calibration models using FRM data as the outcome, or FRM data supplemented with daily averaged FEM data, along with adjustment for temperature and humidity measures. We found that good calibration models were feasible with the Plantower PMS A003 sensor only. We also tested the generalizability of the Seattle calibration model to other metropolitan regions, and found that climatological differences between regions limit its transferability.
Our models were built based on time series data from different regions to cover temporal, seasonal, and spatial variations that are considered in every environmental study. When building the calibration models, we compared different reference instruments available at colocated sites and used two cross-validation designs.
In order to approach our study goal we chose to derive the calibration models fully empirically, as documentation was not available to determine that the size bin techniques for this optical measurement approach are fully validated or whether the cut-points provided by the sensor were applicable or accurate under the conditions in which we deployed our sensors. According to the product data manual for Plantower PMS A003 strengths (2016), each size cut-off "bin" is: "…the number of particles with diameter beyond [size] µm in 0.1 L of air", with cut-offs at 0.3 µm, 0.5 µm, 1 µm, 2.5, 5 µm, 10 µm. No other specifications are provided. As with other optical measurement methods, we anticipate there is error associated with the cut-point of each bin, and are biases due to the fact that the particle mix we measured differed from the particle mix used by the manufacturer in their calibration method. We also suspect that the Plantower PMS A003 instrument measures some particle sizes more efficiently than other, in particular some optical sensors are less sensitive to particles below 1 µm in diameter. Therefore, using exploratory analyses we developed best fitted calibration models for PM 2.5 which we described in Table 4. However, we recognize that such models could be developed in different ways. Therefore, we have tested other model specifications with different transformations of the Plantower PMS A003 sensor counts and different sets of included bin sizes. We developed daily averaged regional calibration models using differences of the original Plantower PMS A003 count output in various size bins (which represent particles [size]µm and above), in order to define variables for the Plantower PMS A003 count size bins (0.3 to 0.5 µm), (0.5 to 1 µm), (1 to 2.5 µm). This allowed us to minimize the potential correlation effect between count size bins inside the model and to exclude counts of particles that are larger than 2.5 µm for a theoretically better comparison with federal instruments (EPA 2001). In order to account for particles that are equal or larger than 2.5 µm but less than 10 µm, we included two additional variables: (2.5 to 5 µm) and (5 to 10 µm) count size bins differences into separate models. The results in Table S5 (see SM Section) suggest that the best performing models are those which adjust for raw Plantower PMS A003 count size bins or those which adjust for differences of the Plantower PMS A003 count size bins but including particles larger than 2.5 µm (especially the model including particles in optical diameter up to 10 µm). The models are not very different, but we observed slightly better predictive performance in the original model which uses all of the available count size bin data (except 10 µm bin) from the device. The contribution of larger than 2.5 µm size bins to the PM 2.5 mass prediction estimate might be explained by size misclassification and by the particular source of PM 2.5 emissions (the sensor may respond differently to particles with varying optical properties apart from size).The results of our analysis are consistent with other studies involving Plantower PMS A003 and Shinyei PPD 42 NS PM 2.5 low-cost sensors at co-located stations. Zheng et al. (2018) demonstrated an excellent precision between Plantower PMS A003 sensors and high accuracy after applying an appropriate calibration model that adjusted for meteorological parameters. Holstius et al. (2014) found the Shinyei PPD 42 NS data were more challenging to screen for data quality, and showed poorer accuracy and weaker performance of calibration models, though findings by Gao et al. (2015) differed. The calibrated Shinyei sensors evaluated by Holstius et al. in California, US, explained about 72% of variation in daily averaged PM 2.5 . Gao et al. deployed Shinyei PPD 42 NS sensors in a Chinese urban environment with higher PM 2.5 concentrations and was able to explain more of the variability (R 2 >0.80). The differences in these reported results for calibration models are potentially due to the Shinyei's PPD 42 NS higher limit of detection and noisier response, which would limit their utility in lower pollution settings.
Our results are also concordant with prior findings that low-cost sensor performance results vary across regions and are influenced by conditions such as temperature, RH, size and composition of particulate matter, and co-pollutants (Castell et al., 2017, Gao et al., 2015Holstius et al., 2014;Kelly et al., 2017, Levy et al., 2018Mukherjee et al., 2017, Shi et al., 2017, Zheng et al., 2018. For instance, Figure S2 in SM section shows the estimated differences in the fitted model coefficient values across metropolitan regions. Furthermore, model residuals varied by site across the Seattle metropolitan region (see SM Section, Figure  S3). Residuals at industry-impacted or near-road sites (Duwamish (PSCA1) and 10 th & Weller (PSCA6), respectively) were noisier than at other sites. These stations' differences might be explained by different particle compositions (Zheng et al., 2018) or by size distribution. For example, diesel exhaust particles less than 0.3 micrometers in size would be included in the reference mass measurement but not in the light scattering sensed by these low-cost sensors. Particle density has also been observed to vary within and between cities, likely explaining variations in the relationship between OPC measured particle count and FRM mass measurements (Hasheminassab et al., 2014;Zhu et al., 2004).
Calibration was necessary to produce accurate measures of PM 2.5 from both types of lowcost sensors. Uncalibrated Plantower PMS A003 PM 2.5 mass calculations were fairly well correlated with co-located FRM measures, but not on the 1-to-1 line and sometimes nonlinear (see SM Section, Figure S4). This aligns with our finding that relatively simple calibration models provided good calibration of the Plantower PMS A003 sensors. The observed non-linearity in the plots also highlights limitations in the manufacturer's algorithm to convert particle counts to mass. Even though the calculated mass is based on a scientifically-motivated algorithm that should account for shape and density of the observed particles (usually by using a fixed value for these parameters), in practice we found that the flexibility of an empirical area-specific adjustment of the separate particle count bins resulted in the best model performance. In smaller calibration datasets, there will be a higher risk of overfitting, and the calculated mass may be a more appropriate choice.
Study design choices play an important role in calibration model development and success. We had access to diverse reference monitoring locations where we were able to co-locate monitors over an approximately two year period. This allowed us to observe monitor performance in various environmental conditions. Other studies have evaluated similar lowcost monitors under laboratory or natural conditions in the field (Austin et al., 2015;Holstius et al., 2014; SCAQMD AQ-SPEC, 2019), but were limited to a single co-location site, single regions, or short monitoring periods. Therefore, calibration models based on a single colocated reference sensor have often been applied to data from different sites within a region, to which the models may not generalize. Our results suggest that low-cost sensors should be calibrated to reference monitors that are sited in environmental conditions that are as similar as possible to the novel monitoring locations. Our LABOSO results, which are presented in supplemental Table S6, indicate that this is especially important when calibration models rely on co-location at a single site. Models from single sites do not necessarily generalize well, and overfitting becomes more of a concern. Simpler model forms may perform better in cases with little data.
As a sensitivity analysis, we excluded certain regulatory sites from the Seattle model. Duwamish (PSCA1) data were excluded due to observed differences in the sites' residuals (see SM Section, Figure S3) and because the industrial site is not representative of participant residential locations. We excluded Tacoma (PSCA4) site BAM data for observed differences with the co-located FRM data. We conducted an additional analysis using data from only the days where both FRM and FEM data were available, and substituted FEM data for FRM data in both model fitting and model evaluation (see SM Section, Table S7). We found that in Seattle, using FEM data for model evaluation led to slightly weaker RMSE and R 2 than evaluation with FRM data, but had little effect on model coefficients. These results suggest that the differences in Seattle model comparisons for different evaluation methods used in Table S3 are dominated by the site types used (e.g. whether the industrial Duwamish site is used for evaluation) and less impacted by the differences in reference instruments.
In the Seattle region, incorporating certain FEM data from additional sites into the evaluation improves the spatial variation and is more representative of the locations where predictions will be made for the ACT-AP study. However, evaluation with FEM data may not always be suitable due to the potential differences in data quality. In our study, we used measures from the six non-regulatory TEOM and BAM monitors due to the lack of the EPAdesignated FEM monitors in some NYC, Chicago and Winston-Salem sites. For simplicity, we united a non-regulatory TEOM and BAM data into "FEM" group. However, the differences in sampling methods and data quality between FEM and non-regulatory instruments can be observed in FRM-BAM and FRM-TEOM low correlation results in NYC and Chicago sites compared to FRM-FEM correlation results in other regions (see Table 3). Such difference in quality of instruments might be one of a reasons for a lower performance of FRM and FEM regional models in NYC, Chicago and Winston-Salem compared to FRM only regional models at the same regions. Our results suggested that FEM data do not generally improve calibration models when FRM data are available, though often there is little difference between using only FRM vs. using both FRM and FEM (see e.g. Table 4). FRM data provided the most consistent evaluation method across regions, but this depends on data availability (see SM Section, Table S2). When there are already enough FRM data, including additional FEM data will likely not improve prediction accuracy. However, when there are little FRM data (i.e. sites and/or observations), additional FEM data may improve calibration models. The choice of model evaluation criteria depends on the purpose of calibration. R 2 is highly influenced by the range of PM 2.5 concentrations and may be a good measure if the purpose of monitoring is to characterize variations in concentration over time. RMSE is an inherently interpretable quantity on the scale of the data and may be more relevant if one is more concerned with detecting small-scale spatial differences in concentrations (e.g. for long-term exposure predictions for subjects in cohort studies when accurately capturing spatial heterogeneity is more important). An appropriate cross-validation structure is also necessary. Roberts et al. (2017) in their study suggest choosing a strategic split of the data rather than random data splitting to account for temporal, spatial, hierarchical or phylogenetic dependencies in datasets. Barcelo-Ordinas et al. (2019) applies a cross-validation method that analyzes LCM ozone concentrations in different monitoring periods: in the short-term and in the long-term. They warn that short-term calibration using multiple linear regression might produce biases in the calculated long-term concentrations. Other researchers from Spain, prior to dividing the data into training and test sets for cross-validation technique, shuffled the data in order to randomly include high and low ozone concentrations in both training and test datasets . For the validation analysis of our PM 2.5 calibration models we used three different cross-validation structures: 10-fold (based on time), "leave-one-site-out" (LOSO) and a "leave-all-but-one-site-out" (LABOSO) crossvalidation design (see Section 2.5.4). The 10-fold cross-validation approach randomly partitions weeks of monitoring with co-located LCM and FRM data into 10 folds. While using the 10-fold cross-validation approach, we tried to minimize the effects of temporal correlation on our evaluation metrics by disallowing data from the same calendar week to be used to both train and test models. A 10-fold cross-validation design with folds randomly selected on the week scale consistently produced higher R 2 and lower RMSE measures than a "leave-one-site-out" (LOSO) cross-validation design. For our purposes, we would prefer to use LOSO in settings with sufficient data from each of at least 3 reference sites. This most closely replicates prediction at non-monitored locations, which is the end goal of our studies. However, the 10-fold cross-validation design is more robust to the number of reference sites, and is preferred (or necessary) in settings with sufficient data from only 1-2 reference sites.

Conclusions
Calibration models developed based on Plantower PMS A003 particles counts incorporating temperature and humidity data, using daily FRM or FRM supplemented with FEM, performed well for Plantower PMS A003 sensors, especially with region-specific calibration. Investigators should develop calibration models using measurements from within the region where the model will be applied. The Seattle metropolitan calibration model could be applied to other regions under similar meteorological and environmental conditions, but region-specific calibration models based on the thoughtful selection of at least 3 reference sites are preferred. This study highlights the importance of deriving calibrations of low-cost sensors based on regional scale comparison to data from reference or equivalent measurement methods, and produces sufficient evidence to caution against using manufacturer provided general calibration factors. Calibrated Plantower PMS A003 PM 2.5 sensor data provide measurements of ambient PM 2.5 for exposure assessment which can be potentially used in environmental epidemiology studies.

HIGHLIGHTS
Low-cost sensors for air pollutants such as particulate matter are of interest, are appealing to deploy in epidemiological studies, but have not been adequately studied for this use.

Different sensors have different performance characteristics.
With adequate calibration of results, and several important caveats, one commonly deployed sensor was found to produce measurements that are precise and reliable.
Calibration of the devices need to be performed with caution, as an approach which is effective in one region may not be effective in a different region.
Epidemiologists may find these devices useful for exposure assessment in epidemiological studies, but need to use caution in using the data, paying attention to data quality and calibration of measurements. Zusman et al. Page 19 Environ Int. Author manuscript; available in PMC 2021 January 01.    Correlation between daily averaged temperature (°F) and RH measures within low-cost monitors across different regions Note: low-cost sensors readings of temperature and RH were calibrated with reference temperature/RH data from Seattle, Beacon Hill site in order to present standard units.  Residuals by temperature (°F) from different calibration models in LA and Chicago. Note: low-cost sensors readings of temperature were calibrated with reference temperature data from Seattle, Beacon Hill site in order to present standard units.   (14) Notes: a -These columns reports unique days (monitor-days) when both LCM data and regulatory station reference data are available. The days presented in the FRM and TEOM/BAM columns overlap considerably, since a day with FRM & FEM reference data will add to both columns.
b -The average concentration/temperature/RH values were averaged across daily observations at the site when both LCM and agency reference data were available, and thus depend on co-location schedule which may differ across sites.
Environ Int. Author manuscript; available in PMC 2021 January 01.   (*) Indicates sites with NRMs (TEOM/BAM) whose method code is not classified by EPA as FEM. In present study NRMs were grouped into "FEM" category Environ Int. Author manuscript; available in PMC 2021 January 01.