Improving long-term air pollution estimates with incomplete data: A method-fusion approach

Graphical abstract


Introduction
Air pollution monitoring requires high-precision instruments, which are expensive and often limit the number of locations observed [1]. In addition to financial constraints, time constraints presented by monitoring multiple sites with a limited number of monitors often prevent long-term continuous monitoring, which is required to accurately estimate long-term air pollution at a given location [2]. To address these challenges, researchers often deploy portable monitors for short periods to observe air pollution at many different locations. The goal of the portable monitor use is to obtain spatially and temporally diverse data. The data collected from the short-term monitoring campaigns have been used to estimate long-term concentrations and develop predictive models of air pollution. Long-term estimates are necessary because these are the values most often used in epidemiological studies of health effects from air pollution [3,4]. Aside from calculating the raw average of short-term samples and treating these values as the long-term concentration estimate, multiplicative temporal adjustments are often employed to predict long-term exposure estimates [5][6][7].

Multiplicative temporal adjustments
Multiplicative temporal adjustments are used to correct for temporal trends present in air pollution data collected through short term monitoring campaigns [8]. When data are collected at short intervals at different times throughout the year, or at different times throughout the day, seasonal and diurnal variation in air pollution becomes exaggerated. Multiplicative temporal adjustments account for these trends by adjusting the short-term observations against continuous data observed at a fixed-location monitoring station. For example, if data were collected during a period of above average air pollution, the adjustment would correct these observations downwards when estimating the long-term value. Applying temporal adjustments can improve accuracy of long-term pollution concentration estimates. Eq. (1) describes the basic form of a multiplicative temporal adjustment.
Long-term estimates of air pollution are calculated by dividing each short-term observation (O t ) by the ratio of its corresponding fixed monitor observation (O FMt ) to the long-term central tendency of a fixed location reference monitor (O Central Tendency FM ) [9].
A method-fusion approach Air pollutants are often log-normally distributed, with many low concentration observations and fewer high concentration observations, see Fig. 1.
The log-normal distribution of air pollution presents challenges when estimating long-term concentrations, due to the influence of extreme values. Two main approaches have been employed to help account for the log-normal distribution commonly present in air pollution. Approach one, see Eq.
(2), utilizes the median as a measure of central tendency for the fixed location monitor [7]. By calculating the median rather than the mean, the estimates produced with this approach are less inflated by extreme values and are a better representation of the observed long-term concentration.
Approach two, see Eq. (3), applies a log transformation to the data, which transforms the data to an approximately normal distribution. This transformation minimizes the potential exaggeration of estimates produced from the presence of extreme values. This approach utilizes the mean as its measure of central tendency.
By incorporating elements from both approaches into one temporal adjustment, more accurate results may be obtained. Eq. (3) displays the combined temporal adjustment. Unlike other temporal adjustments, Eq. (3) utilizes both the log adjustment and median value of the fixed monitor to control for the log-normal distribution of the data being adjusted. We refer to this method-fusion approach as the log median-scaled adjustment.

Model validation
This method has been demonstrated to improve estimates compared to other multiplicative adjustments [10]. Chastko and Adams [10], simulated mobile monitoring campaigns with air pollution observations from three different cities and eight pollutants. Mobile pollution samples were adjusted using multiple temporal adjustment approaches to predict long-term concentrations for each pollutant. This analysis revealed that the log median-scaled adjustment was more accurate the all other temporal adjustments included in the study. For full details on model validation, see Chastko and Adams [10].

Conclusion
The method fusion approach can be applied to any mobile air pollution monitoring dataset and can produce more accurate long-term estimates compared to existing temporal adjustments. These estimates are more accurate because the method fusion approach controls for inflation of the central tendency produced by log-normal distributions, which are often present in air pollution data.

Sample workflow
The following example demonstrates how to apply the log median-scaled adjustment to a sample of air pollution data. This workflow is presented in the programming language R. For this example, mobile data will be represented as a subset of data from a stationary air pollution monitor in Paris France.

R libraries
To access various functions used in this demonstration, the following R libraries are loaded.
Loading the data To access the sample data for this demonstration, a CSV file containing hourly Nitrogen Dioxide observations from 2016 in Paris France is loaded into R. This data is hosted on GitHub and was originally obtained from AirParif [11]. The following code block loads the data into R, assigning it to a variable air.pollution.data.
A sample of the air pollution data is provided in Table 1, it is a collection of air pollution observations from two air pollution monitors with time signatures for each observation.

Sampling and adjustment function
With the data loaded into R, a sample of air pollution data can be taken from one of the monitors to represent the mobile data. In this example, 24 h of air pollution observations will be used. Table 2 displays the sample data and the temporally corresponding reference data that will be used to adjust each sample from the mobile data. Station 1 is used as the reference station and Station 2 is used as the sample station. Additionally, the annual median NO 2 concentration is calculated for the reference monitor. Fig. 2 displays the time slice of mobile data in relation to the entire time series of NO 2 values observed at Station 2.  Now that all the required data have been selected, we can proceed to apply the temporal adjustment. The R implementation of the adjustment is shown below as the LogMedianScaled function. This function requires a vector containing the reference data, a vector containing the mobile data and the annual median pollutant concentration calculated from the entire reference dataset.
The LogMedianScaledAdjustment function returns a single value representing the annual average air pollution concentration estimated from the mobile sample by averaging each adjusted value in the mobile sample. Table 3 shows the adjusted values calculated by the temporal adjustment and the raw input values used to adjust the data. In this example, the raw data produce a long-term estimate of 42.62 ppb, the LogMedianScaledAdjustment produces an annual NO 2 estimate of 37.15 ppb, and the actual long-term value was 30.6 ppb. By applying the log median-scaled adjustment, estimation error was reduced from 12.02 ppb to 6.55 ppb using only a 24 h sample to estimate the annual average.
To visualize the accuracy of the temporally adjusted estimate we can plot the observed average annual NO 2 value, calculated from the stationary data, alongside the raw sample average, the temporally adjusted average and the sample data. Fig. 3 shows that the temporally adjusted average is more accurate than the raw sample's average.