Application of Machine Learning methods to correct the readings of low-cost air pollution sensors

. The study is devoted to the analysis of the application of machine learning methods for correcting the readings of inexpensive sensors that record the concentration of suspended particles PM2.5 in the surface layer of the atmosphere, relative to readings of reference stations. The analysis was carried out on the example of coupled sensors (an inexpensive CityAir sensor and a reference E-BAM) located in Krasnoyarsk (Russia) based on observational data from January 1, 2019 to December 31, 2022. Statistical analysis of the data and comparison of parametric (Linear, Ridge, Lasso, Support vectors machine, Elastic net regressions) and nonparametric (regressions of Nearest Neighbor, Decision Tree and Random Forest) methods for establishing the relationship between two samples was carried out.


Introduction
Air pollution is not only a serious threat to human health (since pollutants are able to adsorb toxic and carcinogenic substances on their surface and, due to their microsize, penetrate into the human circulatory system, increasing the risk of bronchopulmonary, cardiovascular and oncological diseases), but also affect on the attractiveness of the region for life and the general well-being of its population.The most well-known air pollutant is suspended particulate matter PM2.5, which sources in the Earth's atmosphere are both natural (sandstorms, volcanic eruptions) and anthropogenic processes, such as the extraction, processing and combustion of natural raw materials, as well as agricultural and forest fires.Thus, the collection and analysis of data on PM2.5 concentrations are the global area, which has received much attention in countries such as England, China, South and the United States [1][2][3][4][5][6].It should be noted that in 2021 and in 2023 Russia (Krasnoyarsk) twice topped the rating of major cities in the world with the highest level of air pollution according to the IQAir service that monitors air quality in real time (https://www.iqair.com/ru/world-airquality-ranking).
At present, inexpensive (less than $2500) optical sensors can be used to measure the concentration of suspended particles.Their low cost and compact size facilitate wider deployment and collection of real-time pollution data and can therefore be useful in largescale pollution monitoring and spatial mapping.However, the optical principle of particle detection used in inexpensive sensors is based on a number of assumptions that lead to inaccuracies [7,8].Assumptions lead to errors in counting the number of particles and their size, on which the detected concentration of suspended particles depends.In their turn, manufacturers of the reference type optical particle counters spend a lot of time and effort developing mass conversion factors to make their devices as accurate as possible [9].Thus, data collected by the low-cost sensor networks is needed to develop local environmental improvement measures, but the data obtained in this way can be doubtful due to serious limitations in the accuracy and reliability of these low-cost sensors, which leads to the necessity for their correction.
Krasnoyarsk is one of the few cities in Russia where atmospheric air quality is monitored with network of stationary observation post.First, the Ministry of Ecology and Rational Nature Management of the Krasnoyarsk Territory maintains the Regional Departmental Information and Analytical Data System on the State of the Environment of the Krasnoyarsk Territory (RDIAS).Nine RDIAS posts are located in Krasnoyarsk.Automatic measurement of meteorological parameters and concentrations of pollutants (nitrogen oxide, nitrogen dioxide, sulfur dioxide, carbon monoxide, and PM2.5) in the surface layer of the atmosphere is performed (http://www.krasecology.ru/Air)every 20 minutes.In RDIAS, dust analyzers of the E-BAM equipment (Met One Instruments Inc., USA) are used to monitor PM2.5 concentration.These analyzers are recognized worldwide as reference equipment for measuring PM10 and PM2.5 fractions in the atmosphere, recommended for use, certified in many countries.
Secondly, the Air Quality Monitoring System of the Krasnoyarsk Scientific Center of the Siberian Branch of the Russian Academy of Sciences (KSC SB RAS) operates in Krasnoyarsk (http://gis.krasn.ru/blog/).Each post is equipped with a low-cost certified CityAir air monitoring station (https://cityair.io/ru/equipment).The station measures the main meteorological parameters (temperature, humulity and pressue) and the concentration of PM2.5 and PM10 aerosol particles every 20 minutes.The monitoring system of the KSC SB RAS has about 25 posts located in different districts of Krasnoyarsk, which provides spatially detailed information on pollution.
The purpose of this work is to analyze the possibility of using machine learning methods to correct the readings of inexpensive sensors according to measurement data over a long period of time.The methods will be tested on the measurement results of a pair of duplicate CityAir (adjustable) and E-BAM (reference) sensors located in Krasnoyarsk (Sverdlovskaya St.) for the period from January 1, 2019 to December 31, 2022.

Methods
To formalize the difference between the readings of two sensors, statistical analysis methods were used: calculation of the sample mean, variance, mode, quantiles, asymmetry, kurtosis, Jarque-Bera test.To identify the dependence of the readings of the two sensors, the apparatus of paired regression was used.Note that at present, the problem of regression or establishing a relationship between two data sets is successfully solved by supervised machine learning methods if the sample is large and fully describes the dependence under study.Supervised learning methods involve dividing the initial sample  = {(  ,   )} =1  , consisting of pairs of independent   and predicted   values, on training   = {(   ,    )} =1  , and   = \  , where  is the training sample size.Statistical and regression analysis was performed in Python using the numpy, pandas, sklearn, statsmodels modules.Let us briefly describe the methods used for regression analysis.

Parametric regression methods
Parametric regression methods include methods that represent the relationship between the predicted value and the independent variable in the form of function relative to the parameter set  = {} =1  : where   is prediction error and  is chosen common for each i.The most well-known here are the methods of linear and polynomial regression, where f can be represented as a polynomial of the first or higher degrees.In this case, the coefficients  are found from the of minimization the functional . ( The Lasso and Ridge regression methods look for coefficients by the minimization of the functional (2) with the addition of a penalty in  1 and  2 spaces: where  is the penalty which can be found by cross-validation.The Elastic Net regression combine Lasso and Ridge Regression by weighting penalties in  1 and  2 spaces: Here parameter  as well  can be found by cross-validation.Note that regressions that use penalties are good enough in case of strong multicolliarity of input features, nulling the coefficients for insignificant input features.In the study of pairwise dependence, regressions using penalties are unlikely to improve the forecast, but may be useful in choosing the degree of polynomial dependence.And, finally, the Support Vector regression look for coefficients  by minimizing the functional The main idea of the support vector regression is to find such a support curve, the forecast error with respect to which would lie from it in the -neighborhood.

Non-parametric regression methods
When using non-parametric machine learning methods, the dependence of two data sets is represented as a black or white box, that is, it cannot be expressed analytically.The simplest method here is the k nearest neighbors' method.The goal of regression is to find a neighborhood (multidimensional ellipsoid) with fit k points from the training set with known response values for each new independent   that needs to predict the corresponding response   .
The decision tree regression model predicts the response value   by applying simple decision rules based on the values of the independent variable   .The tree is built in the learning process on the training set.The default decision tree building algorithm in the scikitlearn Python module is the CART [10], which builds a binary decision tree using as a division criterion the minimization of the standard error of the deviation of the predicted value of the response of the end nodes of the decision tree from the learned mean value of the node.A decision tree for numerical sequences can be viewed as a piecewise constant approximation.
Note that decision trees are usually dependent on the training set and tend to overfit.The Random Forest algorithm, which uses an ensemble of decision trees, helps to improve the model.To implement the algorithm, firstly, a separate tree is generated based on a random repeated subset S from the training set S tr of the same size N.In practice, this means that some sets    fall into the subset several times, and some remain out-of-bag.The purpose of this randomness is to reduce the variance of the forest estimate.The noise introduced in this way produces decision trees with unrelated prediction errors.By taking the average of these predictions, some of the errors can be eliminated.Random forests reduce the overall variance by combining different trees.In practice, the reduction in variance is often significant, resulting in an overall better model.

Results
As it was mentioned above the readings from a pair of duplicate CityAir (adjustable) and E-BAM (reference) sensors located in Krasnoyarsk (Sverdlovskaya St.) for the period from January 1, 2019 to December 31, 2022 was analyzed.Before the analysis, the data from the sensors were processed: from the subsequent regression analysis, pairs of readings of duplicate sensors containing gaps and pairs in which the measurements differed by more than 6 times were excluded (the number 6 was chosen from a statistical estimate that 95% of the measurements differ from each other friend less than 6 times, see fig. 1).The sample remaining after processing contains 58425 records.

Descriptive statistic of data
Table 1 provides descriptive statistics for sensors data.The table shows that the average value for the sample of the CityAir sensor is almost twice the value of the reference E-BAM sensor.The distribution of the data is not normal, but asymmetric with a sharper peak relative to the normal distribution.This complicates the application of linear regression methods, since this method is applied to normally distributed data.Note that the data distribution is close to lognormal (see fig. 2).A data mode biased closer to zero and bimodality of the lognormal distribution indicates that the sample may be split into background data and exceedance data, for which the statistical description may be different.

Methods for extracting the background concentration of PM2.5 particles
Consider two approaches for extracting background data on the concentration of PM2.5 particles in the surface layer of the atmosphere.Assume that the background for each of the pair of sensors is a subset {  } =1  of data from the set of measurements {  } =1

𝑁
, N>M, satisfying one of the following criteria: 1.   , if Q1-1.5IQR <   < Q3+1.5IQR,where Q1, Q3 and IQR are determined on {  } =1  ; 2.   , if at the same time the average daily value on the reference sensor does not exceed the maximum of daily permissible concentrations according to the World Health Organization (<35 mcg/m 3 ).The first approach is widely used method for detection of outcomings in machine learning [11] and the second one is built based on the type of data and the problem being solved.Figure 3 present the extracted data by the first (left) and second (right) approaches.Thus, in the second case, short-term excesses fall into the background data (for example, the accumulation of exhaust gases during peak hours).Table 2 presents descriptive statistics of background and PM2.5 particle concentration excess readings for the E-BAM reference sensor obtained by the described methods.Descriptive statistics show that separating excess readings from background readings normalizes the sample with excess readings (the Harke-Beer statistic decreases), making it justifiable to use linear regression methods for such a sample.At the same time, a physically more reasonable method of extracting background readings, based on comparing average daily readings for the concentration of PM2.5 with the maximum allowable level cannot be used for operational prediction, since the sliding the daily average is calculated as half a day forward and backward for the current measurement.

Result of regression
First, let us compare the methods presented in Section 2 as applied to solving the problem of establishing a connection between full (not separated by the background and excess) readings of E-BAM and CityAir sensors.We will consider the readings of the reference E-BAM sensor to be predictable, and the readings of the inexpensive CityAir sensor to be adjusted.As a measure of comparison, choose the coefficient of determination R2 -the proportion of the variance of the dependent variable, explained by the considered regression model.As a dependency model, we will consider the following mappings, where the readings of the E-BAM sensor are designated as m, and the readings of the CityAir sensor -s: (1) {} → {}; (2) { 2 , } → {}; (3) { 3 ,  2 , } → {}; (4) {()} → {}; (5) { 2 (), ()} → {}.
Table 3 shows comparisons of the R 2 determination coefficients for dependences of the form (1)-( 5) by the methods described in Section 2 on full data (not divided into background and excess).1)-( 5) and various Machine Learning methods on the full data.

Method of regression
Dependence (1) ( Thus, the linear dependence of the measurements of one sensor relative to another proved to be the best.It is noteworthy that parametric machine learning methods showed a slightly better result than non-parametric ones.Now we will show how machine learning methods trained on the full data (with linear data dependence (1)) approximate separately the data on the background indicators of the PM2.5 concentration and the excess of this concentration, selected by the methods described in section 3.1.1.

Discussions
This paper investigates whether low-cost ground level PM2.5 sensors can be used as an alternative to expensive reference stations, and whether sensors can be calibrated using modern machine learning techniques, based on real observations.The analysis was carried out on the example of paired dublicate sensors (an inexpensive CityAir sensor and a reference E-BAM) located in Krasnoyarsk (Russia) based on observational data from January 1, 2019 to December 31, 2022.
First, analysis of the descriptive statistics of the data from both sensors showed that the pollution data have a bimodal lognormal distribution; the average for samples from each sensor differs by almost 2 times; and the bulk of the readings are data containing background readings.At the same time, the dependence of the full data for the entire period under consideration has a pronounced linear trend, which is determined by measurements exceeding the maximum allowable concentrations of PM2.5 particles (see tables 3 and 4).
Secondly, the presence of an explicit trend makes it possible to effectively use parametric machine learning methods to build a data calibration model.The main advantage of parametric methods is that the resulting model is expressed explicitly as functional dependence.However, clustering data according to some criteria (for example, extracting background measurements or season data) leads that data in each class has its own different from the general one.In this case, nonparametric methods trained on a large amount of various data allows to make more accurate predictions.

Fig. 2 .
Fig. 2. Statistical density distribution of data for sensors: left for rough data and right for logarithmic data.

Fig. 3 .
Fig. 3. Background data obtained using interquartile distance (IQR method, left) and based on information about maximum permissible concentrations (MPC method, right).

Table 1 .
Descriptive statistic of data.

Table 2 .
Descriptive statistic of grouped data.

Table 4 .
Comparison of Machine Learning Methods trained on full data in application to grouped data.Table4shows that parametric machine learning methods allow capturing the general dependence trend, while nonparametric criteria better describe individual data groups.