Estimation of missing air pollutant data using a spatiotemporal convolutional autoencoder

Wardana, I Nyoman Kusuma; Gardner, Julian W.; Fahmy, Suhaib A.

doi:10.1007/s00521-022-07224-2

Estimation of missing air pollutant data using a spatiotemporal convolutional autoencoder

Original Article
Open access
Published: 04 May 2022

Volume 34, pages 16129–16154, (2022)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

Estimation of missing air pollutant data using a spatiotemporal convolutional autoencoder

Download PDF

2283 Accesses
10 Citations
1 Altmetric
Explore all metrics

A Correction to this article was published on 29 June 2022

This article has been updated

Abstract

A key challenge in building machine learning models for time series prediction is the incompleteness of the datasets. Missing data can arise for a variety of reasons, including sensor failure and network outages, resulting in datasets that can be missing significant periods of measurements. Models built using these datasets can therefore be biased. Although various methods have been proposed to handle missing data in many application areas, more air quality missing data prediction requires additional investigation. This study proposes an autoencoder model with spatiotemporal considerations to estimate missing values in air quality data. The model consists of one-dimensional convolution layers, making it flexible to cover spatial and temporal behaviours of air contaminants. This model exploits data from nearby stations to enhance predictions at the target station with missing data. This method does not require additional external features, such as weather and climate data. The results show that the proposed method effectively imputes missing data for discontinuous and long-interval interrupted datasets. Compared to univariate imputation techniques (most frequent, median and mean imputations), our model achieves up to 65% RMSE improvement and 20–40% against multivariate imputation techniques (decision tree, extra-trees, k-nearest neighbours and Bayesian ridge regressors). Imputation performance degrades when neighbouring stations are negatively correlated or weakly correlated.

NARX Neural Network for Imputation of Missing Data in Air Pollution Datasets

Spatiotemporal estimation of TROPOMI NO2 column with depthwise partial convolutional neural network

Article 17 April 2023

Air Quality Index Prediction Based on Deep Recurrent Neural Network

1 Introduction

Rising population, urbanisation, economic growth and industrial expansion have increased air pollution worldwide [1]. The main causes of air pollution are vehicle exhaust, industrial emissions, agricultural and natural disasters, such as volcanic eruptions and wildfires. These air pollutant sources can produce particulate matter (PM), nitrogen dioxide ($\hbox {NO}_{2}$), carbon monoxide (CO), ozone ($\hbox {O}_{3}$), sulfur dioxide ($\hbox {SO}_{2}$), among other pollutants [2]. The effect of air contaminants on the human body differs, depending on the type of contaminants and the level and duration of any exposure. It causes negative impacts on human health and influences socio-economic activities [3, 4]. Concerning human health, air pollution is associated with lung cancer [5, 6], cardiovascular diseases [7,8,9], impaired cognitive function and human emotion [10, 11]. Premature mortality, negative social and educational outcomes, adverse market liquidity and catastrophic climate are the socio-economic aspects triggered by air pollution [12]. Moreover, around 4.9 million deaths were attributed to air pollution in 2017 [13].

Measuring air pollution with potential exposures and health impacts can be more challenging when missing data occurs. The existence of missing data can influence study interpretations and conclusions [14] and affect the functioning of air quality-related public services [15]. Missing data are a common problem in air pollutant measurement and other fields such as clinical, energy and traffic [16,17,18]. The cause of missing data may vary, including sensor malfunction, sensor sensitivity, power outages, computer system failure, routine maintenance, human error and other reasons [19, 20]. Depending on the causes, air pollution data can be missing either in long-consecutive periods or short intervals [21]. While routine maintenance and temporary power outages can cause short intervals of missing data, sensor malfunction and other critical failures can cause longer gaps in data collection.

According to Rubin, incomplete data are classified based on their generating mechanisms, namely missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) [22]. MCAR occurs when data are genuinely missing as a result of random events [23]. MCAR assumes that missing values are a random sample of observed values, which is restrictive [24]. In MAR, the probability of missingness may depend on observed data values but not those that are missing. Under MAR conditions, there is a possibility to retrieve the missing values from other observed predictor variables [23, 25]. When the probability of an observation being missing is dependent on unobserved values, such as the values of the observation themselves, this condition is called MNAR [22, 23, 26]. MNAR is nonignorable missingness and is considered a condition that yields biased parameter estimates [27]. Missing data are most often neither MCAR nor MNAR [26]. The missingness is at least MAR for air quality data. Even though the air contaminant values are missed for unknown reasons (i.e. MCAR), most missing values are caused by explainable circumstances such as routine maintenance, sensor malfunction and power outages [23, 28]. Thus, we assume MAR conditions for the air quality data used in this study.

There are two common ways to handle missing data: delete the missing parts and impute (substitute) the missing values [29]. The deletion method can be further defined as pairwise deletion and listwise deletion. The pairwise deletion method discards the specific missing values, whereas the listwise method removes the entire record even if there is one missing value. The MCAR assumption allows for the exclusion of incomplete observations to yield unbiased results. However, a higher level of missing values may reduce the precision of the analysis [24]. Moreover, because the nature of pollutant measurement generates time-series data, the deletion method could break the data structure, and valuable information may be lost. Contrary to the deletion method, the imputation method reconstructs the missing data based on available information [30].

Reconstruction techniques inspired by machine learning have been used in recovering corrupted data, one of which is the denoising autoencoder (DAE) [31]. Standard DAE and its variants are implemented in many fields, such as image denoising [32,33,34,35], medical signal processing [36, 37] and fault diagnosis [38, 39]. Some works also utilised DAE for missing data imputation. Gondara et al. [40] tried to answer the challenge of multiple imputation by employing an overcomplete representation of DAEs. The proposed method does not need complete observations for initial training, making it suitable for a real-life scenario. Abiri et al. [41] demonstrated the robustness of DAE in recovering a wide range of missing data for different datasets. Abiri et al. proved that the proposed stacked DAE outperformed other established methods, such as K-nearest neighbour (KNN), multiple imputation by chained equations (MICE), random forest and mean imputations. Jiang et al. [42] utilised DAE for imputing the missing traffic flow data and compared three different architectures composing the DAE, namely standard (“vanilla”), convolutional neural network (CNN) and bidirectional long short-term memory (Bi-LSTM). Jiang et al. evaluated the proposed model’s test sets with a general missing rate of 30%. Moreover, splitting traffic data into weekdays and weekends significantly improved the model performances.

The following discussions summarise the recent challenges and breakthroughs in methods for air quality missing data imputation. First, the problem of missing data repeatedly occurs in environmental research, and more studies are required to find effective imputation solutions. Although various methods have been proposed to handle missing data in many fields, more studies addressing air quality missing data prediction are needed [19]. The works mentioned earlier in this section mainly focus on clinical, energy, traffic, etc. Second, most of the related studies focused on a small amount of missing data. Ma et al. stated that the previous works are applicable for short-interval missing imputation or consecutive missing value with a level of missingness less than 30%. This issue was also mentioned by Alamoodi et al. [43]. Few works investigated missing data at large percentages (i.e. more than 80%), either using deletion or imputation. Third, the multiple imputation method can improve imputation performance [14]. We consider that implementing multiple imputation for air quality data is a deserving attempt. Fourth, many studies demonstrated the robustness of denoising autoencoder in recovering noisy data. However, few studies implemented the denoising autoencoder for missing air quality data imputation. Finally, even though air pollutants strongly relate to spatiotemporal characteristics, these factors are rarely included in predicting the missing values of air pollution data. The air quality data collected from air monitoring stations can hold intensely stochastic spatiotemporal correlations among them [44].

Inspired by the capabilities of the denoising autoencoder to reconstruct corrupted data, we propose an imputation method based on the denoising autoencoder. We implement multiple plausible estimates for specific missing values. We propose a simple method suitable for both short-interval and long-interval consecutive missing imputation and simultaneously offer multiple imputations to obtain less biased results. We use a convolutional denoising autoencoder with spatiotemporal considerations to extract the air pollutant features. The proposed method takes advantage of data from nearby stations to predict the missing data in the targeted station. This method does not need external features like weather and climate data (air temperature, humidity, wind speed, wind direction, etc.). Thus, our proposed method involves only the intended pollutant data from neighbouring stations. We propose a simple yet promising way to estimate missing values in real-world applications.

2 Method

2.1 Research framework in general

Our proposed method exploits data from nearby sites to enhance predictions at the target station with missing data. When a target station fails to gather pollutant data from the environment, the neighbouring station data can help to estimate the current loss of the target site. As illustrated in Fig. 1, $S^3$ fails to collect data and acts as a target station. Neighbouring stations $S^2$, $S^5$ and $S^6$ send their data to $S^3$. The participating neighbouring stations eligible to send data are chosen based on their coefficient correlations with the target station. We implement a deep autoencoder model at $S^3$ and use a one-dimensional convolutional neural architecture to cover the spatiotemporal behaviour of pollutant data. Based on the collected spatiotemporal data at target and neighbouring stations, we predict the missing data at the target station.

Figure 2 shows the general research framework used in this study. There are seven main blocks, and each block consists of several tasks. The first block relates to the data sources used in this work. All data sources used in this study are available online, and they can be freely downloaded and used by adhering to the terms described in the given licences. A dataset contains different hourly air pollutant concentrations. Even though the dataset includes several air contaminants, we selected two attributes as the targeted pollutants. Ten monitoring stations are involved in the calculations to acquire the spatial characteristics of air pollutant data. Moreover, we verified our proposed method in three different air quality datasets to achieve less biased results. These are the monitoring of air quality in three major cities: London, Delhi and Beijing.

The data pre-processing in the second block is dedicated to examining the targeted air pollutant coefficient correlation among air monitoring stations. Calculating the coefficient correlation among pollutant concentrations is one of the main steps conducted in this study. For every target pollutant, we joined the same pollutant data taken from all locations into a single data frame and sorted them by the same hourly timestamp. We then calculated the correlation coefficient and selected the three highest correlations between the targeted and neighbouring monitoring stations. Based on these correlations, the data encompassing spatiotemporal characteristics are determined. The spatial behaviour is obtained using data from the targeted and three neighbouring stations (i.e. four monitoring stations in total). The temporal dependency is acquired by collecting the current value and its previously 7-hour values (i.e. 8-hour data in total).

The pre-processing procedure in the third block is carried out to make the spatiotemporal features suitable for the proposed deep learning model. All training and test features are normalised to values between 0 and 1, leading to the data variability reduction [45]. Additional pre-processing in this stage includes initialising missing data because the obtained datasets may contain some missing features. If missing data exist in the original dataset, only an unbroken series of data with a minimum of 1 week (168 hours) period is considered for the training set. We did not remove the remaining data but did not use them during training. Therefore, there are some chunks of unbroken data involved as inputs in the training phase. According to the number of data fragments, the training steps are done in multiple rounds. This step will maintain the temporal behaviour of the time-series data. This step maintains the temporal behaviour of the time-series data. Once we have a clean dataset, we artificially create random and consecutive missing data. The artificial missing values are filled with zeros. The final training and test sets are 3-dimensional matrices with the size of ($n\times 8\times 4$). The integer value of n indicates the number of training or test sets, 8 denotes the 8-hour observation period, and 4 denotes the number of features taken from four monitoring stations.

As indicated in the fourth block, we proposed a deep learning model to handle missing data. In this study, the proposed model architecture is a convolutional autoencoder, meaning that the autoencoder uses convolution layers as the encoding and decoding parts. The proposed convolutional autoencoder acts as a denoising model. By replacing some input features on purpose with zeros, the input sets can be seen as corrupted data, and the model learns to reconstruct these corrupted inputs by minimising the loss function. The training process is shown in the fifth block of the research framework.

The sixth and seventh blocks of the research framework are the post-training interpretation and evaluation steps. The model accepts and yields two-dimensional data, and thus post-training output interpretations are needed to find the intended prediction results. This process involves the aggregation procedure. Finally, some evaluation procedures are taken to examine the trained model, such as calculating error metrics, testing the model on different missing rates and locations, and implementing the proposed algorithm on other air quality datasets.

2.2 Description of the datasets

This study uses air quality datasets from three different cities. A total of 10 stations are selected for each city, and two pollutants per station are studied. We consider ten monitoring stations adequate for implementing our algorithm and evaluating its performance. We also vary the pollutant in each city to demonstrate that our proposed method can be applied to different pollutants. Some considerations are taken into account when selecting the stations. Availability of pollution data and measurement period for all stations are two of our major concerns. We included stations with at least three years data from the same period. Furthermore, since our method is based on the correlation coefficient between stations, we include stations with varying degrees of correlation.

The first dataset is air pollutant data of London city. The data were collected using the Openair tool [46]. Openair is an R package developed by Carslaw and Ropkins to analyse air quality data. For the London city dataset, we focus on two pollutants: nitrogen dioxide ($\hbox {NO}_{2}$) and particulate matter with a diameter of less than $10\; \upmu \hbox {m}$ ($\hbox {PM}_{10}$). We selected ten monitoring stations across London and used data from January 2018 to January 2021.

The second dataset is on India air quality. The dataset was compiled by Rohan Rao from the Central Pollution Control Board (CPCB) website and can be downloaded from Kaggle’s collection [47]. Among many air quality monitoring stations, we selected ten monitoring stations across the city Delhi from February 2018 to July 2020. The chosen pollutants for the Delhi dataset are hourly measurements of $\hbox {NO}_{2}$ and PM with a diameter of less than $2.5\;\upmu \hbox {m}$ ($\hbox {PM}_{2.5}$).

The third dataset is Beijing multi-station air quality provided by Zhang et al. [48], which can be downloaded from the UCI Machine learning repository page [49]. The dataset contains hourly pollutant data from January 2013 to February 2017. We focused on carbon monoxide (CO) and ozone ($\hbox {O}_{3}$) data for the Beijing dataset. We selected ten monitoring stations, namely Aotizhongxin, Changping, Dingling, Dongsi, Guanyuan, Gucheng, Huairou, Nongzhanguan, Shunyi and Tiantan. Table 1 summarises the air quality monitoring stations used in this study.

Table 1 Dataset used in this study

Estimation of missing air pollutant data using a spatiotemporal convolutional autoencoder

Abstract

Similar content being viewed by others

NARX Neural Network for Imputation of Missing Data in Air Pollution Datasets

Spatiotemporal estimation of TROPOMI NO2 column with depthwise partial convolutional neural network

Air Quality Index Prediction Based on Deep Recurrent Neural Network

1 Introduction

2 Method

2.1 Research framework in general

2.2 Description of the datasets

2.3 Correlation of pollutant data

2.4 Data pre-processing

2.4.1 Spatial characteristics

2.4.2 Temporal characteristics

2.4.3 Missing data and perturbation procedure

2.4.4 Model input construction

2.5 Proposed model

2.5.1 Convolutional autoencoder architecture

2.5.2 Model configuration and training

2.5.3 Post-training outputs interpretation

2.6 Model evaluation metrics

3 Results and discussion

3.1 Distribution of missing periods

3.2 Evaluation of temporal characteristics

3.2.1 Autocorrelation coefficients of pollutant data

3.2.2 Temporal window size determination

3.3 Evaluation of spatial characteristics

3.3.1 Correlation coefficients of pollutant data

3.3.2 Selecting the number of neighbouring stations

3.4 Model architecture evaluation

3.5 Imputation performance

3.5.1 Short-interval imputation

3.5.2 Long-interval consecutive imputation

3.6 Effect of correlation level

3.7 Comparison with other methods

4 Conclusions

Change history

29 June 2022

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A

Algorithm for post-training model outputs interpretation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation