Abstract
Air pollution is an increasingly serious problem. Accurate and efficient prediction of air quality can effectively prevent air pollution and improve the quality of human life. The air quality index (AQI) is a dimensionless tool to describe air quality quantitatively. In this study, the machine learning (ML) method was used to estimate AQI for Shijiazhuang, China, as the research object, and pollutants and meteorological factors as data models. Specifically, eXtreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and Random Forest (RF) models were used. The experimental results show that XGBoost model captures the AQI variation trend well, and the R2 of XGBoost model is 0.929, which is 0.3% and 2.3% higher than the R2 of RF model and LightGBM model, respectively. In addition, through the SHAP-based model interpretation method, the study reveals the key factors of AQI variation, that is PM2.5 and PM10, play positive roles in the variation of AQI and AQI is less sensitive to meteorological factors. Finally, Beijing, Shanghai, Xi’an, and Guangzhou were selected to test the model’s validity, and the model performance remained good. Our study shows that applying ML approach to air quality prediction is beneficial for efficiently assessing cities’ future air quality.
Similar content being viewed by others
Data availability
The datasets during the study are available at https://air.cnemc.cn:18014/ and https://power.larc.nasa.gov/data-access-viewer/.
References
Abedi R, Costache R, Shafizadeh-Moghadam H, Pham QB (2022) Flash-flood susceptibility mapping based on XGBoost, random forest and boosted regression trees. Geocarto Int 37:5479–5496
Aliramezani M, Koch CR, Shahbakhti M (2022) Modeling, diagnostics, optimization, and control of internal combustion engines via modern machine learning techniques: a review and future directions. Prog Energy Combust Sci 88:100967
Antwarg L, Miller RM, Shapira B, Rokach L (2021) Explaining anomalies detected by autoencoders using Shapley Additive Explanations. Expert Syst Appl 186:115736
Arbex MA, Santos UP, Martins LC, Saldiva PH, Pereira LA, Braga AL (2012) Air pollution and the respiratory system. J Bras Pneumol 38:643–655
ArunKumar K, Kalaga DV, Kumar CMS, Kawaji M, Brenza TM (2022) Comparative analysis of Gated Recurrent Units (GRU), long Short-Term memory (LSTM) cells, autoregressive Integrated moving average (ARIMA), seasonal autoregressive Integrated moving average (SARIMA) for forecasting COVID-19 trends. Alex Eng J 61:7585–7603
Azodi CB, Tang J, Shiu S-H (2020) Opening the black box: interpretable machine learning for geneticists. Trends Genet 36:442–455
Benchrif A, Wheida A, Tahri M, Shubbar RM, Biswas B (2021) Air quality during three covid-19 lockdown phases: AQI, PM2.5 and NO2 assessment in cities with more than 1 million inhabitants. Sustain Cities Soc 74:103170
Chauhan AJ, Johnston SL (2003) Air pollution and infection in respiratory illness. Br Med Bull 68:95–112
Chen X, Zhang L-W, Huang J-J, Song F-J, Zhang L-P, Qian Z-M, Trevathan E, Mao H-J, Han B, Vaughn M (2016) Long-term exposure to urban air pollution and lung cancer mortality: A 12-year cohort study in Northern China. Sci Total Environ 571:855–861
Chen S-Z, Feng D-C, Han W-S, Wu G (2021) Development of data-driven prediction model for CFRP-steel bond strength by implementing ensemble learning algorithms. Constr Build Mater 303:124470
Dias HLF, Bertoncini BV, Cavalcante RM, Jensen SS, Hansen KM, Ketzel M (2021) Evaluation of OSPM against air quality measurements in Brazil–the case study of Fortaleza, Ceará. J Air Waste Manag Assoc 71:170–190
Greener JG, Kandathil SM, Moffat L, Jones DT (2022) A guide to machine learning for biologists. Nat Rev Mol Cell Biol 23:40–55
Gregório J, Gouveia-Caridade C, Caridade PJ (2022) Modeling PM2.5 and PM10 using a robust simplified linear regression machine learning algorithm. Atmosphere 13:1334
Guliyev H, Mustafayev E (2022) Predicting the changes in the WTI crude oil price dynamics using machine learning models. Resour Policy 77:102664
He Y, Hu C, Jiang B, Sun Z, Ma J, Li H, Tang D (2022) Data-driven approach to predict the flow boiling heat transfer coefficient of liquid hydrogen aviation fuel. Fuel 324:124778
Hu Y, Zang Z, Chen D, Ma X, Liang Y, You W, Pan X, Wang L, Wang D, Zhang Z (2022) Optimization and evaluation of SO2 emissions based on WRF-Chem and 3DVAR data assimilation. Remote Sensing 14:220
Ju J, Liu K, Liu F (2022) Prediction of SO2 concentration based on AR LSTM neural network [J]. Neural Proces Lett 1–19
Karniadakis GE, Kevrekidis IG, Lu L, Perdikaris P, Wang S, Yang L (2021) Physics-informed machine learning. Nature Reviews. Physics 3:422–440
Kim B-Y, Lim Y-K, Cha JW (2022) Short-term prediction of particulate matter (PM10 and PM2.5) in Seoul, South Korea using tree-based machine learning algorithms. Atmospheric Pollut Res 13:101547
Li Z (2022) Extracting spatial effects from machine learning model using local interpretation method: an example of SHAP and XGBoost. Comput Environ Urban Syst 96:101845
Li Y, Yang L, Yang B, Wang N, Wu T (2019) Application of interpretable machine learning models for the intelligent decision. Neurocomputing 333:273–283
Li S, Hui EC, Wen H, Liu H (2022) Does public concern matter to the welfare cost of air pollution? Evidence from Chinese Cities Cities 131:103992
Liu X, Lu D, Zhang A, Liu Q, Jiang G (2022) Data-driven machine learning in environmental pollution: gains and problems. Environ Sci Technol 56:2124–2133
Naghibi SA, Dolatkordestani M, Rezaei A, Amouzegari P, Heravi MT, Kalantar B, Pradhan B (2019) Application of rotation forest with decision trees as base classifier and a novel ensemble model in spatial modeling of groundwater potential. Environ Monit Assess 191:1–20
Nasir N, Kansal A, Alshaltone O, Barneih F, Sameer M, Shanableh A, Al-Shamma’a A (2022) Water quality classification using machine learning algorithms. J Water Process Eng 48:102920
Niri MF, Reynolds C, Ramírez LAR, Kendrick E, Marco J (2022) Systematic analysis of the impact of slurry coating on manufacture of Li-ion battery electrodes via explainable machine learning. Energy Storage Materials 51:223–238
Perera F, Nadeau K (2022) Climate change, fossil-fuel pollution, and children’s health. N Engl J Med 386:2303–2314
Qiu T, Zhang M, Liu X, Liu J, Chen C, Zhao W (2020) A directed edge weight prediction model using decision tree ensembles in industrial Internet of things. IEEE Trans Industr Inf 17:2160–2168
Senthilkumar N, Gilfether M, Chang HH, Russell AG, Mulholland J (2022) Using land use variable information and a random forest approach to correct spatial mean bias in fused CMAQ fields for particulate and gas species. Atmos Environ 274:118982
Sun Y, Haghighat F, Fung BC (2020) A review of the-state-of-the-art in data-driven approaches for building energy prediction. Energy and Buildings 221:110022
Sun Z, Santos J, Caetano E (2022) Data-driven prediction and interpretation of fatigue damage in a road-rail suspension bridge considering multiple loads. Struct Control Health Monit 29:e2997
Tao H, Awadh SM, Salih SQ, Shafik SS, Yaseen ZM (2022) Integration of extreme gradient boosting feature selection approach with machine learning models: application of weather relative humidity prediction. Neural Comput Appl 34:515–533
Thongthammachart T, Araki S, Shimadera H, Matsuo T, Kondo A (2022) Incorporating Light Gradient Boosting Machine to land use regression model for estimating NO2 and PM2.5 levels in Kansai region. Japan. Environmental Modelling & Software 155:105447
Tian Y, Yao X, Chen L (2019) Analysis of spatial and seasonal distributions of air pollutants by incorporating urban morphological characteristics. Comput Environ Urban Syst 75:35–48
Wang S, Wang Y, Wang D, Yin Y, Wang Y, Jin Y (2020) An improved random forest-based rule extraction method for breast cancer diagnosis. Appl Soft Comput 86:105941
Wang Y, Sun K, Li L, Lei Y, Wu S, Jiang Y, Mi Y, Yang J (2022) The impacts of economic level and air pollution on public health at the micro and macro level. J Clean Prod 366:132932
Yang Z, Liu H, Bi T, Li Z, Yang Q (2020) An adaptive PMU missing data recovery method. Int J Electr Power Energy Syst 116:105577
Yang Y, Yuan Y, Han Z, Liu G (2022) Interpretability analysis for thermal sensation machine learning models: an exploration based on the SHAP approach. Indoor Air 32:e12984
Ye L, Dai B, Li Z, Pei M, Zhao Y, Lu P (2022) An ensemble method for short-term wind power prediction considering error correction strategy. Appl Energy 322:119475
Yu H, Wu Y, Niu L, Chai Y, Feng Q, Wang W, Liang T (2021) A method to avoid spatial overfitting in estimation of grassland above-ground biomass on the Tibetan Plateau. Ecol Ind 125:107450
Yu W, Li S, Ye T, Xu R, Song J, Guo Y (2022) Deep ensemble machine learning framework for the estimation of PM 25 concentrations. Environmental Health Perspectives 130:037004
Zaib S, Lu J, Bilal M (2022) Spatio-temporal characteristics of air quality index (AQI) over Northwest China. Atmosphere 13:375
Zhang Y, Zhu B, Gao J, Kang H, Yang P, Wang L, Zhang J (2017) The source apportionment of primary PM2. 5 in an aerosol pollution event over Beijing-Tianjin-Hebei region using WRF-Chem. China Aerosol and Air Quality Research 17:2966–2980
Zhang B, Zhang Y, Jiang X (2022b) Feature selection for global tropospheric ozone prediction based on the BO-XGBoost-RFE algorithm. Sci Rep 12:9244
Zhang B, Rong Y, Yong R, Qin D, Li M, Zou G, Pan J, (2022a). Deep learning for air pollutant concentration prediction: a review. Atmospheric Environment, 119347.
Zhou J, Li Y (2022) Research on spatial distribution characteristics of high haze pollution industries such as thermal power industry in the Beijing-Tianjin-Hebei Region. Energies 15:6610
Zhu M, Xie J (2023) Investigation of nearby monitoring station for hourly PM25 forecasting using parallel multi-input 1D-CNN-biLSTM. Expert Systems App 211:118707
Zhu S, Wang X, Mei D, Wei L, Lu M (2022) CEEMD-MR-hybrid model based on sample entropy and random forest for SO2 prediction. Atmos Pollut Res 13:101358
Funding
This work was supported by Yan’an City Science and Technology Development Program (No.203010096), Yan’an University Doctoral Program (No.20504306), Shaanxi Provincial Talent Program (No. YAU202305399), Yan’an University 14th Five-Year Major Research Program (YAU202313738), and Graduate Education Innovation Program of Yan'an University (No. YCX2023008).
Author information
Authors and Affiliations
Contributions
Siyuan Wang: conceptualization, methodology, data analysis, data collection, and writing—original draft. Ying Ren: formal analysis, investigation, and validation. Bisheng Xia: supervision, validation, writing—review and editing, and funding acquisition.
Corresponding author
Ethics declarations
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Responsible Editor: Marcus Schulz
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, S., Ren, Y. & Xia, B. Estimation of urban AQI based on interpretable machine learning. Environ Sci Pollut Res 30, 96562–96574 (2023). https://doi.org/10.1007/s11356-023-29336-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11356-023-29336-5