Skip to main content

Advertisement

Log in

Estimation of urban AQI based on interpretable machine learning

  • Research Article
  • Published:
Environmental Science and Pollution Research Aims and scope Submit manuscript

Abstract

Air pollution is an increasingly serious problem. Accurate and efficient prediction of air quality can effectively prevent air pollution and improve the quality of human life. The air quality index (AQI) is a dimensionless tool to describe air quality quantitatively. In this study, the machine learning (ML) method was used to estimate AQI for Shijiazhuang, China, as the research object, and pollutants and meteorological factors as data models. Specifically, eXtreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and Random Forest (RF) models were used. The experimental results show that XGBoost model captures the AQI variation trend well, and the R2 of XGBoost model is 0.929, which is 0.3% and 2.3% higher than the R2 of RF model and LightGBM model, respectively. In addition, through the SHAP-based model interpretation method, the study reveals the key factors of AQI variation, that is PM2.5 and PM10, play positive roles in the variation of AQI and AQI is less sensitive to meteorological factors. Finally, Beijing, Shanghai, Xi’an, and Guangzhou were selected to test the model’s validity, and the model performance remained good. Our study shows that applying ML approach to air quality prediction is beneficial for efficiently assessing cities’ future air quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

The datasets during the study are available at https://air.cnemc.cn:18014/ and https://power.larc.nasa.gov/data-access-viewer/.

References

  • Abedi R, Costache R, Shafizadeh-Moghadam H, Pham QB (2022) Flash-flood susceptibility mapping based on XGBoost, random forest and boosted regression trees. Geocarto Int 37:5479–5496

    Google Scholar 

  • Aliramezani M, Koch CR, Shahbakhti M (2022) Modeling, diagnostics, optimization, and control of internal combustion engines via modern machine learning techniques: a review and future directions. Prog Energy Combust Sci 88:100967

    Google Scholar 

  • Antwarg L, Miller RM, Shapira B, Rokach L (2021) Explaining anomalies detected by autoencoders using Shapley Additive Explanations. Expert Syst Appl 186:115736

    Google Scholar 

  • Arbex MA, Santos UP, Martins LC, Saldiva PH, Pereira LA, Braga AL (2012) Air pollution and the respiratory system. J Bras Pneumol 38:643–655

    Google Scholar 

  • ArunKumar K, Kalaga DV, Kumar CMS, Kawaji M, Brenza TM (2022) Comparative analysis of Gated Recurrent Units (GRU), long Short-Term memory (LSTM) cells, autoregressive Integrated moving average (ARIMA), seasonal autoregressive Integrated moving average (SARIMA) for forecasting COVID-19 trends. Alex Eng J 61:7585–7603

    Google Scholar 

  • Azodi CB, Tang J, Shiu S-H (2020) Opening the black box: interpretable machine learning for geneticists. Trends Genet 36:442–455

    CAS  Google Scholar 

  • Benchrif A, Wheida A, Tahri M, Shubbar RM, Biswas B (2021) Air quality during three covid-19 lockdown phases: AQI, PM2.5 and NO2 assessment in cities with more than 1 million inhabitants. Sustain Cities Soc 74:103170

    Google Scholar 

  • Chauhan AJ, Johnston SL (2003) Air pollution and infection in respiratory illness. Br Med Bull 68:95–112

    CAS  Google Scholar 

  • Chen X, Zhang L-W, Huang J-J, Song F-J, Zhang L-P, Qian Z-M, Trevathan E, Mao H-J, Han B, Vaughn M (2016) Long-term exposure to urban air pollution and lung cancer mortality: A 12-year cohort study in Northern China. Sci Total Environ 571:855–861

    CAS  Google Scholar 

  • Chen S-Z, Feng D-C, Han W-S, Wu G (2021) Development of data-driven prediction model for CFRP-steel bond strength by implementing ensemble learning algorithms. Constr Build Mater 303:124470

    CAS  Google Scholar 

  • Dias HLF, Bertoncini BV, Cavalcante RM, Jensen SS, Hansen KM, Ketzel M (2021) Evaluation of OSPM against air quality measurements in Brazil–the case study of Fortaleza, Ceará. J Air Waste Manag Assoc 71:170–190

    CAS  Google Scholar 

  • Greener JG, Kandathil SM, Moffat L, Jones DT (2022) A guide to machine learning for biologists. Nat Rev Mol Cell Biol 23:40–55

    CAS  Google Scholar 

  • Gregório J, Gouveia-Caridade C, Caridade PJ (2022) Modeling PM2.5 and PM10 using a robust simplified linear regression machine learning algorithm. Atmosphere 13:1334

    Google Scholar 

  • Guliyev H, Mustafayev E (2022) Predicting the changes in the WTI crude oil price dynamics using machine learning models. Resour Policy 77:102664

    Google Scholar 

  • He Y, Hu C, Jiang B, Sun Z, Ma J, Li H, Tang D (2022) Data-driven approach to predict the flow boiling heat transfer coefficient of liquid hydrogen aviation fuel. Fuel 324:124778

    CAS  Google Scholar 

  • Hu Y, Zang Z, Chen D, Ma X, Liang Y, You W, Pan X, Wang L, Wang D, Zhang Z (2022) Optimization and evaluation of SO2 emissions based on WRF-Chem and 3DVAR data assimilation. Remote Sensing 14:220

    Google Scholar 

  • Ju J, Liu K, Liu F (2022) Prediction of SO2 concentration based on AR LSTM neural network [J]. Neural Proces Lett 1–19

  • Karniadakis GE, Kevrekidis IG, Lu L, Perdikaris P, Wang S, Yang L (2021) Physics-informed machine learning. Nature Reviews. Physics 3:422–440

    Google Scholar 

  • Kim B-Y, Lim Y-K, Cha JW (2022) Short-term prediction of particulate matter (PM10 and PM2.5) in Seoul, South Korea using tree-based machine learning algorithms. Atmospheric Pollut Res 13:101547

    CAS  Google Scholar 

  • Li Z (2022) Extracting spatial effects from machine learning model using local interpretation method: an example of SHAP and XGBoost. Comput Environ Urban Syst 96:101845

    Google Scholar 

  • Li Y, Yang L, Yang B, Wang N, Wu T (2019) Application of interpretable machine learning models for the intelligent decision. Neurocomputing 333:273–283

    Google Scholar 

  • Li S, Hui EC, Wen H, Liu H (2022) Does public concern matter to the welfare cost of air pollution? Evidence from Chinese Cities Cities 131:103992

    Google Scholar 

  • Liu X, Lu D, Zhang A, Liu Q, Jiang G (2022) Data-driven machine learning in environmental pollution: gains and problems. Environ Sci Technol 56:2124–2133

    CAS  Google Scholar 

  • Naghibi SA, Dolatkordestani M, Rezaei A, Amouzegari P, Heravi MT, Kalantar B, Pradhan B (2019) Application of rotation forest with decision trees as base classifier and a novel ensemble model in spatial modeling of groundwater potential. Environ Monit Assess 191:1–20

    Google Scholar 

  • Nasir N, Kansal A, Alshaltone O, Barneih F, Sameer M, Shanableh A, Al-Shamma’a A (2022) Water quality classification using machine learning algorithms. J Water Process Eng 48:102920

    Google Scholar 

  • Niri MF, Reynolds C, Ramírez LAR, Kendrick E, Marco J (2022) Systematic analysis of the impact of slurry coating on manufacture of Li-ion battery electrodes via explainable machine learning. Energy Storage Materials 51:223–238

    Google Scholar 

  • Perera F, Nadeau K (2022) Climate change, fossil-fuel pollution, and children’s health. N Engl J Med 386:2303–2314

    CAS  Google Scholar 

  • Qiu T, Zhang M, Liu X, Liu J, Chen C, Zhao W (2020) A directed edge weight prediction model using decision tree ensembles in industrial Internet of things. IEEE Trans Industr Inf 17:2160–2168

    Google Scholar 

  • Senthilkumar N, Gilfether M, Chang HH, Russell AG, Mulholland J (2022) Using land use variable information and a random forest approach to correct spatial mean bias in fused CMAQ fields for particulate and gas species. Atmos Environ 274:118982

    CAS  Google Scholar 

  • Sun Y, Haghighat F, Fung BC (2020) A review of the-state-of-the-art in data-driven approaches for building energy prediction. Energy and Buildings 221:110022

    Google Scholar 

  • Sun Z, Santos J, Caetano E (2022) Data-driven prediction and interpretation of fatigue damage in a road-rail suspension bridge considering multiple loads. Struct Control Health Monit 29:e2997

    Google Scholar 

  • Tao H, Awadh SM, Salih SQ, Shafik SS, Yaseen ZM (2022) Integration of extreme gradient boosting feature selection approach with machine learning models: application of weather relative humidity prediction. Neural Comput Appl 34:515–533

    Google Scholar 

  • Thongthammachart T, Araki S, Shimadera H, Matsuo T, Kondo A (2022) Incorporating Light Gradient Boosting Machine to land use regression model for estimating NO2 and PM2.5 levels in Kansai region. Japan. Environmental Modelling & Software 155:105447

    Google Scholar 

  • Tian Y, Yao X, Chen L (2019) Analysis of spatial and seasonal distributions of air pollutants by incorporating urban morphological characteristics. Comput Environ Urban Syst 75:35–48

    Google Scholar 

  • Wang S, Wang Y, Wang D, Yin Y, Wang Y, Jin Y (2020) An improved random forest-based rule extraction method for breast cancer diagnosis. Appl Soft Comput 86:105941

    Google Scholar 

  • Wang Y, Sun K, Li L, Lei Y, Wu S, Jiang Y, Mi Y, Yang J (2022) The impacts of economic level and air pollution on public health at the micro and macro level. J Clean Prod 366:132932

    CAS  Google Scholar 

  • Yang Z, Liu H, Bi T, Li Z, Yang Q (2020) An adaptive PMU missing data recovery method. Int J Electr Power Energy Syst 116:105577

    Google Scholar 

  • Yang Y, Yuan Y, Han Z, Liu G (2022) Interpretability analysis for thermal sensation machine learning models: an exploration based on the SHAP approach. Indoor Air 32:e12984

    Google Scholar 

  • Ye L, Dai B, Li Z, Pei M, Zhao Y, Lu P (2022) An ensemble method for short-term wind power prediction considering error correction strategy. Appl Energy 322:119475

    Google Scholar 

  • Yu H, Wu Y, Niu L, Chai Y, Feng Q, Wang W, Liang T (2021) A method to avoid spatial overfitting in estimation of grassland above-ground biomass on the Tibetan Plateau. Ecol Ind 125:107450

    Google Scholar 

  • Yu W, Li S, Ye T, Xu R, Song J, Guo Y (2022) Deep ensemble machine learning framework for the estimation of PM 25 concentrations. Environmental Health Perspectives 130:037004

    Google Scholar 

  • Zaib S, Lu J, Bilal M (2022) Spatio-temporal characteristics of air quality index (AQI) over Northwest China. Atmosphere 13:375

    CAS  Google Scholar 

  • Zhang Y, Zhu B, Gao J, Kang H, Yang P, Wang L, Zhang J (2017) The source apportionment of primary PM2. 5 in an aerosol pollution event over Beijing-Tianjin-Hebei region using WRF-Chem. China Aerosol and Air Quality Research 17:2966–2980

    CAS  Google Scholar 

  • Zhang B, Zhang Y, Jiang X (2022b) Feature selection for global tropospheric ozone prediction based on the BO-XGBoost-RFE algorithm. Sci Rep 12:9244

    CAS  Google Scholar 

  • Zhang B, Rong Y, Yong R, Qin D, Li M, Zou G, Pan J, (2022a). Deep learning for air pollutant concentration prediction: a review. Atmospheric Environment, 119347.

  • Zhou J, Li Y (2022) Research on spatial distribution characteristics of high haze pollution industries such as thermal power industry in the Beijing-Tianjin-Hebei Region. Energies 15:6610

    Google Scholar 

  • Zhu M, Xie J (2023) Investigation of nearby monitoring station for hourly PM25 forecasting using parallel multi-input 1D-CNN-biLSTM. Expert Systems App 211:118707

    Google Scholar 

  • Zhu S, Wang X, Mei D, Wei L, Lu M (2022) CEEMD-MR-hybrid model based on sample entropy and random forest for SO2 prediction. Atmos Pollut Res 13:101358

    CAS  Google Scholar 

Download references

Funding

This work was supported by Yan’an City Science and Technology Development Program (No.203010096), Yan’an University Doctoral Program (No.20504306), Shaanxi Provincial Talent Program (No. YAU202305399), Yan’an University 14th Five-Year Major Research Program (YAU202313738), and Graduate Education Innovation Program of Yan'an University (No. YCX2023008).

Author information

Authors and Affiliations

Authors

Contributions

Siyuan Wang: conceptualization, methodology, data analysis, data collection, and writing—original draft. Ying Ren: formal analysis, investigation, and validation. Bisheng Xia: supervision, validation, writing—review and editing, and funding acquisition.

Corresponding author

Correspondence to Bisheng Xia.

Ethics declarations

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Responsible Editor: Marcus Schulz

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, S., Ren, Y. & Xia, B. Estimation of urban AQI based on interpretable machine learning. Environ Sci Pollut Res 30, 96562–96574 (2023). https://doi.org/10.1007/s11356-023-29336-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11356-023-29336-5

Keywords

Navigation