A real-time dataset of air quality index monitoring using IoT and machine learning in the perspective of Bangladesh

This paper produces a real-time air quality index dataset of three places named Kuril Bishow Road, Uttara, and Tongi in Dhaka and Gazipur City, Bangladesh. The IoT framework consists of MQ9, MQ135, MQ131, and dust or PM sensors with an Arduino microcontroller to collect real data on sulfur dioxide, carbon monoxide, nitrogen dioxide, ozone, particle matters 2.5 and 10 µm. The data is stored in an Excel file as a comma-separated file and after that, authors applied regression type and classification type machine learning algorithms to analyze the data. The dataset consists of 11 columns and 155,406 rows, where sulfur dioxide, carbon monoxide, nitrogen dioxide, ozone, and particle matter 2.5 and 10 are recorded where AQI is marked as the target variable and the others are indicated as independent variables. In the dataset, AQI is categorized into five classes named Good, satisfactory, Moderate, Poor and Very Poor. After experimental results, it is seen that two places including Uttara and Kuril are comparatively suitable for Air Quality among the three places as well as the Random Forest algorithm outperforms the models. The study describes details of the embedded system's hardware as well. This dataset will be beneficial for environmental researchers to use to analyze the air quality.

well as the Random Forest algorithm outperforms the models.The study describes details of the embedded system's hardware as well.This dataset will be beneficial for environmental researchers to use to analyze the air quality. ©

Value of the Data
• This dataset is useful for monitoring air quality, identifying pollution sources, and assessing the efficacy of pollution-control measures.• These data can be used to establish air quality standards, create policies and regulations, and assess environmental compliance.• The dataset is helpful for research students to analyze using machine learning (ML) or data science techniques to insight the data.• This dataset will be used by researchers, scientists, and engineers for investigating the sources and consequences of air pollution, creating novel pollution control technology, and evaluating the efficacy of treatments.• These data will be used by researchers to predict the air quality index (AQI) condition as well as the environment situations like good, very good, satisfied, bad, poor.

Background
The dataset's goal is to completely measure and monitor air quality in Bangladesh's Dhaka and Gazipur districts, with an emphasis on determining its acceptability for human habitation and overall environmental health.By measuring key air pollutants such as sulfur dioxide, carbon monoxide, nitrogen dioxide, ozone, and particulate matter (PM2.5 and PM10), the dataset aims to provide a thorough understanding of the Air Quality Index (AQI) and its variations across different districts.The dataset's analysis aims to evaluate the degree of compliance with existing regulatory norms and recommendations, as well as identify areas of concern and possible sources of pollution for targeted intervention and pollution management.The dataset seeks to provide important insights into the overall environmental quality of the examined locations by allowing real-time air quality data to be compared to standard values.It also seeks to encourage scientific research by assisting in the development of prediction models and analytical tools for projecting AQI values and tracking long-term trends in air quality.Finally, this dataset adds to larger initiatives to reduce air pollution and promote sustainable environmental practices in Bangladesh's cities and peri-urban areas.

Data Description
The dataset consists of 11 columns and 155,406 rows, where sulfur dioxide, carbon monoxide, nitrogen dioxide, ozone, particle matters 2.5 and 10 are recorded as independent variables.This data is collected in every winter and summer season from January 2017 to December 2022.AQI indicates the dependent variable.The dataset is recorded in three places at Dhaka and Gazipur districts of Bangladesh.In the district of Dhaka, from Kuril Bishow Road and Uttara places, 103,611 features are recorded and in the district of Gazipur, from Tongi, 51,794 features are recorded.
Using this dataset, a research paper is published [ 1 ].The volume of the dataset can be increased with the help of paid cloud protocol as well as for a long time run of the system.Table 1 exhibits some portion of the dataset.The geographical area of collecting dataset is shown in Fig. 1 .We denote the places by red colored rectangular.The top right corner rectangular denotes the direction of the map.

Experimental Design, Materials and Methods
The proposed methodology is shown in Fig. 2 .In this scenario, after getting the real-time data using the IoT framework, we applied machine learning techniques to monitor the AQI into Good, Satisfactory, Moderate, Poor and Very Poor.
The physical part of the IoT framework is displayed in Fig. 3 .MQ9 sensor is sued for CO, MQ131 is used for ozone and NO2, Dust sensor is used for PM 2.5 and PM 10 and MQ135 is used for SO2.Several researches on IoT based framework for real-time values have been done in this field like [2][3][4].
Fig. 4 shows some of the experimental images of the proposed system.The system design is done at Robotics Lab of the Department of Software Engineering in Daffodil International University shown in Fig. 4 (a).The final embedded system during collecting real-time values is shown in Fig. 4 (b).

Air quality index calculation
AQI index is calculated by Eq. (1) .
where B max and B min are the breakpoints, B max means greater than or equal to given Concentration, and B min means smaller than or equal to given Concentration, I max = AQI value correspond-  ing to B max , I min = AQI value corresponding to B min , C p = Pollutant concentration.Table 2 shows the breakpoints or range of AQI as for air quality and Table 3 describes the air pollution factors.Tables 2 and 3 show the AQI categories with the ranges of pollutants factors and pollutant concentration.Particle Matter 2.5 is measured in micrometers (μm), NO2, O3, and SO2 are all measured in Gram/cubic metre (g/m3), whereas CO is measured in Milligram per cubic meter (mg/m3).In addition, we calibrated before fieldwork in our Robotics Laboratory at Daffodil International University shown in Fig. 4 (a).The results of the devices have been close to the standard values.We got the values 10 μm for PM 2.5, 40 g/m3 for NO2, 43 g/m3 for O3, 38 g/m3 for SO2 and 0 mg/m3 for CO.

Comparison performance metrics based on machine learning
We separate the dataset into two portion as dependent and independent before using ML models.In case of regression, we let AQI as the goal column, but in classification, AQI is as an independent column that targets AQI Range.Prior to using machine learning, the dataset is divided into training and test data.Fig. 5 displays the AQI values based on the AQI range.We   evaluated the regression model using several matrices named root mean square error (RMSE), Mean Absolute Error (MAE), r2 score, and R-squared.The model is correct while the R-squared is high than RMSE.The accuracy, precision, recall, and F1 scores are all assessment matrices for the classification model.
In addition, we assessed the accuracy of the classifier model by calculating accuracy ratings for all classification models.
Table 4 shows the four models assessed in the ML Regressor Model: linear regression (LR), decision tree regression (DTR), random forest regression (RFR), and gradient boosting.Linear regression yields respectable results, with an MAE of 9.43 as well as an RMSE of 20.04.The model's R2 value is 0.505, which accounts for approximately 50.5 % of the variance.The Decision Tree Regressor shows more substantial errors, with an MAE of 11.76 along with an RMSE of 25.34.It has an astoundingly high R2 value of 0.995, suggesting a great fit to the data.The RFR produces better results, with an MAE of 9.00 and RMSE of 18.73.The model can explain over 93 % of the variance, according to the R2 value of 0.930.Gradient boosting is the These findings highlight the effectiveness of gradient boosting in the regression issue.
Table 5 shows the assessment metric scores for several models, including precision, recall, F1 score, and accuracy.Random Forest Classifiers scored the highest across all parameters, suggesting good classification performance.The Random Forest model had a Precision score of 0.972 %, Recall score of 0.972 %, and F1 score of 0.972 %.The accuracy scores for Logistic Regression (LR) scoring are 93 %, DT scoring is 95 %, and K-NN scoring is 97.0 %.Naive Bayes scored 94 %, while the Random Forest Classifier (RFC) scored 97.2 %.The most accurate algorithm is RFC, which outperforms the others.
These findings give valuable information about how well various algorithms perform, indicating that RFC may be especially well-suited to the classification problem.Researchers and practitioners may take these findings into account when selecting effective algorithms for analogous professions in the future.

Limitations
From a highly populated area, the dataset was collected using IoT devices and it took almost five years.Due to the long-time data collection series, the system often experienced interruptions.However, for the data collection process, premium cloud storage was not used and the whole collection process was conducted in three places only.
2024The Author(s).Published by Elsevier Inc.This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ ) IoT devices are employed in the data collection process to obtain real-time data on air pollution in Dhaka and Gazipur district of Bangladesh.To monitor levels of carbon monoxide, nitrogen dioxide, ozone, particulate matter 2.5, and particulate matter 10, the hardware design is created to guarantee that appropriate sensors were installed.The Arduino IDE code is uploaded to allow for communication with IoT devices.The Excel data streamer accessed the requires Excel file.The Arduino IDE selects the appropriate COM port to connect the IoT devices to the PC.The data streamer toolbar's "Start Data" option enables real-time data streaming into the Excel sheet.The "Record Data" button launches the data-collecting process, allowing IoT devices to continuously monitor and record air pollutants.To end data gathering, the "Stop Recording" button is pressed.The "Stop Data" function is used to halt data transmission between IoT devices and computers.The collected data is saved in a specified file location for easy analysis and interpretation.Datasource location Institution: Department of Software Engineering, Daffodil International University, Daffodil Smart City (DSC), Birulia, Savar, Dhaka 1216, Bangladesh.Data accessibility Repository name: Mendeley Data Data identification number: DOI: 10.17632/4r25 ×9sc7k.1 Direct URL to data: https://data.mendeley.com/datasets/4r25×9sc7k/1 [ 5 ] Related research article Shakil, S.U.P., Kashem, M.A., Islam, M.M., Nayan, N.M., Uddin, J., Investigation of

Table 1
Some data as sample.
Fig. 1.Location of data collection.

Table 2
AQI category range.

Table 3
Air pollution factors category range.

Table 4
Regression score for ML model.highest performance and the fewest errors (MAE of 8.34, RMSE of 17.86).The R2 value is 0.60, suggesting that the model can explain about 60 % of the variance.Gradient Boosting achieves the lowest errors and the highest R2 value, outperforming all other models.