Outdoor air quality data for spatiotemporal analysis and air quality modelling in Ho Chi Minh City, Vietnam: A part of HealthyAir Project

This article presents outdoor air pollution data acquired from the real-time Air Quality Monitoring Network (AQMN), which was established by the Healthyair project team in Ho Chi Minh City (HCMC), Vietnam. The AQMN is made up of six air pollution monitoring stations spread over the city (Traffic, Residential, and Industrial). Each station measures the same contaminants in the air, including PM2.5, TSP, NO2, SO2, O3, CO, and two meteorological factors, temperature and humidity. This data is crucial for air quality modelling, spatiotemporal analysis, correlation analysis, and assessing local air pollution around the city. The data was first obtained in minute frequency, then transformed and produced in hourly frequency for analysis and modelling. The PM2.5 data from this dataset was used to construct an hourly air quality PM2.5 forecasting model in the publication titled “AI-based Air Quality PM2.5 Forecasting Models for Developing Countries: A Case Study of Ho Chi Minh City, Vietnam” by Rakholia et. al. (2022)

Environment Science (air pollution) Specific subject area Monitoring urban air pollution using IoT based wireless sensor network. Type of data Table data (organized in CSV format) How the data were acquired During the first phase of the HealthyAir initiative, six air pollution monitoring stations were established in HCMC. The data were collected from each station and then merged and pre-processed using the Python software program.
[2] . Data format Raw Analyzed (PM 2.5 , NO 2 , CO, SO 2 , O 3 , TSP in μg/m 3 , temperature in °C, relative humidity in %) Description of data collection Data was collected from the middle of February 2021 until the middle of June 2022. Six air pollution monitoring stations were installed by the HealthyAir project team in different regions including Traffic, Residential, and Industrial across the city and each of them measures the same number of air pollutants PM 2.5 , NO 2 , CO, SO 2 , O 3 , TSP, and two meteorological parameters Temperature and Humidity. Every minute, each air quality monitoring station communicates the value measured by sensors to a cloud server (data repository). The PM 2. 5 and TSP levels in the air were measured in μg/m 3 , whilst CO, SO 2 , and NO 2 were recorded in "ppm" and O 3 was measured in "ppb." The data were transformed to hourly frequency during the data pre-processing step for further analysis and modelling. Data on air contaminants were also converted to the uniform unit (μg/m 3 ). Data source location Ho Chi Minh City, Vietnam, is the primary source of data.

Value of the Data
• This is a unique dataset recorded from high-quality sensors network deployed by the HealthyAir project team, which is valuable for understanding and assessing local air quality across multiple regions (traffic, residential, and industrial) in Ho Chi Minh City. • Data were prepared on an hourly basis, providing sufficient context for future research on air quality assessments, time series modelling, and predictive modelling. • Since the dataset contains data from numerous air pollutants such as PM2.5, NO2, CO, SO2, O3, and TSP, it can be utilized for correlation analysis, feature selection for air quality modelling, and implementing WHO air quality recommendations [5] . • This dataset can be used to conduct research on determining how air pollution affects human health. • These data can be useful to researchers interested in spatiotemporal analysis, air quality modelling, and tests on various validation methodologies. • Researchers can use this data to test various machine learning approaches, and they can be combined with other datasets such as meteorological data or satellite data to estimate air quality.

Objective
The primary goal of collecting outdoor air quality data was to create a unique dataset that can be used for monitoring regional air quality in the city, developing a policy, assessing the impact of air pollution on human health, and developing solutions to reduce the harmful effects of air pollution on the public in HCMC. This one-of-a-kind dataset was gathered from a real-time air quality monitoring network, allowing for the exploration of numerous issues when constructing machine learning models, devising training procedures, and developing time-series forecasting algorithms. This can benefit researchers working on sustainability, time series analysis, predicting urban air quality, and environmental modelling.

Data Description
The raw data set comprises 52,549 records gathered between the middle of February 2021 and the middle of June 2022. The raw data contains 52,549 records collected over a period from mid of February 2021 to mid of June 2022. The air quality dataset presented in this article includes date (dd-mm-yyyy HH:0 0:0 0), air pollutants such as particulate matter (PM 2.5 ), Total Suspended Particles (TSP), Sulfur dioxide (SO 2 ), Ozone (O 3 ), Nitrogen Dioxide (NO 2 ), Carbon Monoxide (CO) in μg/m 3 , and two meteorological parameters Temperature ( °C) and Humidity (%), and Station_No includes a number between 1 and 6 that uniquely identifies a station number and its location ( Table 1 ).
Furthermore, before using this data for analysis and modeling, it is important to understand the data quality: the data was recorded using high-quality sensors, so the records are quite accurate (except outliers at some points due to unforeseen event at random place in the city). There are no duplicates or overlapping values across the dataset, so all records (tuples) are unique. The time component is critical in air quality analysis and modeling; therefore, the entire dataset is prepared in a timely and consistent manner with one-hour intervals and no single timestamp is missing across all stations. Missing values were recorded for some pollutants at some stations, primarily during COVID-19 lockdown periods due to power failures and other uncontrollable factors.
The air quality data from all stations were aggregated and stored into a single file (AirQual-ity_hcmc.csv), the sample data is shown in Table 2 .

Experimental Design, Materials and Methods
The data presented in this article was gathered from a real-time AQMN comprised of six air pollution monitoring stations. Table 4 describes the technical specifications of the instruments used in the construction of an air pollution monitoring station.
The locations of air quality stations in HCMC were chosen with the goal of monitoring air quality in a variety of places, including traffic, urban background, residential areas, industrial districts, and high population density. Every 60 seconds, all stations measured the identical set of air pollutant concentrations, which were then relayed to a cloud server ( Fig. 7 ). Each station's data for each day was saved on the server in a separate (.csv) file. Following that, all csv files were imported into a Python workspace for merging and re-sampled on an hourly basis.
Following that, all negative values were removed from the dataset since the sensors occasionally recorded exaggerated amounts of air contaminants. All negative values were replaced with 'nan,' and that was treated as missing values in the dataset [2] .
Originally, air pollutants PM 2.5 and TSP were measured in μg/m 3 at Healthyair stations, whilst CO, SO 2 , NO 2 , and O 3 were measured in "ppm" and "ppb" respectively. Table 5 shows the cal- Fig. 7. Air quality data acquiring flow. ibration rate for converting air quality concentrations from "ppm" and "ppb" to uniform unit μg/m 3 . The data was then saved on a MySQL server, which allows users to retrieve, sort, search, and filter the data using SQL queries for air quality study, modelling or further analysis. Finally, we exported the data from the MySQL database in csv format.

Ethics Statements
There were no ethical requirements for data collection and processing, and this study did not involve animal or human investigations.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.