Particulate matter 10 µm (PM10), 2.5 µm (PM2.5) datasets gathered by direct measurement, low-cost sensor and by public air quality stations in Fontibón, Bogotá D.C., Colombia

Concentration of particulate matter directly affects air quality and human health. Three sources of information were used in this work to generate datasets on this matter at the Fontibón county in Bogota D.C., Colombia. The first source was a Davis AirLinkⓇ low-cost sensor air quality readings for PM2.5, PM10 and meteorological variables. The sensor was installed in the referred area, collecting air quality readings for PM2.5, PM10, as well as temperature, relative humidity, dew point, wet bulb, and heat index as meteorological variables during the months of May to August 2022. The second source was collecting by direct measurement the PM10 particles using a TischⓇ Hi- Vol equipment, evaluated the concentration of particulate matter PM10 in the same place for 27 days. Finally, raw data was provided by the Bogotá’s Environmental District Bureau (SDA), validating in this work the data readings for the years 2021 and 2022 from the two meteorological stations located in the same county, named “Fontibón” and “Móvil Fontibón”, including Air quality data for PM2.5, PM10, Carbon Monoxide (CO), Ozone, Nitrogen Dioxide (NO2), Sulfur Dioxide (SO2) and the meteorological variables wind speed, wind direction, temperature, precipitation, relative humidity (RH) and Barometric pressure. A Machine Learning model was made to perform the mining and completeness of the missing data with an iterative imputation and with a regression model, and the Pearson, Spearman and Kendall correlation coefficients were calculated, using Python language.


a b s t r a c t
Concentration of particulate matter directly affects air quality and human health. Three sources of information were used in this work to generate datasets on this matter at the Fontibón county in Bogota D.C., Colombia. The first source was a Davis AirLink R low-cost sensor air quality readings for PM 2.5 , PM 10 and meteorological variables. The sensor was installed in the referred area, collecting air quality readings for PM 2.5 , PM 10 , as well as temperature, relative humidity, dew point, wet bulb, and heat index as meteorological variables during the months of May to August 2022. The second source was collecting by direct measurement the PM 10 particles using a Tisch R Hi-Vol equipment, evaluated the concentration of particulate matter PM 10 in the same place for 27 days. Finally, raw data was provided by the Bogotá's Environmental District Bureau (SDA), validating in this work the data readings for the years 2021 and 2022 from the two Subject Data Science, Environmental Science, and Air Pollution Specific subject area Air quality, meteorological variables, data analytics. Type of data Figure  Table How the data were acquired • Low-cost sensor data was acquired using a Davis AirLink R sensor for measurements of air quality and meteorological variables. • Direct measurement data was collected using a Tisch R TE-6070V Hi Vol equipment for PM 10 air pollutant. • Public raw data was provided by the Bogotá's Environmental District Bureau (SDA) for air quality and meteorological variables measurements of the stations "Fontibón" and "Móvil Fontibón".
The data was collected for the Fontibón County of Bogotá D.C., Colombia Data format Raw Analyzed Description of data collection Tisch R equipment and Davis AirLink R sensor were placed in accordance to the EPA specifications in an outdoor clear place, coordinates: (N 4 °39'13.7", W 74 °6'50.7"), elevation: 2552 MASL. A dedicated wireless network was installed at the place using mobile internet service for connectivity of the Wi-Fi capable sensor.
( continued on next page )

Value of the Data
• The data collection consists of measurements of air pollution by means of concentration of particulate matter PM 10 , PM 2.5 , and other pollutants, as well as readings of different meteorological variables, gathered from the Tisch R Manual equipment, Davis AirLink R sensor and from the public air quality stations in the Fontibón County in Bogotá, Colombia, in the time span from years 2021-2022. [1 , 2] These datasets are valuable source of information for the population in the surroundings, since air quality is a topic of crucial interest, considering the impact on health generated by breathing polluted air. Likewise, the data collected is useful for establishing correlations between the concentration of particulate matter and meteorological variables. [3] • The inhabitants of sectors having harmful air quality index may consider data and methodology presented, in particular for the use of a low-cost option for the measurement of the concentration of particulate matter, for decision making and planning improvement actions in these locations to mitigate the possible risks that may be generated to health. The Datasets also provide information on the performance of low-cost sensors for registering information under different environmental conditions, and so, are a viable option to ensure the permanent measurement of air quality in places that require constant intervention due to their particulate matter concentration records. [4] • Collected data may also be a source to analyze human activity impact on air quality, contrasting data for holidays, labor days and even the days during the Covid 19 pandemic shutdown.

Objective
Relevant information of the air quality is provided for a densely populated area with the lack of this data, collecting it from three different sources: direct measurement, low-cost sensors and public air quality stations in the Fontibón County in Bogota D.C., Colombia. Data analytics is applied to verify data integrity and to be a source for determining possible correlations between meteorological variables and air quality as PM 2.5 , PM 10 , concentrations.

Data Description
The datasets described in this section correspond to the treatment given to the readings gathered from: a) The Davis AirLink R sensor, collected from May to August 2022 ( Raw_Sensor_Consolidated.csv ), b) Data collected directly with the Tisch R TE-6070V Hi Vol equipment from July 23 to August 18, 2022 ( Raw_Consolidated_Manual_PM10.csv ) and c) Data collected by the air stations and provided by the SDA from January 01, 2021 to December 31, 2022 ( Raw_Consolidated_ECA_Complete_2022.csv ). All the equipment used for data collection was located in the Fontibón County in Bogota D.C.
The sensor has also been calibrated in accordance with the procedure indicated in Lewis et al. [5] by comparing measurements of the sensor to that of an official instrument located no more than 10 meters away, provided by the SDA. Table 1 shows an excerpt of the raw data from the dataset for the Davis AirLink sensor on May 26, 2022, information is shown for all variables that were measured by the sensor, readings were recorded every 15 minutes in the period May-August 2022, the data is available in the file named Raw_Sensor_Consolidated.csv . The air quality variables included correspond to PM 2.5 (μg/m 3 ) and PM 10 (μg/m 3 ) particulate matter concentration, and the selected meteorological variables useful for their correlation. Table 2 shows the statistical analysis of these variables processed with data mining, since the raw data don't allow for the direct analysis, nor the calculation of correlation coefficients such as Pearson, Kendall, or Spearman. The statistical analysis includes the counting of variables to identify the need for data completeness, the arithmetic mean to identify the average of each variable and thus confirm that the data is in an acceptable range according to each type of variable, the standard deviation to evaluate the dispersion of the variables with respect to their mean, and the interquartile range taking the minimum and maximum values, to analyze the dispersion of the data compared with extreme and central values. Fig. 1 shows the relationship between the particulate matter PM 2.5 (μg/m 3 ) and the meteorological variables selected and processed with data mining; a matrix graph is used to visualize the dispersion of the relationship between the variables mentioned in a single graph.

Data Source 2: Tisch R TE-6070V Hi Vol equipment
The Tisch R TE-6070V Hi Vol equipment was calibrated according to the maker's protocol, which relies on the certified orifice TE-5025, used to verify the flow rate (1.232 ±0.028 m 3 /min) passing through the quartz filter. Table 3 shows an excerpt of the raw data for data collected by direct measurement from July 23 to 28, 2022, for the particulate matter PM 10 , data were collected daily at the same time of the day (11:00h), in the same location where the readings were taken with the Davis AirLink R sensor, the data is available in the file named Raw_Consolidated_Manual_PM10.csv .  Table 4 shows the statistical analysis of the particulate matter PM 10 concentration of air quality, and the selected meteorological variables with the same criteria as above. When performing a preliminary analysis of the data obtained based on data mining, it was observed that the data did not require additional processing, since they presented good data quality. [6] .  Fig. 3 shows the relationship between the air quality variables and the selected meteorological variables for the Tisch R TE-6070V Hi Vol equipment. The high relevance of these data relies on the fact that they were obtained from the direct measurement of the particulate matter collected from the air with high confidence, in contrast to the indirect measurement method utilized by the electronic sensors, which constitutes a valuable contribution, not only for the information provided itself, but for providing data using a reference technique on air quality pollutants.

Data source 3: Bogota's Environmental District Bureau (SDA)
Raw data provided by the SDA in Bogotá-Colombia, contains air quality information measured hourly for the full years 2021 and 2022 stored in two different files. It should be noted that before using the data for its corresponding analysis, it is necessary to evaluate the quality of the data, due to occurrence of missing records.
The two raw data files provided for processing were the first for the meteorological variables and the other for the pollutants, so they were merged and consolidated from January 01/2021, to December 31/2022 in a single file called Raw_Consolidated_ECA_Complete_2022.csv . Table 5 shows an example of the raw data from the air quality stations. These air quality stations were located in the same County (Fontibón) as the other sources, allowing future analysis of data of air quality variables and meteorological variables.   Table 6 shows the statistical analysis of the data provided for the public air quality stations, showing the amount of data in the second row (being evident the requirement of data processing), the mean and standard deviation to visualize the dispersion of the data and the interquartile range for the measurement of the dispersion with respect to the minimum and maximum points. [6] .  Figs. 4 and 5 show the relationship between the PM 2.5 and PM 10 particulate matter concentration respectively, and the selected meteorological variables, based on the data from the air quality stations processed with data mining and standardized to show in a clever way the dispersion of the data and a perspective of the relationship between each of the variables.

Data Source: Tisch R TE-6070V Hi Vol equipment
The equipment used is classified as an Hi Vol active manual sampler, in this case the TISCH R TE-6070V Volumetric flow controlled (VFC) PM 10 , US EPA certified. Calibration protocol was performed using the Tisch R certified orifice TE-5025, verifying the flow rate passing through the quartz filter. The analytical balance used for weighing the filters before and after the sampling was also calibrated.
High volume air sampler model TE-6070V Hi Vol from Tisch R collects large quantities of particles ranging from 0.1 to 1 gram, mass of solid particles collected from the high flow rate for the sampling period of 24h, facilitating gravimetric analysis with confidence. The protocol followed for the direct measurements was according to the Reference Method for the Determination of Particulate Matter as PM 10 in the Atmosphere [7] .
The equipment location was chosen considering: the distance to the main road, distance to borders, height to the floor of the sampler location, and safe distance to the closer trees and buildings to avoid interferences [8] . In this case, the equipment was installed in a place located in an urban area meeting all these requirements: coordinates: N 4 °39 13.7 , W 74 °6 50.7 , elevation: 2552 MASL, height to the floor of the sampling chamber: 2.5 m. Average temperature: 12 °C, Average barometric pressure: 564 mbar.
In the active high-volume sampler, a measured amount of ambient air was drawn into a sample box through a quartz filter for 24 hours. The filter is weighed before and after to determine the net weight change. The total volume of air sampled is determined from the average flow velocity and the sampling time. The total concentration of particles in ambient air is calculated as the mass collected divided by the volume of air sampled, adjusted to reference conditions. This process was repeated during 27 consecutive days in the period of 07/23/2022 -08/18/2022.
The quartz filters were dried before sampling, placing them inside a chamber with abundantly dry silica gel for 2 weeks to absorb moisture, monitoring the filters weight on an analytical balance until constant weight readings, so that the filters were dry enough.
The daily recorded data from the air quality station, such as the initial and final weight of the filters, the readings from the hour meter of the equipment, as well as the initial and final stagnation pressure, wind speed, temperature, relative humidity, dew point, atmospheric pressure and wind direction, measurements were consolidated in an electronic worksheet to calculate the daily PM 10 concentrations according to the calculation procedure established in the reference method.

Data Sources: Davis AirLink R sensor and Public stations SDA
The Davis AirLink R sensor was installed in the same place as the Tisch R equipment, coordinates: N  Calibration of the low-cost sensors is currently under development by the European Committee for Standardization (CEN), in this work the methodology proposed by Lewis et al. [5] also named as 'Collocation' was followed, consisting of comparing measurements of the Davis AirLink R sensor to that of an official instrument located no more than 10 meters away, provided by the SDA.
Treatment for the data was necessary both for the gathered from the Davis AirLink R sensor, and for those provided by the SDA from the air quality stations [9] . This treatment is discussed hereafter.
The file Consolidated_Air_Quality_Analysis_2021_2022_Manual.ipynb , contains the Python code for the analysis. A description of the data from each of the sources, as well as the completeness of the missing data by means of imputation methods with the technique of Machine Learning called KNNImputer available in the Python's library sklearn.impute , which takes the nearest neighbor data to impute and define the missing value. Once the data was completed, the graph of the relationship between the air quality measured by the concentration of particulate matter PM 2.5 and the meteorological variables chosen for the study was made, both for the data from the air quality stations and for the data generated by the Davis AirLink R sensor. [10] Likewise, the calculation of the Pearson, Spearman and Kendall correlation coefficient was performed for the concentration of particulate matter PM 2.5 in relation to each of the meteorological variables established in each data set.
Finally, a training was carried out on the models of linear regression, regression by nearest neighbors and regression with support vectors, comparing the results of these and taking as reference the data set of the air quality stations, considering that they had a greater number of records, which allowed their training and evaluation of the prediction.

Ethics Statements
Raw Data for this work were partially obtained from the Bogotá's Environmental District Bureau, as well as from the low-cost Davis AirLink R sensor owned by the authors.
This study does not involve research with animals or humans.

Declaration of Competing Interests
The authors declare that they have no known competing financial interests or personal relationships that might have appeared to influence the work reported in this paper.