Seeing the air in detail: Hyperlocal air quality dataset collected from spatially distributed AirQo network

Air pollution is a major global challenge associated with an increasing number of morbidity and mortality from lung cancer, cardiovascular and respiratory diseases, among others. However, there is scarcity of ground monitoring air quality data from Sub-Saharan Africa that can be used to quantify the level of pollution. This has resulted in limited targeted air pollution research and interventions e.g. health impacts, key drivers and sources, economic impacts, among others; ultimately hindering the establishment of effective management strategies. This paper presents a dataset of air quality observations collected from 68 spatially distributed monitoring stations across Uganda. The dataset includes hourly PM2.5 and PM10 data collected from low-cost air quality monitoring devices and one reference grade monitoring device over a period ranging from 2019 to 2020. This dataset contributes towards filling some of the data gaps witnessed over the years in ground level monitored ambient air quality in Sub-Saharan Africa and it can be useful to various policy makers and researchers.


a b s t r a c t
Air pollution is a major global challenge associated with an increasing number of morbidity and mortality from lung cancer, cardiovascular and respiratory diseases, among others. However, there is scarcity of ground monitoring air quality data from Sub-Saharan Africa that can be used to quantify the level of pollution. This has resulted in limited targeted air pollution research and interventions e.g. health impacts, key drivers and sources, economic impacts, among others; ultimately hindering the establishment of effective management strategies. This paper presents a dataset of air quality observations collected from 68 spatially distributed monitoring stations across Uganda. The dataset includes hourly PM 2 . 5 and PM 10 data collected from low-cost air quality monitoring devices and one reference grade monitoring device over a period ranging from 2019 to 2020. This dataset contributes towards filling some of the data gaps witnessed over the years in ground level monitored ambient air quality in Sub-Saharan Africa and it can be useful to various policy makers and researchers.

Value of the Data
• The dataset is essential in filling some of the data gaps witnessed over the years in ground level monitored ambient air quality in Sub-Saharan Africa. In turn, policy makers can be guided in developing evidence-based air quality control strategies and prioritisation of air quality issues [2] . • Researchers and the academic community can utilise this dataset to carry out various research studies related to social economic impact of air pollution, and studies aiming at understanding the air pollution exposure risks [3] . • Researchers can use this dataset to facilitate the development of new & novel modelling algorithms in the air quality space e.g. forecasting, spatial temporal modelling and others. • This dataset can be used as a baseline (ground truth) to highlight the potential of utilizing low-cost monitors in other countries/regions where air quality data is non-existent and probably model the air quality in those areas with similar characteristics as the region where the data was collected from. • This dataset can be used in tracking the progress and implementation of World Health Organisation air quality guidelines [4] . • This dataset can be fused with other datasets such as satellite data for environmental and air quality modelling.

Data Description
The air quality dataset presented in this article comprises of records containing timestamp in UTC, PM 2 . 5 concentrations in μg /m 3 , PM 10 in μg /m 3 , site i d which uniquely identifies a monitoring site and the site coordinates (latitude, longitude). It contains 506164 records from low-cost monitors and 3,364 records from the reference grade monitor. The data from the various monitoring devices have varying start dates since they were deployed on different dates as the network is continuously being expanded. The mean PM 2 . 5 and PM 10 concentrations from the lowcost monitors dataset are 37.39 μg /m 3 and 49.61 μg /m 3 respectively. Table 1 and Fig. 2 show the statistical summary of the data from the low-cost monitors. Table 2 shows the statistical summary of the data from the reference grade monitor.

Experimental Design, Materials and Methods
The data presented in the article was collected from a network of AirQo [1] low-cost monitors and one reference grade monitor. The monitoring sites were selected with the aim of monitoring pollution variations for diverse physical environments in the selected urban centres i.e. population distribution (high population density vs low population density), commercial centres vs residential areas, urban background vs non-urban background, proximity to emission sources e.g. road network, industries and others. The monitoring site with the reference grade monitor is an institutional setting with a resident population of over 50 0 0, having paved roads and vegetation canopies. It's located about 0.6 km from a major road and is 1237.39 meters above sea level. The reference grade monitor is a Met One Beta Attenuation Monitor Model 1022 [ 5 , 6 ] which uses the principle of beta ray attenuation to continuously monitor particulate matter. It is configured to measure and record hourly PM 2 . 5 concentration. On the other hand, the low-cost monitors use laser scattering technique and utilise dual Plantower Sensors (PMS 5003) [2] . These devices measure PM 2 . 5 and PM 10 with an effective range of 0-500μg /m 3 as well as the device location coordinates. Thereafter, the measured data is transmitted to a cloud platform every 90 seconds over a local cellular network. The raw data from low-cost monitors is then extracted from the cloud platform and re-sampled to an hourly frequency. The PM 2 . 5 and PM 10 values are computed by averaging the observations from the dual sensors. Records having timestamps with missing or invalid measurements such as negative values and values greater than 500 are eliminated from the dataset. The raw measurements from the low-cost monitors are calibrated by applying appropriate machine learning models trained on data from collocated low-cost and reference-grade monitors [8] . These models were validated through cross-unit and cross-site validation. PM 2.5 measurements were calibrated by applying random forest model which improved the RMSE & MAE from 18.58μg/m 3 to 7.22μg/m 3 and 14.60μg/m 3 to 4.60μg/m 3 respectively when compared to the collocated reference monitor readings. PM 10 measurements were calibrated using the lasso regression model which improved RMSE and MAE from 13.40μg/m 3 to 7.91μg/m 3 and 11.32μg/m 3 to 6.01μg/m 3 respectively. The statistical summaries for the processed dataset were then computed.

Ethics Statements
To preserve the privacy of individuals and institutions hosting the monitoring devices, random coordinate distance preserving transformations were done on the actual coordinates of the monitoring sites. The distance between the transformed coordinates and actual coordinates varies between 50 and 110 metres with an average of 78.35 metres

Declaration of competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have or could be perceived to have influenced the work reported in this article.
All the authors declare that their affiliation to AirQo and Makerere University has not influenced the work reported in this paper.

Data Availability
Seeing the air in detail: hyperlocal air quality dataset collected from spatially distributed AirQo network (Original data) (Mendeley Data).