Weight-of-evidence approach to identify regionally representative sites for air-quality monitoring network: Satellite data-based analysis

The methodology discussed in Lekinwala et al., 2020, hereinafter referred to as the ‘parent article’, is used to setup a nation-wide network for background PM2.5 measurement at strategic locations, optimally placing sites to obtain maximum regionally representative PM2.5 concentrations with minimum number of sites. Traditionally, in-situ PM2.5 measurements are obtained for several potential sites and compared to identify the most regionally representative sites [4], Wongphatarakul et al., 1998) at the location. The ‘parent article’ proposes the use of satellite-derived proxy for aerosol (Aerosol Optical Depth, AOD) data in the absence of in-situ PM2.5 measurements. This article focuses on the details about satellite-data processing which forms part of the methodology discussed in the ‘parent article’. Following are some relevant aspects:• High resolution AOD is retrieved from Moderate Resolution Imaging Spectroradiometer (MODIS) instruments aboard NASA's Aqua and Terra satellite using Multi-Angle Implementation of Atmospheric Correction (MAIAC) algorithm. The data is stored as grids of size 1200 × 1200 and a total of seven such grids cover the Indian land mass. These grids were merged, regridded and multiplied by conversion factors from GEOS-Chem Chemical Transport Model to obtain PM2.5 values. Standard set of tools like CDO and NCL are used to manipulate the satellite-data (*.nc files).• The PM2.5 values are subjected to various statistical analysis using metrics like coefficient of divergence (CoD), Pearson correlation coefficient (PCC) and mutual information (MI).• Computations for CoD, MI are performed using Python codes developed in-house while a function in NumPy module of Python was used for PCC calculations.


Subject Area
Environmental Science More specific subject area Use of satellite data in Air Quality Monitoring (AQM) Method name Satellite-derived PM 2.5 to establish regional representativeness using statistical metrics (CoD, PCC, MI) Name and reference of original method The current work focuses on implementation of application of Mutual Information [3] as a metric to capture non-linear relationship in the data. Additionally, metrics like CoD and PCC discussed in [7] , is used in the study.

Method Details
The method discussed in the article covers procurement and pre-processing of the satellite data, analysing them using metrics like Coefficient of Divergence (CoD), Pearson Correlation Coefficient (PCC) and Mutual information (MI) and visualising the spatial map of the metrics. The verification and validation of this method are presented in the parent article [5] . Further, all codes that were developed for this work are presented as supplementary material.

Data Procurement
The high-resolution AOD data is freely available over the Eastern and South-Eastern Asia region and are divided into grids as shown in the Fig. 1 below. Indian landmass is covered by eight grids of size 1200km × 1200 km each as highlighted in Fig. 1 .
The files can be downloaded from NCCS's Dataportal [ 1 ]. Grid-wise daily files of AOD values are available in the year-wise folders each containing multiple AOT files per day (multiple swaths for Aqua and Terra, max 4 files per day) with unique file names. For example, in grid h00v01 [ 2 ], the files are sorted annually; for the year 2004, files are named according to the following convention,

MAIAC
Algorithm used on the MODIS data to obtain data products [6] A/T NASA's Aqua/Terra satellite fitted with MODIS sensor AOT Aerosol optical thickness h00v01 The grid name as shown in Figure 2 2004 Year 001 Day of the year, can range from 001 -365/366 (depending on leap year), 001 is equivalent to January 01 0735 Time at which the satellite passed over the region hdf File extension

Data pre-processing
The files for the years 2004 -2011 were procured for the current methodology as the AOD/AOT conversion factors (CF) were available for these years. Fig. 2 shows the grids over India and the scope of availability of the CF. Fig. 3 schematically shows the data pre-processing steps involved in the methodology. a. The details about the newly created netCDF file can be checked using ncdump function (as a part of netcdf-bin package on a Linux-based PC). ncdump of one file is shown below, it shows that there are 9474 time-steps available (multiple daily time instants due to multiple passes of both Aqua and Terra every day).  b. The process of converting hdf files into netCDF files is repeated for all the grids. Bottle-neck in this process is the disk writing speed (file for each grid is about 32GB in size) which may take up to an hour per grid. For all the computations, Linux-based computer with 6 core -12 thread i7-8700 processor, 24 GB RAM and a standard spinning hard-disk drive is used. 2. Daily mean of netCDF file: To make the processing easier and bring uniformity across different grids, the daily mean of the data is used, thus making total number of time instants (2922 days) same across different grids.

Rectilinear transformation:
The data grids obtained from the satellite are curvilinear in nature. The daily-averaged data for different grids needs to be merged to create a consistent spatiotemporal dataset over all the time instants. In order to merge the grids, they need to be transformed from the curvilinear form to rectilinear form using code rectilinear.ncl which is later remapped using grid characteristics given in regrid.txt . a. The NCL code creates a blank rectilinear * .nc file . b. The blank rectilinear * .nc file is populated with the variables and saved separately using CDO's setgrid function .

Remapping and Merging:
The merging of grid is important to ensure spatial continuity in the data and makes processing easy. Following are some points to note, a. To merge the different grids, a large grid covering the extent of the smaller grids needs to be created. b. Remapping process is computationally expensive and time consuming and thus to establish a balance between spatial resolution and computation time, for the current application, the 1 km × 1 km data was remapped (converted) to 1.5 km × 1.5 km using CDO's remapcon function.
c. This resolution was found optimal as without losing fineness in the data, the computation time was reduced by about 2-2.5 times (about half the original time). d. It took about 400 hours for the Remapping to complete on a 6-core, 12-thread 4GHz Intel i7-8700 machine with 24GB of RAM. e. Remapping the data to the original spatial resolution is essential before merging, while Remapping it to a coarser grid is optional.
f. During the merging process (using CDO's mergegrid function), a minimum 500 GB of storage is required for all the auxiliary files (intermediate files) which can be deleted after successful merging. g. In case of some issue or some error in the file, the computation can be resumed from a step before the error occurred if the auxiliary files are saved until successful completion of Remapping and merging.

Remapping CF and Multiplying AOD with CF:
For the current study, the CF are obtained from the GEOS-Chem Chemical transport model which is adopted from the study of van Donkelaar et al., [8] . a. The daily CF (2004 -2011) are already in rectilinear form, it is easy to remap it to same spatial resolution as the AOD. b. The daily CF value is multiplied with the daily AOD values to obtain daily PM 2.5 values. c. The multiplication is computationally expensive and is also bottlenecked by the storage speed, it may take about 4-6 hours to process about 15.4 × 10 12 floating point operations.

Data Analysis
The satellite-derived PM 2.5 values obtained in the data pre-processing step is used for further analysis. A schematic of the PM 2.5 data obtained is shown in Fig. 4 .
Following are the important points to note in data analysis,  the values were close, the metric CoD will have value close to zero, while for completely different values, the CoD will be close to unity. The CoD for two columns of values can be calculated as, where, x t and y t value of PM 2.5 at the reference cell and a neighbouring cell at t th time instant respectively, n are the total number of time instants A function for CoD was created for all the calculations and is defined as cod_calculations in the file computation.py 4. Pearson correlation coefficient (PCC) is an addition statistical metric which is used. It can easily be accessed from Python NumPy library's corrcoef function and is used as follows, 5. While CoD compares the values, and PCC quantifies the linear relationship in the data, mutual information (MI) is used to additionally capture the non-linear relationship in the data. 6. Lekinwala et al. [5] discuss the algorithm used to calculate mutual information in detail. Mutual information function mutual_information function is part of the computation.py code.
a. The mutual information can be computed as follows, Matplotlib's contour function is used to create it. 4. Several fonts related and contour plot related options are used to make the plot visually better.
Other functions and options used in the code are shown below.
5. The plots for MI (a) CoD (b) and PCC (c) created using the code in the computation.py file are presented in Fig. 5 . Results and interpretation of Fig. 5 for Bhopal site and other sites are discussed in [5] .

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.