1 Introduction

In recent years, there has been a remarkable increase in the availability and size of georeferenced datasets and data retrieval tools, particularly for environmental studies. In certain research areas, such as air quality monitoring (Vitolo et al. 2016), agriculture and livestock farming (Fassó et al. 2023), climate (Cruz-Alonso et al. 2023), and geology (Ryberg and Vecchia 2012), access to high-quality data is a crucial issue.

For air quality and pollution in particular, access to data is typically linked to finding information from institutional sources, such as public environmental monitoring and protection agencies, government departments, and satellite data repositories. Successful examples of data collection are the rdefra (Vitolo et al. 2016) software, which is used to retrieve air pollution data from the UK Department for Environment, Food and Rural Affairs; SHERPA-city (Degraeuwe et al. 2021), a web application that assesses the impact of traffic measures on NO\(_2\) in European cities; saqgetr (Grange 2019), a package to import AQ monitoring data from the AirBase - The European air quality database; and openair, an open source software for air quality data retrieval and analysis from UK national and regional networks (Carslaw and Ropkins 2012; Szulecka et al. 2017). Additional tools have been developed to supplement the aforementioned databases by retrieving satellite data on terrestrial weather conditions and anthropogenic emissions of various pollutant species, e.g., ecmwfr (Hufkens et al. 2019), which downloads data from the European Centre for Medium-Range Weather Forecasts dataset web services (ECMWF) and Copernicus’s Climate Data Store (CDS) and on land cover (see sen2r by Ranghetti et al. 2020, for automatic discovery and management of satellite data provided by the Sentinel-2 program of the European Space Agency).

In this paper, we present the ARPALData software, an R package (R Core Team 2020) aimed at retrieving, handling, and analyzing air quality and weather data for the Lombardy region in Northern Italy. Lombardy has approximately 10 million inhabitants and the largest population density in Italy, with the highest per capita GDP across the country, sustained by an extensive network of enterprises (mainly services) and industries (Regional Statistical Yearbook 2017). Owing to its high industrial and housing density, coupled with growing land consumption, the region has one of the worst air quality in Europe and worldwide (Carugno et al. 2016; Raffaelli et al. 2020). Therefore, the case study of Lombardy and the Po Valley has received considerable attention from the media and international researchers (Collivignarelli et al. 2020).

Air quality and weather monitoring in Lombardy is entrusted to the Regional Agency for Environmental Protection, namely ARPA Lombardia, which maintains ground monitoring networks and validates and provides data to the public (Maranzano 2022). ARPALData retrieves data directly from the Regione Lombardia Open Database, which provides information on several topics, including the environment, transportation, economy, and social issues in Lombardy. Specifically, for air quality research, data provided by ARPA Lombardia on pollutants and local meteorology are the main institutional sources for researchers. These data have been used in multiple investigations, including monitoring long-term trends in air concentrations (Bigi and Ghermandi 2014; Arvani et al. 2016), source apportionment (Manigrasso et al. 2012), green areas management in urban settings (Terzaghi et al. 2020), the role of agriculture and livestock in fine particulate emissions (Fassó et al. 2023; Marongiu et al. 2023), the impact on air quality of local traffic policies (Fassó 2013; Gibson and Carnovale 2015; Maranzano et al. 2020), the impact of restrictive policies related to COVID-19 (Cameletti 2020; Bontempi 2020; Lovarelli et al. 2021; Bontempi et al. 2022; Fioravanti et al. 2022; Maranzano and Pelagatti 2023; Maranzano et al. 2023), and epidemiological studies linking pollution, mortality, and diseases (Angelici et al. 2016; Carugno et al. 2017; Zoran et al. 2020a, b).

Despite the wide dissemination of ARPA data for research and local policies, data collection in Lombardy suffers from several critical issues, some of which are directly addressed by ARPALData. In addition, there is currently no national platform that systematically collects, harmonizes, and disseminates data from individual regional agencies. The joint collection of data pertaining to multiple Italian regions requires access to specific databases that are not uniform and have unique peculiarities. Therefore, ARPALData is an attempt to standardize the environmental data discovery process for a specific region and broaden the debate on the need for a unified national system by involving researchers and experts from other agencies and institutions in its development.

The remainder of this paper is organized as follows. Section 2 describes the main issues addressed by the software by explaining how the data collection step is currently possible. Section 3 describes the available data collected using the software and the data quality standards guaranteed by ARPA Lombardia. Section 4 presents the functions/commands provided to the users of ARPALData in detail, specifying their objectives and the main technical details. Section 5 provides three guided examples of how ARPALData can be used to cover the major software capabilities. Throughout the text, the syntax (included in the appropriate code chunks) and results obtained are discussed. Finally, Sect. 6 provides information on software availability, as well as plans for future developments and enhancements.

2 Issues addressed by ARPALData

ARPALData addresses several issues concerning data retrieval and management in the Lombardy region. The first issue regards the potential challenges posed by data access and downloading. Official air quality and weather data for Lombardy can be retrieved using two institutional open data sources. The first source is the online dashboard provided by ARPA Lombardia, accessible at https://www.arpalombardia.it/temi-ambientali/meteo-e-clima/guida-richiesta-dati/ for weather and https://www.arpalombardia.it/dati-e-indicatori/aria/ for air quality (both platforms were accessed on December 7th, 2023). The dashboard allows users to inquire about measurements collected from one or more specific sensors installed on the control units that are currently active on the network. For each query, the user can select a maximum of six sensors simultaneously and data referring to no more than one year (i.e., 365 consecutive days). The output is a spreadsheet sent via email to the requester. Thus, data collection requires considerable time and many repeated queries. The second option is to directly enquire into the official regional open database available at https://www.dati.lombardia.it/ (accessed on December 7th, 2023). However, accessing data through a regional open database requires nontrivial knowledge of data management and file management tools, thereby complicating data consultation for non-expert users. To this extent, ARPALData provides a user-friendly tool that does not require access to institutional platforms, is often accessible only through personal credentials, and ensures the real-time updating of information using standardized syntax.

The second issue concerns the original data format and the requirement for remarkable data manipulation and harmonization. ARPA Lombardia provides air quality and weather measurements at a pre-specified sampling frequency, that is, hourly air quality data from the ground monitoring network, daily air quality data at the municipal level, and infra-hourly (10 min) data for meteorological parameters. The concentrations of airborne pollutants or meteorological variables measured through the monitoring network, and daily municipal estimates of airborne pollutant concentrations, can be retrieved by accessing a regional open database. In all cases, users are only allowed to download year-specific spreadsheets containing measurements for the entire network, structured according to a long format. In the case of a multi-year query, the original data should be manually repeated by downloading and reshaping multiple spreadsheets. In addition, each sensor is uniquely identified using a sensor code that is specifically tabulated in an external metadata file. Users are not allowed to prefilter the measurements by date, station, or sensor. The collection procedure requires nontrivial data manipulation (data cleaning and table joining) of the raw measurements and metadata. ARPALData directly collects data from the regional open database source and provides information reshaped according to a two-way panel structure by station (i.e., each entry is univocally identified by the station identification code or name and the time stamp) or a three-way panel format by sensor (i.e., each entry is univocally identified by the sensor identification code and the time stamp). Through user-defined arguments, the functions for downloading data used in the package allow the retrieval of user-specified measurements taken on the entire network or on a subset of the available monitoring stations for any user-specified time period. Furthermore, the user can request data at the original sampling frequency or aggregated according to customized aggregation functions (e.g., monthly averages or weekly maxima). The required observations are directly combined and harmonized into a single data frame object of the class ARPALdf. Eventually, in addition to supplying raw data, ARPALData aims to provide key descriptive statistics and graphic tools to complement official reports and personal analyses, allowing users to learn more about issues related to air quality and meteorology in the region.

The third issue addressed by ARPALData concerns the language. All the original explanatory metadata (i.e., registry tables of monitoring stations or municipalities) and the raw data from ARPA and Regione Lombardia are reported in Italian. This could discourage access by non-Italian-speaking users who are unable to understand the information. Therefore, the dissemination and use of air quality and weather data for Lombardy are greatly reduced and made almost exclusively by domestic affairs. To address this challenge, ARPALData provides all measurements, metadata, and subsequent analytical tools in English, thus facilitating accessibility for both international and domestic users.

Finally, although the data are geographically limited to the Lombardy region, ARPALData allows accessing to environmental measures that are public but not otherwise available through other software or international platforms. For example, as highlighted by Fassó et al. (2023), the European Environmental Agency (EEA) platform provides measures for only a subset of pollutants in Lombardy, from which ammonia or metal concentrations are excluded.

3 Available air quality and weather data

The ARPALData software handles the following datasets from the ARPA Lombardia open data:

  • hourly or daily air quality measurements from the ground monitoring network (174 stations activated since 1968, thereof 93 currently active) since 1968;

  • 10-min weather measurements from the ground monitoring network (331 stations activated since 1989, thereof 267 currently active) since 1989;

  • daily air quality measurement for the 1506 municipalities in Lombardy since 2011.

The available data are continuously updated to the current date. Figure 6 presents the maps of either the currently active or historical air quality and ground weather monitoring networks. The two networks are heterotopic, and in only a few cases do they overlap (i.e., few sites monitor both pollution and weather). The air quality ground monitoring network of ARPA Lombardia comprises fixed stations that provide continuous data at regular time intervals by means of automatic analyzers (ARPA Lombardia 2023b), that is, the sensors. Following the EU Air Quality Directive 2008/50/EC (European Parliament 2008), the pollutant species monitored are NO\(_X\), NO\(_2\), SO\(_2\), CO, O\(_3\), PM\(_{10}\), PM\(_{2.5}\), NH\(_3\), black carbon, arsenic, benzene, benzo(a)pirene, and metals (such as, nickel, cadmium, and lead). The list of pollutants surveyed by the monitoring stations may differ depending on the environmental context in which the unit is being monitored. Therefore, not all stations are equipped with the same analytical toolkit (for instance, ammonia sensors are typically installed at rural sites only, while NO\(_2\) sensors are spread all over the region). The stations are distributed throughout the region according to population density and type of territory according to the criteria defined by Legislative Decree 155/2010 (Presidenza della Repubblica Italiana 2010). Pollutant data are provided at hourly frequency, except for PM\(_{10}\), PM\(_{2.5}\), and NH\(_3\), for which daily averages are provided.

The meteorological monitoring network (ARPA Lombardia 2023c) comprises stations measuring the following parameters: hydrometric level (cm), snow depth (cm), precipitation of snow and rainfall (mm), temperature (\(^\circ\)C), relative Humidity (%), global solar Radiation (W/m\(^2\)), wind speed and direction (m/s and circular degrees), and speed and direction (m/s and degrees N) of wind gusts. All the weather measurements are provided by the agency with a 10-min frequency.

As for the air quality network, also the surveyed meteorological parameters may be different across stations depending on the reference environmental context (e.g., snow depth is measured only at mountain sites). In both networks, the distribution across the regional territory considerably change among pollutants and among weather parameters. As an illustration, in Figs. 7 and 8 we show the spatial distributions of the installed sensors for some relevant pollutants (i.e., NO\(_2\), PM\(_{10}\), PM\(_{2.5}\), and ammonia) and weather parameters (i.e., temperature, rainfall, snow depth, and wind), respectively. Also, note that there exists a one-to-many relationship between stations and sensors, that is, each sensor is programmed to collect information on a single weather or pollution parameter, while multiple sensors can be installed in each station simultaneously.

Municipal values were obtained from regional-scale simulations performed using a chemical-physical air quality model. Therefore, these are not direct measurements, but estimates that also use data from the ARPA air quality detection network. Available pollutants at the municipal level are the weighted daily averages of PM\(_{10}\), PM\(_{2.5}\), and NO\(_2\); the daily maximum of NO\(_2\) and tropospheric O\(_3\); and the daily maximum of the 8-h moving average of tropospheric O\(_3\). Daily estimates of each of the pollutants mentioned are given for all municipalities.

Data provided by the regional agency for the current and immediately preceding year may contain measurements with a low degree of fidelity that must be checked through an appropriate data validation process. Statistical validation of the data involves an assessment phase for each series collected at each site that is completed by March 30 of the year following the measurement year. Therefore, the data should be considered non-definitive before that date. For municipal data in particular, estimates must be considered provisional until the validation process of measurements is completed, followed by the recalculation of values and the release of consolidated data; consolidated estimates for the previous year are tipically available as of June 30 of the current year. Potential causes of the unreliability of certain measurements include malfunctioning or failure of monitoring stations that induce anomalies in the measured values and interruptions in data collection. To overcome these issues, ARPA Lombardia has adopted management procedures and field checks that guarantee that the data collected are as accurate as possible in compliance with European and Global standards. Specifically, the agency implements quality assurance procedures involving a specialized team in charge of performing specific checks in the context of quality control and quality assurance activities, including preventive maintenance and corrective maintenance interventions. Readers may refer to Maranzano (2022) for an extensive discussion of the monitoring networks maintained by ARPA Lombardia on the type and quality of data collected and provided by the agency and on the agency’s role in national and European contexts.

4 Structure and main features of ARPALData

The ARPALData package provides users with 17 functions, which can be grouped into 4 categories: 1) air quality and weather measurements retrieval; 2) data exploration, quality check and graphical representation; 3) class check; and 4) auxiliary functions. A detailed description and summary of ARPALData functions is reported in Table 3. Furthermore, while we discuss in detail the functions to download air quality and weather data, their metadata and exploratory statistics commands, we refer the reader to Supplementary Material 2 for an extended description of other auxiliary functions. In particular, we provide details on class-checking functions, temporal aggregation functions, and commands providing geometries (shapefiles) and maps of Lombardy at different spatial scales. In Supplementary Material 2, we also describe the input arguments of the functions (e.g., type and class, default values and values they can take), the structure of the outputs (e.g., the format of downloaded datasets), and the statistical and computational techniques implemented (e.g., temporal aggregation functions and spatial mapping).

Notably, ARPALData complies with the Comprehensive R Archive Network (CRAN) policy, which requires packages using Internet resources to provide an informative message if the resource is unavailable or has changed. In addition, several functions admit an optional verbose argument that prints auxiliary information on the screen that informs the user about which operations the command is performing and some object-specific information.

4.1 Retrieve air quality and weather data using ARPALData

The ARPALData package contains three functions that allow data to be downloaded from the ARPA Lombardia database. Each function is identified by the prefix get_ARPA_Lombardia, while the suffix determines the type of data to be downloaded: AQ_data returns the observed air quality measurements collected by the ARPA Lombardy ground-based survey system, W_data returns the observed meteorological measurements collected by the ground-based monitoring network, and AQ_municipal_data returns the air quality estimates for each municipality in Lombardy.

The get_ARPA_Lombardia commands share the same set of input arguments, which enable the user to customize the query regarding the variables and period of interest, the sampling frequency, and the grouping statistics. For example, it is possible to concurrently download the average daily concentration of one pollutant and the maximum daily concentration of another pollutant. The period of interest is specified by the user through the start and end dates (optionally stating the exact time); therefore, it is possible to download measurements pertaining to infra-annual sub-periods, but also to periods crossing one or more years. Users can also choose whether to consider the entire monitoring network (or all municipalities) or only a subset by entering the codes of the stations of interest. These codes are provided through the metadata functions presented in the following sub-sections, which enable the filtering and selection of stations based on user-specific criteria. Air quality measurements are originally provided at an hourly frequency (e.g., NO\(_2\) or O3) or daily frequency (e.g., PM\(_{10}\) and PM\(_{2.5}\)), whereas weather measurements are all provided at a 10 min frequency. The user can customize a query such that the output dataset is aggregated to a specific temporal frequency (i.e., hourly, daily, weekly, monthly, or yearly). The temporal sampling rate is reduced by aggregating the measurements at each monitoring site according to a set of aggregation functions available to users, including centrality, variability, and shape measures. The temporal aggregation operation can also be applied to a generic object of the class ARPALdf through the Time_aggregate command. Examples of aggregation are provided in Sects. 5.1 and 5.2, as well as in Supplementary Material 1.

Download functions return as output an object of the classes data.frame and ARPALdf fully compatible with the Tidyverse (Wickham et al. 2019) and the graphical environment ggplot2 (Wickham 2016). Recall that each sensor measures one and only one meteorological or pollutant parameter and that a one-to-many relationship exists between stations and sensors. In addition, multiple pollutants or weather parameters can be measured at each station. The default output object is a long-shaped (or two-way panel) data.frame with dimensions \((N \times T)\times (K+3)\), where T is the number of time stamps, N is the number of stations (or municipalities), and K the number of available measurements (airborne pollutants or weather variables). The first three columns are the time stamp (Date), the station unique identification number (IDStation) and the station name (NameStation), respectively. Notice that the two-way panel takes on an exactly rectangular shape as for each station the same number of time stamps is reported. Thus, if a parameter is not measured at a specific station, a sequence of missing values is reported for all the time stamps. These missing values should not be subject to any imputation analysis because they belong to locations not considered in the sampling plan of ARPA Lombardia. The user can also request the dataset in the three-way panel format with respect to the sensors, where each row corresponds to the measurement collected from a given sensor installed at a given station at a given time stamp. Let us denote as \(S_i\) the number of sensors installed at a generic station \(i=1,\ldots ,N\). Then, the number of rows in the three-way panel is \(\left( \sum _{i=1}^{N}{S_{i}}\right) \times T\), and the number of columns is six, that is Date, IDStation, NameStation, IDSensor, Variable (i.e., meteorological parameters or pollutants) and Value.

The user can easily retrieve the metadata (or registry) on the air quality and weather ground monitoring networks owned by ARPA Lombardia and Lombardy municipalities by using the get_ARPA_Lombardia_*_registry functions. Similar to the data commands, the AQ_registry,  W_registry and AQ_municipal_registry, return explanatory metadata for air quality monitoring stations, the weather monitoring stations and for the municipalities, respectively. Available metadata include the unique identification numbers of sensors and stations, station names, descriptions of the pollutant or weather parameters associated with each sensor, geographical coordinates (latitude, longitude and altitude), and administrative information, such as province, city, and municipality. Metadata may evolve over time as a result of updates performed by the regional agency on the stations. However, the software preserves the entire history of the monitoring networks as it contains information on currently active (as of the date of the request) and disabled sensors. Specifically, for all sensors, the date of initial operation is available, whereas for disabled sensors, the date of deactivation is also given. For the specific case of air quality stations, metadata additionally provide information on the ARPA zone and station type classification. The stations type classification arranges the sites according to the EEA stations type taxonomy (European Environmental Agency 2023), that is the stations are classified according to the urbanisation degree (physical distribution/density of buildings) of the surrounding area (Urban, Suburban, and Rural) and by predominant emission sources (Background, Traffic, and Industrial). On the other hand, the ARPA zone classification (ARPA Lombardia 2023d) has been defined by the Lombardy Region authorities through the Regional Decree 2605/2011 (Giunta regionale della Lombardia 2011), and distinguishes the regional territory into homogenous areas or agglomerations. Zoning reflects the main morphological characteristics of regional territory and is used to evaluate local compliance with national and international environmental objectives and limiting values. The Lombardy region is classified into three large urbanized areas (metropolitan areas of Milano, Bergamo and Brescia), urbanized areas in rural contexts (namely, the urbanized plains), rural areas (namely, the rural plains), mountainous areas, and valley bottoms. The user can retrieve the exact boundaries of the ARPA zones through the get_ARPA_Lombardia_zoning function, which returns the geometries (polygonal shape file, sf object) and a map of the ARPA zoning for Lombardy. The right panel of Fig. 2 shows a map of the zones that subdivide the Lombardy region.

4.1.1 Computational issues and parallel download

Regional air quality and weather data are retrieved from the API Socrata Open Data (SODA), which provides systematic access to the Lombardy Region database, including the capability to filter, query, and aggregate data. ARPALData accesses the Socrata language using the package RSocrata (Devlin et al. 2023). Using appropriate queries, the API allows filtering of the stations of interest and the subperiod of dates to be downloaded.

By default, the software retrieves the measurements serially as follows: given a specific date range as the input, to speed up data access, the interval is split into 12Footnote 1 non-overlapping contiguous sub-intervals. For each sub-interval, an appropriate query is constructed to sequentially examine the database. Finally, the resulting data blocks are stacked into a single data.frame object. However, all the get_ARPA_Lombardia_*_data functions can adopt a parallel computation strategy to improve download speed depending on the computational capacity. Parallel computations are performed through the Futureverse (Bengtsson 2021). When parallel mode is active, the user can declare the number of parallel workers to be activated. By default the number of active workers is equal to half that of the available local cores. In addition, the user can declare the parallel strategy to be used according to the Futureverse syntax, which admits the multisession option (i.e., background sessions on the local machine) or a multicore setting (i.e., forked R processes on the local machine; not supported by Windows and RStudio).

We provide the results of a MonteCarlo simulation experiment designed to assess the computational benefits derived from parallel computation compared to the serial strategy. In particular, we compared the computing time required to download the air quality data for several combinations of the number of stations and months to be retrieved. For each combination, we replicated the download \(N = 50\) times using parallel and serial settings. Randomness is included in the experiment by randomly selecting the stations and the period to be downloaded. At each iteration, the computing times using serial and parallel strategies were calculated from the same set of data to maintain full comparability of the procedures. The code used to reproduce the experiments is presented in Supplementary Material 1. Table 4 reports the average (and standard deviation) computation times under the serial strategy, whereas Table 5 shows the timing under the parallel strategy. By comparing the two tables we can conclude: 1) parallel computation drastically reduces the download time when the number of stations considered is large (average times are reduced by 50% to 65% for 50 stations or for the whole network); 2) regardless the number of stations, it is convenient to use the parallel strategy for periods of at least one year (particularly for a large number of years); 3) when the number of stations of interest and the period of interest are small, the serial strategy is preferable; 4) the computation time appears more affected by the length of the period than by the number of stations (for example, the average times for a single serial month tripled when going from 10 to all stations, while the average times for the number of stations grew substantially); and 5) the variability of the estimated times increases as the number of stations and number of months increases (this could be attributed to the need for a longer connection to the databases via the Socrata API, which is dependent on the accesses and the number of users connected at the time of download).

4.2 Data exploration and graphical analysis using ARPALData

The ARPALData package offers several functions to explore, analyze and represent the air quality and weather measurements previously accessed and processed. The rationale is that the user should be able to verify the statistical properties and the quality of the data of interest immediately after downloading. Indeed, environmental data are often subject to statistical issues, such as outliers or missing data, which compromise their quality and require ad-hoc corrections (Junninen et al. 2004; Gómez-Carracedo et al. 2014; Junger and Ponce de Leon 2015). There are four categories of exploratory tools offered by ARPALData: (1) descriptive statistics and graphs to summarize and depict the measured values globally (over the whole sample) and locally (site-specific or year-specific statistics and graphs); (2) analytical tools and graphs to detect outliers or out-of-scale values (outlier detection analysis); (3) tools describing patterns of missing data to distinguish between cases of missing at random or structural missing caused, for example, by failures in the monitoring network; and (4) correlation analysis between variables at global level (over the whole sample) and local level (correlation for individual monitoring sites).

The above-mentioned tools are available to users through the command ARPALdf_Summary, which can be applied to any object of class ARPALdf. This command provides the user with descriptive statistics summarizing the dataset and plots. This function pursues the goal to give an overview of the statistical quality of the data at hand. The function can be customized by the user by properly setting the arguments, each of which provides information on the specific statistical tools to be employed. Moreover, the functions admit a verbose setting, that screen prints auxiliary information useful for understanding the structure of the data.frame, such as the total number of observations, number of monitoring sites included, number of time stamps and time frequency. By default, ARPALdf_Summary returns for each variable a set of central tendency measures (e.g., mean, median, and quantiles), variability indicators (e.g., standard deviation and range) and descriptive statistics about missing values. However, the user can automatically compute the essential descriptive statistics (indicators of central tendency, variability, shape of distribution, and missing values) for all variables in the dataset by station and year. Furthermore, when more than one numerical variable is present in the dataset, the user can compute Pearson’s linear correlation index among all possible pairs of variables by station and for the overall sample.

When the outlier detection analysis is activated, the function returns an overview of the (potential) outlier values detected using the Hampel Filter (Pearson et al. 2016) and the box-plot rule as follows. Let us denote a sequence of T measurements by \(X_t\) with \(t=1,\ldots ,T\). The Hampel Filter (or identifier) classifies all observations that fall outside the range \(\pm 3 \times MADM\) as outliers, where MADM is the median absolute deviation from the median (i.e., \(1.4826 \times Median(|X_t - Median(X_t)|)\)). Notice that the correction factor 1.4826 adjusts the distance such that, on average, the MADM estimate is equal to the standard deviation for Gaussian data. Conversely, the box-plot rule classifies all the observations above the typical upper value \(X_{0.75} + 1.5 \times IQR\) or below the typical lower value \(X_{0.25} - 1.5 \times IQR\) as outliers, where \(IQR = X_{0.75} - X_{0.25}\) is the interquartile range. Both identifiers use the median as typical value because it is robust to the presence of outliers and the positive skewness that environmental data are most likely to show (Mudelsee and Alkio 2007). In practice, the box-plot rule is often more moderate than the Hampel Filter in identifying outliers. Therefore, their comparison and potential disagreements can serve as a robustness check for users in uncertain situations. Practically, the user is provided with threshold values and the percentage of observations exceeding these values for each variable.

The user can also activate exploratory missing data analysis based on gap length statistics. In particular,  the function ARPALdf_Summary returns information regarding the temporal distribution of missing values (i.e., sequences of missing values) for each variable by station. Because we are dealing with time series data, we are interested in analyzing the gap length, that is, the temporal distance between one observation and the following. In particular, for each variable and station, the function summarizes the pattern of missing values by reporting summary statistics (e.g., average, variability, and quantiles of gap lengths) in the same temporal unit data (measured in hours, days, weeks or months depending on the frequency of the input data). For example, if the input dataset has a daily frequency, the gap statistics will be expressed in terms of number of days. Missing patterns can be detected by analyzing gap statistics. For instance, if the mean value of the gap length is greater than the minimum value at a certain monitoring site, it means that at least one observation in the time series is missing. When other statistics, such as the median, maximum, and percentiles also differ, then the number of missing observations increases. When all statistics are equal to 1, the variable of interest does not present missing values.

The data exploration phase can be enhanced and supplemented using graphical analysis by station and for the overall sample. The ARPALdf_Summary function allows the production of distributional box-plots, histograms, and kernel density plots computed for the full sample. Also, the user can employ the command ARPALdf_Summary_map to represent, on a geo-referenced map, any of the descriptive statistics computed by the ARPALdf_Summary for the available monitoring sites (or for municipalities if municipal data are used). Extensive examples of the usage of ARPALdf_Summary and ARPALdf_Summary_map in several contexts are provided in Sect. 5.

5 Case studies and practical examples

We now present three guided examples in which ARPALData is used to retrieve and analyze air quality and weather data across Lombardy.Footnote 2 For brevity, we comment only on the relevant chunks of code that utilize package-specific commands or routines. The full vignette of the three examples is either available in Supplementary Material 1 or at the GitHub webpage https://github.com/PaoloMaranzano/ARPALData.

5.1 Case study 1: characterizing air quality according to the local context

In the first example, we download and analyze the weekly PM\(_{10}\) concentrations observed across the entire air quality ground monitoring network. We are interested in describing the average and maximum weekly concentrations from 2019 to 2022 according to the local context in which the stations are installed.  Therefore, we characterized PM\(_{10}\) concentrations according to station type and regional zoning, as described in Sect. 4.1.

We begin by downloading the registry (or metadata) concerning the air quality ground monitoring network. We are interested in retrieving information regarding the currently active sites that monitor PM\(_{10}\), and were installed before January 1st, 2019. The full list of stations is filtered based on these criteria.

figure a

By computing the frequency table of the available monitoring sites, we notice that the number of stations is not evenly distributed among station types. In fact, Suburban-Traffic (ST), Suburban-Industrial (SI), Urban-Industrial (UI), and Rural-Industrial (RI) counts are less than three stations each. Therefore, to maintain consistency when aggregating observations, we filter the station list by removing the aforementioned classes of sites. The final number of stations is 59. For stations that comply with the original filters and those concerning station typologies, we collect weekly observations of PM\(_{10}\) from 2019 to 2022. The measurements are downloaded using a parallel setup.

figure b

To exploit information about the local contexts in which the stations are located, we aggregate the weekly average and maximum values by station type and zoning, respectively. We start by aggregating the weekly average PM\(_{10}\) concentrations and computing the lower and upper bounds (minimum and maximum), as well the central value (mean), by station type. The resulting time series are then plotted (Fig. 1) into four sub-panels, each corresponding to a different station type. In each panel, the maximum weekly, average, and minimum concentrations are depicted in red, green, and blue, respectively.

Fig. 1
figure 1

Weekly PM\(_{10}\) minimum (blue), average (green), and maximum (red) concentrations observed from January 2018 to December 2021 by station type, that is Rural-Backgroun (RB), Suburban-Background (SB), Urban-Background (UB), and Urban-Traffic (UT)

The charts show that the average weekly concentrations exhibit different characteristics depending on the type of station. For instance, the weekly average PM\(_{10}\) concentrations observed at Urban-Traffic (UT) stations frequently exceed 75\(\mu g/m^3\) and are compact during summer but more volatile in winter. The vertical distance between the maximum (red series) and the minimum values (blue series) is approximately 20\(\mu g/m^3\) in summer and 40\(\mu g/m^3\) in winter. Similar considerations can be made for Suburban-Background (SB) stations. In contrast, in Rural-Background (RB) stations, average concentrations rarely exceed 75\(\mu g/m^3\), but vary greatly in the winter months (e.g., the vertical distance between the minimum and maximum of the average in January 2020, 2021, and 2022, is approximately 50\(\mu g/m^3\)).

Similar to the process for station types, we aggregate the weekly PM\(_{10}\) maxima by zones (obtained through the command get_ARPA_Lombardia_zoning). The maximal concentrations are aggregated by computing their ranges (from maximum and minimum values) and the average quantities. The aggregated time series are shown in Fig. 2  where the  plot presents seven subpanels in which the weekly minimum (blue), weekly average (green), and weekly maximum concentrations (red) are depicted for each zone.

Fig. 2
figure 2

Weekly PM\(_{10}\) minimum (blue), average (green), and maximum (red) concentrations observed from January 2018 to December 2021 by ARPA zone

The temporal plots suggest that the maximum PM\(_{10}\) values are particularly high in the Milan metropolitan area (the socio-economic center of the region with a high degree of urbanization and vehicles) and in the plains (both urbanized and rural). In the latter case, particulate matter could be attributed to the numerous agricultural activities and livestock farms, which are responsible for the emission of large amounts of secondary particulate matter and ammonia through the spreading of slurry and animal manure (Fassó et al. 2023; Marongiu et al. 2023). The Alps to the north and the valleys had significantly lower maximum values. The other two urban agglomerations (Bergamo and Brescia) exhibit an intermediate situation, where the weekly PM\(_{10}\) maxima are steady and less volatile. Stations located in northern Alpine or pre-Alpine areas register significantly lower maximum values.

5.2 Case study 2: air quality conditions during COVID-19 pandemic at municipal level in Lombardy

In the second example, we use ARPALData to investigate how air quality improved across the region due to the shutdown imposed during the COVID-19 pandemic in spring 2020. Our goal is to compare the average values observed across the 1516 municipalities of Lombardy during the lockdown period (March 8\(^{th}\) to May 18\(^{th}\)) with the concentrations observed in the same subperiods in 2018, 2019, and 2021.

As a first step, we download the daily estimates of NO\(_2\) concentrations at the municipal level from January 2018 to December 2021.

figure c

Then, the full dataset is filtered to retain the values recorded at each municipality during the lockdown subperiod of 2018, 2019, 2020, and 2021 (the output dataset is labelled as Data_spring). The average concentrations during this period are then computed by aggregating the daily values using the function Time_aggregate with argument Frequency = "yearly".

figure d

The average NO\(_2\) concentrations for the available municipalities and years are graphically represented on a map by employing the ARPALdf_Summary_map command. Each map adoptes a color scale ranging from green (low concentration), to yellow (medium concentration), to red (high concentration). To make the color scale uniform throughout the maps, we set a common central value by imposing a mid point equal to the regional average concentration over the period of interest in 2018 (i.e., val_midpoint = mid_conc_2018).

figure e

Figure 3 shows the combined maps of NO\(_2\) concentrations. During spring 2020 the entire region shifted from a yellow-orange color (concentrations above the 2018 average) to greenish colors (concentrations below the 2018 average). In particular, the alpine zone in the north and the Apennines in the south-west experience remarkable improvements. The situation in the highly industrialized and urbanized central belt remains critical. Notably, the four main cities in the region (Milan, Monza, Bergamo, and Brescia) are connected by an orange stripe, perfectly overlapping with the route of the main highway in Northern Italy, that is the A4 Turin-Trieste highway, which is approximately 220 km long and is travelled by about 153 thousand vehicles daily in the Lombardy segment (AISCAT 2017). Among others, supportive findings were found by Fioravanti et al. (2022), who revealed a statistically significant decrease in NO\(_2\) concentrations across the whole Po Valley during March and April 2020 (corresponding to the hard lockdown phase) compared to 2019.

Fig. 3
figure 3

Average NO\(_2\) concentrations observed in 2018, 2019, 2020 and 2021 during the subperiod 8\(^{th}\) March to 18\(^{th}\) May. Data regarding eleven small municipalities are not available (grey unnoticeable areas)

5.3 Example 3: characterization of meteorological phenomena across Lombardy

In the third example, we use ARPALData to retrieve weather data for Lombardy to depict the main characteristics of local meteorology across Lombardy and study the heterogeneity due to the morphology of the territory. In addition, we show the use of data quality check functions provided in ARPALData to assess the actual statistical properties of the available data. In particular, as described in Sect. 4.2, ARPALData contains functions that allow the statistical exploration of a given input dataset, including an analytical tool for outlier detection and identification of recurrent patterns as regards missing values.

We download weather data for the entire ground monitoring network for 2021. Our interest lies in characterizing the monthly measurements of cumulative rainfall, average temperature, and average wind speed and direction.

figure f

After downloading the dataset of interest, we employ the ARPALdf_Summary function to explore the quality of the downloaded data.

figure g
Table 1 Outlier analysis for each variable (Hampel Filter)

The Hampel Filter output (Table 1) shows that at the aggregate level, all the considered meteorological parameters are mainly affected by slight positive skewness and that potential outliers are abundant. For example, considering wind speed, almost 3% of the observations (84 values) are above the upper bound (column Obs_above_upp_count), whereas there are no values below the lower bound. Similar considerations can be applied to rainfall. However, we must consider that the upper bound of the wind speed is just above 1.42 m/s, which is moderate. The table also suggests that three potential negative outlier measurements (column Obs_below_low_count) could be detected for the average temperature. In other words, there were three months in 2021 at specific locations where the recorded temperatures were particularly lower than those of the rest of the sample.

Table 2 Gap length (sequences of missing data) analysis for wind speed measurements at five stations

Moving to the output of the gap length (missing values) analysis by station (summ_data$Gap_length), no station present missing values, as all gap statistics are always equal to 1 month. In Table 2, we report the gap statistics for five stations taken as an example.

As described in Sect. 4.2, descriptive statistics at the station or municipal levels, as well as time-specific measurements, can be graphically represented through maps using the ARPALdf_Summary_map function. In Example 2, we used such command, that depicts the average observed NO\(_2\) values at the municipal level for several years. Similarly, the average temperatures observed at each monitoring station  are  mapped. This can be easily done by using as argument of ARPALdf_Summary_map, the site-specific descriptive statistics computed through ARPALdf_Summary. To show how the temperature is strongly affected by seasonality, geography, and local conditions, we represent the average temperature across Lombardy in two non-overlapping sub-periods, that is October to March 2021 (winter) and April to September 2021 (summer). Starting from the full dataset, we split the sample into the winter and summer subperiods  and compute summary descriptive statistics by station for both subsets (namely, the object summ_data_spring_summer for the summer subperiod). Eventually, we plot the average seasonal temperature for each station using the ARPALdf_Summary_map function by setting as Data argument the Descr_by_IDStat$Mean_by_stat table contained in the output of ARPALdf_Summary. To allow for the maps comparability, we set up a common reference scale for the colors by imposing an average color equal to the yearly grand mean across the region. The two maps are shown in Fig. 4.

figure h
Fig. 4
figure 4

Average temperature observed in the subperiod April -  September 2021 for each station (left panel) and average temperature observed in the subperiod October -  March 2021 for each station (right panel). The central value of the colors scales is the regional grand mean temperature recorded in 2021

From the left map, the average summer temperature recorded in the Alps is significantly lower than that in the Po Valley in the south, whereas the thermal excursion may depend on the yearly subperiod. In fact, the summer range is approximately 30\(^\circ\), while the winter range is from -10\(^\circ\) to 5\(^\circ\).

Gap length and outlier statistics by station can be summarized well either in tabular form or on a map. However, linear correlations are suitable for graphical representation and mapping, particularly when the number of stations is large. In particular, we propose to represent  this on a map of average site-specific linear correlations between the observed average temperature and rainfall. This plot can be obtained by employing the ARPALdf_Summary_map filled with the argument Data = summ_data_spring_summer$Cor_matrix, where Cor_matrix is the linear correlation matrix computed using the ARPALdf_Summary function on the subsample. By repeating this operation on the winter and summer subsamples, we compare the dynamics of Pearson’s linear correlation index across the climatic seasons. The correlation maps for the two sub-samples are shown in Fig. 5.

figure i
Fig. 5
figure 5

Linear correlation among maximum temperature and rainfall observed in the subperiod April  -  September 2021 for each station (left panel) and linear correlation among average temperature and rainfall observed in the subperiod October - March 2021 for each station (right panel). Green points are associated with negative values of linear correlations, while orange points are associated with positive values. Yellow points mean uncorrelated data

The maps show how the correlation between rainfall and temperature strongly depends on the morphology and season. In particular, the mountainous areas (Alps) in the north exhibit strong positive correlations in both summer and winter (red or orange dots), while the Po Valley in the south shows strong negative correlations (green points). In the former case, the positive correlation might indicate that, at higher altitudes, lower temperatures are accompanied by low rainfall during summer. In contrast, plains are characterized by a scarcity of rainfall and high temperatures, particularly during summer. The correlations shown in these two maps agree with the negative monthly rainfall anomalies that characterized the year 2022 in the entire Lombardy region, as well as the trends in the maximum cumulative rainfall values across the 24-hours day. An extensive discussion of rainfall and temperature anomalies registered across Lombardy during the considered period can be found in the report "Climate, natural risks and availability water supply in Lombardy in 2022" (ARPA Lombardia 2023a). In particular, the report states that the climate in 2022 was classified as hot-dry with negative rainfall anomalies distributed for most of the year in the plain areas, whereas the Alpine and pre-Alpine areas experienced positive anomalies only in the first autumn months.

6 Software availability and future developments

The ARPALData software is freely available on the CRAN of R at https://cran.r-project.org/web/packages/ARPALData/index.html since July 2021. The currently available release is 1.5.0 (September 2023). ARPALData runs on Microsoft Windows, Linux and Apple Mac machines. No special hardware other than a standard desktop computer is required to run it. Future software releases will potentially include station or sensor selection using coordinates (i.e., quadrants, single coordinates, and NUTS levels); interactive and dynamic maps; and spatial interpolation (e.g., IDW) for an entire region or a user-defined area. To extend the discussion on the need to develop a unified platform at the national level, since 2023, part of the ARPALData team has been involved in the development of complementary software to retrieve air quality data at the European level, namely EEAaq (Tassan Mazzocco et al. 2023), from the open database of the European Environment Agency.