Generalized linear modelling based monitoring methods for air quality surveillance

Rising industrial pollution, exacerbated by climate change, underscores the need for effective environmental monitoring. Leveraging sensor advancements and Birnbaum-Saunders distribution, this study introduces a novel surveillance method for environmental data, crucial for shaping impactful industrial policies. Simulation studies demonstrate the method ’ s performance, and a case study on nitrogen oxide levels in Italy validates its efficacy in the early detection of severe air pollution events.


Introduction
Nowadays, pollutants like sulphur dioxide, nitrogen oxides, carbon monoxide, and tropospheric ozone from fossil fuel combustion worsen air quality, affecting human health.Establishing a surveillance mechanism is essential to detect sudden changes in nitrogen oxides (NOx) content, emitted mainly by coal plants and vehicles, causing respiratory issues and lungs damage.Real-time control charts aid in swiftly detecting abrupt changes in air quality, emphasizing the growing importance of environmental monitoring.Several control charts have been proposed in the literature to observe environmental characteristics.Anderson and Thompson (2004) studied distance-based multivariate control charts for environmental surveillance.Barratt et al. (2007) adapted the cumulative sum (CUSUM) control chart to monitor the changes in carbon monoxide concentrations, where data was collected from Central London.Morrison (2008) explained how control charts could be employed to inspect environmental data.Gove et al. (2013) examined water supply in Southwestern Australia using X-bar and CUSUM charts.Sancho et al. (2014) used a functional data-based Shewhart control chart with run rules to monitor the air quality of urban areas.Based on an attribute control chart for the number of harmful contamination levels, Leiva et al. (2015) developed a threshold for environmental monitoring.Paroissin et al. (2016) designed novel control charts to monitor a French river's dissolved oxygen concentration (DOC).The use of univariate and multivariate control charts to assess the water quality of the South Morava River was described by Đorđević et al. (2016).Marchant et al. (2018) presented a robust strategy employing multivariate control charts to examine the air quality of Chile.Rodríguez-Álvarez et al. (2021) proposed control charts for paper moisture content variability.Marchant et al. (2019) proposed bivariate control charts to monitor PM pollution in Santiago, and Qiao et al. (2021) designed a data quality control method for air quality monitoring in Chinese cities.
In literature, many positive asymmetry or skewness probability models are often fitted to describe air contaminant concentrations.For example, log-normal, gamma, beta, inverse Gaussian, exponential, extreme values, log-logistic, Johnson bounded system, Weibull, and Pearson distributions.However, many environmentalists preferred the log-normal model instead of others for fitting the air contaminant concentrations due to its close relation to Gaussian distribution and its physical explanation in terms of the law of proportionate effect (LPE) (Ott, 1990).Nevertheless, to rationalize the log-normal as a life distribution, Desmond (1985) proved the inaccuracy of using Cramér's biological model linked to the LPE.Recently, the Birnbaum-Saunders (BS) distribution has been getting more attention due to its close relation to the normal distribution.Moreover, Leiva et al. (2015) proved that rather than log-normal distribution, the BS distribution is an appropriate model for modelling concentrations of pollutants due to its accuracy in using Cramér's biological model.For the mathematical derivation of Cramér's biological model with respect to BS distribution, we may refer to Leiva et al. (2015).Hence, BS distribution is considered for developing the surveillance methods in this study.
the BS distribution is used to fit material fatigue data.Leiva et al. (2015) provided the environmental application of BS distribution with mathematical reasoning.It provides a theoretical rationale for the case study reported in this work and is an excellent fit for the data.Santos-Neto et al. (2012) suggested a Reparametrized Birnbaum-Saunders (RBS) distribution, which is closely related to the normal and asymmetrical distributions such as Gamma and Log-normal distributions.Many monitoring studies are designed based on the BS distribution.For example, a bootstrap control chart for BS distribution was proposed by Lio and Park (2008).Leiva et al. (2011) proposed control charts based on BS distribution to examine lifetimes.Saulo et al. (2015) developed an Xbar chart based on BS distribution.Aslam et al. (2016) presented the attribute control charts under repetitive sampling when a product's lifetime follows a BS distribution.Marchant et al. (2018) proposed robust multivariate control charts where subgroups follow a generalized Birnbaum-Saunders distribution.Khan et al. (2018) proposed a chart based on BS distribution under accelerated hybrid censoring.Marchant et al. (2019) presented bivariate control charts designed for BSdistributed data.Bourguignon et al. (2020) presented control charts to monitor the median parameter of BS distribution.
All the above studies are designed to monitor the BS-distributed response variable.However, linearly related covariates are recorded in many practical situations with the BS-distributed response variable.For example, the temperature is also measured with the average NOx concentration, and they mostly possess a linear relationship (Jayamurugan et al., 2013).Hence, it is more practical to design control charts based on the BS regression model, which can provide the platform to monitor the BS distributed response variable by keeping the linear relation of the covariate(s).
For BS distribution, regression models are based on the requirement that the original dependent variable be changed to a logarithmic scale, which may reduce the power of the study and make interpretations more challenging.To overcome these issues, Santos-Neto et al. (2012) suggested a Reparametrized Birnbaum-Saunders (RBS) distribution, which is closely related to the normal and asymmetrical distributions such as Gamma and Log-normal distributions.The regression models frequently focus on mean response and its initial scale; therefore, employing RBS distribution, the mean can be modelled with no changes like generalized linear models, but the distribution does not belong to the exponential family.Thus, a link function connects the mean response to the linear predictor, which includes the regressors and unknown parameters.Furthermore, the RBS model may also describe data with non-constant variance.With this motivation, this study proposes Shewhart control charts based on standardized and deviance residuals of the Reparametrized Birnbaum-Saunders (RBS) regression model to monitor the positive asymmetric data.Further, a simulation study is designed to assess the proposed charts' performance in terms of run length characteristics.Finally, the proposed charts are implemented on the air quality data to highlight the importance of proposed methods in environmental monitoring.
The rest of the article is presented as follows: the background of the RBS regression model is described in Section 2.Then, the Shewhart control charts are derived in Section 3. In Section 4, the results and design of the simulation study are presented.Further, Section 5 consists of implementing the proposed charts on air quality data.Finally, Section 6 is designed to conclude the findings of this research.

The Reparametrized Birnbaum-Saunders (RBS) regression model
Let Y be a positive random variable that follows the RBS distribution proposed by Santos-Neto et al. (2012) with parameters scale/mean (μ; μ > 0) and shape/precision (δ; δ > 0).Using this parameterization, the probability density function of Y is given as: . (1) The expected value and variance of Y are formulated as E(Y) = μ and Var(Y) = μ 2 (2δ +5)/(δ + 1) 2 , respectively.It is interesting to note that the variance of Y is a function of the mean.Thus, the RBS distribution, like the Gamma distribution, allows us to model heteroscedasticity.Leiva et al. (2014) developed a new approach for BS modelling by using this reparameterization.They treated Y 1 , Y 2 , ⋯, Y n as independently RBS distributed random variables during the estimation process and designed a GLM-type statistical model as , where x i T β is a linear predictor; g − 1 () indicates the inverse of the link function, which must be positive, completely monotone, and at least twice differentiable; β = β 0 , β 1 , ⋯, β k denotes the vector of unknown parameters; and T β belongs to the i-th row vector of the design matrix X.The parameters of the established model are estimated by the maximum likelihood (ML) method.
For a vector of parameters θ = ( β T , δ ) T , the log-likelihood function is given as: Assuming δ as known, the deviance residuals (DRs) for the RBS regression model are defined as: where, μi represents the ML estimate of μ i and sign gives the signal of the difference y i − μi .Further, the standardized residuals (SRs) are derived from the difference y i − μi and expressed as: where δ is fixed and μi denotes the ML estimate of μ i .Further, the DR and SR of the RBS regression model are used to develop proposed control charts.

The Shewhart control charts for asymmetric data
In this section, we first introduce the RBS data-based control chart and then provide the structure of control charts based on deviance and standardized residuals of the RBS regression model (given in Equations 3-4).Moreover, the implementation of the proposed (DR-RBS and SR-RBS) charts, along with the existing (Y-RBS) chart is presented in Fig. 1.

RBS data-based control chart (Y-RBS)
The Y-RBS control chart monitors RBS distributed response variable samples, incorporating covariates, unlike the RBS data-based control chart that ignores them.The chart plots RBS samples against control limits determined by the following expression: where, L Y1 and L Y2 are constants that specify the width of the control limits and are set according to a fixed in-control average run length (ARL 0 ).The Y-RBS chart indicates an out-of-control (OOC) state when any sample falls outside the limits; alternatively, the process reveals incontrol (IC).

RBS deviance residuals based control chart (DR-RBS)
The deviance residuals of the RBS regression model are expressed in Equation ( 3) and plotted against the control limits in the DR-RBS control chart.The control limits of the DR-RBS chart are calculated as follows:  where, L DR1 and L DR2 represents control limit constants and determined in relation to specified ARL 0 .The E(DR) and Var(DR) are the mean and variance of the deviance residuals, respectively.The DR-RBS control chart signals an OOC condition when any value of DR lies beyond the limits; otherwise, the system is considered in an IC condition.

RBS standardized residuals-based control chart (SR-RBS)
In the SR-RBS control chart, the standardized residuals stated in Equation ( 4) are plotted against the control limits, which are derived using the following formulas: where, L SR1 and L SR2 are control limit constants, and their values are selected in accordance with the defined ARL 0 .When any value of SR exceeds the limits, the system is declared OOC; else, the system is considered IC.Fig. 1 shows the implementation of DR-RBS, SR-RBS, and Y-RBS charts.Y-RBS is used when the dataset only contains the RBS distributed main variable.For DR-RBS and SR-RBS, fit the RBS regression model, estimate DRs and SRs using Equations ( 3) and ( 4), and calculate mean, standard deviation, and standardized residuals.Compute control limits using charting constants from Table 1 at a fixed ARL 0 .Plot residuals against control limits for DR-RBS and SR-RBS, and response (Y) values against Y-RBS control limits.If plotting statistics exceed boundaries, investigate for signals; otherwise, repeat for each time point data.

Simulation structure and results
For the performance evaluation of the proposed control charts, a Monte Carlo simulation study is designed.We generate the RBS response variable as Y i ∼ RBS(μ i , δ), where, μ i = exp(β 0 +β 1 x i ); i = 1, 2, ⋯, n rep- resents the mean function, and δ is the shape parameter.By following Leiva et al. (2014), the parameters' values are considered as β 0 = 0.2, β 1 = 0.5 and δ = 2. Also, the sample size is set at 1000.We assume that the values of the covariate (X) come from a uniform distribution with an interval (0, 1).The simulations are carried out on a large scale with 10 4 repetitions.In the RBS regression model, μ i and δ are the basic parameters, and the primary objective is to spot an increasing shift in μ i at a fixed δ.Thus, we assess the performance of control charts by applying different direct and indirect shifts in μ i .The shifts are as follows: a. Indirect shift in μ i with respect to β 0 as β 0 + δ. b.Indirect shift in μ i with respect to β 1 as β 1 X + δ. c.Direct shift in μ i as μ i + δ.
Furthermore, the capacity to detect shifts in control charts is assessed using average run length (ARL), as recommended by several prior works, e.g.(Iqbal et al., 2022a;Iqbal et al., 2022b;Mahmood, 2020;Mahmood & Erem, 2023;Mahmood et al., 2022).ARL represents the average number of points before an alarm.ARL 0 represents an in-control ARL, while ARL 1 indicates an out-of-control ARL.A chart is deemed superior if for a fixed ARL 0 , the ARL 1 values are minimum.The standard deviation of run length (SDRL) specifies the dispersion of a run length, whereas the MDRL reveals the median run length.

Algorithm for determining the charting constants
The following strategy is used to determine the control limit coefficients L Y1 , L Y2 , L DR1 , L DR2 , L SR1 , and L SR2 at fixed ARL 0 .a. Begin by creating a sample of size n using the simulated RBS model, as mentioned above.b.Fit the RBS regression model to the generated data and estimate the DRs and SRs shown in Equations ( 3) and (4).c.Calculate the mean, standard deviation, and standardized residuals for the DR-RBS and SR-RBS control charts.The mean and standard error of the response variable (Y) in the Y-RBS control chart.d.Calculate the control limit(s) of the control chart using the estimates from step (c) and a random value as the charting constant.e. Plot the residuals of DR-RBS and SR-RBS control charts against their respective control limits.Plot the Y-RBS control chart's response (Y) values against the control limit.f.To reach the specified ARL 0 , repeat steps (a-e) many times.g.If the needed ARL 0 is not reached, return to the starting location, adjust the previous random value, and repeat steps (a-f) until it is obtained.
Using this approach, charting constants are determined for each chart against multiple ARL 0 options (e.g., 200, 370, and 500).Table 1 shows the obtained control charting constants.units, respectively.To summarize, the DR-RBS control chart outperforms the SR-RBS and Y-RBS control charts for all three types of shifts.

Illustrative Example: Air quality data
The simulation revealed higher detection ability in the proposed model-based (DR-RBS and SR-RBS) charts compared to the existing data-   2009), who utilized a multisensor system weighing less than 2.5 kg.This system included a relative humidity sensor, a solid-state temperature sensor, and five unique metal oxide chemo-resistive sensors, collecting data with an 8-second sampling time and a memory capacity of up to 72 h.Genuine NOx concentrations were measured on-site using a conventional analyzer as used in this study.
As previously discussed in Section 1, in order to monitor air pollution, a surveillance system must be designed to identify a sudden shift in NOx concentration while also taking temperature into account.The website (https://archive.ics.uci.edu/ml/datasets/air+quality)provides actual hourly averaged nitrogen oxides (NOx) concentration (Y, in ppb) and temperature (X, in • C).The collection contains 9358 instances of hourly averaged answers from a metal oxide chemical sensor embedded in an air quality chemical device.The device was located on a field at street level in a very polluted area of an Italian city.
For the implementation of the proposed charts, two datasets were first retrieved, each of 950 values.The dataset with an RBS distribution is considered in-control (IC), whilst the other is considered out-ofcontrol (OOC).The descriptive statistics for nitrogen oxides (NOx) concentration in the IC dataset show that: (i) the minimum and maximum values are 2.0 and 396.0, respectively; (ii) the mean and standard deviation are 119.2095 and 79.7717, respectively; and (iii) the coefficient of variation with skewness and kurtosis are 0.6692, 1.2529, and 4.2629.These descriptive statistics show that the NOx data have a positive skew empirical distribution with a somewhat higher kurtosis than a normal (or Gaussian) distribution.The histogram in Fig. 2 illustrates these data aspects by approximating the probability density function of nitrogen oxide concentration.The QQ Plot is a popular approach for determining sample data's goodness-of-fit to a theoretical distribution.It enables the user to compare an empirical quantile function (represented by all sample points) to a theoretical model (represented by a 45 • slope line).All of the data points are then compared to a straight line.If the line closely fits the point, the distribution is said to be best suited.Fig. 3 shows a QQ plot with an envelope to evaluate the model's distributional assumption.Because this plot does not show odd features, the assumption that the response variable follows an RBS distribution is validated.Further, to analyze the model fitting, we have used Anderson-Darling and Cramervon Mises goodness-of-fit tests (Barros et al., 2014).The RBS distribution is the best-fitted distribution, according to statistic A = 0.4948 with p − value = 0.2145 and statistic W = 0.0571 with p − value = 0.4132.

Conclusion and future recommendations
There are many real-world datasets that demonstrate positive skew behavior.For such data, symmetric distributions with support throughout the complete set of real numbers are unsuitable.To monitor    positive skew data, we offer new Shewhart control charts based on the RBS regression model's residuals (SR and DR).Furthermore, a simulation study was carried out to evaluate the performance of RBS modelbased control charts to the RBS data-based scheme.The findings demonstrated that RBS model-based schemes outperformed RBS databased schemes.Furthermore, in RBS model-based schemes, the control chart, which is based on the RBS regression model's deviance residuals, is more sensitive to growing mean shifts.In conclusion, our results and application to nitrogen oxides (NOx) data offer an effective real-time monitoring tool for analyzing environmental systems.This example demonstrates the significance of the new technique in recognizing instances of severe urban environmental pollution, allowing us to avoid harmful implications for the population's health in Italy.The proposed approach is recommended for environmentalists and other administrators who wish to monitor the alarming incidence of air pollution in realtime, which is critical for human safety.
The current study lacks consideration of time series components, a key aspect that should be explored in future research.It focuses solely on nitrogen oxide (NOx) for air quality assessment, overlooking other vital contaminants like particulate matter, ozone, carbon monoxide, and sulfur oxides.Potential areas for further investigation include the impact of parameter estimation in the RBS regression model, assumptions regarding covariates in mean calculation, and the use of RBS regression modeling for skewed data detection.

Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 1 .
Fig. 1.Flow chart on implementation of existing Y-RBS chart and proposed DR-RBS and SR-RBS charts.
, resulting in the following values: L Y1 = 1.29,L Y2 = 12.7, L SR1 = 1.329,L SR2 = 3.69, L DR1 = 3.0 and L DR2 = 2.20.Figs.3-4 illustrate the implementation and presentation of the Y-RBS, SR-RBS, and DR-RBS charts.The points beneath the pink window correspond to the IC state, whilst the points under the white window indicate the OOC state.Plotting data are indicated in blue for accuracy, whereas OOC points are emphasized in red.It is noted that the Y-RBS chart (Fig. 4) discovered 104 OOC signals, whereas the SR-RBS (Fig. 5) and DR-RBS (Fig. 6) charts detected 107 and 110 OOC signals.As a consequence, this provides unambiguous proof that the simulated and illustrated example findings are identical: the DR-RBS chart has more detection power than the Y-RBS and SR-RBS charts.

Table 1
Control charting constants of each chart with respect to ARL's.

Table 2
ARL profile of the charts under an indirect shift in μ with respect to β 0 as β 0 + δ

Table 2
shows the results of the first indirect shift in μ relative to β 0 as β 0 +δ across all charts.Model-based charts outperform data-based charts in detecting indirect mean shifts caused by β 0 changes.For example, when δ = 0.7, the ARL 1 for Y-RBS chart is reported as 30.35.This is greater than the ARL 1 ′ s of 25.66 and 31.90 for DR-RBS and SR-RBS charts, respectively, at fixed ARL 0 = 200.The shift leads ARL 1 ARL 1 is 136.99,whereas the SR-RBS chart's ARL 1 is 148.97 at ARL 0 = 200.For DR-RBS and SR-RBS control charts, the depreciated ARL 1 ′ s are reported at roughly 234.73 and 255.98, respectively, at the given ARL 0 = 370.Similarly, with ARL 0 = 500, the shift may result in a decrease of 201.53 and 150.27 units in ARL 1 ′ s of DR-RBS and SR-RBS control charts.

Table 3
summarizes the indirect change in μ with respect to β 1 as β 1 X +δ across all charts.The results indicate that residuals-based procedures outperform data-based methods in detecting indirect changes in the mean induced by varying β 1 .At δ = 0.6, the DR-RBS, SR-RBS, and Y-RBS charts show ARL 1 ′ s around 34.15, 43.45, and 41.56, respectively, with stated ARL 0 = 200.While the ARL 1 ′ s of DR-RBS, SR-RBS, and Y-RBS charts are seen as53.33,53.64,and 57.89, respectively,for ARL 0 = 370.The shift may reduce the values of ARL 1 ′ s for DR-RBS, SR-RBS, and Y-RBS charts by 438.55, 432.46, and 423.23 units, respectively, at ARL 0 = 500.Furthermore, the results show that the deviance residuals-based process performs more effectively than the standardized residualsbased method.For example, when δ = 0.1, the ARL 1 ′ s are reported around 179.99 and 180.37 for DR-RBS and SR-RBS control charts, respectively, at ARL 0 = 200.For DR-RBS and SR-RBS control charts, ARL 1 ′ s are reported around 314.32 and 339.65, respectively, with ARL 0 = 370.At ARL 0 = 500, DR-RBS and SR-RBS control charts show a drop of 94.41 and 37.24, respectively, for the same shift.

Table 4
ARL 0 = 200.For ARL 0 = 370, the ARL 1 ′ s of the DR-RBS, SR-RBS, and Y-RBS control charts are 178.13,185.80, and 220.74, respectively.The shift may have caused a drop of approximately 276.25, 241.45, and 193.55 units in the ARL 1 ′ s of DR-RBS, SR-RBS, and Y-RBS control charts, respectively, at ARL 0 = 500.Furthermore, it can be seen that the DR-RBS control chart performs better than the SR-RBS chart.For shift δ shows the findings for direct shifts in μ as μ + δ.Again, techniques relying on RBS model residuals outperform the RBS databased approach in detecting direct alterations in the mean.For example, when δ = 0.5, the DR-RBS, SR-RBS, and Y-RBS control charts constitute ARL 1 ′ s around103.46,116.70, and 134.07, respectively, at

Table 3
ARL profile of the charts under an indirect shift in μ with respect to β 1 as β 1 X + δ

Table 4
ARL profile of the charts under a direct shift in μ as μ + δ RBS) charts, with the DR-RBS outperforming SR-RBS and Y-RBS across various shift types.To assess real-world performance, we conducted a case study using air quality data from De Vito et al. (