Methodology to Assess Air Pollution Impact on Human Health Using the Generalized Linear Model with Poisson Regression

The growth of urban areas increased the access to many facilities such as transportation, energy, education, water supply, etc. As a consequence, there was a vehicular and industrial growth that combined with unfavorable meteorological conditions caused several worldwide episodes of excessive air pollution with life losses and health damage. Some examples were the well known Donora disaster in October, 1948 and fog episodes occurred after December, 1952 in London (Lipfert, 1993).


Introduction
The growth of urban areas increased the access to many facilities such as transportation, energy, education, water supply, etc. As a consequence, there was a vehicular and industrial growth that combined with unfavorable meteorological conditions caused several worldwide episodes of excessive air pollution with life losses and health damage. Some examples were the well known Donora disaster in October, 1948 and fog episodes occurred after December, 1952 in London (Lipfert, 1993).
Since then, the researchers started to worry about air pollutants impact. Many epidemiological studies of air pollution have been conducted showing that air pollution affects human health, especially in respiratory and cardiovascular diseases, even where concentration levels of pollutants are below the air quality standard levels (Braga et al., 1999;Braga et al., 2001;Burnett et al., 1998;Ibald-Mulli et al., 2004;Peng et al., 2006;Peters et al., 2001;Pope III et al., 2002;Samet et al., 2000aSamet et al., , 2000b. The evaluation of the impact of air pollution on human health is complex due to the fact that several personal characteristics (age, genetics, social conditions, etc.) influence on the response to a given air pollutant concentration. For instance, several studies have shown that a higher air pollution concentration increases the number of respiratory diseases in elderly and children (Braga et al., 1999;Braga et al., 2001). These studies show that children are more susceptible because they need twice the amount of air inhaled by adults and the elderly are more affected due to their weak immune and respiratory systems in addition to the fact they have been exposed to a great amount of air pollution throughout their lives. Another characteristic is genetics. The studies showed that people with chronic diseases or allergies, such as bronchitis and asthma are more sensitive to air pollution.
In this chapter, it will be presented a summary of four kinds of studies usually used to assess air pollution impact on human health and emphasizing the most used one: the time www.intechopen.com series studies. In time series studies, a model very useful is the Generalized Linear Model (GLM). Then, the steps to apply the GLM to air pollution impact on human health studies will be presented in details, including a case study as an example. The results have shown that the GLM with Poisson regression fitted well to the database of the case study considered.
It is relevant to emphasize that the concepts included in this chapter are available in the literature, but the methodology presented to assess air pollution impact on human health employing the GLM with Poisson regression has no precedents.

Statistical methods
To assess air pollution impact on human health, epidemiological studies often use statistical methods that are extremely useful tools to summarize and interpret data.
The health effects (acute or chronic), type of exposure (short or long term), the nature of the response (binary or continuous) and data structure lead to model selection and the effects to be estimated. Regression models are generally the method of choice.
The exposure to ambient air pollution varies according to temporal and/or spatial distribution of pollutants. Most air pollution studies have used measures of ambient air pollution instead of personal exposure because estimating relevant exposures for each person can be daunting. According to this approximation; "misclassification of exposure is a well-recognized limitation of these studies" (Dominici et al., 2003).
According to Dominici et al. (2003), epidemiological studies of air pollution fall into four: time series; case-crossover; panel and cohort. The time series, case-crossover and panel studies are more appropriate for acute effects estimation while the cohort studies are used for acute and chronic effects combined.

Case-crossover studies
Case-crossover studies are conducted to estimate the risk of a rare event associated with a short-term exposure. It was first proposed by Maclure (1991) cited in Dominici et al. (2003) to "study acute transient effects of intermittent exposures". In practice, this design is a modification of the matched case-control design. The difference between a case-crossover and a case-control design is that in case-control designs, each case acts as his/her own control and the exposure distribution is then compared between cases and controls and in case-crossover design "exposures are sampled from an individual's time-varying distribution of exposure". In particular, "the exposure at the time just prior to the event (the case or index time) is compared to a set of control or referent times that represent the expected distribution of exposure for non-event follow-up times". In such a way, the unique characteristics of each individual such as gender, age and smoking status; are matched, reducing possible confounding factors (Dominici et al., 2003).
According to Maclure & Mittleman (2000) cited in Dominici et al. (2003), "in the last decade of application, it has been shown that the case-crossover design is best suited to study intermittent exposures inducing immediate and transient risk, and abrupt rare outcomes". www.intechopen.com

Panel studies
Panel studies collect individual time and space varying exposures, outcomes counts and confounding factors. Consequently they include all other epidemiological designs which are based on temporally and/or spatially aggregated data. Actually, panel studies also rely on group-level data.
In panel designs, the goal is to follow a cohort or panel of individuals to investigate possible changes in repeated outcome measures. This design shows to be more effective in shortterm health effects of air pollution studies, mainly for a susceptible subgroup of the population. Usually, panel studies involve the collection of repeated health outcomes measures for all considered subjects of a susceptible subpopulation over the entire time of study. The measure of pollution exposure could be from a fixed-site ambient monitor or from personal monitors (Dominici et al., 2003). Some care should be taken when designing a panel study, because the main goal of estimating the health effect of air pollution exposure sometimes can be less clear. It happens whenever the panel members do not share the same observation period, so parameterization and estimation of exposure effects need to be considered with much care (Dominici et al., 2003).

Cohort studies
The cohort studies are frequently used to associate long-term exposure to air pollution with health outcomes. Prospective or retrospective designs are possible. The first one consists of participants' interview at the beginning of the research containing particular information such as age, sex, education, smoking history, weigh, and so on. After that, the participants are followed-up over time for mortality or morbidity events. The retrospective design consists of using already available database information. Cohort designs are frequently used to multicity studies, as it ensures "sufficient variation in cumulative exposure, particularly when ambient air pollution measurements are used" (Dominici et al., 2003).

Time series studies
The time series impact studies are often used as they demand simply data such as the amount of hospital admission or mortality in a given day, being easy to obtain on health government departments. So it is unnecessary to follow-up the group of people involved in the study, which demands much time (Schwartz et al., 1996).
Another key advantage of the time series approach is the use of daily data and while the underlying risk on epidemiological studies of air pollution varies with some factors such as age distribution and smoking history, these factors will not have influence on the expected number of deaths or morbidity on any day, since they do not vary from day to day (Schwartz et al., 1996).
Regression models are usually chosen in time series studies, as they are useful tools to assess the relationship between one or more explanatory variables (independent, predictor variables or covariates) (x 1 , x 2 ,…,x n ) and a single response variable (dependent or predicted variable) (y) (Dominici et al., 2003). The simplest regression analysis consisting of more than one explanatory variable is the multiple linear regression and is given by: where y is the response variable and x i (i = 1, 2, …, n) are the explanatory variables.  0 represents the value of y when all the explanatory variables are null,  i terms are called regression coefficients and the residual () is the prediction error (the difference between measured and adjusted values of the response variable).
The regression models goal is to find an expression that better predicts the response variable as a combination of the explanatory variables. It means to find the 's  that better fits to the database.
Due to the non-linearity of the response variable in time series studies of air pollution impacts on human health, the Generalized Linear Models (GLM) with parametric splines (e.g. natural cubic splines) (McCullagh & Nelder, 1989) and the Generalized Additive Models (GAM) with non-parametric splines (such as smoothing splines or lowess smoothers) (Hastie & Tibishirani, 1990) are usually applied.
According to the studies conducted in the last decade, GAM was the most widely applied method as it allows for non-parametric adjustment of the non-linear confounding factors such as seasonality, short-term trends and weather variables. It is also a more flexible approach than fully-parametric models like GLM with parametric splines. Nevertheless, recently the GAM implementation in statistical softwares, like S-Plus has been called into question (Dominici et al., 2003).
To evaluate the impact of default implementation of the GAM software on published analyses, Dominici et al. (2002) reanalyzed the National Morbidity, Mortality, and Air Pollution Study (NMMAPS) data (Samet et al., 2000a(Samet et al., , 2000b using three different methods: The GLM (Poisson regression) with natural cubic splines to achieve nonlinear adjustments for confounding factors; the GAM with smoothing splines and default convergence parameters; and the GAM with smoothing splines and more stringent convergence parameters than the default settings. The authors found that "estimates obtained under GLMs with natural cubic splines better detect true relative rates than GAMs with smoothing splines and default convergence parameter". The authors also added that: "although GAM with nonparametric smoothers provides a more flexible approach for adjusting for nonlinear confounders compared with fully parametric alternatives in time series studies of air pollution and health, the use and implementation of GAMs requires extreme caution".
In such a way, in this chapter it will be presented all the steps that should be followed to conduct a time series study using GLM with Poisson regression, from data collection to measure the goodness of fit. More details about design comparisons between time series; case-crossover; panel and cohort studies are in Dominici et al. (2003).

Generalized Linear Models (GLM)
The GLMs are a union of linear and non-linear models with a distribution of the exponential family, which is formed by the normal, Poisson, binomial, gamma, inverse normal distributions including the traditional linear models, as well as logistic models (Nelder & Wedderburn, 1972).

www.intechopen.com
Since 1972, many researches on GLMs where conducted and as a consequence several computational skills were created such as, GLIM (Generalized Linear Interactive Models), S-Plus, R, SAS, STATA and SUDAAN (Dobson & Barnett, 2008;Paula, 2004 or in matrix form: where the regression coefficients represents the vector of parameters to be estimated (McCullagh & Nelder, 1989).
Each distribution has a special link function, called canonical link function which occurs when i    , where  is called the local or canonical parameter. Table 1 shows the canonical function for some distributions of the exponential family (McCullagh & Nelder, 1989).  (2002), using the canonical link function implies some interesting properties, although it does not mean it should be always used. This choice is convenient because, besides the simplification of the estimative of the model parameter, it also becomes easier to obtain the confidence interval of the response variable mean. However, the convenience do not necessarily implies in goodness of fit.

Canonical link function
In studies of air pollution impact on human health with non-negative count data as response variable, the GLM with Poisson regression is broadly applied (Dockery & Pope III, 1994;Dominici et al., 2002;Lipfert, 1993;Metzger et al., 2004).
The GLM with Poisson regression consists in relating the response variable (y) (mortality or morbidity), which can take on only non-negative integers, with the explanatory variables (x 1 ,x 2 ,…,x n ) (pollutants concentration, weather variables, etc.) according to: Usually, the regression coefficients ( 's  ) are estimated using the Fisher score method of maximizing the likelihood function (maximum likelihood method), which is the same as the Newton-Raphson method when the canonical link function is considered. For the Poisson regression, the likelihood density function is given by (Dobson & Barnett, 2008;McCullagh & Nelder, 1989): where y is the response variable,  is the canonical parameter and  is the dispersion parameter. When the link function is the canonical link (ln  = ).
One feature of the GLM with Poisson regression is that even if all the explanatory variables were known and measured without error, there would still be considerable unexplained variability in the response variable. This is a result of the fact that even if the response variable is more precise, the Poisson process ensures stochastic variability around that expected count. In a classic stationary Poisson regression, the variance is equal to the mean ( = 1), but in many actual count processes there is overdispersion, when the variance is greater than the mean or, underdispersion the other way round. In these cases, it is still possible to apply the GLM with Poisson regression (Everitt & Hothorn, 2010;Schwartz et al., 1996). One way to adjust the over or underdispersion is to assume that the variance is a multiple of the mean and estimate the dispersion parameter using the quasi-likelihood method. Details of this method are in McCullagh & Nelder (1989).

Steps to fit GLM with poisson regression
To apply the GLM with Poisson regression, four main steps should be followed: development of the database; adjustment of the temporal trends; goodness of fit analysis and results analysis. The details of each step are at the following topics.

Database
Usually, in time series studies of air pollution impact on human health using the GLM with Poisson regression, the data used are air pollutants concentration, weather measures, outcome counts and some confounding terms. The data must be collected daily and for at least two years, to capture the seasonal trends.
Pollutants concentrations are usually obtained by fixed-site ambient monitors. The weather measures frequently used are temperature (or dewpoint temperature) and air relative humidity.

www.intechopen.com
The outcome depends on the purpose of the study. For example, in some studies mortality is used, in others, morbidity. The outcome can be stratified by type of disease (such as respiratory or cardiovascular diseases), age (children, young and elderly people) and any other factor of interest. The confounders can be of long-term (such as seasonality) or shortterm (day of the week, holiday indicator, etc.) and will be shown in Section 4.2.

Temporal trends adjustment
A common feature of epidemiological studies is biases due to confounding factors and correlations among covariates that can never be completely ruled out in observational data. Confounding factors are present when a covariate is associated with both the outcome and the exposure of interest but is not a result of the exposure. So, in all epidemiological studies, a basic issue in modeling is to control properly for all the potential confounders. Time series studies have some unique features in this regard (Dominici et al., 2003;Peng et al., 2006).
The intercorrelation of different pollutants in the atmosphere is one source of biases. One way to address this intercorrelation "has been to conduct studies in locations where one or more pollutants are absent or nearly so" (Dominici et al., 2003).
The sources of potential confounding factors in time series studies of air pollution impact on human health can be broadly classified as measured or unmeasured. Measured confounding factors such as weather variables (temperature; dewpoint temperature; humidity and others) are of unique importance in this kind of studies. Some studies have demonstrated a relationship between temperature and mortality being positive for warm summer days and negative for cold winter days, like in Curriero et al. (2002) cited in Peng et al. (2006). One approach to adjust confounding factors by temperature or humidity is to include non-linear functions or a mean of current and previous days temperature (or dewpoint temperature) in the model (Peng et al., 2006). Unmeasured confounding factors are those factors that have influence in outcome counts and have a similar variation in time as air pollutants concentration. These confounding factors produce seasonal and long-term trends in outcome counts that can confound its relationship with air pollution. Some important examples are influenza and respiratory infections (Peng et al., 2006).

Seasonality
In time series studies, the primarily concern is about potential confounding by factors that vary on timescales in a similar manner as pollution and health outcomes. This attribute is usually called seasonality. "A common approach to adjust this trend is to use semiparametric models which incorporate a smooth function of time". The smooth function serves as a linear filter for the mortality (morbidity) and pollution series and "removes any seasonal or long-term trends in the data" (Peng et al., 2006). Several methods to deal with this trend are being used such as smoothing splines, penalized splines, parametric (natural cubic) splines and less common LOESS smoothers or harmonic functions (Dominici et al., 2002;Peng et al., 2006;Samet et al., 2000aSamet et al., , 2000bSchwartz et al., 1996).
The spline function provides an approximation for the behavior of functions which has local and abrupt changes. The most used spline to smoothing curves in GLMs is the natural cubic www.intechopen.com spline (Chapra & Canale, 1987;Samoli et al., 2011;Schwartz et al., 1996), the other ones are usually applied in GAM.
Using splines, polynomial functions will be provided for each defined interval instead of a single polynomial for the whole database. The natural cubic spline is based on third order polynomials derived for each interval between two knots at fixed locations throughout the range of the data (Chapra & Canale, 1987;Peng et al., 2006). The choice of knots locations can result in substantial effect on the resulting smooth. So, in Peng et al. (2006) study the authors "provided a comprehensive characterization of model choice and model uncertainty in time series studies of air pollution and mortality, focusing on confounding factors adjustment for seasonal and long-term trends". According to their results, for natural splines, the bias drops suddenly between one and four degrees of freedom (df) per year and is stable afterwards, suggesting that at least 4 degrees of freedom per year of data should be used. In such way, in time series studies of air pollution and mortality (or morbidity) usually is used four to six knots per year, as the seasonality trend is due to the different behavior of variables during the seasons of the year (Tadano, 2007). Their results show that "both fully parametric and nonparametric methods perform well, with neither preferred. A sensitivity analysis from the simulation study indicates that neither the natural spline nor the penalized spline approach produces any systematic bias in the estimates of the logrelative-rate  " (Peng et al., 2006).
The smooth functions of time accounts only for potential confounding factors which vary smoothly with time, such as seasonality. Some potential confounders which vary on shorter timescales are also important, as they confound the relationship between air pollution and health outcomes, such as day of the week and holiday indicator (Peng et al., 2006).

Day of the week and holiday indicator
Important potential confounding factors that may bias time series studies of air pollution and mortality (or morbidity) are factors which vary on shorter timescales like calendar specific days, such as day of the week and holiday indicator (Lipfert, 1993). These trends are not necessarily present, but they occur often enough that they should be checked (Samoli et al., 2011;Schwartz et al., 1996). For example, on weekends the number of hospital admissions can be lower than on weekdays and can also be lower during holidays.
One way to adjust according the week day trend is to add qualitative explanatory variable for each day of the week (varying from one to seven) starting at Sundays. To adjust the holiday indicator, it can be considered an additional binomial explanatory variable in which one means holidays and zero means workdays (Tadano et al., 2009).
Adding all the time trends mentioned and explanatory variables in the GLM with Poisson regression, the expression used in some studies of air pollution impact on population's health is as follows (Tadano, 2007): where y = health outcome of interest; T = air temperature or dewpoint temperature (ºC); RH = air relative humidity (%); PC = pollutant concentration (g/m 3 ); H = time trend www.intechopen.com variable for holidays; dow = time trend variable for days of the week; ns = natural cubic spline to adjust for seasonality.
Some of these short-term trends can lead to autocorrelation between data from one day to previous days, even after its adjustment. In this regard, partial autocorrelation functions are used.

Partial autocorrelation functions
The short-term trends such as days of the week and holiday indicator can lead to an autocorrelation between data from one day and previous days, even using the adjustment. One way to analyze this time trend is plotting the partial autocorrelation function (Partial ACF) against lag days.
The autocorrelation function of the model's residuals is as follows: , with n = number of observations and k = lag days (Box et al., 1994). In the partial autocorrelation function plot, the residuals should be as smaller as possible, ranging from 12 2n   to 12 2n  (dashed lines) as shown in Fig. 1. In epidemiological studies of air pollution, the important autocorrelations are those occurring in the first five days, which are usually caused by the decrease of health outcomes in weekends and holidays (Tadano, 2007). If the database has autocorrelations, then the model should consider them by including the residuals in the model.
In R or S-Plus language, the residuals to be included are the working residuals. These residuals are returned when extracting the residuals component directly from the glm comand. They are defined as: After adjusting the GLM with Poisson regression including all time trends and explanatory variables, the fitting model need to be tested to assure that this is the best model to be applied to the database.

Goodness of fit
The GLM with Poisson regression has been widely applied in epidemiological studies of air pollution (Dockery & Pope III, 1994;Dominici et al., 2002;Lipfert, 1993;Metzger et al., 2004;Tadano et al., 2009) but it needs caution, as sometimes this model may not fit well to the database. There are two statistical methods that can be used to evaluate goodness of fit in GLMs, as follows.

Pseudo R 2
One interesting and easy to apply goodness of fit test for GLM with Poisson regression is the statistic called pseudo R 2 which is similar to the determination coefficient of classic linear models. It is defined as: where l =log-likelihood function; l(b min ) = maximum value of the log-likelihood function for a minimal model with the same rate parameter for all y's and no explanatory variables (null model) and l(b) = maximum value of the log-likelihood function for the model with p parameters (complete model) (Dobson & Barnett, 2008).
This statistic measures the deviance reduction due to the inclusion of explanatory variables and can be applied in R (R Development Core Team, 2010) throughout the Anova Table  with chi-squared test where the residual deviance values indicates the maximum value of the log-likelihood function for the complete model and the null one.
According to Faraway (1999), a good value of R 2 depends on the area of application. The author suggests that in biological and social sciences it is expected lower values for R 2 .

www.intechopen.com
Values of 0.6 might be considered good, because in these studies the variables tend to be more weakly correlated and there is a lot of noise. The author also advises that it is a generalization and "some experience with the particular area is necessary for you to judge your R 2 's well".

Chi-squared statistic
Another statistical test used as goodness of fit in GLM with Poisson regression is the chisquared (  ) or Pearson statistic, which is used to evaluate the model fit comparing the measured distribution to that obtained by modeling. The expression that represents the chisquared statistic is: where i y = measured values of the response variable; ˆi  = adjusted value by modeling and 1, 2,..., in  with n = number of observations.
The chi-squared statistic is the sum of the Pearson residual of each observation. According to this statistic a model that fits well to the data has a chi-squared statistic close to the degrees of freedom (df) (  df~1), where df = n -p (n = number of observations and p = number of parameters) (Wang et al., 1996). There is no evidence of which goodness of fit (Pseudo R 2 or   is preferred.
After the confirmation of GLM with Poisson regression fitting, the results must be analyzed to find the nature of the correlation between air pollutants concentration and health outcomes.

Results analysis
In epidemiological studies of air pollution it is common to find a relation between the air pollutants concentration of one day to the health outcomes of the next day, two days later or even after one week. Then, researchers usually fit the model to different arrangements of the same database with lags. In time series studies, lags of one day to seven days are frequently applied and then the one that best fits is chosen. One criteria to select the best option is the Akaike Information Criterion.

Akaike information criterion
The Akaike Information Criterion (AIC) is very useful when choosing between models from the same database. The smallest is the AIC, the better is the model. The AIC is automatically calculated in R software when applying the GLM algorithm and is calculated by: where l(b) = maximum log-likelihood value for the complete model; df = degrees of freedom of the model and  = estimated dispersion parameter (Peng et al., 2006).
After choosing for the model that better fits the database and which has the best relationship between air pollution and health outcomes, a method to verify the strength of this www.intechopen.com relationship is applied. One method frequently applied (used in S-Plus and R software) is the Student t Test.

Student t test for statistical significance
In time series studies, the confirmation of any relation between air pollution and health outcomes is obtained throughout a hypothesis test that can show if the regression coefficients are statistically significant or not.
The statistical hypothesis to be tested is the null one (H 0 ) expressed by an equality. The alternative hypothesis is given by an inequality (Mood et al., 1974).
The hypothesis test has several goals; one of them is to verify if the estimated regression coefficient can be discredited. In this case, the following hypotheses are considered H 0 :  = 0 and H 1 :  ≠ 0. The statistical test used to verify these hypotheses is given by: where  is the standard error of the estimated regression coefficient (). The rejection of the null hypothesis occurs when 02 , 1 nk tt     (n = number of observations, k = number of explanatory variables, is the considered significance level), indicating that the estimated regression coefficient is statistically significant. In other words, the explanatory variable influences in the response variable (Bhattacharyya & Johnson, 1977;Mood et al., 1974).
The values 2, 1 nk t    are presented in Student t distribution table, where n-k-1 is the degrees of freedom   df and  is the considered significance level (Bickel & Doksum, 2000).
If the study results in a statistically significant relation between air pollutant concentration and the health outcome of interest, then some analysis and projections are made using the relative risk (RR).

Relative risk
The relative risk (called rate ratio by statisticians) (Dobson & Barnett, 2008) is used to estimate the impact of air pollution on human health, making some projection according to pollutants concentration.
The relative risk is a measure of the association between an explanatory variable (e.g. air pollutant concentration) and the risk of a given result (e.g. the number of people with respiratory injury) (Everitt, 2003).
In a specific way, the relative risk function at level x of a pollutant concentration, denoted as RR(x), is defined as (Baxter et al., 1997): www.intechopen.com It is the ratio of the expected number of end points at level x of the explanatory variable to the expected number of end points if the explanatory variable was 0 (Baxter et al., 1997). For the Poisson regression, the relative risk is given by: indicating, for example, that the risk of a person exposed to some pollutant concentration (x) having a specific injury is RR(x) times greater than someone who has not been exposed to this concentration. A RR(x) = 2 for a pollutant concentration of 100 g/m 3 , indicates that a person exposed to this concentration has two times more chance to get a health problem than someone who has not been exposed to any concentration.

Case-study
To exemplify the appliance of GLM with Poisson regression, a case study will be presented. It will be evaluated the impact of air pollution on population's health of Sao Paulo city, Brazil, from 2007 to 2008. Sao Paulo is the largest and most populated city of Brazil, and one of the most populated in the world.
In this study, it was evaluated the impact of PM 10 (particles with an aerodynamic diameter less or equal to 10 m) on the number of hospital admissions for respiratory diseases, according to the International Classification of Diseases (ICD-10).
PM 10 was chosen because according to WHO (2005) as cited in Schwarze et al. (2010), particulate air pollution is regarded as a serious health problem and some studies reported that reductions in particulate matter levels decrease health impact of air pollution. According to Braga et al. (2001) the health outcomes had high correlation with PM 10 concentration in Sao Paulo (Brazil) population.

Case-study database
The data collected in this study consisted of daily values from January 1 st , 2007 to December 31 st , 2008 to Sao Paulo city, Brazil.
The hospital admissions for respiratory diseases, according to the ICD-10, were considered as response variable. The data was obtained from the Health System (SUS) website (2011). The explanatory variables consisted of PM 10 concentration and weather variables (air temperature and air relative humidity), also including parametric splines for long-term trend (seasonality), qualitative variable for days of the week and binomial variable for holiday indicator.
The PM 10 concentration, air temperature and humidity where obtained from QUALAR system in Cetesb (Environmental Company of Sao Paulo State) website (2011).
The fixed-site monitoring network of Sao Paulo city, held by Cetesb, has twelve automatic stations, and PM 10 concentration is collected in all but one of the stations and temperature and humidity data are acquired in eight of them. This network also contains nine manual stations, but none of them monitors PM 10 concentration, just TSP (Total Suspended Particles www.intechopen.com -particles with an aerodynamic diameter less or equal to 50 m) and PM 2.5 (particles with an aerodynamic diameter less or equal to 2.5 m).
The PM 10 concentration was monitored in eight fixed-site automatic monitoring stations during the study period (from 2007 to 2008); the air temperature and the air relative humidity were monitored only at two stations. The daily data of these variables used in this study comprise the mean of the available data.
The descriptive analysis of the variables considered in this study is presented in Table 2. The values in Table 2 show that the maximum daily PM 10 concentration (103 g/m 3 ) did not overcome the national air quality standard (150 g/m 3 ). To have an initial idea of the relation between the response variable and the explanatory ones, the Pearson correlation matrix was constructed (Table 3). The Pearson correlation between hospital admissions for respiratory diseases (RD) and PM 10 was positive and statistically significant. It means the number of RD increases as PM 10 concentration increases. This table also shows that the number of RD increases as temperature and humidity decreases, but the Pearson correlation was not statistically significant in this case. Consequently, as shows Table 3, the PM 10 concentration increases in days with low temperature and humidity indexes.  Table 3. Pearson correlation matrix between hospital admissions for respiratory diseases (RD), concentration of PM 10 and weather variables.

Long and short-term trend adjustment
The long-term trend usually included in time series studies of air pollution impact on human health is seasonality and in this case study it was considered a natural cubic spline, the most used parametric smooth in GLMs.
To apply the natural cubic spline in GLM with Poisson regression, an explanatory variable for the days is added to the model, consisting of values from 1 to 731, comprising all two years of data.
The short-term trends usually considered in epidemiological studies of air pollution are the day of the week and holiday indicator. The day of the week variable was considered as a qualitative variable which varies from one to seven, starting at Sundays. The holiday indicator was adjusted adding a binomial variable in which one means holidays and zero means workdays.
According to the considerations above, Table 4 brings an example of the first lines of the database used.

RD ns day df as factor dow as factor H T RH PM name glm data database name family poisson na action na omit
where m.name = is the name given to the analysis; ns = natural cubic spline; df = degrees of freedom; database.name = name given to the database file.
To apply this model, one important decision is about the number of degrees of freedom (df) to be considered in the natural cubic spline of days of study. In epidemiological studies of air pollution, the common values are four, five or six degrees of freedom per year of data. To decide which one to use, three analyses where made considering four, five and six degrees of freedom (df) in Equation (15) and the results were compared using the AIC, as shown in Table 5. According to the results indicated in Table 5, the model with 6 degrees of freedom per year of data is the one that better fits the data. Then, in the following analyses, it was considered df = 6 in Equation (15).

Number of df per year AIC
The short-term trends considered in this study (days of the week and holiday indicator) can lead to autocorrelation between data from one day and the previous days, so the Partial ACF plot against lag days must be analyzed. The lines of each lag day until five lags must be between 12 2n   and 12 2n  . In this case study the number of observations (n) is equal to 731, so the lines in Partial ACF plot out of the range (-0.07;0.07) indicates a strong autocorrelation between data from one day and previous days.
The Partial ACF plots against lag days for the model with six degrees of freedom and considering the effects of PM 10 concentration on the same day for the model with no residual inclusion is shown Fig. 2 and Fig. 3 shows for the model after including residuals.
For the model with 6 degrees of freedom, the Partial ACF plot (Fig. 2) shows autocorrelations between one day and the previous 1, 2, 3 and 4 days, as the lines for these lag days are out of the range. To adjust for this time trend, it is necessary to include the residuals for these lag days in the model. After do so, the Partial ACF plot (Fig. 3) shows no more autocorrelations between data for the first five days, indicating that it is the best fitted model.
After adjusting the GLM with Poisson regression including all the time trends and explanatory variables and choosing the degrees of freedom that better fits the data; the fitted model was tested using the pseudo R 2 and the chi-squared statistic to assure that it is the right one to be applied to the case-study.

Model adjustment results
In epidemiological studies of air pollution it is common to find a relation between the air pollutants concentration of one day to the health outcomes of some lag days. In this casestudy, analyses of the relation between PM 10 concentration of one day and the number of hospital admission for respiratory diseases for the same day until one week later was performed.
All models were fitted with no residual inclusion and also with inclusion of residuals due to autocorrelation.
The goodness of fit to the analyses from no lag to seven lag days is shown in Table 6 (A) without residual and, (B) with residuals. The ACF plots for all models adjusted will not be shown, but they are similar to that of Fig. 2. All of them (from no lag to seven lag days) indicated the need of residuals inclusion for 1, 2, 3 and 4 lag days. The models with residuals inclusion did not show anymore autocorrelations.  Table 6. Goodness of fit results for the analyses from no lag to seven lag days for models with no residual inclusion (A) and including residuals in the model (B).
According to the analysis of the goodness of fit shown in Table 6, all the models presented a pseudo R 2 greater than 0.6, showing that the models fitted well to the data, but the chisquared statistic analysis showed values much greater than the degrees of freedom. As there is no evidence of which statistic is suitable for each situation, we can conclude the model fitted well to the data, according to Pseudo R 2 statistic results.
After verifying the models fitted well to the data, the analysis of the regression coefficients was held. The results are shown in Table 7 without residuals and Table 8 with residuals. Analyzing the AIC, the model that include the residuals seems to fit better than the one which was not included and the model considering the effect of seven days lag shows better results, but the regression coefficient did not show statistical significance.
Furthermore, the AIC value with three days lag is lower than for two, one or no lags; showed no autocorrelation and with a regression coefficient statistically significant.
In conclusion, the chosen model was the one with the effect of three days lag in which residuals was included. The relative risk results are therefore only presented for this model (with # symbol in Table 8).   Table 8. Results analysis for no lag to seven lag days for models including residuals.

Relative risk analysis
To analyze and estimate the PM 10 impact on Sao Paulo's population health, the relative risk for the model considering the effects of three lag days including residuals was calculated. The expression that represents it is given by: www.intechopen.com The relative risks where calculated according to Equation (16). The plot of it against PM 10 concentration is shown in Fig. 4.
In Fig. 4 it can be seen that the RR has a linear relation with PM 10 concentration, then the greater the PM 10 concentration, the higher the RR. Thus, when the PM 10 concentration increases from 10 to 100 g/m 3 , the RR increases 5%. It may mean someone exposed to a concentration ten times greater has 5% more chance of getting a respiratory disease. Increase in PM 10 concentration ( g/m 3 )

Relative Risk
Fig. 4. Estimates of relative risk for the model considering the effect of three days lag and including residuals according to the increase in PM 10 concentration (the dashed lines are the confidence interval).

Conclusion
Concluding, previous studies have not found yet a single model that can explain the impact of all kinds of air pollution on human health. Lipfert (1993) made a comparison among approximately 100 studies involving air pollution and demands for hospital services and concluded that this comparison is hampered due to the diversity encountered. The studies vary in design, diagnoses studied, air pollutant investigated; lag periods considered and the ways in which potentially confounding variables are controlled. Lipfert (1993) also concluded that study designs have evolved considerably over the 40 years of published findings on this topic. The early studies tended to emphasize the need to limit the populations studied to those living near air pollution monitors, but more recent studies employed the concept of regional pooling, in which both hospitalization and air monitoring data are pooled over a large geographic area.
In this chapter it was emphasized times series studies appliance using Generalized Linear Models (GLM) with Poisson regression. This model is often used when response variables are countable, which demands less time then studies of follow-up kind.
The four steps to be followed to fit GLM with Poisson regression (database construction, temporal trends adjustment, goodness of fit analysis and results analysis) were applied to a case study comprising the PM 10 concentration impact on the number of hospital www.intechopen.com admissions for respiratory diseases in Sao Paulo city from 2007 to 2008. The results showed that GLM with Poisson regression is useful as a tool for epidemiological studies of air pollution.
According to the case study, the model fitted well to the data as the pseudo R 2 statistic has shown good results (around 0.8 > 0.6) for all adjustments. The models without residual inclusion for effects of PM 10 concentration of the same day (no lag) to five days later (five lag days) showed regression coefficients statistically significant, but autocorrelations for one, two, three and four lag days was identified, suggesting a correlation between data from one day until four days later even after adding variables for day of the week and holiday confounders. So, the models that fitted well to the data were those with residuals inclusions for one, two, three and four lag days for effects of PM 10 concentration for the same day (no lag) to three days later (three lag days), but the better fit was for the effect after three days of exposure according to Akaike Information Criterion (AIC) analysis that has shown the lowest AIC (6584.8).
In this way, an analysis of the relative risk (RR) for the model with residual inclusion and considering the effects of exposures after three days (three lag days) showed that the risk of someone get sick with a respiratory disease increases 5% as the concentration goes from 10 to 100 g/m 3 .
The results of the study showed that the risk of getting sick due to PM 10 concentration can occur up to three days after the exposure and the more concentration, the higher the risk.
Finally, the steps to apply the GLM with Poisson regression to studies of air pollution impact on human health presented in this chapter had not been found in the literature and can be extended to all air pollutants and health outcomes.

Acknowledgment
This chapter was developed with financial support of CNPQ (Conselho Nacional de Desenvolvimento Científico e Tecnológico) and ANP (Agência Nacional do Petróleo).