Spatial linear regression models in the infant mortality analysis

year-old has been assessed through the global and local Moran indexes. Furthermore, three models have been fitted, namely, the classic regression model, the spatial autoregressive model (SAR) and the conditional autoregressive model (CAR). The application of square root transformation in response variable allowed developing prediction equations for spatial models and reaching residuals with normal distribution and constant variance. The Akaike information criterion (AIC) has indicated SAR model as the best. The “number of women in fertile age” and “monthly income of the women” were identified as the main factors affecting significantly infant mortality in Alfenas city, Minas Gerais. 1. Introdução Infant mortality (IM) is one of the main measures used to assess quality of life in several countries. It refers to deaths of young children, typically those less than one-year age. IM is measured by Infant Mortality Rate (IMR), which is the number of deaths of infants under one-year-old per 1000 live births, during a period of time (Genowska et al., 2015). There are several factors pointed for high infant mortality rates. Genowska et al. (2015) points out that premature birth is the biggest contributor to the IMR while factors like birth asphyxia, pneumonia, neonatal infection and malnutrition are other causes of IM. The authors also claim that mother ́s education level, environment conditions, political and medical infrastructure are important contributors’ factors as well. On the other hand, Hall et al. (2016) describes smoking during pregnancy as the most common preventable causes of IM and other measures of public health such as, improving sanitation, access to clean drinking water, immunization against infectious disease can also help to reduce IMR. Some authors like (Leal & Szwarcwald, 1997 and Menezes et al., 1998) support that infant mortality can be controlled by investments that lead to improved assistance for women and new-borns in the delivery room and nursery, socio-economic conditions of people with low income, education access, sanitation and basic health services. Furthermore, epidemiologic studies promoting knowledge about spatial distribution of infant mortality seems to be useful tools to control infant mortality as well. These studies allow the decision makers to identify critical regions with special needs, where positive interventions must be carried out in order to reduce infant mortality rates improving quality of life. Silva et al. (2011) claims that neighbour regions are more likely to present similarity in infant mortality rates than neighbours in large distances. Therefore, infant mortality analysis must be carried out taking into account that level of similarity or dissimilarity between neighbour regions using spatial statistics methods. According to Cressie (2015), spatial statistic is divided in three different groups namely random


Introdução
Infant mortality (IM) is one of the main measures used to assess quality of life in several countries. It refers to deaths of young children, typically those less than one-year age. IM is measured by Infant Mortality Rate (IMR), which is the number of deaths of infants under one-year-old per 1000 live births, during a period of time (Genowska et al., 2015).
There are several factors pointed for high infant mortality rates. Genowska et al. (2015) points out that premature birth is the biggest contributor to the IMR while factors like birth asphyxia, pneumonia, neonatal infection and malnutrition are other causes of IM. The authors also claim that mother´s education level, environment conditions, political and medical infrastructure are important contributors' factors as well. On the other hand, Hall et al. (2016) describes smoking during pregnancy as the most common preventable causes of IM and other measures of public health such as, improving sanitation, access to clean drinking water, immunization against infectious disease can also help to reduce IMR. Some authors like Szwarcwald, 1997 andMenezes et al., 1998) support that infant mortality can be controlled by investments that lead to improved assistance for women and new-borns in the delivery room and nursery, socio-economic conditions of people with low income, education access, sanitation and basic health services. Furthermore, epidemiologic studies promoting knowledge about spatial distribution of infant mortality seems to be useful tools to control infant mortality as well. These studies allow the decision makers to identify critical regions with special needs, where positive interventions must be carried out in order to reduce infant mortality rates improving quality of life. Silva et al. (2011) claims that neighbour regions are more likely to present similarity in infant mortality rates than neighbours in large distances. Therefore, infant mortality analysis must be carried out taking into account that level of similarity or dissimilarity between neighbour regions using spatial statistics methods. According to Cressie (2015), spatial statistic is divided in three different groups namely random fields (geostatistics), point pattern process and lattice data. Random fields analyse phenomena distributed continually in space and point pattern processes are expressed by occurrences identified as points located in space. Lattice data analyse aggregated phenomena such as Gross Demand Product (GDP) or count data of occurrence of any human disease by states, counties, census sectors, blocks etc.
Most studies of infant mortality in Brazil are usually analysed by spatial aggregated data in areas because this type of data is often available in terms of count data by regions (Leal & Szwarcwald, 1997;Silva, et. al., 2011). There are two main sources of data to estimate infant mortality, the Birth Information System (SINASC) and Death Information System (SIM). The information for the former system is provided by hospitals where certificates of births are issued while the information for the later system comes from the number of certificates of deaths (D´orsi & Carvalho, 1998). These systems do not provide information about the exact location (coordinates) of the dead. This is probably one of the reasons why infant mortality is often analysed using aggregated data. Moreover, such representation by regions provides a feasible way to represent easily interpretable thematic maps of spatial distribution of infant mortality for decision makers (Andrade & Szwarcwald, 2001).
In the presence of some covariates associated to infant mortality, regression models can be built in order to predict infant mortality in other regions or for forecasting. For spatial data case, where spatial autocorrelation exists, the model must take into account the spatial structure in order to improve the predictions provided by the model. According to Druck (2004) there are two types of spatial regression models with spatial structure in lattice data: the spatial autoregressive (SAR) and the conditional autoregressive (CAR) model. The former incorporates the spatial dependence as a lag variable of the response while the later captures spatial autocorrelation as a nuisance.
This paper aimed to generate predicted maps of infant mortality distribution for a Brazilian city (Alfenas, MG) using spatial regression model which incorporates transformation in the response variable and identify regions of high risks in order to facilitate the decision makers in intervention actions.

Methods and Material Data and study area
The study was conducted in the urban area of Alfenas city, Minas Gerais. The city is located, at 21 o 25´45´´ Latitude South and 45 o 56´50´´ Longitude West (Köppen & Geiger, 1928). Alfenas is geographically divided in 68 census sectors with different shapes and sizes geographically coded by centroids ( Figure 1).
The number of deaths of infants under one year old was used as the response variable (dependent) and, in the present paper, represents the infant mortality. These data were collected from Death Information System (SIM) between 2010 to 2014 period. Only cases of the dead geographically able to be coded (those containing address information like avenue, street or neighbourhood) were used. Given the limitation information about the total number of birth per census sectors in the study area, it was not possible to determine infant mortality rates. For this reason the study uses the number of deaths of infants under one year old as infant mortality itself not rates. Furthermore, a set of seven covariates were used as independent variables namely: the number of women in fertile age; the number of women in gestational risk age; the number of illiterate women; the monthly income of the women; the monthly income of the men; the number of residences with more than six inhabitants; and the demographic density of the census sector. All independent variables were collected by IBGE (2010) while the number of deaths of infants under one year old was collected by one of the author of the study (D.A.N). All variables were collected for each of the 68 census sectors. Thus, the study did not use biological essays with human people, so there was not necessary a certificate from Ethic and Research Committee.

Spatial Dependence Analysis
Spatial dependence between observations of the number of deaths of infants under one-year-old was evaluated using the local and global Moran index. The application of such indexes depends on the definition of a neighbourhood matrix (W). This matrix represents the spatial dependence structure of any random variable in lattice data. It gives an idea about the degree of proximity between areas or regions (Werneck, 2008). There are many ways to define W. Given n set of areas (A1, … An), W (n × n) has elements wij representing any measure of proximity between areas Ai and Aj. One criteria used to define W is wij = 1, if area Ai and Aj share the same border between them for i ≠ j and wij = 0, otherwise.
According to Waller & Gotway (2004), the Global Moran Index is the main measure used to depict spatial autocorrelation in spatial lattice data. The Global Moran Index is defined by: ) and, wij: elements of neighbourhood matrix (wij = 1 if two areas share the same border and wij = 0, otherwise).
The global Moran index has the same interpretation likewise Pearson´s correlation coefficient. However, the index may not be bounded in [-1; 1]. Positive value of Moran index is an indicative of similarity between values of infant mortality in neighbour regions exhibiting clusters in the study area while negative values indicate dissimilarity in neighbour regions, exhibiting a regular pattern in the study area. In the absence of spatial autocorrelation, the Moran index is zero which indicates a complete spatial randomness phenomena (Santos & Souza, 2007). Inference for global Moran index was assessed under the assumption of asymptotic normal distribution (Cliff & Ord, 1981) and a Wald test was carried out defined by: where ̂ is defined by equation (1), [̂] is the expected value of Moran Index and [̂] is the variance (see Cliff and Ord, 1981).
The global Moran index only gives any idea about the presence of spatial autocorrelation. However, it does not identify what areas exhibits similarity or dissimilarity. For instance, what areas can form clusters in the region. For these reasons a Local Spatial Autocorrelation Index (LISA) must be included in the analysis. According to Druck (2004) Moran LISA is defined as where ̂: local spatial autocorrelation index in area i; n: number of census sectors in Alfenas city; Yi: number of deaths of infants under one year old in census sector i; Yj: number of deaths of infants under one year old in census sector j; wij: elements of neighbourhood matrix (wij = 1 if two areas share the same border wij = 0, otherwise).

Modelling
According to Draper & Smith (1998) a linear multiple regression model is described as: where: Y is a vector of observations (n1) of the dependent variable in n census sectors; X is a matrix (n x p+1) with p independent variables (covariates), measured in n census sectors; β is a vector of parameters (p+1 x 1) to be estimated; ɛ is a vector (n  1) of random uncorrelated errors normally distributed with mean zero and a constant variance, ~( , ) . A stepwise procedure was applied in order to select covariates with significant effect in the model (Charnet et al., 1999). The assumptions of normal distribution, independence and constant variance of the error in classical regression model were analysed by Shapiro-Wilk, Durbin Watson and Breusch Pagan tests, respectively (Draper & Smith, 1998). In addition, a Lagrange Multiplier test was carried out with the residuals of the regression model in order to assess the presence of spatial autocorrelation. The presence of spatial autocorrelation in the residuals is an indicator of lack of independence. Therefore, a spatial component is necessary to be included in the classical regression model in order to reach a high goodness of fit (Anselin, 2005).
Spatial dependence in linear regression models can be incorporated in a single parameter by two ways (Druck, 2004). First, the spatial autocorrelation can be associated to the dependent variable generating a spatial autoregressive model (SAR). Secondly, the spatial dependence parameter is incorporated in the error term, generating a conditional autoregressive model (CAR) also called error model.
According to Anselin (2005), a spatial autoregressive model (SAR) is described as: where Y, X, β e ɛ are defined like the classical regression model depicted by equation (4), ρ is the spatial autoregressive coefficient and W is the neighbourhood matrix.
A conditional autoregressive model (CAR) is defined by: and Y, X, β , ρ, W e ɛ are defined like equation (5) and u is a vector (n  1) of random uncorrelated errors normally distributed with a vector of mean zero and a constant diagonal covariance matrix, ~( , ) (Anselin, 1999).
The null hypothesis for lack of spatial autocorrelation in SAR and CAR models, depicted by equation (5) and (6) was assumed as ρ = 0. In other words, if ρ is statistically equal to zero, residuals of the classical regression model are not spatially auto correlated. In CAR and SAR models, the estimates of β e ρ were obtained by maximum likelihood method (Bivand, 2010). The adequacy of SAR and CAR models (normality, independence and constant variance of the error) were also evaluated using the same approach applied for the classical regression model. The fitted models were compared by adjusted R 2 , log likelihood and AIC (Akaike information criteria).
A transformation is necessary when the residuals exhibit non constant variance or non-normality. This might be the case when there is count data in the response variable such as infant mortality. In such situation, it is often recommended to take the square root transformation, since it may stabilize the variance and somewhat improves symmetry (Draper & Smith, 1998). The square root transformation is straightforward for the classical regression model but it would be more complicated for the spatial regression models. We advocate using a square root transformation for spatial regression models using the same approach depicted in Miller (1984) for the classical regression model. For the SAR model, considering a square root transformation in the dependent variable, we have Squaring both sides of the equality, we have

Applying the expectation in both sites, we have [ ] = [ ].
Because the second side of the equation is composite by square elements of H, this is equivalent to where diag [.] are the diagonal elements of matrix. Thus, we get Applying the transpose, we have It´s known that the expectation of the error term is zero. So, we get which is equivalent to Therefore, the predicted values in the original scale of the dependent variable will be determined by where ̂, ̂ e ̂ are estimates of the unknown parameters obtained by maximum likelihood method using transformed data.
For CAR model it was used the same approach depicted for SAR model. Thus, the predicted values in the original scale of the dependent variable in CAR model will be determined by The predicted values of infant mortality in each census sector were determined by equation (7) and (8). The predictions were used to build thematic maps of infant mortality distribution in Alfenas city in order to identify areas of high risk.

Exploratory analysis
The number of deaths of infants under one-year-old in the city of Alfenas, MG varies from zero to seven and its spatial distribution in quartiles is presented in Figure 2. It shows that the data have not outliers and infant mortality does not exhibit any trend as well. Thus, the number of deaths of infants under one-year-old presents a first order stationary. The global Moran index (GMI) presented in Table 1 indicates an absence of spatial autocorrelation for the response variable (p value > 0.05). Monthly women income -8.7x10 -6** 2.9x10 -6 9.2 x10 -6** 2.6 x10 -6 7.3 x10 -6** 2.3 x10 -6 Number of women in fertile age 5.4 x10 -3** 1.06 x10 -3 4.9 x10 -3** 9.6 x10 -4 4.8 x10 -3** 9.7 x10 -4 We observe that the GMI is a global index and can mask presence of local spatial autocorrelation. Thus, a spatial local autocorrelation was also carried out using LISA. This index estimates a measure of spatial auto correlation for each area. Figure 3 represents the LISA map where we can see some areas exhibiting a statistically significant local spatial autocorrelation (p -value < 0.05). This suggests that the number of deaths of infants under one-year-old in the city of Alfenas is locally spatial auto correlated.
The spatial dependence analysis using local spatial autocorrelation index (LISA) has shown that the number of deaths of infants under one-year-old in neighbour areas are spatially auto correlated. However, there is no trend for such areas to be aggregated in clusters. This can be shown by LISA cluster map depicted in Figure 3. Areas with High-High and Low-Low designations are those with positive spatial autocorrelation while areas with Low-High and High-Low designations belong to regions where the spatial autocorrelation is negative. In Figure 3 there are four areas marked as High-Low. It means that these regions are negatively spatial auto correlated with their neighbours, that is, observations of infant mortality in those four areas are dissimilar between their neighbours.

Modeling
Results of classical regression model between transformed data of deaths of infants under one-year-old and the covariates showed that the number of women in fertile age and the monthly income of women are the main variables with statistically significant effects in the model. Table 2 reports the estimates of the parameters of the classical regression model and for spatial models using these two variables.
The Shapiro-Wilk normality test and Breusch-Pagan test showed that the assumptions of normality and constant variance of the error were not violated. However, by the Durbin-Watson test the assumption of non-auto correlated errors was violated. Thus, inferences based on classical regression model are not valid. The lack of independence in the residuals may show evidences of spatial dependence between observations of infant mortality. Therefore, the spatial autocorrelation must be included in the model using SAR or CAR models in order to reach a better goodness of fit.
Likewise, in classical regression model, results of spatial models (SAR and CAR) also reported the number of women in fertile age and the monthly income of women with significant effect in the models (Table 2).
It is noticed that local spatial autocorrelation index between observations ( Figure 3) has a great influence while using classical regression model. If spatial auto correlation between observations is ignored, all inferences made by classical regression model may be questionable due the violation of independence assumption. For these reasons, classical regression model presented in Table 2 may not be valid to make accurate inferences. Therefore, a spatial regression model (SAR or CAR) is preferred for inferences. Results for SAR and CAR models showed that transformed data of the number of deaths of infants under one-year-old had an inverse relationship with the monthly income of women, while the number of women in fertile age had a direct relationship with the number of deaths of infants under oneyear-old. Therefore, it is expected that areas with more women in fertile age and areas with low monthly income of women will trend to exhibit high values of the number of deaths of infants under one-year-old. In public health programs such areas may deserve special attention for decision makers.
Some authors, as Andrade et al. (2004) and Abreu & Vasconcellos (2010) claimed that it is common to associate socio economic factors with infant mortality. The authors point out that women income is one of the main factors in infant mortality analysis due to the role played by income in purchasing goods and services necessary to achieve high quality of life and health in any country. Menezes et al. (1998), studying risk factors in infant mortality in Pelotas, RS Brazil, reported that infant mortality in families with income less than the minimum wage was three times greater than families with high income.
The estimates of the covariates parameters are similar in both spatial models and the spatial autoregressive parameter is negative in both models as well. Negative spatial dependence and statistically different from zero, means that values of the number of deaths of infants under one-year-old in neighbour areas are dissimilar among them exhibiting any regular pattern in the study area (Waller & Gotway, 2004).
Shapiro-Wilk normality test and Breusch-Pagan heterokedasticity test for the residuals of SAR and CAR models indicated no violation for normality and constant variance assumptions, respectively. By the three criteria used to choose the best model (adjusted R 2 , AIC and log likelihood) it is clear that SAR model was the best among the three fitted models ( Table 2). The fitted SAR model has the lowest AIC (126.7) and the highest adjusted R 2 (44.27%) and log likelihood (-58.35). Therefore, SAR model was used to predict infant mortality in the 68 census sector in Alfenas city. Based on the predicted values a thematic map of the number of deaths of infants under one-year-old distribution was built (Figure 4). Miller (1984) reports some procedures to predict values of the response variable in classical regression model with transformation in the dependent variable. However, for the spatial regression models such procedures were not found in the literature. Thus, in material and method of this paper it is described a proposal procedure in that sense as depicted by equations (7) and (8). Figure 4 shows the predicted values of the number of deaths of infants under one-year-old in Alfenas city using equation (7) based on SAR model.
The predicted values of the number of deaths of infants under one-year-old in Alfenas city by census sectors varies from zero to four. Based on Figure 4, only few census sector did not present infant mortality, while the majority has in average one and two deaths of infants under one-year-old. We have observed that the census sectors without any occurrence of infant mortality are the noble areas in the city, while census sectors with the highest values of infant mortality (three and four) are those with the worst socio economic conditions in terms of infrastructures and they are the poorest areas in Alfenas city. Thus, in a public health program aiming to reduce infant mortality incidence, these areas shall deserve special attention.

Conclusions
The results obtained in this study showed that infant mortality in Alfenas city, MG has negative local spatial auto correlation. Areas with high values of infant mortality are more likely to be surrounded by areas with low values of infant mortality. The use of a square root transformation in the response variable allowed developing prediction equations in spatial regression models. The number of women in fertile age and the monthly women income are the main factors affecting infant mortality in Alfenas city, MG.

Acknowledgements
To CNPq for the scholarship provided to L.M.