Bayesian modeling and clustering for spatio-temporal areal data: An application to Italian unemployment

Spatio-temporal areal data can be seen as a collection of time series which are spatially correlated according to a specific neighboring structure. Incorporating the temporal and spatial dimension into a statistical model poses challenges regarding the underlying theoretical framework as well as the implementation of efficient computational methods. We propose to include spatio-temporal random effects using a conditional autoregressive prior, where the temporal correlation is modeled through an autoregressive mean decomposition and the spatial correlation by the precision matrix inheriting the neighboring structure. Their joint distribution constitutes a Gaussian Markov random field, whose sparse precision matrix enables the usage of efficient sampling algorithms. We cluster the areal units using a nonparametric prior, thereby learning latent partitions of the areal units. The performance of the model is assessed via an application to study regional unemployment patterns in Italy. When compared to other spatial and spatio-temporal competitors, the proposed model shows more precise estimates and the additional information obtained from the clustering allows for an extended economic interpretation of the unemployment rates of the Italian provinces.


Introduction
Data representing spatial features collected at different time points have witnessed an increase in availability in recent years. The underlying areal units can be defined by geographical boundaries (e.g., regions, countries, municipalities) or by a tessellation of the territory of interest. This type of spatio-temporal data is commonly encountered in fields such as Social Sciences, Telecommunications, Economics, Epidemiology, and Image Analysis. A current example is the count of COVID-19 registered cases, often communicated on a daily basis per country and region. We can therefore think of spatio-temporal areal data as a collection of time series, whose spatial correlation depends on the geographical location of the underlying areal units.
In this work, we focus on the analysis of data regarding the evolution of unemployment rates in Italy. Italy has one of the highest unemployment rates in the European Union and exhibits well-known economic disparities between the northern and the southern regions (see Brunello et al., 2001). The global economic and the following eurozone crisis had a destructive impact on the Italian economy, leading to a sharp rise of unemployment across the entire country (see Di Quirico, 2010;Bull, 2018). The geographical dichotomy as well as the upward trend of the unemployment rate motivates the need for a flexible modeling approach, explicitly incorporating both spatial and temporal dimensions.
In addition to modeling the data in space and time, we are interested in uncovering economic patterns across provinces, allowing for the identification of groups of areal units sharing similar features. To this end, we adopt a model-based clustering approach exploiting tools from the Bayesian nonparametric (BNP) literature. More specifically, we make use of the popular Dirichlet Process (DP) prior (Ferguson, 1973), which avoids the need of fixing the number of clusters in the model. See Quintana and Iglesias (2003) for the relationship between model-based clustering and the DP prior.
The body of literature on Bayesian spatio-temporal models for areal data in regional science and quantitative spatial analysis is large. Pars pro toto, we mention the textbooks by Banerjee et al. (2014), Haining and Li (2020), and Sahu (2022). A widely used class of models for the study of areal data is that of conditional autoregressive (CAR) models, introduced in Besag (1974Besag ( , 1975. In a CAR model, the spatial dependence among the observations is captured by a correlation matrix containing information on the neighboring structure of the data. This approach can be used to define a prior distribution for random effects in a Bayesian hierarchical model. Bayesian hierarchical models are particularly useful in modeling related observations, such as in the case of areal time-varying data, allowing to capture the underlying dependence structures and providing full inference and uncertainty quantification. Different CAR prior specifications can be found in the literature, such as the intrinsic-CAR (ICAR) and Besag-York-Mollié priors (both in Besag et al., 1991). Here, we follow Leroux et al. (2000), who specify a set of spatially correlated random effects within a generalized regression setting. CAR models can be thought of as the conditional specification of a (Gaussian) Markov random field (GMRF). The different CAR models proposed in the literature correspond to a specific choice of the precision matrix of the corresponding GMRF (see Rue and Held, 2005, for more details). Thanks to the Markov property, the precision matrix of a GMRF is usually sparse, enabling efficient computations through algorithms for sparse matrices based on results from linear algebra. An algorithm for block updating for Markov random fields models can be found in Knorr-Held and Rue (2002).
In a Bayesian framework, approaches for dealing with both spatial and temporal dependence in areal data can be found, among others, in Ugarte et al. (2012), Rushworth et al. (2014), Lawson (2018), . Beraha et al. (2021) consider the problem of spatially dependent areal data and propose modeling the data collected for each areal unit through a finite mixture of Gaussian distributions. The spatial dependence is introduced via a joint distribution for a collection of vectors on the simplex, which is a logistic transformation of Gaussian multivariate CAR models. Nicoletta et al. (2022) propose a Bayesian model for spatio-temporal data based on a generalized linear mixed effects model. In the spirit of the current paper, Fischer and LeSage (2015) calculate the posterior predictive probabilities for the assignment of areal units to clusters using a spatio-temporal stochastic panel relationship of economic variables such as income and human capital.
Of particular interest is the model by Lee and Lawson (2016), introducing temporal dependence among areal units through an autoregressive structure on the means of the GMRF over time. The model in Lee and Lawson (2016) allows for clustering by modeling the spatio-temporal parameters through a parametric mixture. The model proposed in this work can be seen as an extension of Lee and Lawson (2016). We introduce location-specific autoregressive parameters and use a nonparametric prior for model-based clustering of the areal-specific time series. In particular, we employ the popular Dirichlet process prior (Ferguson, 1973) to jointly model the effects of the time-and arealspecific covariates on the evolution of the unemployment rates and the autoregressive coefficients used to specify the distribution of the spatially correlated random effects. From a computational point of view, we exploit the fact that the proposed spatio-temporal model can be seen as a GMRF and combine the algorithms in Knorr-Held and Rue (2002) and McCausland et al. (2011) to propose an MCMC algorithm for efficient posterior simulation.
A preliminary version of the model proposed in this work can be found in Cadonna et al. (2019), where the authors study the use of mobile phones in the municipality of Milan with the goal of investigating the local population density dynamics. However, while they use harmonic components which are constant across the areal units, we include timeand areal-specific predictors and cluster the areal units. Moreover, the spatial domain in Cadonna et al. (2019) is a regular bivariate grid of the metropolitan area of Milan, while here the proximity matrix is given by neighboring Italian administrative provinces. We point out that we introduce a Bayesian model for clustering the areal-specific time series of the unemployment rates with cluster estimates that do not vary with time. Time-varying clustering of the provinces would require a different modeling approach, see, for instance, Page et al. (2022), andDe Iorio et al. (2019). Nieto-Barajas and Contreras-Cristán (2014), instead, use subject-specific parameters from a BNP prior to cluster time series of share prices in the Mexican stock exchange.
The main contributions of this paper to the analysis of spatio-temporal areal data are the following: (a) perform model-based clustering of areal time series through the effects of time-and areal-specific predictors as well as CAR-modeled random effects; (b) account for spatio-temporal random effects using the CAR specification; (c) implement an efficient MCMC algorithm based on the GMRF interpretation of the CAR prior by exploiting the sparse structure of the spatial correlation matrix; (d) provide interpretable estimates of the evolution of unemployment rates in Italy and its connection to economic factors; (e) offer a thorough comparison of the proposed model with competitors, illustrating the advantages of the proposed approach. This work shows the merits of Bayesian modeling in terms of interpretability, outperforming existing complex nonlinear models such as the one in Mínguez et al. (2020) and can help economists and policy-makers to better understand the evolution of important macroeconomic quantities.
The remainder of this paper is organized as follows. Section 2 describes the motivating application and introduces a preliminary exploratory analysis of the data under study. In Section 3, we present the spatio-temporal clustering model including random effects and a BNP prior, which induces clustering through a mixture of the spatio-temporal parameters. Section 4 displays a sketch of the MCMC algorithm used for posterior inference and Section 5 presents the results of the empirical application. Finally, Section 6 concludes the paper with a discussion. Appendices A and B provide details on the derivation of the full conditional distributions and the MCMC algorithm. Appendix C shows the results of an extensive sensitivity analysis assessing the effect of the method of loss calculation used to estimate the partition of the areal units. Appendix D provides a comparison of the proposed model with alternative model specifications. Finally, Appendix E presents the results of a simulation study with synthetic data.

Data and exploratory analysis
In this section, we illustrate the data motivating the development of the proposed approach. We base the choice of dataset on Mínguez et al. (2020), who approach the problem of modeling spatio-temporal data using a semiparametric model including P-splines. The authors provide a thorough economic background for the choice of data, which is briefly reviewed here. First, we describe the response variable used in the analysis, the unemployment rate in Italy, and how it varies over time and across provinces. Then, we introduce the explanatory variables and elaborate on their correlations and distinctions between the North and the South.

Unemployment Data
This study focuses on = 110 Italian provinces and on = 13 years from 2005 to 2017. Specifically, we use the third level of the "nomenclature des unitès territoriales statistiques" (NUTS) hierarchy. For each Italian province and each year, provincial unemployment rates are provided by the Italian National Institute of Statistics (ISTAT). More specifically, for province at year , the unemployment rate is defined as , = 100 × , ∕ , , where , is the number of unemployed people and , is the labor force, including both the employed and unemployed. Figure 1 shows the average magnitude of the unemployment rate across Italian provinces. The economic differences between northern and southern provinces are clearly visible. This dichotomy has further increased over the years from 2012 to 2017, as can be observed in Figure 2, displaying the evolution of unemployment rates from 2005 to 2017. Apart from distinctively higher unemployment rates, we additionally see a stronger upwards trend in recent years for the southern provinces. We draw the border between the North and the South according to the first level of the NUTS hierarchy above the regions Abruzzo, Molise, and Campania.
As exploratory measures of spatial autocorrelation we compute Moran's I (MI, Moran, 1950) and Geary's C (GC, Geary, 1954). These can be interpreted as the spatial couterpart of the lagged autocorrelation coefficient. MI ranges from −1 to 1, indicating negative and positive spatial autocorrelation, respectively. Values of GC between 0 and 1 indicate positive, and values above 1 negative spatial autocorrelation. For the raw data, we obtain an average value over all years of MI = 0.78 and GC = 0.21, both values signifying strong positive spatial autocorrelation. Assuming differences between northern and southern provinces, we regress the unemployment rates on the latitude of the provinces and compute both measures on the residuals. The new results are MI = 0.21 and GC = 0.76. We conclude that when controlling for the known dichotomy, the statistics are significantly lower, but still indicate spatial autocorrelation in the data.

Explanatory variables
As mentioned before, the choice of explanatory variables follows Mínguez et al. (2020). Seven explanatory variables are selected for the econometric analysis, representing either so-called equilibrium or disequilibrium factors (Partridge and Rickman, 1997). The variable representing the disequilibrium view is the employment growth rate ( ℎ), defined as the annual change of employment for each province. The remaining variables account for the equilibrium view. In order to capture the economic structure of the individual provinces, we include the share of people working in the economic sectors agriculture ( ), industry ( ), construction ( ), and services ( ). We distinguish between urban and rural areas by including the logarithm of the population density ( ). The participation rate ( ) is defined as the ratio between the total labor force and the working population and serves as a proxy for labor supply. Table 1 shows summary statistics of the variables used in this work.   Figure 3 shows scatter plots, histograms, and Pearson correlations of the unemployment rate and the seven covariates, grouped into northern and southern provinces. Additionally, the Pearson correlation coefficient is displayed for all pairs as well as for their northern and southern parts separately. The figure shows that, while variables like , , or exhibit the described dichotomy, there is no clear distinction between the North and the South for the remaining variables. We observe a negative correlation between the unemployment rate and , and a positive one with . The sign of the correlation can change when looking at the North and South separately, as in the case of or . Furthermore, we see a strong negative correlation between the employment in the industry and service sector. The response and explanatory variables are standardized separately by subtracting the overall sample mean and dividing by the overall sample standard deviation. Due to political reformation in Italy and the abolishment of seven provinces, missing data was imputed using the predictive mean matching method via the R package mice (van Buuren and Groothuis-Oudshoorn, 2011). 1

Model specification
In this section, we introduce a novel Bayesian hierarchical model that allows for the study of spatio-temporal dependencies and uncovering the underlying clustering structure of areal data. The former is achieved by including 1 As a robustness check and to avoid problems with compositional data, we additonally estimated the model on log-ratios of , , and against as baseline economic sector. The results are comparable and show only minor qualitative differences. For this reason and in order to stay consistent with Mínguez et al. (2020), we continue our analysis with the original dataset. random effects via a CAR prior and is described in Section 3.1. The latter is achieved by using a Bayesian nonparametric prior for the areal-specific parameters. Section 3.2 describes this feature and gives a brief overview of the DP.

Likelihood and spatio-temporal random effects
Let = , = 1, … , , = 1, … , , be a matrix of areal observations, such that ∈ ℝ × . Each entry represents the observed value of interest at the -th areal unit at time , which in the application under study corresponds to the unemployment rate in province at year . For each areal unit and at each time point, we consider a set of predictors and encode them together with an intercept term in the ( + 1)-dimensional column vector . For = 1, … , and = 1, … , , we model as follows: where ′ represents the transpose of the vector , the vector of regression parameters for area , and a spatiotemporal random effect, discussed later in detail. Furthermore, we assume areal-specific Gaussian error terms with variance 2 . Let = 1 , 2 , … , ′ be the matrix of predictors at time and let = 1 , … , ′ and = 1 , … , ′ be the vectors of random effects and error terms, respectively. Moreover, let = 1 , 2 , … , be a matrix containing the regression coefficients. Eq. (1) can then be re-written in vector form, for each = 1, … , , as follows: where indicates the main diagonal of the matrix and the observations for all areal units at time . The notation adopted in Eq. (2) is convenient, as it allows modeling of the spatio-temporal random effects at time directly. In this work, we opt for an autoregressive decomposition similar to the one described by Rushworth et al. (2014) and Lee and Lawson (2016). The temporal correlation is induced through the conditional expected value, while the precision matrix induces the spatial correlation. Specifically: is a vector of autoregressive coefficients and ( ) is a diagonal matrix with as its entries. The random effects in Eq. (3) are modeled as a vector autoregressive process of order one, in short VAR(1). The covariance matrix is composed of the scale parameter 2 and the inverse of the matrix ( , ).
From Eq.
(3) it is clear that the choice of ( , ) plays a crucial role in modeling the spatial correlation. The matrix ( , ) is of dimension × and depends on two quantities, namely the scalar parameter and the neighboring matrix . The neighboring matrix ∈ {0, 1} × is application-specific and reflects the contiguity structure of the areal units. Specifically, , = 1 if areal units and are adjacent (i.e., they are neighbors), and , = 0 otherwise. Following Leroux et al. (2000), we define ( , ) = ( ( ) − ) + (1 − ) , where is the -dimensional identity matrix and is a -dimensional vector of ones. Let be the number of neighbors of site . The matrix ( ) − has elements equal to if = , equal to −1 if and are neighbors, and equal to 0 otherwise. The parameter allows modeling the spatial correlation among the random effects: = 0 corresponds to independent spatial random effects, while = 1 corresponds to the ICAR prior. In the latter case, the expectation for the random effect in areal unit at time , namely , conditionally on the previous times and the other areal units, is given by the sum of −1 and the average of the random effects in geographically adjacent areal units at time . It is important to point out that the parameter itself is not a correlation parameter, and its precise interpretation can be difficult. While Lee and Lawson (2016) choose to fix = 1 to enforce spatial smoothing, we assume that follows a beta distribution a priori. The choice of the hyper-parameters of this distribution is specified in Section 5.1.
The joint distribution of̃ = 1 , … , is a GMRF. Specifically, it has mean zero and its precision matrix Ω is tri-block diagonal with blocks of dimension × . See Section A.2 for details on the derivation of Ω. This representation enables the use of known algorithms for posterior sampling of the time varying effects . In particular, we implement the algorithm proposed in McCausland et al. (2011). The efficiency of the algorithm for banded matrices increases with lower bandwidths. As the precision matrix inherits the bandwidth from , we capitalize on this efficiency gain by first reorganizing into a banded matrix and then minimizing its bandwidth using the algorithm of Cuthill and McKee (1969). The choice of the hyperparameters is specified in Section 5.1.

Bayesian nonparametric areal clustering
As mentioned in Section 1, we are interested in detecting which areal units exhibit similar patterns. To this end, we allow for clustering through the inclusion of the DP prior for some of the areal-specific parameters. A process distributed as a DP can be seen as an infinite mixture of point-masses at i.i.d. locations: where ( ) is the Dirac's delta, equal to 1 when = , and zero otherwise. The other elements specifying the mixture are the infinite sequence of locations 1 , 2 , … iid ∼ 0 and the infinite sequence of weights, which follows the stick-breaking construction (Sethuraman, 1994): with Beta( , ) representing the beta distribution with mean ∕( + ) and variance ∕(( + ) 2 ( + + 1)). The distribution 0 is the base measure and represents the mean distribution around which the DP is centered, while the mass parameter > 0 indicates its variability around 0 . This definition of the DP highlights the discreetness of its trajectories, which in turn implicitly induces clustering. Letting = , ∈ ℝ +2 , = 1, … , , and placing a DP prior on the vectors 1 , … , allows for clustering the regression coefficients as well as the temporal persistencies . We write: This can be equivalently represented through the introduction of a vector of areal-specific allocation variables = ( 1 , … , ) ′ and a set of unique values ⋆ = ⋆ 1 , … , ⋆ , with ≤ and such that = ⟺ = ⋆ . This representation defines a partition { 1 , 2 , … } composed of clusters = { ∈ {1, … , }| = }, for = 1, … , . To each cluster corresponds a distinct value of the parameter vector ⋆ . The vector of allocations follows, a priori, a Pólya urn scheme with allocation probabilities: where is the size of the -th cluster before we assign the -th observation, that is the size of { ′ < | ′ = }. This means that the joint prior of can be computed as the product of all conditional distributions of , given 1 , … , −1 , for all = 1, … , . The value assumed by in this conditional distribution is either an "old" value among already observed 1 , … , −1 or a "new" value, corresponding to an additional cluster. The first item is always allocated to cluster 1, i.e. ( 1 = 1) = 1, and it is associated to the first unique value ⋆ 1 . Let us define Conditioning on the allocation variables and specifying priors for the remaining parameters, we can write the full model as follows, for all = 1, … , and = 1, … , : which completes the model description by specifying the prior distribution for the location parameters, 0 ( ⋆ ), and for the hyperparameters 2 , 2 , and . In Eq. (4), Beta (−1,1) ( , ) stands for the transformed beta distribution over the interval (−1, 1) with parameters , > 0, obtained by applying a linear transformation 2 − 1 to a standard Beta (0,1) ( , )-distributed random variable . We denote by Inv-Gamma ( , ) the inverse-gamma distribution with mean ∕( − 1) and mode ∕( + 1), where and are the shape and scale parameter respectively. Moreover, we denote with Gamma ( , ) the gamma distribution with shape parameter and rate parameter . Note that, since we are clustering the areal-specific time series , = 1, … , , the proposed model implies a clustering of time series. Cluster estimates of the areal units are based on the posterior distribution of the vector of allocation variables , obtained minimizing the posterior expectation of a suitable loss function. Some of the most popular choices of loss functions for partitions include the Binder loss function (Binder, 1978), the variation of information (VI, Meilȃ, 2007;Wade and Ghahramani, 2018), or generalizations thereof (Dahl et al., forthcoming). Since the estimated number of clusters largely depends on the chosen method of loss calculation, we conduct an extensive sensitivity analysis that can be found in Appendix C. The flexibility of Bayesian hierarchical models allows for a multitude of similar alternative model specifications, especially regarding the choice of the prior. To this end, we compare possible alternatives and present the results in Appendix D.

Algorithm
Let be the vector of unknown model parameters, that is = ( ⋆ , 2 , ⋆ , , 1 , … , , 2 , , ). To obtain posterior samples from the joint distribution of the parameters, we implement a Metropolis-Hastings within Gibbs sampler. After setting initial values for the parameters ⋆ , 2 , ⋆ , , 1 , … , , 2 , , and , at each iteration of the MCMC algorithm we perform the following steps: 1. Sample the allocation variables , = 1, ..., , through the algorithm described in Favaro and Teh (2013), which in our case amounts to an extension of the popular Algorithm 8 by Neal (2000), that includes a re-use step. Details can be found in Appendix B. 2. Sample the unique values ( ⋆ , ⋆ ) conditioned on all the other parameters and the allocation variables . The full conditional distribution for ⋆ is conjugate and Gaussian, while sampling ⋆ requires a Metropolis-Hastings step. Further details can be found in Appendix B. 3. Sample from a mixture of gamma distributions using the auxiliary variable sampler by West (1992). Details can be found in Appendix B. 4. Sample the spatio-temporal random effects ( 1 , … , ) ′ from a multivariate normal distribution using the algorithm proposed in McCausland et al. (2011). Details can be found in Appendix A. 5. Sample 2 from an inverse-gamma distribution. Details can be found in Appendix B. 6. Sample 2 from an inverse-gamma distribution. Details can be found in Appendix B.

Empirical application
In the following, we analyze the Italian unemployment data using the proposed model and algorithm introduced in Sections 3 and 4, respectively. We carefully elicit the prior distributions used in the model, discuss strategies to choose a representative partition and interpret the implied economic findings. To conclude, we benchmark the proposed approach in terms of out-of-sample forecasting capabilities against a variety of competitor models.
Due to its influential nature on the posterior, in particular on the number of clusters, we do not fix the mass parameter but instead assume that a priori it follows gamma distribution with parameters = 3 and = 2, yielding [ ] ≈ 6.75 and [ ] ≈ 7.32 . These numbers are in agreement with prior information on the number of groups of provinces where the economic development of Italian regions is different, e.g. north-west, north-east, center, south, and the two islands.
The use of a Bayesian nonparametric prior in the proposed model allows for clustering of the areal units (i.e., the Italian provinces). To provide an point estimate of the partition of the provinces, we minimize the posterior expectation of specific loss functions, namely the Binder loss (Binder, 1978) and the Variation of Information (VI, Wade and Ghahramani, 2018). These methods of expected loss calculation are invariant to the label-switching problem and can therefore be applied directly to the posterior samples of .An extensive comparison of these methods is presented in Appendix C. In what follows, we present details based on the chosen partition estimated by minimizing the posterior expectation of the Binder loss function. The comparison metrics in Appendix C confirm that the model retains predictive accuracy under the chosen partition, while the lower number of clusters of mostly contiguous provinces allows for a clearer economic analysis. The partition is displayed in Figure 4.

Posterior Inference
We continue by analyzing the clustering in Figure 4 and its implications on the underlying unemployment differentials. To this aim, we re-run the MCMC algorithm conditionally on the Binder loss partition estimate and summarize the estimated coefficients within each cluster. We discard 10,000 draws as burn-in and keep every 3rd draw of the remaining 15,000 draws, thus using the 5,000 remaining draws for posterior inference. The posterior means of the covariate effects are reported in Table 2. Figure 4 shows that Cluster 1 is predominantly made up of northern and central provinces, constituting the largest cluster. Interestingly, the province Taranto as well as the southern regions Abruzzo and Basilicata are also assigned to this cluster. A possible reason could be that, amongst all southern provinces, these have strikingly low unemployment rates. The second cluster contains provinces which are scattered geographically and three out of eight provinces of Sardinia. The third cluster is made up of a small group of northern provinces like for example Alessandria, Vercelli, and Novara, while the fourth contains a few southern provinces: Carbonia-Iglesias in Sardinia, Napoli, Caserta, Lecce, in Calabria the provinces Cosenza, Catanzaro, and Reggio Calabria, and Messina, and Trapani in Sicily. Cluster 5 contains almost all of Sicily, Foggia, Barletta-Andria-Trani, and Crotone. Cluster 6 is a singleton cluster made up of the province Medio Campidano. Its classification as a single cluster is consistent among most partitions examined in Appendix C and is attributable to the highly negative coefficients of and . Cluster 7 contains the provinces Imperia, Massa Carrara, Frosinone, Oristano, and Ogliastra. We can observe in Figure 1 that they exhibit similar average unemployment rates. Note that it is the only cluster where the coefficients corresponding to the economic sectors are all significantly negative.
One of the most pronounced results are the positive and negative intercepts for the southern and northern provinces, respectively. This result clearly captures the well-known North-South dichotomy, also displayed in Figure 1. In order to facilitate the interpretation of the cluster-specific coefficients, Figures 5 and 6 depict the average unstandardized observed covariates, the corresponding posterior mean̂ and the kernel density estimates of the posterior distribution of . As every cluster has only one set of coefficients, each of the maps depicting the estimated values shows seven different hues of the chosen color.
Observing both a negative and a positive effect of the participation rate might seem ambiguous, but is in line with the literature according to Elhorst (2003). A negative effect can stem from the fact that low participation rates are often coupled with low investments in human capital and low commitment to work, both attributes spurring unemployment.  Logarithm of population density (lpopdens) Figure 6: Visualization of , , ℎ, (top to bottom). Unstandardized covariates, averaged over time (left). Posterior means (middle) and posterior kernel density estimates with 5%, 50%, and 95% quantiles (right) of the corresponding regression coefficients. The positive effect on the other hand could indicate an insufficient job offer. Cluster 2 includes provinces with a somewhat puzzling variety of economic and social characteristics. It encompasses the wealthier Bergamo as well as Torino, where the social fabric of the metropolitan area is speckled, and some southern provinces. The latter, however, exhibit comparably low unemployment rates among the South. As a result, the estimates of the intercept and the coefficients associated to , , and ℎ are close to zero, as can be seen in Table 2 and Figures 5 and 6. Attributes that distinguish Cluster 3 from the rest of the North are most notably the comparably high coefficients of as well as the low coefficient of ℎ. The maps in Figure 6, corresponding to the latter, show that the provinces included in Cluster 3 have moderate values of employment growth and a very low estimate, clearly standing out among the surrounding provinces. These results, together with the higĥ , suggest an insufficient offer in jobs in this cluster. Member provinces of the southern Clusters 4 and 5 are characterized by similar social issues, in addition to high values of the unemployment rates. However, Table 2 shows that the two clusters differ by the effect of the population density on the unemployment rate. The main driver of this distinction is possibly the province Napoli which, as we can observe in Figure 6, has the highest population density among all provinces. In addition, Cluster 5 shows an overall higher level of unemployment.
A one to one comparison of the estimated effects with the results in Mínguez et al. (2020) is not possible, as our model estimates an individual set of coefficients for every cluster and additionally includes an intercept term. However, Section 5.3 presents a comparison in terms of out-of-sample prediction.
The estimated spatio-temporal random effects are depicted in Figure 7. Both the map and the time series show a clear distinction between northern and southern provinces. The map strongly resembles Figure 1, implying that the random effects capture the differing levels of unemployment rates among the provinces. The pattern of the time series resembles Figure 2, showing an upwards trend among all provinces.

Competitor models
In order to assess the performance and accuracy of the proposed model, we compare it to competitor models found in the literature on spatio-temporal data analysis. To ensure convergence and improve mixing for the proposed Bayesian spatio-temporal clustering model (BSTC), we run 25 independent MCMC chains, each initialized to different starting values. After discarding a burn-in of 5,000 iterations and keeping the remaining 4,000 draws per chain, we merge the results and obtain a final sample of 100,000 posterior draws.
In the following, we introduce the details for six increasingly elaborate frequentist models as well as the Bayesian spatio-temporal CAR model including a temporal autoregressive process (ST.CARar) proposed by Lee et al. (2018). The simplest model we use for comparison is a standard linear pooling model (Pooled), where the ( + 1) × 1 vector of coefficients is restricted to be the same across time and space : In order to model province-specific heterogeneity, we also consider splitting the error term , into a province-specific fixed part alongside the i.i.d. random part, obtaining the individual fixed effects (IFE) model (Baltagi, 2021): The estimation of the two models is performed via the R-package plm by Croissant and Millo (2008). Both models can be augmented by a spatial-autoregressive (SAR) component ∑ ≠ , where is the spatial autocorrelation coefficient and is the entry of corresponding to the areal units and . Model (6) then becomes and is termed the Pooled-SAR model. Likewise, Model (7) and it is referred to as the IFE-SAR model. The augmented models (8) and (9) are estimated via the R-package splm by Millo and Piras (2012). In order to incorporate smoothing across the spatial and temporal dimension simultaneously, Lee and Durbàn (2011) propose to explicitly model the interaction between space and time through some function depending on the spatial coordinates 1 and 2 and the temporal dimension . They develop an ANOVA method based on P-splines (PS-ANOVA) with the following model specification: The vector is modeled as a multivariate Normal distribution with zero mean and the identity matrix as covariance matrix, the same prior specification used in the base distribution of the Dirichlet process in the proposed model. Using the data from Section 2, we compute and compare commonly used model comparison metrics, namely the out-of-sample root mean squared error (RMSE) and the out-of-sample mean absolute error (MAE). RMSE and MAE are computed using the 1-year ahead forecast starting from the 5-th year (2009) Table 3: Out-of-sample RMSE of the competing models and the Bayesian spatio-temporal clustering model.
training set. The Bayesian models are additionally compared via the widely applicable information criterion (WAIC, Watanabe, 2013) and the log marginal likelihood (LML). The marginal likelihood is obtained as the product of the one-step-ahead predictive likelihoods (cf. Geweke and Amisano, 2010): where is the vector of all unknown parameters and Θ the corresponding parameter space. The logarithm of the marginal likelihood can therefore be decomposed into the sum of the logarithms of the one-step-ahead predictive likelihoods. To be consistent with the RMSE and MAE evaluation and to avoid excessive prior dependence, we report ( 5∶ | 1∶4 ), i.e., we begin the evaluation in the 5th year, treating data up to the 4th year as part of the prior information.
Following Gelman et al. (2014), the WAIC is obtained by calculating the log pointwise posterior predictive density (10) and then adjusting for overfitting with the correction term (11): where denotes the -th posterior draw from a total of MCMC draws. Note that the metrics for the Bayesian models are computed using samples from the posterior. Therefore, they entail the posterior uncertainty for all parameters including, for the BSTC model, the estimated number of clusters. The results summarized in Tables 3 and 4 show similar patterns for both RMSE and MAE. For the first half of the testing period, no model stands out as the clear winner. The IFE and BSTC models alternate in proving the least erroneous except for the year 2012, where the Pooled model, which does not capture any heterogeneity nor spatial correlation, performs best. Adding the SAR component mostly improves the estimates for the IFE and the Pooled model, and has almost no effect on the PS-ANOVA model. On average, the best frequentist models are the PS-ANOVA and the PS-ANOVA-SAR model, being outperformed only by the BSTC model. Table 5 focuses on the Bayesian models, showing the logarithms of the one-step-ahead predictive likelihoods and the WAIC. Note that higher values of predictive likelihoods indicate better predictions, while higher values of WAIC indicate worse. The results resemble the RMSE and MAE, showing close metrics between the two models in the first years. As before, the BSTC model performs poorly in the prediction for the year 2012 and surpasses its competitor in the following years.   Table 5: Logarithms of the one-step ahead predictive likelihoods, their sum and the WAIC of the Bayesian models.

Discussion and conclusions
We develop a novel Bayesian semiparametric model for spatio-temporal areal data and apply it to the evolution of Italian unemployment rates. The proposed approach aims at tackling the difficult challenge of identifying well-pooled spatial and temporal areal units to identify spatio-temporal patterns in the data. We do so by interweaving elements from spatial statistics and Bayesian nonparametrics.
The proposed model uses a Bayesian nonparametric prior, the DP, to cluster location-specific ( ) autoregressive parameters driving the spatio-temporal random effects and ( ) regression parameters of the time varying predictors. This implies a Bayesian nonparametric model for clustering the areal-specific time series of the unemployment rates. Posterior inference is achieved via a tailored Markov chain Monte Carlo algorithm. The fitted model is consequently used to study the evolution of unemployment rates of Italian provinces from 2005 to 2017. We compute estimates of their clustering structure by minimizing posterior expectations of Binder's loss and VI. In doing so, we obtain interpretable results and shed light on the unemployment structure in Italy. The proposed approach fares well in an extensive out-of-sample comparison against popular alternatives, further validating its applicability to the problem at hand.
The model could be further generalized in different directions. First, it might be of interest to investigate prior distributions that do not imply stationarity. For example, the non-stationarity in time could be modeled by means of a time-varying parameter (TVP) model. To avoid potential overfitting, recently developed shrinkage priors could be used (e.g. Bitto and Frühwirth-Schnatter, 2019;Cadonna et al., 2020). Second, one could employ covariate-informed partition models such as PPMx (Müller et al., 2011), where prior distributions additionally encourage the grouping of areas with similar covariate values. Third, the structure of the neighboring matrix could be treated as random and thus learned from the data. To this end, it might be interesting to consider a prior for the precision matrix of the random effects based on a directed acyclic graph representation of the spatial dependence (Datta et al., 2019;Codazzi et al., 2022).  The generalized variation of information (GVI, Dahl et al., forthcoming) takes the form where | | and |̂ | are the cluster sizes of elements belonging to the true and estimated partitions, respectively. The first term in Eq. (15) represents the negative individual entropy of the true partitition, scaled by the misclassification cost . The second term represents the negative individual entropy of the estimated partitition, scaled by the misclassification cost . The last term represents their joint entropy scaled by the mean of and , ( + )∕2. The range of the individual entropy goes from 0 to 2 ( ) ≈ 6.78, while the joint entropy ranges from 0 to 2 2 ( ) ≈ 13.56. We use the entropy of the sampled partitions to quantify the variability in the posterior distribution of the partition. We observe low values of the individual entropy, with a posterior mean of 2.1 and a 95% credible interval of (1.6, 2.6), indicating low variability in the partitions explored by the MCMC chain. We quantify the average deviation of the chosen partition and the sampled partitions using their joint entropy, resulting in a posterior mean of 5.1 and a 95% credible interval of (3.4, 7.0), suggesting that the partitions explored by the MCMC algorithm are similar to the one presented in the application.
To assess how well the model retains predictive accuracy under the different partitions, we compute the WAIC, in-sample RMSE and in-sample MAE. The RMSE and MAE are calculated by running the model on the whole dataset and comparing the predicted response variablê , = , ̂ ⋆ +̂ , , = 1, … , , = 1, … , , with the corresponding true value. These goodness-of-fit metrics are calculated using posterior averages of̂ ⋆ and̂ ⋆ , conditionally on the partitions estimated using different loss function specifications (Binder and GVI loss, with varying and = 1). All estimated partitions are computed using posterior draws of the BSTC method described in Section 5.3. The results are summarized in Table 6, where lower values of WAIC, RMSE, and MAE imply higher predictive accuracy. The partitions obtained with _ are the most consistent with regard to the estimated number of clusterŝ , yielding 7 or 6 clusters in the majority of cases. The other two methods show more sensitivity to the choice of , with rapidly decreasing the number of clusters for higher values of . Note that in most cases, the comparison metrics favor partitions with a higher number of clusters. However, too many different clusters exacerbate meaningful economic interpretation. For this reason, the choice of a representative partition requires a trade-off between a number of clusters that is high enough to capture individual heterogeneity and sparse enough to be economically interpretable.

Appendix D Alternative model specifications
This section shows the results of a performance comparison of the model proposed in Section 3 with alternative specifications. In particular, (a) considers a finite-dimensional version of the nonparametric prior, while (b) focuses on different options to include the DP as a prior for the parameters and . All other prior specifications not concerning these two modifications are kept as in Section 5.1.
In (a), we use a finite mixture model, instead of an infinite one. Miller and Harrison (2018) point out that both Dirichlet process mixtures (DPMs) and mixtures of finite mixtures (MFMs) share several properties essential in the implementation of state-of-the art sampling algorithms and use this to introduce an alternative version of Neal's popular Algorithm 8, suited for MFMs. By applying their method, we estimate a finite analogue of the proposed model (BSTC-MFM). We follow Miller and Harrison (2018) and choose a geometric prior with success probability equal to 0.1 for the number of components.
Turning to (b), we examine a model where we employ one DP prior to cluster the coefficients and a separate one for the temporal autoregressive parameter . In other words, we examine a model where and are clustered using independent DP priors. This approach is termed BSTC-2DPs. In addition, we examine a model where is removed from the clustering process altogether, and instead modelled via a parametric prior. This approach is termed BSTC-XI. In both cases, the Beta (−1,1) ( , ) prior is used, in the first case as the base measure of the now separate DP prior and in the second one as the parametric prior.
While we acknowledge that a rigorous comparison of different prior specifications is beyond the scope of the current work, these experiments can nevertheless shed light on the model's sensitivity to prior modifications. To this end, we compare the different model specifications by analyzing different partitions obtained from samples of each respective model. The MCMC sampler is run for the same number of iterations and burn-in as specified in Section 5.2. Table 7 shows some essential characteristics of the partitions obtained by applying the loss-based methods introduced in Section C with = 1 to MCMC samples from the alternative models. We compare the estimated number of clusterŝ , the size of the biggest cluster, located in the North and Center, the size of the cluster predominantly covering Sicily, and whether the province Medio-Campidano was classified as a singleton cluster. Since a decisive part of the parameter estimation depends on the visited partitions, these characteristics together with the comparison metrics reported in Table 8 serve as a proxy for measuring the similarities between the alternative models. Additionally, we compute the WAIC, in-sample RMSE and in-sample MAE on the complete dataset, again using the sampled posterior values and no summarisation method.   The estimated number of cluster̂ , the size of the Sicilian cluster as well as the inspected singleton cluster hardly differ among the different models. Regarding the size of the biggest cluster, we see some variation between model specifications, but varying strongly between the chosen method of loss calculation. The results show strong consistency in estimating the partition of the provinces among the alternative model specifications. Table 8 underlines this consideration, showing almost no distinction in goodness-of-fit metrics between the models. Each of the examined alternative model specifications provides a different and valuable view on the underlying process. Nevertheless, the analysis shows that their impact on the main aspect of this work, i.e. the model based clustering of the Italian provinces, is small.  uniform prior for . We run the MCMC sampler for 10,000 iterations and discard the first 5,000 as burn-in. Figures 9 and 10 show posterior inference for one of these datasets. We observe in Figure 9 that the sampler visits up to 12 clusters but coincides with the true value almost 80% of the times. The density plots show that the true coefficients are accurately recovered, while the trace plots imply good mixing and convergence.
To guarantee robustness of the results, and in addition to compare the DP process prior with the MFM alternative, we simulate 50 replicated datasets using different random seeds and estimate both the BSTC as well as the BSTC-MFM specification on these data. Figure 11 displays the estimated number of clusters over the 50 replicate datasets, obtained using the Binder loss function with misclassification cost = 1, for both models. We can observe that the MFM model recovers the true number of clusters slightly more often than the DPM model, but also deviates further from it, sometimes estimating nine or ten clusters. Although both models find 11 clusters in one case, they do so for the same dataset which seems to prove a particularly difficult instance of the data generating process. It is important to emphasize that out of 50 different datasets, the two models agree on the number of clusters in 38 of them and only deviate by one cluster in the majority of the remaining cases. Overall, both models show very similar cluster estimates, even though the prior distributions for the number of clusters are not identical (cf. Frühwirth-Schnatter and Malsiner-Walli, 2019).