Computational Statistics and Data Analysis

A hidden semi-Markov-switching quantile regression model is introduced as an extension of the hidden Markov-switching one. The proposed model allows for arbitrary sojourn-time distributions in the states of the Markov-switching chain. Parameters estimation is carried out via maximum likelihood estimation method using the Asymmetric Laplace distribution. As a by product of the model specification, the formulae and methods for forecasting, the state prediction, decoding and model checking that exist for ordinary hidden Markov-switching models can be applied to the proposed model. A simulation study to investigate the behaviour of the proposed model is performed covering several experimental settings. The empirical analysis studies the relationship between the stock index from the emerging market of China and those from the advanced markets, and investigates the determinants of high levels of pollution in an Italian small city.


Introduction
Financial, economic, environmental and medical data are often analysed through linear regression models aiming at identifying determinants (covariates) which explain differences between observations. Nevertheless, the traditional regression models are based on assumptions that are not close to the real behaviour of the data at hand, often characterized by specific features like skewness, heavy tails and heterogeneity. Addressing these data features has led to a large number of proposal to properly model different distributional shapes (Rigby and Stasinopoulos, 2005;Stasinopoulos et al., 2018;Punzo et al., 2018). When the interest lies not only in the centre of the response distribution and/or when the observed data may include some outliers, quantile regression represents an interesting alternative to standard mean regression. Indeed calculating the impact of covariates on different parts of the response variable distribution may give insights eventually missed if only the mean is evaluated; see e.g. Koenker and Bassett Jr (1978), Marino and Farcomeni (2015), Bernardi et al. (2018) and Waldmann (2018) for reviews and introductions to quantile regression.
Standard quantile regression methods assume homogeneity across observations, all observations are considered generated by the same probability distribution model. In some scenarios, however, it is of substantive interest to learn if the probability distribution model generating the data, i.e. the data generating mechanism, is different across observations (Bernardi et al., 2012;Miljkovic and Grün, 2016;Punzo et al., 2018). To capture heterogeneity across observations it is common to assume that latent classes/states are present. In this case, each state can be characterized by a different data generating mechanism (McLachlan and Peel, 2000; Bartolucci et al., 2013;Zucchini et al., 2016). All observations within the same state are characterized by the same data generating mechanism.
During the last few years, the basic homogeneous quantile regression model (Koenker and Bassett Jr, 1978) has been extended to deal with heterogeneity by introducing mixed-effects models Bottai, 2007, 2014) and finite mixtures (Merlo et al., 2021;Del Sarto et al., 2019;Alfò et al., 2017;Wu and Yao, 2016). When time series data are modelled, time-dependent heterogeneity is a crucial data feature to be modelled. Farcomeni (2012) introduces hidden Markov-switching quantile regression models, also known as hidden Markov quantile regression models, where heterogeneity is modelled through a discrete hidden distribution, such that this distribution evolves over time according to a first-order finite-state hidden Markov chain. A similar approach is considered also in Ye et al. (2016) and extended in Marino and Alfó (2015), Marino et al. (2018), Liu and Luger (2018) and Adam et al. (2019).
The general hidden Markov-switching (HMS) regression model is a bivariate discrete-time stochastic process {S t , Y t } t≥0 such that (1) {S t } is a first-order finite-state Markov chain, i.e. Pr(S t | S t−1 , . . . , S 1 ) = Pr(S t | S t−1 ); (2) {Y t } is a sequence of conditional independent random variables given {S t } and a set of exogenous covariates, such that the parameters of the distribution of Y t are linked to the exogenous covariates via regression.
However, assumption (1) may be too strong to hold in practice, as it indicates that the current latent state depends only on the most recent latent state in the past; beyond that, it is memoryless. Moreover, the number of consecutive time points that the Markov chain spends in a given state, the so-called sojourn distribution, follows a geometric distribution. Bulla and Bulla (2006) and Maruotti et al. (2019) illustrated some drawbacks of the HMS approach and overcome the lack of flexibility of the HMS model to model the sojourn distribution by using a hidden semi-Markov-switching (HSMS) model (Barbu and Limnios, 2009;Yu, 2015;Bulla et al., 2010), where the sojourn distribution is explicitly modelled rather than implicitly assumed to be geometric.
Motivated by these considerations, in this work we introduce the hidden semi-Markov-switching quantile regression, which has the work of Farcomeni (2012) as special case. Our proposal is characterized by the fact that the sojourn distribution of the process could be arbitrary compared to its Markov-switching counterpart. The greater flexibility obtained employing hidden semi-Markov-switching models carries a computational cost; indeed these models are more demanding to apply than are Markov-switching ones. To reduce the computational burden that often limits the use of HSMS models, we estimate the sojourn probabilities non-parametrically, following Langrock and Zucchini (2011). By introducing a time-series data model with heterogeneous regression coefficients, we are able to capture many attractive features of the data which models currently applied in economics and environmetrics are not often able to grasp. Moreover the time-dependent heterogeneity is captured by using heterogeneous regression coefficients, allowed to vary across quantiles. Up to our knowledge, this represents the first attempt to introduce quantile regression into a semi-Markovswitching setting for the analysis of time series data. Of course, the proposed approach is developed for data showing time-varying heterogeneity and can be viewed also as a further contribution on clustering time-dependent data, where the interest is in the evolution over time of the clustering structure underlying the data at hand (Maharaj et al., 2019;Bartolucci and Farcomeni, 2010;Berchtold and Raftery, 2002).
For inferential purposes, we propose to estimate model parameters through maximum likelihood assuming the Asymmetric Laplace (AL) distribution (Yu and Moyeed, 2001) as inferential tool. We will show how to calculate the observed and the complete data likelihood function considering the observed covariates and hidden states. This can be achieved by slightly modifying the Baum-Welch algorithm, widely used in the HMS and HSMS literature.
We illustrate the proposal by a large-scale simulation study to investigate the empirical behaviour of the proposed approach with respect to several factors, such as the number of observed times, the sojourn distribution and different quantiles as well as different distributions for the error terms. We further apply the proposed methodology to two datasets from different fields. For the first application, we consider pollution data in an Italian urban area; while the second one deals with the emerging financial market of China.
We model the hourly concentrations of PM 2.5 (i.e. the particulate matter with aerodynamic diameter ≤ 2.5) measured in Battipaglia, in Italy, as a function of some atmospheric variables. In urban areas, the study of pollutant concentrations is of high interest since negative effects of air pollution have been observed on human health (Lee and Sahu, 2016); this has, in particular, been reported for major pollutants such as particulate matter. Local and atmospheric sources of pollution may influence pollution concentrations either directly, indirectly, or both. Empirical and theoretical studies applied sophisticated strategies to link pollution with (observable and unobservable) atmospheric and other sources (Lagona et al., 2011;Park et al., 2001) and modelled the fluctuations of concentrations over long periods. Often pollutant concentrations show extreme values (Martinez-Zarzoso and Maruotti, 2013;Merlo et al., 2020) and it is rather important to properly understand if any atmospheric conditions favour the increase of pollution. Moreover, pollution data often comprises periods of flat stretches (also known as background concentrations) and sudden bursts. Therefore, not only the identification of different levels of exposure, but also the investigation of their transition patterns is of interest. These aspects are addressed by the employed HSMS model. Results show that wind speed plays the main role in reducing pollution, and its effect is time-varying and non-linear. The relative humidity also plays a role, while the temperature is not significant.
For financial data, it is well known that the behaviour of the tails of the distribution are important for investment decisions. In this contest the use of quantile regression is quite appropriate and indeed several standard measure of market risk are based on quantiles (see for example Jorion, 1996). In particular, the use of quantile regression avoids strict assumptions such as normality or i.i.d. of the returns. Moreover, it models non-linearity over different quantiles without setting up a complex non-linear parametric framework and, compared to least squares methods, the quantile regression method is more robust to non-Gaussian, leptokurtic, and heavy-tailed distributions, which are common for financial variables (Maruotti et al., 2019). In the financial data investigation we try to detect possible contagion effects of advanced markets on the Chinese stock market exchange at different quantile levels. Daily returns are modelled via hidden semi-Markov-switching quantile regression, to allow for structural changes (Xiao, 2009), also considering stock indices from several mature markets on the Asian Pacific rim as possible covariates. Results show the financial market integration is found for some quantiles between China and the other countries.
The remainder of the paper is as follows. In Section 2, we specify the proposed model. Section 3 provides the computational aspects of the adopted maximum likelihood algorithm. In Section 4, we show evidence of the performance of the proposal under different data generation schemes by means of a simulation study. We present two empirical applications on financial and pollution data in Section 5.2 and 5.1 respectively. In Section 6, we conclude pointing out some remarks, along with drawbacks that may arise by adopting the proposed methodology.

Methodology
Let Y t , t = 1, . . . , T denote a real-value observation and S t be the state of a finite-state semi-Markov chain, defined on the state space {1, . . . , k, . . . , K }, at time t. A hidden semi-Markov model (HSMM) is a particular kind of time-dependent mixture. The underlying hidden process {S t ; t = 1, . . . , T } is constructed as follows. A homogeneous Markov chain with K states models the transitions between different states, with initial probabilities π k = Pr(S 1 = k) and transition with ∑ K k=1 π k|j = 1 and π j|j = 0, i.e. the diagonal elements of the transition probability matrix are zeros. In (1), k refers to the current state, whereas j refers to the one previously visited; this convention will be used throughout the paper. We collect the initial probabilities in the K -dimensional vector π, whereas the time-homogeneous transition probabilities are collected in the K × K transition matrix Π.
We explicitly model the sojourn-time distribution, associated with each hidden state i.e. the time the hidden process S t spends in the kth state defined as (2) Hence we assume that the state occupancy distribution is concentrated on finite sets of time points.
The combination of a Markov chain, modelling state changes, and the explication of the sojourn-time distributions defining the hidden process, illustrate the main difference between the HMS model (in which the sojourn-time distribution is implicitly geometric) and the HSMS model. It is worth noting that the semi-Markovian hidden process {S t } of a HSMS model does not have the Markov property at each time t, but is Markovian at the times of state changes only. In most of the empirical settings, we have also a sequence of P exogenous covariates X t = {X t1 , . . . , X tp , . . . , X tP } explaining the sequences of response random vector Y t . In such a case, a hidden semi-Markov regression model is possible. Moreover, since we are facing financial and environmental time series data, it may be interesting to understand the behaviour of the process around the tails of the distribution, so for this reason it may be useful to embed the problem in a quantile regression framework. In particular for each hidden state k and at a given quantile level τ , we assume the following quantile regression model with β k (τ ) being a vector of state-specific regression coefficients, x * t = (1, x t ) the set of covariates accounting for the intercept and ϵ tk (τ ) is the error term whose τ th conditional quantile equals zero.
It is well known in the literature that the univariate quantile regression approach has a direct link with the Asymmetric Laplace (AL) distribution. That is, the loss function minimization used in quantile regression to estimate regression parameters (Koenker, 1996) is equivalent (in terms of parameter estimates) to the maximization of the likelihood associated with the AL density; see, e.g., Yu and Moyeed (2001). Therefore, the AL distribution could offer a convenient device to implement a likelihood-based inferential approach when dealing with quantile regression analysis. In our contest the AL distribution becomes where σ is a scale parameter, β ′ k (τ )x * t is the quantile at state k and ρ τ (v) is the quantile loss function given by The selection of the most appropriate sojourn distribution is instead a complicated problem. Possible solutions may be assessed but the comparison of the empirical sojourn times with a theoretical distribution can lead to unsatisfactory detection. We set all the latent state duration densities to be discrete non-parametric distributions with arbitrary point mass assigned to the feasible duration values so as to allow for the most flexibility (Langrock and Zucchini, 2011). This approach is particularly useful in cases where we do not know the sojourn distribution. We can estimate a nonparametric sojourn distribution, perhaps as an initial step before deciding on a parametric distribution, see e.g. Sansom and Thomson (2001). However, the use of a non-parametric sojourn allows to represent any given sojourn distribution and to approximate any parametric sojourn distributions accurately by choosing the maximum times spent in state k, say U k , sufficiently large. This comes at the price of using a large number of parameters, but not of the computational times (see for example Langrock and Zucchini (2011)). The model selection procedure becomes a bit more complex as not only the number of hidden states should be selected, but also the U k values, which may differ across states. However, in practice, U k = U = 30 is large enough in most empirical situations. The total number of parameters is thus given by , which corresponds to the number of independent transition probabilities collected in Π; • the sojourn distribution probabilities, K × (U − 1).
Of course it is not reasonable to try to estimate the initial distribution from just one observation at time 1, especially as the state of the semi-Markov chain itself is not observed. As discussed in Levinson et al. (1983), the maximum likelihood estimator of initial distribution is a unit vector: one entry is 1 and the others 0. Following Leroux and Puterman (1992), no matter which starting values we use for the initial distribution probabilities, it approaches the unit vector.

Likelihood inference
Once the number of latent states K has been assigned/fixed, the algorithm basically works on the complete-data likelihood where s * r is the rth visited state, u r is the time spent in that state (i.e. the duration of the rth visit), R − 1 is the number of state changes up to time T , and ϑ designates the vector of all parameters. In his paper Guédon (2003) proposed to use of the survivor function for the sojourn time in state k. To estimate ϑ, we use the Baum et al. (1970) version of the Expectation-Maximization (EM) algorithm, though alternatives are available in the literature (see e.g. Rydén, 2008;Bulla and Berzel, 2008;MacDonald, 2014). For example, direct numerical maximization of the likelihood using Newton-type algorithms generally converges faster than the EM algorithm, especially in the neighbourhood of a maximum; however, it requires more accurate starting values than the EM to converge. Similarly, Gibbs sampling and other MCMC algorithms, working in a Bayesian context, are likely to require less computation time, but this approach is not the one adopted throughout the paper. The EM algorithm instead provides an iterative solution to maximum likelihood extensively used and preferred when latent variables are presented and when the likelihood evaluation is prohibitively expensive. The EM algorithm alternates the following steps until convergence: E-step : compute the conditional expected value of the complete-data log-likelihood given the observed data andθ, the current estimate of ϑ; M-step : maximize the preceding expected value with respect to ϑ.
In our case with the E-step we calculate the expected complete data log-likelihood given the observed data. In (7), the expectation operator E has the subscript ϑ (h) to explicitly convey that this expectation is being effected using ϑ (h) for ϑ. To calculate (7) it is required the calculation of the following three quantities: 1. the probability of being in state k at time t given the observed sequence; 2. the probability that the process left state j at time t − 1 and entered state k at t given the observed sequence; 3. the expected number of times a process spends u time steps in state k, The first two quantities, γ (h) t (j) and ξ (h) t (j, k), can be calculated via a dynamic programming method known as the forward-backward algorithm, see Appendix.

Operational aspects
The likelihood equation will likely have multiple roots corresponding to local maxima. The EM algorithm should be applied from a wide choice of starting values in any search for the maximum, aiming at excluding the so-called spurious maxima. The problem of choosing reliable initial values for the parameters has been often overlooked in the HMSM and HSMSM literature, though few papers recognize potential issues in likelihood inference due to poor initialization of the parameters. Dunmur and Titterington (1998) investigate the failure of reaching the maximum for the EM algorithm, and to this end a simple binary Markov chain structure is studied, which allows detailed exploration of the parameter space. As a result, the choice of initial conditions is found to be highly influential. Bartolucci and Farcomeni (2009) suggest a deterministic rule to initialize the parameters and, then, perform a sensitivity analysis by randomly perturbing those obtained by the deterministic rule to investigate the EM behaviour. Leroux and Puterman (1992) propose a simple procedure for automatically generating good starting value based on data partitions. Lagona et al. (2015) notice that the convergence to spurious maxima is fast and can be detected by monitoring mixing proportions. As remarked by Maruotti and Rocci (2012), a key factor in the fitting of H(S)MSMs is the accuracy of the estimate of the Markov chain parameters; the remaining parameters can be estimated from the inferred partition in accordance with the estimates of the Markov chain parameters. A comparison of several initialization strategies is given in Maruotti and Punzo (2021).
Here, to fit the HSMM with K states, we initialize the EM algorithm by providing the initial parameters, say ϑ (0) , to the first E-step of the algorithm. The initial parameters are computed in the following way. An initial state partition, say . Finally, as concerns the initial parameters β (0) k and σ (0) k , the quantile regression is fitted on the observations in state k. Several strategies can be followed to obtain the initial state partition Here, we consider a random partition mixed with a short-EM strategy. We classify observations into K clusters according to a Multinomial distribution with parameters 1/K . From the generated partition we obtain ϑ (0) as described above. Each run of the EM algorithm is ''short'' because it is executed for a small number of iterations, without waiting for convergence. Then, the EM algorithm is run from the parameter vector ϑ (0) providing the largest likelihood from these short runs of EM. Starting the algorithm with good initial values certainly helps reduce the computational cost. The E-and M-steps are alternated repeatedly until the following difference changes by an arbitrary small amount, i.e. a stopping criterion based on the sequence of likelihood values ℓ (r) , r = 1, . . . is considered. Since ℓ (r+1) > ℓ (r) , convergence is obtained with a sequence of likelihood value, bounded above.
The algorithm is defined for K fixed. Various authors have discussed algorithms for joint estimation of K and model parameters in a finite mixture context (Wang, 2010); a possible solution is to update estimates for a fixed K and improving step by step. Once the algorithm has reached its end for a given K , we can proceed to increase the number of components to K + 1 and estimate corresponding model parameters. After that, a formal comparison could be done using penalizedlikelihood criteria (such as AIC or BIC) or by employing a formal likelihood-ratio test. In the latter case, the number of states is selected by bootstrapping the likelihood ratio test (LRT) statistic for different number of components, since usual regularity conditions for the asymptotic distribution of the LRT statistic do no hold.
The estimation procedure outlined above does not produce standard errors of the estimates, because approximations based on the observed information matrix often require a very large sample size. The parametric bootstrap scheme can also be used to approximate the covariance matrix of parameter estimates to obtain reliable estimation of confidence bands (Visser et al., 2000). We further investigate the behaviour of the parametric bootstrap under the considered setting in the simulation study.

Simulation study
To evaluate the behaviour of the proposed model, a simulation study has been conducted under several different scenarios. In all scenarios, we simulated 500 samples from a two-state HSMS model with varying parameters. In the different scenarios we have analysed three different experimental factors: the sample size (T = 500; 1000), the sojourn distribution (non-parametric; Geometric; Shifted Poisson, Shifted Negative Binomial) and the distribution of the error terms (standard Gaussian, skew Student's t with five degrees of freedom). We further provide results with respect to the HMS approach.
Different sojourn distributions are considered. Firstly, we generate sojourns from then we simulate from a shifted-Poisson distribution, i.e.
is also considered, with p 1 = p 2 = 0.2. The latter corresponds to the Markov-switching quantile regression introduced by Farcomeni (2012). For each chosen combination of sample size, sojourn distribution and error terms distribution we generate 500 datasets and fit our HSMS model and its HMS counterpart. To evaluate model performance in recovering the true model structure, we compared the models on the basis of their ability to accurately estimate the slopes of statedependent regression coefficient and the latent process parameters, summarized looking at the goodness of classification measures by the Adjusted Rand Index (ARI; Hubert and Arabie, 1985). We measure the accuracy by reporting point estimates and standard errors of the estimators. We started each run from a randomly chosen set of parameters as described in Section 3.1 and stopped the algorithms when the increase in the log-likelihood at two consecutive iterations was less than 10 −4 . We report the goodness of classification in Figs. 1 and 2. The inferred classification is obtained via a simple maximum a posteriori rule, also known as local decoding. Firstly, we notice that a slightly better classification is estimated for the HMS under the Geometric sojourn (see the bottom part of Fig. 1). This is not surprising, as an implicit assumption of the HMS is indeed the Geometric sojourn distribution. The HSMS approach so far proposed provides an approximation to any sojourn distributions and its superiority in terms of goodness of classification is rather clear if we compare the two approaches under the non-parametric sojourn. We further consider Shifted Poisson and Shifted Negative Binomial sojourns (see Fig. 2). The HSMS approach provides slightly better ARI values, but the HSM approach is still comparable in terms of classification performance. We further notice that the errors distribution also plays a crucial role in recovering the underlying classification. As expected, assuming Skew-t errors, i.e. heavy tails, makes the classification a bit fuzzier. At last, the goodness of classification is also partially quantile-specific, i.e. the ARI is better at the median and a bit worse if quantiles in the tails are modelled.
In Tables 1 and 2 we report the estimated of the slopes (as in Farcomeni, 2012) at different quantiles (τ = 0.1; 0.25; 0.5; 0.75; 0.9) for all combinations of sojourn and error distributions. Looking at these tables, we may notice that the precision of the estimates is higher at the centre of the distribution when compared to other quantiles. Such differences are due to the reduced amount of information at the tails of the distribution. This effect tends to decrease slightly with increasing sample size (i.e. for T = 1000, not reported here for sake of brevity). Overall, the recovering of the data generation parameters by the HSMSM is satisfactory, no matter what sojourn distribution is considered. As a further remark, better goodness of classification leads to more accurate parameters estimates, in line with the findings of Maruotti and Rocci (2012).
We check how well the parametric sojourn distributions parameters are estimated by the non-parametric approach so far proposed (see Table 3). We adopt a two-step procedure. Firstly, we determine the state membership via local decoding; then we fit the inferred sojourn distributions and obtain the provided estimates by maximum likelihood. The results are mixed for T = 500. The non-parametric approach is able to reasonably approximate shifted Poisson and Geometric sojourns, for all quantiles and error distributions, whilst its performance is a bit worse if the sojourns follow a shifted Negative Binomial distribution. These estimates improve for T = 1000, i.e. increasing the sample size (results are not reported here and available upon request). This part of the simulation study aims at showing that there might also be some drawbacks and/or weaknesses in using a non-parametric approach with small sample sizes; while it is well-known that its performance increases with the sample size.
At last, we investigate a specific computational aspect related to the estimate of the standard errors. As mentioned before, the EM algorithm does not provide standard errors of the estimates. In the empirical applications, however, we need them to make inference on the regression parameters. An estimation of the regression parameters can be accurate and precise, but if the associated estimation of variance is poor, then coverage by the 95% confidence interval may falsely indicate poor estimation by the point estimator, that is, the point estimator may result in a poor coverage rate. For each of the simulated datasets, using the estimated parameters, 100 more datasets of the same size were simulated. Then, the corresponding standard errors are computed and then used to compute the average standard deviation. We report the results about the HSMS model, with Gaussian errors, in Table 4, along with the 95% coverage probabilities. We compare the standard errors of the estimates and the average standard errors obtained from the parametric bootstrap approach. The parametric bootstrap performs very well and represents a more than reasonable choice for the estimation of the uncertainty surrounding the regression parameters estimates. Its implementation is straightforward, though time consuming, and its behaviour does not qualitatively vary for the skew-t error case.  Table 2 Estimated coefficients (Est.) and standard errors (Std.Err.) over 500 simulations with T = 500.

The analysis of pollution data
The data considered in this example originate from a regional monitoring network system developed in Italy. The data are collected in their raw form, but this leads to difficulties in communicating the overall exposure conditions. Therefore, quantitative methods (models/summary statistics) with the ability to condense the large volume data into a small number of categorical summaries are required. Exposure conditions are, in our case, identified by the hidden states and the corresponding conditional distributions, which provide a condensed overview of their characteristics via the respective concentration profiles. Moreover, we link pollution levels to observed atmospheric variables, which improves the characterization of the different levels of exposure. Atmospheric variables may play a major role in determining the level of exposure to particular pollutants, but also for capturing time dependence as proxies for daily variations. Data are displayed in Fig. 3. From the visual perspective, it is apparent that patterns of non-linear relationship can be detected and that the correlation structure is rather heterogeneous. Existing analyses relate the mean of (log-transformed)  concentrations to atmospheric variables, but linear regression is known to perform badly with heavy-tailed and skewed data, also implicitly imposing restrictions on the entire distribution. By using quantile regression, instead, we are able  to model the entire distribution of PM 2.5 concentrations, focusing on the highest quantiles, where exposure to high concentrations may represent a risk for human health. The dependent variable is the hourly average of PM 2.5 concentrations, which is known to be highly skewed, i.e. we observed several PM 2.5 low concentrations with a very long right tail. The recorded concentrations may strongly be influenced by temperature, relative humidity and wind intensity. The association between PM 2.5 and atmospheric variables constitutes a crucial aspect of the analysis as well, and should not be neglected or treated as a nuisance. Here, air temperature (Temp), relative humidity (Humidity) and wind intensity (Wind. Int) are considered as potential explanatory variables. Thus, we consider the following empirical model The AIC and BIC do not select the same number of hidden states. The BIC is in favour of the 3 States model, while the AIC is increasing up to 5 states (and maybe more), see Table 5.
We comment on the BIC-chosen model. The three states correspond to different levels of (increasing) exposures. State 1 indicates a safe condition, with low pollution. State 2 is the alarming state, because though pollution is not at a warning level, there is a probability of around 0.3 to move to State 3, according to the estimated transition probability matrices. State 3 identifies at-risk exposures, with high and very high pollution. Over the considered period, i.e. 1/1/2019-31/1/2019, State 3 is visited only the 13% of hours, while State 1 and State 2 are visited 49% and 38% of hours, respectively.
We summarize the estimated quantile-specific regression coefficients in Fig. 4. At first hand, it is a general belief that temperature, influencing air stability, would play an important role in particle accumulation, causing an increase in the concentration of PM2.5 and probably further aggravates haze pollution. This is not the case in our example. The temperature plays no role in estimating the quantiles of the PM 2.5 concentrations. On the other hand, the relative humidity and the wind intensity both have significant effects for most of the quantiles in all the 3 hidden states. There are large gaps in understanding the relationship between relative humidity and PM 2.5 concentrations, mainly due to the uncertainty about PM 2.5 sensitivity to relative humidity (Wang et al., 2015;Du et al., 2016;Zhu et al., 2016). The estimated effect of relative humidity is quantile-specific, being non-linear, and state-specific, as the magnitude of the effects varies across states. While relative humidity strongly contributes to the accumulation, development and sustained growth of PM 2.5 concentrations, its effect is partially alleviated for extreme pollution. This is because an increase in the water in the air could enhance the condensation of water and the formation of rainfall, all of which in turn would mitigate PM 2.5 concentrations, though increasing the concentrations to a certain extent. As expected, wind intensity plays the major role in reducing pollution, in particular in at-risk situations as those identified by State 3. Wind-blowing is often the only way to reduce pollution from high-risk situations to more affordable ones.
At last, we checked for the usefulness of the non-parametric estimation of the sojourn probabilities using the fast and accurate method proposed by Dimitrova et al. (2020) to compute the cumulative density function of the Kolmogorov-Smirnov statistic when discrete functions are considered, obtaining the exact p values of the Kolmogorov-Smirnov test. Table 6 shows the p-values of the Kolmogorov-Smirnov test statistic. This is rather useful to get more insights on the sojourn distributions. Each state has different sojourn characteristics. It is rather clear that the Shifted Poisson is not a good candidate for the data at hand, i.e. the sojourn times show overdispersion. The geometric and Shifted Negative Binomial may be reasonable at some quantiles to model the sojourns in State 1 and, partially, in State 2, but not in State 3. So, considering a unique and common (to all quantiles and states) sojourn distribution may be too restrictive. It is not surprising that different sojourns apply to different pollution conditions (Lagona et al., 2011). The non-parametric approach allows for a lot more flexibility. Used jointly with the Kolmogorov-Smirnov test can also be used to select a proper parametric sojourn distribution, if the case.

The analysis of the Shangai stock exchange composite index
In this section, we illustrate the practical relevance of our methodology by modelling the Shanghai Stock Exchange (SSE) index and those from the advanced markets such as the United States (SP500) and Japan (NIKKEI). This study also considers stock indices from several mature markets on the Asian Pacific rim, including Hong Kong (HANGSENG), Korea (KOSPI), Singapore (STRAITS), Australia (ASX), and New Zealand (NZAO). We use T = 3849 daily returns calculated, spanning the period from January 3, 2005, to October 2, 2019. For each stock market index, daily returns are computed as [log(P t ) − log(P t−1 )] * 100, where P t is the closing price on day t. In line with the ideas discussed in Ye et al. (2016), we consider the following model: Returns SSE,t = β 0k (τ ) + β 1k (τ )Returns SP500,t + β 2k (τ )Returns NIKKEI,t + β 3k (τ )Returns HANGSENG,t + β 4k (τ )Returns KOSPI,t + β 5k (τ )Returns STRAITS,t + β 6k (τ )Returns ASX ,t + β 7k (τ )Returns NZAO,t + ϵ tk (τ ) with Returns SSE,t corresponding to returns at time t for the SSE index, and similarly for the other indexes. As it is well known, financial data are characterized by stylized facts like heavy tails, skewness, presence of outliers (Bulla, 2011); this makes the selection of an appropriate parametric distribution challenging. For these reasons a quantile regression framework may be an appropriate modelling approach to be considered. Moreover the presence of changes in volatility has a fundamental impact on the behaviour of returns (Rydén et al., 1998;Ang and Timmermann, 2012;Maruotti et al., 2019).
With the rapid growth China, many investors would certainly consider investing in this market rather than in the advanced or developed markets. However, the question of whether the Chinese market is integrated with other stock markets so that investing in China will provide the benefit of diversification is a major concern for investors. This lead to investigate whether SSE is interdependent with other stock markets.
Comparison of models with differing number of states, and thus parameters, is accomplished via the Akaike information criterion (AIC) and the Bayesian information criterion (BIC), see Table 7. The two chosen model selection criteria lead to different preferred models, and at different quantile we have different levels of heterogeneity, as measured by the number of hidden states. This is not surprising, as e.g. Maruotti et al. (2019) show that different criteria may lead to different chosen models when analysing financial time-series. A K = 2-state model is discussed in the following for ease of discussion for all τ -levels. Of course, model selection is a crucial aspect of any real data analyses and some contributions are present in the Markov-switching literature (see e.g. Pohle et al., 2017) . It deserves a specific further investigation under the quantile regression framework .
Results are summarized in Fig. 5. As could be expected, the state-specific intercepts are increasing somewhat with τ , with State 1 having lower values than State 2 for all τ s. Hidden states are often used as a mathematical tool to model time-dependent heterogeneity; however, here, they have a physical meaning being State 1 the at-risk state and State 2 the non-at-risk state. According to the initial probabilities, at t = 1 we start from the non-at-risk state, which is rather reasonable as the series starts from January 2005.
The integration between the Hong Kong and Chinese market is rather evident. The main role in estimating the returns quantiles of the SSE index is played by the Hang Seng index. This would imply that there is a driving force that brings these markets together. Hence, the benefit of diversification between the two markets would be limited. This is not a surprising result. The proximity between the two markets allows the investors to switch investments between the two markets. Moreover, the Hang Seng index in Hong Kong is often seen as back door to access to the Chinese market; allowing greater access to investors hoping to get in on growth in mainland China. This also partially shared with the KOSPI in Korea, though the relationship is weaker. Evidence is found that a significant effect exists between the daily returns generated by United States and the SSE. This is not surprising as U.S is the world's foremost stock market and has large influence on other stock markets. Beside that U.S. is also one of the main trading partners for both China. The quantiles of the SSE returns are less affected by the NIKKEI of Japan. Its effect is rather small and significant in the not-at risk state only. There is also an effect of other emerging markets, in particular in State 1, like they share a common systematic risk.
The main achieved goal of our empirical analysis is the evidence of the quantile effects of a large number of financial time series on the SSE. Indeed we found that the number of markets influencing the Chinese one varies across quantiles. The way the other markets influence the SSE is not symmetric and the number of significant markets in the lower tail is larger than that in the upper tail.
The number of significant markets also varies over time. This is one of the reasons that understanding current financial market structures is important because the market structure changes over time. As a by-product of using quantile regression is the investigation of the relationships between the entire SSE returns distributions and the other markets. The relationship among markets is rather complex, and time-varying. It is worth mentioning that some markets, e.g. the NIKKEI in State 1, the KOSPI and the NZAO in State 2, affect the SSE index only at certain quantiles. In particular the NZAO shows a significant effects in the at-risk State only at extreme quantiles, i.e. the SSE is more connected to Asia Pacific regional markets under at-risk periods, while the NIKKEI affects the SSE only at the centre of the distribution.
The usefulness of relaxing the geometric assumption for sojourn distribution is widely acknowledged in the literature (see e.g. Bulla and Bulla, 2006;Maruotti et al., 2019). This is because the HMS model is not able to capture some stylized facts of the return series (Rydén et al., 1998).
As for the environmental data, we test (Dimitrova et al., 2020) if the estimated sojourn probabilities follow a known parametric distribution, see Table 8. It is rather clear that none of the considered parametric distributions can reproduce the sojourns in State 1; while the sojourn in State 2 can be approximated by the geometric or the Negative Binomial, even better by assuming a non-parametric head and a geometric/negative binomial tail (as in Bulla et al., 2010).

Conclusions
We introduced a hidden semi-Markov quantile regression as a model-based clustering approach which overcame the limitations inherent in the Markov-switching one. The model has attractive features, including the feasibility to infer cluster-specific covariate effects on various quantiles, and the flexible way in which the entire response conditional distribution may be modelled and detected. Parameter estimation can be obtained in a maximum likelihood framework, slightly modifying the well-know Baum-Welch algorithm. We further provide the usefulness of the proposal in different empirical fields like finance and environmental frameworks. The model could be extended towards several directions. Firstly, in a multivariate setting, multiple quantiles can be jointly modelled by using the Multivariate Asymmetric Laplace distribution under proper constraints of the parameters (Petrella and Raponi, 2019). Time dependence can be further included in the quantile regression model allowing the evolution of the quantile over time to vary according to an autoregressive process. The effect of external, independent, covariates may also be included in the latent part of the model, allowing the semi-Markov chain parameters to depend on external information.
The proposal is introduced under the time-series setting. Of course, and as already mentioned in the Introduction, similar approaches are available under the longitudinal setting for HMMs (Marino and Alfó, 2015;Marino et al., 2018).
Our hidden semi-Markov-switching quantile regression model could also be extended to the longitudinal setting as further research work, extending the mixed HSMM approach proposed in Langrock et al. (2012) to quantile regression.
In the analysis of non-normal data in a regression setting, it is relevant to note that Markov-switching generalized additive models for location, shape and scale considered in Hambuckers et al. (2018) and Langrock et al. (2018) could be used as a promising alternative to fit the data, ensuring a lot of flexibility in a parametric setting and a parsimonious approach, as it has fewer model parameters than the proposed Markov-switching quantile model. Of course, a key advantage of Markov-switching quantile regression is that no distributional assumption are required.
A possible limitation of the suggested approach may be the well-known quantile crossing problem that may arise when quantile regression models are fitted separately for each considered quantile probability level. This common practice of treating the quantile levels independently of one another can yield fitted quantile curves that cross one another, thereby leading to a nonsensical response distribution. Indeed, quantile monotonicity requires the quantiles to be increasing as a function of τ , meaning that any well-defined distribution must necessarily have non-crossing quantiles. This quantile crossing problem is potentially worse for non-linear quantile models, like the (semi-)Markov-switching specification developed here. Fortunately, the approach developed by Frumento and Bottai (2016) makes it simple to control for quantile crossing. Further research investigations could be made in this direction.
The quantities used in the E-step can be thus rewritten as α t (k, u ′ )π k|j f (y t+1 , . . . , y t+u ′ | S t+u ′ = k, . . . , S t = k; ·)ζ t+u (k, u) η t (k, u) = α t (k, u)ζ t (k, u) γ t (k) = min(U k ,t,T −t) ∑ u=1 η t (k, u) Assuming U k being the maximum times in state k, Eq. (7) becomes In the M-step, we can harness the separability of parameters in Q ( ϑ|ϑ (h) ) to maximize each component individually as follows The existence of (A.8) is readily available in many statistical packages such as the R package quantreg (Koenker et al., 2020).
It is known (Levinson et al., 1983) that the maximum likelihood estimator of π is a unit vector: one entry is 1 and the others 0. Thus, in practice, the likelihood is maximized conditional on the Markov chain starting in state k, for each of k = 1, 2, . . . , K , and then chooses the largest result.