1 Introduction

Ordinal time series arise from data that are measured qualitatively, so the possible outcomes are finitely many categories. The distinctive feature of ordinal data is a natural order among the categories, i.e., they can be strictly arranged from lowest to highest. Real-world examples include environmental time series such as the air quality data discussed in Sect. 6, but also data from economics, sports, health studies and many further areas (see Fokianos and Kedem 2003; Weiß 2020a; López-Oriona and Vilar 2023). The ordered categories pose certain challenges for adequate time series models. While the literature on ordinal regression models is quite large, see Tutz (2022) for a recent survey, there are only few publications concerning ordinal time series models. Conditional regression models have been proposed by Pruscha (1993), Fokianos and Kedem (2003), and rank-count approaches have been used by Weiß (2020a), Liu et al. (2022), Liu et al. (2022). In this paper, we develop several different stochastic approaches for ordinal time series modelling, which are based on GARCH-type dependence structures (generalized autoregressive conditional heteroscedasticity). Specifically, we propose stochastic models derived from artificial neural networks (ANN) as an extension of conventional regression approaches.

To simplify notations, we assume that the random variables \(Y_{t}\), \(t\in \mathbb {Z}=\{\ldots ,-1,0,1,\ldots \}\), take values from a finite ordered set \(\mathcal {S}=\{0,\dots , K\}\), \(K\in \mathbb {N}=\{1,2,\ldots \}\), where the integer elements represent the corresponding ordered categories (in ascending order). In the present time series context, the conditional probabilities that \(Y_{t}\) takes a certain value depends somehow on the past of the process. As usual, let \(\mathcal {F}_{t-1}\) denote the information being available up to time \(t-1\). Regarding the actual conditional probability distribution of \(Y_{t}|\mathcal {F}_{t-1}\), different specifications are possible. Two approaches are relevant for the framework proposed in this paper. First, we can link the qualitative ordinal data to count data, which is known as the rank-count approach (see Weiß 2020a). Here, the count \(k\in \mathcal {S}\) expresses how many categories we are above the lowest category. This enables us to employ conditional count distributions for \(Y_{t}|\mathcal {F}_{t-1}\) with known properties, and to adapt count time series models to the ordinal case. Accordingly, the general notation in this paper is that \(Y_{t}\) takes a natural number from \(\mathcal {S}\). While the implementation is easy compared to the alternatives, count models rely on a metric interpretation. Thus, we might implicitly assume properties that the ordinal data do not possess. The literature on count time series models is relatively abundant (see Weiß 2018). The relevant framework for this paper is the INGARCH model class (integer-valued GARCH), which is part of the broader class of generalized linear models (GLM). The basic linear INGARCH model for unbounded counts was proposed by Ferland et al. (2006). For bounded counts, Ristić et al. (2016) suggest a binomial INGARCH model. This model is particularly relevant for us because the ordinal data, when treated as counts, are always bounded. A generalization of the linear model based on a logistic response function in combination with a bounded count distribution is considered in Chen et al. (2020); details are provided in Sect. 2.

The second approach is designed to explicitly respect the ordinal nature of the data. Here, we interpret \(Y_t\) as a multinomially distributed random variable with population parameter 1 and probability masses \(P_t(k)=P (Y_{t}=k|\mathcal {F}_{t-1})\). We use the obvious convention that \(P_t(k)=0\) for \(k\notin \mathcal {S}\). In order to account for the natural ordering, we consider the conditional cumulative distribution function (CDF) \(F_t(k)=P (Y_{t}\le k|\mathcal {F}_{t-1})\). It is related to the probability mass function (PMF) via the equation \(P_t(k)=F_t(k)-F_t(k-1)\). In Sect. 3, the (logistic) ordinal GARCH model, ORDGARCH, shall be motivated from the logistic binomial INGARCH model for count data. As an alternative, we consider a multinomial-logit or softmax GARCH model in Sect. 4, which uses the softmax response function. The latter model also serves as a basis for constructing a GARCH model for ordinal time series with an ANN response function, see Sect. 5. Such response functions have already been proposed for the INGARCH framework (see Jahn 2023), and in this paper, we demonstrate the methodology and empirical benefit of using ANN response functions in ordinal time series models. The main data under investigation, see Sect. 6, are the daily air quality levels in Chinese cities as provided by Liu et al. (2022), which have been discussed previously by several authors. The newly proposed models as well as existing models are estimated and compared. Two further data examples concerning fear states of the U.S. stock market and hourly levels of cloud coverage at a weather station in Germany are analyzed in Sect. 7. Section 8 concludes the article and discusses possible directions for future research.

2 GLM for count representation

Before we turn to the highly nonlinear ANN models in Sect. 5, let us start with possible implementations in a GLM framework. As we are dealing with conditional probabilities, relevant GLM must feature a response function that maps the set of real numbers, \(\mathbb {R}\), onto the unit interval (0, 1). The most popular choice (and the one used here) is the logistic regression model, but alternatives such as the nearly linear soft-clipping approach (see Weiß and Jahn 2022) might also be used in this context. The logistic model uses the logistic response function \(\phi (x) = \big (1+\exp (-x)\big )^{-1}\) which has the desired property that all arguments are mapped onto (0, 1) in a strictly monotonic way. Following the count interpretation of the ordinal process \((Y_t)\), we combine the logistic response function \(\phi \) with a conditional binomial distribution in the following way:

$$\begin{aligned} Y_{t}\sim Bin (K,P_t) \,\, \text {with } P_{t}=\phi \! \left( \alpha _0+\sum _{i=1}^{p}\alpha _i Y_{t-i}/K+ \sum _{j=1}^{q}\beta _j P_{t-j}\right) . \end{aligned}$$
(1)

Model (1) is known as the logistic binomial INGARCH(pq) model, where \(p\in \mathbb {N}\) and \(q\in \mathbb {N}_0=\{0,1,\ldots \}\). Furthermore, \(\alpha _0,\ldots , \alpha _i,\ldots ,\beta _j,\ldots \in \mathbb {R}\) with \(\alpha _0,\alpha _p,\beta _q\not =0\) (the latter if \(q>0\)); details and properties can be found in Chen et al. (2020). While the autoregressive part is self-explanatory, the GARCH-like feedback terms \(P_{t-j}\) represent the past conditional means (and variances) and allow the model to capture an intensified memory (i.e., slowly decaying serial dependence) of a time series (see Fokianos 2011). Note that we use the letter “p” to denote the order of the autoregressive part (and thus “q” for the feedback part), as it is common practice for ordinary autoregressive models and also often done for INGARCH models (e.g., in Chen et al. 2020). Also note that \(p\ge 1\) is necessary to avoid a degenerate model without stochastic dependence.

The estimation of the \(p+q+1\) parameters in \(\varvec{\theta }=(\alpha _0,\alpha _1,\dots ,\alpha _p,\beta _1,\dots ,\beta _q)^\top \) is done by maximizing the conditional log-likelihood function (CML estimation). The binomial conditional log-likelihood has the following form:

$$\begin{aligned} \ell (\varvec{\theta })= \text {const} + \sum _{t}Y_{t}\ln (P_t)+(K-Y_t)\ln (1-P_t). \end{aligned}$$
(2)

The likelihood framework provides the usual tools for model inference and selection, such as a numerical Hessian to approximate the covariance matrix and information criteria to compare models with different numbers of parameters. For ease of computations, we rewrite model (1) as \(P_{t}=\phi (\varvec{X}_t^\top \varvec{\theta })\) with \(\varvec{X}_{t}=(1,\ Y_{t-1}/K,\dots ,Y_{t-p}/K,\ P_{t-1},\dots P_{t-q})^\top \), allowing us to compute \(P_t\) as a dot product.

3 Ordinal GARCH model

Inspired by the logistic binomial INGARCH model (1), we propose a logistic ordinal model based on a multinomial conditional distribution for \(Y_t\), together with the logistic response function. For the definition of the model, we consider the following binarized processes (where \(\mathbbm {1}\) denotes the indicator function):

$$\begin{aligned}&\widetilde{Y}_{t}^{(k)}=\mathbbm {1}_{[Y_t\le k]}\quad \text {for }\, k\in {0,\dots ,K-1}, \quad \text {and} \end{aligned}$$
(3)
$$\begin{aligned}&\overline{Y}_{t}^{(k)}=\mathbbm {1}_{[Y_t= k]} \quad \text {for }\, k\in {0,\dots ,K}. \end{aligned}$$
(4)

We call \(\widetilde{\varvec{Y}}_{t} = (\widetilde{Y}_{t}^{(0)}, \ldots , \widetilde{Y}_{t}^{(K-1)})^\top \) the ordinal binarization of \(Y_t\), and \(\overline{\varvec{Y}}_{t} = (\overline{Y}_{t}^{(0)}, \ldots , \overline{Y}_{t}^{(K)})^\top \) the nominal one. As an example, assume that \(Y_t\) takes values from \(\{0,\dots , 5\}\) (i.e., \(K=5\)). Let \(Y_{s}=3\) be the realized value for some time period s. Then, \(\widetilde{\varvec{Y}}_{s}=(0,0,0,1,1)\) and \(\overline{\varvec{Y}}_{s}=(0,0,0,1,0,0)\) are the corresponding binarized vectors. In analogy to the definition of \(\widetilde{\varvec{Y}}\) and \(\overline{\varvec{Y}}\), denote the CDF vectors as \(\varvec{F}_t\) with \(F_t^{(k)}=F_{t}(k)\), and the PMF vectors as \(\varvec{P}_t\) with \(P_t^{(k)}=P_{t}(k)\). Note that \(E \big [\widetilde{\varvec{Y}}_{t}|\mathcal {F}_{t-1}\big ] = \varvec{F}_t\) and \(E \big [\overline{\varvec{Y}}_{t}|\mathcal {F}_{t-1}\big ] = \varvec{P}_t\). The PMF vectors \(\varvec{P}_t\) are elements of the \((K+1)\)-part unit simplex, \(\mathbb {S}_{K+1}\,=\ \big \{\varvec{u}\in (0,1)^{K+1}\ \big |\ u_0+\ldots +u_{K}=1\big \}\).

Since \(Y_t\) is interpreted as a random variable with conditional multinomial distribution, the nominal binarization \(\overline{\varvec{Y}}_{t}\) represents the dependent variable in the subsequent ordinal regression model. The ordinal information, however, is incorporated by defining the ordinal GARCH model with respect to the conditional CDF \(\varvec{F}_{t}\) and by an appropriate construction of the explanatory variables, in analogy to the existing ordinal time series models by Pruscha (1993), Fokianos and Kedem (2003). We propose the following model recursion, again including a feedback term of order q:

$$\begin{aligned}&\overline{\varvec{Y}}_{t}\sim Mult (1,\varvec{P}_t)\, \text {with } P_t^{(k)}=F_t(k)-F_t(k-1) \,\, \text {for } k=0,\dots ,K, \,\text {and} \end{aligned}$$
(5)
$$\begin{aligned}&F_{t}(k)=P (Y_t\le k)=\phi \!\left( \alpha _0^{(k)}+\sum _{i=1}^{p}\alpha _i \widetilde{Y}_{t-i}^{(k)}+\sum _{j=1}^{q}\beta _j F_{t-j}(k)\right) \,\, \text {for } k=0,\dots ,K-1. \end{aligned}$$
(6)

Model (5) is only well-defined if \(F_{t}(k)\ge F_{t}(k-1)\) (and thus \(\varvec{P}_t\in \mathbb {S}_{K+1}\)). By construction, the property \(1\ge \widetilde{Y}_{t}^{(k)}\ge \widetilde{Y}_{t}^{(k-1)}\ge 0\) holds for the binarized series. Therefore, and because \(\phi \) is strictly monotonically increasing, it is easily verified that \(\alpha _0^{(k)}\ge \alpha _0^{(k-1)}\), \(\alpha _i\ge 0\), and \(\beta _j\ge 0\) for all kij constitute a sufficient condition for a well-defined model. For identifiability of the model order \((p,q)\in \mathbb {N}\times \mathbb {N}_0\), we also require that \(\alpha _p\not =0\) and, if \(q>0\), that \(\beta _q\not =0\).

Regarding CML estimation of the \(K+p+q\) parameters in \(\varvec{\theta }=(\alpha _0^{(0)},\dots ,\alpha _0^{(K-1)},\alpha _1,\dots ,\beta _1,\dots )^\top \), the most rigorous way is to consider the multinomial log-likelihood, which has a relatively simple form since the multinomial population parameter equals 1. Therefore, only one component is non-zero and the log-likelihood function is expressed as:

$$\begin{aligned} \ell (\varvec{\theta })=\sum _{t}\sum _{k}\overline{Y}_{t}^{(k)}\, \ln \big (P_{t}^{(k)}\big ). \end{aligned}$$
(7)

For an efficient numerical implementation, the regressors can be collected in an appropriate matrix. To simplify the notation for the multivariate time series, let \({\textbf {X}}_t\) denote the regressor matrix at time t, being defined as

$$\begin{aligned} {\textbf {X}}_{t}=\begin{pmatrix} 1&{}&{}0&{} \widetilde{Y}_{t-1}^{(0)}&{}\cdots &{} \widetilde{Y}_{t-p}^{(0)}&{}F_{t-1}(0)&{} \cdots &{}F_{t-q}(0)\\ &{}\ddots &{}&{}\vdots &{}\vdots &{}\vdots &{}\vdots &{}\vdots &{}\vdots \\ 0&{}&{}1&{} \widetilde{Y}_{t-1}^{(K-1)}&{}\cdots &{} \widetilde{Y}_{t-p}^{(K-1)}&{}F_{t-1}(K-1)&{} \cdots &{}F_{t-q}(K-1)\\ \end{pmatrix}. \end{aligned}$$
(8)

Applying the response function \(\phi \) in a component-wise manner, model (5) can be rewritten compactly as \(\varvec{F}_{t}=\phi \left( {\textbf {X}}_t\varvec{\theta }\right) \).

Fig. 1
figure 1

ORDGARCH(1, 0) model from Example 3.1: a conditional PMFs and b marginal PMF

Example 3.1

To illustrate the flexibility of the ORDGARCH approach, let us consider an ORDGARCH(1, 0) model for \(K=5\) with parameter vector \(\varvec{\theta }=(-0.5,0,0.7,1.5,3,\ 1.5)\). The last parameter is the autoregressive parameter \(\alpha _1\), whereas the threshold parameters \(\alpha _0^{(0)},\dots ,\alpha _0^{(4)}\) are chosen such that the probability for category 0 becomes very large (zero inflation). This becomes visible for both the conditional distributions in Fig. 1(a) and the stationary marginal distribution in Fig. 1(b). The ability to capture zero inflation differs from the conditionally binomial model (1). To evaluate the extent of serial dependence, we compute ordinal Cohen’s \(\kappa _{ord }(h)\) for different lags \(h\in \mathbb {N}\). \(\kappa _{ord }(h)\) was proposed by Weiß (2020a) as an ordinal counterpart to the autocorrelation function (ACF), and for the considered model, it equals \(0.2955, 0.0930, 0.0302, \ldots \) for \(h=1,2,3,\ldots \), i.e., the values are exponentially decreasing in accordance to the first-order AR model structure.

Note that the aforementioned properties were computed numerically exactly by utilizing that the ORDGARCH(1, 0) model constitutes a stationary Markov chain, being fully determined by its transition matrix [the columns of which are plotted in Fig. 1(a)]. Solving the invariance equation, the marginal distribution is obtained, and the matrix of h-step-ahead transition probabilities is simply the hth power of the 1-step transition matrix; see Appendix B.2.2 in Weiß (2018) for details.

4 Softmax GARCH model

A second GLM approach is based on the multinomial logistic regression, also known as softmax regression, which has been mainly used for nominal time series, see Fokianos and Kedem (2003), Moysiadis and Fokianos (2014). It relies on the softmax response function \(\varvec{\sigma }: \mathbb {R}^{K+1}\rightarrow \mathbb {S}_{K+1}\) with \(\sigma _k(\varvec{x})=\frac{\exp (x_k)}{\sum _m\exp (x_m)}\), which converts an arbitrary real-valued vector into a probability vector. Furthermore, we make use of the function \(e:\mathbb {S}_{K+1}\rightarrow (0,K)\) defined by \(e(\varvec{p})=\sum _{r=0}^{K}r\, p_r\), which can be interpreted as the “expected category” with respect to the PMF vector \(\varvec{p}= (p_0,\ldots , p_K)^\top \). Then, the softmax GARCH model still assumes the conditional distribution of \(\overline{\varvec{Y}}_{t}|\mathcal {F}_{t-1}\) to be multinomial (and that analogous constraints on model order and parameters like in Sect. 3 hold):

$$\begin{aligned}&\overline{\varvec{Y}}_{t}\sim Mult (1,\varvec{P}_t)\quad \text {with } \nonumber \\&\quad \varvec{P}_t=\varvec{\sigma }\!\left( \ldots ,\ \alpha _0^{(k)}+\sum _{i=1}^{p}\alpha _i^{(k)} Y_{t-i}/K+\sum _{j=1}^{q}\beta _j^{(k)} e(\varvec{P}_{t-j})/K,\ \ldots \right) . \end{aligned}$$
(9)

The relevant features of (9) are the category-specific parameters and the coding of the explanatory variables. Regarding the latter aspect, the model is more similar to the binomial model (1). The autoregressive components \(Y_{t-i}\) denote a natural number expressing the category at time \(t-i\), so there is no binarization for the lagged dependent variables. Similarly, the feedback terms \(e(\varvec{P}_{t-j})\) are calculated as single numbers (expected categories) representing the past states of the system. A similar construction is used in Liu et al. (2022).

For computational purposes, the (normalized) regressors are collected in a vector

$$\begin{aligned} \varvec{X}_t = \Big (1,\ Y_{t-1}/K,\dots , Y_{t-p}/K,\ e(\varvec{P}_{t-1})/K,\dots ,e(\varvec{P}_{t-q})/K\Big )^\top . \end{aligned}$$

The \(K(p+q+1)\) parameters, on the other hand, are arranged in a matrix:

$$\begin{aligned} \varvec{\Theta }^\top =\begin{pmatrix} 0 &{} 0 &{} \cdots &{} 0 &{} 0 &{} \cdots &{} 0 \\ \alpha _0^{(1)}&{}\alpha _1^{(1)}&{} \cdots &{} \alpha _p^{(1)}&{}\beta _1^{(1)}&{} \cdots &{} \beta _q^{(1)}\\ \vdots &{} \vdots &{} \ddots &{} \vdots &{} \vdots &{} \ddots &{} \vdots \\ \alpha _0^{(K)}&{}\alpha _1^{(K)} &{} \cdots &{}\alpha _p^{(K)}&{}\beta _1^{(K)} &{} \cdots &{}\beta _q^{(K)}\\ \end{pmatrix}. \end{aligned}$$
(10)

The first column of \(\varvec{\Theta }\) (referring to the parameters for \(k=0\)) consists of zeros and implicitly defines the first category as the reference category to avoid an overparametrization. The matrix form allows computing \(\varvec{P}_t\) via matrix multiplication \(\varvec{P}_t=\varvec{\sigma }(\varvec{\Theta }^\top \varvec{X}_t)\). The estimation is done by maximizing the multinomial log-likelihood in analogy to (7), namely

$$\begin{aligned} \ell (\varvec{\Theta })=\sum _{t}\sum _{k}\overline{Y}_{t}^{(k)}\, \ln \big (P_{t}^{(k)}\big ). \end{aligned}$$
(11)

In comparison to the previous ordinal GARCH model (5), the number of regressors is lower because there is no binarization; but the number of parameters is larger, as we have category-specific parameters.

Fig. 2
figure 2

Softmax GARCH(1, 0) model from Example 4.1: a conditional PMFs and b marginal PMF

Example 4.1

As a theoretical example, let us consider a softmax GARCH(1, 0) model for \(K=5\) with parameter matrix \(\varvec{\Theta }\) with first row \(\varvec{\Theta }_{1\cdot }=(0,0,0,0,0,0)\) and second row \(\varvec{\Theta }_{2\cdot }=(0,-0.5,-1,-1,-0.5,0)\). The second row implies a U-shaped conditional PMF for \(Y_t\) which becomes more pronounced the larger \(Y_{t-1}\). This is illustrated in Fig. 2(a), where the columns of the transition matrix of the process are plotted. The softmax GARCH(1, 0) model constitutes a stationary Markov chain, the marginal distribution of which (again U-shaped) is shown in Fig. 2(b). Utilizing the Markov property (recall Example 3.1), one computes the ACF for the numerically coded process \((Y_t)\) which is identical to zero. The absence of any autocorrelation is obviously due to the absence of a linear relationship between \(Y_t\) and \(Y_{t-1}\), while we are concerned with (nonlinear) serial dependence anyway. For example, the joint bivariate probabilities \(P (Y_t=k, Y_{t-1}=0)\) are constant in k, see Fig. 2(a), whereas the products \(P (Y_t=k)\, P (Y_{t-1}=0)\) are U-shaped, see Fig. 2(b). This purely nonlinear dependence also implies that linear count models such as the linear binomial INGARCH(1, 0) with ACF equal to power expressions of the autoregressive parameter could not describe the process adequately. Furthermore, it is worth pointing out that conditionally binomial models such as (1) cannot have U-shaped conditional distributions, which demonstrates the flexibility of the softmax approach (9).

5 Neural network framework for ordinal time series

ANN are known powerful tools for regression and classification problems, see Abiodun et al. (2018) for a recent survey. A main contribution of this paper is to consider conditional ANN regression models which are explicitly designed to model ordinal time series. Here, we can build on the concepts discussed in the previous sections. Concerning the count interpretation of the ordinal time series, corresponding neural INGARCH models have already been considered in Jahn (2023). The relevant binomial neural INGARCH(pq) model is given by:

$$\begin{aligned} Y_{t}\sim Bin (K,P_t) \quad \text {with } P_{t}=f^{ANN }\Big (Y_{t-1}/K,\dots ,Y_{t-p}/K,\ P_{t-1},\dots ,P_{t-q}\Big ). \end{aligned}$$
(12)

The architecture of the ANN regression function \(f^{ANN }\) has to be adapted to the problem. For this paper, we focus on single hidden layer networks (with H neurons in the one and only hidden layer) without loss of generality. For the count interpretation (12), the network output \(f^{ANN }=f^{ANN }_{\phi }\) is scalar-valued and denotes a single success probability per time period. We have a J-dimensional vector of regressors, \(\varvec{X}_{t}=(1,\ Y_{t-1}/K,\dots ,Y_{t-p}/K,\ P_{t-1},\dots P_{t-q})\), where \(J=p+q+1\). The model uses logistic activation functions together with the \(H(p+q+2)\) weight parameters \(w^{1}_{h}\) and \(w^{0}_{jh}\), where \(h\in \{1,\ldots ,H\}\) and \(j\in \{1,\ldots ,J\}\):

$$\begin{aligned} f^{ANN }_{\phi }({{\textbf {w}}}^0,\varvec{w}^1,\varvec{X}_t)=\phi \! \left( \sum _{h=1}^{H}w^1_{h}\cdot \phi \biggl (\sum _{j=1}^{J}w^0_{jh} X_{tj}\biggl )\right) . \end{aligned}$$
(13)

Here, the \(w^{0}_{jh}\) (\(w^{1}_{h}\)) refer to the weights for the activation of the hidden (output) layer. Also note that we consider fully-connected networks only, meaning that individual weight parameters may not be eliminated. This ensures the identifiability of the model.

The most important feature of the ANN regression function is its universal approximation property. It means that \(f^{ANN }\) can approximate any continuous regression function on a compact set to an arbitrary degree, given that the number of hidden neurons H is selected large enough. In the context of time series analysis, it allows, in particular, for a nonlinear autoregressive behavior and, if a time variable is included, an arbitrary time trend (Jahn 2023). The relevance of some of these features will be considered as part of the data example in Sect. 6. In what follows, we want to establish an ANN model framework that respects the ordinal nature of the time series better than the conditionally binomial model (12). The ORDGARCH approach (5) is not suitable for the following reasons. First, the model relies on parameter restrictions ensuring the monotonicity of \(\varvec{F}_{t}\) w.r.t. the categories. Such constraints render the numerical optimization of the ANN-based likelihood even more difficult, because an algorithm may often run into situations where the parameter combinations yield infeasible probabilities. Second, the binarized lagged dependent variables do not allow us to use the universal approximation property of the ANN to the full extent, as they may only take two different values. Therefore, it is more appropriate to build upon the softmax GARCH model (9), which uses the lagged dependent variables in their natural number representation. It is always preferable in ANN regression to represent the information in as few variables as possible, and the natural number \(Y_t\) contains the same information as the vectors \(\overline{\varvec{Y}}_t\) or \(\widetilde{\varvec{Y}}_t\). Consequently, the following model equation for \(\overline{\varvec{Y}}_{t}|\mathcal {F}_{t-1}\) is proposed, referred to as the neural softmax GARCH(pq) model:

$$\begin{aligned}&\overline{\varvec{Y}}_{t}\sim Mult (1,\varvec{P}_t) \quad \text {with } \nonumber \\&\quad \varvec{P}_{t}=\varvec{f}^{ANN }_{\varvec{\sigma }}\Big (Y_{t-1}/K,\dots ,Y_{t-p}/K,\ e(\varvec{P}_{t-1})/K,\dots , e(\varvec{P}_{t-q})/K\Big ). \end{aligned}$$
(14)

The regression function \(\varvec{f}^{ANN }_{\varvec{\sigma }}\) uses the softmax activation \(\varvec{\sigma }\) as the output activation function (but still the logistic activation for the hidden layer), and the corresponding weight parameter \({{\textbf {w}}}^1\) becomes a matrix (where \(w^1_{h0}=0\) for identifiability):

$$\begin{aligned} \varvec{f}^{ANN }_{\varvec{\sigma }}({{\textbf {w}}}^0,{{\textbf {w}}}^1,\varvec{X}_t)=\varvec{\sigma }\! \left( \ldots ,\ \sum _{h=1}^{H}w^1_{hk}\cdot \phi \biggl (\sum _{j=1}^{J}w^0_{jh} X_{tj}\biggl ),\ \ldots \right) . \end{aligned}$$
(15)

Note that model (14) with its conditional multinomial distribution is much more flexible than the conditionally binomial model (12). While a conditional binomial distribution necessarily has a unimodal shape, the multinomial distribution can also represent a multimodal structure, recall Example 4.1.

The neural softmax GARCH(pq) model (14) has \(H(K+p+q+1)\) parameters (fully-connected networks). In the purely autoregressive neural softmax GARCH(1, 0) model, the number of hidden neurons is bounded by \(H<\frac{K(K+1)}{2+K}\). This follows from the fact that the total number of parameters \(H(2+K)\) has to be smaller than \(K(K+1)\), the degrees of freedom in the transition matrix of the Markov chain representation. CML estimation is done as in (11), with \(\varvec{P}_t\) being computed according to (14).

6 Application: air quality in Chinese cities

To demonstrate the usefulness of the proposed nonlinear GARCH-type models for ordinal time series, we first discuss a well-established environmental data example from the literature in great detail, namely ordinal time series on the daily air quality level in different Chinese cities as provided by Liu et al. (2022) (two further data examples are briefly presented later in Sect. 7). According to (Liu et al. 2022, p. 2838), the Chinese government measures the air quality in six ordinal categories from 0 (“excellent”) to 5 (“severely polluted”), so \(K=5\) according to our notation. Time series models with some kind of autoregressive structure are appropriate for this type of data, because emitted particulate matter and other substances affecting the air quality might remain in the air for a few days until they are washed out or blown away. The dataset has been considered previously in relation to time series models. Liu et al. (2022) employ a mixture of a bounded Poisson and a categorical distribution to fit the data, and the zero–one-inflated bounded Poisson distribution used by Liu et al. (2022) can be understood as a special case thereof; further details are provided below. Weiß and Jahn (2022) use the soft-clipping response function in combination with a binomial distribution, yielding a model similar to equation (1).

Remark 1

We note that the present section is not intended to be an exhaustive study on the air quality in China, but mainly serves to demonstrate the usefulness of our newly proposed models for such kind of ordinal time series data. For a comprehensive study of air quality, also further quality metrics should be taken into account, such as the US-AQI (Air Quality Index) or PM\(_{2.5}\) (particulate matter with diameter \(\le 2.5 \mu \)m), which are also monitored for Chinese cities such as Beijing, see https://www.iqair.com/china/beijing.

Fig. 3
figure 3

Plot of the Beijing air quality time series from 01 Dec 2013 to 31 Jul 2019. Median=1, IOV=0.535, \(\kappa _{ord }(1)=0.385\), \(\kappa _{ord }(2)=0.142\), \(\kappa _{ord }(3)=0.045\), and \(\kappa _{ord }(4)=0.033\)

Let us start our discussion with the daily air quality time series of Beijing between 01 Dec 2013 and 31 Jul 2019 (\(T=2068\) days), which was already discussed by Liu et al. (2022), Weiß and Jahn (2022), while Liu et al. (2022) consider the data for a slightly different time frame. A plot of the time series together with some descriptive statistics is shown in Fig. 3. Note that due to the ordinal nature of the data, the usual statistics such as sample mean, variance, and ACF are not suitable, but ordinal counterparts have to be used. As in Weiß (2020a), we use the median for measuring location, the index of ordinal variation (IOV) for dispersion, and the ordinal Cohen’s \(\kappa \) for serial dependence at lag \(h\in \mathbb {N}\) (also recall Example 3.1). Table 1 shows the information loss in terms of the Akaike information criterion (AIC) and the Bayesian information criterion (BIC), resulting from CML estimation for the different models discussed in Sects. 2, 3, 4 and 5. Regarding the GLM-type models in the upper part of Table 1, the results suggest that the ORDGARCH models, although being theoretically appealing due to their definition without any reference to a count interpretation, do not perform well on the air quality data. The AIC displays a clear advantage of the softmax GARCH models compared to the binomial INGARCH models, and also in terms of the BIC (which is more restrictive regarding an increased number of model parameters), the softmax GARCH models still perform slightly better. This indicates that the shape of the conditional binomial distribution does not fit well to the actual conditional distribution of the data.

Table 1 Air quality in Beijing, see Sect. 6: information loss for different models (#par = number of model parameters)

This is in agreement with earlier findings in Liu et al. (10,11,) who proposed INGARCH-type models for these data, which are based on mixtures of the bounded Poisson (BP) distribution and of multinomial models. More precisely, the BP-distribution results from truncating the Poisson distribution to the range \(\mathcal {S}=\{0,\dots , K\}\) and has a shape similar to the binomial distribution. Liu et al. (2022b) extended the BP-distribution by allowing for additional zero- and/or one-inflation. Liu et al. (2022a) more generally propose a mixture with a multinomial distribution on a subset of \(\mathcal {S}\) and refer to this model as Novel Category bounded Poisson (NC-BP). Concerning the Beijing series, Liu et al. (2022b) recommend a purely one-inflated (OI) model (zero-inflation is not significant), while Liu et al. (2022a) use a multinomial distribution on \(\{0,\ldots ,4\}\) for their NC-BP model. We consider these models as further competitors for the air quality data, where we choose the softplus function (see Weiß et al. 2022) as the response function for the Poisson parameter (instead of the originally proposed linear response) to ensure a positive conditional mean. This gives more flexibility and results in a lower information loss, see the middle part of Table 1.

The displayed neural models in the bottom part of Table 1 are estimated with \(H=2\) and \(H=3\). Because we consider fully-connected, single hidden layer networks, the choice of H is sufficient to identify the model. Also recall the remark from Sect. 5 regarding the upper bound for H: in the present data example, we have \(K=5\) and thus \(H<\frac{5(5+1)}{1+1+5}=4.29\) for model order (1, 0). In particular, for \(H=4\), the model is already very close to overparametrization, producing unstable Hessians with negative eigenvalues in our estimations. The best overall model, in terms of both AIC and BIC, is the neural softmax GARCH with model order (1, 1) and \(H=3\).

Figure 4 illustrates the principal advantage of the softmax approach over the approaches with a fixed conditional distribution based on the estimated models for the air quality in Beijing. While the conditional probabilities for the displayed binomial model (1) are limited to the shape of a binomial PMF, the softmax model allows for PMFs of arbitrary shape. In the case \(Y_{t-1}=0\) in (a), the softmax model is able to capture one-inflation. For \(Y_{t-1}=5\) in (b), in turn, we get a high binomial probability for (1), \(P_t=0.7505\), such that the binomial model has very low probabilities of transitioning to the low categories. In a more practical interpretation, the neural softmax model indicates that the air quality has a significantly higher chance to be at least “good” today (\(Y_t\le 1\)) if the air was “severely polluted” yesterday (\(Y_{t-1}=5\)) than suggested by the binomial model.

Fig. 4
figure 4

Air quality in Beijing, see Sect. 6: predicted probabilities \(P (Y_{t}=k|Y_{t-1}=0)\) in (a) and \(P (Y_{t}=k|Y_{t-1}=5)\) in (b) according to different models

In addition to the information criteria which help to assess the relative performance of different models, we consider the probability integral transform (PIT) histogram to assess the model adequacy (see Weiß 2018, Sect. 2.4). The PIT histogram is well-defined for ordinal data, because it is solely based on the conditional CDF \(F_t(k)=P (Y_{t}\le k|\mathcal {F}_{t-1})\). It is defined based on the sample mean \(\overline{PIT }(u)\), \(u\in [0;1]\), across the terms

$$\begin{aligned} PIT _t(u) = \left\{ \begin{array}{ll} 0 &{} \text {if } u\le F_t(y_t-1), \\ \displaystyle \frac{u-F_t(y_t-1)}{F_t(y_t)-F_t(y_t-1)} &{} \text {if } F_t(y_t-1)<u<F_t(y_t), \\ 1 &{} \text {if } u\ge F_t(y_t), \end{array}\right. \quad \text {for all } t. \end{aligned}$$

If the PIT histogram is constructed with \(J\in \mathbb {N}\) subintervals of the form \([\frac{j-1}{J},\frac{j}{J}]\), \(j=1,\ldots ,J\), then the differences \(\overline{PIT }(\frac{j}{J})-\overline{PIT }(\frac{j-1}{J})\) define the heights of the corresponding rectangles. The PIT histogram allows us to identify a possible misrepresentation of the conditional dispersion structure by looking for deviations from uniformity. At this point, we also want to pick up an issue discussed in Weiß and Jahn (2022) where the soft-clipping binomial INGARCH(1, 1) model could not adequately describe the dispersion behavior of the air quality time series from the city of Zhengzhou. Thus, Fig. 5 displays PIT histograms for the two models from Fig. 4, estimated from the Beijing and also from Zhengzhou time series. Both models describe the Beijing series reasonably well, see the uniform shape in Fig. 5(a, b). For the Zhengzhou series, by contrast, the PIT histogram of the logistic binomial INGARCH(1, 1) model in (c) exhibits the same undesired pattern as the soft-clipping binomial INGARCH(1, 1) model from (Weiß and Jahn 2022, Fig. 7(c)). The newly proposed neural softmax GARCH(1, 1) model, on the other hand, is able to fit the shape of the conditional distribution very well as the PIT histogram in Fig. 5(d) is close to uniformity.

Fig. 5
figure 5

PIT histogram of different models for the air quality in Beijing and Zhengzhou

As another piece of evidence, we calculate the aggregated conditional CDF and compare it to the empirical marginal CDF (“marginal calibration”, see Sect. 2.4 in Weiß (2018), again well-defined for ordinal data). Table 2 shows corresponding values for the two models and the two cities considered previously. The results confirm the conclusions derived from the PIT histogram (Fig. 5) as we get a much closer agreement for the softmax models. Especially for the Zhengzhou series, the binomial model does not describe the marginal distribution very well.

Table 2 Air quality: Empirical CDFs and aggregated conditional CDFs

Remark 2

Although the fitted neural softmax GARCH(1, 1) models provide an excellent fit to the Beijing and Zhengzhou series, let us point out that the neural model can be easily modified to account for further possible features of real-world ordinal time series. The general idea is to add further (simple) regressor variables to \(\varvec{f}^{ANN }_{\varvec{\sigma }}(\cdot )\) in (14) and to utilize the universal approximation property of the ANN. For example, adding a single time variable t would allow the model to account for an arbitrary nonlinear time trend (given that the number of hidden neurons H is large enough). To account for seasonality, it would make sense to define, e.g., a (single) day-of-year variable for daily data with annual seasonality or a (single) day-of-week variable for daily data with weekly seasonality. The expected nonlinear (cyclic) relationship of such variables does not constitute a problem in the neural model. Obviously, other covariates might be considered in a similar way, also allowing for arbitrary functional relationships without the need to consider polynomial or interaction terms.

Let us conclude our main data example by investigating the out-of-sample performance of the different models. Table 3 shows the conditional one-step-ahead forecast performance for the last 365 observations. On the one hand, we compute the log-likelihood (logl) value, which considers the whole predictive distribution. On the other hand, the percentage of correctly predicted categories (PCP) as well as the mean absolute error (MAE) are computed, the latter with respect to the numerical coding \(\{0,\ldots ,K\}\) of the ordinal range. For both PCP and MAE, the conditional median is taken as the point forecast value. Recall that large logl and PCP values as well as a small MAE value indicate a good forecast performance. To produce the respective forecasts, the models are estimated based on the first \(T-365\) observations. Generally, the neural softmax GARCH models display the best in-sample fit in terms of AIC and BIC (recall the discussion of Table 1) and also the best out-of-sample fit in terms of the log-likelihood, confirming the appropriateness of AIC and BIC for model selection. Regarding the other criteria, PCP and MAE, at most small differences are observed between the different models, which is due to the discreteness of the data (different models might produce the same conditional median from \(\{0,\ldots ,5\}\)). The best performance in terms of PCP and MAE is observed for the neural binomial INGARCH model with \(H=2\).

Table 3 Air quality in Beijing, see Sect. 6: out-of-sample performance for different models (last 365 observations)

7 Further data applications

To demonstrate the wide applicability of the newly proposed models for ordinal time series, let us present two further data examples from different application areas. In order to keep the article at a reasonable length, we discuss these examples more briefly.

Our first additional data application considers “fear states” of the U.S. stock market derived from the Volatility Index (VIX). The data investigated here correspond to those discussed by Weiß (2019). On the basis of VIX, Hancock (2012) originally considered 13 fear states (so \(K=12\) according to our notation). However, the categories \(\{9,10,11,12\}\) are not observed during the respective time period from 1990 to 2006 (\(T=4287\)). This is no problem for the binomial INGARCH models with their parametric conditional distribution, but it creates problems for the other models with category-specific parameters. Therefore, and because we have less than 10 observations of \(\{0\}\) and \(\{8\}\), we consider re-defined categories \(\mathcal {S}=\{\{0,1\},2,3,4,5,6,\{8,9,10,11,12\}\}\) in the following analysis (implying \(K=6\)). According to Hancock (2012), these aggregated states can be interpreted as “extreme complacency or very low anxiety”, “low anxiety”, ..., “very high anxiety”, and “extremely high anxiety up to extreme panic”. A plot of the time series together with descriptive statistics is shown in Fig. 6. In analogy to Remark 1, we note that this time series serves for illustrative purposes here, whereas a comprehensive analysis on the risk of the market should also comprise further metrics such as the raw VIX values.

Fig. 6
figure 6

Plot of the daily U.S. stock market fear states from 1990 to 2006. Median=1, IOV=0.459, \(\kappa _{ord }(1)=0.889\), \(\kappa _{ord }(2)=0.845\), \(\kappa _{ord }(3)=0.813\), and \(\kappa _{ord }(4)=0.795\)

Table 4 shows AIC and BIC values for different models analogous to Table 1. The most important result is that, by contrast to our main data example from Sect. 6, our newly proposed ORDGARCH models are now clearly preferred over the binomial models. The overall best model in terms of both AIC and BIC is the softmax GARCH(1, 1) model. The benefit of the feedback terms is plausible in light of the slowly decaying Cohen’s \(\kappa \), see Fig. 6. The neural softmax GARCH(1, 1) model is not able to improve the fit. This can be explained by the absence of highly non-linear dynamics in the series, the fear states tend to remain relatively constant over time (see Fig. 6).

Table 4 Fear states of U.S. stock market, see Sect. 7: information loss for different models (#par = number of model parameters)

Our final data example considers hourly levels of cloud coverage at a weather station in Schleswig, Germany. Following Weiß (2020b), five categories of cloudiness from “sky clear” to “overcast” are defined (\(K=4\)). As in Weiß (2020b), the time period under investigation is May 2011, consisting of \(T=744\) hours as plotted in Fig. 7.

Fig. 7
figure 7

Plot of the hourly cloud coverage at Schleswig, Germany in May 2011. Median = 3, IOV = 0.626, \(\kappa _{ord }(1)=0.741\), \(\kappa _{ord }(2)=0.617\), \(\kappa _{ord }(3)=0.516\), and \(\kappa _{ord }(4)=0.420\)

The results in Table 5 again confirm the benefit of the newly proposed softmax GARCH approach. Here, we can also directly compare to the regime-switching discrete ARMA models from (Weiß 2020b, Table 4), where the preferred two-regime model has \(BIC=1345.5\) and is, thus, clearly outperformed by any of the softmax models. The proposed neural extension of the softmax GARCH(1, 1) is beneficial in terms of AIC, whereas the BIC-optimal choice is given by the simple softmax GARCH(1, 0) model.

Table 5 Cloud coverage in Schleswig, see Sect. 7: information loss for different models (#par = number of model parameters)

8 Conclusions

In this paper, different possibilities to model GARCH-like dependence structures of ordinal time series were investigated. Two novel approaches were proposed which build on a conditional multinomial distribution for the (binarized) dependent variable and the associated likelihood. The first one is the ORDGARCH approach which strictly respects the ordinal nature of the data. The feedback term that introduces a GARCH-type intensified memory is based on the entire conditional CDF from previous time periods. The second one is the softmax GARCH approach which uses the rank-count formulation for the regressors. Here, the feedback term is constructed to represent the “expected category” from previous time periods. The softmax GARCH model is also particularly suitable when combined with a neural network regression function. These universal approximators allow for essentially arbitrary functional relationships between the regressors and the dependent variable. This flexibility turned out to be advantageous when analyzing the (ordinal) air quality time series from Chinese cities, where the neural softmax GARCH model displays a superior performance in terms of the information loss. As an absolute measure of the model adequacy, PIT histograms and marginal calibration tables confirm the capabilities of the newly proposed neural model. Also concerning the other data applications (fear states and cloudiness), the softmax GARCH models turned out to be an appealing choice.

Regarding future research, it would be interesting to discuss possible nonstationary extensions of our newly proposed ordinal time series models in further detail (recall Remark 2), and also spatio-temporal extensions appear to be particularly relevant. Such models for spatio-temporal ordinal data could be constructed in analogy to Jahn et al. (2023) and they would be of great use not only for environmental applications. Furthermore, the present research focused on the development of novel stochastic models for ordinal time series. With the employed neural network regression function, we integrate machine learning techniques while the stochastic framework of the non-neural softmax GARCH model is fully preserved. For future research, it would be interesting to investigate whether or how other machine learning approaches (see Géron (2019)) could be utilized to model GARCH-type dependence structures in ordinal time series. As we are essentially dealing with classification problems, corresponding algorithms such as extreme gradient boosting classifiers might be considered.