Generalized threshold latent variable model

: This article proposes a generalized threshold latent variable model for ﬂexible threshold modeling of time series. The proposed model encompasses several existing models, and allows a discrete valued threshold variable. Suﬃcient conditions for stationarity and ergodicity are investigated. The minimum description length principle is applied to formulate a criterion function for parameter estimation and model selection. A computationally eﬃcient procedure for optimizing the criterion function is devel- oped based on a genetic algorithm. Consistency and weak convergence of the parameter estimates are established. Moreover, simulation studies and an application for initial public oﬀering data are presented to illustrate the proposed methodology.


Introduction
The threshold autoregressive (TAR) model, proposed by [58], has enormous popularity in a wide range of applications. It allows the modeling of diverse behaviors under different regimes, which provides flexible descriptions of many real-world scenarios. The backgrounds, theory, applications and extensions of TAR models can be found in the excellent surveys of [59,61].
Recently, generalizations to the TAR model have been considered by introducing threshold structures to non-linear models instead of autoregressive models. One important direction is on generalized linear models with thresholds, such as the generalized threshold mixed model (GTMM) ( [53]) and the generalized threshold stochastic regression model (GTSRM) ( [54]). In addition, threshold modeling has been extended to heteroscedasticity of time series, for example, the double-threshold autoregressive moving average conditional heteroskedastic (DTARMACH) model ( [38]), the threshold stochastic volatility (THSV) model ( [56]), the multiple-threshold double autoregressive (MTDAR) model ( [33]), and the threshold double autoregressive model (TDAR) ( [34]). Furthermore, threshold models with more elastic regime switching mechanisms are considered in the endogenous delay threshold model (EDTAR) of [27] and the hysteretic autoregressive (HAR) model of [36].
For the asymptotic theory of TAR models, the strong consistency and asymptotic distributions of the parameter estimates are studied by [10] and [13]. On the other hand, tests for a linear series against its threshold extension are considered in [12], [63] and [35]. For many sophisticated threshold models, asymptotic theories are developed by assuming stationarity and ergodicity of the process (for example, [54]). However, conditions for strict stationarity and ergodicity are investigated only for the self-excited threshold autoregressive (SETAR) model ( [11]), the TAR model with order p ( [3]), the DTARMACH model in [38], the HAR model in [36], the MTDAR model in [33], and the TDAR model in [34]. Corresponding results for the generalized linear-type threshold models remain unexplored.
More importantly, despite the well developed theoretical background of estimation theory, the estimation procedure of threshold models demands a high computational cost. Due to the irregular nature of the threshold parameters, locating the global minimum of the likelihood requires a multi-parameter grid search over all possible values of the threshold parameters, which is typically computationally infeasible; see [32] and [65]. Consequently, many threshold models are implemented assuming one threshold a priori; see [56], [53], and [54]. When the number of thresholds and the parametric model of each regime are unknown, no practical estimation method appears available except for the simplest TAR model; see [66], [15] and [16].
To tackle the above problems and further extend the flexibility, in this article we first propose the generalized threshold latent variable model (GTLVM) which covers most of the aforementioned models as special cases. In particular, the threshold variables may be continuous or discrete valued. As far as we know, asymptotic theory for threshold models with discrete-valued thresholds has not × I(z t−d ∈ (θ i−1 , θ i ]) , (2.1) where −∞ = θ 0 < θ 1 < · · · < θ r < θ r+1 = ∞ are the thresholds that classify y t into r + 1 regimes based on the threshold variable z t−d and threshold delay parameter d, Y t−1 = {y t−1 , . . . , y t−p Y } are the past observations, and X t = {x t,1 , . . . , x t,p X } are the covariates. Moreover, f i and h i are conditional densities of y t and λ t of regime i, respectively. We assume that z t−d is measurable with respect to the sigma-field generated by Typical examples include z t−d = y t−d for self-excited threshold models, and z k−d = x k−d,k for some exogenous covariates x t−d,k .
To explicitly describe the dependence of y t on λ t , Y t−1 , X t and z t−d , the GTLVM can be specified as where φ(·) is the link function which is smooth and monotonic increasing; { t } and {e t } are independent and identical (i.i.d.) mean-zero errors; g y,1 (·) and g y,2 (·) are known continuous functions; g λ (λ t , u t ) is an inverse cumulative distribution function with parameter λ t , and {u t } are i.i.d. uniform random variables on [0, 1]. Without loss of generality, we may assume g y,1 (0) = g y,2 (0) = 0 by adjusting ω i and α i accordingly. Note that (2.2) is a special case of (2.1): In (2.2), for any regime i, the conditional density of y t given λ t , denoted as f i (y t |Y t−1 , λ t ), can be determined by the distributions of u t and e t . Also, the conditional density of λ t given Y t−1 , X t , denoted as h i (λ t |Y t−1 , X t ), can be found from the second equation of (2.2). Integrating out the effect of λ t in f i (y t |Y t−1 , λ t ) with respect to h i , (2.1) follows. Although (2.1) is slightly more general than (2.2), we focus our attention on (2.2) for convenience in parametric modeling. By properly choosing λ t , g λ (λ t , u t ), φ(λ t ) and z t−d , model (2.2) covers a number of the aforementioned threshold models in the literature. For instance, if g λ (λ t , u t ) = 0 and g y, 1 is the identity function, then (2.2) reduces to threshold autoregressive and moving average (TARMA) model as in [31]. If, in addition, p e = 0, then (2.3) reduces to threshold autoregressive (TAR) model as in [58]. If g λ (λ t , u t ) = g λ (u t ) is a quantile function with p e = 0, then (2.2) reduces to the quantile self-excited threshold autoregressive (QSE-TAR) model in [8]. Next, denote λ t = E(y t ) and let g λ (λ t , u t ) = F −1 (λ t , u t ) be the inverse of the cumulative distribution function of some exponential family distribution with probability density f (y t ; λ t , a t , ν) = exp where η(λ t ) is the canonical parameter, ν is an overdispersion parameter and a t is a user-specified weight. Then, the special case of (2.2) given by (2.4) reduces to the GTMM in [53] when φ(λ t ) = 0 in the first or last regime, and reduces to the GTSRM in [54] when ψ ,j = 0 for all j. In addition, (2.2) covers models with double threshold structure and conditional heteroskedasticity. For example, with λ t = σ t and φ(

5)
Generalized threshold latent variable model 2047 is the MTDAR model ( [33]). Similar arguments apply to the THSV model in [56]. Therefore, the proposed model (2.2) allows a unified treatment to the open problem of establishing the stationarity and ergodicity of many existing threshold-type models. Moreover, the algorithm developed in Section 4 provides an efficient solution to the computationally challenging problem of estimation and model selection for these models.
Denote the threshold parameters by Θ = (θ 1 , . . . , θ r ) and the model parameters for the ith regime by ,i ). Combining the parameters in all regimes, we define Ψ = (Ψ 1 , ..., Ψ r+1 ). Note that we allow some of ψ x,i and ψ (q) ,i equal to zero so that different autoregressive orders and covariates can be included in different regimes. Denote the model order parameter 2 , p e and p are defined similarly.

Stationarity and ergodicity
We state a set of sufficient conditions for the strict stationarity and ergodicity of the GTLVM (2.2) as follows: a) The covariate X t is independent of {y s } s<t and E|X t | < ∞. In addition, there exists a positive integerp such that {(X t , . . . , X t−p+1 )} t=1,... is Markovian, strictly stationary and ergodic. Moreover, there exists a non-negative integer q such that z t is measurable with respect to the sigma-field generated by {X t , y t , . . . , X t−q , y t−q }.
To establish stationarity and ergodicity of a threshold model, a condition for preventing the series from exploding is required. For example, max i y,1,i | < 1 for TAR model in [3]. Under Condition 1, we have the following result for GTLVM, where the proof is provided in the Appendix.
If either one of the following conditions holds: then {y t } is strictly stationary and ergodic. Remark 2. The HAR model in [36] has a two-regime structure where a > 0 and (θ 1 , θ 1 + a] is called the hysteresis region, is covered by (2.2) by defining g λ (λ t , u t ) = 0, p e = 0, and ) is measurable with respect to the sigma-field generated by {y t , z * t−1 , y t−1 , . . . , y t−q , z * t−q }. Also, z * t is irreducible and aperiodic. Clearly, with q = 1 and r = 1, Condition 1a ) covers HAR model (3.1). In other words, Theorem 3.1 guarantees the stationarity and ergodicity of the multiple-regime extension of HAR model.
When specific knowledge is available on the threshold variable z k−d , the conditions 1) and 2) in Theorem 3.1 can be relaxed as follows.
We have the following corollary for the classical TAR and TARMA models defined in (2.3). For TAR model, the condition in Corollary 3 is the same as [11]. For TARMA model, [38] also established the stationarity and ergodicity under the irreducibility condition. However, it requires y,1,i | < 1, which is slightly stronger than the condition max i y,1,i | < 1 in Corollary 3. We remark that recently [14] proves the irreducibility for TARMA(1,1) model under some parametric conditions (see Condition (C3) in [14]). This justifies the potential validity of Condition 1f).

Estimation and model selection criterion
Given the delay parameter d, the number of thresholds r, and the model order parameter p, the GTLVM is completely specified. Estimation of model parameters can be performed by maximum likelihood. Specifically, the log-likelihood of the time series is is the conditional log-likelihood of y t given {Y t−1 , X t } in regime i. Let r 0 , d 0 , p 0 , Θ 0 and Ψ 0 be the true parameter values, andΘ = (θ 1 , . . . ,θ r ),Ψ = (Ψ i , . . . ,Ψ r+1 ) be the corresponding maximum likelihood estimators that maximize (4.1) given d, r and p. Based on the likelihood function, traditional methods such as sequential chisquare likelihood ratio test from [12] can be derived to determine the number of thresholds. However, different autoregressive orders and combination of covariates in different regimes contribute to complication in implementing the traditional methods. To overcome the computational burden, we adopt a model selection approach by developing a criterion function based on the minimal description length (MDL) principle ( [47,48]). Given a model M, the criterion is From [47,48] and [30], it requires approximately log 2 (n) bits to encode an integer and (log 2 n)/2 bits to encode a maximum likelihood estimator with n data points. From [66], the thresholds can be associated with the order statistics of the threshold variables {z 1 , . . . , z n } and require r i=1 log 2 (n i )/2 bits, where n i is the number of observations in the ith regime. Recall that ψ (k) x,i = 0 when the kth covariate is not included in the model. As the integer 0 requires 1 bit and log 2 1 = 0, by denoting p X,i = p X w=1 p (w) X,i as the number of nonzero entries in p X,i , the maximum likelihood estimatorΨ X,i can be encoded with (p X,i /2) log 2 (n i ) bits. Similar arguments suggest that encodingΨ Y,1,i ,Ψ e,i ,Ψ Y,2,i andΨ ,i require (p Y,1,i /2) log 2 (n i ), (p e,i /2) log 2 (n i ), (p Y,2,i /2) log 2 (n i ) and (p e,i /2) log 2 (n i ) bits, respectively. Thus, we have where p i = p Y,1,i + p e,i + p Y,2,i + p X,i + p ,i is the total number of nonzero coefficients in regime i. From [47], CL(E t |M) can be approximated by the negative of log 2 of the likelihood. Hence, 3) The optimal model can then be selected as the valuesr,d andp that minimize the MDL(M).
Since the likelihood function remains constant when θ i ∈ (R j , R j+1 ], j = 1, . . . , n − 1, where {R 1 , . . . , R n } is the ordered observations of the threshold variable, the estimatorθ i may take any value on (R j , R j+1 ] for some j. Without loss of generality we takeθ i = R j+1 . Then, to obtain an approximate solution to the optimization problem, we developed a genetic algorithm which is found to achieve promising performance for related optimization problems in change-point analysis ( [18,20], [41]) and estimation of the TAR model ( [64] and [66]). Inspired by [66], we develop the methodology with modifications for a simultaneous detection of both autoregressive and covariate structures.
The genetic algorithm is an imitation of the biological evolution for optimization. It involves inheritance, crossover, mutation and filtering. Specifically, the algorithm begins with a population of chromosomes, where each chromosome stores the information of a candidate solution to the optimization problem. For our application, we first fix a d and define each candidate solution as a model M specified by some r and p. Then, the performance of each chromosome is measured by its information criterion MDL(M). Chromosomes with better performance have higher probabilities to conduct crossover and produce offspring, and so their good model features are more likely to be inherited. Meanwhile, mutation occurs with a small probability, which brings in new models to seek the global optimum. After several generations of crossover and mutation, the best performing model is selected as the optimal one. Finally, we repeat the procedure with different d and select the best performing model that attains the smallest MDL(M).
Specifically, each step of the genetic algorithm is described as follows. Chromosome Formation: First, generate the initial population, which is a set of chromosomes in vector form. Each chromosome is expressed as c = [r, p 1 , (θ 1 , p 2 ), . . . , (θ r , p r+1 )]; the parameter estimateΨ is obtained once c is specified. Similar to [66], the initial population is created as follows: 1) The number of thresholds r is generated from a Poisson distribution with mean 2. 2) Sample θ i s uniformly from {z t }. Reject the sample and sample again if any regime has fewer than τ n,0 observations. This minimum span condition ensures the estimation accuracy of Ψ.

3) Select
The MDL(M) of each chromosome is then computed by (4.3). Crossover and Mutation: Crossover and mutation are two methods for generating offspring. In crossover, two chromosomes are selected from the population as "parents" with probabilities proportional to the inverse of their ranks of MDL(M). Next, a p o 1 is drawn from one of the parent's p 1 with equal probability. Then, for both parents, each of their {θ j , p j+1 } is selected with probability 0.5. Sort all selected {θ j , p j+1 }s in ascending order of θ j to produce an offspring If some thresholds θ o j s violate the minimum span condition, randomly delete the pair {θ o j , p o j+1 } until the condition is satisfied. In mutation, one parent chromosome is selected from the population with probabilities proportional to the inverse of the ranks of MDL(M). Then, a new chromosome is generated to crossover with the parent chromosome to produce an offspring. To achieve a higher degree of exploration, the features from the generated parent are selected with probability 0.7.
To balance between retaining good features with crossover and bringing new solutions with mutation, the probabilities of performing crossover and mutation are 0.9 and 0.1, respectively. To explore more possibilities in the model order, every offspring will have its order parameter in one randomly selected regime replaced by a newly generated order, with probability 0.3.
After conducting crossover and mutation, the group of offspring become a new generation of chromosomes. To ensure monotonicity of optimization, an elitist step is conducted to replace the worst performing 20 chromosomes in the new generation by the best performing chromosome in the previous generation.
Migration: With the advance of parallel computing, the island model is introduced to accelerate the computation and alleviate trapping in suboptimal solutions. In particular, we perform genetic algorithm on N I groups of subpopulations with size N p . These subpopulations conduct their own reproduction steps and thus are treated as distinct islands. To share good features between islands, for every M i generations, the M N worst performing chromosomes in the jth island are replaced by the best M N chromosomes from the (j − 1)th island, for j = 1, . . . , N I , where the 0th island is conventionally defined as the N I -th island. In this article we used N I = 50, N p = 200, M i = 4, and M N = 2. The full mechanism and improvement in performance of the parallelized genetic algorithm can be found in [1,2].
Claim of Convergence: When the best chromosome remains unchanged over 20 generations, we claim that convergence is achieved and the optimal model is obtained from the parameters of the best chromosome. Alternatively, in consideration of computational efficiency, the algorithm may be stopped after a fixed number of generations.

Assumptions for asymptotic inferences
Apart from Condition 1, we state the following assumptions for asymptotic inferences. First, Assumptions 1-3 are proposed for the consistency: is regular in the sense that the maximum likelihood estimatorΨ i is asymptotically normal ( [17] and [44]). We assume , is uniformly bounded. In addition, for any vector Φ that has the same dimension as (1, X T t ) and satisfies |Φ| = 1, there exists an > 0 such that Assumptions 1-2 are standard regulatory conditions for statistical inference in parametric models. The assumption on the third-order derivatives of L(Ψ, Θ, d) is essential in deriving the asymptotic distribution ofΘ; see Theorem 5.41 of [62]. Assumption 3, which is in similar spirit as Assumption 3 in [54], assumes the linear independence of X t to eliminate redundancy in the covariates. It holds if the joint conditional density of the exogenous covariates X t is non-degenerate given z t−i and z t−j .
Additionally, the following two assumptions are required for the convergence rates of estimators: for any Ψ (1) , Ψ (2) ∈ Ω Ψ . Furthermore, we assume that either one of the following condition holds: a) The marginal density of z t , π z (·), is continuous at {θ 0 1 , . . . , θ 0 r }, and the joint probability density of Li et al. b) The threshold variable z t is discrete, and y t is in the exponential family satisfying where T (·) is a measurable function such that var(T (y t ) | λ t ) ∈ (0, ∞) almost surely, γ(·), b(·) are continuous transformation functions, and ν i is the overdispersion parameter for regime i. Moreover, we assume that Assumption 4 are the conditions for the likelihood function with respect to continuous and discrete threshold variables. Asymptotic theory for threshold models with discrete threshold variables does not seem to have been studied in the literature. Similar to Assumption 6 in [54], we impose a square-integrable bound function for the difference of log-likelihood in both Assumption 4a) and 4b). While an example of verifying Assumption 4a) has been shown in supplementary materials of [54], an example of verifying Assumption 4b) is given in the Appendix. Furthermore, Assumption 4b) requires that the conditional density of y t given {Y t−1 , X t , z t−d } is in the exponential family. Thus, an explicit form of the difference in log-likelihood is available for applying results in large deviation theory; see [46]. Note that although the assumptions and asymptotic properties ofΘ for continuous and discrete z t are different, the same estimation procedure proposed in Section 4 is applied. Assumption 5 is analogous to Assumption 7 in [54] for proving the convergence of n(Θ − Θ 0 ). For any integer j, let A and A * be the σ-algebras generated by {w t } t≤j and {w t } t≥j+k , respectively. If {w t } is ρ-mixing, then there exists a sequence {ρ(k)} k=1,2,... with lim k→∞ ρ(k) → 0 such that, for all square-integrable random variables g and h that are respectively A and A * measurable, |corr(g, h)| ≤ ρ(k) holds; see [7] and [21]. See also Examples 1 and 2 in the supplementary materials of [54] for verification of ρ-mixing and selections of the function Γ. We will illustrate the verification of the above assumptions in some explicit examples in Section 6.

Asymptotic theorems
Under Condition 1, we develop the consistency and asymptotic properties of the parameter estimates. The proofs are provided in the appendix.
Next, we derive the convergence rate ofΘ for continuous or discrete z t , respectively: Denote the difference between the log-likelihood of y s under parameters Ψ (1) and Ψ (2) by is a compound Poisson process, independent of˜ * 1,i (κ i ), defined by where The following theorems derive the asymptotic distribution of the threshold parameterŝ Θ and model parametersΨ.

Simulation
In this section, two simulation experiments are performed to illustrate model fitting with respect to discrete and continuous thresholds. In each simulation experiment, 300 replications are conducted for every scenario. For simplicity, d is assumed to be known.

Example 1
Consider a three-regime self-excited GTLVM with Poisson distribution and loglink:  Figure 1.
As the threshold variable is discrete and no covariate is used, it suffices to verify Assumptions 1, 2 and 4b). As is common in the literature, Assumption 1 can be achieved by focusing attention on a sufficiently large compact subset of the parameter space. For Assumption 2, note that the conditional density f (y t | Y t−1 , y t−4 ) is the density of Poisson distribution, and is thus regular.
are not equal almost everywhere for all i since the latent variable λ t takes different values in the two regimes. Verification of Assumption 4b) is more technical and is given in the Appendix.    Table 1 reports the model selection performance. Even for a small sample of size of 200, the percentage of correct number of estimated threshold is over 80%, and the percentage of correct identification of model order in all regimes is over 60%. For comparison, we repeated the experiment with the Bayesian Information Criterion, which is defined as −2L(Ψ, Θ, d) + r+1 i=1 log(n i )p i . It is found that minimal description length gives better performance. Other information criteria such as NAIC in [59] might also be adopted; however, NAIC and other AIC-type criteria are not consistent in estimating the true order of the model. See [26] for details about consistency of information criterion.
Furthermore, Table 2 summarizes the performance of thresholds and model parameters estimates within the replications that correctly specify the regime and model structure using minimal description length. A fast convergence speed for discrete thresholds is observed. Furthermore, O p (n −1/2 ) convergence rate of Ψ is realized based on the rapid convergence ofΘ.

Example 2
We consider a double threshold autoregressive model with conditional heteroskedasticity by a log-link on σ 2 t : Here  ∼ N(0, 1). In the third regime, we select p Y,2,3 = 0 so σ 2 t = e 0.15 is a constant. A sample time series plot is depicted in Figure 1.
Note that (6.2) can be expressed as (2.2) with φ(x) = log(x), g λ (λ t , ·) ≡ 0, g y,1 (x) = x, g y,2 (x) = 2 log(x) and p X = p e = p = 0. Condition 1 can be verified readily using similar arguments in the verification for model (6.1). As the threshold variable is continuous and no covariate is used, it suffices to verify Assumptions 1, 2 and 4a). While Assumptions 1, 2 can be verified similarly as in Example 1, the verification of Assumption 4a) is similar to [54].  The empirical classification rate of regime and model structure for (6.2), and comparisons between criteria, are reported in Table 3. Again, the results are promising for moderately large sample sizes, and minimal description length still achieves superior performance. Tables 4 and 5 summarize estimation results for thresholds and model parameters, where the asymptotic convergences such as O p (n −1 ) rate forΘ, and O p (n −1/2 ) forΨ, are recognized. In conclusion, the purposed methodology has satisfactory performance.

Application
Initial public offerings (IPOs) are one of the most important funding sources in finance. IPO activities are found to be time-varying; see, for example, [50], which discusses the cyclical effect with respect to stock market bull/bear trends and cross-year variation of levels of monthly IPO volumes. This is evidenced by a large variation with clustering of small and large observations across yearly periods. However, few studies of IPO activities have considered quantitative modeling. [39] and [40] purposed autoregressive models which incorporates past monthly initial returns and market participation proxies. Nevertheless, linear autoregressive modeling is theoretically questionable for integer IPO volumes. Moreover, as indicated in [49] and [68], different market scenarios exist in the IPO market. These facts suggest the necessity of regime classifications for proper modeling.
Thus, for theoretical soundness and modeling flexibility, the GTLVM is applied to IPO volumes modeling. We model the U.S. monthly net IPO volumes from January 1976 to March 2014, where the dataset is available at [51]) (https://www.quandl.com/data/RITTER/US IPO STATS-Historical-US-IPO-Statistics). By the definition in [51], net IPO volumes exclude issuance of penny stocks, units and close-end funds.
To flexibly model counting data, we use a negative binomial distribution with a canonical log-link function in order to capture different dispersions across regimes. Here where λ t satisfies λ t = E(y t ), and k i is the dispersion parameter. With y * t−j = log[max(y t−j , 0.01)] = g y,2 (y t−j ), the link function is expressed as To capture the dependence of the variables in the previous 12 months, we set the maximum delay as D = 12 and autoregressive order as Q Y,2 = 12. And as indicated by [39], IPO activities are affected by new information arriving during the book-building period which lasts for approximately two to four months. Hence, we check the averaged historical two-, three-or four-month return of the S&P 500 Index, denoted as x (2) , x (3) , and x (4) , respectively, as possible covariates that represent recent market performance. In addition, the past observations y * t−2 , y * t−3 and y * t−4 are included in the model. In each regime, the parameter Ψ i is estimated by quasi-maximum likelihood with iteratively reweighted least square method, see [42].
For the net IPO series, two thresholds are estimated asθ 1 = 2 andθ 2 = 24. The estimated dispersion k i in the three regimes are 1.5962, 4.1817 and 9.9228, with theoretical standard errors of 0.4593, 0.5342 and 1.6211, respectively. The link function estimate is  Figure 2. For model diagnostics, the standardized deviance residuals are plotted in Figure 3. The fluctuations of deviance residuals around zero indicates that the fitting is adequate.
Some discussions about the estimation results are as follows. First, [49] asserts that IPO activities can be classified in to two regimes: "cold" and "hot" markets, depending on the volumes of the activities. Later, [68] propose a more sophisticated three-regimes classification in terms of "cold", "normal" and "hot" markets. The results in (7.1) suggest that the classification in [68] is more appropriate.
Second, (7.1) indicates that stock market return is an effective predictor of IPO volumes. In particular, the positive coefficients of x (2) in regimes 1 and 2 of (7.1) indicate that IPO activities are positively associated with stock market performances. This phenomenon agrees with the theory in [45] that high market returns increase the incentives of IPO issuance, thus contributing to IPO market activity as volumes soar. Moreover, the coefficient of x (2) is decreasing from regime 2 to regime 1, and becomes insignificant in regime 3, indicating that the positive effect of stock market returns diminishes with the increase in recent IPO activities. One possible explanation for this is as follows: as mentioned in [49], when the market is overactive, severe underpricing exists and discourages entrepreneurs. Hence, entrepreneurs choose to issue stocks in other periods, which offsets the market performance influence.

Proof of Theorem 3.1 (strict stationarity and ergodicity)
First, we state the following definitions from Markov chain theory that provide the background for studying strict stationarity and ergodicity.

Definition 1. Irreducibility
Definition 2. Geometric ergodicity: a Markov process {Y t } on {Ω, B} is geometrically ergodic if there exists a probability measure π on B such that for all A ∈ B and y ∈ Ω, there exist ρ ∈ (0, 1) and M y > 0 that where · is the total-variation norm. This implies that {y t } is ergodic, β-mixing and has a unique stationary distribution π; see [4] and [52].

Definition 3. Small set and petite set: a set C ∈ B is said to be small if there exists an integer m > 0 and a non-trivial measure v m (·) on B such that for all
Similarly, a set C is said to be petite for {y t } if there exists a probability measure γ * (·) on N + and a non-trivial measure v γ * (·) on B such that for all y ∈ C and A ∈ B, ∞ n=0 P n (y, A)γ * (n) ≥ v γ * (A). Clearly, a small set is a petite set; see [43].
The proof of Theorem 3.1 relies on the following theorem about the ergodicity of Markov chains.

Theorem 7.1. ([43]) Let n(y) : Ω → N + be an integer valued function. An irreducible chain {Y t } on Ω is geometrically ergodic if it is aperiodic and there exists a non-negative function V ≥ 1 on Ω which is bounded on a petite set C,
and for all y ∈ Ω, there exist ρ ∈ (0, 1) and b ∈ (0, ∞) such that By Condition 1a), denote and F t−1 as the sigma-field generated by Also, as z t−d is measurable with respect to the sigma-field generated by {X t−d , Integrating both sides with respective to the density of z t−d , we have pr(Ξ t | F t−1 ) = pr(Ξ t | Ξ t−1 ), and hence {Ξ t } is a Markov process. Under Condition 1a ), we can analogously verify the Markovian property of Thus, {Ξ t , z * t } is shown to be Markovian. Next, we illustrate our proof with p X = p e = p = 0 for simplicity. In this case, it suffices to prove stationarity and ergodicity of {Y t } = (y t , . . . , y t−p * +1 ). We will show the geometric ergodicity of {y t } by using Theorem 7.1. Hence, we need to verify that {Y t } or {Y t , Z * t } is an irreducible and aperiodic Markov process and construct the corresponding function V . For the general case (except for the irreducibility when p e > 0), the same verification methodologies could be applied on {Ξ t } or {Ξ t , z * t } with mild modifications, and hence the proof is omitted. Therefore, it suffices to show the stationarity and ergodicity of (7.4) Denote μ as the Lebesgue measure on R, and μ p * as the Lebesgue measure on R p * . Next, we show that {Y t } or {Y t , Z * t } is irreducible and aperiodic, and there exists some small set by the following proposition.
Proof of Proposition 1. We illustrate the proof of {Y t } under Condition 1a), where the proof of {Y t , Z * t } under condition 1a ) follows similar arguments. First we assume that e t and t have almost everywhere continuous and positive densities on R. We divide the proof into three parts: irreducibility, existence of small set, and aperiodicity. 1) Irreducibility. As e t has positive density on R, y t can reach any point With respect to (7.4), we can construct Hence, an injection exists between E t,i and A and thus μ(E t,i ) > 0. The mapping from E t,i to A is surjective and μ(E t,i ) > 0 holds. Denote the density of e t as f e (·), we have As e t and t have almost everywhere positive densities, the conditional densities f i (y t | Y t−1 , λ t , u t , z t−d ) and h i (λ t | Y t−1 , z t−d ) are thus almost everywhere positive. Therefore, from (2.1), the marginally density f is positive almost everywhere. Denote the marginal density of u t and t as f u (·) and f (·), respectively, it follows that Furthermore, denote f z (·) as the density of z t , we have, almost surely, Therefore, for anyÃ ⊂ R p * and y ∈ R p * with Lebesgue measure μ p * (Ã) > 0, as there exist some A 1 ∈ R and A 2 ∈ R p * −1 such that A 1 × A 2 ⊂Ã and μ(A 1 ) > 0, This completes the proof of irreducibility.
2) Existence of a small set C. By Definition 3, we need to construct a small set C such that, for all Y 0 = y ∈ C ⊂ R p * with μ p * (C) > 0, there exists a non-trivial measure v m (·) on the Borel sigma-field on R p * , B, such that for anỹ We construct the set as C = {y : y = (y 1 , . . . , y p * ) ∈ R p * , |y| ∞ ≤ c} for some constant c, where |y| ∞ = max{|y 1 |, . . . , |y p * |}. First, as the density of e t is almost everywhere positive, f (y t | Y t−1 , λ t , u t , z t−d ) > 0. Analogous to (7.5)-(7.6), we have that f (y t | Y t−1 ) > 0, and hence with (7.6), for any A ⊂ R with μ(A) > 0. By using induction on (7.7), pr( we have v m (Ã) > 0 by the compactness of C. Thus, the set C is verified as a small set.
3) Aperiodicity. From the existence proof of the small set C, it has been shown that P 1 (y, C) > 0 and P 2 (y, C) > 0 for all y ∈ C. By Proposition A1.1 of [9], it follows that {y t } is aperiodic.
To relax the assumption that t and e t have almost everywhere continuous and positive densities on R, we extend to the case e t = t = 0 for all t where e t and t do not have almost everywhere positive density. Define a perturbation 8) where σ 1,m , σ 2,m > 0, σ 1,m → 0 and σ 2,m → 0 as m → ∞, and e 1,t , e 2,t are i.i.d. zero-mean noises with finite first moment and almost everywhere positive densities. Using the perturbation techniques in [22], {y m t } is irreducible, aperiodic, and a small set exists. Thus we can derive the strict stationarity and ergodicity of {y m t } by Theorem 7.1. Since {y m t } converges almost surely to {y t } as m → ∞, the proof is complete.
Proof of Lemma 3.1. Lemma 3.1 is established in the irreducibility part of the proof of Proposition 1.
Let F t−1 be the σ-field generated by {y s , s } s≤t−1 . From the independence of z t−d and {y s } s<t , for concave φ in 1 ), we have for someρ ∈ (0, 1) and some constants H * 2 . Using similar arguments in the proof of Theorem 3.1, {y t } is geometrically ergodic and hence strictly stationary and ergodic. If φ(·) is a polynomial with order γ ≥ 1 in 2 ), then by considering ρ z (γ, z t−d ) instead of ρ z (1, z t−d ), similar derivations yield an analogous result to (7.15), with |ψ (j) y,2,i | replaced by |ψ (j) y,2,i | 1/γ . Thus, the strict stationarity and ergodicity are verified.

Verification of Assumption 4b)
In this section we illustrate the verification of conditions in Assumption 4b) using the model in Example 1 of the simulation studies. First, we have T (y t ) = y t . As Theorem 3.1 indicates that {y t } is strictly stationary and ergodic, y t < ∞ almost surely. Denote Y t = (y t , . . . , y t−3 ) and let | · | ∞ be the infinite norm such that |Y t | ∞ = max{|y t |, . . . , |y t−3 |}. Note that {Y t } is a Markov process. By (6.1), Li et al. and hence for some 0 < ρ < 1 and positive constants K λ and c λ . Thus, λ t < ∞ almost surely, and var(T (y t ) | λ t ) < ∞ almost surely. Next we show the existence of moments of y t and λ t . By strict stationarity and ergodicity, we have E(y t ) = E(λ t ) < ∞ and hence E(y k t ) < ∞ and E(λ k t ) < ∞ for all k ∈ (0, 1]. For k > 1, we denoteỹ t = y k t and analogous to [22], we consider a sequence of perturbation {ỹ m t } given by As ρ ∈ (0, 1), for any constant c 1 and c 2 , we have c 1 y ρ +c 2 < ρy for all sufficiently large y. Therefore, for all |Ỹ m t−1 | 1/k ∞ > y * with some sufficiently large y * . Meanwhile, there exists a sufficiently large constant H k such that By the induction arguments as in (7.12) and (7.13), we have Analogous to (7.13), (7.18) implies (7.2) with V (Ỹ m t−1 ) = |Ỹ m t−1 | ∞ + 1, and hence the geometric ergodicity of {ỹ m t } is verified by Theorem 7.1, which further implies E(ỹ m t ) < ∞. Then, as c m → 0,ỹ m t →ỹ t almost surely and hence E(ỹ t ) = E(y k t ) < ∞ for all k > 1. In conclusion, E(y k t ) < ∞ and E(λ k t ) < ∞ for all k > 0.
With the existence of all moments of y t and λ t , we are ready to verify the remaining conditions. As log[E(e uT (yt) | λ t )] = λ t (e u − 1), by choosing c t = λ t (e u − 1), we have lim sup n→∞ is finite for all K > 0. Therefore, all of the conditions in Assumption 4b) are satisfied.

Proof of Theorems 5.1, 5.2, 5.3 and 5.4(asymptotic theory of inferences)
From (4.3), the minimum description length is a sum of the negative loglikelihood and the penalties on model complexity. We define as the penalty on the model complexity. Similar to [66], we show the following propositions: where {k 1 n } and {k 2 n } are two positive sequences that converging to 0. Define In addition, let L i = lim n→∞ E(L n,i ), and L n,i be the k th derivative with respect to Ψ i for L i , L n,i , respectively. Here L for k = 0, 1, 2, i = 1, . . . , r + 1.
Proof of Proposition 2. We show the case k = 0 as an illustrative example. The proofs for k = 1 or 2 are similar. First, by the compactness of the parameter space S(Ψ * i ) of Ψ i and the ergodic theorem, for any pair of (θ l ,

Y. Li et al.
For discrete z t , (7.20) holds for any combination of (θ l , θ u ) from observed z t values. Thus, Proposition 2 holds. For continuous z t , (7.20) holds particularly for all subintervals of rational endpoints. Hence for θ ∈ (θ 0 i−1 , θ 0 i ) and any > 0, there exists a rational number w < θ such that where Q(w, θ) = E(I(w < z t−d ≤ θ)). Selecting w close to θ ensures a sufficiently small Q(w, θ). As the pair (θ l , θ u ) is closely approximated by some subset with rational number endpoints, (7.20) holds uniformly on all (θ l , , the almost sure convergence still holds if they are within a shrinkage neighborhood.
where the first and last terms in (7.21) converge to 0 almost surely by Proposition 2. Thus, by the uniqueness of Ψ * i under Assumption 2,Ψ * i → Ψ * i almost surely.
For any (θ l , θ u ) ⊆ (θ 0 i−1 , θ 0 i ), by the theory of Kullback-Leibler distance, Suppose that r * < r 0 , then there must be some θ * j−1 and θ * j such that for some positive integer k, In other words, k + 2 true regimes are pooled into a "working regime" j. We relabel the true regimes as sub-regime 1, ..., k + 2 with thresholds of sub-regime l denoted as (θ (l−1) , θ (l) ]. Hence, the number of observations in the working regime j is n j . The log-likelihood of the working regime j is and As n → ∞, we have n j → ∞ and n j,l → ∞. Denote the true parameter in subregime l as Ψ 0 (l) . Consider a sub-regime m, from Propositions 2, 3 and theory of Kullback-Leibler distance,  d 0 ) a.s. , (7.22) since at least one part in the summation of (7.22) is not maximized. Furthermore, for one of such sub-regime m, equipped with the ergodic theorem, there exists some c m > 0 such that By the ergodic theorem again, we have and is positive almost surely. On the other hand, from (7.19), . Hence, the decrease in log-likelihood is more rapid. Therefore, r * < r 0 fails to optimize MDL(M). In general, if one of the working regimes is not nested in a true regime, for example, for some i, j, then the MDL(M) cannot be smaller than that with sub-regimes (θ 0 i−1 , θ * j−1 ], (θ * j−1 , θ 0 i ] and (θ 0 i , θ * j ]. As a result, all working regimes have to be nested in some true regimes, i.e., θ 0 i−1 < θ * j−1 < θ * j < θ 0 i for all j with some i, which implies r * ≥ r 0 .