Asymptotic properties of quasi-maximum likelihood estimators in observation-driven time series models

Abstract: We study a general class of quasi-maximum likelihood estimators for observation-driven time series models. Our main focus is on models related to the exponential family of distributions like Poisson based models for count time series or duration models. However, the proposed approach is more general and covers a variety of time series models including the ordinary GARCH model which has been studied extensively in the literature. We provide general conditions under which quasi-maximum likelihood estimators can be analyzed for this class of time series models and we prove that these estimators are consistent and asymptotically normally distributed regardless of the true data generating process. We illustrate our results using classical examples of quasi-maximum likelihood estimation including standard GARCH models, duration models, Poisson type autoregressions and ARMA models with GARCH errors. Our contribution unifies the existing theory and gives conditions for proving consistency and asymptotic normality in a variety of situations.


Introduction
The aim of this work is to offer a systematic and unified way of studying quasimaximum likelihood inference for a large class of time series models which are called observation-driven models. The terminology was introduced by [9] to signify their main ingredient; the evolution of the observations relies on a hidden process which in turn is driven by some model based dynamics. Observationdriven models can be employed for modeling various types of data including high-frequency financial tick data, epidemiological data, environmental and climate data, to mention only a few of their applications. Their wide applicability is based on the fact that they can accommodate various dependence structures met in practice. Some well-known examples are the GARCH models [6], ARMA-GARCH models (for more see [21] and [42] and the references therein) and duration models. Furthermore, count time series and binary time series models have close connections with the aforementioned models and they are actually covered by the framework we study, see [17].
The success of the observation-driven models stems from the fact that they are based on generalized linear methodology, see [35] and [30]. The combination of likelihood based inference and generalized linear models provide a systematic framework for the analysis of quantitative as well as qualitative time series data. Indeed, estimation, goodness of fit tests, diagnostics and prediction are implemented in a straightforward manner because computations can be carried out using a number of existing software packages. Furthermore, both positive and negative association can be taken into account by a suitable choice of model parametrization.
Observation-driven models are defined as follows. Suppose that (X, d X ) and (Y, d Y ) are two Polish spaces equipped with their Borel sigma-fields X and Y.
Assume further that (x, y, y ) → f θ y,y (x) : θ ∈ Θ is a family of measurable functions from (X × Y 2 , X ⊗ Y ⊗2 ) to (X, X ). We denote by (Y 0 , . . . , Y n ) the observed data and we define observation-driven models as follows: Definition 1.1 (Generalized Observation-Driven model). We say that the pro- where F t = σ(Y 0:t , X 0:t ), y s:t = (y s , . . . , y t ) for s ≤ t, and ϕ : Θ → Φ is a measurable function from Θ to Φ.
The dependence graph between the various random variables, appearing in equation (1.1), is illustrated in Figure 1. It can be noted that the response Y t depends on Y t−1 and X t−1 through the kernel Q ϕ(θ) . Theoretical study of the properties of these models has been given a great deal of attention in the literature. It is beyond our intention to give a systematic review in this direction. Our primary aim is to study the properties of the Quasi Maximum Likelihood Estimators (QMLE) for estimating the unknown parameter θ. The QMLE is a standard methodology for inference in the class of models introduced by (1.1). Indeed, as described below, Example 2.1 refers to the standard GARCH(1,1) model which is routinely fitted by employing a Gaussian likelihood regardless of the assumed error distributions. Several other examples will be discussed throughout this work, including ARMA-GARCH examples; see Example 2.5.
As a remark, we note that the idea of quasi-likelihood inference was originated by [43] in the context of generalized linear models for independent data, and it was further developed in [30]. It should be noted that the quasi-likelihood is a special case of the methodology of estimating functions; see for example the texts by [23] and [25]. This contribution offers verifiable conditions for obtaining consistent and asymptotically normally distributed quasi-maximum likelihood based estimators for the parameter vector (1.1) even in the case where the kernel Q is misspecified. More precisely, our main results are the following: The paper is organized as follows: Section 2 discusses examples of observationdriven models and shows their wide applicability. Section 3 sets up the general notation that is used throughout this work and discusses convergence of the QMLE under model misspecification. Section 4 shows that the asymptotic distribution of the QMLE is normal and discusses conditions under which this fact holds true. All results are applied to the examples of observation-driven models presented in Section 2. Section 5 gives an empirical illustration while the Appendices contain the proofs of our results.

Examples of observation-driven models
In classical state-space models, also referred as parameter-driven models, the observations {Y t , t ∈ N} are modelled hierarchically given a hidden process {X t , t ∈ N} which has its own (most often Markovian) dynamic structure, see [28] or [12], for instance. In the Bayesian setting, the process {X t , t ∈ N} may be thought as the dynamical parameter and the distribution of observations is specified conditionally on this parameter. Well-known examples include linear state-space models [28], [44], or hidden Markov models.
Suppose that {Y t , t ∈ N} denotes the observed time series and let {X t , t ∈ N} be an unobserved process. The dichotomy between observation-driven models and parameter-driven models was suggested by [9] who classified these processes according to whether their dynamics are driven by the observed data themselves or by an unobserved process (see also [10]); parameter-driven models are discussed in [44], or [28] and [12] for instance. The generalized observation-driven model, introduced by Definition 1.1, is linked now to several standard examples by identifying suitably the observations and the corresponding latent process.

Example 2.1.
Recall the standard GARCH model ( [6]) standard normal random variables. In this example, the latent process X t is the volatility process σ 2 t and the conditional distribution of Y t given Y 0 , . . . , Y t−1 , X 0 is Gaussian with zero mean and time changing variance X t . The latent process {X t , t ∈ N} can be recovered by back substitution of the second equation of (2.1): for some starting value X 0 . The last display shows that the hidden volatility process is determined by the initial value X 0 and the past observations; this is precisely the reason why model (2.1) belongs to the class of observation-driven models.
There are several challenging problems associated with the model specification (2.1). In this paper, we will give conditions for obtaining asymptotically normally distributed maximum likelihood based estimators for the parameter vector (d, a, b) when the distribution of { t , t ∈ N} is misspecified. For GARCH models, such questions have been addressed by numerous authors including [29], [4], [20], [26], [33], [2], among others. The general framework developed is based on a different point of view which unifies these works.
The model of Example 2.1 can be extended further by replacing the second equation of (2.1) by a non-linear model, such as σ 2 t = f θ Yt (σ 2 t−1 ). Again by repeated substitution, we can express X t as a function of X 0 and Y 0 , . . . , [41] and [42]), are examples of non-linear specification of the volatility process.
Considering (2.1), the conditional density of Y t given X t−1 = x is given by More generally, let us define the simplified observation-driven model as follows.

Definition 2.1 (Simplified Observation-Driven model). We say that the process
where F t = σ(Y 0:t , X 0:t }, y s:t = (y s , . . . , y t ) for s ≤ t. Recall that Q is a Markov kernel defined on X × Y and dominated by some measure μ on (Y, Y) with associated transition density q(x, y) = dQ(x, ·)/dμ(y) and (x, y) → f θ y (x) is a measurable function from X × Y to X, which is parameterized by θ ∈ Θ.

Example 2.2. A popular class of models that describe time intervals between
consecutive observations is that of duration models, see [14]. These models have been found quite useful in financial applications; in particular they have been applied to the analysis and modeling of duration dynamics between trades, as they fit adequately with intraday market activity, see [24], for instance. To be specific, suppose that Y t denotes the duration between two consecutive observations. Then, a duration model is specified by The unobserved process X t is given by ψ t which is equal to the conditional mean of Y t given its past. This is quite analogous to GARCH (1,1) where the volatility σ 2 t is the expected value of Y 2 t given its past. The recursion (2.3) can be rewritten as an observation-driven transition (2.2) by setting X t = ψ t and where g is the density of t and θ = (d, a, b). For estimating the parameter vector θ, [13] has suggested the use of QMLE by assuming that { t , t ∈ N} is a sequence of i.i.d. exponential random variables with mean one. This work includes this specification and gives conditions for obtaining asymptotically normally distributed estimators in the case of model (2.3).

Example 2.3. Assume that we observe a binary time series
and let us consider the following observation-driven model

4)
where b > 0, 1{·} is the indicator function and {U t , t ∈ N * } is a sequence of i.i.d. standard uniform random variables. (2.4) introduces an observation-driven model for binary time series where the hidden process X t is equal to λ t ; see [38], [34]. Recall that for a Bernoulli random variable with success probability p the canonical link is given by the inverse logistic cdf, that is ln p/(1−p). The logistic model has been widely used in numerous applications. An alternative model is given by the probit link which is defined by means of π t = Φ −1 (p t ), where Φ(·) is the cdf of the standard normal random variable. For the complete specification of the probit model, we replace λ t by π t in the second equation of (2.4); see [46], [39] and [27] [40], [10], [16], [18] and [19], among others. The linear model for the analysis of count time series is based on the specification The transformation of the observed process Y t to the process ln(1 + Y t ) avoids the issue of zeroes in the data. Note that for this example, the hidden process X t is ν t which is equal to ln λ t , in virtue of (2.5). This is an example of a canonical link model because the canonical parameter of the Poisson distribution with mean λ is ν = ln λ. Regardless of which model is applied for data analysis, the same remarks made in Example 2.1 are true. In this case we will need to examine the behavior of Maximum Likelihood Estimator (MLE) when the Poisson assumption is not necessarily true for both of the above models. More generally, [8] suggest the use of mixed Poisson models for modeling count time series data, that is (2.5) is replaced by where the notation is as before.  [35] and [30] in the context of generalized linear models.
The next model we discuss is not a simplified observation-driven model but it can still be covered by our work; in fact this model is a generalized observationdriven time series; see Section 3, for more. [20], for instance) is specified by the following equations

Example 2.5. An AR(1) model with GARCH(1,1) errors (see
where {η t , t ∈ Z} is an i.i.d. sequence of standard normal random variables. As in the preceding example, the hidden process X t = σ 2 t is a function of the initial value σ 2 0 and the past observations for some initial value σ 2 0 ; thus (2.8) belongs to the general class of observationdriven models. The notable difference between (2.1) and (2.8) is that for the former, the distribution of Y t given σ 2 t does not depend on any additional parameters other than those appearing in the specification of the GARCH model. In contrast, for model (2.8), the distribution of Y t given σ 2 t and Y t−1 depends on the parameter α through the mean of the assumed Gaussian error distribution. More generally (see [32] among others) consider the following class of models where m(Y t−1 ; α) represents the conditional mean (which depends on an unknown parameter α) and the volatility process is modeled by a non-linear model as discussed above. In this example, the complete model depends on the unknown parameter vector θ = (α, λ), whereas the distribution of Y t given σ 2 t and Y t−1 depends only on the parameter φ(θ) = α. Remarks made for (2.8) still hold for the case of (2.9). This contribution covers also this class of observation-driven models (recall Definition 1.1) and examines the consequences of misspecifying the likelihood function.
The above presentation shows the wide applicability of observation-driven models in various scientific fields. Notably, these models can take into account both qualitative and quantitative data in a unified manner. We proceed to study the asymptotic behavior of the QMLE in the next section.

General misspecified models
Two models are under consideration in this work: the generalized and the simplified observation-driven time series (see Definitions 1.1 and 2.1, respectively). The dependence graph between the various random variables that appear in these definitions are shown in Figure 2. The case of the simplified model is obviously a particular case of the general model. However, we specify the assumptions required for studying the QMLE in the simplified models framework to avoid confusions and to compare our results with those obtained in the literature.

Example 2.5 (Continued). An example of generalized observation-driven time series model is given by the AR(1)-GARCH model (and its respective non-
linear counterpart of (2.9)) discussed in Example 2.5. The conditional distribution of the response Y t for both cases -for this example it can be assumed to be Gaussian-depends on the autoregressive parameter α and on the hidden process σ 2 t . For the model described in (2.8), it can be easily checked that where g(·) is the density of t and θ = (α, d, a, b).
For a generalized observation-driven model, the distribution of (Y 1 , . . . , Y n ) given X 0 = x and Y 0 = y 0 has a density with respect to the product measure μ ⊗n . It is given by where we have set for all t ≥ 1 and all y 0:t ∈ Y t+1 , with the convention f θ y 0 (x) = x. Note that for all t ≥ 0, X t is a deterministic function of Y 0:t and X 0 , i.e., Now, fix a point x of X. In this section, we focus on the asymptotic properties of θ n,x , the conditional Maximum Likelihood Estimator (MLE) of the parameter θ based on the observations (Y 0 , . . . , Y n ) and associated to the parametric family of likelihood functions given in (3.2). In other words, we consider We are especially interested in the case of misspecified models. To be precise, we do not assume that the distribution of the observations belongs to the set of distributions where the maximization occurs. In particular, the sequence {Y t , t ∈ Z} does not necessarily correspond to the observation process associated to the recursion (3.3), see [3], [22] and [15]. However, regardless the true data generating process, Theorem 3.1 below shows that the MLE converges to the set of values that minimize the Kullback-Leibler distance between the imposed model and the true model. Before stating the results, some assumptions are needed.
(A1) {Y t , t ∈ Z} is a strict-sense stationary and ergodic stochastic process.
and by E the corresponding expectation.
where (x, y, y ) ∈ X × Y 2 are fixed, are continuous. (A3) There exists a family of P-a.s. finite random variables such that for all x ∈ X, (i) the following limit holds P-a.s., (ii) the following limit holds P-a.s., For all (θ, t) ∈ Θ × N, we set the following notation: (3.6) The above assumptions are standard and they are introduced here for facilitating the proof of consistency. Note that under (A2), the mapping θ → L θ n,x Y 0:n is a continuous function on the compact set Θ and thus, the MLÊ θ n,x obtained by (3.4) is well-defined. Furthermore, under (A3)-(i) we obtain, regardless of the initial value of X −m = x, that X 0 (and thus X t ) can be approximated by a quantity involving the infinite past of the observations. Assumption (A3)-(ii) allows the conditional log-likelihood function to be approximated by a stationary sequence. Furthermore, (A3)-(iii) calls for a well-defined maximization problem. Verification of assumption (A3) is usually done by introducing the limit, as m tends to infinity, of f θ Y −m:0 (x) for all fixed (θ, x) ∈ Θ × X and by showing that this limit does not depend on x. We can therefore denote it by The following theorem establishes the consistency of the sequence of estimators {θ n,x , n ∈ N} defined by (3.4) in misspecified models. The proof follows the lines of [11] but the arguments should be adapted to take into account that the kernel density q φ here depends on the parameter.
Proof. The proof directly follows from Theorem A.1 provided that However, (a) follows from (A3)-(iii), (b) follows by combining (A2) with (A3)-(i) since a uniform limit of continuous functions is continuous and (c) is deduced by (A3)-(ii) and the definitions of L θ n,x Y 1:n andL θ n Y −∞:n . The proof is completed.
t )] < 0, then there exists a strict-sense stationary and ergodic process {Y t , t ∈ Z} obtained from (2.1) with the associated parameter vector θ = (d, a, b); see [7] and [21,Ch.2] where ν > 0 and c is a constant which is chosen such that the distribution has zero mean and variance one (see [36]). The parameter ν characterizes the thickness of the tail. When ν = 2, we obtain the standard normal distribution while for ν > 2 (respectively ν < 2) the distribution has thinner (respectively thicker) tails than the normal distribution. The GED distribution is usually employed for GARCH modeling of heavy-tail returns; see the recent work by [15] among others. In this example, we assume that ν is known and the parameter vector Note that the above display implies that a < 1. Following Remark 3.1, we can also note that under these constraints, there exist stationary and ergodic versions of the process {Y t , t ∈ Z} in this parametric family. We now show that (A2) and (A3) hold. Set X t = σ 2 t . Recall that the recursions given by (2.1) define a simplified and therefore a generalized observationdriven model (1.1) where we have set, with a slight abuse of notation, These equations imply clearly that (A2) holds. We now turn to (A3). Given an initial value of σ 2 0 , which will be specified below, the conditional log-likelihood defined in (3.5) may be expressed as where σ 2 t are computed recursively using (2.1). Note that, since f y,y (x) = f y (x) in this particular model, the conditional log-likelihood in (3.9) does not depend on the first observation Y 0 (contrary to the general expression given in (3.5)) and we therefore write L θ n,σ 2 0 Y 1:n instead of L θ n,σ 2 0 Y 0:n . For a given value of θ, the unconditional variance (corresponding to the stationary value of the variance) is usually a sensible choice for the unknown initial value Nevertheless, in what follows, we consider that the initialization is fixed to an arbitrary value σ 2 0 = x. For any integer m, we have so that there exists a P-a.s. finite random variable M such that |Y t | ≤ Mβ |t| for all t ∈ Z. This implies that P-a.s., and we can define With these definitions, we get Similarly, we obtain that R. Douc et al.
and the mean value theorem implies that there exists a constant γ > 0 such that Thus, as t goes to infinity. This shows (A3)-(ii). The proof of (A3)-(iii) is along the same lines.
Other positive distributions for the sequence { t , t ∈ Z} can be employed; for example the Gamma density suitably normalized to have mean one. However, we discuss the simple case of the exponential distribution for illustrating the verification of the required assumptions. As before, let θ = (d, a, b) ∈ Θ which is assumed to be a compact subset of Following Remark 3.1, we can also note that with these constraints, there exists a strictly stationary and ergodic process {Y t , t ∈ Z} in this parametric family and under some additional assumptions, we can obtain moments of any order for the stationary process (see [31] for more details). By letting X t = ψ t , (2.3) defines an observation-driven model with Since g(·) is equal to the exponential density, the conditional log-likelihood may be expressed as where ψ t are computed recursively using (2.3) given Y 0 and ψ 0 . As before, a typical choice of the initialisation is the stationary mean of the process but in what follows, we consider that the initialization is fixed to an arbitrary value ψ 0 = x. Working as before and with the same notation, we obtain that As in the previous example, we can show that (A3)-(ii) holds true. The proof of (A3)-(iii) is along the same lines and therefore Theorem 3.1 shows that {θ n,x } are consistent.
We note that Examples 2.3 and 2.4 can be analyzed in a similar way and therefore we omit details.
We now turn to Example 2.5. As previously noted, (2.8) can be put in the framework of observation-driven time series model using (3.1). Hence the assumptions of Theorem 3.1 can be easily checked. Nevertheless, we next focus on the general formulation (2.9) which has been studied by [32] and see how their results are interpreted in our context.

Example 2.5 (Continued).
Consider now (2.9) and suppose again that the observations are realizations of a strict-sense stationary and ergodic process {Y t , t ∈ Z}. However, we fit an observation driven model using a Gaussian assumption for the error term {η t , t ∈ Z}. Suppose that θ ∈ Θ which is assumed to be compact. Given initial values Y 0 and σ 2 0 , we obtain the Gaussian quasiloglikelihood

Under some additional assumptions, see C1-C3 in [32, Proposition 1], we obtain that sup
as m → ∞. Assuming further that σ 2 t is bounded away from 0, we obtain the consistency of the estimators {θ n,x }.

Quasi-maximum likelihood estimation
When a model has been correctly specified, that is when there exists a parameter θ ∈ Θ such that the data are generated according to this specific process, Theorem 3.1 implies consistency of the MLE to θ provided that the set Θ is reduced to the singleton {θ }.
An important subclass of misspecified models corresponds to the case where the observation process is assumed to follow the following recursions for any A ∈ Y, where θ is supposed to be in Θ o , the interior of Θ, but Since q = q ϕ(θ) for any θ, this special case of data generating process falls within the misspecified models framework. Equivalently, we assume that there exists a true parameter θ such that the second equation of the above display has been correctly specified but the corresponding chosen density q is not equal to the true density associated to the data generating process. This is a standard assumption and has been widely used in practice; for instance parametric inference for GARCH models is most often based on Gaussian log-likelihood. For this misspecification case, the MLE {θ n,x } defined by (3.4) are called QMLE. Note that θ is not anymore the true value of the parameter in the sense that the distribution of the observation process is not characterized only by θ . Nevertheless, and perhaps surprisingly, it can be shown that, under additional assumptions, the QMLE {θ n,x } are consistent and asymptotically normal with respect to the parameter θ . From now on, we assume for simplicity that X ⊂ R and we initially study the consistency property of the QMLE.

Consistency of the QMLE
Recall that the parameter θ which appears in the recursion (4.1) satisfies the following assumption: (A4) The parameter θ is assumed to be in Θ o , the interior of Θ.

2723
The main assumption which links the densities q and q φ is the following: (A5) For all (x , y) ∈ X × Y, the function ψ x ,y defined on the set Φ × X by has a unique maximum (φ, x) = (ϕ(θ ), x ).
The previous assumption corresponds to an identification condition and is quite analogous to the assumption A3(b) made by [45]. The following theorem shows the consistency of QMLE {θ n,x }; its proof is given in the Appendix. We now illustrate this result by considering several standard examples of time series models. We first consider the class of simplified Observation-Driven models as described in Definition 2.1. This special class of models is characterized by the fact that q φ (x, y; y ) does not depend on y nor on φ and that f θ y,y (x) does not depend on y. Equivalently and with a slight abuse of notation, we assume that

If (4.3) holds then (4.2) reduces to
and assumption (A5) is then replaced by the following condition: It is worth noting that in the particular case where Q ϕ(θ) = Q does not depend on θ and is equal to Q , we deal with a well-specified model. Since the Kullback-Leibler divergence is nonnegative, we obtain , y) , and, provided that x → Q(x, ·) is a one-to-one mapping, the equality holds if and only if x = x . Thus, in well-specified models, (A6) is most often satisfied.

Example 2.1 (Continued). Assume as before that the observations {Y t , t ∈ Z} is a strict-sense stationary ergodic process associated to
where σ 2 t is bounded from below. The last display generalizes (2.1) by allowing the volatility process to be a non-linear function of its past values and past values of Y t . To obtain strict stationarity and ergodicity for model (4.4) it suffices to assume conditions like those reported by [2], for instance. We now assume that (4.4) corresponds to the true data generating process. However, we fit to the observations the following observation-driven model with normal innovations, where { t , t ∈ N}) is an iid sequence of standard normal random variables. This is a misspecified model; in fact, this approach amounts to using the Gaussian log-likelihood as a quasi log-likelihood function to estimate the parameter θ. By setting X t = σ 2 t , we observe that (4.5) corresponds to the case of a simplified observation-driven model given by (4.3) where q(x, ·) is the density of a centered normal distribution of variance x. We examine under which conditions assumption (A6) holds true so that a consistent QMLEθ n,x for θ can be obtained.
For example, when f θ By straightforward algebra, we note that this function is maximized at the point Q (x , dy)y 2 . We conclude that assumption (A6) is satisfied provided that the condition Q (x , dy)y 2 = x holds true. Plugging this equality into (4.4), we obtain that the observations {Y t , t ∈ Z} is a strict-sense stationary ergodic process associated to where { t , t ∈ Z} is an i.i.d. sequence of random variables with potentially any unknown distribution, provided that E[( t ) 2 ] = 1. This is a standard identifiability assumption for GARCH models which implies that Var In particular, we note that E[ t ] = 0 is not required for proving consistency.
For another example within the GARCH models framework, consider again that the true data generating process is given by (4.4), but we fit model (4.5) with errors { t , t ∈ N * } following the Laplace distribution with density We fit to the observations the following observation-driven model where Q(x, ·) is assumed to belong to the natural exponential family distributions. To be specific, we assume that for all (x, y) ∈ X × Y, for some twice differentiable function α : X → R (which is the cumulant of Q) and some measurable function h : Y → R + . We investigate conditions under which assumption (A6) holds true. By noting that it can be readily checked that α" ≥ 0 so that α is convex. Therefore, the function is concave. The point at which this function achieves its maximum is reduced to a singleton {x}, which can be obtained by cancellation of the derivatives with respect to x. We obtain that Q (x , dy)y −α (x) = 0. Finally, (A6) is satisfied provided that the condition holds. Because, for the natural exponential family, the function α(·) corresponds to the cumulant generating function, its first derivative is equal to the mean of Y t given the past F t−1 . Therefore, the above condition states that the mean function has to be correctly specified regardless of the true data generating process. This fact has been noticed by several authors in the context of longitudinal data analysis (see [47], for example) and in time series modeling; see [48]. However, we show that the right mean specification is a necessary condition for obtaining a consistent QMLE. An immediate application of the above fact yields consistency results for duration and count time series models. For instance, recall models (2.5) and (2.6).
Then we obtain q(x, y) = exp(yx − exp(x))/y!, so that α(x) = exp(x) which is the mean of Q(x, ·) under this parametrization. Thus, (4.6) yields Q (x , dy)y = exp(x ) which implies the following. Suppose that {Y t , t ∈ Z} is any count time series with mean λ t (respectively exp(ν t )). Then, the QMLE will be consistent for θ , provided that the second equations of (2.5) ( respectively (2.6)) has been correctly specified. In particular, recall (2.7) for the mixed Poisson count time series models. Then, to obtain a consistent QMLE for θ , it suffices to assume that E[Z t ] = 1 and the second equation has been correctly specified. Related work on QMLE for count time series models has been recently reported by [1]. These authors established strong consistency of QMLE for count time series models using conditions that imply (A1)-(A3) and (A6) provided that the mean process λ t > d. The last condition is trivially satisfied for the case of linear model (2.5). For the case of log-linear model (2.6), this condition can be verified using the results of [19] and [11]. Furthermore [1] establishes asymptotic normality of the QMLE by imposing regularity conditions on the score function and information matrix. Those conditions imply Assumption (A8) which points to the conclusions of Theorems 4.2 and 4.3.
In addition, we mention that similar findings are discovered for the simple duration model (2.3). In this case, a consistent QMLE for θ is obtained assuming that E[ t ] = 1.

Example 2.5 (Continued).
Recall the autoregressive model with GARCH noise; for properties of the QMLE for general ARMA-GARCH(p,q) models, see the work by [20] and for the more general model (2.9) see [32]. For ease of presentation, we focus on (2.8).
Assume that the observations {Y t , t ∈ Z} is a strict-sense stationary ergodic process associated to where {η t , t ∈ N} is a sequence of i.i.d. random variables with unknown distribution satisfying We fit to the data the following model where b > 0 and {η t , t ∈ N} is a sequence of i.i.d. standard normal random variables. As noted in (3.1), this model falls into the class of general observationdriven model because it can be rewritten as where θ = (α, d, a, b), ϕ(θ) = α and |α| < 1. Then, the kernel Q a has a density Now, fix some y ∈ Y and let ψ y be the function First note that for all x ∈ X, which does not depend on x ∈ X. Then, replacing α by α and maximizing with respect to x, we obtain Because the global maximum of ψ(α, x) is attained at only one point, namely (ϕ(θ ), x ), assumption (A5) is satisfied.

Asymptotic normality of the QMLE for simplified observation-driven models
In this section, we present the asymptotic normality of the QMLEθ n,x for simplified observation-driven models. We choose to start with this class of models, as defined by (4.3), in order to avoid technicalities and burdensome notation. However, in the next section we develop rigourously all the steps for proving asymptotic normality of the QMLE for general observation-driven models. We assume that the parameter set Θ is a subset of R nΘ . Suppose that for all y ∈ Y, the function x → q(x, y) is twice differentiable. For all twice differentiable functions f : Θ → R and all y ∈ Y, define the following quantities: y) . (4.8) These functions appear naturally when differentiating the log-likelihood function θ → ln q(f (θ), y) with respect to θ. By straightforward algebra we obtain the score function and the Hessian matrix, respectively, as To prove asymptotic normality, we need the following additional assumptions which are quite standard for maximum likelihood type asymptotics in the framework of time series. More precisely, it is assumed that the score function and the information matrix of the data can be approximated by the infinite past of the process. In addition, all of these quantities are assumed to exist and in particular the Fisher information matrix is not singular. In what follows, the notation f • Y 1:t−1 (x) stands for the function (A7) For all y ∈ Y, the function x → q(x, y) is twice continuously differentiable.
Moreover, there exist ρ > 0 and a family of P-a.s. finite random variables such that f θ Y −∞:0 is in the interior of X, the function θ → f θ Y −∞:0 is, P-a.s., twice continuously differentiable on some ball B(θ , ρ) and for all x ∈ X, (i) P-a.s. , where · is any norm on R nΘ , (ii) P-a.s. , where by abuse of notation, we use again · to denote any norm on the set of n Θ × n Θ -matrices with real entries, Moreover, the matrix J (θ ) defined by is nonsingular.
where J (θ ) is defined in (4.9) and I(θ ) is defined by The proof of Theorem 4.2 follows directly from Theorem 4.3 stated in the next section. We now turn to the case of the asymptotic normality for the QMLE in general observation-driven models.

Asymptotic normality of the QMLE for general observation-driven models
Obtaining the asymptotic normality of the QMLE for the general observationdriven model proceeds along the previous steps. We will state the main result in this section. For simplicity, assume that Φ ⊂ R and therefore, the function θ → ϕ(θ) takes values on R. If for all y, y ∈ Y, (x, φ) → q φ (x, y; y ) is twice continuously differentiable, we can define χ and κ similarly to (4.7) and (4.8).
To be specific, for all twice continuously differentiable functions f, ϕ : Θ → R and all (y, y ) ∈ Y 2 , define Moreover, we set As before, these functions correspond to the derivatives of the log-likelihood function where the function θ → (ϕ(θ), f(θ)) is twice continuously differentiable. It can be readily checked that For studying the asymptotic normality of the QMLE, assumption (A7) is replaced by the following: (A8) For all y, y ∈ Y, the functions (x, φ) → q φ (x, y; y ) and θ → ϕ(θ) are twice continuously differentiable and ϕ(θ ) is in the interior of Φ. Moreover, there exist ρ > 0 and a family of P-a.s. finite random variables such that f θ Y −∞:0 is in the interior of X, the function θ → f θ Y −∞:0 is, P-a.s., twice continuously differentiable on some ball B(θ , ρ), and for all x ∈ X, (i) P-a.s. , where · is any norm on R nΘ , tends to 0, as t → ∞, where we use again · to denote any norm on the set of n Θ × n Θ -matrices with real entries, Moreover, the matrix J (θ ) defined by is nonsingular.
where J (θ ) is defined in (4.14) and I(θ ) is defined by The proof is postponed to the Appendix C.

Application
We verify empirically the asymptotic normality of the QMLE. Consider a count time series {Y t } whose true distribution conditional on the past is the geometric distribution with mean process λ t ; in other words set Recall the notation of Equation 2.4 and suppose that the mean process {λ t } is defined either by a linear model of the form (2.5) or by a log-linear model of the form (2.6). In this case, the true log-likelihood function is given by L θ n,x y 0:n = 1 n n t=1 y t ln λ t (θ) − (y t + 1) ln(1 + λ t (θ)) .
However, in practice the true distribution is generally unknown and therefore we choose to use the Poisson distribution as the response distribution. It is easy to check in this case that the "working" likelihood takes on the form Therefore, the score equations for model (2.5) (equivalently model (2.6)) are given by where the vector of derivatives ∂λ t (θ)/∂θ and ∂λ t (θ)/∂θ can be calculated by recursion. Table 5 summarizes results of a limited simulation study, where data are generated according to the geometric distribution with linear or log-linear model specification, but with the Poisson distribution is being fitted instead. All results are based on 1000 simulations. Consider the upper panel of the table which corresponds to results obtained after fitting the linear model. In fact, the Table reports the estimates of the parameters obtained by averaging out the results from all simulations. The first two rows correspond to the mean and standard error of the simulated QMLE. In all cases, we see that these estimators approach the true values quite satisfactory. The next three rows show some summary statistics of the sampling distribution of the standardized MLE. In particular, the row p-values, correspond to the p-values obtained from a Kolmogorov-Smirnov test statistic (for testing normality) for the standardized MLE obtained by the simulation. In all cases, we note that the asserted asymptotic normality is quite adequate. The second panel of Table 5 reports results for the log-linear model. We note again that we have quite satisfactory approximation to the true value of the parameter and the normality of the estimators is achieved. Results of 1000 simulations obtained after fitting a linear (2.5) and log-linear model (2.6) to a count time series of 1000 observations. Data are generated according to the geometric distribution with mean λt but with the Poisson model being fitted instead.

Outlook
We have studied a rich class of time series models that have been found quite useful in diverse applications. As the list of references shows, several studies addressed the problem of estimation and inference for observation-driven models by QMLE methodology. This work unifies the existing literature in a coherent and simple way. Furthermore, the methodology can be extended to the case of multivariate data. For instance, consider the so-called vector GARCH model [21,Sec. 11.2.2] which is given by , where t is an i.i.d. sequence of m-dimensional standard normal random variables, Σ t is a m × m positive definite matrix and the notation vech denotes the half-vectorization of an m × m square matrix C; in other words if C = (c ij ), then vech(C) = (c 11 , c 21 , ..., c m1 , c 22 , ..., c m2 , ..., c mm ) . Additionally, the vector d is m(m + 1)/2-dimensional and the matrices A and B are of dimension . Comparing the last display with (2.1) we note that the hidden process X t is equal to Σ t and the conditional density of Y t given X t−1 = x is given by y → q(x, y) = (2π) −m/2 |x| −1/2 exp(− 1 2 y x −1 y). Similar models can be developed for other classes of processes. The proposed framework advances the theoretical background for both univariate and multivariate observation-driven models and lists easily verifiable conditions for studying the QMLE.
This implies where the last equality follows from (4.2). Finally, {M t , t ≥ 1} is an ergodic (see (A1)) and square integrable F-martingale with stationary increments. The proof follows by applying the results of [5]. Moreover, By the Birkhoff ergodic theorem we have that Hence, we only need to show that Let > 0 and choose 0 < η < ρ such that E sup θ∈B(θ ,η) The existence of such η follows from the P-a.s.