Analyzing categorical time series in the presence of missing observations

In real applications, time series often exhibit missing observations such that standard analytical tools cannot be applied. While there are approaches of how to handle missing data in quantitative time series, the case of categorical time series seems not to have been treated so far. Both for the case of ordinal and nominal time series, solutions are developed that allow to analyze their marginal and serial properties in the presence of missing observations. This is achieved by adapting the concept of amplitude modulation, which allows to obtain closed‐form asymptotic expressions for the derived statistics' distribution (assuming that missingness happens independently of the actual process). The proposed methods are investigated with simulations, and they are applied in a project on migraine patients, where the monitored qualitative time series are often incomplete.

F I G U R E 1 Time series of daily levels of peak severity ("none," "mild," "moderate," "severe") for patient A. The dotted lines indicate missing data The aforementioned approaches to deal with missing data refer to real-valued time series. But missing observations may also happen to qualitative time series, which consist of categorical values ordered in time. For such time series, completely different analytical tools have to be used, 2 and, consequently, the aforementioned solutions for missing data are not applicable. The main practical motivation for the present article is a collaborative project involving real-world qualitative time series from migraine patients with missing data. Example 1. N1-Headache is a mobile app for migraine patients that was developed by Curelator Inc. During initial registration, participants give agreement for their data to be used for research and publication purposes by accepting the Curelator Inc terms and conditions and privacy policy. The protocol for analysis and publication of participant data was reviewed and approved by the Biomedical Research Alliance of New York (BRANY) Institutional Review Board (IRB) on October 10, 2019. BRANY granted a full waiver from informed consent. The app requires the patients "to log information about [their] headaches, migraine symptoms and medication use on days [they] have an attack, and track a range of factors (moods, weather, diet, etc.) on a daily basis that may influence [their] risk of attack" (https://n1-headache.com/patients/faq/; retrieved April 16, 2021). These data are then used to identify the lifestyle factors and medications that affect the migraine attacks, also see Vives-Mestres et al, 11 Vives-Mestres and Casanova 12 for further background. Many of the considered features are categorical, such as binary symptom features ("true" or "false"), dietary features regarding the consumption of certain foods ("no," "some," "a lot"), or emotional features being expressed on a 0 to 10 Likert scale (ranging from "not at all" to "a lot"). Also the variable of main interest, pain peak severity, is measured on an ordinal scale with levels "none," "mild," "moderate," and "severe." Thus, for each patient, a lot of categorical time series (measured on a daily basis) are available. These time series, however, often suffer from missing data, because patients did not enter any value into the app at some days, or they stopped before having replied to all questions. Two illustrative examples are shown in Figures 1 and 2. Figure 1 plots the ordinal peak-severity time series ("How bad was your headache pain at its worst today?") obtained for, say, patient A, which is of length n = 225 but including 19 missing observations (highlighted by dotted lines). Figure 2 refers to another patient, B, and shows one of the emotional features, the variable stress ("How stressed have you felt today?") being measured on a 0-10 Likert scale. This ordinal time series is of length n = 136 and contains three missing observations. Both time series are further discussed later in Section 4.2.
There are further fields of application where one is commonly confronted with categorical time series exhibiting missing observations. As an example, the measurement of cloud coverage at weather stations leads to ordinal time series, 13 where the measurement at a certain time may fail for several reasons: besides a failure of the measuring device, also weather or environmental conditions (such as heavy precipitation, fog, or light pollution) may prevent the precise determination of the current coverage state * . But in what follows, we focus on the migraine data of Example 1.
In this article, we develop solutions to handle missing data in stationary categorical time series, X 1 , … , X n with n ∈ N = {1, 2, … }. That is, the outcomes x t of X t are assumed to stem from a qualitative range  consisting of a finite number of categories. We distinguish between the case, where  exhibits a natural order among the categories (ordinal time series), and the case where the categories are unordered (nominal time series), because this distinction is also associated with different analytic tools. Section 2 briefly reviews the basic approaches for analyzing both types of categorical time series (if no data are missing). Then, we derive solutions of how to handle missing data in categorical time series. We present a unique approach of incorporating missingness and of deriving the asymptotic distributions of the proposed statistics. But as the specific analytic tools differ between ordinal and nominal time series, we present the respective results separately. First, we focus on the case of ordinal time series, as this applies to our migraine data from Example 1. In Section 3, we derive a central limit theorem (CLT) for estimators of the (bivariate) cumulative distribution function (CDF) of (X t ), which is applied to obtain approximate distributions for relevant ordinal statistics (eg, measures of dispersion or serial dependence). Here, only those derivations are presented in the main text which are crucial to understand the concepts for dealing with the missing data; otherwise, they are postponed to the supplement. Section 4 investigates the finite-sample performance of the obtained asymptotic approximations with some simulations. In particular, we pick up again Example 1 and apply the novel missing-data approaches to these ordinal time series. Due to the unified framework, the ordinal solutions of Section 3 are easily adapted to the case of nominal time series. A summary of the results for the nominal case is presented in Section 5, whereas the necessary derivations are again postponed to the supplement. The article is concluded in Section 6. The companion GitHub repository at https://github.com/cnweiss/CaTS_Missing contains all relevant implementation codes for this article.

BASICS OF CATEGORICAL TIME SERIES ANALYSIS
Let (X t ) be a stationary categorical process, with either nominal or ordinal range . To simplify notations, let the possible outcomes be arranged in a certain order (either lexicographical or natural order), that is, we denote the range as  = {s 0 , s 1 , … , s m } with some m ∈ N. Although the range is denoted in a unique way, ordinal time series have to be analyzed in a different way than nominal ones. 2,14 For nominal time series, analytic tools are based on the probability mass function (PMF). We denote the marginal PMF by p = (p 0 , … , p m ) ⊤ ∈ [0; 1] m+1 with p i = P(X = s i ), and bivariate probabilities with time lag h ∈ N by p ij (h) = P(X t = s i , X t−h = s j ). For ordinal time series, we make use of the natural ordering by considering the CDF instead, with the notations The stochastic properties of a nominal process are measured by different approaches than those of an ordinal process. While the location of a nominal process is expressed by computing the mode from p, one commonly computes the median from f to express the location of an ordinal process. Also the concepts of dispersion are different. While dispersion is uniquely minimal in the case of a one-point distribution, expressed by the PMF vectors in the nominal case, or by the CDF vectors in the ordinal case, respectively, the concepts of maximal dispersion differ. 15 While a nominal random variable exhibits maximal dispersion if p equals the uniform distribution ( 1 m+1 , … , 1 m+1 ) ⊤ (all outcomes are equally probable), an ordinal random variable has maximal dispersion for an extreme two-point distribution (the outermost categories have probability 1∕2), that is, if the CDF vector f equals ( 1 2 , … , 1 2 ) ⊤ . Several dispersion measures have been proposed in the literature. A popular choice is the index of qualitative (ordinal) variation, abbreviated as IQV (IOV), which is given by respectively. 15 Both are normalized to values in [0; 1], where 0 (1) is attained exactly in the case of minimal (maximal) dispersion. For ordinal data, also measures of asymmetry or skewness are derived from f, see Klein and Doll, 16 Weiß 14 as well as Section 3.3. Measures of serial dependence, in turn, make use of the bivariate PMF or CDF, respectively. As qualitative counterparts to the ACF, one may use the nominal or ordinal Cohen's , 14,17 given by respectively. A -value of zero implies serial independence at lag h, whereas positive (negative) -values indicate positive (negative) serial dependence. The sample counterparts of the measures in (3) and (4) are obtained by replacing the (cumulative) probabilities by (cumulative) relative frequencies. The latter are most easily expressed if switching from (X t ) to its binarization in the ordinal case. Here, 1 A denotes the indicator function, which takes the value 1 (0) if A is true (false). Note that the nominal binarization (Y t ) has the range {e 0 , … , e m } according to (1), whereas the ordinal binarization (Z t ) has the range {c 0 , … , c m } according to (2). Now, the marginal (cumulative) relative frequencies computed from X 1 , … , X n are equal to means about their binarizations, ie,p For the bivariate (cumulative) relative frequencies, we computê For deriving the asymptotics of the sample measuresÎQV,ÎOV in (3) and̂n om (h),̂o rd (h) in (4), or those of alternative measures, one first requires a CLT for the relative frequencies in (5) to (6). Afterward, one derives the asymptotics of the measures based on Taylor approximations ("Delta method"), see Weiß,17,14 for details. It should be noted, however, that all of the aforementioned approaches assume the time series to be fully observed.
In the sequel, however, we focus on categorical time series with missing observations, and we present a unified framework for handling missingness in both ordinal and nominal time series. We start with adapting the definition of the above sample measures (3) to (6) to deal with missing data. Then, we follow a similar path as the one sketched before for deriving their asymptotics: First, CLTs are proven for the missingness-corrected types of (cumulative) relative frequencies (5) to (6). Second, these CLTs are applied together with Taylor expansions to achieve the asymptotic distributions of the considered sample measures of dispersion (3), serial dependence (4), and others. Although the general approach for handling missingness is the same for both ordinal and nominal time series, the relevant methods differ in detail (as discussed before, we have different concepts of dispersion, for example, and skewness is well defined only for ordinal data). Therefore, we present the results for both cases separately and start, motivated by Example 1, with ordinal time series in Section 3. But thanks to the unified framework, the presentation and derivation of the corresponding nominal results, later in Section 5, is much simplified, also see the discussion in Remark S.4.1 of Supplement S.4.

Amplitude modulation
For handling missing data in categorical time series, we pick up the concept of amplitude modulation proposed by Parzen. 8 The amplitude modulation is not applied to the original qualitative process (X t ), but to the binarization derived thereof. Our approach is first explained in detail for the case of ordinal time series and the corresponding binarization (Z t ). Later in Section 5, we show how to adapt it to nominal time series and (Y t ). Let the amplitude-modulating process 6 or, more generally, stationary with some additional dependence structure. 7 We define the amplitude modulation of (Z t ) as (O t ⋅ Z t ), and we set̂∶= 1 Then, the mean of̂is given by .
is assumed to be independent of (X t ) (including the case of (O t ) being deterministic), which is commonly assumed for quantitative time series as well, 4,6-10 then E simplifies to which implies to estimate f byf * ∶= If, however, (O t ) would not be independent of (X t ), then the estimation of f would be more challenging. This is illustrated by the following example.

Example 2.
Assume that missing data only happen for "large" observations. This might be the case if, for example, people are asked for their level of alcohol consumption, and if they are ashamed to admit heavy consumption. More precisely, assume that for some a ∈ {0, … , m}, we have .
which is unbiased for i ≤ a, but biased for i > a.
In view of the difficulties illustrated by Example 2 and in accordance with common practice, 4,6-10 for the rest of the article, we assume that (O t ) is independent of (X t ).

Central limit theorems
Under the assumption of (O t ) being independent of (X t ), let us study the stochastic properties of the corrected CDF estimatorf * =̂∕O from (8). If missingness does not happen randomly but according to some deterministic principle, that is, = o f according to (7). Hence,f * =̂∕o is an exactly unbiased estima- Z t is just a weighted mean of the Z t , where n obs = ∑ n s=1 o s is the number of available observations. Therefore, second-order moments are computed exactly as see Lemma A.5.1 in Weiß. 14 As an example, if . Thus, the exact position of the missing data is not relevant in this case, just the "effective sample size" n obs .
If (8) Define the extended binary vectors which form a stationary process (Z * t ) by the stationarity of (X t ) and (O t ). We number the components of Z * t as Finally, let us assume that an appropriate mixing assumption on (X t ) and (O t ) is satisfied, 18,19 such as -mixing with exponentially decreasing weights, which then passes to (Z * t ). Then, we have the CLT Note that the components of the sample mean ⊤ are required for calculatingf * according to (8). The see Supplement S.1 for the proof. Let us define the function g and f = g( Z * ). Let D denote the Jacobian of g evaluated in Z * = ( 1, f ⊤ ) ⊤ . Then, the lin- together with the CLT (11) implies the following result (Delta method).

Theorem 1. Let the amplitude-modulating binary process
Let the stationary ordinal process (X t ) be independent of (O t ), where X t is observed if O t = 1. Let (X t ) and (O t ) satisfy the mixing assumption required for (11). Then, the corrected CDF-estimatorf * according to (8) satisfies . The proof of Theorem 1 is given in Supplement S.1. For = 1 = (h) (no missing data), (13) agrees with the CLT (20) in Weiß. 14 Also note the analogous structure of (9) and (13). Generally, (13) can be used to derive the asymptotics of statistics based onf * , which is illustrated in Section 3.3. Note that (12) and (13) are related to each other by i,  (13) As a result, the effect of (X t )'s dependence structure on * ij does not change in the presence of missing data.
Finally, let us return to the bias result provided by Theorem 1. The linear Taylor approximationf * , where D (j) denotes the jth row of D, implies that the expansion for the bias E[f * j ] − f j is of order o(n −1∕2 ). To compute the n −1 -term, we use the quadratic Taylor approximationf * Hence, the bias E[f * j ] − f j is of order o(n −1 ), as stated in Theorem 1, and thus negligible for practice.

Marginal properties of ordinal time series
The CLTs (11) and (13) can now be used to derive the asymptotics of statistics computed fromf * . This is illustrated by considering the asymptotics of the dispersion measure IOV defined in Section 2, but an analogous derivation would be possible for other ordinal dispersion measures as well. For the case of complete data, the IOV asymptotics are provided by Theorem 7.1.1 in Weiß. 14 We now adapt his derivations to the case of incomplete data. If applying the Delta method directly to (13), we use the linear approximation ofÎOV in Theorem 7.1.1 of Weiß 14 together withf * , namelyÎ This linear approximation exists if f is not equal to the extreme two-point distribution ( 1 2 , … , 1 2 ) ⊤ (recall Section 2; for f = ( 1 2 , … , 1 2 ) ⊤ , the linear term in the expansion ofÎOV would vanish). Then, we conclude that Similarly, the mean ofÎOV is computed approximately as .
Thus, we have a negative bias in estimating IOV, also see Weiß. 14 Note that exactly the same results are obtained if using (11) as the starting point, which is demonstrated in Supplement S.2. If being concerned with i.i.d. ordinal data, that is, if ord (h) = 0, then we simply obtain − 1 n IOV for the bias. In the same way, the asymptotics of further statistics regarding the marginal distribution of ordinal time series could be obtained, for example, that of alternative measures of dispersion, or of measures about its shape. As an example, a simple measure of ordinal skewness is given by skew = 2 m ∑ m−1 i=0 f i − 1, see Klein and Doll, 16 Weiß. 14 It is estimated by replacing the f i byf * i , thus leading to the mean and to the asymptotic variance Let us summarize these results in the following theorem.

Theorem 2.
Let the assumptions of Theorem 1 hold. Then, the sample IOV and skew, computed based onf * , are asymptotically normally distributed with

Extension: Measuring serial dependence
The CLTs (11) and (13) forf * are used for statistics referring to the marginal distribution of (X t ), recall Section 3.3. If, however, we want to compute a measure of serial dependence in the presence of missing data, such as ord (h) defined in Section 2, then the concept of amplitude modulation and the corresponding CLTs have to be extended. Like in (8), we usê f * =̂∕O as an estimator for the marginal distribution f. Then, we redefine the ordinal Cohen's aŝ Here,f * ij (h) denotes an estimator of f ij (h) that is defined as follows. In analogy to Parzen, 8 Dunsmuir and Robinson, 9 we first consider the bivariate cumulative relative frequencŷi j (h) of the amplitude-modulated binarization By the independence of (O t ) to (X t ), its mean equals which implies to define the estimator in analogy to (8). (16) is generally biased. To derive the asymptotics of thef * ij (h) and, in particular, of̂o rd (h) from (15), we have to extend our derivations from Section 3.2 in analogy to the approach in Supplements A.5 and A.7 in Weiß. 14 Therefore, we define the 2 (m + 1)dimensional binary vectors

If (O t ) is deterministic, this leads again to an exactly unbiased estimator of f ij (h) (again a type of weighted mean). By contrast, if (O t ) is stochastic, namely stationary with autocovariance function O (h) and (h) =
where the components numbered by −1, 0, … , m − 1 agree with those of Z * t in (10), and where the additional components at positions −2 and m, … , 2 m − 1 refer to the relevant interactions at times t and t − h. Note that we only included the "j-j" pairs Z t,j Z t−h,j in Z (h) * t , becausêo rd (h) in (15) only makes use of the "agreement frequencies"f * jj (h). If also the asymptotics off * ij (h) with i ≠ j would be required, then Z (h) * t would have to be extended by further components To keep it simple, we continue with the vectors Z (h) * t defined by (17). These allow us to conclude on the joint asymptotics of O t O t−h , O,̂i,̂j j (h) in analogy to Section 3.2. Then, again like in Section 3.2, we derive the joint asymptotic distribution off * i ,f * jj (h), and finally that of̂o rd (h). The details are postponed to Supplement S.3. As the main result from a practical perspective, we obtain the following theorem, which shall later be applied to test for significant serial dependence in (X t ).

Theorem 3. Let the assumptions of Theorem 1 hold, and denote
Under the null hypothesis of (X t ) being i.i.d., the distribution of̂o rd (h) at lag h ∈ N, defined by (15), is approximately normal with mean and variance given by , respectively.
Theorem 3 shows that the (negative) bias of̂o rd (h) is not influenced by the dependence structure of (O t ), it is just inversely proportional to the expected number of available observations, n . The approximate variance, by contrast, depends on joint second-and third-order moments of (O t ), namely on (h) and (h, 2h). If the data are fully observed, then = (h) = (h, 2h) = 1, and Theorem 3 simplifies to Theorem 7.2.1 in Weiß. 14 If, by contrast, data are missing with (O t ) being i.i.d., so (h) = 2 and (h, 2h) = 3 , then the variance in Theorem 3 simplifies to But still, the variance depends in a more complex way on the missing data than the mean: the "no-missings variance" (first summand) is inflated by the factor 1∕ 2 , and we have an additional summand being proportional to (1 − )∕ 2 . This kind of dependence on is more complex than the weaker 1∕ -dependence in the case of the marginal statistics discussed in Example 3.
Theorem 3 and Equation (18) can be applied to test for significant serial dependence (on some level ) in the presence of missing data. For this purpose, we have to plug-in estimated probabilities into the formula for the approximate variance, leading to the valuê2 . Then, the critical values are given by −1∕(n̂) ∓ z 1− ∕2̂, where z 1− ∕2 denotes the (1 − ∕2)-quantile of the standard normal distribution.

Example 4.
An important special case for practice is that of a binary process (X t ), ie, where m = 1. For simplicity, one usually assumes the range to be coded as  = {0, 1}, with marginal distribution given by p 1 = p and p 0 = f 0 = 1 − p. Then, ord (h) and nom (h) are identical, and they also agree with the ACF (the same holds for the respective sample counterparts). The variance in Theorem 3 simplifies to It is equal to 1∕n if no data are missing, which are the well-known ACF asymptotics. For (O t ) being i.i.d., we get These results can be applied to test for significant dependence (autocorrelation) in binary time series with missing data.

Results from a simulation study
To check the finite-sample performance of the asymptotic results derived in Section 3, a simulation study with 10,000 replications per scenario was done. The ordinal time series were generated according to the rank count approach described in Weiß, 14 that is, a process (I t ) of bounded counts with range {0, … , m} was generated, and the ordinal data (X t ) with range  = {s 0 , … , s m } was obtained by setting X t = s I t . For the rank counts, a first-order binomial autoregressive (BinAR(1)) model is assumed, see Section 3.3 in Weiß 2 for details. The BinAR(1) model with mean parameter p ∈ (0; 1) and dependence parameter r ∈ [0; 1) has a binomial marginal distribution, namely Bin(m, p), and the lag-h ACF r h , where r = 0 leads to i.i.d. data. Inspired by the migraine data (recall Example 1), we consider the parametrizations (m, p, r) = (3, 0.20, 0.35) and (10, 0.45, 0.50). In addition, we also use r = 0 within the aforementioned parametrizations as examples of i.i.d.-simulations. The sample sizes are n ∈ {50, 100, 250, 500, 1000}, and the probability of the i.i.d.amplitude-modulating process (O t ) is chosen as ∈ {1, 0.95, 0.9, 0.85, 0.8, 0.75, 0.5}. For each of these scenarios, the CDF, IOV, skew, and̂o rd (h) are computed, and we compare the sample properties obtained from the simulated data with the asymptotic properties computed according to Section 3. The full simulation and asymptotic results are offered as supplementary material. In what follows, we discuss a few illustrative graphs created thereof. Figure 3 shows diverse error plots, where the central dot corresponds to the mean, and where the error bars are drawn as ∓ the standard deviation (SD). The first row of Figure 3 shows the (asymptotic vs simulated) mean ∓ SD of the dispersion measureÎOV, recall Section 3.3. To save some space, only two levels of missingness are considered, namely = 1 (complete data) and = 0.75 (25% missing data), and these are represented by different dots as explained in the legend. If we compare the filled dots ( = 1) to the incomplete dots ( = 0.75), we recognize a slight effect of the missing data. The negative bias ofÎOV becomes stronger with decreasing (the true IOV values are 0.468 and 0.351, respectively), and the SDs increase in any scenario. But the crucial point is that the asymptotics well capture these changes. Only for n = 50, we see a modest deviation between asymptotic and simulated SD forÎOV. Otherwise, we have a rather close agreement. The conclusions for the skewness measureŝkew in the second row of Figure 3 are essentially the same, except that there is virtually no bias. The true skew values (=asymptotic means) are 0.6 and 0.1, respectively, and also for the simulated means ofŝkew, there is hardly any effect of n and . Altogether, the derived asymptotics are well suited to approximate the true finite-sample distribution ofÎOV andŝkew.
Next, let us turn to the dependence measurêo rd (h), recall Section 3.4 for the asymptotics. The first row of Figure 4 shows the (asymptotic vs simulated) mean ∓ SD of̂o rd (1) for i.i.d. ordinal data (tabulated results are provided by the supplementary material). We see a rather good agreement between the asymptotic approximation and the actual true sampling distribution except for the small-sample scenarios n = 50, 100. We also recognize that decreasing again intensifies the negative bias and, in particular, the SDs severely increase (more than twice the SD if 25% missing data). Next, we apply the dependence test (5%-level) described before Example 4. For this purpose, we estimate bŷ= O and f by    f * , and these are plugged-in into the asymptotics (18). Then, the dependence test is executed by comparinĝo rd (1) to the critical values −1∕(n̂) ∓ z 0.975̂. Counting across all replications, we get the simulated rejection rates plotted in the second row of Figure 4. These rejection rates express the size in the i.i.d.-case r = 0, and the power in the dependence case r > 0. It becomes clear that the simulated sizes are reasonably close to the nominal level 0.05 for n ≥ 100, irrespective of the actual extent of missingness. However, we note a strong effect of missing data on the power of the test. Especially for the small sample sizes n = 50, 100, we get a severe deterioration in power if 25% of the data are missing. For example, if n = 100 and 25% are randomly missing (so the "effective sample size" is 75), we have a worse power than if n = 50 without missing data. This is caused by the strong increase of the SD for decreasing , and is also confirmed by asymptotic computations using (18) From an intuitive point of view, it is better to estimate a dependence measure such aŝo rd (h) from a short but connected categorical sequence than from a long but disconnected one. Thus, missing observations have a much stronger effect on serial than on marginal properties.

Data application: Migraine patients
Let us return to the migraine data introduced in Example 1. Using the amplitude modulation approach developed in Section 3, we estimated the marginal CDF via (8) as well as corresponding measures of location, dispersion, and skewness. The obtained results are summarized in Figure 5 for the peak-severity series, and in Figure 6 for the stress series (we plot the PMF instead of the CDF as this is easier to interpret). The stress series in Figure 6 has a nearly symmetric marginal distribution with median 5 (and the outermost categories are never observed), so patient B experiences medium stress levels throughout the observation period. The peak-severity distribution in Figure 5, by contrast, is clearly skewed to the right with median category "none." Nevertheless, patient A had an at least mild peak on more than 40% of all days. Before further analyzing the serial dependence structure of both time series, and before applying the asymptotics derived before, the pattern of missingness needs to be analyzed in some more detail. For the migraine data, there is no reason to assume the amplitude-modulating process to be deterministic. The binary amplitude-modulating time series o 1 , … , o n for Figures 1 and 2 take the value 0 at the dotted lines and 1 otherwise. Thus, the probability of getting an observation at time t is estimated aŝ= 206∕225 ≈ 0.916 ( Figure 1) and̂= 133∕136 ≈ 0.978 (Figure 2), respectively. We analyze for possible serial dependence within (o t ) by using the ordinary sample ACF (also recall Example 4). In both cases, the obtained ACF values are not significant at the 5%-level: 0.080, −0.036, 0.026, … with approximate standard error 1∕ √ n ≈ 0.067 for Figure 1, and −0.023, … with 1∕ √ n ≈ 0.086 for Figure 2. Thus, there is no contradiction to assume (O t ) being i.i.d. in both examples. Hence, we can test for the serial dependence of the ordinal peak-severity and stress series by using the simplified expression (18) for the approximate variance of̂o rd (h).
The results obtained for thêo rd (h)-test (approximate critical values at 5% level) are also summarized in Figures 5 and 6. In both cases, we have significantly positive but decreasing values at lags 1 and 2 (0.341, 0.213, … for peak severity, 0.392, 0.203, … for stress). This information could be used for model fitting, and it implies that an AR(1)-like model for the respective rank counts (in analogy to the simulations of Section 4.1) might give a feasible solution. Furthermore, the clearly positive lag-1 values indicate some tendency for inertia, in the sense that the peak or stress level, respectively, of the previous day continues in a similar way also at the present day. While these information would be relevant for the considered patients, we conclude by discussing another question: What would have happened if we would have ignored the missing data? As an experiment (not recommended for imitation in practice), we just skip the missing observations, leading to complete time series of reduced lengths ñ = 206 (peak severity) and ñ = 133 (stress), respectively. If using these reduced time series for computing marginal statistics, we obtain the same estimates as before, which is clear from Equation (8). But both the estimates and the critical values for̂o rd (h) are affected. For the peak-severity series, where the number of missing (and now ignored) values was rather large (19 out of 225), thêo rd (h)-values change from 0.341, 0.213, … to 0.258, 0.138, … (and the approximate standard error from 0.072 to 0.055). Thus, ignoring the missing data, we would arrive at much lower dependence values. And even for the stress series, where only three out of 136 observations are missing, there is a visible effect:̂o rd (h) changes from 0.392, 0.203, … to 0.370, 0.195, … , and the approximate standard error from 0.060 to 0.057. Hence, it becomes clear that missing observations in ordinal time series should not just be ignored but carefully considered for time series analysis.
Here,p * ij (h) ∶=̂i In Supplement S.4, it is first shown that the vector √ n ( … ,p * i − p i , … ,p * jj (h) − p jj (h), … ), the entries of which are defined by (19) and (24), is approximately normally distributed with mean 0 and some covariance matrix (h) * , see the derivations up to (S.7). Then, these asymptotics are used to derive the following counterpart to Theorem 3.

Theorem 6.
Let the assumptions of Theorem 4 hold, and use the notation (h 1 , … , h r ) of Theorem 3. Under the null hypothesis of (X t ) being i.i.d., the distribution of̂n om (h) at lag h ∈ N, defined by (23), is approximately normal with mean and variance given by , respectively.
To sum up, for nominal time series with missing data, the same battery of results is available as for the ordinal case. The obtained formulae differ only slightly from their ordinal counterparts such that software implementations are very similar for both cases.

CONCLUSIONS
Motivated by a project on migraine patients, where the collected time series on qualitative features such as stress or pain peak severity are often incomplete, we considered the task of analyzing categorical time series in the presence of missing data. We proposed a unified framework for handling missingness in categorical time series and for deriving the asymptotics of the relevant statistics. We proved expressions for the asymptotic distribution of the univariate and bivariate sample CDF (PMF) computed from an ordinal (nominal) time series that explicitly account for missing observations. These expressions, in turn, were used to derive the asymptotics of, for example, measures of dispersion or serial dependence in such time series. The finite-sample performance of the resulting asymptotic approximations was investigated by simulations, and the developed approaches were applied to example time series from the migraine project. It became clear that especially the measurement of serial dependence is strongly affected by incomplete data, and simply ignoring the missing observations may cause quite misleading results. Thus, missing observations in categorical time series should be carefully considered when analyzing these time series. We conclude by briefly outlining a direction for future research that seems relevant for the migraine project. Besides analyzing the time series individually, it would also be relevant to analyze for cross-dependencies between the different features of a patient in time. Thus, one has to deal with missingness again, but now with respect to statistics that are applied to features of different type, for example, ordinal features with different ranges (such as peak severity vs stress) or even features being measured on different scales.