Skip to main content

Advertisement

Log in

Latent Theme Dictionary Model for Finding Co-occurrent Patterns in Process Data

  • Application Reviews and Case Studies
  • Published:
Psychometrika Aims and scope Submit manuscript

Abstract

Process data, which are temporally ordered sequences of categorical observations, are of recent interest due to its increasing abundance and the desire to extract useful information. A process is a collection of time-stamped events of different types, recording how an individual behaves in a given time period. The process data are too complex in terms of size and irregularity for the classical psychometric models to be directly applicable and, consequently, new ways for modeling and analysis are desired. We introduce herein a latent theme dictionary model for processes that identifies co-occurrent event patterns and individuals with similar behavioral patterns. Theoretical properties are established under certain regularity conditions for the likelihood-based estimation and inference. A nonparametric Bayes algorithm using the Markov Chain Monte Carlo method is proposed for computation. Simulation studies show that the proposed approach performs well in a range of situations. The proposed method is applied to an item in the 2012 Programme for International Student Assessment with interpretable findings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

References

  • Aalen, O., Borgan, O., & Gjessing, H. (2008). Survival and event history analysis: A process point of view. Berlin: Springer.

    Book  Google Scholar 

  • Allison, P. D. (1984). Event history analysis: Regression for longitudinal event data (Vol. 46). California: Sage.

    Book  Google Scholar 

  • Allman, E., Matias, C., & Rhodes, J. (2009). Identifiablity of parameters in latent structure models with many observed variables. The Annals of Statistics, 37, 3099–3132.

    Article  Google Scholar 

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning research, 3, 993–1022.

    Google Scholar 

  • Borboudakis, G., & Tsamardinos, I. (2019). Forward-backward selection with early dropping. The Journal of Machine Learning Research, 20, 276–314.

    Google Scholar 

  • Chen, Y. (2019). A continuous-time dynamic choice measurement model for problem-solving process data. arXiv preprint arXiv:1912.11335.

  • Chen, Y.-L., Tang, K., Shen, R.-J., & Hu, Y.-H. (2005). Market basket analysis in a multiple store environment. Decision Support Systems, 40, 339–354.

    Article  Google Scholar 

  • Deng, K., Geng, Z., & Liu, J. S. (2014). Association pattern discovery via theme dictionary models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76, 319–347.

    Article  Google Scholar 

  • Duchateau, L., & Janssen, P. (2007). The frailty model. Berlin: Springer.

    Google Scholar 

  • Dunson, D. B., & Xing, C. (2009). Nonparametric Bayes modeling of multivariate categorical data. Journal of the American Statistical Association, 104, 1042–1051.

    Article  PubMed Central  Google Scholar 

  • Fang, G., Liu, J., & Ying, Z. (2019). On the identifiability of diagnostic classification models. Psychometrika, 84, 19–40.

    Article  PubMed  Google Scholar 

  • Gibson, W. A. (1959). Three multivariate models: Factor analysis, latent structure analysis, and latent profile analysis. Psychometrika, 24, 229–252.

    Article  Google Scholar 

  • Goodman, L. A. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika, 61, 215–231.

    Article  Google Scholar 

  • Goodman, M., Finnegan, R., Mohadjer, L., Krenzke, T., & Hogan, J. (2013). Literacy, numeracy, and problem solving in technology-rich environments among US adults: Results from the program for the international assessment of adult competencies 2012. First look (NCES 2014-008). ERIC.

  • Griffin, P., McGaw, B., & Care, E. (2012). Assessment and teaching of 21st century skills. Berlin: Springer.

    Book  Google Scholar 

  • Han, Z., He, Q., & von Davier, M. (2019). Predictive feature generation and selection using process data from pisa interactive problem-solving items: An application of random forests. Frontiers in Psychology, 10, 2461.

    Article  PubMed  PubMed Central  Google Scholar 

  • Hastie, T., Tibshirani, R., Friedman, J., & Franklin, J. (2005). The elements of statistical learning: Data mining, inference and prediction. The Mathematical Intelligencer, 27, 83–85.

    Google Scholar 

  • He, Q., & von Davier, M. (2016). Analyzing process data from problem-solving items with n-grams: Insights from a computer-based large-scale assessment. In Handbook of research on technology tools for real-world skill development, (pp. 750–777). IGI Global.

  • Ishwaran, H., & Rao, J. S. (2003). Detecting differentially expressed genes in microarrays using Bayesian model selection. Journal of the American Statistical Association, 98, 438–455.

    Article  Google Scholar 

  • Ishwaran, H., & Rao, J. S. (2005). Spike and slab variable selection: Frequentist and bayesian strategies. The Annals of Statistics, 33, 730–773.

    Article  Google Scholar 

  • Kruskal, J. B. (1977). Three-way arrays: Rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra and its Applications, 18, 95–138.

    Article  Google Scholar 

  • Liu, J., Xu, G., & Ying, Z. (2012). Data-driven learning of q-matrix. Applied Psychological Measurement, 36, 548–564.

    Article  PubMed  PubMed Central  Google Scholar 

  • Liu, J., Xu, G., & Ying, Z. (2013). Theory of the self-learning q-matrix. Bernoulli: Official Journal of the Bernoulli Society for Mathematical Statistics and Probability, 19, 1790.

    Article  Google Scholar 

  • Lord, F. M. (1980). Applications of item response theory to practical testing problems. UK: Routledge.

    Google Scholar 

  • OECD. (2014a). Assessing problem-solving skills in PISA 2012.

  • OECD. (2014b). PISA 2012 technical report. (Available at) http://www.oecd.org/pisa/pisaproducts/pisa2012technicalreport.htm.

  • OECD. (2016). PISA 2015 results in focus. (Available at) https://www.oecd.org/pisa/pisa-2015-results-in-focus.pdf.

  • Piatetsky-Shapiro, G. (1991). Discovery, analysis, and presentation of strong rules. Knowledge discovery in databases, 229–238.

  • Qiao, X., & Jiao, H. (2018). Data mining techniques in analyzing process data: A didactic. Frontiers in Psychology, 9, 2231.

    Article  PubMed  PubMed Central  Google Scholar 

  • Sethuraman, J. (1994). A constructive definition of dirichlet priors. Statistica Sinica, 4, 639–650.

    Google Scholar 

  • Templin, J., Henson, R. A., et al. (2010). Diagnostic measurement: Theory, methods, and applications. New York: Guilford Press.

    Google Scholar 

  • Tibshirani, R. (1997). The lasso method for variable selection in the cox model. Statistics in Medicine, 16, 385–395.

    Article  PubMed  Google Scholar 

  • van der Linden, W. J. (2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31, 181–204.

    Article  Google Scholar 

  • Vermunt, J. K., & Magidson, J. (2002). Latent class cluster analysis. Applied Latent Class Analysis, 11, 89–106.

    Article  Google Scholar 

  • Walker, S. G. (2007). Sampling the dirichlet mixture model with slices. Communications in Statistics–Simulation and Computation®, 36, 45–54.

    Article  Google Scholar 

  • Xu, G., et al. (2017). Identifiability of restricted latent class models with binary responses. The Annals of Statistics, 45, 675–707.

    Article  Google Scholar 

  • Xu, H., Fang, G., Chen, Y., Liu, J., & Ying, Z. (2018). Latent class analysis of recurrent events in problem-solving items. Applied Psychological Measurement, 42, 478.

    Article  PubMed  PubMed Central  Google Scholar 

  • Xu, H., Fang, G., & Ying, Z. (2019). A latent topic model with Markovian transition for process data. arXiv preprint arXiv:1911.01583.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guanhua Fang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (rdata 775 KB)

Appendices

Appendix A: Conditions C1 and C2

We provide the exact statements of conditions C1 - C2 in this appendix.

  1. C1.a

    For each equivalence class [j] with size larger than 1, there exists a partition \(\{{\mathcal {I}}_{[j],1}, \mathcal I_{[j],2}, {\mathcal {I}}_{[j],3} \}\) of 1-grams such that for any \(e_1 \in {\mathcal {I}}_{[j],1}\), \(e_2 \in {\mathcal {I}}_{[j],2}\) and \(e_3 \in {\mathcal {I}}_{[j],3}\), sentence \(E=(e_{l}, e_k), l \ne k \in \{1,2,3\}\) and sentence \(E = (e_1, e_2, e_3)\) admit only one separation. Cardinalities of three sets satisfy \(|\mathcal I_{[j],1}|\), \(|{\mathcal {I}}_{[j],2}|\) and \(|{\mathcal {I}}_{[j],3}| \ge |[j]|\). Here, |[j]| is the cardinality of equivalence class [j].

  2. C1.b

    Define T-matrices \(T_{[j],1}\), \(T_{[j],2}\) and \(T_{[j],3}\) such that \(T_{[j],k} [l,j_1] = \frac{\theta _{j_1 l}}{1 - \theta _{j_1 l}}\) for \(e_l \in {\mathcal {I}}_{[j],k}\), \(j_1 \in [j]\), and \(k = 1, 2 ~\text {or}~ 3\). Matrices \(T_{[j],1}\), \(T_{[j],2}\) and \(T_{[j],3}\) have full column rank.

  3. C2.a

    For each equivalence class [j] with size larger than 1 and for any l-gram \(w = [e_1~e_2 \ldots ~ e_l]\) with \(l \ge 2\), there exists \({\mathcal {D}}_{[j],w}\) (the subset of 1-grams) such that (1) for any \(e \in {\mathcal {D}}_{[j],w}\), sentence \(E = (e_1, \ldots e_l, e)\) does not admit other separations containing \((l+1)\)-gram or l-gram other than w; (2) cardinality of \({\mathcal {D}}_{[j],w}\) is greater than or equal to |[j]|.

  4. C2.b

    Define matrix \(T_{[j],w}\) such that \(T_{[j],w} [e,j_1] = \frac{\theta _{j_1 e}}{1 - \theta _{j_1 e}}\) for \(e \in \mathcal D_{[j],w}\) and \(j_1 \in [j]\). Matrix \(T_{[j],w}\) has full column rank.

Conditions C1 - C2 pertain to the dictionary and parameter structures. Specifically, Condition C1.a puts the restrictions on 1-grams such that not all combinations of 1-grams are considered as patterns, which ensures the pattern frequency can be identified. It is very similar to the sufficient conditions in identifiability of diagnostic classification models (DCMs, Xu et al. 2017; Fang et al. 2019), where they require all items can be divided into three non-overlapping item sets. Here, 1-gram can be viewed as the counterpart of item in DCMs. Condition C2.a requires that each l-gram is not overlapped with other patterns to some extent and thus can be identified. Conditions C1.b and C2.b require that the examinees from different groups should have different pattern frequencies.

The T-matrices here share the similar ideas to those in Liu et al. (2012; 2013). We use the following example to illustrate this idea.

Example 2

Consider a 2-class model with \(\lambda _1 = \lambda _2\) and \(\mathcal D = \{ [a], [b], [c], [d], [e], [f], [a~b], [c~d], [e~f]\}\). Pattern probability \(\{\theta _{jw}\}\) is

We claim this setting is identifiable.

Notice that Classes 1 and 2 are in the same equivalence class [1]. We can construct \({\mathcal {I}}_{[1],1} = \{[a], [b]\}\), \(\mathcal I_{[1],2} = \{[c],[d]\}\), and \({\mathcal {I}}_{[1],3} = \{[e],[f]\}\). It is easy to check that their T-matrices satisfy

$$\begin{aligned} T_{[1],1} = T_{[2],1} = T_{[3],1} = \begin{pmatrix} 1 &{}\quad 3 \\ 1 &{}\quad 1/3\\ \end{pmatrix}. \end{aligned}$$

Hence, Condition C1 is satisfied, since they all have full column rank. For \(w = [a~b]\), we can set \({\mathcal {D}}_{[1],w} = \{c,d\}\) by checking that both sentences (abc) and (abd) have only one separation. Its T-matrix is

$$\begin{aligned} T_{[1],w}= \begin{pmatrix} 1 &{}\quad 3 \\ 1 &{}\quad 1/3\\ \end{pmatrix}, \end{aligned}$$

which is also full-column rank. Similarly, we can check it for [c d] and [e f]. Thus, Condition C2 is also satisfied. Furthermore, Assumption A1 holds since both classes contain all sentences in \({\mathcal {O}}\). Lastly, Assumption A2 obviously holds.

Appendix B: Proofs

To prove main theoretical results, we start with two lemmas which play key roles for dictionary and parameter identifiability. The proof of Lemma 2 is presented at the end of this section.

Lemma 1

(Kruskal 1977) Suppose \(A,B,C,{\bar{A}},{\bar{B}},{\bar{C}}\) are six matrices with R columns. There exist integers \(I_0\), \(J_0\), and \(K_0\) such that \(I_0+J_0+K_0 \ge 2R+2\). In addition, every \(I_0\) columns of A are linearly independent, every \(J_0\) columns of B are linearly independent, and every \(K_0\) columns of C are linearly independent. Define a triple product to be a three-way array \([A,B,C] = (d_{ijk})\) where \(d_{ijk}=\sum _{r=1}^{R} a_{ir} b_{ir} c_{kr}\). Suppose that the following two triple products are equal \([A,B,C]=[{\bar{A}},{\bar{B}},{\bar{C}}]\). Then, there exists a column permutation matrix P such that \({\bar{A}}=AP\Lambda , {\bar{B}}=BPM, {\bar{C}}=CPN\), where \(\Lambda , M, N\) are diagonal matrices and \(\Lambda MN =\) identity. Column permutation matrix is right-multiplied to a given matrix to permute the columns of that matrix.

Lemma 2

Under Assumptions A1 and A2, it holds that \({\mathcal {O}}_{[j]} = {\mathcal {O}}_{[j]}^{'}\) if and only if \({\mathcal {D}}_{[j]} = \mathcal D_{[j]}\).

Here, we recall that \({\mathcal {O}}_{[j]}\) is the observed sentence set generated from equivalence class [j] and \({\mathcal {D}}_{[j]}\) is the dictionary consisting of patterns from equivalence class [j].

Proof of Theorem 1

For every model \({\mathcal {P}} = ({\mathcal {D}}, \{\theta _{jw}\}, \{\lambda _j\}, \pi , \kappa ) \in {\mathfrak {P}}^0\), we need to show that if there exists another model \({\mathcal {P}}^{'}\) such that

$$\begin{aligned} P(K|\kappa ) \cdot \bigg \{ \sum _z \pi _z \prod _{k=1}^K \bigg \{ \sum _{S_{k} \in {\mathcal {S}}_{k}} P(S_{k}, {\tilde{T}}_{k}| z) \bigg \} \bigg \} = P(K|\kappa ^{'}) \cdot \bigg \{ \sum _z \pi _z^{'} \prod _{k=1}^K \bigg \{ \sum _{S_{k} \in {\mathcal {S}}_{k}} P(S_{k}, {\tilde{T}}_{k}| z) \bigg \} \bigg \}, \nonumber \\ \end{aligned}$$
(A1)

it must hold \({\mathcal {P}} = {\mathcal {P}}^{'}\).

We prove it through the following steps. (1) \(\kappa \)-identifiability: we show that the parameter \(\kappa \) is identifiable. (2) \(\lambda \)-identifiability: we prove that \(\lambda _{[j]} = \lambda _{[j]}^{'}\) for any equivalence class [j]. (3) Dictionary identifiability: we show that \({\mathcal {O}} = \mathcal O^{'}\) implies \({\mathcal {D}} = {\mathcal {D}}^{'}\). (4) \(\{\theta \}, \pi \)-identifiability: we show that \(\{\theta _{jw}\} \overset{p}{=} \{\theta _{jw}^{'}\}\) and \(\pi \overset{p}{=} \pi ^{'} \).

For \(\kappa \)-identifiability, we can see that the marginal distribution of \(e_{1:N}\) and \(t_{1:N}\) is

$$\begin{aligned} P(e_{1:N}, t_{1:N}) = P(K|\kappa ) \cdot \bigg \{ \sum _z \pi _z \prod _{k=1}^K \bigg \{ \sum _{S_{k} \in {\mathcal {S}}_{k}} P(S_{k}, {\tilde{T}}_{k}| z) \bigg \} \bigg \}. \end{aligned}$$
(A2)

By taking \(K = 0\), we have that \(P(e_{1:N}, t_{1:N}) = P(K = 0)\). Then, it must hold that

$$\begin{aligned} e^{- \kappa } = e^{- \kappa ^{'}}. \end{aligned}$$

This implies that \(\kappa = \kappa ^{'}\).

For \(\lambda \)-identifiability, we consider take \(K = 1\) and an event sentence \(E = (e)\) and \({\tilde{T}} = (t)\). Then, (A1) becomes

$$\begin{aligned} P(K = 1|\kappa ) \bigg \{ \sum _{j=1}^J \pi _j \theta _{j e} \lambda \exp \{ - \lambda t\} \bigg \} = P(K = 1|\kappa ^{'}) \bigg \{ \sum _{j=1}^J \pi _j^{'} \theta _{j e}^{'} \lambda ^{'} \exp \{ - \lambda ^{'} t\} \bigg \}. \end{aligned}$$
(A3)

By \(\kappa \)-identifiability, we further have

$$\begin{aligned} \sum _{j=1}^J \pi _j \theta _{j e} \lambda _j \exp \{ - \lambda _j t\} = \sum _{j=1}^J \pi _j^{'} \theta _{j e}^{'} \lambda _j^{'} \exp \{ - \lambda _j^{'} t\} \end{aligned}$$
(A4)

after simplification. Let \(t \rightarrow \infty \), we must have that \(\lambda _{[j_0]} = \lambda _{[j_0]}^{'}\), where \([j_0]\) is the equivalence class with minimum lambda value. Hence, we also have \(\sum _{j \in [j_0]} \pi _j \theta _{je} = \sum _{j \in [j_0]} \pi _j^{'} \theta _{je}^{'}\). Then, (A4) becomes

$$\begin{aligned} \sum _{j \notin [j_0]} \pi _j \theta _{j e} \lambda _j \exp \{ - \lambda _j t\} = \sum _{j \notin [j_0]} \pi _j^{'} \theta _{j e}^{'} \lambda _j^{'} \exp \{ - \lambda _j^{'} t\}. \end{aligned}$$

By the similar strategy, we can show that \(\lambda _{[j]} = \lambda _{[j]}^{'}\) for every equivalence class [j]. This gives \(\lambda \)-identifiability.

For the dictionary identifiability, we would like to point out that its proof is not covered in Deng et al. (2014). Therefore, we seek an alternative approach to prove it.

By taking \(K = 1\), an arbitrary sentence \(E \in {\mathcal {O}}\) and \({\tilde{T}} = (t, \ldots , t_{n_E})\) where \(n_E\) is the sentence length. Then, (A1) becomes

$$\begin{aligned} \sum _{[j]} [\sum _{j_1 \in [j]} \pi _{j_1} P(E|j)] (\lambda _{[j]})^{l_E} \exp \{ - \lambda _{[j]} n_E t\} = \sum _{[j]} [\sum _{j_1 \in [j]} \pi _{j_1}^{'} P^{'}(E|j)] (\lambda _{[j]}^{'})^{n_E} \exp \{ - \lambda _{[j]}^{'} n_E t\} . \nonumber \\ \end{aligned}$$
(A5)

Comparing the coefficients on both sides of (A5), we then have

$$\begin{aligned} \sum _{j_1 \in [j]} \pi _{j_1} P(E|j) = \sum _{j_1 \in [j]} \pi _{j_1}^{'} P^{'}(E|j). \end{aligned}$$
(A6)

This implies that \({\mathcal {O}}_{[j]} = {\mathcal {O}}_{[j]}^{'}\). By Lemma 2, we then have \({\mathcal {D}}_{[j]} = \mathcal D_{[j]}^{'}\). Notice that \({\mathcal {D}} = \cup _{[j]} \mathcal D_{[j]}\). It concludes the dictionary identifiability.

For \(\{\theta \}, \pi \)-identifiability, we prove it by making use of (A6). In (A6), we take \(E = (e)\) for \(e \in {\mathcal {E}}\), \(E = (e_1, e_2)\) with \(e_1, e_2\) from different partition sets, and \(E = (e_1, e_2, e_3)\) with \(e_k \in \mathcal I_{[j],k}, (k = 1,2,3)\), sequentially.

Without loss of generality, we suppose there is only one equivalence class. According to Condition C1.a that E only admits one separation, (A6) can be simplified as

$$\begin{aligned} \sum _j \eta _j \varphi _{je} = \sum _j \eta _j^{'} \varphi _{je}^{'},&~\text {if }E = (e) \end{aligned}$$
(A7)
$$\begin{aligned} \sum _j \eta _j \varphi _{je_1} \varphi _{je_2} = \sum _j \eta _j^{'} \varphi _{je_1}^{'} \varphi _{je_2}^{'},&~\text {if }E = (e_1, e_2) \end{aligned}$$
(A8)
$$\begin{aligned} \sum _j \eta _j \varphi _{je_1} \varphi _{je_2} \varphi _{je_3} = \sum _j \eta _j^{'} \varphi _{je_1}^{'} \varphi _{je_2}^{'} \varphi _{je_3}^{'},&~\text {if }E = (e_1, e_2, e_3). \end{aligned}$$
(A9)

where we define \(\eta _j = \pi _j \prod _e (1 - \theta _{je})\), \(\varphi _{je} = \theta _{je}/(1 - \theta _{je})\).

In addition, if we take E to be an empty sentence, then it holds

$$\begin{aligned} \sum _j \eta _j = \sum _j \eta _j^{'}. \end{aligned}$$
(A10)

It is not hard to write Eqs. (A7A8A9) – (A10) in terms of tensor products of matrices, that is,

$$\begin{aligned} {[}{{\bar{T}}}_1, {{\bar{T}}}_2, {{\bar{T}}}_3] = [{{\bar{T}}}_1^{'}, {{\bar{T}}}_2^{'}, \bar{T}_3^{'}], \end{aligned}$$

where

$$\begin{aligned}{{\bar{T}}}_1= & {} \left( \begin{array}{ccc} 1 &{} \ldots &{} 1 \\ \varphi _{1 v_1} &{} \ldots &{} \varphi _{J v_1} \\ \vdots &{} \vdots &{} \vdots \\ \varphi _{1 v_{I_1}} &{} \ldots &{} \varphi _{J v_{I_1}} \\ \end{array} \right) , \\ {{\bar{T}}}_2= & {} \left( \begin{array}{ccc} 1 &{} \ldots &{} 1 \\ \varphi _{1 v_1} &{} \ldots &{} \varphi _{J v_1} \\ \vdots &{} \vdots &{} \vdots \\ \varphi _{1 v_{I_2}} &{} \ldots &{} \varphi _{J v_{I_2}} \\ \end{array} \right) , \end{aligned}$$

and

$$\begin{aligned}{{\bar{T}}}_3 = \left( \begin{array}{ccc} 1 &{} \ldots &{} 1 \\ \varphi _{1 v_1} &{} \ldots &{} \varphi _{J v_1} \\ \vdots &{} \vdots &{} \vdots \\ \varphi _{1 v_{I_3}} &{} \ldots &{} \varphi _{J v_{I_3}} \\ \end{array} \right) \cdot \Lambda . \end{aligned}$$

Here, \(\Lambda \) is a J by J diagonal matrix with its j-th element equal to \(\eta _j\). By Condition C1.b, column ranks of matrix \({{\bar{T}}}_1\), \({{\bar{T}}}_2\) and \({{\bar{T}}}_3\) are full column rank. Therefore, by Lemma 1, we have that

$$\begin{aligned} {{\bar{T}}}_1^{'} = {{\bar{T}}}_1 P A, ~ {{\bar{T}}}_2^{'} = {{\bar{T}}}_2 P B ~ \text { and } ~ {{\bar{T}}}_3^{'} = {{\bar{T}}}_3 P C, \end{aligned}$$

where matrix P is a column permutation matrix, AB and C are diagonal matrices satisfying \(ABC = I\). Since elements in first rows of \({{\bar{T}}}_1, {{\bar{T}}}_2, {{\bar{T}}}_1^{'}, {{\bar{T}}}_2^{'}\) are all ones, it implies \(A = B = I\). Therefore, \(C = I\) as well. Thus, we have \({{\bar{T}}}_1^{'} = {{\bar{T}}}_1 P\), \({{\bar{T}}}_2^{'} = {{\bar{T}}}_2 P\) and \(\bar{T}_3^{'} = {{\bar{T}}}_3 P\). By comparing element-wisely, we can see that \(\eta = \eta ^{'}\) and \(\{\varphi _{je}\} = \{\varphi _{je}^{'}\}\) up to a label switch. Further, \(\{\theta _{je}\} \overset{p}{=} \{\theta _{je}^{'}\}\) due to the monotonicity relation between \(\varphi _{je}\) and \(\theta _{je}\).

In the following, we prove that \(\theta _{jw}\) is identifiable up to the same label switch for any pattern \(w \in {\mathcal {D}}\) by induction. Suppose we have that \(\theta _{jw}\) is identifiable when w belongs to {1-grams, ..., (k-1)-grams}. We need to show that \(\theta _{jw}\) is identifiable if w is a k-gram.

Let \({\mathcal {E}}_k\) be the sentence set including all k-grams in \({\mathcal {D}}\) and all possible combinations of k-gram and 1-gram that are not in \({\mathcal {D}}\). It is not hard to see that for each \(E \in {\mathcal {E}}_k\), its separation can only be the combinations of all m-grams (\(m < k\)) or the combinations of k gram and 1-gram.

$$\begin{aligned} \sum _j \eta _j \varphi _{jw} = \sum _j \eta _j^{'} \varphi _{jw}^{'},&~ \text {if }E = (w)\text { and }w\text { is }k\text {-gram;} \\ \sum _j \eta _j \varphi _{jv_1} \varphi _{jv_2} = \sum _j \eta _j^{'} \varphi _{jv_1}^{'} \varphi _{jv_2}^{'},&~ \text {if }E = (v_1, v_2), v_1 \text { is a }k\text {-gram and } v_2 \in {\mathcal {D}}_{v_1}. \end{aligned}$$

By previous results that \(\eta \overset{p}{=} \eta ^{'}\) and \(\varphi _{v} \overset{p}{=} \varphi _{v}^{'}\) for those v’s are 1-grams, we could write above equations in the following matrix form, that is,

$$\begin{aligned} {{\bar{T}}}_w {\tilde{\varphi }}_{w} = {\mathbf {0}}. \end{aligned}$$
(A11)

where \({\tilde{\varphi }}_{w} = (\varphi _{1w} - \varphi _{1w}^{'}, \ldots , \varphi _{Jw} - \varphi _{Jw}^{'})^T\) and

$$\begin{aligned} {{\bar{T}}}_w = \left( \begin{array}{ccc} \eta _1 &{} \ldots &{} \eta _J \\ \varphi _{1v_1}\eta _1 &{} \ldots &{} \varphi _{Jv_1}\eta _J \\ \vdots &{} \vdots &{} \vdots \\ \varphi _{1v_J}\eta _1 &{} \ldots &{} \varphi _{Jv_J}\eta _J \\ \end{array} \right) . \end{aligned}$$

Here, \(v_1, \ldots , v_J\) are J distinct 1-grams in \({\mathcal {D}}_v\). According to Condition C2.a and C2.b, (A11) admits only one solution. Therefore, \({\tilde{\varphi }}_{w} = {\mathbf {0}}\), which implies \( \theta _{w} \overset{p}{=} \theta _{w}^{'}\). Hence, we conclude that \(\theta _{jw} = \theta _{jw}^{'}\) up to a label switch for all \(w \in {\mathcal {D}}\). This concludes the \(\{\theta \}, \pi \)-identifiability. By completing all steps, we establish the identifiability results. \(\square \)

Proof of Theorem 2

We prove this result by two steps. In Step 1, we prove that dictionary \({\mathcal {D}}\) can be estimated consistently. In Step 2, we show that the estimator, \((\{{\hat{\theta }}_{jw}\} , \{\hat{\lambda }_j\}, {\hat{\pi }}, {\hat{\kappa }})\), is consistent. Without loss of generality, we take compact set \({\varvec{\Theta }}_c\) as \({\varvec{\Theta }}_c\) = {\(\theta _{jw} \in [\eta , 1 - \eta ], \pi _j \in [\eta , 1 - \eta ], \sum _j \pi _j = 1, \lambda _j \in [c, C]\), \(\kappa \in [c, C]\)}, where \(\eta \), c, C are some positive constants such that true model parameter is in \(\Theta _c\).

Proof of Step 1 We first introduce several useful event sets. Define an event set \(\Omega _{D}\),

$$\begin{aligned} \Omega _{D} \equiv \{\omega | {\mathcal {O}}^{*} \subset \{E_{ik} | i = 1, \ldots , m, k = 1, \ldots , K_i \} \}. \end{aligned}$$
(A12)

In other words, all possible sentences are at least observed once on \(\Omega _D\). Define sets \(\Omega _{E} = \{\omega | |\sum _i K_i| \ge m \kappa / 2 \}\), \(\Omega _K = \{\omega | K_i \le K_0, i = 1, \ldots , m\}\), \(\Omega _T = \{\omega | {\tilde{t}}_{ik,u} \le t_0, \text {for all}~ i,k,u\}\) and \(\Omega _b = \Omega _{E} \cap \Omega _K \cap \Omega _T \cap \Omega _D\). Next, we show that \(\Omega _b\) holds with high probability. Specifically, we show the upper bound for \(P(\Omega _b^c)\) by decomposing \(\Omega _b^c\) into four parts, i.e.,

$$\begin{aligned} \Omega _b^c \subset \Omega _{E}^c \cup \Omega _{K}^c \cup (\Omega _T^c \cap \Omega _K) \cup (\Omega _D^c \cap \Omega _E \cap \Omega _K \cap \Omega _T) . \end{aligned}$$
  1. 1.

    By large deviation property of Poisson random variable \(\sum _i K_i\), we have that \(P(\Omega _{E}^c) \le \exp \{ - c m\}\) where \(c = (1 - \log 2) \kappa ^{*} / 2\).

  2. 2.

    It is not hard to see that \(P(\Omega _K^c) \le m P(K_i > K_0) \le m \exp \{- c K_0\}\) for some constant c by using Poisson moment generating function.

  3. 3.

    By union bound, we can get that \(P(\Omega _T^c \cap \Omega _K) \le m K_0 l_{max} P(t_{ik,u} > t_0) \le m K_0 l_{max} \exp \{- \lambda _{min} t_0\}\). Here, \(l_{max}\) is the longest sentence length and \(\lambda _{min} = \min \{\lambda _j, j = 1, \ldots , J\}\).

  4. 4.

    We show that P(E) has a positive lower bound for every sentence \(E \in {\mathcal {O}}^{*}\). Take an arbitrary E, we know that \(P(E) = \sum _{z = 1}^J \sum _{S \in {\mathcal {F}}(E)} P(S|z)\). Hence, \(P(E) \ge \prod _{w} (\theta _{jw}^{*})^{{\mathbf {1}}\{w \in S\}} (1 - \theta _{jw}^{*})^{{\mathbf {1}}\{w \notin S\}}\) for \(S \in \mathcal F(E)\) and j such that \(E \in {\mathcal {O}}_j\). Thus, P(E) is bounded below by \(\eta ^{v_D}\), i.e., \(P(E) \ge \eta ^{v_D}\). Therefore, we have that \(P(\Omega _D^c \cap \Omega _E \cap \Omega _K \cap \Omega _T) \le |{\mathcal {O}}^{*}| P(E \notin \{E_{ik} | i = 1, \ldots , m, k = 1, \ldots , K_i \}) \le |{\mathcal {O}}^{*}| (1 - \eta ^{v_D})^{|\{E_{ik}\}|} \le |{\mathcal {O}}^{*}| (1 - \eta ^{v_D})^{m \kappa / 2} \).

Hence, event \(\Omega _b^c\) holds with probability at most \(2 \exp \{ - c m\} + m \exp \{- c K_0\} + m K_0 l_{max} \exp \{- \lambda _{min} t_0\} + |{\mathcal {O}}^{*}| (1 - \eta ^{v_D})^{m \kappa / 2}\).

On event \(\Omega _b\), we have that all sentence in \(\mathcal O^{*}\) are at least observed once. By the dictionary identifiability from Theorem 1, we know that \(\hat{{\mathcal {D}}} = {\mathcal {D}}^{*}\). In other words, \(P(\hat{{\mathcal {D}}} \ne {\mathcal {D}}^{*}) \le P(\Omega _b^c)\). This completes Step 1 by choosing \(K_0 = (\log m)^2\) and \(t_0 = (\log m)^2\).

Proof of Step 2 For any fixed parameter \(\Theta \equiv (\{\theta _{jw}\} , \{ \lambda _j\}, \pi , \kappa )\). Let \(l(\Theta )\) denote the log-likelihood evaluated at \(\Theta \). By identifiability we know that, \({\mathbb {E}} l(\Theta ^{*}) > {\mathbb {E}} l(\Theta )\) for any distinct \( \Theta \in {\varvec{\Theta }}_c\). By compactness of \(B(\Theta ^{*}, \delta )^c \cap {\varvec{\Theta }}_c\) and continuity of \({\mathbb {E}} l(\Theta )\) (see (A19)), there exists a positive number \(\epsilon \) such that \({\mathbb {E}} l(\Theta ) \le {\mathbb {E}} l(\Theta ^{*}) - 3 \epsilon \) for any \(\Theta \in B(\Theta ^{*}, \delta )^c \cap {\varvec{\Theta }}_c\). In next, we prove the uniform convergence of \(l(\Theta )\) to the expected value.

By Bernstein inequality, we know that

$$\begin{aligned} P\left( \frac{1}{m} |\sum _i l_i(\Theta ) - {\mathbb {E}} l_i(\Theta )| \ge \sqrt{\mathrm {var}(l(\Theta ))} \cdot x \right) \le 2 \exp \{- m x^2\} \end{aligned}$$
(A13)

holds point-wisely. By compactness, \(\text {var}(l(\Theta ))\) is bounded by some constant M. Thus,

$$\begin{aligned} P\left( \frac{1}{m} |\sum _i l_i(\Theta ) - {\mathbb {E}} l_i(\Theta )| \ge x \right) \le 2 \exp \{- m x^2 / M\} \end{aligned}$$
(A14)

for any fixed \(\Theta \).

Next, we consider bound the gap between \(l_i(\Theta ) - l_i(\Theta ^{'})\). For notational simplicity, we omit subscript i in the following displays. We know that

$$\begin{aligned}&l(\Theta ) - l(\Theta ^{'}) \nonumber \\&\quad = \log \{P(K)/P^{'}(K)\} + \log \left\{ \frac{\sum _j \pi _j \prod _{k = 1}^K P(E_k| j) P(T_k| j)}{\sum _j \pi _j^{'} \prod _{k = 1}^K P^{'}(E_k| j) P^{'}(T_k| j)} \right\} \nonumber \\&\quad \le \log \{P(K)/P^{'}(K)\} + \log \left\{ \max _j \frac{\pi _j \prod _{k = 1}^K P(E_k| j) P(T_k| j)}{\pi _j^{'} \prod _{k = 1}^K P^{'}(E_k| j) P^{'}(T_k| j)} \right\} \nonumber \\&\quad \le \log \{P(K)/P^{'}(K)\} + \max _j \log \left\{ \frac{\pi _j \prod _{k = 1}^K P(E_k| j) P(T_k| j)}{\pi _j^{'} \prod _{k = 1}^K P^{'}(E_k| j) P^{'}(T_k| j)} \right\} . \end{aligned}$$
(A15)

For \(\Vert \Theta - \Theta ^{'}\Vert _{\infty } \le \delta _1\), we can see that \( \log \{P(K)/P^{'}(K)\} \le C K \delta _1\) for some constant C. We can further show that \(\log \{ P(E|j)/P^{'}(E|j) \} \le C v_D \delta _1\) and \(\log \{P(T|j)/P^{'}(T|j) \} \le C l_{max} t_0 \delta _1\) on set \(\Omega _b\). This is because

$$\begin{aligned}&\log \{ P(E|j)/P^{'}(E|j) \} \nonumber \\&\quad = \log \frac{ \sum _{S \in {\mathcal {F}}(E)} P(S|j)}{\sum _{S \in {\mathcal {F}}(E)}P^{'}(S|j) } \nonumber \\&\quad \le \max _{S \in {\mathcal {F}}(E)} \log \{ P(S|j) / P^{'}(S|j) \} \nonumber \\&\quad \le \max _{S \in {\mathcal {F}}(E)} \sum _{w} \max \{\log \{\theta _{jw} / \theta _{jw}^{'}\}, \sum _{w} \log \{(1 - \theta _{jw}) / (1 - \theta _{jw}^{'})\} \} \nonumber \\&\quad \le C v_D \delta _1 \end{aligned}$$
(A16)

and

$$\begin{aligned}&\log \{ P(T|j)/P^{'}(T|j) \} \nonumber \\&\quad = \log \{ \prod _{u = 1}^{l_E} \lambda _j \exp \{ - \lambda _j t_u \} \} - \log \{ \prod _{u = 1}^{l_E} \lambda _j^{'} \exp \{ - \lambda _j^{'} t_u \} \} \nonumber \\&\quad \le l_{max} (t_0 + 1) \delta _1. \end{aligned}$$
(A17)

With (A16) and (A17), (A15) becomes

$$\begin{aligned} l(\Theta ) - l(\Theta ^{'}) \le \sum _k \{Cv_D \delta + l_{max} (t_0 + 1) \delta \} \le C t_0 K_0 \delta _1, \end{aligned}$$
(A18)

by adjusting the constant.

Next, we prove that \({\mathbb {E}} l(\Theta )\) is a continuous function of \(\Theta \). Define set \(A_{k,t} = \{\omega | t - 1 \le \max \{\tilde{t}_{k_1,u}; k_1 = 1, \ldots , k, u = 1, \ldots , n_{k_1}\} \le t\}\) for \(k,t = 1, 2, \ldots \). By algebraic calculation, for any \(\Theta , \Theta ^{'}\) with \(\Vert \Theta - \Theta ^{'}\Vert _{\infty } \le \delta _1\), we have

$$\begin{aligned}&{\mathbb {E}} l(\Theta ) - {\mathbb {E}} l(\Theta ^{'}) \nonumber \\&\quad = \sum _{k=0}^{\infty } P^{*}(K = k) \{ \int \sum _{E \in {\mathcal {O}}^{*}} P^{*}(E, T | k) \log \{\frac{P(E,T,k)}{P^{'}(E,T,k)} \} dT \} \nonumber \\&\quad \le \sum _{k=0}^{\infty } P^{*}(K = k) \sum _{t = 1}^{\infty } \{ \int _{A_{k,t}} \sum _{E \in {\mathcal {O}}^{*}} P^{*}(E, T | k) \log \{\frac{P(E,T,k)}{P^{'}(E,T,k)} \} dT \} \nonumber \\&\quad \overset{(A18)}{\le } \sum _{k=0}^{\infty } P^{*}(K = k) \sum _{t = 1}^{\infty } \{ \int _{A_{k,t}} \sum _{E \in {\mathcal {O}}^{*}} P^{*}(E, T | k) (C t k \delta _1) \} dT \} \nonumber \\&\quad \le \sum _{k=0}^{\infty } P^{*}(K = k) \sum _{t = 1}^{\infty } (C t k \delta _1) P^{*}(A_{k,t}) \nonumber \\&\quad \le \sum _{k = 0}^{\infty } C k \delta _1 P^{*}(K = k) \sum _{t=1}^{\infty } k l_{max} \exp \{- \lambda _{min} t\} \nonumber \\&\quad \le \sum _{k = 0}^{\infty } C \delta _1 l_{max} 1 / (1 - \exp \{-\lambda _{min}\}) k^2 P^{*}(K = k) \nonumber \\&\quad \le C \delta _1 \end{aligned}$$
(A19)

by adjusting the constant.

Thus, we have \(|{\mathbb {E}} l(\Theta ) - {\mathbb {E}} l(\Theta ^{'})| \le \frac{\epsilon }{4}\) for any \(\Theta , \Theta ^{'}\) such that \(\Vert \Theta - \Theta ^{'}\Vert \le \delta _2 \equiv \epsilon / (4C)\). Together with (A18), we then have

$$\begin{aligned}&\frac{1}{m} |\sum _i l_i(\Theta ) - {\mathbb {E}} l_i(\Theta )| - \frac{1}{m} |\sum _i l_i(\Theta ^{'}) - {\mathbb {E}} l_i(\Theta ^{'})| \nonumber \\&\quad \le \frac{1}{m} |\sum _i l_i(\Theta ) - {\mathbb {E}} l_i(\Theta ) - (\sum _i l_i(\Theta ^{'}) - {\mathbb {E}} l_i(\Theta ^{'}))| \nonumber \\&\quad \le \frac{1}{m} \sum _i \{|l_i(\Theta ) - l_i(\Theta ^{'})| + |{\mathbb {E}} l_i(\Theta ) - {\mathbb {E}} l_i(\Theta ^{'})| \} \nonumber \\&\quad \le \epsilon /4 + \epsilon /4 \le \epsilon /2, \end{aligned}$$
(A20)

when \(\Vert \Theta - \Theta ^{'}\Vert _{\infty } \le \delta _3\). Here we take \(\delta _3\) be \(\min \{\epsilon /(4C t_0 K_0), \delta _2\}\).

By the covering number technique, there exists a finite set \({\mathcal {N}}\) such that the distance of any two points from \(\mathcal N\) is at least \(\delta _3\). Thus by (A14), we have

$$\begin{aligned} P( \sup _{\Theta \in {\mathcal {N}}} \frac{1}{m} |\sum _i l_i(\Theta ) - {\mathbb {E}} l_i(\Theta )| \ge \epsilon /2 ) \le 2 |{\mathcal {N}}| \exp \{- m \epsilon ^2 / (4 M)\}. \end{aligned}$$

Define the set \(\Omega _g = \{\omega | \sup _{\Theta \in {\varvec{\Theta }}_c } \frac{1}{m} |\sum _i l_i(\Theta ) - {\mathbb {E}} l_i(\Theta )| \le \epsilon \}\). Combined with (A20), it further gives us that

$$\begin{aligned} P( \Omega _g^c ~\text {and}~ \Omega _b)\le & {} 2 |{\mathcal {N}}|\exp \{- m \epsilon ^2 / (4 M)\} \nonumber \\\le & {} 2 (\frac{D}{\delta _3})^{n_p} \exp \{- m \epsilon ^2 / (4 M)\} \end{aligned}$$
(A21)

where D is the diameter of \(\varvec{\Theta }_c\) and \(n_p\) is the number of total model parameters.

Lastly, by the definition of \({\hat{\Theta }}\) and (A21), we have that

$$\begin{aligned} \frac{1}{m} \sum _i l_i({\hat{\Theta }})\ge & {} \frac{1}{m} \sum _i l_i(\Theta ^{*}) \nonumber \\\ge & {} \frac{1}{m} \sum _i {\mathbb {E}} l_i(\Theta ^{*}) - \epsilon \nonumber \\\ge & {} \sup _{\Theta \in \varvec{\Theta }_c \cap B(\Theta ^{*}, \delta )^c} \frac{1}{m} \sum _i {\mathbb {E}} l_i(\Theta ) + 2 \epsilon \nonumber \\\ge & {} \sup _{\Theta \in \varvec{\Theta }_c \cap B(\Theta ^{*}, \delta )^c} \frac{1}{m} \sum _i l_i(\Theta ) + \epsilon \end{aligned}$$
(A22)

on \(\Omega _b \cap \Omega _g\). In other words, (A22) implies that

$$\begin{aligned} P({\hat{\Theta }} \in \varvec{\Theta }_c \cap B(\Theta ^{*}, \delta )^c)\le & {} P((\Omega _b \cap \Omega _g)^c) \nonumber \\\le & {} 2 (\frac{D}{\delta _3})^{n_p} \exp \{- m \epsilon ^2 / (4 M)\} \nonumber \\&+ 2 \exp \{ - c m\} + m \exp \{- c K_0\} + m K_0 l_{max} \exp \{ - \lambda _{min} t_0\} \nonumber \\&+ |{\mathcal {O}}^{*}| (1 - \eta ^{v_D})^{m \kappa /2}. \end{aligned}$$
(A23)

By choosing \(K_0 = (\log m)^2\) and \(t_0 = (\log m)^2 \), the left hand side of (A23) goes to zero as \(m \rightarrow \infty \). This concludes the proof. \(\square \)

Proposition 1

Under LTDM setting, the probability mass function \(P(e_{1:N}, t_{1:N}; \Theta )\) can be written in the multiplicative form of \(G(e_{1:N}; \Theta ) F(e_{1:N}, t_{1:N}; \Theta _1)\) if and only if \(\lambda _j = \lambda , j = 1, \ldots , J\). Here, \(\Theta _1\) is the model parameter excluding \(\{\theta _{jw}\}\), G and F are some functions.

Proof of Proposition 1

First, we write out the likelihood function

$$\begin{aligned}&P(e_{1:N}, t_{1:N}; \Theta ) \nonumber \\&\quad = \frac{\kappa ^{K}\exp \{-\kappa \}}{K!} \sum _{j=1}^J \pi _j \prod _{k=1}^K \{P(E_k; \{\theta _{jw}\}) P({\tilde{T}}_k; \lambda _j) \} \nonumber \\&\quad = C(K, \kappa ) \sum _{j=1}^J \pi _j \prod _{k=1}^K \{P(E_k; \{\theta _{jw}\}) P({\tilde{T}}_k; \lambda _j) \}, \end{aligned}$$
(A24)

where \(C(K, \kappa )\) is some quantity depending on K and \(\kappa \).

We first prove the sufficient part. Suppose \(\lambda _j = \lambda , j = 1, \ldots , J\). Then, (A24) can be written as

$$\begin{aligned}&P(e_{1:N}, t_{1:N}; \Theta ) \nonumber \\&\quad = C(K, \kappa ) \sum _{j=1}^J \pi _j \prod _{k=1}^K \{P(E_k; \{\theta _{jw}\}) P({\tilde{T}}_k; \lambda _j) \} \end{aligned}$$
(A25)
$$\begin{aligned}&\quad = C(K, \kappa ) \{\sum _{j=1}^J \pi _j \prod _{k=1}^K \{P(E_k; \{\theta _{jw}\})\} \prod _{k=1}^K P({\tilde{T}}_k; \lambda ). \end{aligned}$$
(A26)

Hence, we can take \(G(e_{1:N; \Theta }) = C(K, \kappa ) \{\sum _{j=1}^J \pi _j \prod _{k=1}^K \{P(E_k; \{\theta _{jw}\})\}\) and \(F(e_{1:N}, t_{1:N}; \Theta _1) = \prod _{k=1}^K P({\tilde{T}}_k; \lambda )\). This concludes the sufficient part.

We next prove the necessary part. Suppose it is not true. In other words, we can write \(P(e_{1:N},t_{1:N}; \Theta ) = G(e_{1:N}; \Theta ) F(e_{1:N},t_{1:N}; \Theta _1)\) when \(\lambda _j\)’s are not all the same. Without loss of generality, we assume \(\lambda _1 < \lambda _2 \le ... \le \lambda _J\). By assumption,

$$\begin{aligned} C(K, \kappa ) \sum _{j=1}^J \pi _j \prod _{k=1}^K \{P(E_k; \{\theta _{jw}\}) P({\tilde{T}}_k; \lambda _j) \} = G(e_{1:N}; \Theta ) F(e_{1:N},t_{1:N}; \Theta _1). \end{aligned}$$
(A27)

Divided by \(\prod _{k=1}^K P({\tilde{T}}_k; \lambda _1)\) on both sides of (A27), we get

$$\begin{aligned} C(K, \kappa ) \sum _{j=1}^J \pi _j \prod _{k=1}^K \{P(E_k; \{\theta _{jw}\}) \frac{P({\tilde{T}}_k; \lambda _j)}{P({\tilde{T}}_k; \lambda _1)} \} = G(e_{1:N}; \Theta ) \frac{F(e_{1:N},t_{1:N}; \Theta _1)}{\prod _{k=1}^K P({\tilde{T}}_k; \lambda _1)}. \end{aligned}$$
(A28)

By letting \({\tilde{t}}_n \rightarrow \infty ; n = 1, \ldots N\), the left hand side of (A28) becomes \(C(K, \kappa ) \pi _1 \prod _{k=1}^K \{P(E_k; \{\theta _{1w}\})\). Then, we know that \(G(e_{1:N}; \Theta )\) has the form of \(C_1(e_{1:N};\Theta _1) \prod _{k=1}^K \{P(E_k; \{\theta _{1w}\})\). Plug this back to (A27), we then get

$$\begin{aligned} C(K, \kappa ) \sum _{j=1}^J \pi _j \prod _{k=1}^K \{P(E_k; \{\theta _{jw}\}) P({\tilde{T}}_k; \lambda _j) \} = C_1(e_{1:N};\Theta _1) \prod _{k=1}^K \{P(E_k; \{\theta _{1w}\}) F(e_{1:N},t_{1:N}; \Theta _1). \end{aligned}$$

Notice that the left hand side of above equation is a polynomial of \(\{\theta _{jw}\}\)’s and right hand side is a polynomial of \(\{\theta _{1w}\}\)’s. Hence, it must hold that \(\pi _j \prod _{k=1}^K P({\tilde{T}}_k, \lambda _j) \equiv 0\) for \(j = 2, \ldots , J\), which is impossible when \(\pi _j > 0\). Thus, it contradicts with the assumption. This concludes the proof of necessary part. \(\square \)

Proof of Lemma 2

For any pattern w in \({\mathcal {D}}_{[j]}\), we need to show it also belongs to \({\mathcal {D}}_{[j]}^{'}\). It is easy to see that if w has the form of [A], then it must belong to \({\tilde{D}}\) since [A] only admits one separation. In the following, we only need to consider \(w = [e_1 e_2 \ldots e_{l_w}]\) such that \(e_1, \ldots ,e_{l_w}\) (\(l_w \ge 2\)) are different according to Assumption A2. Without loss of generality, we assume w belongs to Class j.

Let \(\check{O}_j\) denote the longest sentence generated by \(\mathcal D_j\) satisfying that (1) each event belongs to \({\mathcal {E}} - \{e_1\}\); (2) the length of \(\check{O}_j\) is at least \(n_{j,e_1}\) (See \(n_{j,e_1}\)’s definition in Assumption A1). Notice that \(\check{O}_j\) may not be unique, we only need to consider one of them. Let \(\grave{O}_j\) be the sentence such that it has form \((Q_1 Q_2)\) such that (1) \(Q_1\) contains \(\check{O}_j\) as its subsentence; (2) each event in \(Q_1\) belongs to \({\mathcal {E}} - \{e_1\}\); (3) \(Q_1\) is longest possible; (4) the first event of \(Q_2\) is \(e_1\) (Be empty if it does not exist.); (5) \(Q_2\) is shortest possible. Notice \(\grave{O}_j\) may not be unique, we only need to consider one of them. Next we consider the decomposition of sentence \(O_j = (\grave{O}_j w)\). By aid of \(O_j\), we can show that w must belong to \(\mathcal D_{[j]}^{'}\).

Since \({\mathcal {O}}_{[j]} = {\mathcal {O}}_{[j]}^{'}\), we know that \(O_j \in {\mathcal {O}}_{[j]}^{'}\). Without loss of generality, we also assume that \(O_j\) appears in j-th class of model \({\mathcal {P}}^{'}\). We claim that \(O_j\) only has separations in form \(\{\mathcal S(\grave{O}_j), w\}\) in \({\mathcal {D}}^{'}\). (\({\mathcal {S}}(O)\) is one realization of separation for O.) If not, then we must have the following cases.

Case 1: There is a separation \(S \in {\mathcal {F}}(O_j)\) such that \(S = \{{\mathcal {S}}(R_1), w_1\}\) where \(w_1\) is contained in w. By Assumption A2, we know that \(w_1\) does not consist of \(e_1\). Then, we consider sentence \((w_1 R_1)\). It is in \({\mathcal {O}}_{[j]}^{'}\), then it must in \({\mathcal {O}}_{[j]}\). By Assumption A1, we know that \((w_1 R_1)\) must belong to \({\mathcal {O}}_j\), since it contains \(\check{O}_j\). Then, \((w_1 R_1)\) can be written in form of \((Q_1 Q_2)\) with longer \(Q_1\). This contradicts with the definition of \(\grave{O}_j\). Therefore, Case 1 cannot happen.

Case 2: There is a separation \(S \in {\mathcal {F}}(O_j)\) such that \(S = \{{\mathcal {S}}(R_2), w_2\}\) where \(w_2\) contains w. We further consider the following four situations.

  1. 2.a

    Suppose \(R_2\) contains \(\check{O}_j\) as its sub-sentence and does not contain events \(u_1\). Since \({\mathcal {O}}_{[j]} = \tilde{\mathcal O}_{[j]}\), we know that \(R_2\) must belong to \({\mathcal {O}}_{[j]}\). By Assumption A1, \(R_2\) is also in \({\mathcal {O}}_j\). Then, it leads to contradiction since \(R_2\) is longer than \(\check{O}_j\).

  2. 2.b

    Suppose \(R_2\) contains \(\check{O}_j\) as its sub-sentence and contains events \(u_1\). \(R_2\) is also in \({\mathcal {O}}_j\) for the same reason as before. This time, \(R_2\) can be written in the form of \((Q_1 Q_2)\) with shorter \(Q_2\), which contradicts with the definition of \(\grave{O}_j\).

  3. 2.c

    Suppose \(R_2\) is contained in \(\check{O}_j\). If \(R_2\) is the longest sentence generated by \({\mathcal {D}}_{j}^{'}\) without \(e_1\), then by Assumption A1 we know \(\check{O}_j\) must also belong to this class. Therefore, \(R_2\) is not the longest sentence. It implies that there exists a pattern \({\tilde{w}}\) with events in \({\mathcal {E}}_w\) in \({\mathcal {D}}^{'}_j\) does not contribute to \(R_2\). Therefore, sentence \(({\tilde{w}} R_2 w_2)\) containing \(\check{O}_j\) must belong to \({\mathcal {O}}_j\). By using the same argument, we know that it can also be written in the form \((Q_1 Q_2)\) with longer \(Q_1\). This contradicts with the definition of \(\grave{O}_j\).

  4. 2.d

    Suppose \(R_2 = \check{O}_j\). If \(Q_2\) is not empty, then \(w_2\) contains two \(e_1\)’s. It contradicts with Assumption A2. If \(Q_2\) is empty, then \(w_2 = w\) exactly.

Hence, we conclude the proof of this lemma. \(\square \)

Appendix C: Parameter Estimation in NB-LTDM

In the inner part of the NB-LTDM Algorithm, we adopt a nonparametric Bayes method which is used to avoid selection of a single finite number of mixtures J. Therefore, we replace finite mixture components by an infinite mixture, that is,

$$\begin{aligned} P(e_{1:N}, t_{1:N})= & {} P(K | \kappa ) \sum ^\infty _{j=1} v_j \prod _{k=1}^K P(E_k, T_k| j), \quad i=1,\ldots ,m, \\ \sum _{j=1}^{\infty } v_j= & {} 1. \end{aligned}$$

For the choice of prior, we specify \(\theta _{jw} \sim \text {Unif}(0,1)\), \(\kappa \sim \mathrm {Ga}(1,1)\), \(\lambda _j \sim \mathrm {Ga}(1,1)\) and \(v = (v_1, \ldots ) \sim Q\) where Q corresponds to a Dirichlet process. The stick-breaking representation, introduced by Sethuraman (1994), implies that \(v_j = V_j \prod _{l < j} (1- V_l)\) with \(V_j \sim \text {Beta}(1,\alpha )\) independently for \(j=1,\ldots ,\infty \), where \(\alpha >0\) is a precision parameter characterizing Q.

Hence, our nonparametric Bayesian latent theme dictionary model can be written in the following hierarchical form:

$$\begin{aligned}&{ S_{ik},T_{ik}| z_i = j, \{\theta _{jw}\}, \{\lambda _j\} } \sim \frac{1}{n_{S_{ik}}!} \prod _{w=1}^{v_D} [\theta _{jw}^{\mathbf{1}_{\{w \in S_{ik}\}}} (1 - \theta _{jw})^{{\mathbf {1}}_{\{w \not \in S_{ik}\}}}] \cdot \prod _{u = 1}^{n_{ik}} [\lambda _{j}e^{-\lambda _{j} {\tilde{t}}_{ik,u}}] \\&E_{ik},T_{ik}| z_i, \{\theta _{jw}\}, \{\lambda _j\} \sim P(E_{ik}, T_{ik}| z_i) = \sum _{S \in {\mathcal {F}}(E_{ik})} P(S_{ik}, T_{ik}|z_i, \{\theta \}, \{\lambda \}), \quad i \in [m]; k \in [K_i]\\&z_i {\mathop {\sim }\limits ^{iid}} \sum ^\infty _{j=1} V_j \prod _{l < j} (1 - V_l) \delta _j, \quad i=1,\ldots ,m \\&V_j {\mathop {\sim }\limits ^{iid}} \text {Beta}(1,\alpha ), \quad j=1,\ldots ,\infty \\&\theta _{hw} {\mathop {\sim }\limits ^{iid}} \text {Unif}(0,1), \quad j=1,\ldots ,\infty , w=1,\ldots ,v_D \\&{ \lambda _{j} } {\mathop {\sim }\limits ^{iid}} \text {Ga}(1,1), \quad j = 1, \ldots , \infty \\&\alpha \sim \text {Ga}(1,1) \\&\kappa \sim \text {Ga}(1,1). \end{aligned}$$

where \(\delta _j(\cdot )\) is the Dirac measure at j.

We use a data augmentation Gibbs sampler (Walker 2007) to update all parameters and latent variables. Specifically, we introduce a vector of latent variables \(u=(u_1, \ldots , u_N)\), where \(u_i {\mathop {\sim }\limits ^{iid}} U[0,1]\). The full likelihood becomes

$$\begin{aligned}&\prod _{i=1}^m \bigg \{ \big \{ \prod _{k = 1}^{K_i} {\mathbf {1}}_{\{ u_i < v_{z_i}\}} P(S_{ik},T_{ik}|z_i, \{\theta _{jw}\}, \{\lambda _j\}) {\mathbf {1}}_{\{{\mathcal {F}}(E_{ik}) \in \mathcal S_{ik}\}} \big \} \frac{\kappa ^{K_i} \exp \{-\kappa \}}{K_i !} \bigg \} \\&\quad \cdot \prod _{j} \big \{ f_{Be}(V_j; \alpha ) f_{Ga}(\lambda _j) \prod _{w = 1}^{v_D} f_u(\theta _{jw}) \big \} f_{Ga}(\kappa ) f_{Ga}(\alpha ), \end{aligned}$$

where \(f_{Ga}\) is the density of \(\mathrm {Ga}(1,1)\), \(f_u\) is the density of \(\mathrm {Unif}(0,1)\) and \(f_{Be}(\cdot ;\alpha )\) is the density of \(\mathrm {Beta}(1, \alpha )\).

Then, Gibbs sampler iterates through the following steps:

  1. 1.

    Update \(u_i\), for \(i=1,\ldots ,m\), by sampling from \(U(0,v_{z_i})\).

  2. 2.

    Update \(\theta _{hw}\), for \(h=1,\ldots ,j^*, w=1\ldots , v_D\), by sampling from

    $$\begin{aligned} \text {Beta} \bigg ( \sum ^n_{i:z_i = h} \sum ^{K_i}_{k=1} 1(\theta _{hw} \in S_{ik}) + 1, \sum ^n_{i:z_i = h} \sum ^{K_i}_{k=1} 1(\theta _{hw} \notin S_{ik} ) + 1 \bigg ). \end{aligned}$$
  3. 3.

    Update \(\lambda _j\), for \(j = 1, \ldots , j^{*}\) (\(j^{*} = \max \{z_i\}\)), by sampling from

    $$\begin{aligned} \text {Ga}\left( 1 + \sum _{i:z_i = j} \sum _{k=1}^{K_i} l_{T_{ik}}, 1 + \sum _{i:z_i = j} \sum _{k=1}^{K_i} \sum _{u = 1}^{l_{T_{ik}}} T_{ik,u}\right) . \end{aligned}$$
  4. 4.

    Update \(V_j\), for \(j=1,\ldots ,j^*\), by sampling from \(\text {Beta}(1,\alpha )\) truncated to fall into the interval

    $$\begin{aligned} \bigg [\max _{i:z_i = j} \frac{u_i}{\prod _{l< j} (1-V_l)}, 1- \max _{i: z_i > j} \frac{u_i}{V_{z_i} \prod _{l < z_i, l \ne j}(1-V_l)} \bigg ]. \end{aligned}$$
  5. 5.

    Update \(z_i\), for \(i=1,\ldots ,m\), by sampling from

    $$\begin{aligned} P(z_i&= j|e_{1:N_i},t_{1:N_i},{\mathbf {S}}_i, \{\theta _{jw}\},V,u,z_{-i})\\&= \frac{1(j \in A_i) \prod ^{K_i}_{k=1} P(S_{ik},T_{ik}|\{\theta _{lw}\}, \lambda _j) }{\sum _{l \in A_i} \prod ^{K_i}_{k=1} P(S_{ik}, T_{ik}|\{\theta _{lw}\}, \lambda _j) }1(S_{ik} \in {\mathcal {F}}(E_{ik})), \end{aligned}$$

    where \(A_i := \{ j: v_j > u_i\}\). To identify the elements in \(A_1,\ldots ,A_m\), first update \(V_j\) for \(j=1,\ldots ,\tilde{k}\), where \(\tilde{j}\) is the smallest value satisfying

    $$\begin{aligned} \sum ^{\tilde{j}}_{j=1} v_j > 1 - \min \{u_1,\ldots , u_m \}. \end{aligned}$$
    (A29)

    Therefore, \(1,\ldots ,\tilde{j}\) are the possible values for \(z_i\). Note that we have already updated \(V_j\) for \(j=1,\ldots ,j^*\). Therefore, we first check if \(j^*\) satisfies (A29). If yes, then we do not have to sample more; otherwise sample \(V_j \sim \text {Beta}(1,\alpha )\) for \(j=j^*+1,\ldots \) until (A29) is satisfied. In this case, we also have to sample \(\theta _{j w}\) from U(0, 1) and \(\lambda _j\) from \(\text {Ga}(1,1)\) for \(j = j^*+1,\ldots , \tilde{j}\) and \(w =1,\ldots ,v_D\) in order to compute \(P(S_{ik}, T_{ik} | \theta _j, \lambda _j)\) for \(j=j^* +1,\ldots ,\tilde{j}\).

  6. 6.

    Update \(S_{ik}\), for \(i=1,\ldots ,m\), \(k=1,\ldots ,K_i\), by sampling from

    $$\begin{aligned} { P(S_{ik} = S|E_{ik}, \theta _{z_i} ) } = \frac{P(S_{ik} = S |\theta _{z_i})}{ \sum _{S' \in {\mathcal {F}}(E_{ik})} P(S_{ik} = S' | \theta _{z_i}) } 1(S \in {\mathcal {F}}(E_{ik})). \end{aligned}$$
  7. 7.

    Update \(\kappa \), which follows gamma distribution \(\text {Ga}(1 + \sum _i K_i, 1 + m)\).

  8. 8.

    Sample \(\alpha \) from posterior \(\text {Ga}(1 + j^{*}, 1 - \sum _{j=1}^{j^{*}} \log (1 - V_{j}))\).

Appendix D: Estimated Dictionary in Traffic Item

In the “Traffic" item, the NB-LTDM algorithm found a dictionary \(\hat{{\mathcal {D}}}\) with \({\hat{v}}_D = 82\). For each pattern w in \(\hat{{\mathcal {D}}}\), we classified the six classes into two clusters based on their pattern probabilities \(\theta _{jw}(j = 1, \ldots , 6)\). Those classes with high pattern probabilities are clustered together and shown in Table 11.

Table 11 Identified patterns from “Traffic" item. Column “Class" represents the label of latent class with high corresponding pattern probability.

Appendix E: Latent Class Model and Theme Dictionary Model

In this section, we briefly recall two popular models, latent class model (LCM; Gibson 1959) and theme dictionary model (TDM; Deng et al. 2014), which are related with the proposed LTDM.

Widely adopted in biostatistics, psychometrics and machine learning literature (e.g., Goodman 1974; Vermunt and Magidson 2002; Templin et al. 2010), LCM relates a set of observed variables to a discrete latent variable, which is often used for indicating the class label. LCM assumes a local independence structure, i.e.,

$$\begin{aligned} P(X_1, \ldots , X_K | Z) = \prod _{k = 1}^K P(X_k | Z), \end{aligned}$$

where \(X_1, \ldots , X_K\) are K observed variables and Z is a discrete latent variable with density \(P(Z = j) = \pi _j, ~ j = 1, \ldots J\). Thus, the joint (marginal) distribution of \(X_1, \ldots , X_K\) takes form

$$\begin{aligned} P(X_1, \ldots , X_K) = \sum _{j = 1}^J \bigg \{ \pi _j \prod _{k = 1}^K P(X_k | Z = j)\bigg \}. \end{aligned}$$

For TDM (Deng et al. 2014), it typically handles observations known as words/events. It can be used for identifying associated event patterns. The problem of finding event associations is also known as market basket analysis (Piatetsky-Shapiro 1991; Hastie et al. 2005; Chen et al. 2005). Under TDM, a pattern is a combination of several events. A collection of distinct patterns forms a dictionary, \({\mathcal {D}}\). An observation, E, is a set of events. In TDM, we observe E but do not know which patterns it consists of. In other words, E could be split into different possible partitions of patterns. For each possible partition, we call it a separation of E. The collection of multiple observations \({\mathbf {E}} = \{E_1, \ldots , E_K\}\) forms a document. TDM does not take into account event ordering. For example, \(E = (A, B, C)\) is an observation with three events, A, B and C. TDM treats \(E^{'} = (C, B, A)\) as the same observation as E. Consequently, patterns are also unordered. For instance, patterns [A B] and [B A] are viewed as the same. TDM postulates that a pattern appears in an observation at most once. Let \(\theta _w \in [0,1]\) be the probability of pattern that appears in an observation. The probability distribution of one separation S for observation E is defined to be

$$\begin{aligned} P(S) = \prod _{w \in {\mathcal {D}}} \theta _w^{{\mathbf {1}}_{ \{ w \in S \} }} (1 - \theta _w)^{{\mathbf {1}}_{ \{ w \notin S \} }}. \end{aligned}$$
(30)

Since separation S is not observed, the marginal probability of E is

$$\begin{aligned} P(E) = \sum _{S \in {\mathcal {F}}(E)} P(E, S) = \sum _{S \in \mathcal F(E)} \prod _{w \in {\mathcal {D}}} \theta _w^{{\mathbf {1}}_{ \{ w \in S \} }} (1 - \theta _w)^{{\mathbf {1}}_{ \{ w \notin S \} }}, \end{aligned}$$

where \({\mathcal {F}}(E)\) is the set of all possible separations for E. Furthermore, observations are assumed to be independent, i.e., for \({\mathbf {E}} = \{E_1, \ldots , E_K\}\),

$$\begin{aligned} P({\mathbf {E}}) = \prod _{k=1}^K P(E_k). \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fang, G., Ying, Z. Latent Theme Dictionary Model for Finding Co-occurrent Patterns in Process Data. Psychometrika 85, 775–811 (2020). https://doi.org/10.1007/s11336-020-09725-2

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11336-020-09725-2

Keywords

Navigation