Abstract
Process data, which are temporally ordered sequences of categorical observations, are of recent interest due to its increasing abundance and the desire to extract useful information. A process is a collection of time-stamped events of different types, recording how an individual behaves in a given time period. The process data are too complex in terms of size and irregularity for the classical psychometric models to be directly applicable and, consequently, new ways for modeling and analysis are desired. We introduce herein a latent theme dictionary model for processes that identifies co-occurrent event patterns and individuals with similar behavioral patterns. Theoretical properties are established under certain regularity conditions for the likelihood-based estimation and inference. A nonparametric Bayes algorithm using the Markov Chain Monte Carlo method is proposed for computation. Simulation studies show that the proposed approach performs well in a range of situations. The proposed method is applied to an item in the 2012 Programme for International Student Assessment with interpretable findings.
Similar content being viewed by others
References
Aalen, O., Borgan, O., & Gjessing, H. (2008). Survival and event history analysis: A process point of view. Berlin: Springer.
Allison, P. D. (1984). Event history analysis: Regression for longitudinal event data (Vol. 46). California: Sage.
Allman, E., Matias, C., & Rhodes, J. (2009). Identifiablity of parameters in latent structure models with many observed variables. The Annals of Statistics, 37, 3099–3132.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning research, 3, 993–1022.
Borboudakis, G., & Tsamardinos, I. (2019). Forward-backward selection with early dropping. The Journal of Machine Learning Research, 20, 276–314.
Chen, Y. (2019). A continuous-time dynamic choice measurement model for problem-solving process data. arXiv preprint arXiv:1912.11335.
Chen, Y.-L., Tang, K., Shen, R.-J., & Hu, Y.-H. (2005). Market basket analysis in a multiple store environment. Decision Support Systems, 40, 339–354.
Deng, K., Geng, Z., & Liu, J. S. (2014). Association pattern discovery via theme dictionary models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76, 319–347.
Duchateau, L., & Janssen, P. (2007). The frailty model. Berlin: Springer.
Dunson, D. B., & Xing, C. (2009). Nonparametric Bayes modeling of multivariate categorical data. Journal of the American Statistical Association, 104, 1042–1051.
Fang, G., Liu, J., & Ying, Z. (2019). On the identifiability of diagnostic classification models. Psychometrika, 84, 19–40.
Gibson, W. A. (1959). Three multivariate models: Factor analysis, latent structure analysis, and latent profile analysis. Psychometrika, 24, 229–252.
Goodman, L. A. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika, 61, 215–231.
Goodman, M., Finnegan, R., Mohadjer, L., Krenzke, T., & Hogan, J. (2013). Literacy, numeracy, and problem solving in technology-rich environments among US adults: Results from the program for the international assessment of adult competencies 2012. First look (NCES 2014-008). ERIC.
Griffin, P., McGaw, B., & Care, E. (2012). Assessment and teaching of 21st century skills. Berlin: Springer.
Han, Z., He, Q., & von Davier, M. (2019). Predictive feature generation and selection using process data from pisa interactive problem-solving items: An application of random forests. Frontiers in Psychology, 10, 2461.
Hastie, T., Tibshirani, R., Friedman, J., & Franklin, J. (2005). The elements of statistical learning: Data mining, inference and prediction. The Mathematical Intelligencer, 27, 83–85.
He, Q., & von Davier, M. (2016). Analyzing process data from problem-solving items with n-grams: Insights from a computer-based large-scale assessment. In Handbook of research on technology tools for real-world skill development, (pp. 750–777). IGI Global.
Ishwaran, H., & Rao, J. S. (2003). Detecting differentially expressed genes in microarrays using Bayesian model selection. Journal of the American Statistical Association, 98, 438–455.
Ishwaran, H., & Rao, J. S. (2005). Spike and slab variable selection: Frequentist and bayesian strategies. The Annals of Statistics, 33, 730–773.
Kruskal, J. B. (1977). Three-way arrays: Rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra and its Applications, 18, 95–138.
Liu, J., Xu, G., & Ying, Z. (2012). Data-driven learning of q-matrix. Applied Psychological Measurement, 36, 548–564.
Liu, J., Xu, G., & Ying, Z. (2013). Theory of the self-learning q-matrix. Bernoulli: Official Journal of the Bernoulli Society for Mathematical Statistics and Probability, 19, 1790.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. UK: Routledge.
OECD. (2014a). Assessing problem-solving skills in PISA 2012.
OECD. (2014b). PISA 2012 technical report. (Available at) http://www.oecd.org/pisa/pisaproducts/pisa2012technicalreport.htm.
OECD. (2016). PISA 2015 results in focus. (Available at) https://www.oecd.org/pisa/pisa-2015-results-in-focus.pdf.
Piatetsky-Shapiro, G. (1991). Discovery, analysis, and presentation of strong rules. Knowledge discovery in databases, 229–238.
Qiao, X., & Jiao, H. (2018). Data mining techniques in analyzing process data: A didactic. Frontiers in Psychology, 9, 2231.
Sethuraman, J. (1994). A constructive definition of dirichlet priors. Statistica Sinica, 4, 639–650.
Templin, J., Henson, R. A., et al. (2010). Diagnostic measurement: Theory, methods, and applications. New York: Guilford Press.
Tibshirani, R. (1997). The lasso method for variable selection in the cox model. Statistics in Medicine, 16, 385–395.
van der Linden, W. J. (2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31, 181–204.
Vermunt, J. K., & Magidson, J. (2002). Latent class cluster analysis. Applied Latent Class Analysis, 11, 89–106.
Walker, S. G. (2007). Sampling the dirichlet mixture model with slices. Communications in Statistics–Simulation and Computation®, 36, 45–54.
Xu, G., et al. (2017). Identifiability of restricted latent class models with binary responses. The Annals of Statistics, 45, 675–707.
Xu, H., Fang, G., Chen, Y., Liu, J., & Ying, Z. (2018). Latent class analysis of recurrent events in problem-solving items. Applied Psychological Measurement, 42, 478.
Xu, H., Fang, G., & Ying, Z. (2019). A latent topic model with Markovian transition for process data. arXiv preprint arXiv:1911.01583.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendices
Appendix A: Conditions C1 and C2
We provide the exact statements of conditions C1 - C2 in this appendix.
-
C1.a
For each equivalence class [j] with size larger than 1, there exists a partition \(\{{\mathcal {I}}_{[j],1}, \mathcal I_{[j],2}, {\mathcal {I}}_{[j],3} \}\) of 1-grams such that for any \(e_1 \in {\mathcal {I}}_{[j],1}\), \(e_2 \in {\mathcal {I}}_{[j],2}\) and \(e_3 \in {\mathcal {I}}_{[j],3}\), sentence \(E=(e_{l}, e_k), l \ne k \in \{1,2,3\}\) and sentence \(E = (e_1, e_2, e_3)\) admit only one separation. Cardinalities of three sets satisfy \(|\mathcal I_{[j],1}|\), \(|{\mathcal {I}}_{[j],2}|\) and \(|{\mathcal {I}}_{[j],3}| \ge |[j]|\). Here, |[j]| is the cardinality of equivalence class [j].
-
C1.b
Define T-matrices \(T_{[j],1}\), \(T_{[j],2}\) and \(T_{[j],3}\) such that \(T_{[j],k} [l,j_1] = \frac{\theta _{j_1 l}}{1 - \theta _{j_1 l}}\) for \(e_l \in {\mathcal {I}}_{[j],k}\), \(j_1 \in [j]\), and \(k = 1, 2 ~\text {or}~ 3\). Matrices \(T_{[j],1}\), \(T_{[j],2}\) and \(T_{[j],3}\) have full column rank.
-
C2.a
For each equivalence class [j] with size larger than 1 and for any l-gram \(w = [e_1~e_2 \ldots ~ e_l]\) with \(l \ge 2\), there exists \({\mathcal {D}}_{[j],w}\) (the subset of 1-grams) such that (1) for any \(e \in {\mathcal {D}}_{[j],w}\), sentence \(E = (e_1, \ldots e_l, e)\) does not admit other separations containing \((l+1)\)-gram or l-gram other than w; (2) cardinality of \({\mathcal {D}}_{[j],w}\) is greater than or equal to |[j]|.
-
C2.b
Define matrix \(T_{[j],w}\) such that \(T_{[j],w} [e,j_1] = \frac{\theta _{j_1 e}}{1 - \theta _{j_1 e}}\) for \(e \in \mathcal D_{[j],w}\) and \(j_1 \in [j]\). Matrix \(T_{[j],w}\) has full column rank.
Conditions C1 - C2 pertain to the dictionary and parameter structures. Specifically, Condition C1.a puts the restrictions on 1-grams such that not all combinations of 1-grams are considered as patterns, which ensures the pattern frequency can be identified. It is very similar to the sufficient conditions in identifiability of diagnostic classification models (DCMs, Xu et al. 2017; Fang et al. 2019), where they require all items can be divided into three non-overlapping item sets. Here, 1-gram can be viewed as the counterpart of item in DCMs. Condition C2.a requires that each l-gram is not overlapped with other patterns to some extent and thus can be identified. Conditions C1.b and C2.b require that the examinees from different groups should have different pattern frequencies.
The T-matrices here share the similar ideas to those in Liu et al. (2012; 2013). We use the following example to illustrate this idea.
Example 2
Consider a 2-class model with \(\lambda _1 = \lambda _2\) and \(\mathcal D = \{ [a], [b], [c], [d], [e], [f], [a~b], [c~d], [e~f]\}\). Pattern probability \(\{\theta _{jw}\}\) is
We claim this setting is identifiable.
Notice that Classes 1 and 2 are in the same equivalence class [1]. We can construct \({\mathcal {I}}_{[1],1} = \{[a], [b]\}\), \(\mathcal I_{[1],2} = \{[c],[d]\}\), and \({\mathcal {I}}_{[1],3} = \{[e],[f]\}\). It is easy to check that their T-matrices satisfy
Hence, Condition C1 is satisfied, since they all have full column rank. For \(w = [a~b]\), we can set \({\mathcal {D}}_{[1],w} = \{c,d\}\) by checking that both sentences (a, b, c) and (a, b, d) have only one separation. Its T-matrix is
which is also full-column rank. Similarly, we can check it for [c d] and [e f]. Thus, Condition C2 is also satisfied. Furthermore, Assumption A1 holds since both classes contain all sentences in \({\mathcal {O}}\). Lastly, Assumption A2 obviously holds.
Appendix B: Proofs
To prove main theoretical results, we start with two lemmas which play key roles for dictionary and parameter identifiability. The proof of Lemma 2 is presented at the end of this section.
Lemma 1
(Kruskal 1977) Suppose \(A,B,C,{\bar{A}},{\bar{B}},{\bar{C}}\) are six matrices with R columns. There exist integers \(I_0\), \(J_0\), and \(K_0\) such that \(I_0+J_0+K_0 \ge 2R+2\). In addition, every \(I_0\) columns of A are linearly independent, every \(J_0\) columns of B are linearly independent, and every \(K_0\) columns of C are linearly independent. Define a triple product to be a three-way array \([A,B,C] = (d_{ijk})\) where \(d_{ijk}=\sum _{r=1}^{R} a_{ir} b_{ir} c_{kr}\). Suppose that the following two triple products are equal \([A,B,C]=[{\bar{A}},{\bar{B}},{\bar{C}}]\). Then, there exists a column permutation matrix P such that \({\bar{A}}=AP\Lambda , {\bar{B}}=BPM, {\bar{C}}=CPN\), where \(\Lambda , M, N\) are diagonal matrices and \(\Lambda MN =\) identity. Column permutation matrix is right-multiplied to a given matrix to permute the columns of that matrix.
Lemma 2
Under Assumptions A1 and A2, it holds that \({\mathcal {O}}_{[j]} = {\mathcal {O}}_{[j]}^{'}\) if and only if \({\mathcal {D}}_{[j]} = \mathcal D_{[j]}\).
Here, we recall that \({\mathcal {O}}_{[j]}\) is the observed sentence set generated from equivalence class [j] and \({\mathcal {D}}_{[j]}\) is the dictionary consisting of patterns from equivalence class [j].
Proof of Theorem 1
For every model \({\mathcal {P}} = ({\mathcal {D}}, \{\theta _{jw}\}, \{\lambda _j\}, \pi , \kappa ) \in {\mathfrak {P}}^0\), we need to show that if there exists another model \({\mathcal {P}}^{'}\) such that
it must hold \({\mathcal {P}} = {\mathcal {P}}^{'}\).
We prove it through the following steps. (1) \(\kappa \)-identifiability: we show that the parameter \(\kappa \) is identifiable. (2) \(\lambda \)-identifiability: we prove that \(\lambda _{[j]} = \lambda _{[j]}^{'}\) for any equivalence class [j]. (3) Dictionary identifiability: we show that \({\mathcal {O}} = \mathcal O^{'}\) implies \({\mathcal {D}} = {\mathcal {D}}^{'}\). (4) \(\{\theta \}, \pi \)-identifiability: we show that \(\{\theta _{jw}\} \overset{p}{=} \{\theta _{jw}^{'}\}\) and \(\pi \overset{p}{=} \pi ^{'} \).
For \(\kappa \)-identifiability, we can see that the marginal distribution of \(e_{1:N}\) and \(t_{1:N}\) is
By taking \(K = 0\), we have that \(P(e_{1:N}, t_{1:N}) = P(K = 0)\). Then, it must hold that
This implies that \(\kappa = \kappa ^{'}\).
For \(\lambda \)-identifiability, we consider take \(K = 1\) and an event sentence \(E = (e)\) and \({\tilde{T}} = (t)\). Then, (A1) becomes
By \(\kappa \)-identifiability, we further have
after simplification. Let \(t \rightarrow \infty \), we must have that \(\lambda _{[j_0]} = \lambda _{[j_0]}^{'}\), where \([j_0]\) is the equivalence class with minimum lambda value. Hence, we also have \(\sum _{j \in [j_0]} \pi _j \theta _{je} = \sum _{j \in [j_0]} \pi _j^{'} \theta _{je}^{'}\). Then, (A4) becomes
By the similar strategy, we can show that \(\lambda _{[j]} = \lambda _{[j]}^{'}\) for every equivalence class [j]. This gives \(\lambda \)-identifiability.
For the dictionary identifiability, we would like to point out that its proof is not covered in Deng et al. (2014). Therefore, we seek an alternative approach to prove it.
By taking \(K = 1\), an arbitrary sentence \(E \in {\mathcal {O}}\) and \({\tilde{T}} = (t, \ldots , t_{n_E})\) where \(n_E\) is the sentence length. Then, (A1) becomes
Comparing the coefficients on both sides of (A5), we then have
This implies that \({\mathcal {O}}_{[j]} = {\mathcal {O}}_{[j]}^{'}\). By Lemma 2, we then have \({\mathcal {D}}_{[j]} = \mathcal D_{[j]}^{'}\). Notice that \({\mathcal {D}} = \cup _{[j]} \mathcal D_{[j]}\). It concludes the dictionary identifiability.
For \(\{\theta \}, \pi \)-identifiability, we prove it by making use of (A6). In (A6), we take \(E = (e)\) for \(e \in {\mathcal {E}}\), \(E = (e_1, e_2)\) with \(e_1, e_2\) from different partition sets, and \(E = (e_1, e_2, e_3)\) with \(e_k \in \mathcal I_{[j],k}, (k = 1,2,3)\), sequentially.
Without loss of generality, we suppose there is only one equivalence class. According to Condition C1.a that E only admits one separation, (A6) can be simplified as
where we define \(\eta _j = \pi _j \prod _e (1 - \theta _{je})\), \(\varphi _{je} = \theta _{je}/(1 - \theta _{je})\).
In addition, if we take E to be an empty sentence, then it holds
It is not hard to write Eqs. (A7A8A9) – (A10) in terms of tensor products of matrices, that is,
where
and
Here, \(\Lambda \) is a J by J diagonal matrix with its j-th element equal to \(\eta _j\). By Condition C1.b, column ranks of matrix \({{\bar{T}}}_1\), \({{\bar{T}}}_2\) and \({{\bar{T}}}_3\) are full column rank. Therefore, by Lemma 1, we have that
where matrix P is a column permutation matrix, A, B and C are diagonal matrices satisfying \(ABC = I\). Since elements in first rows of \({{\bar{T}}}_1, {{\bar{T}}}_2, {{\bar{T}}}_1^{'}, {{\bar{T}}}_2^{'}\) are all ones, it implies \(A = B = I\). Therefore, \(C = I\) as well. Thus, we have \({{\bar{T}}}_1^{'} = {{\bar{T}}}_1 P\), \({{\bar{T}}}_2^{'} = {{\bar{T}}}_2 P\) and \(\bar{T}_3^{'} = {{\bar{T}}}_3 P\). By comparing element-wisely, we can see that \(\eta = \eta ^{'}\) and \(\{\varphi _{je}\} = \{\varphi _{je}^{'}\}\) up to a label switch. Further, \(\{\theta _{je}\} \overset{p}{=} \{\theta _{je}^{'}\}\) due to the monotonicity relation between \(\varphi _{je}\) and \(\theta _{je}\).
In the following, we prove that \(\theta _{jw}\) is identifiable up to the same label switch for any pattern \(w \in {\mathcal {D}}\) by induction. Suppose we have that \(\theta _{jw}\) is identifiable when w belongs to {1-grams, ..., (k-1)-grams}. We need to show that \(\theta _{jw}\) is identifiable if w is a k-gram.
Let \({\mathcal {E}}_k\) be the sentence set including all k-grams in \({\mathcal {D}}\) and all possible combinations of k-gram and 1-gram that are not in \({\mathcal {D}}\). It is not hard to see that for each \(E \in {\mathcal {E}}_k\), its separation can only be the combinations of all m-grams (\(m < k\)) or the combinations of k gram and 1-gram.
By previous results that \(\eta \overset{p}{=} \eta ^{'}\) and \(\varphi _{v} \overset{p}{=} \varphi _{v}^{'}\) for those v’s are 1-grams, we could write above equations in the following matrix form, that is,
where \({\tilde{\varphi }}_{w} = (\varphi _{1w} - \varphi _{1w}^{'}, \ldots , \varphi _{Jw} - \varphi _{Jw}^{'})^T\) and
Here, \(v_1, \ldots , v_J\) are J distinct 1-grams in \({\mathcal {D}}_v\). According to Condition C2.a and C2.b, (A11) admits only one solution. Therefore, \({\tilde{\varphi }}_{w} = {\mathbf {0}}\), which implies \( \theta _{w} \overset{p}{=} \theta _{w}^{'}\). Hence, we conclude that \(\theta _{jw} = \theta _{jw}^{'}\) up to a label switch for all \(w \in {\mathcal {D}}\). This concludes the \(\{\theta \}, \pi \)-identifiability. By completing all steps, we establish the identifiability results. \(\square \)
Proof of Theorem 2
We prove this result by two steps. In Step 1, we prove that dictionary \({\mathcal {D}}\) can be estimated consistently. In Step 2, we show that the estimator, \((\{{\hat{\theta }}_{jw}\} , \{\hat{\lambda }_j\}, {\hat{\pi }}, {\hat{\kappa }})\), is consistent. Without loss of generality, we take compact set \({\varvec{\Theta }}_c\) as \({\varvec{\Theta }}_c\) = {\(\theta _{jw} \in [\eta , 1 - \eta ], \pi _j \in [\eta , 1 - \eta ], \sum _j \pi _j = 1, \lambda _j \in [c, C]\), \(\kappa \in [c, C]\)}, where \(\eta \), c, C are some positive constants such that true model parameter is in \(\Theta _c\).
Proof of Step 1 We first introduce several useful event sets. Define an event set \(\Omega _{D}\),
In other words, all possible sentences are at least observed once on \(\Omega _D\). Define sets \(\Omega _{E} = \{\omega | |\sum _i K_i| \ge m \kappa / 2 \}\), \(\Omega _K = \{\omega | K_i \le K_0, i = 1, \ldots , m\}\), \(\Omega _T = \{\omega | {\tilde{t}}_{ik,u} \le t_0, \text {for all}~ i,k,u\}\) and \(\Omega _b = \Omega _{E} \cap \Omega _K \cap \Omega _T \cap \Omega _D\). Next, we show that \(\Omega _b\) holds with high probability. Specifically, we show the upper bound for \(P(\Omega _b^c)\) by decomposing \(\Omega _b^c\) into four parts, i.e.,
-
1.
By large deviation property of Poisson random variable \(\sum _i K_i\), we have that \(P(\Omega _{E}^c) \le \exp \{ - c m\}\) where \(c = (1 - \log 2) \kappa ^{*} / 2\).
-
2.
It is not hard to see that \(P(\Omega _K^c) \le m P(K_i > K_0) \le m \exp \{- c K_0\}\) for some constant c by using Poisson moment generating function.
-
3.
By union bound, we can get that \(P(\Omega _T^c \cap \Omega _K) \le m K_0 l_{max} P(t_{ik,u} > t_0) \le m K_0 l_{max} \exp \{- \lambda _{min} t_0\}\). Here, \(l_{max}\) is the longest sentence length and \(\lambda _{min} = \min \{\lambda _j, j = 1, \ldots , J\}\).
-
4.
We show that P(E) has a positive lower bound for every sentence \(E \in {\mathcal {O}}^{*}\). Take an arbitrary E, we know that \(P(E) = \sum _{z = 1}^J \sum _{S \in {\mathcal {F}}(E)} P(S|z)\). Hence, \(P(E) \ge \prod _{w} (\theta _{jw}^{*})^{{\mathbf {1}}\{w \in S\}} (1 - \theta _{jw}^{*})^{{\mathbf {1}}\{w \notin S\}}\) for \(S \in \mathcal F(E)\) and j such that \(E \in {\mathcal {O}}_j\). Thus, P(E) is bounded below by \(\eta ^{v_D}\), i.e., \(P(E) \ge \eta ^{v_D}\). Therefore, we have that \(P(\Omega _D^c \cap \Omega _E \cap \Omega _K \cap \Omega _T) \le |{\mathcal {O}}^{*}| P(E \notin \{E_{ik} | i = 1, \ldots , m, k = 1, \ldots , K_i \}) \le |{\mathcal {O}}^{*}| (1 - \eta ^{v_D})^{|\{E_{ik}\}|} \le |{\mathcal {O}}^{*}| (1 - \eta ^{v_D})^{m \kappa / 2} \).
Hence, event \(\Omega _b^c\) holds with probability at most \(2 \exp \{ - c m\} + m \exp \{- c K_0\} + m K_0 l_{max} \exp \{- \lambda _{min} t_0\} + |{\mathcal {O}}^{*}| (1 - \eta ^{v_D})^{m \kappa / 2}\).
On event \(\Omega _b\), we have that all sentence in \(\mathcal O^{*}\) are at least observed once. By the dictionary identifiability from Theorem 1, we know that \(\hat{{\mathcal {D}}} = {\mathcal {D}}^{*}\). In other words, \(P(\hat{{\mathcal {D}}} \ne {\mathcal {D}}^{*}) \le P(\Omega _b^c)\). This completes Step 1 by choosing \(K_0 = (\log m)^2\) and \(t_0 = (\log m)^2\).
Proof of Step 2 For any fixed parameter \(\Theta \equiv (\{\theta _{jw}\} , \{ \lambda _j\}, \pi , \kappa )\). Let \(l(\Theta )\) denote the log-likelihood evaluated at \(\Theta \). By identifiability we know that, \({\mathbb {E}} l(\Theta ^{*}) > {\mathbb {E}} l(\Theta )\) for any distinct \( \Theta \in {\varvec{\Theta }}_c\). By compactness of \(B(\Theta ^{*}, \delta )^c \cap {\varvec{\Theta }}_c\) and continuity of \({\mathbb {E}} l(\Theta )\) (see (A19)), there exists a positive number \(\epsilon \) such that \({\mathbb {E}} l(\Theta ) \le {\mathbb {E}} l(\Theta ^{*}) - 3 \epsilon \) for any \(\Theta \in B(\Theta ^{*}, \delta )^c \cap {\varvec{\Theta }}_c\). In next, we prove the uniform convergence of \(l(\Theta )\) to the expected value.
By Bernstein inequality, we know that
holds point-wisely. By compactness, \(\text {var}(l(\Theta ))\) is bounded by some constant M. Thus,
for any fixed \(\Theta \).
Next, we consider bound the gap between \(l_i(\Theta ) - l_i(\Theta ^{'})\). For notational simplicity, we omit subscript i in the following displays. We know that
For \(\Vert \Theta - \Theta ^{'}\Vert _{\infty } \le \delta _1\), we can see that \( \log \{P(K)/P^{'}(K)\} \le C K \delta _1\) for some constant C. We can further show that \(\log \{ P(E|j)/P^{'}(E|j) \} \le C v_D \delta _1\) and \(\log \{P(T|j)/P^{'}(T|j) \} \le C l_{max} t_0 \delta _1\) on set \(\Omega _b\). This is because
and
With (A16) and (A17), (A15) becomes
by adjusting the constant.
Next, we prove that \({\mathbb {E}} l(\Theta )\) is a continuous function of \(\Theta \). Define set \(A_{k,t} = \{\omega | t - 1 \le \max \{\tilde{t}_{k_1,u}; k_1 = 1, \ldots , k, u = 1, \ldots , n_{k_1}\} \le t\}\) for \(k,t = 1, 2, \ldots \). By algebraic calculation, for any \(\Theta , \Theta ^{'}\) with \(\Vert \Theta - \Theta ^{'}\Vert _{\infty } \le \delta _1\), we have
by adjusting the constant.
Thus, we have \(|{\mathbb {E}} l(\Theta ) - {\mathbb {E}} l(\Theta ^{'})| \le \frac{\epsilon }{4}\) for any \(\Theta , \Theta ^{'}\) such that \(\Vert \Theta - \Theta ^{'}\Vert \le \delta _2 \equiv \epsilon / (4C)\). Together with (A18), we then have
when \(\Vert \Theta - \Theta ^{'}\Vert _{\infty } \le \delta _3\). Here we take \(\delta _3\) be \(\min \{\epsilon /(4C t_0 K_0), \delta _2\}\).
By the covering number technique, there exists a finite set \({\mathcal {N}}\) such that the distance of any two points from \(\mathcal N\) is at least \(\delta _3\). Thus by (A14), we have
Define the set \(\Omega _g = \{\omega | \sup _{\Theta \in {\varvec{\Theta }}_c } \frac{1}{m} |\sum _i l_i(\Theta ) - {\mathbb {E}} l_i(\Theta )| \le \epsilon \}\). Combined with (A20), it further gives us that
where D is the diameter of \(\varvec{\Theta }_c\) and \(n_p\) is the number of total model parameters.
Lastly, by the definition of \({\hat{\Theta }}\) and (A21), we have that
on \(\Omega _b \cap \Omega _g\). In other words, (A22) implies that
By choosing \(K_0 = (\log m)^2\) and \(t_0 = (\log m)^2 \), the left hand side of (A23) goes to zero as \(m \rightarrow \infty \). This concludes the proof. \(\square \)
Proposition 1
Under LTDM setting, the probability mass function \(P(e_{1:N}, t_{1:N}; \Theta )\) can be written in the multiplicative form of \(G(e_{1:N}; \Theta ) F(e_{1:N}, t_{1:N}; \Theta _1)\) if and only if \(\lambda _j = \lambda , j = 1, \ldots , J\). Here, \(\Theta _1\) is the model parameter excluding \(\{\theta _{jw}\}\), G and F are some functions.
Proof of Proposition 1
First, we write out the likelihood function
where \(C(K, \kappa )\) is some quantity depending on K and \(\kappa \).
We first prove the sufficient part. Suppose \(\lambda _j = \lambda , j = 1, \ldots , J\). Then, (A24) can be written as
Hence, we can take \(G(e_{1:N; \Theta }) = C(K, \kappa ) \{\sum _{j=1}^J \pi _j \prod _{k=1}^K \{P(E_k; \{\theta _{jw}\})\}\) and \(F(e_{1:N}, t_{1:N}; \Theta _1) = \prod _{k=1}^K P({\tilde{T}}_k; \lambda )\). This concludes the sufficient part.
We next prove the necessary part. Suppose it is not true. In other words, we can write \(P(e_{1:N},t_{1:N}; \Theta ) = G(e_{1:N}; \Theta ) F(e_{1:N},t_{1:N}; \Theta _1)\) when \(\lambda _j\)’s are not all the same. Without loss of generality, we assume \(\lambda _1 < \lambda _2 \le ... \le \lambda _J\). By assumption,
Divided by \(\prod _{k=1}^K P({\tilde{T}}_k; \lambda _1)\) on both sides of (A27), we get
By letting \({\tilde{t}}_n \rightarrow \infty ; n = 1, \ldots N\), the left hand side of (A28) becomes \(C(K, \kappa ) \pi _1 \prod _{k=1}^K \{P(E_k; \{\theta _{1w}\})\). Then, we know that \(G(e_{1:N}; \Theta )\) has the form of \(C_1(e_{1:N};\Theta _1) \prod _{k=1}^K \{P(E_k; \{\theta _{1w}\})\). Plug this back to (A27), we then get
Notice that the left hand side of above equation is a polynomial of \(\{\theta _{jw}\}\)’s and right hand side is a polynomial of \(\{\theta _{1w}\}\)’s. Hence, it must hold that \(\pi _j \prod _{k=1}^K P({\tilde{T}}_k, \lambda _j) \equiv 0\) for \(j = 2, \ldots , J\), which is impossible when \(\pi _j > 0\). Thus, it contradicts with the assumption. This concludes the proof of necessary part. \(\square \)
Proof of Lemma 2
For any pattern w in \({\mathcal {D}}_{[j]}\), we need to show it also belongs to \({\mathcal {D}}_{[j]}^{'}\). It is easy to see that if w has the form of [A], then it must belong to \({\tilde{D}}\) since [A] only admits one separation. In the following, we only need to consider \(w = [e_1 e_2 \ldots e_{l_w}]\) such that \(e_1, \ldots ,e_{l_w}\) (\(l_w \ge 2\)) are different according to Assumption A2. Without loss of generality, we assume w belongs to Class j.
Let \(\check{O}_j\) denote the longest sentence generated by \(\mathcal D_j\) satisfying that (1) each event belongs to \({\mathcal {E}} - \{e_1\}\); (2) the length of \(\check{O}_j\) is at least \(n_{j,e_1}\) (See \(n_{j,e_1}\)’s definition in Assumption A1). Notice that \(\check{O}_j\) may not be unique, we only need to consider one of them. Let \(\grave{O}_j\) be the sentence such that it has form \((Q_1 Q_2)\) such that (1) \(Q_1\) contains \(\check{O}_j\) as its subsentence; (2) each event in \(Q_1\) belongs to \({\mathcal {E}} - \{e_1\}\); (3) \(Q_1\) is longest possible; (4) the first event of \(Q_2\) is \(e_1\) (Be empty if it does not exist.); (5) \(Q_2\) is shortest possible. Notice \(\grave{O}_j\) may not be unique, we only need to consider one of them. Next we consider the decomposition of sentence \(O_j = (\grave{O}_j w)\). By aid of \(O_j\), we can show that w must belong to \(\mathcal D_{[j]}^{'}\).
Since \({\mathcal {O}}_{[j]} = {\mathcal {O}}_{[j]}^{'}\), we know that \(O_j \in {\mathcal {O}}_{[j]}^{'}\). Without loss of generality, we also assume that \(O_j\) appears in j-th class of model \({\mathcal {P}}^{'}\). We claim that \(O_j\) only has separations in form \(\{\mathcal S(\grave{O}_j), w\}\) in \({\mathcal {D}}^{'}\). (\({\mathcal {S}}(O)\) is one realization of separation for O.) If not, then we must have the following cases.
Case 1: There is a separation \(S \in {\mathcal {F}}(O_j)\) such that \(S = \{{\mathcal {S}}(R_1), w_1\}\) where \(w_1\) is contained in w. By Assumption A2, we know that \(w_1\) does not consist of \(e_1\). Then, we consider sentence \((w_1 R_1)\). It is in \({\mathcal {O}}_{[j]}^{'}\), then it must in \({\mathcal {O}}_{[j]}\). By Assumption A1, we know that \((w_1 R_1)\) must belong to \({\mathcal {O}}_j\), since it contains \(\check{O}_j\). Then, \((w_1 R_1)\) can be written in form of \((Q_1 Q_2)\) with longer \(Q_1\). This contradicts with the definition of \(\grave{O}_j\). Therefore, Case 1 cannot happen.
Case 2: There is a separation \(S \in {\mathcal {F}}(O_j)\) such that \(S = \{{\mathcal {S}}(R_2), w_2\}\) where \(w_2\) contains w. We further consider the following four situations.
-
2.a
Suppose \(R_2\) contains \(\check{O}_j\) as its sub-sentence and does not contain events \(u_1\). Since \({\mathcal {O}}_{[j]} = \tilde{\mathcal O}_{[j]}\), we know that \(R_2\) must belong to \({\mathcal {O}}_{[j]}\). By Assumption A1, \(R_2\) is also in \({\mathcal {O}}_j\). Then, it leads to contradiction since \(R_2\) is longer than \(\check{O}_j\).
-
2.b
Suppose \(R_2\) contains \(\check{O}_j\) as its sub-sentence and contains events \(u_1\). \(R_2\) is also in \({\mathcal {O}}_j\) for the same reason as before. This time, \(R_2\) can be written in the form of \((Q_1 Q_2)\) with shorter \(Q_2\), which contradicts with the definition of \(\grave{O}_j\).
-
2.c
Suppose \(R_2\) is contained in \(\check{O}_j\). If \(R_2\) is the longest sentence generated by \({\mathcal {D}}_{j}^{'}\) without \(e_1\), then by Assumption A1 we know \(\check{O}_j\) must also belong to this class. Therefore, \(R_2\) is not the longest sentence. It implies that there exists a pattern \({\tilde{w}}\) with events in \({\mathcal {E}}_w\) in \({\mathcal {D}}^{'}_j\) does not contribute to \(R_2\). Therefore, sentence \(({\tilde{w}} R_2 w_2)\) containing \(\check{O}_j\) must belong to \({\mathcal {O}}_j\). By using the same argument, we know that it can also be written in the form \((Q_1 Q_2)\) with longer \(Q_1\). This contradicts with the definition of \(\grave{O}_j\).
-
2.d
Suppose \(R_2 = \check{O}_j\). If \(Q_2\) is not empty, then \(w_2\) contains two \(e_1\)’s. It contradicts with Assumption A2. If \(Q_2\) is empty, then \(w_2 = w\) exactly.
Hence, we conclude the proof of this lemma. \(\square \)
Appendix C: Parameter Estimation in NB-LTDM
In the inner part of the NB-LTDM Algorithm, we adopt a nonparametric Bayes method which is used to avoid selection of a single finite number of mixtures J. Therefore, we replace finite mixture components by an infinite mixture, that is,
For the choice of prior, we specify \(\theta _{jw} \sim \text {Unif}(0,1)\), \(\kappa \sim \mathrm {Ga}(1,1)\), \(\lambda _j \sim \mathrm {Ga}(1,1)\) and \(v = (v_1, \ldots ) \sim Q\) where Q corresponds to a Dirichlet process. The stick-breaking representation, introduced by Sethuraman (1994), implies that \(v_j = V_j \prod _{l < j} (1- V_l)\) with \(V_j \sim \text {Beta}(1,\alpha )\) independently for \(j=1,\ldots ,\infty \), where \(\alpha >0\) is a precision parameter characterizing Q.
Hence, our nonparametric Bayesian latent theme dictionary model can be written in the following hierarchical form:
where \(\delta _j(\cdot )\) is the Dirac measure at j.
We use a data augmentation Gibbs sampler (Walker 2007) to update all parameters and latent variables. Specifically, we introduce a vector of latent variables \(u=(u_1, \ldots , u_N)\), where \(u_i {\mathop {\sim }\limits ^{iid}} U[0,1]\). The full likelihood becomes
where \(f_{Ga}\) is the density of \(\mathrm {Ga}(1,1)\), \(f_u\) is the density of \(\mathrm {Unif}(0,1)\) and \(f_{Be}(\cdot ;\alpha )\) is the density of \(\mathrm {Beta}(1, \alpha )\).
Then, Gibbs sampler iterates through the following steps:
-
1.
Update \(u_i\), for \(i=1,\ldots ,m\), by sampling from \(U(0,v_{z_i})\).
-
2.
Update \(\theta _{hw}\), for \(h=1,\ldots ,j^*, w=1\ldots , v_D\), by sampling from
$$\begin{aligned} \text {Beta} \bigg ( \sum ^n_{i:z_i = h} \sum ^{K_i}_{k=1} 1(\theta _{hw} \in S_{ik}) + 1, \sum ^n_{i:z_i = h} \sum ^{K_i}_{k=1} 1(\theta _{hw} \notin S_{ik} ) + 1 \bigg ). \end{aligned}$$ -
3.
Update \(\lambda _j\), for \(j = 1, \ldots , j^{*}\) (\(j^{*} = \max \{z_i\}\)), by sampling from
$$\begin{aligned} \text {Ga}\left( 1 + \sum _{i:z_i = j} \sum _{k=1}^{K_i} l_{T_{ik}}, 1 + \sum _{i:z_i = j} \sum _{k=1}^{K_i} \sum _{u = 1}^{l_{T_{ik}}} T_{ik,u}\right) . \end{aligned}$$ -
4.
Update \(V_j\), for \(j=1,\ldots ,j^*\), by sampling from \(\text {Beta}(1,\alpha )\) truncated to fall into the interval
$$\begin{aligned} \bigg [\max _{i:z_i = j} \frac{u_i}{\prod _{l< j} (1-V_l)}, 1- \max _{i: z_i > j} \frac{u_i}{V_{z_i} \prod _{l < z_i, l \ne j}(1-V_l)} \bigg ]. \end{aligned}$$ -
5.
Update \(z_i\), for \(i=1,\ldots ,m\), by sampling from
$$\begin{aligned} P(z_i&= j|e_{1:N_i},t_{1:N_i},{\mathbf {S}}_i, \{\theta _{jw}\},V,u,z_{-i})\\&= \frac{1(j \in A_i) \prod ^{K_i}_{k=1} P(S_{ik},T_{ik}|\{\theta _{lw}\}, \lambda _j) }{\sum _{l \in A_i} \prod ^{K_i}_{k=1} P(S_{ik}, T_{ik}|\{\theta _{lw}\}, \lambda _j) }1(S_{ik} \in {\mathcal {F}}(E_{ik})), \end{aligned}$$where \(A_i := \{ j: v_j > u_i\}\). To identify the elements in \(A_1,\ldots ,A_m\), first update \(V_j\) for \(j=1,\ldots ,\tilde{k}\), where \(\tilde{j}\) is the smallest value satisfying
$$\begin{aligned} \sum ^{\tilde{j}}_{j=1} v_j > 1 - \min \{u_1,\ldots , u_m \}. \end{aligned}$$(A29)Therefore, \(1,\ldots ,\tilde{j}\) are the possible values for \(z_i\). Note that we have already updated \(V_j\) for \(j=1,\ldots ,j^*\). Therefore, we first check if \(j^*\) satisfies (A29). If yes, then we do not have to sample more; otherwise sample \(V_j \sim \text {Beta}(1,\alpha )\) for \(j=j^*+1,\ldots \) until (A29) is satisfied. In this case, we also have to sample \(\theta _{j w}\) from U(0, 1) and \(\lambda _j\) from \(\text {Ga}(1,1)\) for \(j = j^*+1,\ldots , \tilde{j}\) and \(w =1,\ldots ,v_D\) in order to compute \(P(S_{ik}, T_{ik} | \theta _j, \lambda _j)\) for \(j=j^* +1,\ldots ,\tilde{j}\).
-
6.
Update \(S_{ik}\), for \(i=1,\ldots ,m\), \(k=1,\ldots ,K_i\), by sampling from
$$\begin{aligned} { P(S_{ik} = S|E_{ik}, \theta _{z_i} ) } = \frac{P(S_{ik} = S |\theta _{z_i})}{ \sum _{S' \in {\mathcal {F}}(E_{ik})} P(S_{ik} = S' | \theta _{z_i}) } 1(S \in {\mathcal {F}}(E_{ik})). \end{aligned}$$ -
7.
Update \(\kappa \), which follows gamma distribution \(\text {Ga}(1 + \sum _i K_i, 1 + m)\).
-
8.
Sample \(\alpha \) from posterior \(\text {Ga}(1 + j^{*}, 1 - \sum _{j=1}^{j^{*}} \log (1 - V_{j}))\).
Appendix D: Estimated Dictionary in Traffic Item
In the “Traffic" item, the NB-LTDM algorithm found a dictionary \(\hat{{\mathcal {D}}}\) with \({\hat{v}}_D = 82\). For each pattern w in \(\hat{{\mathcal {D}}}\), we classified the six classes into two clusters based on their pattern probabilities \(\theta _{jw}(j = 1, \ldots , 6)\). Those classes with high pattern probabilities are clustered together and shown in Table 11.
Appendix E: Latent Class Model and Theme Dictionary Model
In this section, we briefly recall two popular models, latent class model (LCM; Gibson 1959) and theme dictionary model (TDM; Deng et al. 2014), which are related with the proposed LTDM.
Widely adopted in biostatistics, psychometrics and machine learning literature (e.g., Goodman 1974; Vermunt and Magidson 2002; Templin et al. 2010), LCM relates a set of observed variables to a discrete latent variable, which is often used for indicating the class label. LCM assumes a local independence structure, i.e.,
where \(X_1, \ldots , X_K\) are K observed variables and Z is a discrete latent variable with density \(P(Z = j) = \pi _j, ~ j = 1, \ldots J\). Thus, the joint (marginal) distribution of \(X_1, \ldots , X_K\) takes form
For TDM (Deng et al. 2014), it typically handles observations known as words/events. It can be used for identifying associated event patterns. The problem of finding event associations is also known as market basket analysis (Piatetsky-Shapiro 1991; Hastie et al. 2005; Chen et al. 2005). Under TDM, a pattern is a combination of several events. A collection of distinct patterns forms a dictionary, \({\mathcal {D}}\). An observation, E, is a set of events. In TDM, we observe E but do not know which patterns it consists of. In other words, E could be split into different possible partitions of patterns. For each possible partition, we call it a separation of E. The collection of multiple observations \({\mathbf {E}} = \{E_1, \ldots , E_K\}\) forms a document. TDM does not take into account event ordering. For example, \(E = (A, B, C)\) is an observation with three events, A, B and C. TDM treats \(E^{'} = (C, B, A)\) as the same observation as E. Consequently, patterns are also unordered. For instance, patterns [A B] and [B A] are viewed as the same. TDM postulates that a pattern appears in an observation at most once. Let \(\theta _w \in [0,1]\) be the probability of pattern that appears in an observation. The probability distribution of one separation S for observation E is defined to be
Since separation S is not observed, the marginal probability of E is
where \({\mathcal {F}}(E)\) is the set of all possible separations for E. Furthermore, observations are assumed to be independent, i.e., for \({\mathbf {E}} = \{E_1, \ldots , E_K\}\),
Rights and permissions
About this article
Cite this article
Fang, G., Ying, Z. Latent Theme Dictionary Model for Finding Co-occurrent Patterns in Process Data. Psychometrika 85, 775–811 (2020). https://doi.org/10.1007/s11336-020-09725-2
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11336-020-09725-2