Enabling Hierarchical Dirichlet Processes to Work Better for Short Texts at Large Scale

Mai, Khai; Mai, Sang; Nguyen, Anh; Van Linh, Ngo; Than, Khoat

doi:10.1007/978-3-319-31750-2_34

Enabling Hierarchical Dirichlet Processes to Work Better for Short Texts at Large Scale

Khai Mai¹⁹,
Sang Mai¹⁹,
Anh Nguyen¹⁹,
Ngo Van Linh¹⁹ &
…
Khoat Than¹⁹

Conference paper
First Online: 12 April 2016

3135 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9652))

Abstract

Analyzing texts from social media often encounters many challenges, including shortness, dynamic, and huge size. Short texts do not provide enough information so that statistical models often fail to work. In this paper, we present a very simple approach (namely, bag-of-biterms) that helps statistical models such as Hierarchical Dirichlet Processes (HDP) to work well with short texts. By using both terms (words) and biterms to represent documents, bag-of-biterms (BoB) provides significant benefits: (1) it naturally lengthens representation and thus helps us reduce bad effects of shortness; (2) it enables the posterior inference in a large class of probabilistic models including HDP to be less intractable; (3) no modification of existing models/methods is necessary, and thus BoB can be easily employed in a wide class of statistical models. To evaluate those benefits of BoB, we take Online HDP into account in that it can deal with dynamic and massive text collections, and we do experiments on three large corpora of short texts which are crawled from Twitter, Yahoo Q&A, and New York Times. Extensive experiments show that BoB can help HDP work significantly better in both predictiveness and quality.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Aletras, N., Stevenson, M.: Evaluating topic coherence using distributional semantics. In: Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013), pp. 13–22 (2013)
Google Scholar
Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using wikipedia. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 787–788. ACM (2007)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using web search engines. WWW 7, 757–766 (2007)
Google Scholar
Cheng, X., Yan, X., Lan, Y., Guo, J.: BTM: topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 26(12), 2928–2941 (2014)
Article Google Scholar
Ye, C., Wen Wushao, P.Y.: TM-HDP: an effective nonparametric topic model for tibetan messages. J. Comput. Inf. Syst. 10, 10433–10444 (2014)
Google Scholar
Grant, C.E., George, C.P., Jenneisch, C., Wilson, J.N.: Online topic modeling for real-time twitter search. In: TREC (2011)
Google Scholar
Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic variational inference. J. Mach. Learn. Res. 14(1), 1303–1347 (2013)
MathSciNet MATH Google Scholar
Hong, L., Davison, B.D.: Empirical study of topic modeling in twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88. ACM (2010)
Google Scholar
Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 889–892. ACM (2013)
Google Scholar
Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM (2008)
Google Scholar
Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th International Conference On World Wide Web, pp. 377–386. ACM (2006)
Google Scholar
Schönhofen, P.: Identifying document topics using the wikipedia category network. Web Intell. Agent Syst. 7(2), 195–207 (2009)
Google Scholar
Tang, J., Meng, Z., Nguyen, X., Mei, Q., Zhang, M.: Understanding the limiting factors of topic modeling via posterior contraction analysis. In: Proceedings of The 31st International Conference on Machine Learning, pp. 190–198 (2014)
Google Scholar
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006)
Article MathSciNet MATH Google Scholar
Than, K., Doan, T.: Dual online inference for latent dirichlet allocation. In: Proceedings of the Sixth Asian Conference on Machine Learning (ACML), pp. 80–95 (2014)
Google Scholar
Wang, C., Paisley, J.W., Blei, D.M.: Online variational inference for the hierarchical dirichlet process. In: International Conference on Artificial Intelligence and Statistics, pp. 752–760 (2011)
Google Scholar
Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, pp. 1445–1456 (2013)
Google Scholar
Yih, W.T., Meek, C.: Improving similarity measures for short segments of text. AAAI 7, 1489–1494 (2007)
Google Scholar

Download references

Acknowledgments

This work was partially supported by Vietnam National Foundation for Science and Technology Development (NAFOSTED Project No. 102.05-2014.28), and by AOARD (US Air Force) and ITC-PAC (US Army) under agreement number FA2386-15-1-4011.

Author information

Authors and Affiliations

Hanoi University of Science and Technology, No. 1, Dai Co Viet Road, Hanoi, Vietnam
Khai Mai, Sang Mai, Anh Nguyen, Ngo Van Linh & Khoat Than

Authors

Khai Mai
View author publications
You can also search for this author in PubMed Google Scholar
Sang Mai
View author publications
You can also search for this author in PubMed Google Scholar
Anh Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Ngo Van Linh
View author publications
You can also search for this author in PubMed Google Scholar
Khoat Than
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Khoat Than .

Editor information

Editors and Affiliations

The University of Melbourne, Melbourne, Victoria, Australia
James Bailey
The University of Texas at Dallas, Richardson, Texas, USA
Latifur Khan
Osaka University, Osaka, Japan
Takashi Washio
University of Auckland, Auckland, New Zealand
Gill Dobbie
Shenzhen University, Shenzhen, China
Joshua Zhexue Huang
Massey University, Auckland, New Zealand
Ruili Wang

Appendix: Conversion of topic-over-biterms (distribution over biterms) to topic-over-words (distribution over words)

In BoB, after we finish training the model, we obtain topics, each is the multinomial distribution over biterms and we want to convert to the topics with distribution over words. Assume that \(\varvec{\phi }_k\) is the distribution over biterms of topic k. According to probability:

\( p(w_i \mid z=k) = \sum _{j=1}^{V}p(w_i,w_j \mid z=k) = \sum _{j=1}^{V}p(b_{ij} \mid z=k) = \sum _{j=1}^{V}\phi _{kb_{ij}} \),

As discussed in Sect. 4.2, in implementing BoB, we can merge \(b_{ij}\) and \(b_{ji}\) into \(b_{ij}\) with \(i<j\). Because of identical occurence in every document, after finishing training process, the value of \(p(b_{ij}\mid z=k)\) will be expectedly the same as \(p(b_{ji}\mid z=k)\). Therefore, in grouping these biterms into one, the conversion version of this implementation is: \( p(w_i\mid z=k) = \sum _{j=1}^{V}p(b_{ij} \mid z=k) = \phi _{kb_{ii}} + \frac{1}{2}\sum _{\text {b: biterms contain } w_i}\phi _{kb}. \)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mai, K., Mai, S., Nguyen, A., Van Linh, N., Than, K. (2016). Enabling Hierarchical Dirichlet Processes to Work Better for Short Texts at Large Scale. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9652. Springer, Cham. https://doi.org/10.1007/978-3-319-31750-2_34

Download citation

DOI: https://doi.org/10.1007/978-3-319-31750-2_34
Published: 12 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31749-6
Online ISBN: 978-3-319-31750-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

Buying options

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Conversion of topic-over-biterms (distribution over biterms) to topic-over-words (distribution over words)

Appendix: Conversion of topic-over-biterms (distribution over biterms) to topic-over-words (distribution over words)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation