Skip to main content

Enabling Hierarchical Dirichlet Processes to Work Better for Short Texts at Large Scale

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9652))

Abstract

Analyzing texts from social media often encounters many challenges, including shortness, dynamic, and huge size. Short texts do not provide enough information so that statistical models often fail to work. In this paper, we present a very simple approach (namely, bag-of-biterms) that helps statistical models such as Hierarchical Dirichlet Processes (HDP) to work well with short texts. By using both terms (words) and biterms to represent documents, bag-of-biterms (BoB) provides significant benefits: (1) it naturally lengthens representation and thus helps us reduce bad effects of shortness; (2) it enables the posterior inference in a large class of probabilistic models including HDP to be less intractable; (3) no modification of existing models/methods is necessary, and thus BoB can be easily employed in a wide class of statistical models. To evaluate those benefits of BoB, we take Online HDP into account in that it can deal with dynamic and massive text collections, and we do experiments on three large corpora of short texts which are crawled from Twitter, Yahoo Q&A, and New York Times. Extensive experiments show that BoB can help HDP work significantly better in both predictiveness and quality.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Aletras, N., Stevenson, M.: Evaluating topic coherence using distributional semantics. In: Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013), pp. 13–22 (2013)

    Google Scholar 

  2. Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using wikipedia. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 787–788. ACM (2007)

    Google Scholar 

  3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  4. Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using web search engines. WWW 7, 757–766 (2007)

    Google Scholar 

  5. Cheng, X., Yan, X., Lan, Y., Guo, J.: BTM: topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 26(12), 2928–2941 (2014)

    Article  Google Scholar 

  6. Ye, C., Wen Wushao, P.Y.: TM-HDP: an effective nonparametric topic model for tibetan messages. J. Comput. Inf. Syst. 10, 10433–10444 (2014)

    Google Scholar 

  7. Grant, C.E., George, C.P., Jenneisch, C., Wilson, J.N.: Online topic modeling for real-time twitter search. In: TREC (2011)

    Google Scholar 

  8. Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic variational inference. J. Mach. Learn. Res. 14(1), 1303–1347 (2013)

    MathSciNet  MATH  Google Scholar 

  9. Hong, L., Davison, B.D.: Empirical study of topic modeling in twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88. ACM (2010)

    Google Scholar 

  10. Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 889–892. ACM (2013)

    Google Scholar 

  11. Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM (2008)

    Google Scholar 

  12. Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th International Conference On World Wide Web, pp. 377–386. ACM (2006)

    Google Scholar 

  13. Schönhofen, P.: Identifying document topics using the wikipedia category network. Web Intell. Agent Syst. 7(2), 195–207 (2009)

    Google Scholar 

  14. Tang, J., Meng, Z., Nguyen, X., Mei, Q., Zhang, M.: Understanding the limiting factors of topic modeling via posterior contraction analysis. In: Proceedings of The 31st International Conference on Machine Learning, pp. 190–198 (2014)

    Google Scholar 

  15. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  16. Than, K., Doan, T.: Dual online inference for latent dirichlet allocation. In: Proceedings of the Sixth Asian Conference on Machine Learning (ACML), pp. 80–95 (2014)

    Google Scholar 

  17. Wang, C., Paisley, J.W., Blei, D.M.: Online variational inference for the hierarchical dirichlet process. In: International Conference on Artificial Intelligence and Statistics, pp. 752–760 (2011)

    Google Scholar 

  18. Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, pp. 1445–1456 (2013)

    Google Scholar 

  19. Yih, W.T., Meek, C.: Improving similarity measures for short segments of text. AAAI 7, 1489–1494 (2007)

    Google Scholar 

Download references

Acknowledgments

This work was partially supported by Vietnam National Foundation for Science and Technology Development (NAFOSTED Project No. 102.05-2014.28), and by AOARD (US Air Force) and ITC-PAC (US Army) under agreement number FA2386-15-1-4011.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Khoat Than .

Editor information

Editors and Affiliations

Appendix: Conversion of topic-over-biterms (distribution over biterms) to topic-over-words (distribution over words)

Appendix: Conversion of topic-over-biterms (distribution over biterms) to topic-over-words (distribution over words)

In BoB, after we finish training the model, we obtain topics, each is the multinomial distribution over biterms and we want to convert to the topics with distribution over words. Assume that \(\varvec{\phi }_k\) is the distribution over biterms of topic k. According to probability:

\( p(w_i \mid z=k) = \sum _{j=1}^{V}p(w_i,w_j \mid z=k) = \sum _{j=1}^{V}p(b_{ij} \mid z=k) = \sum _{j=1}^{V}\phi _{kb_{ij}} \),

As discussed in Sect. 4.2, in implementing BoB, we can merge \(b_{ij}\) and \(b_{ji}\) into \(b_{ij}\) with \(i<j\). Because of identical occurence in every document, after finishing training process, the value of \(p(b_{ij}\mid z=k)\) will be expectedly the same as \(p(b_{ji}\mid z=k)\). Therefore, in grouping these biterms into one, the conversion version of this implementation is: \( p(w_i\mid z=k) = \sum _{j=1}^{V}p(b_{ij} \mid z=k) = \phi _{kb_{ii}} + \frac{1}{2}\sum _{\text {b: biterms contain } w_i}\phi _{kb}. \)

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Mai, K., Mai, S., Nguyen, A., Van Linh, N., Than, K. (2016). Enabling Hierarchical Dirichlet Processes to Work Better for Short Texts at Large Scale. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9652. Springer, Cham. https://doi.org/10.1007/978-3-319-31750-2_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-31750-2_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-31749-6

  • Online ISBN: 978-3-319-31750-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics