Skip to main content
Log in

Finding maximal ranges with unique topics in a text database

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Recent years have witnessed the rapid growth of text data, and thus the increasing importance of in-depth analysis of text data for various applications. Text data are often organized in a database with documents labeled by attributes like time and location. Different documents manifest different topics. The topics of the documents may change along the attributes of the documents, and such changes have been the subject of research in the past. However, previous analyses techniques, such as topic detection and tracking, topic lifetime, and burstiness, all focus on the topic behavior of the documents in a given attribute range without contrasting to the documents in the overall range. This paper introduces the concept of u n i q u e t o p i c s, referring to those topics that only appear frequently within a small range of documents but not in the whole range. These unique topics may reflect some unique characteristics of documents in this small range not found outside of the range. The paper aims at an efficient pruning-based algorithm that, for a user-given set of keywords and a user-given attribute, finds the maximal ranges along the given attribute and their unique topics that are highly related to the given keyword set. Thorough experiments show that the algorithm is effective in various scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7

Similar content being viewed by others

Notes

  1. http://www.chinanews.com/.

  2. \({r_{1}\cup {\cdots } \cup r_{n}}\) labels a range containing r 1 through r n .

  3. http://www.bbc.com/news.

  4. https://github.com/NLPchina/ansj_seg.

  5. London Riots. https://en.wikipedia.org/wiki/2011_England_riots.

  6. Coasta Concordia Disaster. https://en.wikipedia.org/wiki/Costa_Concordia_disaster.

  7. Ya’an Earthquake. https://en.wikipedia.org/wiki/2013_Lushan_earthquake.

  8. Malaysia Airlines Flight 370. https://en.wikipedia.org/wiki/Malaysia_Airlines_Flight_370.

  9. https://github.com/cran/bursts.

References

  1. Allan, J., Papka, R., Lavrenko, V.: On-Line New Event Detection and Tracking. In: Proceedings of the 21St Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 37–45. ACM (1998)

  2. Bay, S.D., Pazzani, M.J.: Detecting group differences: Mining contrast sets. Data Min. Knowl. Disc. 5(3), 213–246 (2001)

    Article  MATH  Google Scholar 

  3. Bayardo Jr, R.J.: Efficiently Mining Long Patterns from Databases. In: ACM Sigmod Record, vol. 27, pp. 85–93. ACM (1998)

  4. Blei, D.M., Lafferty, J.D.: Dynamic Topic Models. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 113–120. ACM (2006)

  5. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  6. Chen, Z., Liu, B.: Mining Topics in Documents: Standing on the Shoulders of Big Data. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1116–1125. ACM (2014)

  7. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391 (1990)

    Article  Google Scholar 

  8. Dong, G., Li, J.: Efficient Mining of Emerging Patterns Discovering Trends and Differences Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 43–52. ACM (1999)

  9. Doyle, G., Elkan, C.: Accounting for Burstiness in Topic Models Proceedings of the 26th Annual International Conference on Machine Learning, pp. 281–288. ACM (2009)

  10. Fan, H., Ramamohanarao, K.: Efficiently Mining Interesting Emerging Patterns. In: Advances in Web-Age Information Management, pp. 189–201. Springer (2003)

  11. Fiscus, J.G., Doddington, G.R.: Topic Detection and Tracking Evaluation Overview. In: Topic Detection and Tracking, pp. 17–31. Springer (2002)

  12. Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., Pirahesh, H.: Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Disc. 1(1), 29–53 (1997)

    Article  Google Scholar 

  13. Griffiths, T., Steyvers, M.: A Probabilistic Approach to Semantic Representation. In: Proceedings of the 24th Annual Conference of the Cognitive Science Society, pp. 381–386. Citeseer (2002)

  14. Griffiths, T., Steyvers, M., et al.: Prediction and semantic association, Advances in neural information processing systems, pp. 11–18 (2003)

  15. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. 101(suppl 1), 5228–5235 (2004)

    Article  Google Scholar 

  16. Herstein, I.N.: Topics in algebra. Blaisdell publishing company, waltham mass (1964)

  17. Hofmann, T.: Probabilistic Latent Semantic Indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)

  18. Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1-2), 177–196 (2001)

    Article  MATH  Google Scholar 

  19. Kalinin, A., Cetintemel, U., Zdonik, S.: Interactive Data Exploration Using Semantic Windows. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 505–516. ACM (2014)

  20. Kleinberg, J.: Bursty and hierarchical structure in streams. Data Min. Knowl. Disc. 7(4), 373–397 (2003)

    Article  MathSciNet  Google Scholar 

  21. Kullback, S., Leibler, R.A.: On information and sufficiency. The annals of mathematical statistics, pp. 79–86 (1951)

  22. Lappas, T., Arai, B., Platakis, M., Kotsakos, D., Gunopulos, D.: On burstiness-aware search for document sequences. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 477–486. ACM (2009)

  23. Lau, J.H., Collier, N., Baldwin, T.: On-Line Trend Analysis with Topic Models: # Twitter Trends Detection Topic Model Online. In: COLING, pp. 1519–1534 (2012)

  24. Masada, T., Fukagawa, D., Takasu, A., Hamada, T., Shibata, Y., Oguri, K.: Dynamic Hyperparameter Optimization for Bayesian Topical Trend Analysis. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1831–1834. ACM (2009)

  25. Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 487–494. AUAI Press (2004)

  26. Steyvers, M., Griffiths, T.: Probabilistic topic models. Handbook of Latent Semantic Analysis 427(7), 424–440 (2007)

    Google Scholar 

  27. Steyvers, M., Smyth, P., Rosen-Zvi, M., Griffiths, T.: Probabilistic Author-Topic Models for Information Discovery. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 306–315. ACM (2004)

  28. Van Rijsbergen, C.J.: Information retrieval. The Information Retrieval Group (1979)

  29. Wang, C., Blei, D., Heckerman, D.: Continuous time dynamic topic models. arXiv:1206.3298 (2012)

  30. Wang, X., McCallum, A.: Topics over Time: a Non-Markov Continuous-Time Model of Topical Trends. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 424–433. ACM (2006)

  31. Wayne, C.: Topic Detection and Tracking (Tdt) Overview and Perspective. In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop (1998)

  32. Wayne, C.L.: Topic Detection and Tracking (Tdt). In: Workshop Held at the University of Maryland on, vol. 27, p 28. Citeseer (1997)

  33. Zhang, D., Zhai, C., Han, J.: Topic cube: topic Modeling for Olap on multidimensional text databases. In: SDM, vol. 9, pp. 1124–1135 (2009)

Download references

Acknowledgments

We thank Yaoliang Chen for his useful comments, and Chenghao Guo and Kaiwen Zhou for their enthusiastic help during the data collection process. We also thank the anonymous reviewers for their invaluable feedback and suggestions that have greatly improved this work. This work was partially supported by the NSFC (No. 61370080, No. 61170007) and the Shanghai Innovation Action Project (Grant No. 16DZ1100200), as well as by respective grants from EMC and SAP.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhihui Yang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, Z., Ma, H., He, Z. et al. Finding maximal ranges with unique topics in a text database. World Wide Web 21, 289–310 (2018). https://doi.org/10.1007/s11280-017-0448-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-017-0448-y

Keywords

Navigation