Abstract
Recent years have witnessed the rapid growth of text data, and thus the increasing importance of in-depth analysis of text data for various applications. Text data are often organized in a database with documents labeled by attributes like time and location. Different documents manifest different topics. The topics of the documents may change along the attributes of the documents, and such changes have been the subject of research in the past. However, previous analyses techniques, such as topic detection and tracking, topic lifetime, and burstiness, all focus on the topic behavior of the documents in a given attribute range without contrasting to the documents in the overall range. This paper introduces the concept of u n i q u e t o p i c s, referring to those topics that only appear frequently within a small range of documents but not in the whole range. These unique topics may reflect some unique characteristics of documents in this small range not found outside of the range. The paper aims at an efficient pruning-based algorithm that, for a user-given set of keywords and a user-given attribute, finds the maximal ranges along the given attribute and their unique topics that are highly related to the given keyword set. Thorough experiments show that the algorithm is effective in various scenarios.
Similar content being viewed by others
Notes
\({r_{1}\cup {\cdots } \cup r_{n}}\) labels a range containing r 1 through r n .
London Riots. https://en.wikipedia.org/wiki/2011_England_riots.
Coasta Concordia Disaster. https://en.wikipedia.org/wiki/Costa_Concordia_disaster.
Ya’an Earthquake. https://en.wikipedia.org/wiki/2013_Lushan_earthquake.
Malaysia Airlines Flight 370. https://en.wikipedia.org/wiki/Malaysia_Airlines_Flight_370.
References
Allan, J., Papka, R., Lavrenko, V.: On-Line New Event Detection and Tracking. In: Proceedings of the 21St Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 37–45. ACM (1998)
Bay, S.D., Pazzani, M.J.: Detecting group differences: Mining contrast sets. Data Min. Knowl. Disc. 5(3), 213–246 (2001)
Bayardo Jr, R.J.: Efficiently Mining Long Patterns from Databases. In: ACM Sigmod Record, vol. 27, pp. 85–93. ACM (1998)
Blei, D.M., Lafferty, J.D.: Dynamic Topic Models. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 113–120. ACM (2006)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Chen, Z., Liu, B.: Mining Topics in Documents: Standing on the Shoulders of Big Data. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1116–1125. ACM (2014)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391 (1990)
Dong, G., Li, J.: Efficient Mining of Emerging Patterns Discovering Trends and Differences Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 43–52. ACM (1999)
Doyle, G., Elkan, C.: Accounting for Burstiness in Topic Models Proceedings of the 26th Annual International Conference on Machine Learning, pp. 281–288. ACM (2009)
Fan, H., Ramamohanarao, K.: Efficiently Mining Interesting Emerging Patterns. In: Advances in Web-Age Information Management, pp. 189–201. Springer (2003)
Fiscus, J.G., Doddington, G.R.: Topic Detection and Tracking Evaluation Overview. In: Topic Detection and Tracking, pp. 17–31. Springer (2002)
Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., Pirahesh, H.: Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Disc. 1(1), 29–53 (1997)
Griffiths, T., Steyvers, M.: A Probabilistic Approach to Semantic Representation. In: Proceedings of the 24th Annual Conference of the Cognitive Science Society, pp. 381–386. Citeseer (2002)
Griffiths, T., Steyvers, M., et al.: Prediction and semantic association, Advances in neural information processing systems, pp. 11–18 (2003)
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. 101(suppl 1), 5228–5235 (2004)
Herstein, I.N.: Topics in algebra. Blaisdell publishing company, waltham mass (1964)
Hofmann, T.: Probabilistic Latent Semantic Indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1-2), 177–196 (2001)
Kalinin, A., Cetintemel, U., Zdonik, S.: Interactive Data Exploration Using Semantic Windows. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 505–516. ACM (2014)
Kleinberg, J.: Bursty and hierarchical structure in streams. Data Min. Knowl. Disc. 7(4), 373–397 (2003)
Kullback, S., Leibler, R.A.: On information and sufficiency. The annals of mathematical statistics, pp. 79–86 (1951)
Lappas, T., Arai, B., Platakis, M., Kotsakos, D., Gunopulos, D.: On burstiness-aware search for document sequences. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 477–486. ACM (2009)
Lau, J.H., Collier, N., Baldwin, T.: On-Line Trend Analysis with Topic Models: # Twitter Trends Detection Topic Model Online. In: COLING, pp. 1519–1534 (2012)
Masada, T., Fukagawa, D., Takasu, A., Hamada, T., Shibata, Y., Oguri, K.: Dynamic Hyperparameter Optimization for Bayesian Topical Trend Analysis. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1831–1834. ACM (2009)
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 487–494. AUAI Press (2004)
Steyvers, M., Griffiths, T.: Probabilistic topic models. Handbook of Latent Semantic Analysis 427(7), 424–440 (2007)
Steyvers, M., Smyth, P., Rosen-Zvi, M., Griffiths, T.: Probabilistic Author-Topic Models for Information Discovery. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 306–315. ACM (2004)
Van Rijsbergen, C.J.: Information retrieval. The Information Retrieval Group (1979)
Wang, C., Blei, D., Heckerman, D.: Continuous time dynamic topic models. arXiv:1206.3298 (2012)
Wang, X., McCallum, A.: Topics over Time: a Non-Markov Continuous-Time Model of Topical Trends. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 424–433. ACM (2006)
Wayne, C.: Topic Detection and Tracking (Tdt) Overview and Perspective. In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop (1998)
Wayne, C.L.: Topic Detection and Tracking (Tdt). In: Workshop Held at the University of Maryland on, vol. 27, p 28. Citeseer (1997)
Zhang, D., Zhai, C., Han, J.: Topic cube: topic Modeling for Olap on multidimensional text databases. In: SDM, vol. 9, pp. 1124–1135 (2009)
Acknowledgments
We thank Yaoliang Chen for his useful comments, and Chenghao Guo and Kaiwen Zhou for their enthusiastic help during the data collection process. We also thank the anonymous reviewers for their invaluable feedback and suggestions that have greatly improved this work. This work was partially supported by the NSFC (No. 61370080, No. 61170007) and the Shanghai Innovation Action Project (Grant No. 16DZ1100200), as well as by respective grants from EMC and SAP.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yang, Z., Ma, H., He, Z. et al. Finding maximal ranges with unique topics in a text database. World Wide Web 21, 289–310 (2018). https://doi.org/10.1007/s11280-017-0448-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-017-0448-y