Abstract
This paper introduces an approach for discovering thematically related document groups (a topic mining task) in massive document collections with the aid of graph local clustering. This can be achieved by viewing a document collection as a directed graph where vertices represent documents and arcs represent connections among these (e.g. hyperlinks). Because a document is likely to have more connections to documents of the same theme, we have assumed that topics have the structure of a graph cluster, i.e. a group of vertices with more arcs to the inside of the group and fewer arcs to the outside of it. So, topics could be discovered by clustering the document graph; we use a local approach to cope with scalability. We also extract properties (keywords and most representative documents) from clusters to provide a summary of the topic. This approach was tested over the Wikipedia collection and we observed that the resulting clusters in fact correspond to topics, which shows that topic mining can be treated as a graph clustering problem.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Auer, S., Lehmann, J.: What Have Innsbruck and Leipzig in Common? Extracting Semantics from Wiki Content. In: Franconi, E., Kifer, M., May, W. (eds.) ESWC 2007. LNCS, vol. 4519, pp. 503–517. Springer, Heidelberg (2007)
Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., Blei, D.M.: Reading tea leaves: How humans interpret topic models. In: Neural Information Processing Systems (2009)
Chen, J., Zaiane, O.R., Goebel, R.: Detecting Communities in Large Networks by Iterative Local Expansion. In: International Conference on Computational Aspects of Social Networks 2009, pp. 105–112. IEEE (2009)
Flake, G.W., Lawrence, S., Giles, C.L.: Efficient identification of Web communities. In: Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 150–160. ACM, New York (2000)
Garza, S.E.: A Process for Extracting Groups of Thematically Related Documents in Encyclopedic Knowledge Web Collections by Means of a Pure Hyperlink-based Clustering Approach. PhD thesis, Instituto Tecnológico y de Estudios Superiores de Monterrey (2010)
Garza, S.E., Brena, R.F.: Graph Local Clustering for Topic Detection in Web Collections. In: 2009 Latin American Web Congress, pp. 207–213. IEEE (2009)
Gibson, D., Kumar, R., Tomkins, A.: Discovering large dense subgraphs in massive graphs. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 721–732. VLDB Endowment (2005)
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Science USA 101(1), 5228–5235 (2004)
He, X., Ding, C.H.Q., Zha, H., Simon, H.D.: Automatic topic identification using webpage clustering. In: Proceedings of the IEEE International Conference on Data Mining, pp. 195–202 (2001)
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)
Lancichinetti, A., Fortunato, S., Kertész, J.: Detecting the overlapping and hierarchical community structure in complex networks. New Journal of Physics 11, 33015 (2009)
Liu, Y., Niculescu-Mizil, A., Gryc, W.: Topic-link LDA: joint models of topic and author community. In: Proceedings of the 26th Annual International Conference on Machine Learning. ACM, New York (2009)
Luo, F., Wang, J.Z., Promislow, E.: Exploring local community structures in large networks. Web Intelligence and Agent Systems 6(4), 387–400 (2008)
Menczer, F.: Links tell us about lexical and semantic web content. CoRR, cs.IR/0108004 (2001)
Milne, D., Witten, I.H.: Learning to link with Wikipedia. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 509–518. ACM, New York (2008)
Modha, D.S., Spangler, W.S.: Clustering hypertext with applications to Web searching, US Patent App. 10/660,242 (September 11, 2003)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management, 513–523 (1988)
Schaeffer, S.E.: Stochastic Local Clustering for Massive Graphs. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 354–360. Springer, Heidelberg (2005)
Schönhofen, P.: Identifying document topics using the Wikipedia category network. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 456–462. IEEE Computer Society, Washington, DC, USA (2006)
Stein, B., Zu Eissen, S.M.: Topic identification: Framework and application. In: Proceedings of the International Conference on Knowledge Management, vol. 399, pp. 522–531 (2004)
Virtanen, S.E.: Clustering the Chilean Web. In: Proceedings of the 2003 First Latin American Web Congress, pp. 229–231 (2003)
Wartena, C., Brussee, R.: Topic detection by clustering keywords. In: DEXA 2008: 19th International Conference on Database and Expert Systems Applications (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Garza Villarreal, S.E., Brena, R.F. (2011). Topic Mining Based on Graph Local Clustering. In: Batyrshin, I., Sidorov, G. (eds) Advances in Soft Computing. MICAI 2011. Lecture Notes in Computer Science(), vol 7095. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25330-0_18
Download citation
DOI: https://doi.org/10.1007/978-3-642-25330-0_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25329-4
Online ISBN: 978-3-642-25330-0
eBook Packages: Computer ScienceComputer Science (R0)