Discovery of Frequent Tree Structured Patterns in Semistructured Web Documents

Miyahara, Tetsuhiro; Shoudai, Takayoshi; Uchida, Tomoyuki; Takahashi, Kenichi; Ueda, Hiroaki

doi:10.1007/3-540-45357-1_8

Tetsuhiro Miyahara⁴,
Takayoshi Shoudai⁵,
Tomoyuki Uchida⁴,
Kenichi Takahashi⁴ &
…
Hiroaki Ueda⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2035))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1344 Accesses
24 Citations

Abstract

Many documents such as Web documents or XML files have no rigid structure. Such semistructured documents have been rapidly increasing. We propose a new method for discovering frequent tree structured patterns in semistructured Web documents. We consider the data mining problem of finding all maximally frequent tag tree patterns in semistructured data such as Web documents. A tag tree pattern is an edge labeled tree which has hyperedges as variables. An edge label is a tag or a keyword in Web documents, and a variable can be substituted by any tree. So a tag tree pattern is suited for representing tree structured patterns in semistructured Web documents. We present an algorithm for finding all maximally frequent tag tree patterns. Also we report some experimental results on XML documents by using our algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, 2000.
Google Scholar
T. Beyer and S. Hedetniemi. Constant time generation of rooted trees. SIAM J. Comput., 9:706–712, 1980.
Article MATH MathSciNet Google Scholar
M. Fernandez and Suciu D. Optimizing regular path expressions using graph schemas. Proc. Intl. Conf. on Data Engineering (ICDE-98), pages 14–23, 1998.
Google Scholar
T. Miyahara, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda. Polynomial time matching algorithms for tree-like structured patterns in knowledge discovery. Proc. PAKDD-2000, Springer-Verlag, LNAI 1805, pages 5–16, 2000.
Google Scholar
T. Miyahara, T. Uchida, T. Kuboyama, T. Yamamoto, K. Takahashi, and H. Ueda. KD-FGS: a knowledge discovery system from graph data using formal graph system. Proc. PAKDD-99, Springer-Verlag, LNAI 1574, pages 438–442, 1999.
Google Scholar
T. Miyahara, T. Shoudai and T. Uchida. Discovery of maximally frequent tag tree patterns in semistructured data. Proc. LA Winter Symposium, Kyoto, pages 15-1–15-10, 2001.
Google Scholar
S. Nestorov, S. Abiteboul, and R. Motwani. Extracting schema from semistructured data. Proc. ACM SIGMOD Conf., pages 295–306, 1998.
Google Scholar
T. Shoudai, T. Miyahara, T. Uchida, and S. Matsumoto. Inductive inference of regular term tree languages and its application to knowledge discovery. Information Modelling and Knowledge Base XI, IOS Press, pages 85–102, 2000.
Google Scholar
T. Uchida, T. Shoudai, and S. Miyano. Parallel algorithm for refutation tree problem on formal graph systems. IEICE Trans. Inf. Syst., E78-D(2):99–112, 1995.
Google Scholar
K. Wang and H. Liu. Discovering structural association of semistructured data. IEEE Trans. Knowledge and Data Engineering, 12:353–371, 2000.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Information Sciences, Hiroshima City University, Hiroshima, 731-3194, Japan
Tetsuhiro Miyahara, Tomoyuki Uchida, Kenichi Takahashi & Hiroaki Ueda
Department of Informatics, Kyushu University, Kasuga, 816-8580, Japan
Takayoshi Shoudai

Authors

Tetsuhiro Miyahara
View author publications
You can also search for this author in PubMed Google Scholar
Takayoshi Shoudai
View author publications
You can also search for this author in PubMed Google Scholar
Tomoyuki Uchida
View author publications
You can also search for this author in PubMed Google Scholar
Kenichi Takahashi
View author publications
You can also search for this author in PubMed Google Scholar
Hiroaki Ueda
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Computer Science and Information Systems, The University of Hong Kong, Pokfulam, Hong Kong China
David Cheung
CSIRO Mathematical and Information Sciences, GPO Box 664, Canberra, ACT 2601, Australia
Graham J. Williams
Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave., Kowloon, Hong Kong China
Qing Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Miyahara, T., Shoudai, T., Uchida, T., Takahashi, K., Ueda, H. (2001). Discovery of Frequent Tree Structured Patterns in Semistructured Web Documents. In: Cheung, D., Williams, G.J., Li, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2001. Lecture Notes in Computer Science(), vol 2035. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45357-1_8

Download citation

DOI: https://doi.org/10.1007/3-540-45357-1_8
Published: 11 April 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41910-5
Online ISBN: 978-3-540-45357-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics