Abstract
The huge amount of information available on the Web has attracted many research efforts into developing wrappers that extract data from webpages. However, as most of the systems for generating wrappers focus on extracting data at page-level, data extraction at site-level remains a manual or semi-automatic process. In this paper, we study the problem of extracting website skeleton, i.e. extracting the underlying hyperlink structure that is used to organize the content pages in a given website. We propose an automated algorithm, called the Sew algorithm, to discover the skeleton of a website. Given a page, the algorithm examines hyperlinks in groups and identifies the navigation links that point to pages in the next level in the website structure. The entire skeleton is then constructed by recursively fetching pages pointed by the discovered links and analyzing these pages using the same process. Our experiments on real life websites show that the algorithm achieves a high recall with moderate precision.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD 2003), San Diego, California, USA, ACM Press, New York (2003)
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proceedings of 27th International Conference on Very Large Data Bases (VLDB 2001), Roma, Italy, pp. 119–128 (2001)
Brin, S., Page, L.: The Anatomy of a Large-Scale HypertextualWeb Search Engine. In: Proceedings of the Seventh International World Wide Web Conference (WWW 7), Brisbane, Australia, pp. 107–117 (1998)
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: Proceedings of 27th International Conference on Very Large Data Bases (VLDB 2001), Roma, Italy, pp. 109–118 (2001)
Hammer, J., García-Molina, H., Cho, J., Crespo, A., Aranha, R.: Extracting Semistructured Information from the Web. In: Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, USA, pp. 18–25 (1997)
Kleinberg, J.M.: Authoritative Sources in a Hyperlinked Environment. Journal of the ACM 46, 604–632 (1999)
Kushmerick, N.: Wrapper verification. World Wide Web Journal 3, 79–94 (2000)
Liu, Z., Li, F., Ng, W.K., Lim, E.P.: A Visual Tool for Building Logical Data Models of Websites. In: Proceedings of Fourth ACM CIKM International Workshop on Web Information and Data Management (WIDM 2002), in conjunction with the Eleventh International Conference on Information and Knowledge Management (CIKM 2002), LcLean, Virginia, USA, pp. 92–95 (2002)
Liu, Z., Ng, W.K., Lim, E.P.: An Automated Algorithm for Extracting Website Skeleton. Technical report, Centre for Advanced Information Systems, Nanyang Technological University, Singapore (2003)
Mecca, G., Atzeni, P.: Cut and Paste. Journal of Computer and System Sciences 58, 453–482 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liu, Z., Ng, W.K., Lim, EP. (2004). An Automated Algorithm for Extracting Website Skeleton. In: Lee, Y., Li, J., Whang, KY., Lee, D. (eds) Database Systems for Advanced Applications. DASFAA 2004. Lecture Notes in Computer Science, vol 2973. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24571-1_70
Download citation
DOI: https://doi.org/10.1007/978-3-540-24571-1_70
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21047-4
Online ISBN: 978-3-540-24571-1
eBook Packages: Springer Book Archive