Skip to main content

Representing Web Data as Complex Objects

  • Conference paper
  • First Online:
Electronic Commerce and Web Technologies (EC-Web 2000)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1875))

Included in the following conference series:

Abstract

The popularization of the Web has made a huge volume of data available for a large audience. In a large number of Web sites, such as bookstores, electronic catalogs, travel agencies, etc., the pages constitute documents which are composed of pieces of data whose overall structure can be easily recognized. Such pages are called data-rich and can be seen as collections of complex objects. In this paper, we show how such objects can be represented by nested tables, which are simple, intuitive, and quite convenient for expressing their implicit structure. The assumption is that, for most sites of interest, only few examples are required to reveal the structure of the objects. To corroborate our assumption, we describe a data extraction tool that adopts this approach and present results of some experiments carried out with this tool.

This work is supported by Project SIAM (grant MCT/FINEP/PRONEX 76.97.1016.00) and by individual research grants from CNPq and CAPES.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abiteboul, S., Hull, R., and Vianu, V. Foundations of Databases. Addison-Wesley, Reading, Massachusetts, 1995.

    MATH  Google Scholar 

  2. Buneman, P. Semistructured Data. In Proceedings of the Sixteenth ACM SIGMOD Symposium on Principles of Database Systems (Tucson, Arizona, 1997), pp. 117–121.

    Google Scholar 

  3. Buneman, P., Davidson, S., Hillebrand, G., and Suciu, D. A Query Language and Optimization Techniques for Unstructured Data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Quebec, Canada, 1996), pp. 505–516.

    Google Scholar 

  4. Buneman, P., Deutsch, A., and Tan, W. A Deterministic Model for Semistructured Data. In Proceedings of the Workshop on Query Processing for Semistructured Data and Non-Standard Data Formats (Jerusalem, Israel, 1999).

    Google Scholar 

  5. da Silva, A.S. Example-based Extraction and Integration of Semi-Structured Data. Ph.D. Thesis Proposal, Departament of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil, 2000. In preparation.

    Google Scholar 

  6. Embley, D. W., Campbell, D. M., Jiang, Y. S., Liddle, S. W., Ng, Y.-K., Quass, D., and Smith, R. D. Conceptual-model-based data extraction. Data & Knowledge Engineering 31, 3 (1999), 227–251.

    Article  Google Scholar 

  7. Jaeschke, G., and Schek, H.-J. Remarks on the algebra of non first normal form relations. In Proceedings of the ACM Symposium on Principles of Database Systems (Los Angeles, California, 1982), ACM, pp. 124–138.

    Google Scholar 

  8. Laender, A. H. F., Ribeiro-Neto, B., and da Silva, A. S. DEByE-Data Extraction By Example. Technical Report, Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil, 2000.

    Google Scholar 

  9. Libkin, L. A Relational Algebra for Complex Objects Based on Partial Information. In Proceedings of the Third Symposium on Mathematical Fundamentals of Database and Knowledge Systems (Rostock, Germany, 1991), pp. 29–43.

    Google Scholar 

  10. Lorentzos, N. A., and Dondis, K. A. Query by Example for Nested Tables. In Proceedings of the 9th International Conference in Database and Experts Systems Applications(Vienna, Austria, 1998), pp. 716–725.

    Google Scholar 

  11. Nestorov, S., Abiteboul, S., and Motwani, R. Inferring Structure in Semistructured Data. SIGMOD Record 26, 4 (1997), 39–43.

    Article  Google Scholar 

  12. Nestorov, S., Abiteboul, S., and Motwani, R. Extracting Schema from Semistructured Data. In Proceedings of the ACM SIGMOD Conference on Management of Data (Seatle, Washington, 1998), pp. 256–306.

    Google Scholar 

  13. P. Buneman and W. Fan and S. Weinstein. Interaction between Path and Type Constraints. In Proceedings of ACM Symposium on Principles of Database Systems (PODS) (Philadephia, Pennsylvania, 1999), pp. 56–67.

    Google Scholar 

  14. Papakonstantinou, Y., Garcia-Molina, H., and Widom, J. Object Exchange Across Heterogeneous Information Sources. In Proceedings of the Eleventh International Conference on Data Engineering(Taipei, Taiwan, 1995).

    Google Scholar 

  15. Ribeiro-Neto, B., Laender, A. H. F., and da Silva, A. S. Extracting Semi-Structured Data Through Examples. In Proceedings of the Eighth ACM International Conference on Information and Knowledge Management-CIKM’99 (Kansas City, Missouri, 1999), pp. 94–101.

    Google Scholar 

  16. Silva, E. S. Example-Based Semi-Structured Data Extraction. Master’s Thesis, Departament of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil, 1999. In Portuguese.

    Google Scholar 

  17. van Gucht, D., and Fischer, P. C. Multilevel nested relational structures. Journal of Computer and System Sciences 36, 1 (1988), 77–105.

    Article  Google Scholar 

  18. Wang, K., and Liu, H. Schema Discovery for Semistructured Data. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-97) (Newport Beach, California, 1997), pp. 271–274.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Laender, A.H.F., Ribeiro-Neto, B., da Silva, A.S., Silva, E.S. (2000). Representing Web Data as Complex Objects. In: Bauknecht, K., Madria, S.K., Pernul, G. (eds) Electronic Commerce and Web Technologies. EC-Web 2000. Lecture Notes in Computer Science, vol 1875. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44463-7_19

Download citation

  • DOI: https://doi.org/10.1007/3-540-44463-7_19

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-67981-3

  • Online ISBN: 978-3-540-44463-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics