Skip to main content
Log in

XML schema refinement through redundancy detection and normalization

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

As XML becomes increasingly popular, XML schema design has become an increasingly important issue. One of the central objectives of good schema design is to avoid data redundancies: redundantly stored information can lead not just only to a higher data storage cost but also to increased costs for data transfer and data manipulation. Furthermore, such data redundancies can lead to potential update anomalies, rendering the database inconsistent. One strategy to avoid data redundancies is to design redundancy-free schema from the start on the basis of known functional dependencies. We observe that XML databases are often “casually designed” and XML FDs may not be determined in advance. Under such circumstances, discovering XML data redundancies from the data itself becomes necessary and is an integral part of the schema refinement (or re-design) process. We present the design and implementation of the first system, DiscoverXFD, for efficient discovery of XML data redundancies. It employs a novel XML data structure and introduces a new class of partition-based algorithms. The XML data redundancies are defined on the basis of a new notion of XML functional dependency (XML FD) that (1) extends previous notions by incorporating set elements into the XML FD specification, and (2) maintains tuple-based semantics through the novel concept of Generalized Tree Tuple (GTT). Using this comprehensive XML FD notion, we introduce a new normal form (GTT-XNF) for XML documents, and provide comprehensive comparisons with previous studies. Given the set of data redundancies (in the form of redundancy-indicating XML FDs) discovered by DiscoverXFD, we describe a normalization algorithm for converting any original XML schema into one in GTT-XNF.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Arenas M. and Libkin L. (2004). A normal form for XML documents. TODS 29(1): 195–232

    Article  Google Scholar 

  2. Armstrong, W.: Dependency structures of database relationships. In: Proc. IFIP, pp. 580–583. North Holland (1974)

  3. Atzeni, P., DeAntonellis, V.: Foundations of Databases. Benjamin Cummings (1993)

  4. Buneman, P., Davidson, S., Fan, W., Hara, C., Tan, W.-C.: Keys for XML. In: Proc. WWW, pp. 201–210. Hong Kong, China (2001)

  5. Buneman P., Davidson S., Fan W., Hara C. and Tan W.-C. (2003). Reasoning about keys for XML. Inf. Syst. 28(8): 1037–1063

    Article  Google Scholar 

  6. Chen, Y., Davidson, S., Zheng, Y.: XKvalidator: a constraint validator for XML. In: Proc. CIKM, pp. 446–452. McLean, VA (2002)

  7. Codd E.F. (1970). A relational model of data for large shared data banks. Commun. ACM 13(6): 377–387

    Article  MATH  Google Scholar 

  8. Fagin R. (1977). Multivalued dependencies and a new normal form for relational databases. TODS 3: 262–278

    Article  Google Scholar 

  9. Fan W. and Libkin L. (2002). On XML integrity constraints in the presence of DTDs. J. ACM 49(3): 368–406

    Article  MathSciNet  Google Scholar 

  10. Fan W. and Simeon J. (2003). Integrity constraints for XML. JCSS 66: 2554–291

    MathSciNet  Google Scholar 

  11. Huhtala, Y., Karkkainen, J., Porkka, P., Toivonen, H.: TANE: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2) (1999)

  12. Ilyas, I., Markl, V., Haas, P., Brown, P., Aboulnaga, A.: CORDS: automatic discovery of correlations and soft functional dependencies. In: Proc. SIGMOD, pp. 647–658. Paris, France (2004)

  13. Lee, M.L., Ling, T.W., Low, W.L.: Designing functional dependencies for XML. In: Proc. EDBT, pp. 124–141. Prague, Czech Republic (2002)

  14. Ley, M.: DBLP Computer Science Bibliography. http://dblp.uni-trier.de/

  15. Lopes, S., Petit, J.-M., Lakhal, L.: Efficient discovery of functional dependencies and Armstrong relations. In: Proc. EDBT, pp. 350–364. Konstanz, Germany (2000)

  16. Mannila, H., Raiha, K.-J.: Dependency inference. In: Proc. VLDB, pp. 155–158. Brighton, England (1987)

  17. May, W.: Information extraction and integration with Florid: the Mondial case study, (1999). http://www.dbis.informatik.uni-goettingen.de/lopix/lopix-mondial.html

  18. Mok W.Y., Ng Y.-K. and Embley D. (1996). A normal form for precisely characterizing redundancy in nested relations. TODS 21(1): 77–106

    Article  Google Scholar 

  19. Novelli N. and Cicchetti R. (2001). Functional and embedded dependency inference: a data mining point of view. Inf. Syst. 26: 477–506

    Article  MATH  Google Scholar 

  20. Ozsoyoglu Z.M. and Yuan L.-Y. (1987). A new normal form for nested relations. TODS 12(1): 111–136

    Article  MathSciNet  Google Scholar 

  21. PIR International Protein Sequence Database. http://pir.georgetown.edu/pirwww/search/textpsd.shtml (2006)

  22. Popa, L., Velegrakis, Y., Miller, R., Hernández, M., Fagin, R.: Translating Web data. In: Proc. VLDB, pp. 598–609. Hong Kong, China (2002)

  23. Sleepycat Software. http://www.sleepycat.com/ (2006)

  24. Vincent, M., Liu, J.: Checking functional dependency satisfaction in XML. In: Proc. XSym, pp. 4–17. Trondheim, Norway (2005)

  25. Vincent M., Liu J. and Liu C. (2004). Strong functional dependencies and their application to normal forms in XML. TODS 29(3): 445–462

    Article  Google Scholar 

  26. W3C. XML Schema. http://www.w3.org/TR/xmlschema-0/ (2004)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cong Yu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, C., Jagadish, H.V. XML schema refinement through redundancy detection and normalization. The VLDB Journal 17, 203–223 (2008). https://doi.org/10.1007/s00778-007-0063-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-007-0063-0

Keywords

Navigation