Schema Design and Normalization Algorithm for XML Databases Model

—In this paper we study the problem of schema design and normalization in XML databases model. We show that, like relational databases, XML documents may contain redundant information, and this redundancy may cause update anomalies. Furthermore, such problems are caused by certain functional dependencies among paths in the document. Based on our research works, in which we presented the functional dependencies and normal forms of XML Schema, we present the decomposition algorithm for converting any XML Schema into normalized one, that satisfies X-BCNF.


INTRODUCTION
The eXtensible Markup Language (XML) has recently emerged as a standard for data representation and interchange on the Internet [1]. Although many XML documents are views of relational data, the number of applications using native XML documents is increasing rapidly. Such applications may use native XML storage facilities [2], and update XML data [3]. Updates, like in relational databases, may cause anomalies if data is redundant. In the relational world, anomalies are avoided by developing a well-designed database schema. XML has its version of schema too; such as DTD (Document Type Definition), and XML Schema [4]. Our goal is to find the principles for good XML Schema design. We believe that it is important to do this research now, as a lot of data is being put on the web. Once massive web databases are created, it is very hard to change their organization; thus, there is a risk of having large amounts of widely accessible, but at the same time poorly organized data.
Normalization is a process which eliminates redundancy, organizes data efficiently and improves data consistency. Whereas normalization in the relational world has been quite explored, it is a new research area in native XML databases. Even though native XML databases mainly work with document-centric XML documents, and the structure of several XML document might differ from one to another, there is room for redundant information. This redundancy in data may impact on document updates, efficiency of queries, etc. Figure 1, shows an overview of the XML normalization algorithms that we propose [10][11][12].
This paper focus on the normal form theory. This theory concerns the old question of well-designed databases or in other words the syntactic characterization of semantically desirable properties. These properties are tightly connected with dependencies such as keys, functional dependencies, weak functional dependencies, equality generating dependencies, multi-valued dependencies, inclusion dependencies, join dependencies, etc. All these classes of dependencies have been deeply investigated in the context of the relational data model [5][6][7][8]. The work now requires its generalization to XML (trees like) model.
Our goal is to apply the concepts of relational database normalization to XML Schema design. We show how to transfer an XML Schema X, that based on a set of functional dependencies F, into a new specification (X', F') that is in XML normal form (X-BCNF) and contains the same information. Figure 1. An overview of the XML normalization algorithms II. MOTIVATING EXAMPLE In this section, through an example, we show that, like relational databases, XML documents may contain redundant information, and this redundancy may cause update anomalies.
Example 1: Consider the following XML Schema that describes a part of a "university" database. For every course, we store its number (cno), its title and the list of students taking the course. For each student taking a course, we store the student number (sno), name, and the grade in the course.
An example of an XML document (tree) that conforms to this XML Schema is shown in Figure 2 [9]. This document satisfies the following constraint: "any two student elements with the same sno value must have the same name". This constraint (which looks like a FD), causes the document to store redundant information: for example, the name Deere for student st1 is stored twice, as in relational databases, such redundancies can lead to update anomalies: for example, updating the name of st1 for only one course results in an inconsistent document, and removing the student from a course may result in removing that student from the document altogether.
In order to eliminate redundant information, we use a technique similar to the relational one, and split the information about the name and the grade. Since we deal with just one XML document, we must do it by creating an extra element of complexType, called info, for student information, as shown in the figure below.
Each info element has (as children) one name and a sequence of number elements, with sno as an attribute. Different students can have the same name, and we group all student numbers sno for each name under the same info element. A restructured document that conforms to this XML Schema is shown in Figure 3 [9]. Note that st2 and st3 are put together because both students have the same name.
This example remembers us with the bad relational design caused by nonkey FDs, and how the database designer solve this problem by modifying the schema.

III. PRIMARILY DEFINITIONS
To extend the notions of FDs to the XML model, we represent XML trees as sets of tuples [9], and find the correspondence between documents and relations that leads to the definition of functional dependency.
We first describe the formal definitions of XML Schema (XSchema) and the conforming of XML tree to XSchema. The definition of XSchema is based on regular tree grammar theory that introduced in [14]. Assume that we have the following disjoint sets: -Ê: set of element names, -Â: set of attribute names, -DΤ: set of atomic data types (e.g., ID, IDREF, IDREFS, string, integer, date, etc). -Str: set of possible values of string-valued attributes http://www.i-jet.org SCHEMA DESIGN AND NORMALIZATION ALGORITHM FOR XML DATABASES MODEL -Vert: set of node identifiers All attribute names start with the symbol @. The symbols φ and S represent element type declarations EMPTY (null) and #PCDATA, respectively.
• A ⊆ Â, is a finite set of attribute names.
• M is a function from E to its element type definitions: i.e., M(e) = α, where e∈ E, and α is a regular expression: α : where ε denotes the empty element, t ∈ DΤ, "+" for the union, "," for the concatenation, α * for the Kleene closure, α ? for (α + ε) and α + for (α, α * ) • P is a function from an attribute name a to its attribute type definition: i.e., P(a) = β, where β is a 4-tuple (t, n, d, f), where: -t ∈ DΤ, -n is either "?" (nullable) or "¬?" (not nullable), -d is a finite set of valid domain values of a or ε if not known, and -f is a default value of a or ε if not known.
• r ⊆ E is a finite set of root elements, • ∑ is a finite set of integrity constraints for XML model. The integrity constraints we consider are keys (P.K, F.K,…), and dependencies (functional, and inclusion).

Definition 2 (Path in XSchema):
Given an XSchema X = (E, A, M, P, r, ∑), a string p = p 1 …p n , is a path in X if, p 1 = r, p i is in the alphabet of M(p i −1 ), for each i ∈ [2, n − 1], and p n is in the alphabet of M(p n−1 ) or p n = @l for some @l ∈ P(p n−1 ). We define length(p) as n and last(p) as p n . We let paths(X) stand for the set of all paths in X, and EPaths (X) for the set of all paths that ends with an element type (rather than an attribute or S), that is:

Definition 3 (XML Tree):
An XML tree T is defined to be a tree, T = (V, lab, ele, att, root) Where: • V ⊆ Vert is a finite set of vertices (nodes).
the set {@l ∈ Â | att(v, @l) is defined} is required to be finite. • root ∈ V is called the root of T.
The parent-child edge relation on V, {(v 1 , v 2 ) | v 2 occurs in ele(v 1 )}, is required to form a rooted tree. Note that, the children of an element node can be either zero or more element nodes or one string.

Definition 4 (Path in XML Tree):
Given an XML tree T, a string: p 1 …p n with p 1 ,…, p n-1 ∈ Ê and p n . If p n = @l, with @l∈ Â, then att(v n−1 , @l) is defined. If p n =S, then v n−1 has a child in Str.
• We let paths(T) stand for the set of paths in T.
Now, we give a definition of a tree conforming to the XSchema (T╞ X), and a tree compatible with X (T ⊲ X).

Definition 7:
Given two XML trees T 1 and T 2 , we say that T 1 is equivalent to T 2 written T 1 ≡ T 2 , iff T 1 ≤ T 2 and T 2 ≤ T 1 (i.e., T 1 ≡ T 2 iff T 1 and T 2 are equal as unordered trees).
• We define [T] to be the ≡-equivalence class of T.
In the following definition, we extend the notion of tuple for relational databases to the XML model. In a relational database, a tuple is a function that assigns to each attribute a value from the corresponding domain. In our setting, a tree tuple t in a XML Schema X is a function that assigns to each path in X a value in Vert∪ Str∪ {φ} in such a way that t represents a finite tree with paths from X containing at most one occurrence of each path. In this section, we show that an XML tree can be represented as a set of tree tuples.

Definition 8 (Tree tuples):
is defined to be the set of all tree tuples in X. For a tree tuple t and a path p, we write t.p for t(p).

Example 2:
Suppose that X is the XML Schema shown in example 1. Then a tree tuple in X assigns values to each path in paths(X) such as: Definition 9 (tree X ): Given XML Schema X = (E, A, M, P, r, ∑) and a tree tuple t ∈ T(X), tree X (t) is defined to be an XML tree (V, lab, ele, att, root), where: If v = t.p, @l ∈ A and t.p.@l ≠ φ , then att(v, @l ) = t.p.@l Example 3: Let X be the XML Schema from example 1 and t the tree tuple from Example 2. Then, t gives rise to the following XML tree: We would like to describe XML trees in terms of the tuples they contain. For this, we need to select tuples containing the maximal amount of information. This is done via the usual notion of ordering on tuples (relations).
• If we have two tree tuples t 1 , t 2 , we write t 1 ⊆ t 2 if whenever t 1 .p is defined, then t 2 .p is also defined, and • As usual, t 1 ⊂ t 2 means t 1 ⊆ t 2 and t 1 ≠ t 2 .
• Given two sets of tree tuples, Y and Z, we write: Definition 10 (tuples X ): Given XML Schema X and an XML tree T such that T ⊲ X, tuples X (T) is defined to be the set of maximal tree tuples t (with respect to ⊆ ), s.t. tree X (t) is subsumed by T, that is: • Hence, tuples X applies to equivalence classes: tuples X ([T]) = tuples X (T). • The following proposition lists some simple properties of tuples X (·) Proof. We prove only monotonicity. Suppose that T 1 ≤ T 2 and t 1 ∈ tuples X (T 1 ). We have to prove that ∃ t 2 ∈ tuples X (T 2 ) such that t 1 ⊆ t 2 . If t 1 ∈ tuples X (T 2 ), this is obvious, so assume that t 1 ∉ tuples X (T 2 ). Given that t 1 ∈ tuples X (T 1 ), tree X (t 1 ) ≤ T 1 , and therefore, tree Hence, by definition of tuples X (·), there exists t 2 ∈ tuples X (T 2 ) such that t 1 ⊂ t 2 , since t 1 ∉ tuples X (T 2 ). □ Example 4: In example 1, we saw the XML Schema X and a tree T conforming to X. In example 2, we saw one tree tuple t for that tree, with identifiers assigned to some of the element nodes of T. If we assign identifiers to the rest of the nodes, we can compute the set tuples X (T). Finally, we define the trees represented by a set of tuples Y as the minimal, with respect to ≤, trees containing all tuples in Y.
Definition 11 (trees X ): Given XML Schema X and a set of tree tuples Y ⊆ T (X), trees X (Y) is defined to be: Notice that, if T ∈ trees X (Y) and T ' ≡ T, then T ' is in trees X (Y). The following shows that every XML document can be represented as a set of tree tuples, if we consider it as an unordered tree. That is, a tree T can be reconstructed from tuples X (T), up to equivalence ≡.

Theorem 1. Given XML Schema X and an XML tree T, if T ⊲ X, then trees(tuples X ([T])) = [T].
Proof: Every XML tree is finite, and, therefore, Let t 1 , t 2 ∈ T (X) be defined as: Since t 1 .r ≠ t 2 .r, there is no an XML tree T such that, tree X (t 1 ) ≤ T and tree X (t 2 ) ≤ T.
• We say that Y ⊆ T (X) is X-compatible if there is an XML tree T: T ⊲ X and Y ⊆ tuples X (T).
• For X-compatible set of tree tuples Y, there is always an XML tree T: for every t ∈ Y, tree X (t) ≤ T.

Proposition 3. If Y ⊆ T (X) is X-compatible, then:
(a) There is an XML tree T such that T ⊲ X and trees X

If t ∈ tuples X ([T]), then the property holds trivially. Suppose that t∉ tuples X ([T]). Then, given that
In either case, we conclude that there is t'∈ tuples X (trees X (Y)) s.t. t⊆ t'. □ The example below shows that it could be the case that tuples X (trees X (Y)) properly dominates Y, that is, Y ⊆ b tuples X (trees X (Y)) and tuples X (trees X (Y)) Y. In particular, this example shows that the inverse of Theorem 1 does not hold, that is, tuples X (trees X (Y)) is not necessarily equal to Y for every set of tree tuples Y , even if this set is X-compatible. Let X be as in example 5 and t 1 , t 2 ∈ T (X) be defined as: Let t 3 be a tree tuple defined as: Then, tuples X (trees X ({t 1 , t 2 })) = {t 3 } since t 1 ⊂ t 3 and t 2 ⊂ t 3 , and, therefore, {t 1 , t 2 } ⊆ b tuples X (trees X ({t 1 , t 2 })) and tuples X (trees X ({t 1 , t 2 })) {t 1 , t 2 }.

IV. NORMAL FORMS OF XML SCHEMA
In this section, and by using the definitions of the previous sections, we present the normal forms of XML documents. Our goal is to see what relational concepts we can usefully apply to XML. Can the normal forms that guide database design be applied meaningfully to XML document design?

Definition 12 (functional dependencies):
Given an XML Schema X, a functional dependency (FD) over X is an expression of the form: S 1 → S 2 where S 1 , S 2 ⊆ paths(X), S 1 , S 2 ≠ φ. The set of all FDs over X is denoted by FD(X).
• For S ⊆ paths (X), and t, t' ∈ T (X), t.S = t'.S means t.p Definition 13: If S 1 → S 2 ∈ FD(X) and T is an XML tree • Note that: if tree tuples t 1 , t 2 satisfy an FD S 1 → S 2 , then for every path p ∈ S 2 , t 1 .p and t 2 .p are either both null or both not null.

Definition 14:
: If for every pair of tree tuples t 1 , t 2 in an XML tree T, t 1 .S 1 = t 2 .S 1 implies they have a null value on some p ∈ S 1 , then the FD is trivially satisfied by T.
• The previous definitions extends to the equivalence classes, since, for any FD f, and T ≡ T', T╞ f iff T'╞ f.
• We write T╞ F, for F ⊆ FD (X), if T╞ f for each f ∈ F, and we write T╞ (X, F), if T╞ X and T╞ F. Definition 15: Given XML Schema X, a set F ⊆ FD (X) and f ∈ FD(X), we say that (X, F) implies f, written (X, F)┝ f , if for any tree T with T╞ X and T╞ F, it is the case that T╞ f. The set of all FDs implied by (X, F) will be denoted by (X, F) + .

A. Primary and Foreign Keys of XML Schema
In this section, we present the definitions of the primary and foreign keys of the XML Schema. We observe that while there are important differences between the XML and relational models, much of the thinking that commonly goes into relational database design can be applied to XML Schema design as well.
Definition 17 (key, foreign key, and superkey): Let X = (E, A, M, P, r, ∑) be XML Schema, a constraint ∑ over X has one of the following forms: • key: e(l) → e, where e∈ E, and l is a set of attributes in P(e). It indicates that the set l of attributes is a key of e elements . • foreign key: e 1 (l 1 ) ⊆ e 2 (l 2 ) and e 2 (l 2 ) → e 2 where e 1 , e 2 ∈ E, and l 1 , l 2 are non-empty sequences of attributes in P(e 1 ), P(e 2 ), respectively, and moreover l 1 and l 2 have the same length. This constraint indicates that l 1 is a foreign key of e 1 elements referencing key l 2 of e 2 elements.
• A constraint of the form e 1 (l 1 ) ⊆ e 2 (l 2 ) is called an inclusion constraint. • Observe that a foreign key is actually a pair of constraint, namely an inclusion constraint e 1 (l 1 ) ⊆ e 2 (l 2 ) and a key e 2 (l 2 ) → e 2 .
• superkey: suppose that, e⊆ E, and for any two distinct paths p 1 and p 2 in the XML Schema X, we have the constraint that: p 1 (e) ≠ p 2 (e). The subset e is called a superkey of X. • Every XML Schema has at least one default superkey -the set of all its elements.

B. First Normal Form for XML Schema (X-1NF)
First normal form (1NF) is now considered to be a part of the formal definition of a relation in the basic relational database model. Historically, it was defined as: "The domain of an attribute in a tuple must be a single value from the domain of that attribute" [13].
Of course, XML is hierarchical by nature. An XML "tuple" can vary from first normal form in several ways, all of them are valid by means of data modeling:

C. Second Normal Form of XML Schema (X-2NF)
X-2NF is based on the concept of full functional dependency. (X) is called full FD, if removal of any element's path p from S 1 , means that the dependency does not hold any more, (i.e., for any p ∈ S 1 , (S 1 -{p}) does not functional determine S 2 ).

Definition 19:
A FD S 1 → S 2 is called partial dependency if, for some p∈ S 1 , (S 1 -{p}) → S 2 is hold.  • The test for X-2NF involves testing for FDs whose left-hand side are part of the primary key. If the primary key contain a single element's path, the test need not be applied at all.

D. Third Normal Form of XML Schema (X-3NF)
X-3NF is based on the concept of transitive dependency. (X) is transitive dependency if there is a set of paths Z (that is neither a key nor a subset of any key of X), and both S 1 → Z and Z → S 2 hold.  Example 10: The XML Schema Emp_Dept in the above example is in X-2NF (since no partial dependencies on a key element exist), but Emp_Dept is not in X-3NF. Boyce-Codd Normal form of XML Schema (X-BCNF), proposed as a similar form as X-3NF, but it was found to stricter than X-3NF, because every XML Schema in X-BCNF is also in X-3NF, however, an XML Schema in X-3NF is not necessarily in X (X), then S 1 is a superkey of X.

Definition 21:
Also, we can consider the following definition of X-BCNF: Definition 24: Given XML Schema X and F ⊆ FD (X), In definition 24, we suppose that, f is a nontrivial FD.
Indeed, the trivial FD p.@l → p.@l is always in (X, F) + , but often p.@l → p ∉ (X, F) + , which does not necessarily represent a bad design. To show how X-BCNF distinguishes good XML design from bad design, we consider example 1 again, when only functional dependencies are provided.
Example 11: Consider the XML Schema from example 1 whose FDs are FD1, FD2, and FD3, shown in example 6. FD3 associates a unique name with each student number, which is therefore redundant. The design is not in X-BCNF, since it contains FD3 but does not imply the functional dependency: courses.course.taken_by.student.@sno → courses.course.taken_by.student.name To solve this problem, we gave a revised XML Schema in example 1. The idea was to create a new element info for storing information about students. That design satisfies FDs, FD1, FD2, as well as courses.info.number.@sno → courses.info and can be easily verified to be in X-BCNF.

V. NORMALIZATION ALGORITHM
The goal of this section is to show how to transform an XML Schema X and a set of FDs F into a new specification (X', F') that is in X-BCNF and contains the same information.
Throughout the section, we assume that the XML Schemas are non-recursive. This can be done without any loss of generality. Notice that in a recursive XML Schema X, the set of all paths is infinite. We make an additional assumption that all the FDs are of the form: {q, p 1 .@l 1 , . . . , p n .@l n } → p. That is, they contain at most one element path on the left-hand side. While constraints of the form {q, q', . . . } are not forbidden, they appear to be quite unnatural. Furthermore, even if we have such constraints, they can be easily eliminated. To do so, we create a new attribute @l, remove {q, q'} ∪ S → p and replace it by q'.@l →q' and {q, q'.@l}∪ S → p.
We shall also assume that paths do not contain the symbol S (since p.S can always be replaced by a path of the form p.@l ).

A. The Decomposition Algorithm
For introducing the decomposition algorithm, we make the following assumption: if S → p.@l is an FD that causes a violation of X-BCNF, then every time that p.@l is not null, every path in S is not null. This will make our presentation simpler.
Given XML Schema X and a set of FDs F, a nontrivial FD S → p.@l is called anomalous, over (X, F), if it violates X-BCNF; that is, S → p.@l ∈ (X, F) + but S → p ∉ (X, F) + . A path on the right-hand side of an anomalous FD is called an anomalous path, and the set of all such paths is denoted by APath(X, F).
In this sub-section we present an X-BCNF decomposition algorithm that combines two basic ideas: creating a new element type, and moving an attribute.

1) Creating New Element Types
Let X = (E, A, M, P, r, ∑) be XML Schema and F a set of FDs over X. Assume that (X, F) contains an anomalous FD {q, p 1 .@l 1 , . . . , p n .@l n }→ p.@l , where q∈ EPaths (X) and n ≥ 1. For example, the "university" database shown in Example 1 contains an anomalous FD of this form (considering name.S as an attribute of student): {courses, courses.course.taken_by.student.@sno} → courses.course.taken_by.student.name.S. (1) To eliminate the anomalous FD, we create a new element type τ as a child of the last element of q, we make τ 1 , . . . , τ n its children, where τ 1 , . . . , τ n are new element types, we remove @l from the list of attributes of last(p) and we make it an attribute of τ and we make @l 1 , . . ., @l n attributes of τ 1 , . . . ,τ n , respectively, but without removing them from the sets of attributes of last(p 1 ), . . . , last(p n ), as shown in Figure 4. For instance, to eliminate the anomalous functional dependency (1), in example 1, we create a new element type info as a child of courses, we remove name.S from student and we make it an "attribute" of info, we create an element type number as a child of info and we make @sno its attribute. We note that we do not remove @sno as an attribute of student.

3) The Algorithm
The algorithm applies the two transformations introduced in the previous sections until the schema is in X-BCNF, as shown in Figure 6.
The algorithm shows in Figure 6, involves FD implication, that is, testing membership in (X, F) + (and consequently testing X-BCNF and (X, F)-minimality).
Since each step reduces the number of anomalous paths, then we obtain:

Proposition 4. The X-BCNF decomposition algorithm
terminates, and outputs a specification (X, F) in X-BCNF.

VI. CONCLUSION AND FUTURE WORKS
We address the problem of schema design and normalization in XML databases model. The main contribution of this paper are the proposed normal forms for XML Schema, and the decomposition algorithm that used to convert any XML Schema into normalized one, that satisfies X-BCNF.
The decomposition algorithm can be improved in various ways, and we plan to work on making it more efficient. We also would like to find a complete classification of the complexity of the FD implication problem for various classes of XML Schemas. We plan to work on extending XML Schema normal form to more powerful normal forms, in particular by taking into account multi-valued dependencies.