Mapping XML to RDF: An algorithm based on element classification and aggregation

Many data in Web are exchanged and stored in XML. RDF achieves rapid response and reasoning by aggregating a large amount of knowledge. It has been widely used in intelligent field as a way of data organization. Building RDF based on XML is the focus of this article. This paper analyzes the tree structure of XML, divides the XML elements into three sub-models according to the structure, then aggregates and classifies XML elements, and abstracts the aggregates into three abstract structure models, and finally proposes related mapping rules. Based on the optimization of existing methods, this paper proposes an XML-to-RDF mapping algorithm based on element classification and aggregation, which is reflected in three aspects: optimized mapping rules to ensure the completeness and accuracy of data content and semantics; proposed general algorithms for different types of XML; identified equivalent elements of XML and solved the problem of RDF data redundancy caused by such equivalent elements.


Introduction
The emergence of RDF originated from the description of metadata. It was invented by Apple researcher Guha in 1997. The concept of the Semantic Web was proposed by Tim Berners-Lee [1]. The Semantic Web is based on the organization of graphs and "links", but what is linked is no longer a web page but an entity in the objective world. Semantic descriptions have also been added to the links. RDF is a data model used to organize semantic data in the Semantic Web. RDF Schema [2] gives it semantics by defining the vocabulary used in RDF. As a product of the development of the Semantic Web, the knowledge graph has become a hot research topic. Although not all knowledge graphs' framework is RDF, its essence is to describe entities and their attributes and relationships. In the knowledge graph, triples are called knowledge. RDF achieves rapid response and reasoning by aggregating a large amount of knowledge. It has been widely used in the field of intelligence as a way of data organization and has become the recommended standard of W3C.
There are a large number of data with different structures without semantic information in Web. Building RDF or knowledge graphs based on these data has become an important research direction. XML is a markup language [3], and its data model is called Document Object Model (DOM). The elements and attributes in XML documents store data information. XML is a unified data storage and exchange format in Web. Because of its flexible structure and strong portability, data described in XML is widely available in the Web. Building RDF based on XML is the focus of this article.

Related work
Some work has been devoted to researching methods for constructing RDF based on XML. Michel Klein proposed in [5] to map the data in an XML document to a sequence of RDF triples based on RDF Schema. RDF Schema specifies the classes and attributes of the RDF data generated by the mapping. In the process of traversing the XML DOM tree deeply, select the relevant data in the XML document based on the classes and attributes and map it to the RDF components. However, the classes and attributes are completely customized manually and depend on the user's subjectivity, so the semantic information in the original data may be lost and some irrelevant semantics may be added. Moreover, the entire mapping process is not automatically constructed, and human factors affect the RDF data constructed by the mapping. Pham et al. proposed a method of constructing RDF Schema based on DTD and XML Schema in paper [8] [10] , and constructing RDF mapping algorithm based on the constructed RDF Schema. This method is more effective for XML documents constrained by structured documents. And it effectively retains the implicit semantics of the data. The structured document restricts the organization of the data in the XML document by specifying the data types used by the elements and attributes in the XML. The paper analyzes the structured document XML Schema or DTD and obtain the definition of the elements and attributes in the constrained XML document, map them as classes or attributes according to the data types and related mapping rules of the elements and attributes, and then traverse the XML DOM tree according to the constructed classes and attributes, and convert the XML elements and attributes to parts of RDF. But not all XML documents are restricted by DTD or XML Schema. For XML documents with relatively free structure, the above methods are obviously invalid. In addition, in paper [12], Pham et al. believe that elements with different tag names defined in XML Schema and modified by the same data type should describe similar objects. If they are mapped to different classes, the generated RDF data is redundant, so they proposed a method to calculate the semantic similarity of such elements. This similarity calculation is obtained on the basis of calculating the semantic similarity between the child element and the parent element of the related element. If the similarity exceeds a certain value, it is considered that these elements with different tags describe different objects, and they can be mapped to different classes. Paper [11] proposes a more general method. The paper divides XML documents into two types: the first type is without structural restrictions; the second type is bound by XML Schema or DTD. The method proposed in this paper traverse the DOM tree of the first type of XML document, and decide to map them as different components of RDF through the mutual nesting relationship of XML elements and attributes, but due to the element tags in the first type XML document is more flexible, there will be elements with the same tag name and different embedded structures. Elements with the same label should not be mapped to different components of RDF.
Paper [13] [14] designed a data semantic model based on XQuery and XPath query languages, and provided a unified semantic model for both XML and RDF data models. Paper [15] [16] integrated XML query language XPath into RDF query language SPARQL, the query of data in the XML document is realized in the process of querying RDF triples. These methods are based on the relevant query language to output the query results in the form of RDF. Paper [17] [18] is based on the mapping language to map various types of data into RDF triples, which involves the part of building RDF based on XML. This method has better results when XML data is frequently updated, but different mapping requirements require different mapping documents, and the construction of mapping documents is cumbersome.
After comprehensive analysis of the existing methods, we found the following shortcomings: the mapping rules are unreasonable and cannot truly reflect the semantic information contained in XML; the type of XML that can be processed is single, and the method's versatility and portability are poor; it fails to effectively identify equivalence Element, it is impossible to avoid redundancy in the constructed RDF. This article starts with the structure and content of XML, and proposes the automatic construction of RDF based on XML element classification and aggregation .

Tree model for parsing XML
Based on the characteristics of XML, a well-formed XML document can be visualized as a tree model. Assuming that E and A respectively represent the collection of elements and attributes in XML document, the elements and attributes in XML are divided into three models.
Definition 1: There is a type of element in XML where the embedded content is text value, and the starting tag does not contain attributes. Assuming that all such elements in XML document constitute s E . We take the text node of the XML document and the corresponding element tags node as a simple sub-model. In XML document, the start tag of some elements contains several attributes, the map of attribute name node-attribute value node is also a simple sub-model. ◆Assuming s N is the collection of the element ， s label names of s E , and a N is the collection of the attribute's names of A : is a mathematical expression of a simple sub-model, assuming that the set of all simple sub-models in XML document is S. Definition 2: There is a type of element embedded with several children-elements, and the start tag of the element can contain several attributes. It is assumed that all such elements in XML document constitute c E , The map of element tags node-{children-element modules, attribute modules} is abstracted as complex sub-models.
◆Assuming c N is a collection of the label names of all elements in c E : is a mathematical abstract expression of complex sub-models, the set of all complex sub-models is C ,

Definition 3:
There is a type of element that contains not only a few sub-elements but also text values. At the same time, the start tag of the element can also contain a number of attributes, assuming that all such elements in XML document constitute m E . The map of element tags node-{children-element modules, attribute modules, text node} is abstracted as compound sub-model. ◆Assuming m N is a collection of the label names of all elements in m E : is a mathematical abstract expression of compound sub-models, the set of all compound sub-models is M ,

Aggregation class and abstract structure model
The elements in the XML document correspond to different sub-models. The elements with the same tag name are classified and aggregated, and the sub-models of the elements of the same aggregate type are integrated to construct an abstract structure model.

Aggregation class and abstract structure model
Equivalent elements refer to elements with the same embedded content that belong to the same aggregation class. These equivalent elements may cause data redundancy.
Definition 6: The sub-models of all elements in the current XML document include three types, so there are three types for determining whether the two elements are equivalent: ◆When i e and j e are both simple sub-model, the equivalent condition is: ◆When i e and j e are both compound sub-models, the equivalent condition is:

Build aggregation class and its abstract structure model
The aggregation class is the basis for building RDF. This section will introduce the process of obtaining aggregated classes. XML documents can be divided into two categories: the tags of the elements of the first type of XML can be customized by the user, which has poor versatility among different Web applications; the elements and attributes of the second type of XML are restricted by the XML Schema. (2) Processing of second type XML Different from the first type XML, the second type XML is based on XML Schema. XML Schema specifies the elements and attributes used in XML. The steps are as follows: Ⅰ. Get collections of various elements in XML Schema, which are embedded sub-elements of the root tag of XML Schema: respectively represent the global elements, global attributes, global simple data type collection, and global complex data type collection; Ⅱ. Traverse

GXSA GXSE、
, get the aggregation class collection . Obtain the abstract model of the aggregation class.

Mapping rules for RDF triples
The rules and steps for constructing RDF triples based on XML data entities are as follows: Ⅰ. Traverse the elements and attributes of XML and attach different ID ; Ⅱ. Traverse the DOM tree and adjust the elements and attributes to attach the same ID to equivalent elements and attributes; Ⅲ . Traverse the adjusted DOM tree and store the mapped elements to OID , the rules of constructing RDF sequence: ◆If j e is a simple sub-model, the rules for constructing RDF triples are as follows:

XML Data Repository
The mapping method proposed in this paper has good effects for both types of XML. This chapter conducts qualitative and quantitative analysis of the effect of the mapping algorithm. To test the effectiveness of the algorithm, we validated it on the XML Data Repository dataset. The XML Data Repository collects publicly available data in XML form and provides statistical information about the dataset. As shown in table 1.  Figure 1 shows the test Results of several algorithms on the test data in table 1. The results show that the mapping algorithm in this chapter is more versatile.    Figure 3. Comparison of the number of RDF triples generated by different algorithm mapping The equivalent elements in XML cause data redundancy in RDF triples. The mapping algorithm in [11] does not identify the equivalent elements in XML, so the RDF data generated by the mapping algorithm will have redundant data. Use the data set in table 1 to test the mapping algorithm in this paper and the mapping algorithm in [11], and compare the statistical total number of RDF triples generated by different mapping algorithm. It can be found from figure 3 that since the mapping algorithm in this chapter has identified the equivalent elements in XML, the equivalent elements that have been mapped are pruned during the process, data redundancy in RDF is well avoided. The redundancy rate refers to the ratio of the eliminated redundant data statistics to the RDF triples statistics generated by the unredundant mapping.

Conclusions
The key to constructing RDF triples required for knowledge engineering based on XML in WEB lies in the construction of mapping rules. Compared with various existing mapping algorithms, this article constructs mapping rules based on the classification and aggregation of XML elements. The mapping rules for mapping elements to RDF triples ensure the completeness and accuracy of the constructed RDF through the mapping algorithm. At the same time, through the identification of duplicate data, it is ensured that redundancy data will not appear in the constructed RDF triples. Finally, the algorithm can realize the process of mapping different types of XML to RDF. The experimental analysis proves that the mapping algorithm has reached corresponding requirements.