Provenance Method of Electronic Archives Based on Knowledge Graph in Big Data Environment

: With the advent of the era of big data, the Provenance Method of electronic archives based on knowledge graph under the environment of big data has produced a large number of electronic archives due to the development of science and technology. How to guarantee the credential characteristics of electronic archives in the big data environment has attracted wide attention of the academic community. Provenance is an important technical means to guarantee the certification of electronic archives. In this paper, knowledge graph technology is used to provide the concept provenance of electronic archives in large data environment. It not only enriches the provenance method, but also guarantees the certification of electronic archives in the large data environment.


Introduction
In the era of big data, the connotation and extension of data have undergone tremendous changes [1]. The living environment and life cycle of data are also different from those of information age, which brings various management and governance risks. Therefore, "data management" has become a hot topic in many disciplines from the perspective of big data [2]. As a historical record carefully preserved by the state and society, electronic archives are of great value. The arrival of the era of big data has given new features to the data resources of electronic archives. The main participants in the management of electronic archives are increasingly aware of the value of electronic archives [3].
Under the same technical conditions, the common characteristics of large archives data and big data are huge amount of data, various types of data and fast processing speed; while slightly different characteristics are: relatively scattered data, high value density of single type of electronic archives data [4], high reliability and fidelity of data.
The data resources of electronic archives are large and growing fast as a whole. At present, the total amount of electronic archives resources in single archives in China has not reached the level of PB, but the total amount of archives resources in each archive (office) can be regarded as massive. At the same time, with the digitalization of existing archives and the increase of new electronic archives received each year, the number of electronic archives has formed a huge scale.
The data types of electronic archives are various, and the structure is complex. Electronic archives involve government agencies, finance, education and other industries. Each industry produces different archival data, not only traditional paper archives, but also electronic archives such as pictures, charts, audio, video, etc. The formats and features are various, thus forming a large number of heterogeneous data [5]. So, there are a lot of structured and unstructured data in archives data resources.
High speed of data processing of electronic archives. In the era of big data, the timeliness of electronic archives is more obvious, and it is more necessary to predict what will happen in time [6], which requires fast processing speed, which is also the most remarkable feature different from traditional data. Secondly, the speed and efficiency of electronic archives processing is the life of industry or enterprise.
The value of archival data resources is high. Compared with the characteristics of high total value and low value density of most data resources, electronic archives are the most authentic, reliable, authoritative and credential primary information resources [7]. The more people they use, the greater their value. Electronic archives are relatively scattered, which means that due to the separation effect of if2012.
scope and storage period, electronic archives in space and time range are often stored in different physical spaces [8], although the numbers can be enormous, but it is difficult to complete. The high value density of single kind of electronic archives means that the value density of electronic archives is higher than that of general data without identification because of archival archiving and identification [9]. The high density of data value comes from the value of archives and is proportional to the value of archives.
Archives objectively record the historical situation and retain the true historical markers, which is convincing historical evidence [10]. The value of vouchers is the basic value of electronic archives, and the real electronic archives have the value of vouchers. One of the reasons for maintaining the veracity of electronic archives is to give full play to the evidential value of electronic archives [11]. Data provenance is an important technical means to ensure the veracity of electronic archives. However, in the current large data environment, the number of electronic archives has increased dramatically [12], and the traditional provenance method cannot meet the current needs. Therefore, this paper studies the Provenance Method of electronic archives in the large data environment.

Knowledge Graph
In essence, knowledge graph is a semantic network that reveals the relationship be-tween entities, and it can formally describe the real-world things and their relationships [13]. Now the knowledge graph has been used to refer to a variety of large-scale knowledge base.
Triple is a general representation of knowledge graph, that is, is the entity aggregate in the knowledge base, the number of different entities is |E|. Is a set of relationship in it, which contains |R| different relationships, represents the triple set in the knowledge base [14] The basic form of triple includes Entity 1, relationship, Entity 2 and concept, attribute and attribute value. Entity is the most basic element in knowledge graph, and different entities have different relationships [15]. Concepts mainly refer to collections, categories, object types [16], and the types of things, such as people and geography; Attributes mainly refer to the attributes, characteristics, characteristics and parameters that an object may have, such as nationality, birthday, etc. Attribute values mainly refer to the values of the attributes specified by the object, such as China, 1988-09-08, etc. Each entity can be identified by a globally unique ID, each pair (attribute-attribute value) can be used to describe the intrinsic characteristics of the entity, and the relationship can be used to connect two entities and describe the relationship between them.

Logical Structure
The structure of knowledge graph of logical structure includes the logical structure of knowledge graph itself and the technology (system) framework used to construct knowledge graph.
Logically, the knowledge graph can be divided into two layers: data layer and pat-tern layer. In the data layer of the knowledge graph, knowledge is stored in units of fact in the graph database [17]. If the "entity-relation-entity" or "entity-attribute-property value" triples are taken as the basic expression of fact, all the data stored in the graph database will form a huge entity-relationship network and form the graph of knowledge [18]. The schema layer is the core of the knowledge graph. What is stored in the pattern layer is the extracted knowledge. Ontology library is usually used to manage the pattern layer of knowledge graph, and the support ability of ontology library to axioms, rules and constraints is used to regulate the relations between entities, relationships, types and attributes of entities and other objects.

Architecture
The architecture of knowledge graph refers to the construction of pattern structure, as shown in Fig. 1.

Figure 1:
The architecture of knowledge graph The part of the dotted line frame is the process of constructing knowledge graph, which needs to be updated and iterated with people's cognitive ability.
There are two main ways to construct knowledge graph: top-down and bottom-up. Top-down refers to defining ontology and data patterns for knowledge graph, and then adding entities to the knowledge base [19]. This method needs to use some existing structured knowledge base as its basic knowledge base, such as Freebase project. Bottom-up refers to extracting entities from some open linked data, choosing those entities with high confidence to join the knowledge base, and then building the top-level ontology model.

Problem Statement
At the 13th International Archives Congress, Terry Cook, a Canadian archivist, developed the traditional entity provenance into a highly abstract and generalized concept provenance on the basis of summing up the experience and lessons of archival practice in various countries. Conceptual provenance holds that provenance is not only the originator of electronic archives, but also an abstract and broadranging relationship [20]. It is an abstract concept provenance focusing on the functions and business activities of the archivists within the organization. This theoretical innovation sublimates the principle of provenance in theory and deepens the understanding of the essence and value of electronic archives.

The Position of Provenance Principle in the Electronic Age
Under the current big data environment, a large number of electronic archives have been produced. We should go deep into the concrete formation process of electronic archives and understand the origin and development of document information [21]. In essence, the principle of provenance requires us to fully understand and maintain the traceable information of documents, while in the era of electronic archives, it has risen to the sense of background information acquisition. The content and emphasis of the description have changed. The description of electronic archives should involve various elements in the formation of documents, including the background, content and structure. The emphasis of the description should be shifted from the content of documents to the background of the formation of documents.
The principle of provenance embodies the interrelationship between the formations of electronic archives and ensures the evidential value of electronic archives. Electronic archives do not exist in isolation and without connection [22]. Each electronic archive exists as an integral part of the whole archive. In order to find out whether an electronic file is of great significance, it is necessary to know exactly who the file is, under what conditions, for what purpose, and in what form it is structured.

The Role of Provenance Principle in the Electronic Age
The role of provenance principle in electronic archives filing. On the one hand, the traditional archives work still exists objectively. As a sort-out principle, the principle of provenance classifies archives and libraries from the perspective of provenance so as to distinguish them from each other in basic methods [23]. The principle of provenance fully guarantees the integrity of the whole file and facilitates the filing staff to sort out the electronic files. The principle of provenance requires that electronic archives be classified according to their sources. It is required to classify electronic archives according to their provenance in public administration organs [24], so as to maintain the organic connection of electronic archives and the inherent historical connection between documents. According to the principle of provenance, filing helps to protect the integrity of electronic archives, to explain the importance of documents, to maintain the evidential value of documents, and to provide convenience for the subsequent management links. On the other hand, electronic archives have a series of technical characteristics. These characteristics are different from traditional documents, but all these characteristics have not fundamentally changed the characteristics of electronic archives as archives [25]. Still like traditional documents, they are true records of human activities. Therefore, it is equally important to maintain the historical organic relationship between documents for the collation of electronic archives.
The role of provenance principle in identification [26]. The concept of provenance plays an important guiding role in the field of electronic archives identification. The principle of provenance requires that the original history of electronic archives be maintained. According to the principle of provenance, electronic archives can maintain their complete provenance information and system. Its identification depends on real and reliable provenance. The electronic archives management organization is the starting point of the appraisal work. The identification work based on the formation of electronic archives has increasingly shown its feasibility and reliability.
In the current big data environment, how to provide concept provenance lacks the corresponding technical means. Therefore, this paper proposes to use knowledge graph technology to provide concept provenance [27].

Information Extraction of Electronic Archives
Information extraction is the first step in the construction of knowledge graph. Information extraction of electronic archives based on knowledge graph can be obtained by entity extraction, relationship extraction and attribute extraction [28].
With the continuous progress of named entity recognition technology, academia began to pay attention to the problem of information extraction in the open domain, that is, no longer limited to specific knowledge areas, but oriented to the open internet, to study and solve the problem of information extraction in the whole network. Firstly, a scientific and complete named entity classification system is established, which can be used to guide the research of algorithm on the one hand, and to facilitate the management of the extracted entity data on the other hand. Finally, the automatic classification of entity was realized by using adaptive perceptron algorithm.
Statistical machine learning (SML) is widely used to model the relationships between entities, instead of predefined grammatical and semantic rules. For example, Lexical, syntactic and semantic features of natural language were used to model entity relationships, and the rule-free approach by maximum entropy method was successfully realized. Moreover, a large number of supervised learning methods based on eigenvectors or kernels are used in relation extraction, which makes the accuracy improve continuously. The goal of attribute extraction is to collect attribute information of a specific entity from different sources. For example, for a public figure, information such as nicknames, birthdays, nationalities, educational backgrounds can be obtained from the public information on the Internet [29][30]. Attribute extraction technology can collect the information from a variety of data sources to realize the entity belongings. Realize the complete sketch of entity attributes.

Integration of Knowledge in Electronic Archives
Through information extraction, the goal of obtaining entity, relationship and entity attribute information from unstructured and semi-structured electronic archives is achieved. However, these results may contain a lot of redundant and error information [31], and the relationship between electronic archives data is flattened and lacks layers. Secondary and logical, so it is necessary to clean up and integrate them. Knowledge fusion includes two parts: entity link and knowledge merge. Through knowledge fusion, the ambiguity of concepts can be eliminated, redundancy and erroneous concepts can be eliminated, so as to ensure the information quality of electronic archives.
Entity linking refers to the operation of linking the entity object extracted from electronic archives to the corresponding correct entity object in the knowledge base [32]. The basic idea of entity link is to select a group of candidate entity objects from the knowledge base according to the given entity reference items, and then link the reference items to the correct entity objects through similarity calculation. In recent years, academia has begun to pay attention to utilizing the co-occurrence relationship of entities, while linking multiple entities to the knowledge base, called collective entity linking [33].
The general process of entity linking is extracting entity reference terms from electronic archives through entity extraction; disambiguating entity and co-referential resolution, judging whether the samename entity in the knowledge base represents different meanings and whether there are other named entities in the knowledge base that express the same meanings [34]; confirming that the same-name entity in the knowledge base represents the same meanings. After the correct entity object, the entity reference is linked to the corresponding entity in the knowledge base.
Entity disambiguation is a technology specially used to solve ambiguity problem of the same name entity. In real language environment, we often encounter the problem that an entity reference item corresponds to multiple named entity objects. For example, the term "Li Na" can correspond to the entity Li Na as a singer, or to the entity Li Na as a tennis ball. By entity disambiguation, athlete Li Na can establish entity links accurately according to the current context. Entity disambiguation mainly adopts clustering method.
Entity resolution technology is mainly used to solve the problem that multiple reference items correspond to the same entity object. For example, in a document, many pronouns may point to the same entity object. By using co-referential resolution technology, these referential terms can be merged into the correct entity object [35]. This problem is of special importance in the fields of information retrieval and natural language processing and attracts a lot of research efforts [36]. Academic circles have different expressions on this issue, such as object alignment, entity matching and entity synonyms. In addition to the problem of co-referential resolution as a classification problem, it can also be solved as a clustering problem. The basic idea of clustering method is to focus on entity referential terms and realize the matching between referential terms and entity objects through entity clustering [37]. The key problem is how to define the similarity measure between entities.

Knowledge Processing of Electronic Archives
Through information extraction, knowledge elements such as entities, relationships and attributes can be extracted from the original corpus. After knowledge fusion, the ambiguity between entity reference items and entity objects can be eliminated and a series of basic factual expressions can be obtained. However, facts themselves are not equal to knowledge [38], and in order to obtain structured and network finally. Knowledge processing includes three aspects: ontology construction, knowledge reasoning and quality evaluation.
Ontology is a norm for modeling concepts and an abstract model for describing the objective world. It defines concepts and their links in a formal way. The greatest feature of ontology is that it is shared, and the knowledge reflected in ontology is a well-defined consensus [39]. In knowledge graph, ontology is located in the pattern layer, which is used to describe the conceptual hierarchy system as the conceptual template of knowledge in the knowledge base. At present, the research on Ontology generation methods is mainly focused on entity clustering. The main challenge is that the entity description obtained by information extraction is very short and lacks the necessary context information, which makes most statistical models unavailable.
Knowledge reasoning refers to expanding and enriching knowledge networks by establishing new relationships among entities based on the existing entity relationship data in knowledge base through computer reasoning [40]. Knowledge reasoning is an important means and key link of knowledge graph construction. Through knowledge reasoning, new knowledge can be found from existing knowledge. The object of cognitive reasoning is not limited to the relationship between entities, but also can be the attribute value of entities and the conceptual hierarchical relationship of ontology. Conceptual reasoning can also be carried out according to the inheritance relationship of concepts in ontology library.
Quality assessment is also an important part of knowledge base construction technology. Due to the limitation of current technology level, there may be errors in knowledge elements obtained by using open domain information extraction technology such as entity recognition errors, relationship extraction errors, etc. The quality of knowledge obtained by knowledge reasoning is also not guaranteed. Therefore, before adding it to the knowledge base, we need a process of quality assessments [41]. With the development of open related data projects, the quality differences among the knowledge base products produced by each sub-project are also increasing, and the conflicts between data are increasing. How to evaluate the quality of the knowledge base and the global knowledge is becoming more and more important. The construction of atlas plays an important role. The significance of introducing quality assessment is that the credibility of knowledge can be quantified, and the quality of knowledge base can be guaranteed by discarding the knowledge with lower confidence.

Intelligent Search of Electronic Archives Based on Knowledge Graph
After the above processing, we can build an intelligent search tool for electronic archives based on knowledge graph. Intelligent search based on knowledge graph is a search based on long tail. Search engines display search results in the form of knowledge cards. The query request of electronic archives users will go through two stages: query semantic understanding and knowledge retrieval.
Query-based semantic understanding. The semantic analysis of query type in knowledge graph mainly includes: word segmentation, part-of-speech tagging and error correction of query request text; description normalization to match knowledge in knowledge base [42]; and context analysis. In different contexts, the object of user query will be different, so knowledge graph needs to integrate the user's emotions at that time and feedback the user's answers in time query expansion. After defining the query intention and related concepts of the users of electronic archives, the relevant concepts in the current context are added to expand.
Knowledge retrieval. After query analysis, the standard query statements are entered into the knowledge base search engine, which retrieves the corresponding entities in the knowledge base and the entities with high matching degree in category, relationship, correlation and so on.

Conclusion
With the advent of the era of big data, the Provenance Method of electronic archives based on knowledge graph under the environment of big data has produced a large number of electronic archives due to the development of science and technology. How to guarantee the credential characteristics of electronic archives in the big data environment has attracted wide attention of the academic community. Provenance is an important technical means to guarantee the certification of electronic archives. In this paper, knowledge graph technology is used to provide the concept provenance of electronic archives in large data environment. It not only enriches the provenance method, but also guarantees the certification of electronic archives in the large data environment.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.