Guarantee Mechanism of Data Continuity for Electronic Record Based on Linked Data

In the field of electronic record management, especially in the current big data environment, data continuity has become a new topic that is as important as security and needs to be studied. This paper decomposes the data continuity guarantee of electronic record into a set of data protection requirements consisting of data relevance, traceability and comprehensibility, and proposes to use the associated data technology to provide an integrated guarantee mechanism to meet the above three requirements.


Introduction
The amount of electronic record that needs to be stored in a big data environment is growing geometrically. The traditional electronic record management system based on structured small-scale data management technology is difficult to effectively utilize the unstructured large-scale electronic record in the era of big data [1][2][3]. Therefore, how to ensure the availability of electronic record has become one of the major challenges in the research of electronic record management in the era of big data [4,5].
Traditional electronic record management is often functional or target-driven, that is, according to the business objectives of enterprises or organizations, the electronic record management function requirements are decomposed into many function points, and the electronic record management system is developed by realizing these function points one by one [6,7]. For example, retrieval, statistics, authority control, extract, secret level management, programming interface, offline utilization and other function points are the core function points of electronic record management [8,9]. If the business environment is based on the relative stability of electronic documents, the function-driven design method is indeed the preferred solution [10][11][12][13]. However, with the advent of the era of big data, the business activities, service contents and data of organizations are in a constantly changing environment. Data-driven mode is about to become a new mode of electronic record management [14]. In this new development mode, data will become an important driver of electronic record management.
Data continuity refers to a set of data protection measures composed of data relevancy, traceability, comprehensibility, and internal connections. Its purpose is to ensure the availability, credibility and control-ability of data, reduce the risk of data misuse, loss of trust and loss of control, with a focus on data quality assurance-better correlation of data, avoiding fragmentation; enhancing data tracking and identification, providing evidence-based data; having reasonable self-describing information, and maintaining the subject's understanding and control.
Data continuity assurance is the key to solving the big data challenges faced by electronic record management described above. First, the business of record requires that electronic record have machine comprehensibility, otherwise it is difficult to define new services based on record content or optimize existing services. Second, data-driven electronic record management requires that the electronic record itself be in an associated and traceable state. Third, data centralization shifts the focus of electronic record management from computing to data. The association, traceability and semantic understanding of electronic record have become their key activities. Finally, the real-time processing power of electronic record depends not only on the choice of technology (such as stream processing techniques such as Spark), but also on the state of the data. That is, the relevance, traceability, and comprehensibility of electronic record.
At present, although the continuity guarantee of electronic record data is put forward, there is still a lack of technical means to provide guarantee. Aiming at this problem, this paper proposes the continuity guarantee mechanism of electronic record data based on related data.

Electronic Record
With the mass production of electronic record, electronic record has gradually replaced paper record as the main form of social records, and electronic record management has become an important part of record management. China's research on electronic record management has gone through more than a decade and is becoming increasingly mature and showing Chinese characteristics.
From a theoretical perspective, compared with paper record, electronic record have many differences in characteristics, among which three aspects deserve special attention: First, the authenticity, integrity and validity of electronic record; The second is the change of the separability of electronic record content and carrier and the way it is managed; The third is the integration of the content, background and structural information of electronic record. Fundamentally, the uniqueness of these three aspects is the essence of electronic record, which is the result of the generation, management, preservation or destruction of record based on the "system", and is directly related to the electronic record management system. The generation and circulation of electronic record are aimed at effectively supporting the business activities between the organization and the users. The management purpose is firstly to obtain business recognition. Considering the various business systems to which electronic record are attached, the key and difficult aspects of authentic and trusted maintenance are business process integration and version control. Important business activities have corresponding rules and regulations for the management of their record, which are used to control the versioning, process control, audit tracking and other work of business documents.
Like security, continuity is an important attribute of data resources and one of the primary tasks of data management. However, the arrival of the era of big data makes the problem of fragmented electronic record, garbage data and data islands increasingly prominent, and the loss of use, credibility and control of electronic record have become a new challenge to the management of electronic record, and the data continuity of electronic record has become an important subject to be studied urgently.

Linked Data
The associated data uses the RDF (Resource Description Framework) data model, which uses URIs (Uniform Resource Identifiers) to name data entities, publish and deploy instance data and class data, Thus, the data can be revealed and obtained through a hypertext transfer protocol, at the same time, it emphasizes the interrelation of data, the interrelation and the contextual information that is beneficial to people and computers.
Linked data can create links between data from different sources. These data sources may be databases maintained by two geographically located organizations, or they may be different systems within an organization that are not interoperable at the data level. Linked data is machine readable and unambiguous and linked to other external data sets, as well as linked from data from external data sets.
The associated data network is different from the current hypertext network, the basic unit of a hypertext network is a hypertext markup language (HTML) file linked by hyperlinks. Linked data is not simply connecting these files, but using RDF to form a network that links anything in the world, namely the data network, which can be described as a network of online data describing all the entities in the world. The emergence of the associated data network not only expands the current hypertext network, but also discriminates, selects and locates the confusing information resources on the current network. The three modes of data storage are shown in Tab. 1.

Problem Description
The difficulty of the main challenges facing the electronic record management system in the era of big data is to achieve the following four transformations: The first is the transition from the document of the business to the commercialization of the record. The second is the transition from target-driven to recorddriven. The third is the transition from computation-centric to data-centric. The fourth is the transition from offline processing to real-time processing. There are many problems that need to be addressed to achieve these four transitions, but the most important thing is to focus on the continuity of electronic record management. Data continuity is not only a prerequisite for record business and data driving, but also a core problem that needs to be solved in a data-centric design pattern and real-time processing.
The connotation of electronic record data continuity guarantee is shown in the following table. As can be seen from Tab. 2, the relevance, traceability and comprehensibility are related to each other, from space to time, from structure to semantics, to ensure the integrity and usability of electronic record, so that electronic record can be found with data, and are true and effective. Different from the theory of digital continuity centered on long-term preservation, the theory of data continuity further emphasizes the continuity of the content and semantic level of electronic record.
From the connotation of data continuity guarantee, it can be seen that the following three core issues should be studied in the research of electronic document management for the new challenges of the big data era. The first is the guarantee of the relevance of electronic record. The second is the traceability guarantee of electronic record. The last is the guarantee of comprehensibility of electronic record. In order to provide the above three aspects of protection, this paper uses the associated data technology to provide the corresponding technical support.

Guarantee Mechanism of Data Continuity for Electronic Record Based on Linked Data
The theory of connected data provides a theoretical basis for the study of data continuity, especially the relevance of electronic record. The practice of data engineering based on the associated data set has important reference significance for the data continuity of electronic record, especially the implementation method and guarantee mechanism of data relevance.
At the same time, the associated data broadens the theory and technology of data traceability. The data traceability method based on associated data lays a good foundation for data continuity, especially the design of data traceability. At present, there are tracing methods based on annotation, tracing methods based on inverse function, tracing methods based on bit vectors and so on. Among them, the annotationbased tracking method is relatively simple, and it is also relatively easy to implement. At present, there are certain applications.
The Resource Description Framework (RDF) is the cornerstone of the development of the Semantic Web and is a standardized language used to describe metadata for network resources. It is intended to describe the resources and their relationships. The associated data uses the RDF description language, which uses Uniform Resource Identifiers (URIs) to identify things and describe resources with attributes and attribute values. The description of a resource is a statement of the attributes of the resource and the value of the attribute, called a statement. It uses a specific set of terms to express the various parts of the statement. The part of the statement of things used to identify things is called the subject. The part used to distinguish the different attributes of the stated object is called the predicate. The part of the statement that distinguishes the values of the individual attributes is called the object. The object can be either an attribute value or a resource object. The associated data is described as objects whenever possible, which is beneficial for establishing connection of data.
Resource objects in associated data are divided into information resources and non-information resources. Information resources themselves are information, such as pictures, web pages, etc., and generally have representations that can be accessed by HTTP, such as different formats, protocol properties, or natural language. Non-information resources refer to the concept of the real world outside the Web. For non-information resources, the associated data assigns it a Uniform Resource Identifier (URI) that cannot be directly referenced by the HTTP protocol. The URI points to not the noninformation resource itself, but the information resource associated with it. The interoperability between resource objects links different resource objects, resource object forms and their information resources with non-information resources, thus forming a wide data network and providing a basis for data sharing. This type of data understanding occurs both within a data set and across data sets.
The main goal of RDF is to provide a framework for enabling different domains to define their own metadata elements, while providing a machine-understandable representation that facilitates data exchange in a big data environment. That is, RDF provides a metadata solution for web data integration. In RDF, a resource can be of any type, a property of a resource is a special kind of resource, a value of a property is also a resource, and even a statement can be a resource, and each resource has a unique URI reference. In order to be able to fuse different metadata sets, RDF is designed to allow anyone to define metadata to describe a particular resource. Since there are more than one attribute of a resource, it is generally a definition of a metadata set, which is the set of words in RDF. It includes various metadata sets such as DC metadata, ontology, classification tables, thesaurus and so on. A vocabulary is also a resource that can be uniquely identified using a URI. Thus, when using RDF to describe resource attributes, you can use a variety of different vocabularies, just by specifying them with a URI.
Since RDF only provides a primary semantic representation, there is no uniform label to support a more specific description of the semantic relationship, therefore, a unified knowledge organization system standard that supports more specific semantic relationships and flexible extensibility needs to be established on the basis of RDF. When the history records the value of the attribute, in order to ensure the consistency of the description and its relevance, the values are specified from a specific vocabulary. This makes it easy to merge and fuse with other RDFS data in the Semantic Web, providing support for interoperability between thesaurus and between the thesaurus and other vocabularies.

Conclusion
In the field of electronic record management, especially in the current big data environment, data continuity has become a new topic that is as important as security and needs to be studied. This paper decomposes the data continuity guarantee of electronic record into a set of data protection requirements consisting of data relevance, traceability and comprehensibility, and proposes to use the associated data technology to provide an integrated guarantee mechanism to meet the above three requirements.
Funding Statement: This work is supported by the NSFC (61772280), the national training programs of innovation and entrepreneurship for undergraduates (Nos. 201910300123Y, 202010300200), and the PAPD fund from NUIST. Yongjun Ren is the corresponding author.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.