Data reconciliation and fusion methods: a survey

Withtheprogressofnewtechnologiesofinformationandcommunication,moreandmoreproducersofdataexist. Ontheotherhand,thewebformsahugesupportofallthesekindsofdata.Unfortunately,existingdataisnotproperduetotheexistenceofthesameinformationindifferentsources,aswellaserroneousandincompletedata. Theaimofdataintegrationsystemsistooffertoauserauniqueinterfacetoqueryanumberofsources.Akeychallengeofsuchsystemsistodealwithconflictinginformationfromthesamesourceorfromdifferentsources.We present,inthispaper,theresolutionofconflictattheinstancelevelintotwostages:referencesreconciliationanddatafusion.Thereferencereconciliationmethodsseektodecideiftwodatadescriptionsarereferencestothesame entityinreality.Wedefinetheprinciplesofreconciliationmethodthenwedistinguishthemethodsofreferencereconciliation,firstonhowtousethedescriptionsofreferences,thenthewaytoacquireknowledge.Wefinishthis sectionbydiscussingsomecurrentdatareconciliationissuesthatarethesubjectofcurrentresearch.Datafusioninturn,hastheobjectivetomergeduplicatesintoasinglerepresentationwhileresolvingconflictsbetweenthedata. Wedefinefirsttheconflictsclassification,thestrategiesfordealingwithconflictsandtheimplementingconflictmanagementstrategies.Wepresentthen,therelationaloperatorsanddatafusiontechniques.Likewise,wefinish thissectionbydiscussingsomecurrentdatafusionissuesthatarethesubjectofcurrentresearch.


Introduction
Resolving schema-level conflicts or even using the same schema can not, however, avoid conflicts between the values of instances themselves. Indeed, when data from different sources is integrated, different values can be references to the same entity in reality. These variations are due to: Different conventions and different vocabularies, Incomplete information, The presence of the erroneous data or Data freshness.
The three main goals of data integration systems is to increase [1]: completeness, concision and accuracy of the data. Completeness indicates the quantity of data measured by the number of entities as well as the number of attributes. Concision measures the uniqueness of object representations in integrated data, both in terms of unique entities number and the number of single entity attributes. Finally, accuracy indicates the data correctness, which means, how data are consistent with reality.

Data reconciliation and fusion methods
High completeness can be achieved by integrating more sources, whereas so as to be concise and accurate, two levels of tasks are performed by a data integration system [2]: schema-level matching and instance-level matching.
The schema-level matching is intended to establish semantic links between the content of different data sources. This task is accomplished during the generation of the integration system global schema. While the purpose of instance-level matching is to produce the correct data for each entity after determining reconciliation between instances that represent the same entity in reality. Some integration frameworks suppose the existence of a common identifier in all sources referring to the same concept. If such unique identifier exists, query results over sources can be reconciled thanks to relational operations. Nevertheless, taking into account that sources are autonomous, a common unique identifier doesn't exist in most of cases; How can a data integration system judge that two descriptions refer to the same entity or not? For this purpose, integration systems use entity reconciliation methods. Instance-level matching is processed through two phases: reference reconciliation and data fusion.
Even if schema conflicts are resolved, instances are not necessarily homogeneous; conflicts can occur when the same data is copied to multiple sources and the results of querying these sources contain conflicting data values for the same entities. Instance conflicts are often grouped into two classes: (i) reference conflicts and (ii) attribute value conflicts (see Figure 1).
The reference reconciliation step (key conflict resolution) aims to figure out the conflicts at the instance level by determining entities that refer to the same entity in reality. The purpose of the data fusion step (attribute conflict resolution) is to merge records referring to the same entity in reality by combining them in one representation and resolving eventual conflicts.
The remaining of this paper is organized as follows: Section 2 present the reference reconciliation through the definition of its main bases and techniques. In Section 3, we discuss the data fusion by classifying the various conflicts that may appear as well as the different techniques to resolve such conflicts. both Section 2 and Section 3 end by a discussion on issues related to step. Finally, the paper is concluded in Section 4.

Reference reconciliation (resolving key conflicts)
The problem of reconciling data is one of major problems for data integration systems. It consists in deciding whether two data descriptions are references to the same entity in reality or not. Data reconciliation is also denoted by the reference reconciliation, record matching [3,4], record linkage [5], entity resolution [6,7], object identification, duplicate detection [8] or data cleaning [9,10]. We consider that a data is defined by an identifier (reference) and by a description. It is therefore a question of reconciling different identifiers described in relation to the same schema.
Historically, reference reconciliation has been mentioned for the first time in the terminology of "Record Linkage" by Halbert Dunn [11]. In fact, in the 1950s, data was represented by records in files, which justifies the use of the term "Record Linkage" to name the task of reference reconciliation in a data integration system. The problem of reference reconciliation performed by computer was introduced by Newcombe et al. in 1959 [12]; It was formalized ten years later by Felligi and Sunter [5]. After that, the problem of reference reconciliation has been considered in different terminologies by different communities. In the field of databases, data matching and duplicate detection are used when merging or cleaning multiple databases. In the automatic natural language processing, they make a co-reference and resolution of the anaphors when they want to find which are the nominal groups that refer to the same entity. In order to automate and make effective the reference reconciliation, a large number of methods have been proposed. Summaries of these methods can be found in [13][14][15][16][17][18][19][20].
The different reconciliation approaches available in the literature can be classified according to two criteria [21]: (i) the exploitation of the relations between data, and (ii) the exploitation of the knowledge for the reference reconciliation.

Evaluation of reference reconciliation methods
The quality of the obtained results of a reference reconciliation method is evaluated using measures of Information Retrieval field: Recall, Precision and F-Measure.
Recall: Proportion, among all possible pairs, of those for which the method has produced a correct result.
Precision: Proportion, among the couples for which the method has produced a result (of reconciliation or non-reconciliation), of those for which the result is correct.
F-Measure: Finding a good compromise between recall and precision is a more difficult goal to achieve, so results can be evaluated by calculating the combination of the two measures.

Similarity measures
One of the most important challenges that reconciliation methods face is the difficulty of the syntactical variation of the vocabulary used to describe the data. These variations may be due to typographical variations, abbreviations, the presence of acronyms, coding differences and synonyms. In such a context, most reference reconciliation methods rely on similarity measurement techniques between values.
Similarity measure: A function: Sim: E 3 E → ½0; 1 applied to a pair of elements ðe 1 ; e 2 Þ ∈ E 3 E and which returns a real number in the interval ½0; 1 expressing the similarity between these two elements.
Similarity Score: The similarity score is the actual value calculated by a similarity measure for a pair of elements.
A multitude of similarity measures between character strings have been developed, each one is effective for particular types of syntactic variations. These similarity measures are categorized into three classes [21]: Character-based measures, Atomic chains-based measures and Hybrid measures.
The comparison of the different measures shows that no non-hybrid measure is appropriate for all datasets. The best measures for some datasets may be the worst for other datasets [22]. As a result, more flexible measures that combine several similarity measures Data reconciliation and fusion methods are needed. More generally, the selection of these measures must be customizable in a reference reconciliation system to be able to adapt to data.

Exploitation of relationships between data
Reference reconciliation methods can be distinguished by their ability to exploit the relationships between data. Two types of possible approaches are distinguished [17]: local approaches and global approaches. 2.3.1 Local approaches. Some of the local approaches exploit references descriptions not structured in attributes. The works [23][24][25][26], adopting this vision, calculate similarity score using textual data only, in the form of a single string. Such approach is interesting (1) when one requires to carry out a quick similarity calculation [24,23,25], (2) when one requires to have a set of candidate pairs for the reconciliation to apply then a more sophisticated similarity calculation [26] or even (3) when the association (attribute, value) is not sure.
A second part of local approaches consider structured data (attributes). A multitude of methods dealing with structured data proposed probabilistic models [5,27,3], or calculate a score of similarity for reference pairs [28][29][30]. However, the methods, in the first case, are very dependent to the probabilistic model parameters estimation, contrary to the last case method where it is a question of calculating the similarity scores of the attribute values and then combining them using an aggregation function to capture possibly influence degrees of the different attributes on the reference pairs similarity.

Global approaches.
To improve the results quality of reconciliation methods, global approaches hexploit the relationships between data. These relationships are expressed explicitly like foreign keys in relational databases or semantic relations in other data models [7,31,32] or, are detected and used in the reference reconciliation algorithm [33]. Relationships, when they exist, allow to take into account more information when comparing reference descriptions. Some of these approaches, such as [31], perform simultaneous reconciliations of references that are of several types (people, articles, conferences) in order to capture the dependencies between the reference pairs induced by the relationships linking these references. Thus, global methods improve the obtained results quality by reducing the number of false reconciliations (false positives) or the number of false non-reconciliations (false negatives).
2.3.3 Comparison between the two approaches. Some local reference reconciliation approaches such as Fellegi and Sunter [5] and Winkler [27] use a probabilistic model. However, they are very dependent on the estimation of the different probabilistic model parameters. This dependency can be an obstacle for some applications such as news websites or blogs. In fact, these parameters can hardly be estimated because we cannot have labelled data and their estimation that would makes the method execution ineffective. Unlike local approaches based on probabilistic models, local approaches based on similarity measures do not confront the dependency problem.
Local approaches consider each reference as an entity that is completely independent of other entities and ignore that data is often represented in schemas expressing relationships between data. While global approaches exploit the references description in terms of attributes, and relationships linking different references in order to make a decision of reconciliation or non-reconciliation.

Exploitation of knowledge for reconciliation
Different types of knowledge can be used to improve the effectiveness of reference reconciliation approaches in term of choice and parameters of similarity measures as well as the importance of different attributes in the reconciliation decision. As example of such knowledge, we can cite the knowledge about references and attributes, and knowledge about values such as synonymy. Reconciliation approaches that exploit knowledge in a fixed way (by coding it directly in the approach) are very sensitive to changes of data sources and domain (medical, commercial, etc.). Whereas, dynamic exploitation of domain knowledge allows adaptation of changes in data and application characteristics. Two types of approaches can be distinguished [34,17] according to the manner of acquiring knowledge namely learning-based approaches (supervised) and rule-based approaches (unsupervised).
2.4.1 Supervised approaches. Supervised approaches [22,8] use learning algorithms to learn different knowledge from expert-labelled data. For these approaches, if data changes, a new sample must be created on which the learning method will be reapplied.
2.4.2 Unsupervised approaches. Unsupervised approaches [35][36][37] use declarative language to allow an expert to specify the knowledge needed for reconciliation, in rules form for example. For these approaches, it is necessary to re-specify, taking into account the changes of data, the necessary knowledge such as, transformation or reconciliation rules or the concept profile descriptions.
The knowledge exploitation allows the approaches to be adaptable to the data evolution and the changes of domains (trade, trip, medicine, Web) in addition to the results quality improvement by reducing of false decisions. However, these approaches are dependent on human effort to label the learning data or to manually declare the rules. This makes these approaches difficult to apply to large volumes of data or evolving data.
2.4.3 Comparison between the two approaches. Reference reconciliation methods that exploit knowledge in a fixed way by implementing it directly into the source code of the method are highly vulnerable to changes of data sources and domain. For example, population census data in Algeria or China have different characteristics with regard to the names of persons and their date of birth; non-explicit (segmented) information can be presented in different orders according to the population culture. Country, for example, in China the last name precedes the first name and the date of birth is represented beginning with the year followed by the month ending with the day whereas in Algeria the format (First Name Last Name, Day Month Year) is the most used. More generally [17] the quality of the data sources may be different. Unsupervised or supervised exploitation of domain knowledge makes methods capable of adapting to changes in data and application characteristics. For methods that acquire domain knowledge through supervised learning, when the data changes, a new sample is created and the learning method is reapplied.
For methods using declarative language to specify knowledge [17] such as Hernandez and Stolfo [35], Wai Lup Low et al. [36] and Doan et al. [37], it is needed to perform a re-specification of the necessary knowledge such as transformation or reconciliation rules or the concept profiles descriptions, taking into account changes in the data.
The methods exploiting domain knowledge are generic, and provide less false positives and false negatives. contrariwise, the cost in terms of human effort required to either label the training data or to manually declare the rules makes these methods difficult to apply to large volumes of data or to frequently changing data.

Discussion on issues related to data reconciliation
Based on our reading of the literature and our work in this area, we summarize, in the following, some open issues and research directions for improving data reconciliation and data integration in general.
2.5.1 Compromise between effectiveness, efficiency, genericity and reduction of human intervention. A key challenge in developing an effective reconciliation solution is that some of the requirements are in conflict with one another. To reduce the search space and so improves effectiveness, blocking methods are used. Nevertheless, this can reduce the efficiency by the elimination of some relevant entity pairs. otherwise, the efficiency can be improved by Data reconciliation and fusion methods combining the use of various reconciliation algorithms, but increase computing time and thus reduce the effectiveness. Reconciling entities in a single domain easy than for different domains with a generic reconciliation solution. A specific reconciliation solution requires less human intervention to provide the training data. More effort are needed to ensure such compromises [19,38]. 2.5.2 Reconciliation in big data integration. Big data community is giving today more interest to data reconciliation from big and heterogeneous data sources [39]. Big data reconciliation is particularly difficult because big data sources contains unstructured data in addition to their heterogeneous structure and are in dynamic evolution [40]. In practice, existing reconciliation methods can't keep their ineffectiveness when the volume of data become very large and in a huge number of sources. New methods are proposed offering parallel and scalable reconciliation to deal with the volume dimension [41,39,42]. These include blocking techniques and techniques that dispatch the charge between different nodes. To deal with the execution time dimension, incremental clustering methods are suggested [43].
2.5.3 Unstructured data reconciliation. The explosion of the number of data sources and the emergence of unstructured data led information to be unclean, incomplete and redundant. [44]. For years, the worlds of structured systems (transactions, databases, Oracle, DB2, Teradata, and others) have grown side by side with the world of unstructured systems (email, phone conversation transcripts, spreadsheets, reports, documents, and others). And for years these worlds existed as if they were isolated. After these two environments matured and grown to a large size, it was recognized that it is necessary to integrate the two worlds. Of course, data reconciliation is a very crucial step in this integration. Hence, the unstructured data reconciliation research field still relevant today. Highly heterogeneous information spaces require also new data reconciling techniques. To overcome the variety aspect, proposed techniques generate structured data by tagging and matching text. Furthermore, available data are usually inaccurate and contains noise (http://tdan.com/matching-unstructured-data-and-structured-data/5009). To deal with this aspect of truthfulness, proposed clustering and assembly techniques can treat noise and values changing.
2.5.4 Real-time reconciliation. Real-time reconciliation (also known as dynamic reconciliation, online reconciliation, just in time reconciliation) assumes that data is dynamically matched, and therefore changes regularly (for example, customer data in a widely used system). For this reason, some systems do not copy the work data. Instead, the data is indexed to the source system using the same clustering rules that are used in the reconciliation process [45]. The usefulness of this type of reconciliation requires the proposition of even more efficient methods and algorithms. Many recent efforts address this issue [46][47][48][49].

Data fusion (resolving attribute conflicts)
The last phase of a data integration system is to merge records referring to the same entity in reality by combining them in one representation and resolving eventual conflicts.
The problem of conflicting values of attributes in data integrating field was evoked for the first time by Umeshwar Dayal [50]. Since then, even if the problem got less attention, but some techniques have been proposed [51]. We present in the following, data fusion [52] also known as data consolidation or entities resolution.

Conflicts classification
Data conflicts can be classified into two classes: (a) Attribute value uncertainty when information are missing, and (b) contradictions when the attribute have different values [1,51].
Uncertainties: The existence of the attribute value in one source and a null value in one or more sources.
Contradictions: The attribute value in one source is different than the attribute value in one or more sources.

Strategies for dealing with conflicts
Conflict-handling strategies are operations that interpret the way in which incoherent data are handled. To provide a unique and coherent representation, sometimes one vale is selected, other times values are combined or a new value is created.
There are several strategies for dealing with incoherence, some of which are mentioned repeatedly in the literature [53][54][55][56][57]. They are divided according to how they manage (or do not manage) conflicting data into three main classes [58]: ignorance, avoidance and resolution of conflicts.
3.2.1 Conflicts ignorance. Ignoring conflicts strategies don't make decisions about conflicting values, they are not sometimes aware of the conflict, and can therefore produce incoherent results.
"Skip conflicts" strategy and "Consider all possibilities" strategy are examples of ignoring conflicts strategies.
3.2.2 Conflicts avoidance. These strategies are aware of conflicts, although they don't resolve individually each conflict. On the contrary, only one decision is made, for example, the preference of a particular source. Since the decision is often made before or without looking at the data values, conflicts are not always handled. They are more effective in terms of processing time than conflict resolution strategies to the detriment of the precision lose due to non-consideration of all information that may be useful for conflict resolution.
Two classes of conflicts avoidance techniques are identified, a first class of strategies that take into account meta-data when making decisions (based on metadata) and a second class of strategies that don't take them into account (based on instances).
"Confident source" strategy is an example of metadata-based strategies, "Take information" strategies and "Only coherent data" are examples of instance-based strategies.
3.2.3 Conflicts resolution. Conflict resolution strategies, unlike previous classes, take into account all values and metadata in the decision of conflict resolution. These approaches are less efficient in processing time, but provide conflict resolution capabilities that are as flexible as possible.
Likewise the previous class, decisions taken by conflict resolution strategies can be instance-based, or metadata-based, using for instance, the freshness of data or the reliability of a source. According to the produced result they can be also classified into: decision strategies that choose one of existing values, or mediation strategies that create a new value, such as the average of existing numbers.
The most common value strategy and the draw strategy are examples of conflict resolution strategies based on instances that produce decisions. An example of mediation strategies is the middle value strategy. The most recent value strategy is a representative strategy of metadata strategies that produce mediated values.

Relational operators and data fusion techniques
Data fusion can be performed by standard (union and join) or advanced relational operators. The join and union (and their variants) merge data of all kinds. Approaches based on union generate a common schema, according to which records from sources tables are added. Other approaches define new operators and combine them to existing ones [51].

Joining approaches.
To do the join operation on two tables, their schemas are expanded, to allow adding unknown values to tuples. Outer join operation is a variant that avoid losing tuples. The variant complete disjunction merges tuples of two or more relations into the same tuple. [59].
3.3.2 Union approaches. The union of two relations merges the data of the same tuples, that is to say, the same values. The minimal union operation is an improved variant that eliminates tuples having null-values but share the same non-null values with other tuples.
3.3.3 Other techniques. In addition to resolving uncertainties, it exist relational operators for contradictions elimination. The matchjoin operator starts by generating all possible tuples, it reduce them then following a way defined by the user [60]. The prioritized merge operator [61] follow the same procedure except the values selection from preferred sources.
We mention at the end that data fusion can be achieved also by SQL-based techniques through functions defined by user, aggregation functions or others.

Discussion on data fusion issues
Similar to what we did with data reconciliation, we also summarize, in the following, some open issues and search directions for improving data fusion and data integration in general.
Advanced conflict resolution techniques Trusting the most accurate sources is not always the best solution because even these sources can contains errors. The work in [62][63][64] proposes to examine the accuracy of sources when deciding real values through probabilistic models that calculate the iterative accuracy of sources.
It is difficult to distinguish between incorrect values and outdated ones. Thus, the most common value may be an outdated value, while the most recent value may be a wrong value. To find the correct values, a probabilistic model integrating the notion of sources freshness is proposed in [65].
Sources can integrate instances from other sources. Therefore, errors can propagate quickly and slant the conflict resolution decision. Works in [66][67][68] propose to take dependencies between the sources into account during the discovery of the correct values. They use algorithms that iteratively detect dependency between sources.
Fusion in big data integration To meet the veracity challenges in big data, approaches extend existing ones to deal with the big data volume, the response time, and the data variety. For instance [69], proposed a framework offering three levels of transparency: data resource integration level, data fusion level and data service provision level.
From data fusion to knowledge fusion The knowledge fusion identifies the true subject-predicate-object triple extracted by several information extractors from several information sources [70]. Traditionally, they process to mapping schemas then data reconciliation focusing on conflicts. Therefore, knowledge sources are different that involve more efforts to deal with [71]. While adapting and improving data fusion methods can solve some of the knowledge fusion problems, there still a place for improvement. However, quality improvement involves more fundamental modifications to the first assumptions that data fusion techniques suppose.
Multi-sensor data fusion To provide a description of an environment or a process, observations from many sensors are combined. This combination from disparate sources is somehow better than if these sources were used individually [72]. Multi-sensor data fusion is a difficult task due to the fact that sensor technologies are imperfect and diverse, furthermore their application environment are also of different natures. Even if some of these problems have been addressed, there is no existing technique can overcame all multi-sensor data fusion challenges [73,74].

Conclusion
Conflicts resolution is accomplished following two phases: reference reconciliation first then data fusion. Reference reconciliation methods aim to answer one question, two given data descriptions refer to the same real world entity or to two different entities. This problem occurs when more than one representation are used to describe one entity in real world, when the data contains errors, and when the information is incomplete. We defined the principles of a reconciliation method and then distinguished reference reconciliation methods, first on how to use reference descriptions, then how to acquire knowledge.
Data fusion is about the fusion of duplicates in the same representation and at the same time the resolution of possible conflicts between different values of the same attribute. This problem was not the preoccupation of researchers until the last two decades when several works and methods have emerged. In this paper, we have also discussed some current data reconciliation and fusion issues that are the subject of current research at the end of each section.

Data reconciliation
and fusion methods