Word Sense Disambiguation using Aggregated Similarity based on WordNet Graph Representation

The term of word sense disambiguation, WSD, is introduced in the context of text document processing. A knowledge based approach is conducted using WordNet lexical ontology, describing its structure and components used for the process of identification of context related senses of each polysemy words. The principal distance measures using the graph associated to WordNet are presented, analyzing their advantages and disadvantages. A general model for aggregation of distances and probabilities is proposed and implemented in an application in order to detect the context senses of each word. For the non-existing words from WordNet, a similarity measure is used based on probabilities of co-occurrences. The module of WSD is proposed for integration in the step of processing documents such as supervised and unsupervised classification in order to maximize the correctness of the classification. Future work is related to the implementation of different domain oriented ontologies.


Introduction
For the acquisition of knowledge in artificial intelligence, two approaches defined in [1] are used:  transfer process between human to knowledge base, process with a major disadvantage given by the fact that the one who has knowledge cannot easily identify it;  conceptual modeling process by building models in which are placed the new knowledge as they are acquired, this process leading to the appearance of the ontology as a systematic organization of knowledge, data of the reality, leading to the construction of theories upon what it exists.An essential role of ontology is to be reused in multiple applications.Mapping two or more ontologies is called alignment.This task is particularly difficult, the main cause of limitation in extending existing ontologies [1].Direction that follows the ontology is supported by the introduction of artificial intelligence techniques to emulate the mental representation of concepts used, and the interpenetration of these links.
The kernel of the ontology is defined as a system ( ), where:  is the lexicon formed out of the terms from the natural language;  a set of concepts;  represents the reference function that maps the set of terms of the lexicon to the set of concepts;  is the hierarchy of the taxonomy given by the direct, acyclic, transitive and reflexive relation;  is the starting point upon which the hierarchy is built on.There are two types of ontologies as defined in [1], depending on the area in which they are used:  ontologies for knowledge-based systems are characterized by a relatively small number of concepts, but linked by a large and varied relationships, concepts are grouped into complex conceptual schemes or scenarios and for each concept there can be one or more customizations;  lexicalized ontologies, including a large number of concepts linked by a small number of relationships, like WordNet ontology concepts that are represented by sets of synonymous words, these ontologies are used in human language processing systems.It is introduced the concept of ontology as a knowledge base in the classification of documents, in order to analyze semantic documents by solving the ambiguity of the terms.This integration results in an improvement in the objective function defined for classification techniques used.The main components of an ontology are described, the concepts and relations between them.These components are analyzed, identifying methods of extracting knowledge from within.With the defined relationships between concepts it is created the graph representation seen as a taxonomy of belonging such as "isa" of the concepts to the more general ones.The senses of a concept are defined, along with the possibility of graph representation of each sense.In the context of WordNet ontology, the concept of synset is introduced as an equality relation between concepts with similar senses.The graph representation is further used for evaluating the similarity between two concepts.The more similar the concepts, the less the length of the path between the two nodes related to the elements in the graph representation.Two elements from the same synset maximize the similarity measure.
Similarity calculation is used in the evaluation of context senses of polysemy words, measuring the maximum probability of occurrence of each sense of each words from a phrase.

Components and Structure of WordNet Lexical Ontology
WordNet is a database that contains information about English vocabulary.Originally designed as a full-scale model of semantic organization, was soon accepted in natural language processing NLP, Natural Language Processing.WordNet ontology has become the chosen database NLP, Kilgariff saying that not using this resource requires explanation and justification, [19].Ontology popularity is high due to open access and wide area coverage.WordNet ontology is created and maintained by Princeton University, the database can be downloaded from [2].It contains nouns, verbs, adjectives and adverbs.Lexical meanings are relations between them.Words with similar meanings are organized into sets called synsets.The latest version of WordNet 3.0 contains about 155,000 words organized in 117,000 synsets, [3].A similar synset consists of words that end with a definition and examples of use of these words.Table 1 contains a statistic of the number of synsets existing along with the type of words from which they are formed.Areas of WordNet ontology is a lexical resource in which synsets are semi-automatic marked with one or more classes of membership in a set consisting of 165 hierarchically organized domains [5].
WordNet ontology is integrated into the representation and processing of documents as a component that solves problems like [6]:  ignorance of any relationship between words;  high dimensionality of the space of representation.In [7], WordNet structure is seen as intuitive, consisting of words that have multiple meanings, each sense forming a synsets, WordNet ontology atomic structure, and relationships between words, such as synonyms, antonyms, links represented by a graph.

Graph representation of WordNet components
In the WordNet ontology, there are defined types of semantic relations between concepts represented by words and multiple meanings of words.Table 3 shows examples of the six types of relationships existing in the case of nouns.

Fig. 2. Is-a relations for the first sense of car noun
Based on the relations of "is-a" type, the graph representation is formed, Figure 3.

Fig. 3. Graph associated to the first sense of car noun using is-a relations
This metric is then used in the evaluation of applications for text documents, as well as supervised classification and clustering, the semantic problem solving.

Similarity Measure of Strength Connection between Two Nouns
The similarity between the two concepts in the is-a hierarchy of the graph associated to the ontology WordNet quantifies how much resemble those objects based on information held on schedule [8].Measurement correlation and the distance between words is used in applications such as identifying contextual meanings of words, determining the structure of text documents, creating automatic summaries, information extraction and automatic indexing [9].For understating the way of similarity calculation between the WordNet concepts, the graph associated to the WordNet ontology is given as starting point.Figure 4 contains part of the representation for the examples car and bicycle.
In the context of similarity identification between c 1 and c 2 concepts with multiple senses, the metric result is given by the maximum between the values of the similarity metric of each senses of concepts c 1 and c 2.
For that, noting the general similarity measure with , where C represents the set of concepts existed in the graph G, the similarity value is given by the relation, [10]: where:  ( ) represents the set of senses of the concept c, with ;  represents a sense from the set of senses associated for concept c.Table 4 contains the formulas for measuring the correlation or similarity between two concepts from the WordNet ontology, [11], [LINGLI12] and [12].Two categories of similarity measures exist, [10]  Applying the metric of the minimum length between two nodes is a correct measure of semantic distance in the case where the density of the terms across the semantic network is constant.But how general semantic network density is not constant, the number of nodes in the network increases with deepening in direct correlation with the increasing number of terms is required densities approach along with the shortest path evaluation.An example that reinforces this idea is given by the differences between the sets of concepts {plant, animal} and {zebra, horse}, DOI: 10.12948/issn14531305/17.3.2013.15sets of concepts of a 2-link both, but the connection between the first two concepts is lower than the next two.This difference is given by the position at which the concepts are situated from the root level.Plant and animal concepts are more general, situated at a superior level, beside zebra and horse concepts, more particular ones.By applying the simple process of calculating the depth of a node, the shortest path length metric is significantly improved.Problem which is reached is the transposition of the depth of a node into a density.The work of Richardson [15] suggests using the value of the density the depth calculated of each node itself.Thus, the distance between two nodes is calculated as the ratio between the length and the density of the minimum distance between nodes of the graph.As this method involves a linear relationship between depth and density, an assumption is not true in all cases; it is proposed to calculate the average density for each level of the graph.It is created a function associating a graphical averaged density.Let FDM be the function that receives as a parameter the graph level and returns the average density of that level.Using this approach, the distance function between two nodes is: where:  is the weighted distance between two WordNet concepts;

 (
) is the minimum length of the path between x and y;  L is the level where x and y nodes are found within the graph.Since two nodes not necessarily are found at the same level, a way of solving this problem consists in assigning the level L with the level where the closest parent of x and y nodes is part of.

Word Sense Disambiguation of Polysemantic Nouns
Automatic evaluation of contextual meanings of words had an interest and concern since the beginning of natural language processing.Evaluation meaning is not seen as an independent business, but as an intermediate step and necessary in order to achieve the semantic processing of text objects, [16].One way to solve the problem of choosing the contextual meaning of a word in the context in which the word is polysemy is to extend analysis at word level way, increasing the size of the representation of text documents directly proportional to the number of senses added in the analysis, and training base to be able to perform statistical analysis of the occurrence of contextual meanings, and correlations with other words that deal directly.The base from which to start analyzing the contextual meaning of a word is the number of meanings available in WordNet ontology, along with a counter of the number of times meaning emerged.Figure 5 contains meanings and word counting car.For evaluating the senses of word road for each phrase F 1 and F 2 the array of similarities using the d PATH metric between road word and the rest existing words is formed, Table 7.For the first phrase, the sense chosen for the word road is the first one, and for the second phrase, the sense of the word road is road#s2.
Once selected the sense of road word, the next polysemy word is analyzed, car.An optimization method consists in choosing only the senses upon which there are statistics in WordNet ontology, for the noun car mentioning the 1, 2 and 3 senses, table 9.Because of the fact that the aggregated probability for the second sense of the noun car is greater, the contextual sense of the word car is car#s2.
In Figure 7, it is presented the source code of the WSD process.The testing process consists in running a set of phrases priory contextual sense classified.The metric used for evaluating the WSD correctness, , is defined using:

Fig. 4 .
Fig. 4. Graph associated to car and bicycle nouns using is-a relations

Fig. 5 .
Fig. 5. Senses and number of appearance of car noun

Table 4 .
Formulas for metrics for evaluation of similarity between two concepts of WordNet

Table 7 .
Similarity values between the words from F 1 and F 2

Table 8 .
Probabilities of the senses of word road

Table 9 .
Similarity measures of car senses and the other words

Table 10 .
The probabilities of noun car