PASVAG: A Routine to Relate Different Geographical Names

This research may contribute to the development of studies for geographically investigate or ethnic historically the origins or changes of Geographical Names. According Rostaing “The Toponymy intends to seek the origin of place names and also to study its transformations. “Quantifying these changes can reveal relationships proximity of words that the indicator is proposed. Example: change naming a neighborhood of Rio de Janeiro CityBR suffered, formerly known as “REAL ENG °” becoming in the “Realengo” neighborhood. Assign a distance value of this change can reveal never before made associations. Thus, research walked in the search for solutions that fit issues of computational linguistics and textual similarity indices applied to the Geographical Names, that these attributes become binding key-in processing queries to different databases. The goal is to enable the management and recovery of a large body of data. This indicator similarity proposed in this paper has been tested and confronted between experiments with simulated data. The main results of the experiments recognized standards in testing, and the importance of the variable noise position in the string, as well as usage limits for similarity component in the integration of databases. Citation: Matos V, Coelho V (2015) PASVAG: A Routine to Relate Different Geographical Names. Arts Social Sci J 6: 120. doi:10.4172/21516200.1000120


Introduction
When working with information originating from various sciences, there is a need to seek recovery strategies and exploitation of data visualization [1]. From a technological standpoint, the space used as a reference for data exploration allows adding understanding and insight in building the information [2]. Because of this, equip the information to design a geographic information system is characterized by elements such as multidisciplinary and interdisciplinary [3][4][5]. This heterogeneous set of such data depends on the integration of several sciences and reflects this context storage requirement of different types of data to group as logical format records constitute a Geographic Database (GDB).
For the purpose of this work is used the notion of similarity of geographical names (NG) in which admits a measure of how two strings are similar. On the premise is quantification of similarity is based on the metric space. And in turn, provides the notion of distance and relative proximity to the idea of the first law of geography "all things are similar, but closer things are more related than distant things" [6].
The growing demand of information makes the search for technological innovation tools have premised the need to generate and store large quantities of records. Thus, in the context of data replication, the theme of Geographic Database (BDG) converges to central labor issue. Treating the problem from the perspective of the universe structural [7,8], which included the concepts of reality to be stored on the computer, i.e. the BDG; it uses variables that give meaning to the question of spatiality of formal models for geographic entities. They are: geometric field that stores the registry geometry (point, line or polygon) by one or more pairs of coordinates; and field; filing its geographical name.
For the same meaning is applied to Geographic Names. It adopted the close relationship in response indicator of similarity. Therefore, if the similarity between two strings gets close to zero, it indicates symmetry between the NG, and through the text field present in BDG, called "geographical name", there is the limits of efficiency when changing similarity index to propose an acceptance threshold for similarity of Geographical Names.

Materials and Methods
The construction of the database to test the proposed indicator proposal (PASVAG), implemented in Postgres, delimited string in the set range of 1 to 20 characters which limits the maximum size of digits that you use to create the database. Through algorithms written in JAVA programming language, povoaremos the database with the simulation of a sequence of characters including noise (dissimilar character chain) in every possible position in the string. Thus, the performance is evaluated indicator similarity to character sets and changes the cardinality of its chains, the position (s) Noise (s) as well as the noise size. This routine allowed generating the comparison materials and inferring a noise in the string to which you want to compare, and so calculate the distance with controlled noise. The Pseudo shows on your first level data entry. At this stage, the variable is loaded with the sequence of characters. In the example above s1 receives "ABCD", in the next step, triggers the counter and if verifiesthe size of s1. Immediately below, the diamond is the accountant who handles the condition and repeats the subsequent expressions in the number of times the cardinality of the string. In the following level, replaces the position 1 of the array by noise (@) and increments the second counter. The following is checked the second counter condition that will process the procedures for applying the calculation of similarity and keep its results in the database. Figure 1 demonstrates how was filed the records in the database. This process was repeated until strings with cardinality 20. The total universe with group sizes yielded by the law of permutations, a universe of 2,097,130 records. This small sample of the database exemplifies the distribution noise occupying all positions of possibilities.

First
The Table 1 illustrates a records section of how dissimilar variables were allocated in accordance with the methodology already described, comprising a permutation in leasing all the possibilities for noise values for a string with four characters cardinality. It is noteworthy that the concept of noise for this approach, symbolized by the signal "@" means the percentage of the same upon size of the string from which is compared. Then equation 1 refers: R% -Noise Percentage r -Number of dissimilar characters. t -Number of string characters.
The use of this concept provides comparability set of cardinality tracks this problem in the simulated scenario. Still on the topic above, the formulation proves the impossibility of the terms "r" and "t" assume different values of integer and smaller than "one", as the atomic unit of a string is invariably "one" character. What makes a restriction on the universe of possibilities for noise levels to a finite universe of known values. Another observation pointless the eq.1 equation is the effect of the term "t" in the composition percentage of noise. In Table 2 if we observes-that for a same percentage of noise, 25% exist values of higher concentration of dissimilar terms for a same range of noises, Logo, the higher the cardinality the chain of greater characters will be the amount of noises distributed to a same percentage. To carry out the calculation of this new indicator of similarity to the scope of the Geographic Name (NG), is Question meet a set of assumptions that are:

Assumptions
Be C a chain of characters any, In sequence, adopts-if the use of cardinality of the sets, i.e. the number of elements within a set "W" any, if denotes by W . Therefore, the cardinality of "W" is nothing more than the set number of objects. Using the example of the set "L { , , }" = R i o the "L" set cardinality is given by L 3. = The following proposition consists of data two sets any "W" and "Z".
Be the difference ( ) " " − W Z the set composed of elements that are in the set "W" and are not in the set "Z". Thus, ( ) ( ) These information forms the basis for understanding the subtraction     Figure 2 illustrates through the Venn diagram the difference between sets by subtracting the portion not hatched is observed as a result of an empty set. However, in Figure 2 is changed order of subtraction of the joint which results in a unitary assembly. Therefore, the proposition for the whole set 1 2 " And last, I is an index of similarity that meets the requisites of metric space.

Motion for Similarity Indicator for Geographic Names
In calculating this new indicator, addresses the mapping of the string of two units: characters and bigrams. In characters, it is taken as noise the character set that are present in L 2 and L 1 not. We told these differences, we will treat the set of characters that are present in L 1 and not in L 2 , so that these differences are summed, termed as noise in the characters (R c ) Then, referred to as noise ratio (T c ) will be the reason (2) in (3) in the following formulation.
In this step the process repeats with new mapping unit string into bigrams. Made the sum of the differences is found the amount of noise in the bigrams (R B ).
In this other step is calculated noise ratio in bigrams (T B ) applying the ratio (eq 4) (eq. 5).
Finally the indicator is made by applying the variables (EQ. 3) and (EQ 5) to the variables of the equation (EQ. 6).
( ) Thus, to satisfy the condition of metric space, the proposed indicator meets this premise reverse. Therefore, the distance between two strings is ( ) I , if the indicator (I) obtains maximum similarity will result in 1, but applied to the similarity function will zero distance between the string comparisons.
As seen previously, the formulation of the indicator similarity to NG must meet the following conditions. :| | | | 0 :| | | | Para qu 0 e : Once the test if the indicator formulation meets the expected properties, it was found the maximum and minimum limits of similarity. For example, the comparison of two strings in the amount of resulting noise null means that all the elements fit together, therefore, we have a case of maximum similarity value: , applying the Equation (eq.3) and (eq.5).
The null result for noise rates when applied in (eq.6).
Results in the maximum similarity are validation demonstrates that the proposed indicator fits the requirements of metric space as it obeys the positivity condition and zero distance between the strings.
When the limit being the minimum of similarity or the maximum distance, when the geographical names are quite dissimilar, that is, there is no similarity between the pair of strings, in which case the amount of noise is greater than zero and equal be conditional as where a ≠ b ≠ 0 by applying the equation (eq. 3) and (eq 5).
Soon I=0 It provides us the value of the indicator, even if not known the value of T B , but by multiplying T C , will have zero multiplied by any amount resulting in zero.
This final result demonstrates that the function of the compared strings obtained maximum distance from one another, which means no similarity.

Calculation Methods
Unless the conditions of use of the similarity to the specificity of the NG will need to make measurable how different they are NG. For this, we will take the notion of distance to measure the similarity. So the shorter the distance, the more similar the NG are, and the greater ( , ) 0 ≥ f C C for all C 1 , C 2 in X. In the example of its application: Be I("BANANA", "ANANAIS") the similarity of two strings C 1 and C 2 ( Table 3).
The results found indicated in the last row of the table. Satisfies the first condition of metric space in which its value is positive. The following test demonstrates that the indicator by finding a maximum similarity, when the distance between the strings is zero. Let's see if C1=C2 meets the distance (C1, C2)=0. Be I("BANANA", "BANANA") ( Table 4).
When you view the zero in response, it is understood that the distance between the words is null proving more this property. To validate the third property, symmetry f(C 1 , C 2 )=f(C 2 , C 1 ), will reuse the example of the first property entering the inverted variable is I ("Ananais", "Banana") the similarity of two strings C 1 and C 2 ( Table 5).
At the end of the calculation found that the result is the same as example1, meaning that no matter the order of the input variables of the similarity of function, because its result will be the same. In this check it is concluded that the indicator meets the symmetry condition metric space. Following is proven the last condition, triangular inequality, to contemplate the premises of metric space. Be C 1 , C 2 and C 3 , "Banana", "Ananais" and "TOMATE" strings which you want to calculate their distances (Tables 4 and 6).
With these results it can be seen in the graph below the validation of the triangle inequality property, d(C1, C2) ≤ d(C1, C3) + d(C2, C3). Visually checked in Figure 2 is that the sum of f distances f(ABACAXI, BANANA) + f(ABACAXI, ANANAIS) is greater than the distance f(ANANAIS, BANANA) ( Figure 3).

Results
For the realization of similarity tests in the string with simulated data, an important point is the definition of the sequence of known characters that will be the basis for the studied models. This representation allows you to play on a smaller scale, the universe of possibilities that an alphanumeric digit occupy the composition of an NG. The design of the tests with simulated data base was provided both to discover other parameters on the study problem, as the test marker proposed similarity and to identify their advantages and limitations. In this simulation model, the capture of noise effect on strings was a great value greatness to formulate a theoretical model and refine knowledge of markers on the similarity of NG.

Data simulated results
In Figure 4 is observed the effect of noise of 40% (two dissimilar characters) to cardinality of six characters. To the extent that it varies the position of the noise observed different patterns of responses: 1) The first position when the noise occupies the ends of the string, namely the sign "@" symbol noise grouped at the beginning, end, or occupying its two ends. This pattern gives the maximum cutting pattern for this range of cardinality and noise. This feature has a positive effect for the treatment of addresses because the indicator demonstrates tolerant exchange of address types, "Street or Avenue", framed at the beginning of the string.
2) The next standard, obtained of the noise occupying one end, or grouped within the string. This pattern shows an average cut in relation to other positions noise.
3) In the hard cut pattern is observed that noise manifests itself within the chain, occupying positions spaced.
Given the above, we note that for any cardinality there is a pattern of response that distinguishes the various known noise positions. This makes it possible adoption to any geographical name.
The following were checked outliers marker similarity. In Figure 5 it is shown the performance of the similarity function for cardinality of twenty characters with minimal cutting pattern. By imposing the premise we "one" for maximum similarity and absence of noise. However, according to the behavior of the represented function, the similarity score comprises measuring an amount of dissimilar terms of cardinality up to half of the string (values above 50% noise).
The amplitude indicator similarity to-noise variation was investigated and shown in Figure 6, through the similarity score. As the noise is increased and features of smaller amounts of similarity, the greater the amplitude between maximum and minimum values of class similarity. Example is the marker in the position that admits only noise Rc=1: The marker has a similarity Rc= [8.5, 9. Upon the behavior of the tested similarity function, one can create a rule for quantifying the amplitude response as a function of cardinality and the amount of noise. The amount of similarity is always results the amount of "noise + 1". That is, a chain of cardinality of twenty characters with two obtains a noise amplitude similarity three valid results. This expression is valid up to the limit indicator of similarity, that is, half of its cardinality.
Another important observation is shown in Figure 7. When you enter a five-character noise, the rule of verification established to quantify the number of similar responses is the same. Highlighted by the red rectangle, the standard six answers spreads no matter how much rises the cardinality. With this data we find that the outlet values

Taxonomia noise in geographical names
From the maximum and minimum intervals listed in the triangular matrix available in the appendix, were listed below some noise patterns studied from simulated data. This use of the triangular matrix provides support for the user has the conditions to infer situations expected in similarity query, i.e., quantify how flexible is the response pattern in the query by NG. Figure 8 summarizes the relationship between strategy records similarity, considering a cardinality of a NG any fixed, the higher the amount of noise, the lower the similarity index, then lower confidence for a relationship. Since the smaller the amount of noise, the higher the similarity index and increased confidence to the linkage between records. So, starting to break similarity, I=[0.85,9.0], with the interference of a lower noise 5% (relative to EQ.5 expression).
In this first pattern is underscored a subtle difference for writing terms in Figure 9. To achieve this level of similarity, the smaller the amount of noise and higher cardinality only the variable similarity is sufficient to rely on the relationships between data.
In response patterns with 0.81 similarity, with a percentage of noise 15 to 20% (relative to eq.5 expression), a new pattern in which the marker relates similarity incomplete NG. Exemplified in Figure 10 the track in question contains the name "PARQUE" or "VILA", even if present in only one of the records the relationship of both is possible. However, this range covers the need to know one more variable in addition to the similarity, because it just is not enough to rely on the relationship between the NG.  Figure 10, it is observed that for this amount of similarity with smaller cardinalities that the example of Figure 11 shows how much diminishes the trust between relationships. In the meantime the cardinality variables and similarity reveal another kind of relationship, one to many: a record of the IPP database table happens to be related to more than one record of the place names of CNEFE table. This limit is unfeasible the decision of relationship, because trust even trimmed the similarity variables, For range of values below 0.65 to the limit of the similarity metric 0.5, we find the following: cardinality and low similarity with loud noise which rules out any relationship.
To cardinality and high noise ratio, with low similarity, the relationship of the sought records is possible as long as analyzed over an external variable, its geographical position (Figure 12).

Discussion
In short, when dealing with smaller similarity values from 0.6 to balanced recovery of their records can be performed since the noise is considered to be greater than eight characters. Because even with low sensitivity and correct classification rate (43%), their low percentage of error for incorrect classifications (6%) minimizes the risk of erroneous classification.
Before concluding the hypothesis that there is a threshold of similarity between Geographic Names, the experiments show that it is necessary to consider the position occupied by the noise in the string. In addition, one must wonder if these noises are clustered or dispersed. This renders a high degree of relevance for the acceptance scale value of similarity in the tested databases. To set a lower limit based on experiments and complete a minimum of similarity can be considered two parameters: The first parameter is observed in Figures 4 and 11. At the point of minimum similarity of a chain of twenty characters, the maximum amount of noise that the indicator measures is equivalent to 50% of the cardinality of the string. The second point, to set a minimum limit, is analysis of different cardinality sizes from the Geographic Names. A string cannot be twice the size of your relational pair, as well as consider this difference in cardinality should pay attention to the noise between them. Example: I (Avenida Guilherme Maxwell, Rio de Janeiro, AvG Macwell, Rio de Janeiro)=0.507 Therefore, the maximum of 62% difference considers altercation cardinality and/or noise.
For strings of the same cardinality, the number of possible results considering the variation of the noise position is the amount of noise plus one. This property is important for future shows is value mappings.

Conclusion
The verification of the applicability of the proposed indicator to assess the recovery of records in different databases, demonstrated quite effectively for correct classification of NG pairs. The results indicated that the smaller similarity value ranges than 0.9 and greater than 0.8, the use of a variable to refine the retrieval of records retrieved increase confidence in the information. This use provides for greater cardinalities of ten characters, the positive effect to correct positive ratings genuine. For smaller similarity range 0.6 and 0.5 higher, the additional variable noise increases efficiency for correct classification. When considering noise values greater than eight characters, the incorrect classifications index drops to 6%. This shows an asymmetry property with high efficiency for negative impostor classification. Reached the expected goal, the similarity coefficients maximized queries that require textual comparisons, as in the georeferencing process addresses.