Adaptive Geoparsing Method for Toponym Recognition and Resolution in Unstructured Text

: The automatic extraction of geospatial information is an important aspect of data mining. Computer systems capable of discovering geographic information from natural language involve a complex process called geoparsing, which includes two important tasks: geographic entity recognition and toponym resolution. The ﬁrst task could be approached through a machine learning approach, in which case a model is trained to recognize a sequence of characters (words) corresponding to geographic entities. The second task consists of assigning such entities to their most likely coordinates. Frequently, the latter process involves solving referential ambiguities. In this paper, we propose an extensible geoparsing approach including geographic entity recognition based on a neural network model and disambiguation based on what we have called dynamic context disambiguation . Once place names are recognized in an input text, they are solved using a grammar, in which a set of rules speciﬁes how ambiguities could be solved, in a similar way to that which a person would utilize, considering the context. As a result, we have an assignment of the most likely geographic properties of the recognized places. We propose an assessment measure based on a ranking of closeness relative to the predicted and actual locations of a place name. Regarding this measure, our method outperforms OpenStreetMap Nominatim. We include other assessment measures to assess the recognition ability of place names and the prediction of what we called geographic levels (administrative jurisdiction of places).


Introduction
Geoparsing is a sophisticated process of natural language processing (NLP) used to detect mentions of geographical entities and to encode them in coordinates [1]. Roughly, the language is analyzed to obtain place names and to visualize their locations on maps. One way to achieve this kind of analysis is to process textual information through a pipeline in which place names are first recognized and then geolocated. Named entity recognition (NER) is an important NLP task that seeks to recognize and classify entities in predefined categories such as quantities, people, organizations, places, expressions of time, and events, among others. This research topic has become very relevant, since high-performance NER systems usually precede other complex NLP

Related Work
Algorithms and methods behind geoparsing remain a very active research field, and new systems with new and better characteristics are starting to be developed. Most of the existing approaches for geoparsing are knowledge-based [2], which use external sources (vocabularies, dictionaries, gazetteers, ontologies). In general, these are heuristic-based methods (place type, place popularity, place population, place adjacency, geographic-levels, hierarchies, etc.) [3]. Both tasks toponym recognition and resolution rely strongly on the evidence contained in the used knowledge sources, among other Natural Language Processing or Machine Learning tools. Next, some of these approaches are presented.
Buscaldi and Rosso [4] described a knowledge-based word sense disambiguation method for the disambiguation of toponyms. The proposed method is based on the disambiguation of nouns by using an adaptation of the Conceptual Density approach [5] using WordNet [6] as the external knowledge resource. The Conceptual Density is an algorithm for word sense disambiguation, measures the correlation between the sense of a given word and its context taking into account WordNet subhierarchies. Authors use holonymy relationships to create subhierarchies to disambiguate locations. The method was evaluated using geographical names from the SemCor corpus available at www. sketchengine.eu/semcor-annotated-corpus (last access 26 August 2020). Michael et al. [7] presented a method for geographical scope resolution for documents. For this, the method is based on the administrative jurisdiction of places and the identificacion of named entities of type person to assign a greographic scope to documents. The method applies a set of heuristic for geographical attributes (population-based prominence, distance-based proximity, and sibling relationships in a geographic hierarchy). The method is focused on international wide scope documents within a Georgaphic Information Retrieval system. This method does not consider toponym disambiguation. In this line, Radke et al. [8] proposed an algorithm for geographical labeling of web documents considering all place names without solving possible ambiguities between them. Woodruff et al. [9] developed a method that automatically extracts the words and phrases (only in English) that contain names of geographical places and assigns their most likely coordinates. From this method, a prototype system called Gipsy (georeferenced information processing system) was developed.
Inkpen et al. [10] developed an algorithm that extracts expressions composed of one or more words for each place name. Authors use a conditional random fields classifier, which is based on an unguided graphical model that is used for unstructured predictions [11]. They focused on tweet location entities by defining disambiguation rules based on heuristics. The corpus contains tweets in English from the states and provinces of the United States and Canada. Middleton et al. [12] presented a comparison of location extraction algorithms, two developed by authors and three from the state-of-the-art. The author's algorithms use OpenStreetMap, and a combination of language model from social media and several gazetteers. The third-party algorithms use NER-based based on DBpedia, GeoNames and Google Geocoder API. In the OpenStreetMap approach, location entities are disambiguated using linguistic and geospatial context. To create the model in the language model approach, a corpus of geotagged Flickr posts and a term-cell map were used. To enchance the accuracy of the model, authors used some heuristics to refine it. A fine-grained quantitative evaluation was conducted on English labeled tweets taking into account streets and buildings. Karimzadeh et al. [13] described the GeoTxt system for toponym recognition and resolution from tweets in English. For place name recognition, the system can use one of six publicly available NERs: Stanford NER (nlp.stanford.edu/software/CRF-NER.html), Illinois CogComp (github.com/IllinoisCogComp/ illinois-cogcomp-nlp), GATE ANNIE (gate.ac.uk/ie/annie.html), MITIE (github.com/mit-nlp/MITIE), Apache OpenNLP (opennlp.apache.org), and LingPipe (www.alias-i.com/lingpipe); the above URLs were last accessed on 26 August 2020. For place name disambiguation within the place name resolution, the system can use three disambiguation mechanisms: two hiearchical-relationship-based methods and a spatial proximity-based disambiguation method. The system uses the Geonames gazetteer. GeoTxt has global spatial scope and was optimized for English text from tweets.
An interesting work for historical documents was presented by Rupp et al. [14]. These documents were translated and transcribed from their corresponding original books. Authors use a system called VARD, for spelling correction, and a historical gazetteer. One of the drawbacks is that the names of current places in historical texts may induce ambiguities with respect to places in past centuries. In this sense, Tobin et al. [15] presented another approach wherein there are three historical corpora digitized beforehand. The authors used information extraction techniques to identify place names in the corpus. They rely on gazetteers to compare the results obtained with human annotations of the three corpora. A similar method was applied for the SpatialML corpus (catalog.ldc.upenn.edu/LDC2011T02, last access 20 August 2020), which is a geo-annotated corpus of newspapers [16] which is made up of two main modules: a geotagger and a georesolver. The first one processes input text and identifies the strings that denote place names. The second one takes the set of place names recognized as input, searches for them in one of the different gazetteers and determines the most likely georeference for each one. A disadvantage is that it does not find place names if they are misspelled, even slightly, and so it relies only on exact matches. Also, Ardanuy and Sporleder [17] presented a supervised toponym disambiguation method based on geographic and semantic features for historical documents. The method consists of two parts: toponym resolution (using information from GeoNames) and toponym linking (using information from Wikipedia). Geographic features include latitude and logitude, population, the target country, and inlinks to the Wikipedia article. Semantic features include the title of the location's Wikipedia article, the name of the country and the target region, and the context words referring to History. The method was tested on five data sets in Dutch and German. Some approaches has been proposed for non-English documents. This is a research niche, since each language has specific language patterns that provide extra information in order to identify toponyms. Nes et al. [18] presented a system called GeLo, which extracts addresses and geographic coordinates of commercial companies, institutions and other organizations from their web domains. This is based on part-of-speech (POS) tagging, pattern recognition and annotations. The tests included web domains of organizations located in the region of Tuscany in Italy. The platform was developed in Italian and English, though authors proposed an independent language option. This system is composed of two modules: (1) a tracking tool for indexing documents, and (2) a linguistic analyzer that takes as input the web documents retrieved from the previous module. Martins and Silva [19] presented an algorithm based on PageRank [20]. Here, geographical references are extracted from the text and two geographical ontologies. The first one is based on global multilingual texts, and the second is based only in the region of Portugal. One of the limitations is that, similar to the original PageRank algorithm, they assign the same weight to all edges (references), which causes dense nodes (with many references to other sites) to tend to produce higher scores, whether or not they are unimportant. In another work, Martins and Silva [21] presented what they called geographic scope of web pages using a graphic classification algorithm. The geographic scope is specified as a relationship between an entity in the web domain (web page) and an entity in the geographic domain (such as an administrative location or region). The geographic scope of a web entity has the same footprint as the associated geographic entity. The scope assigned to a document is granted due to the frequency of occurrence of a term and by considering the similarity to other documents. The work was focused on feature extraction, recognition and disambiguation of geographical references. The method makes extensive use of an ontology of geographical concepts and includes an architecture system to extract geographic information from large collections of web documents. The method was tested on English, Spanish, Portuguese, and German. Gelernter and Zhang [22] presented a geoparser for Spanish translations from English. This method is an ensemble from four parsers: a lexico-semantic Named Location Parser, a rule-based building parser, a rule-based street parser, and a trained Named Entity Parser. The method was capable to recognize location words from Twitter hash tags, website addresses and names. Authors developed a parser for both languages; the NER parser is trained by using the GeoNames gazetteer and the Conditional Random Fields algorithm. The method was capable to deal with street and building names. The method was evaluated on an set of 4488 tweets collected by the authors and used the kappa statistic for human aggrement. Moncla et al. [23] proposed a geocoding method for fine-grain toponyms. The approach consists of two main parts: geoparsing based on POS tagging and syntactic-semantic rules; and a disambiguation method based on clustering of spatial density. This approach can deal with toponyms not in gazetteer. The proposal was tested on a hiking corpus obtained from websites in French, Spanish, and Italian. These documents describe displacements using toponyms, spatial relations, and natural features or landscapes.
There is a lack of NER annoted corpus for Spanish variants. In this sense, Molina-Villegas et al. [24] presented a promising project to compile a corpus of Mexican news and train a GNER model with dense vectors obtained from the corpus. This is important because, in general, the training corpora somehow restricts the outcome of GNER models. For instance, a GNER model trained exclusively with news from Spain could recognize places or local names from Spain, but not from Mexico. Though this is an inescapable fact, it is possible to obtain GNER models through a processing pipeline in which the only change is at the input corpora; that is, without making structural changes in the training process per se. Guided by achieving this ability, in this paper, we present a GNER framework in which the training corpus could be changed to extend its capability to recognize places in a wide set of geographic contexts. Another complex issue around geographic entities is the spatial ambiguity that arises from the fact that there are many places with the same name. In this regard, we propose what we have called dynamic context disambiguation, an approach based on a gazetteer and knowledge base that mimics the human process with respect to how a place mentioned in a text must be solved.

Geoparsing Approach
Our proposal includes two main modules: geographic-named entity recognition (GNER) and dynamic context disambiguation. The GNER module is used to detect entities alluding to place names in an unstructured text. The geographic disambiguation module solves and determines the most likely geographic properties (which we will discuss later) of these places. These modules, as well as their most important elements, are illustrated in Figure 1 and detailed throughout this section.

Geographic-Named Entity Recognition
The geographic-named entity recognition (GNER) module is based on a trained model (GNER model) whose inputs are vector representations of words, also referred to as embeddings in a semantic space. Basically, the input text is transformed into dense vectors, and then the GNER model determines when a specific word or n-gram is a geographic entity. A dense vector, opposite to a sparse vector, is one that has not zero entries. In the field of Natural Language Processing, representations of words in a vector space help learning algorithms to achieve better performance in specific tasks like named entity recognition. This mapping from the vocabulary to vectors of real numbers conceptually involves a mathematical embedding from a space with many dimensions (one per word which causes sparse vectors based on frequencies) to a continuous vector space with a much lower dimension. This lower dimension vector representation of a word is known as word embedding or dense vectors. It is worth mentioning that, given the lack of NLP resources for Mexican-Spanish, we deployed our own GNER module based on a fusion model of lexical and semantic features alongside a neural network classifier.

Preprocessing and Vectorization
First, texts are preprocessed with standard tokenization, where we preserve capital letters. This is because capitalized words are actually part of the standard lexical features for GNER; most of the time, location entities appear capitalized. According to Cucerzan and Yarowsky [25], even disambiguation can be learned from a set of unambiguous named entities. Their semi-supervised learning technique illustrates that we can greatly disentangle the problem by using a combination of lexical heuristics and classical binary classification. In summary, including lexical boolean features like Starts with a capital letter or has internal period (e.g., St.) improves the results. In consequence, not including this valuable information as part of the features must decrease the global accuracy. Other lexical features are also included as binary variables. For instance, for each token, we check whether its individual characters are numeric, the number of characters, if the token belongs to specified entities, if it is a stopword, and its part-of-speech (POS) tag, among other features.
The semantic features of this module are based on word embeddings. For this, we used the context encoder (ConEc), described in Section 5.1, to transform the original words into dense vectors.

Neural Network GNER Classifier
Once we obtain lexical and semantic characteristics of words, we fuse them by concatenating all features in what we call a bag of features and then we use a one-layer perceptron neural network for a binary classification (geographic entity or not). Although state of the art models for general NER are based on modern deep encoder-decoder architectures such as convolutional neural networks with bidirectional long-short-term-memory networks (CNN-BiLSTM) and attention mechanisms (Li, P.-H. et al. [26]), and more recently, by using language representations with transformers (BERT, Luo et al. [27], Li et al. [28]), it is well known that they require a huge amount of data in order to train the models. In our case, this is a drawback, given the lack of data for Mexican-Spanish language. For this reason, we prefer to use a simpler model, a one-layer perceptron neural network, which inputs are semantic and lexical characteristics of our corpus. We choose our model by using a popular method for selecting models, Cross-Validation, which help us to choose the appropriate complexity of the neural network which achieves a better performance. In our case, and after testing a wide range of architectures, the best model was a neural network with one hidden layer, 3 hidden units and a sigmoid activation function with weight decay. The fact that we are using a simple model for classification, means we are obtaining a good representation of text based on the features we are using, i.e., semantic and lexical characteristics, and it is not necessary to train a complex model in order to obtain good results. At the end of the neural network GNER classifier, the last layer determines, for each token, one of the two possible classes {<location>,<word>}. Finally, for entities composed of two or more tokens, we use a heuristic to reconstruct the whole entity by using the class of the tokens in the original text. That is, two or more consecutive <location> labels will be considered as one single location entity. The labeled version of the original text is sent to the dynamic context disambiguation module.

Dynamic Context Disambiguation
Our disambiguation approach derives decisions based on rules and facts. The rules specify how ambiguities could be solved by considering the context in a similar manner as that of a person. Given that the rules are activated as needed in execution time, we call it dynamic context disambiguation. This involves a set of elements that are illustrated in Figure 1 and described in what follows: • Knowledge base (KB): A set of rules which mimic human knowledge about how a place mentioned in a text must be solved. A rule is an IF-THEN clause with an antecedent and a consequent. The IF part of the clause is made up of a series of clauses connected by a logic AND operator that causes the rule to be applicable, while the THEN part is made up of a set of actions to be executed when the rule is applicable. By "solved," we mean that their most likely geographic locations have been determined.
• Visualization: Once all place names have been solved, this module allows us to visualize, on a map, the resulting place names in DS.

Notation
The rules in KB specify how ambiguities could be solved by considering the context, in a similar manner as that of a person. To describe them and show the way in which these are activated, we will use the following notation:

G
A gazetteer containing tuples of the form: (place name, location, geographic level, parent geographic level).

R i
The ith rule defined as a Horn clause of the form where P i is a fact and Q is an action or set of actions to be executed when the antecedent is true.
D Input text document containing the entities to be georeferenced.

A, B
Geographic entities in D.
S Stack data structure, in what follows disambiguation stack, wherein the entities will be stored as soon as the rules are activated and executed.
C Auxiliary stack used in situations, in which we lack information to solve the ambiguities relative to a place name.
The properties location, geographic level and parent level are denoted in what follows geographic properties. Location is defined in terms of latitude and longitude. Geographic level and parent geographic level are nominal values corresponding to an administrative division (provinces, states, counties, etc.), such as those shown in Appendix A.

Base Facts and Rules
As we pointed out, KB is a set of rules representing situations of how a place or entity mentioned in a text should be solved. Rules are Horn clauses that consist of facts and goals. A fact is a clause P i representing a condition, while a goal is a clause Q i representing an action to be executed. For any rule, the sets of facts and goals correspond to its antecedent and consequent, respectively. For our purposes, we have defined the following set of base facts, from which the rules in KB are defined.

P 1
A matches a location name in G.
The disambiguation stack is empty.
A has a more specific administrative level than B (but A is not necessarily contained in B). P 4 There is a relationship between A and B, meaning that there is a path between A and B in the administrative hierarchy. P 5 There are no entities in D to be processed. P 6 There are elements in C that must be processed.
The set of rules that guides the stages of the disambiguation process are defined in Table 1. The entity to be processed (A) has a match in G and S is empty. P 1 ∧ P 2 =⇒ Q 1 Assign A with the geographic properties of the matched location in G that has the highest hierarchy (relative to the administrative level).
The entity to be processed (A) has a match in G, S is not empty, A is lower than the entity (T) at the top of S and there is a relationship between A and T.
Assign A with the geographic properties of the matched location in G whose parent code is equal to the parent code of T'.
The entity to be processed (A) has a match in G, S is not empty, A is lower than the entity (T) at the top of S and there is no relationship between A and T.
Assign A with the geographic properties of the matched location in G that has the highest hierarchy (relative to the administrative level).
The entity to be processed (A) has a match in G, S is not empty, A is not lower than the entity (T) at the top of S and there is a relationship between A and T.
Assign A with the geographic properties of the matched location in G that has the highest hierarchy.
The entity to be processed (A) does not have a match in G, and S is not empty.
The entity to be processed (A) does not have a match in G, and S is empty.
The entity to be processed (A) has a match in G, S is not empty, A is not lower than the entity (T) at the top of S and there is no relationship between A and T.
Assign A with the geographic properties of the matched location in G that has the highest hierarchy.
There are no more entities from the text to be processed, but there are still entities in the conflicts stack C.

Geoparsing Process
After a set of place names have been found in D by the GNER module, dynamic context disambiguation must assign their most likely geographic properties. This involves the execution of the KB and FG modules, orchestrated by the IE (see Figure 1), so this last one determines which rules must be activated and applied through the AE module.
The above corresponds to the whole process of geoparsing, which we have formalized in Algorithm 1, where the function entity_recognition encompasses the process of GNER from which a list L containing geographic entities is obtained. The function rules_inference identifies the most likely geographic properties of each entity e ∈ L, based on the information in G and the current state of S. This information generates instances of base facts (via FG) that are considered by IE to determine the rules that must be activated and executed by AE. As a result of this execution, the state of S is changed, containing at this point solved entities. If a conflict is found as a consequence of the above process, e is pushed on C, where by the term conflict we mean that there is no information to assign the geographic properties of e. Finally, when there are no more entities to be processed, the function solve_conflicts is called, which assigns the most suitable geographic properties to those entities belonging to C. To illustrate the execution of Algorithm 1, we provide the following example. The input text is presented at the top of Figure 2, emphasizing the geographic entities. An approximate translation reads thus: A customer arrives at Tu Hogar furniture store, branch Azcapotzalco in Mexico City. He requests a table that is on sale and he asks the delivery to be at the town of Ixcatán in the municipality of Zapopan, Jalisco. The problem is that the branch in charge of delivering to that area is the Pedregal de Santo Domingo one, in the municipality of Coyoacán, next to the Zapatería Juárez, but in that branch the table is not on sale. The furniture store should reach an agreement with the client. On the left side of Figure 2, we have included the inherent hierarchy derived from prior classification based on the geographic levels defined in Appendix A. The top right presents the list of rules activated and applied during the execution of Algorithm 1. Finally, the bottom right shows the final state of S. Algorithm 1 begins by determining the place names contained in the text, in which case the list L is initialized via the function entity_recognition. For each entity in L, the function rules_inference is iteratively called. The following actions are executed at each stage.

1.
The entity in L to be processed (Mueblería Tu Hogar) does not have a match in G and, at this point, S is empty. This means that there is not enough information to determine the geographic properties of the entity, in which case the rule R 5 is activated and applied. As a consequence, the conflict condition is met, and so the entity is pushed onto C.

2.
The next entity in L (Ciudad de México) has several matches in G, from which the one with the highest geographic level is taken and pushed onto S according to R 0 .

3.
At this point, the entity in L to be processed (Azcapotzalco) has a match in G, with Ciudad de México being the top of S, so the rule R 1 is activated, in which case, Azcapotzalco is pushed onto S as a child entity of Ciudad de México.

4.
The entity to be processed (Ixcatán) has two matches in G. In this case, these names have the same geographic level, in which case, either one can be selected and pushed onto S, according to R 2 .

5.
At this point, the top of S is Ixcatán and the entity in L to be processed is Zapopan. This has two matches in G, and thus the place with the highest feature level is selected to be added to S. However, this place (Zapopan) is not lower than the top of S (Ixcatán), though there is a relationship between them. This situation causes rule R 3 to be triggered.

6.
Similarly, the entity in L to be processed (Jalisco) is not the predecessor of the current top (Ixcatán), and there is a relationship between them, so the rule R 3 is applied again. 7.
Then, given that the entity in L to be processed (Pedregal de Santo Domingo) does not have any relationship with the top of S (Ixcatán), the rule R 6 proceeds. 8.
At this point, the top of S is Pedregal de Santo Domingo. When the entity in L to be processed (Coyoacán) is evaluated, it happens that the top of S is its predecessor, and so R 3 must be applied. 9.
The entity in L to be processed (Zapatería Juarez) appears in the text, but not in G, implying R 4 .
Finally, the function solve_conflicts is invoked in order to solve the remaining entities in C through the rule R 7 .
El problema es que la sucursal encargada de hacer los envíos a esa zona es la de Pedregal de Santo Domingo, en el municipio de Coyoacán, junto a la Zapatería Juárez, pero en esa sucursal el comedor no tiene descuento. La mueblería deberá llegar a un acuerdo con el cliente. The map presented in Figure 3 shows all entities with their corresponding geographic properties (location and geographic levels). For simplicity, the Figure 3 is presented in two parts. This is because the two regions involved are so far apart that it is convenient to focus on them separately. In Figure 3a, all of the entities near the Jalisco region are mapped. Note that there are two toponyms of Jalisco in the gazetteer, and so the algorithm has disambiguated Jalisco (ADM1) as the correct one. Another case is when there are two entities with the same name and geographic level; this is the case for Zapopan and Ixcatán. Algorithm 1 would determine the most likely entity according to the rule activation; this situation is illustrated by the relationship between the instances of Zapopan (ADM2) and Ixcatán (PPL) that are indeed the nearest. Similar issues appear in Figure 3b for Ciudad de México and Coyoacán.
In addition to the mentioned issues, in Figure 3b, we observe that, for those entities that are not included in the gazetteer (Mueblería Tu Hogar and Zapatería Juarez), Algorithm 1 deduces the most likely location. This is just an example to exhibit the feasibility of our proposal. In Section 5, we will show the results of applying this proposal to a massive volume of news.

Data and Annotation Criteria
The assessment of our method included three different corpora. The first corpus C1 was used to produce word embeddings. As we will see in Section 5.1, this is necessary to feed a neural network in order to create the GNER model.
The second corpus C2 is the Corpus of Georeferenced Entities of Mexico (CEGEOMEX) (http: //geoparsing.geoint.mx/mx/info/, last access 20 August 2020) reported in the project [24]. The corpus was annotated manually with geographic-named entities and is, as far as we know, the only existing data source in Mexican-Spanish for GNER. CEGEOMEX was labeled according to the following criteria: • Named entities are considered geo-entities only when it is possible to assign them geographic coordinates. • A place name composed of many words, expressions or other entities must be spanned by a single label. For example, <loc>Hermosillo, Sonora</loc> must be a single geo-entity. • Imprecise references such as "at the south of" or "on the outskirts of" must not be included as part of the geo-entity. For instance, in "on the outskirts of Mexico", only <loc>Mexico</loc> must be labeled. • Place names that are part of other entity types such as "the president of Mexico" or "Bishop of Puebla" or the band name "Los tucanes de Tijuana" must not be labeled. • The names of sports teams accompanied by names of places such as countries, states or municipalities, such as "Atletico de San Luis", must not be considered place names. Instead, the names of the stadiums, sports centers or gyms, such as <loc>Estadio Azteca</loc>, must be considered place names. • Country names must be considered as place names when they point to the territory, but they must not be labeled when they refer to the government. For example, in 'California sends help to those affected by the earthquake in Mexico City", only <loc>Mexico City</loc> must be labeled.
Finally, a corpus C3 containing news not included in the training and validation process of GNER was used to test and validate the disambiguation process. Unlike previous corpora, the tagging process included the geographic properties (location and geographic level) of each tagged entity. As a summary, a description of the above corpora is shown in Table 2.

Experiments and Results
The experiment consisted of assessing the performance of our geoparsing approach in terms of three assessments: (1) the recognition ability of the GNER module in terms of standard evaluation metrics (accuracy, precision, recall and F-measure), (2) the accuracy of geographic level predictions by the toponym resolution module; and (3) a ranking consisting of six categories of closeness, where by the term closeness we mean the distance from the predicted location to the actual location.

Geographic-Named Entity Recognition
We have obtained the semantic features based on word embeddings obtained with word2vec [29]. Although word2vec offer computational advantages compared to other methods such as Glove [30] or fastText [31], there are two main drawbacks: it is not possible to obtain embeddings for out of vocabulary (OOV) words, and it cannot adequately represent words with several meanings. OOV words can be solved with fastText, but in order to tackle both problems, we used an extension of word2vec based on the context encoder (ConEc) [32]. The ConEc training is identical to that of CBOW-word2vec, with a difference in the calculations of the final embeddings after the training is completed. In the case of word2vec, the embedding of a word w is simply its corresponding row of the weight matrix W 0 obtained in the training process, while with ConEc the embedding is obtained by multiplying W 0 with the mean context vector x w of the word, such that the local and global context of a word are distinguished. The global vector is obtained with the mean of all binary context vectors x w i corresponding to the M w occurrences of w in the training corpus, according to while the local context vector is computed according to x w i , where m w corresponds to occurrences of w in a single document. The final embedding y w of a word is obtained with a weighted sum of both contexts, as defined in Equation (1): where α ∈ [0, 1] is a weight that determines the emphasis on the local context of a word. For OOV words, their embeddings are computed solely based on the local context, i.e., setting α = 0.
After an extensive search for classifiers and their parameters with cross validation, we decided to use a simple neural network classifier. The GNER neural network classifier has 1 hidden layer with 3 hidden units and a sigmoid activation function with weight decay. The reason to utilize this type of classifier (instead of a complex one) is that we obtained a good representation of the texts based on the bag of features we used, in such a way that it is possible to discriminate geographic entities. It is sufficient to use a simple classifier, such as the one we used, or even a support vector machine with a linear kernel. For entities composed of two or more tokens, we used a heuristic to reconstruct the whole entity by using the class of the tokens in the original text. Details can be found in [33].
The results of three different encoders are presented in Table 3. The best performance for GNER was obtained using the global and local context encoders. However, the local context encoder was useful to obtain embeddings for words outside of the vocabulary. The full procedure and results have been documented in [33].

Geographic Level Assignment
To evaluate the geographic level assignment accuracy of our method, we compared 1956 entity levels from the corpus C3. The resulting confusion matrix is presented in Table 4. Each column of the matrix represents the instances in the predicted levels (using the levels in the gazetteer of Table A1), while rows represent the actual geographic level. Note that in all cases the maximum is found in the main diagonal, suggesting that, in general, the algorithm makes the correct assignment of the geographic level to the entities. This was corroborated using the metrics presented in Table  5, where a remarkable performance of the algorithm regarding the geographic level assignment is observed, with a global accuracy of 0.9089.

Ranking of Closeness
As we have seen in Section 5.2, our geographic entity disambiguation method is very accurate with respect to the assignment of the geographic level of entities. Nevertheless, there is a natural question that arises when looking at those results: What if the geographic level is correct, but the real coordinates are not? To address this issue, we propose to carry out the evaluation one step further by measuring the distance from the coordinates determined by the proposed algorithm to the actual point.
It is difficult to define, in general, what is the correct location of a place. Locations could have a variety of shapes and sizes. Specific points, lines, multi-lines, polygons, among other shapes cannot be treated with the same criteria for evaluation of geoparsing. This is still an open issue. In this regards, we used ranges of distances (in km) between the point provided by our method and a manually annotated point (our actual coordinate) to decide if a location is correct or not.
As a baseline, we used the Nominatim importance score (https://nominatim.org/, last access 20 August 2020) to compare its results against our disambiguation approach. Nominatim is a search engine to search OpenStreetMap (OSM) data for locations by name or by address. Nominatim uses some heuristics in order to determine the priority of each response that could match with the query. The heuristics used by Nominatim include lexical similarity (between the query and OSM data), the bounding box of the current map (when used in a web interface) and a location importance estimation called the importance score. The importance score is used to rank the results in a Nominatim query according to their prominence in Wikipedia.
To contrast our approach against Nominatim, we just obtain the coordinates of the best guess for both methods. In the case of Nominatim, this corresponds to the coordinate values (latitude, longitude) of the first entity in the ranking, if any. Our method assigns the coordinates values after the algorithm described in Section 3 is executed. Note that a key feature of our approach is that it always obtains coordinates, even when the entities are not found in the gazetteer.
In Table 6, we contrast the coordinates of Nominatim vs. our method in terms of a standard distance (haversine distance) given in kilometers. Each row corresponds to a range from 0 to less than 5 kilometers to 100 kilometers or more. The columns are divided into two sections. In the first section (entitled All Entities), the distribution of ranges includes the not georeferenced category, which corresponds to the case when Nominatim does not assign any coordinates. The same information is plotted in Figure 4, where we can observe that our method is able to assign very close coordinates (approximately 5 km) in more than half of the cases. Furthermore, in 75% of cases, the assigned point will be within 25 km of the actual point. Assuming as a correctly located place the one whose distance range among the actual and predicted coordinates is less than 1 Km, we found that the proportion of correctly located places is 51% compared to 23% obtained by Nominatim.
In the second section (entitled Only Georeferenced Entities) of Table 6, entities not found in the gazetteer are left outside of the distribution. However, as is presented in Figure 5, while this represents a slight improvement of the baseline, the same precision of our proposal prediction accuracy for coordinates holds.  Figure 5. Frequencies of toponyms in ranges according to their distances to the actual points only for entities found in the gazetteer.

Conclusions and Future Work
The assignment of geographic coordinates to place names in text is a challenge in NLP and data mining, which is desirable to solve before other tasks such as information retrieval, information extraction, and summarization, among others. In this paper, we presented a novel geoparsing approach based on word embeddings for toponym recognition and dynamic context identification for toponym resolution. The approach is composed of the geographic-named entity recognition and dynamic context disambiguation modules. In the first module, a neural network model is trained by means of dense vectors to recognize geographically named entities. In the second module, a set of rules and facts is applied to take advantage of context to assign the most suitable geographic level to place names, and then to identify the correct locations.
Experimentation was carried out to determine the performance improvement of our method over a well-known baseline (OpenStreetMap Nominatim). This experimentation included a annotated corpus in which the geographic properties (georeferences and geographic levels) of the entities are known beforehand. In the experiments, our approach led to promising results, outperforming the baseline. The performance was given in terms of two metrics: (1) accuracy relative to the geographic level and (2) distance between the georeferences assigned by our method and prior georeferences. Our method manages to georeference 75% of entities in a range of 0 to 25 km and 50% within less than 5 km. In real-time applications, this type of result can be highly relevant; for example, for linking emergency medical systems [34], event detection on social media [12], and place detection in location-based services [35], among others.
We have proposed a geoparsing method that allows us to recognize locations from a text and assign their most likely geographical properties. For location recognition we have trained a machine learning model which predicts those words in the input corresponding to locations. Once the locations have been recognized, our method attempts to find their geographical properties through an inference process based on a set of facts and rules, and a gazetteer. Though our method was successfully tested on Mexican-Spanish, this can be adapted to other languages taking into account the following remarks: (a) A GNER model must be used or trained for a different language (or Spanish variant); (b) a gazetteer containing an appropriate set of places should be included (OpenStreetMap is suitable for English and Spanish); (c) the disambiguation process does not require modifications because the facts and rules do not depend on the target language. Finally is worth mentioning that, our proposal allows us to assign the geographical properties of specific locations not contained within the gazetteer which is very helpful in applications like gazetteer enrichment and could also be beneficial for the study of historical texts where existent gazetteers are limited.
The application use of the created system is around the phenomenon of clandestine graves in Mexico which has been scarcely addressed from the scientific point of view. In the project, border research questions are raised that will facilitate the creation of a clandestine grave search protocol that takes advantage of the scientific knowledge generated. The proposal tackles two large complementary problems around the search for clandestine graves. On the one hand, the development of potential distribution models generated from a machine learning approach is proposed. A relevant contribution in this area with respect to previously developed approaches is that includes the development of a Geoparser specialized in extracting information on the location of clandestine graves from journalistic reports and official documents of public access. Then other techniques and concepts of geospatial modeling will be incorporated such as proximity analysis and network analysis.

Conflicts of Interest:
The authors declare no conflict of interest. Table A1. Geographic levels based on GeoNames (www.geonames.org/export/codes.html, last access 20 August 2020). We have included the level OTHER to agglomerate those specific places that do not lie within the administrative levels of first, second and third order; we have included NULL to group those places that are not found in the gazetteer.

Geographic Level
Description