Automatic Identiﬁcation of Addresses: A Systematic Literature Review

: Address matching continues to play a central role at various levels, through geocoding and data integration from different sources, with a view to promote activities such as urban planning, location-based services, and the construction of databases like those used in census operations. However, the task of address matching continues to face several challenges, such as non-standard or incomplete address records or addresses written in more complex languages. In order to better understand how current limitations can be overcome, this paper conducted a systematic literature review focused on automated approaches to address matching and their evolution across time. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines were followed, resulting in a ﬁnal set of 41 papers published between 2002 and 2021, the great majority of which are after 2017, with Chinese authors leading the way. The main ﬁndings revealed a consistent move from more traditional approaches to deep learning methods based on semantics, encoder-decoder architectures, and attention mechanisms, as well as the very recent adoption of hybrid approaches making an increased use of spatial constraints and entities. The adoption of evolutionary-based approaches and privacy preserving methods stand as some of the research gaps to address in future studies.


Introduction
An address is a reference to a unique location on Earth and is usually expressed according to a certain addressing system (a combination of components such as street names, building numbers, units, levels, unit directions, postal codes, etc.), which can be distinguished from others based on its structure as well as on the types of used components [1]. Due to the hierarchical nature of the fields that compose an address, the association between addresses and address fields can be formally modelled, thereby taking into account the semantic characteristics of address fields [2].
In general terms, address matching consists of the process of identifying pairs of records through the comparison of full addresses or address fields, with the aim of obtaining the best matching result in relation to a searched address [3]. Address matching is also described as the process of relating the literal description of an address to its corresponding location on a map [4]. In this process, known as geocoding, addresses (up to the street name or street name and door number, combined with a postal code and/or an administrative division) are matched with a reference database in order to obtain the corresponding spatial geographic coordinates [5]. In the absence of a unique identifier (such as the social security number, for instance), addresses can also be used as quasi-identifiers in the linking of records related to the same entity in one or more data collections [6]. As such, address matching main areas of application include, among others, the enrichment of data quality [3], named entity recognition [6], and location-based analyses in general [7], and other types of multifamily houses, the street number will be further followed by the identification of address elements such as the unit, level, and unit direction, which are hard to tackle through the use of dictionaries or even more traditional machine learning approaches, due to writing variations and the use of non-standard abbreviations, combined with missing elements.
To perform the proposed SLR, the paper is structured as follows: Section 2 presents the main data sources, search strategies, screening procedures and tools; Section 3 contains the main results, their discussion and identified research gaps; and, finally, in Section 4, we present our main conclusions, including some recommendations for future research.

Data Sources and Search Strategies
In order to select the most relevant set of articles, the following query was executed in April 2021 and December 2021 (in order to retrieve more recent papers) in the electronic repositories Scopus and Web of Science: ("address matching" OR "toponym matching" OR "address pars*" OR "address standardization" OR "address database*" OR "address string*" OR "postal address data" OR "non-standard addresses" OR "address element segmentation" OR "name and address data" OR "address geocoding" OR "geocoding addresses" OR "geocoded address*") AND ("machine learning" OR "deep learning" OR "neural network*" OR "vector representation" OR semantic* OR probabilistic OR automat* OR "similarity measure*") AND NOT ("IP address*" OR "mac address*" OR URL OR email*).
The Boolean expressions OR/AND imply that any article should contain at least one keyword from the first subquery inside curved brackets and one keyword from the second one. The Boolean expression "AND NOT" aims at further excluding any article containing one of the keywords inside the last subquery. The keywords included in the final query resulted from a fine-tuning process of alternative keywords' combination based on recently published papers in this field, such as the ones by Comber and Arribas-Bel [3] and Lin et al. [9], as well as on some of the earlier works, namely by Churches et al. [6]. The keywords "address data" were explicitly excluded from the query due to the ambiguous use of address as a noun and as a verb and to the presence of a significant number of papers related to the study of data imbalance issues, among others. The keyword "geocoding", although relevant, was also excluded due to its more general nature and conceptual "fuzziness" (in the sense that it can have different meanings depending on the user's perception and experience), with postal addresses consisting of one of the possible inputs that can be assigned a geographic code [15]. A combination of the keywords "geocoding" or "geocoded" and "addresses" was considered instead. Lastly, a time period of 20 years was considered, since seminal work about the researched topic based on machine learning approaches was first published in 2002 [6].

Screening Procedures
A total of 122 distinct documents were retrieved from the previously mentioned databases, after deduplication of common articles. For the selection of relevant papers, the following exclusion criteria were considered: • Exclusion of reviews, book chapters, reports, and other duplicates (e.g.,: articles published as book chapters in the Springer series "Studies in Computational Intelligence"); • Exclusion of conferences not ranked as "A" (as of April 2021), according to the conference ranking provided in http://www.conferenceranks.com/ (accessed on 9 November 2021) (e.g.,: International Conference on Natural Computation); • Exclusion of journals not ranked as Q1 or Q2 (as of April 2021), according to the SCImago Journal Rank indicator (https://www.scimagojr.com/ (accessed on 9 November 2021)) (e.g.,: Russian Journal of Forest Science); • Exclusion of articles which were not in the scope of the research (e.g.,: articles dealing with inputs not related to addresses).
Regarding the screening procedures, the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [16] were followed, resulting in the final inclusion of 41 articles as shown in Figure 1.

•
Exclusion of journals not ranked as Q1 or Q2 (as of April 2021), according to the SCImago Journal Rank indicator (https://www.scimagojr.com/ (accessed on 9 November 2021)) (e.g.,: Russian Journal of Forest Science); • Exclusion of articles which were not in the scope of the research (e.g.,: articles dealing with inputs not related to addresses).
Regarding the screening procedures, the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [16] were followed, resulting in the final inclusion of 41 articles as shown in Figure 1.

Tools
Excel 2010 (Microsoft Corp), VOSviewer 1.6.16 (https://www.VOSviewer.com (accessed on 9 November 2021)) and Gephi 0.9.2 (https://gephi.org/ (accessed on 9 November 2021)) were used for the qualitative and quantitative analyses of keywords and author cooccurrences as well as publication trends, top countries, research gaps, application areas and methods. VOSviewer consists of a bibliometric analysis tool for network analysis based on clustering techniques and text mining [17]. It enables three types of visualizations: network visualization, overlay visualization, and density visualization. In this analysis, only the first two were used, for the sake of simplicity. Gephi consists of an opensource software for network analysis [18] in a broader range of subjects, enabling the extraction of graph theory metrics additional to the ones provided by VOSviewer. As such, in this paper, Gephi was mostly used in a subsidiary manner, whenever considered necessary.

Publication Venues of the Selected Papers
Of the 41 articles that met the inclusion criteria, 17 were published in Q1 journals, 10 in Q2 journals, 6 in a Q2 book series, and the remaining 8 in conference proceedings, as illustrated in Figure 2:

Tools
Excel 2010 (Microsoft Corp), VOSviewer 1.6.16 (https://www.VOSviewer.com (accessed on 9 November 2021)) and Gephi 0.9.2 (https://gephi.org/ (accessed on 9 November 2021)) were used for the qualitative and quantitative analyses of keywords and author co-occurrences as well as publication trends, top countries, research gaps, application areas and methods. VOSviewer consists of a bibliometric analysis tool for network analysis based on clustering techniques and text mining [17]. It enables three types of visualizations: network visualization, overlay visualization, and density visualization. In this analysis, only the first two were used, for the sake of simplicity. Gephi consists of an open-source software for network analysis [18] in a broader range of subjects, enabling the extraction of graph theory metrics additional to the ones provided by VOSviewer. As such, in this paper, Gephi was mostly used in a subsidiary manner, whenever considered necessary.

Publication Venues of the Selected Papers
Of the 41 articles that met the inclusion criteria, 17 were published in Q1 journals, 10 in Q2 journals, 6 in a Q2 book series, and the remaining 8 in conference proceedings, as illustrated in Figure 2: In Tables 1 and 2 the main journals and conference proceedings are presented, respectively.  In Tables 1 and 2 the main journals and conference proceedings are presented, respectively. Regarding the number of publications, the top journals include the ISPRS International Journal of Geo-Information (with 4 articles), the Applied Sciences (Switzerland) journal (with 2 articles), the International Journal of Geographical Information Science (with 2 articles), the Transactions in GIS journal (with 2 articles), and the Wuhan Daxue Xuebao (Xinxi Kexue Ban)/Geomatics and Information Science of Wuhan University (with 2 articles), of which the first four are ranked as Q1. Within the top journal publishers are MDPI AG (Switzerland), MDPI Multidisciplinary Digital Publishing Institute (Switzerland), Taylor and Francis Ltd. (United Kingdom), Wiley-Blackwell Publishing Ltd. (United Kingdom), and Wuhan University (China). The most influential journals, in terms of the number of citations, are also ranked as Q1 (Table 3), as was expected. Most of the conference papers originate from the book series Lecture Notes in Computer Science (with 6 papers), followed by the International Conference on Document Analysis and Recognition (with 3 papers), with the latter being more influential as far as the number of citations is concerned. Overall, the main research fields identified in the analysis were computer science (34%), Earth and planetary sciences (16%), social sciences (15%), mathematics (9%), engineering (5%), medicine (4%), and physics and astronomy (4%), with the remaining areas corresponding to decision sciences, materials science, business, management and accounting, chemical engineering, economics, econometrics and finance, and environmental science. The top publishing countries comprise the United Kingdom (27%), the United States (24%), Germany (15%), and Switzerland (15%). Considering the authors' affiliations, the first position is held by China (with 38%), followed by the United States (12%), the United Kingdom (10%), and India (8%).
In the considered time period, spanning from 2002 to 2021, the number of published articles has been steadily increasing since 2017 ( Figure 3). This trend can be explained by the big volume of unstructured address data that has been created by the rapid development of mobile internet and location-based services and the increasing need for effective address matching methods, in order to facilitate geocoding and promote geospatial management [2,9]. In countries like China, rapid urban expansion has also led to an increased concern with the improvement of address quality and the retrieval of standard address data [19]. first position is held by China (with 38%), followed by the United States (12%), the United Kingdom (10%), and India (8%).
In the considered time period, spanning from 2002 to 2021, the number of published articles has been steadily increasing since 2017 ( Figure 3). This trend can be explained by the big volume of unstructured address data that has been created by the rapid development of mobile internet and location-based services and the increasing need for effective address matching methods, in order to facilitate geocoding and promote geospatial management [2,9]. In countries like China, rapid urban expansion has also led to an increased concern with the improvement of address quality and the retrieval of standard address data [19].

Keyword Occurrence Analysis
The keyword co-occurrence analysis was performed using VOSviewer, as depicted in Figure 4a. A keyword is represented by a circle and its importance by the circle's size, with circles of the same color belonging to the same cluster. The number of times two connected nodes are referenced together is represented by the thickness of the link connecting the circles. In particular, a full counting method was used, involving 55 screened keywords, with a minimum threshold of 2 occurrences. In Figure 4b, an overlay visualization is also included in order to reveal changing trends in keywords. Earlier occurrences are depicted in blue, and the more recent ones in yellow. Computation linguistics, attention mechanisms, LSTMs, and location-based services emerge as some of the most recent research topics.

Keyword Occurrence Analysis
The keyword co-occurrence analysis was performed using VOSviewer, as depicted in Figure 4a. A keyword is represented by a circle and its importance by the circle's size, with circles of the same color belonging to the same cluster. The number of times two connected nodes are referenced together is represented by the thickness of the link connecting the circles. In particular, a full counting method was used, involving 55 screened keywords, with a minimum threshold of 2 occurrences. In Figure 4b, an overlay visualization is also included in order to reveal changing trends in keywords. Earlier occurrences are depicted in blue, and the more recent ones in yellow. Computation linguistics, attention mechanisms, LSTMs, and location-based services emerge as some of the most recent research topics.   As shown in Table 4, four clusters were identified, based on VOSviewer default clustering technique [17]: address matching and NLP, in green (e.g.,: Xu et al. [20]), GIS/geocoding and machine learning, in blue (e.g.,: Peng et al. [21]), address standardization, in yellow (e.g.,: Churches et al. [6]) and address recognition and parsing, in red (e.g.,: Wei et al. [22]). Table 4. Co-words obtained in each cluster by VOSviewer.

Clusters
Co-Words

Co-Authorship Analysis
VOSviewer was also used to perform the analysis on co-authorship. A full counting method and a minimum of 2 documents and 2 citations were chosen, resulting in a total of 23 authors. As shown in Table 5 and Figure 5a,b, 9 clusters were found, which appear to be organized around collaborating authors' countries of origin (and, in most cases, to the corresponding address model structures and the language in which addresses are written), degree of collaboration between researchers (link strength), and average year of publication: cluster 1 corresponds to authors from India (2010); clusters 2, 3, and 6 to Chinese researchers who published articles in 2020, 2021, and 2018-2019, respectively, with the latter exhibiting a weaker link strength than the former; cluster 4 includes Portuguese authors (2018); cluster 5 involves Australian researchers (2004); cluster 7 refers just an English author (2019); and, finally, clusters 8 and 9 engage Chinese researchers publishing in different years (2006 and 2010), with these earlier works being focused on more traditional approaches to address matching and parsing. Overall, Chinese authors lead the way, with 19 papers that represent 46% of the published articles, half of them published between 2019 and 2021, after a peak in 2016 (11%). Nevertheless, the most cited authors are the Australians Peter Christen and Tim Churches, which may be influenced by the year and field of research of the corresponding papers, since different publication and citation cultures may put in disadvantage papers from more recent time periods and specific subfields [23]. In order to better understand the connections between the different authors and their research, an analysis based on the references of each paper was also undertaken. Of the three different types of citation-based approaches available in VOSviewer, bibliographic coupling [24] was chosen, which measures the similarity between papers based on the number of references they share [25]. This approach is less affected by changes over time since references remain stable [25] and it outperformed alternative methods in a comparative study by X. Liu [26]. In Figure 6, a bibliographic coupling analysis of authors is depicted, based on a full counting method and a minimum of 2 documents and 2 citations, pointing to the existence of bibliographic coupling relations between almost all of the researchers at hand, in spite of their relative isolation in terms of the co-authorship analysis, outside each cluster.  In order to better understand the connections between the different authors and their research, an analysis based on the references of each paper was also undertaken. Of the three different types of citation-based approaches available in VOSviewer, bibliographic coupling [24] was chosen, which measures the similarity between papers based on the number of references they share [25]. This approach is less affected by changes over time since references remain stable [25] and it outperformed alternative methods in a compar-

Application and Methods Analysis
An evaluation of the main application areas, methods and algorithms used in the papers under study was also undertaken using VOSviewer. Two separate keyword analyses using a full counting method and a minimum threshold of 1 occurrence were performed. In each of the analyses, only keywords related to applications or methods/algorithms were taken in consideration. Figure 7 shows that the top 5 application domains consist of geographical information systems (GIS)/Census [27], POIs/Spatial Analysis [2], GIS/Urban Planning [9], GIS/Health Care [28], and Location Based Services [29]. Taking into account the average publication year (Figure 8), it is possible to observe that the most recent application domains consist of disease control (covid-19), location-based services, and GIS/census/urban planning, in which geocoding, with an increasing importance in people's daily lives, stands as a common feature.

Application and Methods Analysis
An evaluation of the main application areas, methods and algorithms used in the papers under study was also undertaken using VOSviewer. Two separate keyword analyses using a full counting method and a minimum threshold of 1 occurrence were performed. In each of the analyses, only keywords related to applications or methods/algorithms were taken in consideration. Figure 7 shows that the top 5 application domains consist of geographical information systems (GIS)/Census [27], POIs/Spatial Analysis [2], GIS/Urban Planning [9], GIS/Health Care [28], and Location Based Services [29]. Taking into account the average publication year (Figure 8), it is possible to observe that the most recent application domains consist of disease control (covid-19), location-based services, and GIS/census/urban planning, in which geocoding, with an increasing importance in people's daily lives, stands as a common feature.     As far as methods/algorithms are concerned, Figure 9 shows the growing importance of deep learning algorithms in the field of address matching since 2018, such as recurrent neural networks [30], long short-term memory networks [31], gated recurrent units [30], bidirectional encoder representations from transformers [32,33], and Graph Convolutional Networks [34]. Recurrent neural networks (RNNs) were originally conceived to spot patterns in data sequences like character strings, through "a recurrent hidden state whose activation at each time step is dependent on the previous time step" [35] (p. 331). Long short-term memory networks (LSTM) and gated recurrent units (GRU) consist of two well-known extensions of RNNs, which are able to handle RNN's difficulties in modelling long-term dependencies (i.e., long sequences). Graph Convolutional Networks (GCN) are a special case of Graph Neural Networks (GNN), which were also originally introduced as extensions of RNNs [36]. Bidirectional encoder representations from 2% 3% 0% 5% 10% 15% 20% Other Disease control (covid-19) As far as methods/algorithms are concerned, Figure 9 shows the growing importance of deep learning algorithms in the field of address matching since 2018, such as recurrent neural networks [30], long short-term memory networks [31], gated recurrent units [30], bidirectional encoder representations from transformers [32,33], and Graph Convolutional Networks [34]. Recurrent neural networks (RNNs) were originally conceived to spot patterns in data sequences like character strings, through "a recurrent hidden state whose activation at each time step is dependent on the previous time step" [35] (p. 331). Long short-term memory networks (LSTM) and gated recurrent units (GRU) consist of two well-known extensions of RNNs, which are able to handle RNN's difficulties in modelling long-term dependencies (i.e., long sequences). Graph Convolutional Networks (GCN) are a special case of Graph Neural Networks (GNN), which were also originally introduced as extensions of RNNs [36]. Bidirectional encoder representations from Transformers (BERT) [32,33] consist of a simpler network architecture, based solely on attention mechanisms (which assign higher weights to the most important features), not requiring the sequential processing of data.
Probabilistic based approaches for segmenting and labelling sequence data, such as Hidden-Markov Models (HMMs) and Conditional Random Fields (CRFs) [3], on their turn, have been mostly used before 2015 (it should be noticed, however, that CRFs have continued being used in combination with other, more advanced approaches [37]). A Hidden Markov Model [38][39][40] consists of a finite set of unobserved (hidden) states, a matrix of transition probabilities between those states, a collection of observable facts, and an observation (or emission) matrix comprising the probabilities with which each hidden state emits an observation. Conditional Random Fields (CRFs) are inherently conditional and assume that the output labels are not independent [41,42].
Semantics, which aims at understanding natural language contents such as addresses [9], consists of the most important node of the network depicted in Figure 9, reflecting its central role and growing relevancy in this research field. In order to better understand the relative importance of each node and the interactions between them, two centrality measures were additionally considered: the eigenvector centrality, which characterizes the global centrality of a node in a network, and the betweenness centrality, which can be described as the number of times that a certain node needs another one to reach a third node through the shortest path [43]. For that effect, a Pajek (*.net) file containing the network of keyword occurrences related to methods/algorithms was extracted from VOSviewer and used as an input to Gephi. The obtained results are included in Table 6, confirming the global centrality of semantics. Transformers (BERT) [32,33] consist of a simpler network architecture, based solely on attention mechanisms (which assign higher weights to the most important features), not requiring the sequential processing of data. Figure 9. Network overlay of keyword occurrences related to methods/algorithms.
Probabilistic based approaches for segmenting and labelling sequence data, such as Hidden-Markov Models (HMMs) and Conditional Random Fields (CRFs) [3], on their turn, have been mostly used before 2015 (it should be noticed, however, that CRFs have continued being used in combination with other, more advanced approaches [37]). A Hidden Markov Model [38][39][40] consists of a finite set of unobserved (hidden) states, a matrix of transition probabilities between those states, a collection of observable facts, and an observation (or emission) matrix comprising the probabilities with which each hidden state emits an observation. Conditional Random Fields (CRFs) are inherently conditional and assume that the output labels are not independent [41,42].
Semantics, which aims at understanding natural language contents such as addresses [9], consists of the most important node of the network depicted in Figure 9, reflecting its central role and growing relevancy in this research field. In order to better understand the relative importance of each node and the interactions between them, two centrality measures were additionally considered: the eigenvector centrality, which characterizes the global centrality of a node in a network, and the betweenness centrality, which can be described as the number of times that a certain node needs another one to reach a third node through the shortest path [43]. For that effect, a Pajek (*.net) file containing the network of keyword occurrences related to methods/algorithms was extracted from VOSviewer and used as an input to Gephi. The obtained results are included in Table 6, confirming the global centrality of semantics.   The present section aims to perform a more detailed discussion of the different address matching algorithms based on the full text of the selected articles (also summarized in Appendix A), with a view to extend the previously presented keyword-based analysis. This more detailed literature review will be organized around the three main methods which have been found to be the most relevant: string similarity-based methods, address element-based methods, and deep learning methods [9].
String similarity-based methods consist of a standard approach for address matching and generally involve the computation of a similarity metric between the addresses under comparison. Three main methods can be identified: character-based, vector-space based, and hybrid approaches [44]. Character-based methods comprehend edition operations, like sub-sequence comparisons, deletions, insertions, and substitutions. One of the best-known character-based methods is the Levenshtein edit distance metric [45], consisting of the minimum number of insertions, substitutions, or deletions which are required to convert a string into another (for instance, the edit distance between the toponyms Lisboa and Lisbonne is three, since it requires two insertions and one substitution) [44]. Another example of a character-based method is the Jaro metric [46], specifically conceived for matching short strings, like person names, with a more advanced version being latter proposed (Jaro-Winkler similarity) [47], in order to give higher scores to strings matching from the beginning up to a given prefix length. Regarding vector-space approaches, the calculation of the cosine similarity between representations based on character n-grams (i.e., sequences of n consecutive characters) consists of a common approach, alongside the Jaccard similarity coefficient [44]. Lastly, hybrid metrics, while combining the advantages of the two previous approaches, also allow for small differences in word tokens and are more flexible in what concerns to word order and position [44]. Nevertheless, in terms of performance, there is not a best technique. The available metrics are task-dependent and, according to the study developed by Santos et al. [44], involving the comparison of thirteen different string similarity metrics, the differences in terms of performance are not significant, even when combined with supervised methods, to avoid the manual tuning of the decision threshold (one of the most important factors to obtain good results).
Address element-based methods, on their turn, rely on address parsing, a sequence tagging task which has been traditionally approached using probabilistic methods mainly based on Hidden Markov Models (HMM) and Conditional Random Fields (CRF) [3], alongside other less common approaches not always involving machine learning methods.
In what concerns the application of HMMs in the context of residential addresses, the hidden states correspond to each segment of the address and the observations consist of the tokens assigned to each word of the input address string (after the application of some cleaning procedures), which may be based on look-up tables and hard-coded rules [6]. For instance, the address "17 Epping St Smithfield New South Wales 2987", after cleaning and tokenization, would turn into the following: where 'NU' would stand for other numbers, 'LN' for locality (town, suburb) names, 'WT' for wayfare type (street, road, avenue, etc.), 'TR' for territory (state, region), and 'PC' for postal (zip) code [6] (p. 6). In order to determine, by statistical induction, the most likely arrangement of hypothetical "emitters" behind the observed sequence, a set of training examples is used to learn both the transition matrix and the observation matrix, through the maximum likelihood approach. Since it is computationally infeasible to evaluate the probability of every possible path (for N states and T observations, there would be N T different paths), the Viterbi algorithm is used to find the most probable path through the model [48]. As such, the most probable sequence of states, based on previously trained transition and emission matrices, will present the highest probability of occurring, as illustrated below, in which the observation symbols are in brackets and the emission probabilities are underlined [6]  One of the main drawbacks of traditional HMMs is the fact that they do not support multiple simultaneous observations for one token. Even in more advanced versions of HMMs such as entropy Markov Models [49], in which the current state depends both on the previous state and on existing observations, there is a weakness called the label bias problem [50]: "transitions leaving a given state to compete only against each other, rather than against all transitions in the model" [41] (p. 2).Within the present literature review, four of the considered articles propose HMM-based methods: the already mentioned one by Churches et al. [6], aiming at the preparation of name and address data for record linkage purposes, through a combined approach using lexicon-based tokenization and HMMs, with the obtained experimental results confirming it as an alternative to rule-based systems that is both feasible and cost-effective; a second paper by the same authors [51], in which a geocoding system based on HMMs and a rule-based matching engine (Febrl) for spatial data analysis is proposed and tested on small datasets of randomly selected addresses from different sources, with experimental results pointing to exact matches rates between 89% and 94%, depending on the source and considering the total exact matches obtained at various levels (address level, street level and locality level); the paper by X. Li et al. [40], in which an HMM-based large scale address parser is proposed, obtaining an accuracy of 95.6% (F-measure), after being tested on data from various sources with varying degrees of quality and containing billions of registers, of which 20% were synthetically rearranged in order to reproduce normal address variations; and, finally, the paper by Fu et al. [52], in which an HMM-based segmentation and recognition algorithm is proposed for the development of automatic mail-sorting systems involving handwritten Chinese characters (a problem which will be further addressed in the present literature review), with experimental results confirming its effectiveness.
Conditional Random Fields (CRFs) consist of a recent innovation in the field of text segmentation. CRFs are conditional by nature and assume no independence between output labels, illustrating real world addresses, in which zip codes, for instance, are related to city names, localities, and even streets [3]. Having all the advantages of Maximum entropy Markov models (MEMMs), CRFs also solve the label bias problem by letting the probability of a transition between labels also depend on past and future elements and not only on the current address element [3]. "The essential difference between CRFs and MEMMs is that the underlying graphical model structure of CRFs is undirected, while that of MEMMs is directed" [41] (p. 2). Considering, as an example, the address "3B Records, 5 Slater Street, Liverpool L1 4BW", an HMM parser would erroneously predict the first and second labels as standing for number ('3B') and street ('Records'), respectively, whereas the CRF parser, when reaching the actual property number (5), would give a higher score to the current label in order to revise it to a property number and the previous label (3B Records) to a business name [3]. Another recent approach to address parsing is based on so-called "word-embeddings", the name given to the vector representation of words [3]. An implementation of such method is word2vec [53], an unsupervised neural network language which aims to make predictions about the next words by modeling the relationships between a given word and the words in its context, based on two possible architectures: the continuous skip-gram model (Skip-Gram) and the continuous bag-ofwords model (CBOW) [53]. The latter is usually chosen over the former, since it is trained by inferring the meaning of a particular word from its context [9].
A practical comparison between HMMs, CRFs, and a CRF augmentation with word2vec is undertaken in Comber and Arribas-Bel [3]. The VGI based Libpostal library (https: //github.com/openvenues/libpostal (accessed on 9 November 2021)), which trains a CRF model on 1 billion street addresses from OSM data, was used for the segmentation task. Although the obtained results are broadly consistent in terms of precision, the classifiers using the HMM technique present lower recall values than the ones obtained by the CRF, meaning that both methods are capable of distinguishing true positives from false positives, but the CRF is able to classify a greater proportion of matches [3]. The augmented version of the CRF model does not outperform the results obtained by the original one but presents the advantage of not committing the user to a particular string distance and its biases [3]. In another recent work by the same author [54], a predictive model for address matching is proposed, based on recent innovations in machine learning and on a CRF parser for the segmentation of address strings. The biggest contribution of the paper at hand, however, is the thorough documentation of all the steps required to execute the proposed model's workflow. In other papers included in the present literature review, CRFs are used as a benchmark model, e.g.,: Dani et al. [55] or in combination with other methods, which will be further addressed [56][57][58].
Other less recent approaches have been proposed for address parsing/segmentation, namely within address standardization studies aiming at minimizing the size of labelled training data. One such example is the work by Kothari et al. [59], in which a nonparametric Bayesian approach to clustering grouped data, known as hierarchical Dirichlet process (HDP) [60], is used with a view to discover latent concepts representing common semantic elements across different sources and allow the automatic transfer of supervision from a labeled source to an unlabeled one. The obtained latent clusters are used to segment and label address registers in an adapted CRF classifier, with experimental results pointing to a considerable improvement in classification accuracy [59]. A similar approach is proposed by Guo et al. [61], in which paper a supervised address standardization method with latent semantic association (LaSA) is presented, with a view to capture latent semantic association among words in the same domain. The obtained experimental results show that the performance of standardization is significantly improved by the proposed method. Expert systems have also been proposed, namely by Dani et al. [55], in which paper a Ripple Down Rules (RDR) framework is proposed with a view to enable a cost-effective migration of data cleansing algorithms between different datasets. RDR allows the incremental modification of rules and to add exceptions without unwanted side effects, based on a failure driven approach in which a rule is only added when the existing system fails to classify an instance [55]. After comparison with traditional machine learning algorithms and a commercial system, experimental results show that the RDR approach requires significantly less rules and training examples to reach the same accuracy as the former methods [55].
Tree-based models have been proposed to handle automatic handwritten address recognition, which consists of a particular address parsing/segmentation task mostly studied by Chinese researchers, due to the greater complexity of the Chinese language (larger character set, different writing styles, great similarity between many of the characters) [62]. In the paper by Jiang et al. [62], a suffix tree is proposed to store and access addresses from any character. In relation to previous approaches also based on a tree data structure, the proposed suffix tree is able to deal with noise and address format variations. Basically, a hierarchical substring list is firstly built, after which the obtained input radicals are compared with candidate addresses (filtered by the postcode) with a view to optimize a cost function, combining both recognition and matching accuracy [62]. A correct classification rate of 85.3% is obtained in the experimental results. However, according to Wei et al. [22], the recognition accuracy of character-level-tree (CLT) models is dependent on the completeness of the address list on which they are based. In order to overcome this limitation, the authors propose a structure tree built at the word level (WLT), in which each node consists of an address word and the path from the roof to the leaf corresponds to a standardized address format. After initial recognition by a character classifier, segment candidate patterns are mapped to candidate address words based on the WLT database. In the final phase (path matching), candidate address words' scores are summed in order to obtain the address recognition result [22]. The obtained experimental results show that the proposed method outperforms four benchmarking methods, including the previously mentioned suffix tree. Address tree models for address parsing and standardization are also proposed in the papers by Tian et al. [63], Liu et al. [64] and Li et al. [65]. In the first two, the address tree model is mainly used for rule-based validation and error recognition, by providing information about the hierarchy of Chinese addresses and, in the case of the paper by Li et al. [65], latent tree structures are designed with a view to capture rich dependencies between the final segments of an address, which do not always follow the same order.
Within the address element-based methods, it is also worth highlighting geocoding as a means to enhance address standardization, through the correction of misspellings and the filling of missing attributes, some of the most common errors found in postal addresses [66]. After successful matching with a record from a standardized reference database (like Google Maps or OSM), reverse geocoding can be performed to obtain a valid and complete representation of the queried address. In the case of geocoded databases like GNAF, for instance [51], geographic coordinates can be used to calculate the spatial proximity between different records for conducting distance-based spatial analyses and for record linking purposes between different databases (up to the house number). Another important application of address geocoding relates to the matching of historical address records (such as census records) with contemporary data, by attaching grid references to the former in order to perform longitudinal spatial demographic analyses [27]. However, the successful automated geocoding of residential addresses depends on a number of factors, namely population densities (with positional error increasing as population density decreases) [27,67], the completeness of an address (existence or not of a number and street name), and changes in street names, among others [27]. These limitations can be tackled by the previous standardization and enrichment of addresses [68] and the choice of the most adequate geocoding method, including the use of property data [67] or the use of hybrid geocoding approaches [28].
With the advancement of deep learning methods, various authors have recently proposed the adoption of the previously mentioned extensions to RNNs (namely, LSTMs, GRUs, and GCNs), in order to better cope with nonstandard address records and highly dissimilar toponyms. LSTMs and GRUs are both composed of gates, which consist of neural networks that regulate the flow of information from one time step to the next, thereby helping to solve the short memory problem. In particular, GRUs have two gates-update and reset gates-and LSTMs, three gates-input, forget, and output gates [30]. The amount of fresh information added through the input gate in LSTMs is unrelated to the amount of information retained through the forget gate. In GRUs, the retention of past memory and the input of new information to memory are not mutually exclusive. The GRU stores both long-term dependence and short-term memories in a single hidden state, whereas the LSTM stores the former in the cell state and the latter in the hidden state. Because there are fewer weights and parameters to update during training, GRUs are faster to train than LSTMs [30].
Within the present literature review, several of the considered papers propose these types of methods, namely the ones by Santos et al. [35], Lin et al. [9], J. Liu et al. [58], Shan et al. [7,29], P. Li et al. [69], and Chen et al. [70]. To take into account contextual information both from previous and future tokens, by processing the sequence in two directions, bidirectional LSTM (BiLSTM) or GRU layers are also being employed in the great majority of these studies. The best performing models further connect the encoder and decoder through an attention mechanism in order to assign higher weights to the most important features [7,9,29]. With a view to reduce overfitting and enhance the classification models' generalization abilities, a dropout regularization layer is also normally added [9,35,58]. The ESIM model [71] consists of an illustrative example of a deep learning architecture based on the principles previously described. After address tokenization (with the help of gazetteers and dictionaries, in the case of more complex languages, with no natural separators) and the obtaining of vector representations of the different (labelled) address pairs (based on word2vec), the ESIM model is employed through the following four layers [9]: • An input encoding layer, that encodes the input address vectors and extracts higherlevel representations using the bidirectional long short-term memory (BiLSTM) model; • A local inference modelling layer, that makes local inference of an address pair using a modified decomposable attention model [72]; • An inference composition layer, responsible for making a global inference between two compared address records based on their local inference, in which average and max pooling are used to summarize the local inference and output a final vector with a fixed length; • Finally, a prediction layer, based on a multilayer perceptron (MLP) composed of three fully connected layers with rectified linear unit (ReLU), tanh and softmax activation functions, is used to output the predictive results of address pairs (that is, whether there is a match or not).
In terms of performance, all of the previously presented deep learning methods achieve a greater matching accuracy than the traditional text-matching models. In the case of the BiLSTM model proposed by Lin et al. [9], the precision, recall, and F1 score on the test set all reached 0.97, against the 0.92 scores achieved by the second-best performing model (Jaccard similarity coefficient + RF method). The deep neural network based on GRUs, to categorize toponym pairs as matches or non-matches, proposed by Santos et al. [35], also outperforms traditional text-matching methods, achieving an increase of almost 10 points in most of the evaluation metrics (namely, accuracy, precision, and F1). The LM-LSTM-CRF+BP neural networks model proposed by J. Liu et al. [58] achieves an accuracy and F1 score of 87%, compared with average scores of 70% by the benchmark methods (word2vec and edit distance). The address GCN method proposed by Shan et al. [7] also presents better results, on both precision (up to 8%) and recall (up to 12%), than the existing methods, which include the DeepAM model previously proposed by the same author, based on an encoder-decoder architecture with two LSTM networks [29]. The Bi-GRU neural network proposed by P. Li et al. [69] presents a similar performance to that shown by a Bi-LSTM neural network (F1 score of 99%) and a higher performance than unidirectional GRU and LSTM neural networks (F1 score of 93%), as it would be expected. Finally, the attention-Bi-LSTM-CNN network (ABLC) proposed by Chen et al. [70] achieves an improvement of 4-10% more accuracy than the baseline models, which include the previously mentioned ESIM model, presenting the second-best overall performance.
In two of the most recent studies included in the present literature review also based on deep learning methods, bidirectional encoder representations from Transformers (BERT) are proposed instead. The first one is the study by Xu et al. [20], which proposes a method for fusing addresses and geospatial data based on BERT (in what concerns the learning of addresses' semantics) and a K-Means high-dimensional clustering algorithm, enhanced by innovative fine-tuning techniques, to create a geospatial-semantic address model (GSAM). The computational representation extracted from GSAM is then employed for predicting address location, based on a neural network architecture to minimize the Euclidean distance between the predicted and real coordinates. In the second study [37], a new address element recognition method is proposed, for dealing with address elements with shortchange periods (streets, lanes, landmarks, points of interest names, etc.) which still have not been included in a segmentation dictionary. A model based on BERT is first applied to obtain the vector representations of the dataset and learn the contextual information and model address features, followed by the use of a CRF to predict the tags, with new address elements being recognized according to the tag [37].
In terms of performance, the GSAM model [20] achieves a classification accuracy above 0.97, against a minimum expected accuracy of 0.91 by other methods; the BERT-CRF model [37] achieves the highest F1 score on generalization ability (0.78), when compared to benchmark models combining word2vec, BiLSTM, and CRF methods (with an average F1 score of 0.41), as well as an equally high F1 score on the testing dataset (0.95).
Although related to POIs' locations and descriptions, two final articles (both published in 2021) are worth mentioning, due to the combined use of the previously presented approaches and spatial correlation/reasoning methods. The first of this studies [2] presents a method for identifying POIs in large POI datasets in a quick and accurate manner, based on: an enhanced address-matching algorithm, combining string, semantic, and spatial similarity, within an ontology model describing POIs' locations and relationships, in order to support the transition from semantic to spatial; a grid-based algorithm capable of achieving compact representations of vast qualitative direction interactions between POIs and performing quick spatial reasoning, through the fast retrieval of direction relations and quantitative calculations. The second of the studies [8] proposes an unsupervised method to segment and standardize POIs' addresses, based on a GRU neural network combined with the spatial correlation between address elements for the automatic segmentation task, and a tree-based fuzzy matching of address elements for the standardization task, with experimental results pointing to a relatively high accuracy.

Research Gaps
Within the more recently published papers considered in the present literature review, the most relevant opportunities for further work can be summarized as follows: the use of representative and large enough datasets [20]; the inclusion of duplicate place names, in order to enable the application of the proposed methodology to a national address database [9]; to improve accuracy, different weights might be assigned to the addresselement vectors depending on their hierarchy [9]; the need to fine tune the weight ratio of fused features, such as coordinates and the semantic representation of addresses, alongside the improvement of the underlying concatenation method and measurement metrics [20]; the adoption of systematic approaches for tuning hyper-parameters and experimenting with different architectures [35]; the need to involve more complex spatial objects and relations [2,8]. Some of the limitations highlighted in less recent studies, however, should also be taken in consideration in the application of the most recent methods, like the need to tackle privacy and confidentiality issues [51] when using personal quasi-identifiers such as addresses (especially, residential ones). Another concern that should be addressed and which was tackled in some of the earlier studies [55,59,61] is related to the minimization of human labelling when generating both training and test data. Lastly, no references have been found about the use of genetic programming (GP) [73] in the field of semantic address matching. GP has several advantages over other machine learning methods, including the ability to provide results that can be easily interpreted, based on programs, rules, or functions, as well as the ability to easily incorporate specific knowledge about a problem, despite its efficiency issues, which are primarily due to a time-consuming fitness function computation [74]. In Figure 10, the main research gaps are illustrated. identifiers such as addresses (especially, residential ones). Another concern that should be addressed and which was tackled in some of the earlier studies [55,59,61] is related to the minimization of human labelling when generating both training and test data. Lastly, no references have been found about the use of genetic programming (GP) [73] in the field of semantic address matching. GP has several advantages over other machine learning methods, including the ability to provide results that can be easily interpreted, based on programs, rules, or functions, as well as the ability to easily incorporate specific knowledge about a problem, despite its efficiency issues, which are primarily due to a time-consuming fitness function computation [74]. In Figure 10, the main research gaps are illustrated.

Conclusions
In this study, a systematic literature review based on Scopus and Web of Science, covering a time span of 20 years was undertaken in order to better understand how past and current limitations to address matching have been and can be overcome through the adoption of automated approaches to address matching. For the screening of the articles initially found, the PRISMA guidelines were followed, resulting in a final set of 41 relevant papers from high ranked Journals and Conference Proceedings. VOSviewer, a bibliometric analysis tool, was used to perform cluster analysis on the relationships between authors and popular research topics. The number of published articles has been increasing since 2017, a trend that may be closely related to the application of deep learning methods in this field. Chinese authors lead the way, with 19 papers that represent 46% of the published articles, half of them published between 2019 and 2021, after a peak in 2016 (11%). Disease control (covid-19), location-based services, and GIS/census/urban planning stand as some of the most recent application domains in the field under study. The research seems to confirm that probabilistic methods (such as HMMs and CRFs) have been outplaced by NLP methods based on semantics, encoder-decoder architectures, and attention mechanisms. There also seems to exist some evidence pointing to the very recent adoption of hybrid approaches with an increased use of spatial constraints and entities. It should be noted, however, that this review has some limitations, such as the subjectivity of the search query and screening procedures. As such, a more effective search query should be considered in future research with a view to avoid the exclusion of potentially relevant papers. In spite of its limitations, the present review presented a concise and detailed overview of the research being produced in the field of automated address matching,

Conclusions
In this study, a systematic literature review based on Scopus and Web of Science, covering a time span of 20 years was undertaken in order to better understand how past and current limitations to address matching have been and can be overcome through the adoption of automated approaches to address matching. For the screening of the articles initially found, the PRISMA guidelines were followed, resulting in a final set of 41 relevant papers from high ranked Journals and Conference Proceedings. VOSviewer, a bibliometric analysis tool, was used to perform cluster analysis on the relationships between authors and popular research topics. The number of published articles has been increasing since 2017, a trend that may be closely related to the application of deep learning methods in this field. Chinese authors lead the way, with 19 papers that represent 46% of the published articles, half of them published between 2019 and 2021, after a peak in 2016 (11%). Disease control (covid-19), location-based services, and GIS/census/urban planning stand as some of the most recent application domains in the field under study. The research seems to confirm that probabilistic methods (such as HMMs and CRFs) have been outplaced by NLP methods based on semantics, encoder-decoder architectures, and attention mechanisms. There also seems to exist some evidence pointing to the very recent adoption of hybrid approaches with an increased use of spatial constraints and entities. It should be noted, however, that this review has some limitations, such as the subjectivity of the search query and screening procedures. As such, a more effective search query should be considered in future research with a view to avoid the exclusion of potentially relevant papers. In spite of its limitations, the present review presented a concise and detailed overview of the research being produced in the field of automated address matching, within a considerably long time span, of 20 years. Future studies can develop upon its main findings, mainly in what concerns the improved use of the identified deep learning algorithms, in terms of the adoption of unsupervised or semi-supervised settings, optimization strategies for deep neural network training and/or systematic approaches to hyper-parameter tuning, as well as of privacy preserving methods, namely when dealing with residential addresses acting as quasi-identifiers in record linkage processes. The use of more complex spatial objects and relations, as means to enhance address matching and standardization, consists of another important gap to address, namely in domains not limited to POIs' retrieval. Lastly, no references have been found to evolutionary-based approaches in the field of semantic address matching, which may also be a potential research gap to address in future studies.