Augmenting natural hazard exposure modelling using natural language processing

Natural hazard exposure modelling involves constructing databases that describe the elements (people and built environment) exposed to some hazard in a selected location. These databases are often constructed using information from censuses, cadastral data, or satellite imagery. In this work, we suggest complementing hazard exposure modelling using an alternative and unconventional data source: the text components of building permits. The proposed methodology, Natural Language Processing for the Global Exposure Database (NLP4GED), adopts natural language processing techniques to extract building-by-building exposure attributes in line with the GED4ALL taxonomy (Global Exposure Database for ALL). This three-step methodology involves using: a classifier to filter permits potentially containing exposure information; a clustering algorithm to identify semantically similar permits; and regular expressions (or regex) to extract exposure-attributes. As an illustrative application, we apply NLP4GED to wrangle an unstructured real-world dataset of 100,989 building permits in Malta. We effectively provide relevant exposure attributes (i.e., year of construction, building height, and occupancy) for 23,076 buildings presented in a geographic information system (GIS) environment.


Introduction
Physical exposure modelling captures and classifies the characteristics of the built environment, describing the attributes of different assets (e.g., buildings or infrastructure) and those assets' vulnerability to one or more hazards.For example, identifying the number of buildings (quantity) possessing basements (quality) in flood-prone areas is a form of exposure modelling.Risk assessment methodologies require robust and extendable taxonomies deployed to normalise the descriptions of an asset's features which affect its vulnerability to some hazard (e.g., building height with respect to seismic shaking).Significant efforts by various parties have been made to develop such taxonomies, such as the ATC-13 [1], the European macro-seismic scale [2], the PAGER-prompt assessment of global earthquake response- [3] and HAZUS 6.0 -HAZard United States- [4].In this study, we shall propose a methodology for capturing exposure-attributes as defined by the Global Exposure Database for All's taxonomy system (GED4ALL [33] ).GED4ALL is a multi-hazard taxonomy that defines 15 main building attributes (these are: direction, material, lateral load resisting system, height, date of construction or retrofit, surroundings, occupancy, shape of building plan, structural irregularity, ground floor hydrodynamics, exterior walls, roof, floor, foundation, fire protection).Each building in an exposure database which has a known exposure attribute value is classified by the taxonomy in an exhaustive manner (e.g., a lateral load-resisting system comprised of unreinforced masonry would correspond to MUR).
Further to taxonomy systems, building exposure databases requires that practitioners derive, and often layer, information from multiple sources.Census data is a common starting point and they have been used to gather exposure attributes for dwellings such as their geographic density, distribution, and construction materials (e.g., [5,6]).Furthermore, data from maps or satellite imagery, reinforced by census-based statistics have been, for example, deployed for modelling the flood exposure of coastal buildings [7].Similarly, population density grids have been used to study the exposure of people near volcanoes [8], while on-site studies (e.g.[9]) provide real-world exposure data, albeit limited by practical considerations.In another example, several sources such as journal papers, technical guides and policies have been aggregated, as in the global database of Flood Protection Standards, FLOPROS [10].
In this paper, we posit that natural language, more specifically the text found in building permit descriptions, is a potentially untapped source of exposure information.Language has been utilized in disaster risk reduction (DRR) to, for example, monitor postevent impacts and responses.Researchers have, in some cases, processed social media content to analyse responses to natural hazard events (e.g., [11,12]).Alternatively, the same social media content has been utilized as an early warning system (e.g., [13]).Our exploitation of individual building permits in the context of DRR is not common: in a study reviewing natural hazard risk assessment literature, none of the exposure modules (where applicable) directly used language as a data source [14].To our knowledge, the closest use of building permits in DRR comes from examples of post-event studies.One example is an analysis following a tornado pass through the town of Joplin (Missouri, USA).In this study, the specific locations of building permits associated with roof repair were used to approximate the path of the tornado [15].We, therefore, propose to fill this gap investigating the potential use of building permits and natural language as an augmentation to existing exposure-modelling techniques.
As governments continue to make a shift towards e-government, building permit data has become more readily available.As early as 2007, the UK and Austria had, for example, fully digitized their planning processes [16].The text components of building permits will vary by region but can be reasonably expected to include at least an address and a brief description: i.e., the 'permit text' which briefly describes the permit proposal.We posit that the text embedded in these permit texts may offer insights into the relevant exposure attributes of the building to which the permit refers (see Table 1 for some examples).There is therefore the need for a methodology which extracts exposure attributes embedded in a given corpus of building permits.To do so, we turned to the field of computer science concerned with handling written or spoken word: natural language processing, or NLP [17,18].
The proposed methodology, Natural Language Processing for Global Exposure Database for All (NLP4GED), first involves using classification algorithms to filter permits which contain exposure information (Section 2.2) against those that offer no insights.Classification algorithms are a form of supervised learning, where training data points (e.g., building permits) having some known class are first defined (e.g., data points containing exposure attributes vs. not containing exposure attributes).A classification model is constructed with this training data, and this model can then parse new data (e.g., unseen building permits), and assign datapoints to one of the original classes.The second step of our methodology involves using a clustering algorithm on the filtered dataset of building permits to group them by linguistic similarity (Section 2.3).Clustering is an unsupervised statistical method, as no training data is passed with the dataset.We finally use Regular Expressions (or regex) to extract specific exposure attributes from the different clusters (Section 2.4).A regex is an abstract pattern for searching through text that returns a match if a text adheres to the designed pattern.For example, a regex could be designed to return a match if the word 'building' appears in a text, provided 'building' is preceded by 'construction' (i.e., 'construct a new building' would return a match, but 'to replace apertures in a building' would not).This regex would take the abstract form of: (?<¼construct)(?:.*)(building).More details on regex patterns are provided in Section 2.4 and Appendix A.
Apart from this introduction, this paper is structured as follows: Section 2 describes NLP4GED in a general step-by-step basis.Section 3 provides an illustrative application of NLP4GED in which we successfully extract three of the 15 exposure attributes as defined by GED4ALL from a case study dataset.Finally, Section 4 draws the relevant conclusion.

Proposed NLP4GED methodology
The proposed methodology (Fig. 1a) involves subsequently applying three steps on some given corpus (Section 2.1) of building permits: 1. Build a classifier to act as a "noise filter" (Section 2.2) which categorises the dataset into two segments: trivial building permits (i.e., those not containing insights on exposure attributes), versus non-trivial ones that potentially contain exposure attributes.This model is trained on a small subset of manually-labelled data; 2. Build a clustering model (Section 2.3) which groups the filtered 'non-trivial' building permits into clusters based on similar semantic (i.e., meaning) and syntactical (i.e., grammar) properties.This aims at grouping clusters containing permits with similar topics and language which provide insights on similar exposure attribute(s); To construct underground water reservoir and agricultural store.None 3. Following a checklist-based approach, design a library of regex patterns able to capture different GED4ALL exposure attributes within the clusters (Section 2.4).The methodology is general enough to accommodate different building permit datasets (e.g., pertaining to different authorities and/or regions).
The constructed set of regex patterns is finally deployed on the non-trivial corpus of building permits (Fig. 1b) to provide a semisupervised tagging for the corpus (i.e., tagging large portions of the corpus simultaneously, rather than manually tagging permits one by one).The tags correspond to the exposure attributes as per the GED4ALL taxonomy.A Python-based code to apply this methodology is openly available (https://github.com/justinschembri/nlp4ged).
To facilitate the description of each step of the methodology, in the following subsections we provide realistic examples related to the illustrative application (Section 3), which involves 100,989 publicly available building permits issued between 2005 and 2021 in Malta [19].The full details of the illustrative applications are instead shown in Section 3.

Initial corpus analysis
A corpus is a collection of text documents which, in this case, are building permits.Familiarization with the building permit corpus is fundamental as it provides a general idea of the exposure attributes which may be embedded into it.When dealing with an unknown corpus, it is common to first quantify a few key descriptors, such as (but not limited to): 1) the number of documents, or size, 2) the date range of the building permits, 3) the mean building permit word count, and 4) the evolution of mean word count with time.
The size of the corpus may guide the choice of the prediction algorithm(s) adopted for classification and clustering (see "Scikit Learn -Choosing the right estimator" [34], for a simple guide for estimator choice based on corpus size).Depending on the time span between the earliest and most recent permit, it may be desirable to split the corpus into smaller corpora such that they may be analyzed separately..This is because the writing style of building permits descriptions may change over large periods of time.One rough proxy to identify potential style changes is the mean word count of building permit (longer texts may also suggest more exposure insights are extractable).In our illustrative example, we quantify the mean word count on a yearly basis.The example shown in Table 2 and Fig. 2 shows a relatively short date range, with a slight trend towards the building permits becoming more verbose over time.This suggests that newer permits may offer more exposure-insights than older ones.
To further familiarise with the corpus, we suggest randomly selecting a small and manageable portion of the corpus (e.g., in our illustrative example, we consider 250 permits) and manually reviewing them to discover if they contain any information transferable to the GED4ALL exposure attributes.Such building permits should then be classified as non-trivial (i.e., containing exposure attribute information) or trivial.Apart from providing insight into the overall style of writing of the building permits, this task directly feeds into the deployment of the noise-filtering classifier discussed in the next section.
In this preliminary stage, we also recommend running some initial word searches through the corpus for keywords associated with the GED4ALL attributes (see Table 3).This simple step may, with minimal effort, already highlight what exposure attributes may be made available by the corpus.

Noise-filtering classifier
After gaining some familiarity with the corpus, the next step involves developing a text classifier.This model seeks to discriminate those permits which may contain exposure insights (i.e., non-trivial) from those that do not (i.e., trivial).Such a filtering process allows the clustering algorithm (Section 2.3) to be deployed only on data relevant to the task, thus facilitating the design of regex patterns in the next step (Section 2.4).Calibrating the classifier requires these steps: 1. Construct an NLP pipeline, which involves sequential pre-processing, tokenization, normalization, vectorization, and prediction modules (see Appendix A.1 for detailed definitions) .Perform a grid search by considering several combinations of models/ techniques for each step in the pipeline, as well as combinations of their hyperparameters.Several model choices are possible, and their choice is highly dependent on the given corpus.However, some common general starting points are 1) pre-process text to remove non-alphabetic symbols, lowercase the text and remove stop words (e.g., 'and', 'a', 'the'); 2) tokenize the sentence by white space and; 3) do not use any additional normalization techniques such as lemmatization.Vectorization techniques may include the Term Frequency-Inverse Document Frequency (TF-IDF) model (e.g.Ref. [20], and Doc2Vec [21].Prediction may include Linear Support Vector Classification (LinearSVC e.g.[22]), and Naive Bayes (e.g.[23]), estimators.2. Perform a k-fold validation for each combination of models in the grid search.This technique randomly subdivides the data into 'k' equally large subsets (or folds).The classifier is constructed 'k' number of times, with one fold of the data being used as the testing set and the remaining folds as the training set.For each constructed classifier, a performance metric is calculated, and the average score is finally calculated.The suggested performance metric for this task is the so-called "F1-score", which is the harmonic mean of precision and recall.Precision quantifies what proportion of positive identifications are actually correct.Recall quantifies the proportion of actual positive identifications that are identified correctly.As commonly done for classifiers (e.g.[24]), using the  F1-score helps reduce the risk of filtering out building permits potentially containing exposure information.After calculating the F1-score of each combination, the model with the best performance (i.e., highest F1-score) is identified.3. The pipeline and specific hyperparameters which produced the highest F1-score is compared against an (arbitrarily high) acceptability threshold which, in our illustrative example is set to 85%.If the model identified in step 2 does not surpass the required threshold, another set of permits (e.g., 250 in our illustrative example) are manually labelled, and the process is repeated from step 1 until the threshold is met (Fig. 3).Such an iterative approach is suggested considering that, in general, there is no 'ideal' volume of training data which needs to be labelled, as this depends on the particular language/style, the amount of noise, and the differences in language between a trivial and non-trivial permit.It should be noted that increasing the tagged dataset too much may actually decrease the model performance by introducing too much noise and causing the model to overfit.In such cases, no further training data is suggested.
With the best performing pipeline selected, the developed classifier is run on the entire corpus and the non-trivial corpus returned is passed to the next phase.

Building permit clustering
The specific exposure insights offered by each permit in the non-trivial corpus are yet to be identified.Building permit clustering is the next critical step to achieve this goal.Unlike classification, clustering (i.e., unsupervised learning) does not require manual tagging and relies on the semantic/topical similarity of text documents to group permits.The process of vectorization (see Appendix A) converts a building permit into a real-valued vector, and clustering is the process by which documents are grouped based on the distances between datapoints and examined using some similarity metric (e.g., silhouette score [25]).A well-performing clustering algorithm would define separate building permit clusters associated with separate activities/topics (e.g., demolitions; constructions; alterations; extensions).The boundaries between clusters occurring in natural language are unlikely to be clear.For example, it is reasonable to expect that a building permit may describe multiple building interventions, leading to ambiguity as to which cluster it belongs to.NLP clustering is an investigative tool, and the quality of the output should also be manually verified by reading through samples.By definition, however, a sample from a well-defined cluster should be somewhat representative of the remaining permits within that cluster (see Table 4).This facilitates the regex design (Section 2.4), which can be based on reviewing a smaller number of well-clustered permits.
We suggest designing a building permit clustering model according to the following steps.1.Consider the non-trivial corpus obtained as per Section 2.2.If possible, based on any pre-existing domain knowledge, define a smaller subset representative of the entire non-trivial corpus (e.g., the non-trivial permits within a single district of a city).2. Construct an NLP pipeline (see Appendix A for detailed definitions) consisting of the previously adopted pre-processing, tokenization, normalization, and vectorization modules.The prediction module should now involve some candidate clustering algorithms, such as K-Means [26] or DBSCAN [27].As in Section 2.2, an iterative grid search through models and hyperparameters should be conducted.The number of clusters is an input of the grid search, and a reasonable range should be tentatively defined (e. g., between ten and 30, at intervals of ten).The final selection of the number of clusters is refined in the subsequent steps.3.For each combination in the grid search, calculate a relevant performance metric.The silhouette score, SS = (b − a)/ max (a,b), for example, is a metric that depends on the intra-cluster distance a (i.e., the average distance between any two points in a cluster) and the inter-cluster distance b (i.e., the average distance between any two clusters).This score ranges from − 1 (clusters are not sufficiently separated and overlap significantly) to 1 (clusters are well separated).An SS value close to 0 suggests the boundaries between clusters are not particularly distinct (this is expectable in natural language clustering).In summary, the silhouette score measures the compactness of the data points within each cluster as well as the separation between individual clusters.4. Select the model/hyperparameter combination with the highest silhouette score.Use a dimensionality-reduction technique (e.g., Principal Component Analysis, PCA) to reduce the multidimensional dataset to 2 or 3 dimensions and hence plot it (see Fig. 10a in Section 4).Visualising the clusters provides an indication of the expected clustering performance: well separated clusters suggest distinct clustering, i.e., a corpus with strong semantic differences.Cluster overlap on the other hand suggests a more uninform corpus with nominal content variety.
Given the selected model combination, choosing the number of clusters is a somewhat subjective choice.To facilitate this decision, it is suggested to run the optimal configuration in the grid search considering a more-refined range of possible cluster numbers (e.g., in our illustrative example, between three and 30, at intervals of one) and calculate the "inertia" for each trained model.Inertia represents the sum of the square distances between each datapoint and the centroid of the cluster they are assigned to.While inertia decreases asymptotically as the number of clusters increases, too many clusters may result in overfitting and a reduction in data interpretability.The so-called elbow method [28], suggests that the optimal number of clusters is the point of inflection in the curve of the inertia plotted against the number of clusters (Fig. 10b).As the number of clusters is increased, individual datapoints become more focused around their respective cluster centroid, i.e., decreasing overall inertia.This implies the forming of more distinct clusters.However, increasing the number of clusters too much will eventually return a nominal decrease in inertia, meaning the clusters are essentially as distinct as they are going to be.Although roughly identifying the elbow is straightforward (e.g., between 15 and 25 in Fig. 10b), the specific selection of the number of clusters remains a subjective choice.However, provided that the final choice lies within the identified range, the final choice generally does not have a significant impact on the results.On the one hand, good-quality clustering may be achieved only if the adopted dataset contains underlying patterns.On the other hand, marginal increases in clustering performance are not pursued within NLP4GED, since the main purpose of clustering is to organise building permits in a useful manner for constructing the exposure attribute extracting regex (as described in Section 2.4).

Designing regex patterns to extract exposure attributes
A regex is an abstract pattern of text and symbols used to search through a text document and return one or more matches if some part of the text complies with the pattern.Appendix A.2 provides some basic knowledge regarding regex patterns while specifically referring to their deployment on building permits.Although the clusters obtained according to Section 2.3 should contain building permit descriptions with similar content, within-cluster text variations are bound to be present.This section describes a step-by-step methodology to design a set of regex patterns to extract exposure attributes from each permit.Such patterns should be strict enough to capture any common linguistic structure in a cluster, and flexible enough to disregard the abovementioned text variations (often not offering relevant insights).The below methodology should be applied for each cluster.

Step 1: environment setting and text pre processing
Writing and testing regex can be performed in most programming environments, but we recommend using a publicly available online regex writing tool (e.g., [29]; https://regex101.com/,last accessed June 2023) to take advantage of the intuitive visual interface.Moreover, to facilitate the regex design, it is necessary to perform a light pre-processing of the text.For example, in our illustrative application, the following steps were taken: 1) delete characters which are normally followed by a whitespace, (e.g., full stops, commas, brackets); 2) replace special characters which are not normally followed by a white space (e.g., hyphens, back-slash, forward-slash) with a white space; 3) lowercase the text.

Step 2: gather exposure insights from a cluster
An adequately defined cluster should contain semantically and syntactically similar building permits.Leveraging on this property, we suggest randomly selecting a small subset of permits from a given cluster (e.g., in our illustrative example, we select 50) and introducing them in the regex writing tool.Based on the same property, we suggest selecting a random entry in the cluster to be analysed based on the scheme in Fig. 4.This involves identifying: 1.The main action verb and primary subject.For example, the text "demolish site and construct five floors of apartments and underlying basements" would result in "demolish" and "apartment", respectively; 2. The main exposure attribute implied by the text.The previous text example refers to a new building, and therefore the text provides the year of construction (extractable from the permit reference number).Moreover, the text indicates that building is residential, includes basement levels and has a total height of five floors.
By considering all the possible GED4ALL attributes, the process allows the gathering of insights on the exposure attributes included within the cluster.Some other examples of this process are show in Section 3.4.

Step 3: build conservative regex pattern to avoid false positives
Based on the initial insights gained according to Section 2.4.2, we propose designing an initial, particularly-strict regex pattern.Strictness in this context describes a regex which is limited in its scope, such that it is unlikely to unintentionally capture building permits with different topical content.By visually inspecting the cluster subset in the visual regex tool, this initial pattern is designed to guarantee a match for all the permits containing the desired exposure attribute, while not providing any false positives.
To begin, the root of the regex pattern should be designed to capture the main action verb and subject while maintaining their relative position to each other within the string of text (Fig. 5).Considering the example permit description, "demolish site and construct five floors of apartments and underlying basements", the pattern: demolish(?:\s\wþ){0,6}\sapartmentreturns a match only if the word "demolish" and "apartment" are separated by at most six words.Using the visual regex tool facilitates immediate checking of the matches produced by this pattern over the entire cluster subset.Modifications to the pattern may then be made until any false match is excluded.If false matches arise, introduce further strictness by using positive and/or negative lookaheads and/or lookbehinds (see Appendix A.2). Positive and negative lookaheads enforce a rule for which some other word/s or patterns must be present (positive) or absent (negative) in the text for the overall pattern to return a capture.It is possible to exemplify this concept by considering the permit description "demolish site and construct apartments and underlying basements".We further considering the description "demolish canopy and paint apartments", which would return a false match ("demolish" is three words away from "apartments").We can make the regex stricter by introducing a lookahead asserting that the word "construct" is located somewhere between "demolish" and "apartment".This lookahead would be the pattern (?¼.*construct), and when combined with the full pattern results in demolish(?¼.*construct) (?:\s\wþ){0,6}\sapartment.Once false matches have been removed in the sample subset (using the word proximity concept and positive or negative lookaheads and lookbehinds), it is possible to move to Step 4.

Step 4: increase the flexibility of the pattern
The regex pattern designed in Step 3 may not return many matches over the cluster.This is due to the pattern's initial strictness, which is indeed beneficial in avoiding false matches.To safely increase the flexibility of the pattern while avoiding false matches, we propose modifying it to include synonyms of the main action verb and subject.
Referring to the example permit ("demolish site and construct five floors of apartments and underlying basements"), description from earlier, appropriate synonyms for the words "demolish" and "apartment" could be "remove" and "residence", respectively.Synonyms may be introduced to the pattern by introducing an 'or' operator (i.e., the character |) and the synonym.For example, adding the word "residence" to the considered example pattern can be done as follows: demolish(?¼.*construct)(?:\s\wþ){0,6}\s(apartment| residence).
While the choice of synonyms can be performed using a thesaurus, we suggest interactively identifying those synonyms using the visual regex tool and the considered cluster subset.This is done by introducing the 'or' operator without adding a synonym (for example, (demolish|)(?¼.*construct)(?:\s\wþ){0,6}\s(apartment|) and running the pattern on the cluster subset.This highlights matching permit descriptions showing any word instead of "demolish" or "apartment".By quickly examining the highlighted permit descriptions, one may identify the relevant synonyms of "demolish", "construct" and "apartment" that need to be included in the pattern.Considering the above example building permit, the resulting pattern may be: (demolish|dismantle)(?¼.*construct|.*erect|.*build)(?:\s\wþ){0,6}\s(apartment|residence) The regex pattern resulting from this step allows capturing some exposure attributes.We define these particular exposure attributes as the "first-pass" logical conclusion (i.e., the meaning or implication of a match, Fig. 6), and they are those conclusions made when designing the strict root regex only.The first-pass logical conclusions for the example building permit are that the building occupancy type is "residential" and that the permit reference number contains its year of construction.

Step 5: include conditional secondary regex patterns and related logic
Permit descriptions may include additional exposure parameters currently not captured by the strict regex pattern (e.g., height or material).For example, the permit description "demolish site and construct five floors of apartments and underlying basements" also contained information related to the number of floors and the presence of basements.To capture such additional exposure information, which may or may not be included in permit descriptions, we propose using a set of simpler secondary regex patterns to allow for "second-pass" logical conclusions.Since we propose to run those patterns conditional on a first-pass match, the risk of a false positive is considerably lower, and therefore the secondary patterns do not require the same level of strictness as the root regex.
The first step for designing the secondary regex patterns is to inspect, using the visual regex tool, the permit descriptions in the cluster subset that are matching with the first-pass regex.This sheds light on additional exposure attributes which may be available in the text (considering the GED4ALL attributes not captured by the first-pass logic).Fig. 7 shows a schema to systematically consider -and take note of-potential secondary regex patterns to include.Conditional on a first-pass match, there now exist second pass logical conclusions which provide insight on, in this particular example, the number of storeys and the presence of basements.Considering the permit description "demolish site and construct five floors of apartments and underlying basements" as an example, we annotate a simpler secondary regex capable of capturing the following descriptions of the building height "… construct < numeral word > floor" and the presence of the word "basement".The resulting regex for both are: ('w\þ) floor and basement.
For both the first-pass and second-pass regex, small pieces of functional logical code will be required to manipulate the match and assign it to the building permit.For the example first-pass conclusions given earlier, the code will need to assign any matched building permit's reference number (which includes the permit year) to the "Year of Construction" attribute, and automatically assign the "residential" occupancy type.For the second-pass, the word numeral must be parsed, converted to an integer and assigned to the "Number of Floors" attribute value.

Step 6: compile a regex library and apply it to the corpus
The above steps are repeated for each cluster to compile a library of regex patterns.By following this methodology, a compiled library of patterns should be able to cover whichever GED4ALL exposure attributes are extractable from the corpus (see Section 3.3 for the one compiled for the illustrative application).Although the methodology presents a standardised process, designing regex is highly J. Schembri and R. Gentile dependent on the linguistic nature of the corpus.After the library is complete, the first and second-pass logic for each regex is programmatically run over the entire corpus (Fig. 8).Building permits captured by the 'first pass' regex have the respective 'first pass' exposure-attribute assigned to them.The subset captured by the regex is then exposed to the 'second pass' regex, which assigns additional exposure-attributes if they are present to that building permit.Matches of the first regex are then removed from the corpus, and the process is repeated iteratively until the full regex suite is exhausted.

Initial corpus analysis
This section describes the illustrative application of NLP4GED to a dataset of real building permits in Malta.The corpus consists of 100,989 building permits, with a date range between 2005 and 2021.The building permits are pre-divided into 68 subsets, each representing one of the Maltese local councils (Fig. 9a).Each council includes 1485 permits on average, while the median number of permits per cluster is 1200.The number of building permits submitted each year is variable (see Fig. 9b) and shows a minimum of 1896 (in 2011) and a maximum of 10,845 (in 2018).The mean word count per permit shows an increasing trend over time: 12 words in 2005 to 22 words after 2017.However, this increase is not deemed significant enough to require a subdivision of the corpus into smaller corpora.Running searches for the suggested exposure-related keywords revealed a number of permits which could contribute to three GED4ALL exposure attributes (see Table 5): 1) year of construction, 2) occupancy, and 3) height (including presence of basements).

Noise-filtering classifier
The building permits of the local council of Qormi (see Fig. 9b) are selected for the construction of the noise-filtering classifier.This local council is selected through prior domain knowledge as being reasonably representative of other local councils.There are 3563 building permits in this subset, thus this council is above the third quartile of the data.Initially, 500 building permits of this subset (14 %) are tagged into two categories, trivial and non-trivial, depending on the available exposure-attributes insights present in the text.
The classifier pipeline, and its respective grid search, is built using the pipeline module of the Python package sci-kit learn [30].Pre-processing, normalization, and tokenization are grouped together and defined in three possible settings.In the first combination ("basic"), sentences are lowercased, and punctuation is removed, no normalization techniques are applied, and sentences are tokenized by whitespace characters as separators.The second ("medium"), included the features of the basic configuration as well as the removal of stop-words (see Appendix A).Finally, the third level ("high") also included the normalization technique of word stemming (see Appendix A).Two common vectorization candidates are selected: TF-IDF and Doc2Vec (see Sections 2.2 and Appendix A.1), and a range of hyperparameters for both models are included in the grid search (Table 6a).The two candidate prediction models are the LinearSVC and Random Forest.Table 6b reports the candidate hyperparameters for such models.The grid search includes 1092 combination of normaliser/vectorizer/predictive model and their respective candidate hyperparameters.
For each combination, an eight-fold cross validation is performed (the testing fold is composed of 136 building permits).The bestperforming combination for this initial n=500 dataset provides an F1-score of 82.9%.The entire process is repeated iteratively after adding 250 additionally tagged building permits in each iteration (see Fig. 3 in Section 2.2).The F1-score of the best-performing combinations gradually increases and peaks at 88.1% for a dataset of n=1250 permits.Adding further 250 permits leads to a drop of the accuracy, suggesting an overfitting of the model.For this reason, the model with n=1250 permits is considered in the next steps of GED4ALL.The configuration of such model hyperparameters are the following: pre-processing/normalization/tokenization='Basic'; vectorization: TF-IDF, max features=256, n-gram range=(1,1); predictor: LinearSVC, Dual Optimization=True (see sci-kit learn documentation for hyperparameter definitions).

Clustering
The clustering phase of the methodology is conducted using the building permits of the Zebbug council, since it has an approximately equal size and representation as Qormi (3139 building permits).First, the subset is passed to the classifier, identifying 895 non-  Building Height, Basement J. Schembri and R. Gentile trivial documents (i.e., 71% of the documents are filtered out as trivial).The optimal clustering configuration is obtained with the same NLP pipeline and grid search assumptions adopted for classification (Section 2.2), although K-means is selected as the candidate clustering algorithm in the prediction module (tentatively considering k=10, 20, 30 for the candidate number of clusters).The silhouette score is the initial performance metric used to select models and respective hyperparameters.The highest silhouette score is produced by the grid search combination with the following configuration: pre-processing/normalization/tokenization='medium'; vectorization: Doc2Vec, size=256, alpha=0.025,epochs=10; predictor: k (number of clusters)=10.The highest silhouette score is equal to 0.33, suggesting that cluster overlaps are present, possibly because most permit descriptions contain similar portions of text, in turn due to a somehow standardised technical writing.
With the optimal configuration of the NLP pipeline, we adopted a finer definition of the number of clusters (i.e., between three and 75) and obtained the K-means inertia against an increasing number of clusters (Fig. 10a).According to the elbow method (Section 2.3),

Table 7
Sample building permits and a proposed cluster topics associated with selected clusters.a reasonable choice for the number of clusters lies within 15 and 25.The final (subjective) choice of 22 clusters is based on reading random samples from the clusters of the candidate clustering models, seeking a reasonable compromise between coherence among the clusters and data interpretability.Table 7 shows some examples of such random samples, showing that the dataset is somewhat topically homogenous, with most building permits describing construction or alteration interventions, albeit with dissimilar synatx.Finally, Fig. 10b shows the visualisation of the selected clustering model, considering the first two principal components confirming the above interpretation, since data is subdivided into bands, suggesting cluster overlaps.

Regex development
According to the procedure in Section 2.4, we selected a random building permit from each cluster, identifying the main action verb and corresponding subject.The regex patterns are initially built to be strict, using word proximity and positive and/or negative lookaheads and lookbehinds.Synonyms of the main verb and subject are then introduced, and finally the second pass logic is identified.A set of 74 regex patterns is constructed and is made available in the GitHub repository (see nlp4ged/regex/regex_list.csv), while a sample of some of the patterns are shown in Table 8.The table also includes examples of building permit descriptions matching with the shown regex patterns, as well as the exposure attributes captured with the first-and second-pass logics.By closely matching the specific semantic patterns in the corpus, the produced regex patterns allowed the extraction of the following three exposure-attributes: 1) building height (above and below ground), 2) date of construction or retrofit, and, 3) occupancy type.This result confirms the effectiveness of the preliminary corpus search using simple keywords (Section 2.1) as a simple tool to familiarise with the corpus.

Application of the regex Pattens and result visualisation
The final step of NLP4GED refers to the entire building-permit corpus.The defined noise-filtering classifier is run on the full 100,989 permit descriptions, identifying 31,138 non-trivial cases.Then, the available 74 regex patterns are applied to the non-trivial corpus, identifying 23,076 matches (74% of the denoised corpus).Note that the clustering step is not required at this stage, as it is only instrumental for writing the regex patterns.
While the most-effective five regex patterns are responsible for 47% of the total captures (see Table 9a), it is important to mention how defining an extensive suite of patterns is fundamental to maximise the amount of exposure data extracted from the corpus (as the 69 remaining patterns are responsible for 53% of the captures).The exposure-attributes extracted are the following: 1) building heights of 1953 buildings; 2) the year of construction of 15,091 buildings; 3) the year of retrofit of 7985 buildings; 4) the presence of 2577 basements; 5) the occupancy type of 16,729 buildings.These results are tabulated in Table 9b.
The Malta planning authority offers and maintains a public GIS containing building information.Using stereoscopic aerial photography, building boundaries are digitized as individual blocks.Furthermore, during the planning process, architects are required to set out the footprint of the building (or site) they are modifying and submit it on an official base map, thus increasing the reliability of this tool.The building boundaries are then digitally added to the Malta planning authority's GIS system as vector objects.Each vector object contained several attributes, the most important being the building permit's unique reference number.
Given the availability of this data, the results obtained using NLP4GED can be also visualised.To do so, the exposure attributes shown in Table 9 are assigned to the correspondent building block using the permit reference number as a guide.Fig. 11 exemplifies the different phases of NLP4GED, including the initial case study area (a), the available building permit corpus (b), the result of applying the noise-filter classifier (c), and the identification of matches of the different regex patterns.Finally, Fig. 11e shows the allocated year of construction (or retrofit) to the different building of a small area.Fig. 12 shows further examples of captured building J. Schembri and R. Gentile attributes such as the occupancy and the building height.The mapping of exposure attributes geographically demonstrates the potential perk of using building permits to augment exposure models.Moreover, since NLP4GED provides building-by-building data, it proves to be an effective tool to increase the refinement of the most-common exposure models, which rely on different data aggregation techniques.

Conclusion
With the steady global shift towards e-government, data related to building permit applications has become more readily available.In this study, we have proposed a methodology to use the text description of building permits to enhance natural hazard exposure modelling: NLP4GED -Natural Language Processing for Global Exposure Database.The methodology first involves using a classifier to filter permits pertaining to two classes (e.g., containing exposure attribute, or non-trivial vs. not containing exposure attributes, or trivial).The second step of NLP4GED involves using a clustering algorithm on the non-trivial corpus of building permits to group them by linguistic similarity.Clustering facilitates the definition of Regular Expressions (or regex), which are used in the subsequent step of the methodology to identify specific exposure attributes from the different permits (e.g., year of construction, height, occupancy).Finally, a set of simple logical conclusions are attached to each regex to extract the relevant exposure attributes.Provided that a corpus of building permit descriptions is available in digital form, we provided a general machine learning-aided methodology to extract building-by-building exposure features, which may then be concatenated into one building-specific taxonomy string (defined according to the GED4ALL taxonomy).A Python-based code to apply this methodology is openly available (https://github.com/justinschembri/nlp4ged, last accessed: August 2023).
To demonstrate the effectiveness of NLP4GED, we successfully extracted exposure attributes (i.e., year of construction, building height, and occupancy) for 23,076 buildings pertaining to an unstructured real-world dataset of 100,989 building permits in Malta.
The key limitations inherent to this data-driven methodology begin, quite obviously, with data absence.It is plausible that some buildings (most likely the oldest and most vulnerable ones) may not be covered by a building permit.Moreover, all building permits not covered by a digitisation process may not be adopted in NLP4GED.A further limitation refers to data quality.Indeed, it is unlikely that a building permit description contains all GED4ALL exposure attributes, as most of them are not relevant for a planning process.In our case study, for example, no building permits mentioned of the lateral load resisting system, as this information is not relevant for the planning regulations in Malta.
From a practical point of view, any exposure attribute captured using NLP4GED is effectively obtained using a set of deterministic regex patterns, and therefore no uncertainty propagation exercise can be carried out when using such attributes within a risk model.However, false positive regex captures would generate errors in risk models, and therefore it is paramount to design strict regex patterns, such that false positives are minimised.On the other hand, an excess of false negatives would create an information loss rather than uncertainties, and therefore it is less concerning in terms of error propagation in a risk model.The adopted machine learning models, affected by errors, are only instrumental to the design of the regex patterns.A low accuracy of the noise-filtering classifier would lead to a slightly more contaminated dataset fed to the clustering model.Most likely, however, the clustering algorithm itself would eventually group up trivial permits together, effectively generating a cluster that one can discard entirely.However, a lower accuracy of the classifier may generate data loss excluding potentially valuable datapoints useful to design regex patterns.
The text content of building permits shows promise as an additional data source to enhance natural hazard exposure modelling, and it can complement the more-generally adopted census data and/or satellite data.Apart from facilitating the development of buildingby-building exposure models, another possible use case for NLP4GED is the periodical update of existing exposure models as building permits are approved, virtually in real time.Furthermore, the methodology may contribute to forecasting exposure models by providing information on the rate of buildings being demolished and rebuilt, which is generally lacking in current exposure forecasting models.and classifiers/clusterers also have their own internal set of model hyperparameters.The function of a hyperparameter is model dependent.A common hyperparameter is the maximum number of features (i.e. the length of the feature vector) that can be created.For example, setting the max feature to 256 in the same TF-IDF model would allow only the most frequent 256 words to appear in the vocabulary.This improves performance and reduces noise, but may (especially with larger more diverse datasets) reduce resolution.There are other hyperparameters such TF-IDF vectorizer includes the n-gram range hyperparameter, represented by a lower bound and upper bound value (e.g.(1,3)).This particular hyperparameter allows the formation of n-grams (i.e., a sequence of words) from the text of a minimum length of 1, and a maximum length of 3.This allows of repeated duos (bi-grams) or trios (tri-grams) of words, as opposed to just single words when constructing the vector features, provided they appear frequently enough to be within the maximum feature length.
The interaction between the dataset, the pre-processing, normalization, tokenization, models and their hyperparameter values, and the output results is somewhat unpredictable.To this end, it is quite common to experiment with hyperparameters.One method of experimentation is the organised and iterative grid search.A grid search is conducted by setting a range, or grid, of possible values for each of the hyperparameters and iterating through every single combination.A performance metric is selected and measured at each iteration, and the hyperparameters for the best performing model is returned.Classification metrics are fairly straight forward, since a tagged test data exists, such as the f1-score introduced earlier.Clustering metrics are somewhat more complex, and there is no one-sizefits all metric.However, within the scope of this work, we are interested in cluster compactness and distinctiveness from each other, which makes the silhouette score a reasonable candidate.

A.2 Regular Expressions -Regex
A regex may be as concise as one word (analogous to the 'find' function on a word processor).For example, the regex composed exclusively of the word building will return a match 'object' if the string (text) contains the word 'building', as many times as it appears in the string.The match object will also simply be the word: 'building'.In the absence a match, None shall be returned.If we wish to capture, say, the word preceding the word 'building', this can be represented abstractly through the pattern \w þ building.The \w þ component in this pattern is a set of symbols that is interpreted by the regex engine as 'any given word'.The regex pattern may be explained as: "match the world building, and the word that precedes it".Running this regex pattern on the phrase "to construct new building" would return the match "new building".
A significant part of regex knowledge consists of understanding the various tools of abstraction available.Referring back to the pattern \w þ building.Here, the \w þ component is not one single abstraction, but rather a composition of three.Firstly, the \ character is a regex 'escape character': its function is to detach the default meaning of the preceding character, in this case, the w.The character w, by default, literally means the letter w.When combined with the escape character (i.e., \w), the literal meaning is escaped, and w is elevated to a metacharacter meaning "any character".The þ character at the end of the construction is not interpreted, by default, as the literal þ character.Therefore and in this case, by applying an escape character to it would actually restore the literal meaning, i.e., \þ means the literal character þ.By default, þ is an important regex abstraction known as a 'quantifier'.Quantifiers specify the number of times that the preceding pattern should be matched, in this case þ means the previous pattern should be matched between one or more (unlimited) times.When the composition is put together \w þ becomes an instruction to the regex engine: "capture the previous pattern, (i.e., any word character), between one or more times.For the full scope of metacharacters, we refer to the Python re library ("re -Regular expression operations," n.d.).The previous quantifier (the þ metacharacter) would continue to make matches an infinite amount of times, provided at least one match is made, i.e. from one to infinite times, as much as needed for the pattern to continue.The * quantifier is similar, except that the preceding character's absence does not break the pattern, i.e., match the previous character between 0 or infinite times.The distinction is slight, but take for example, the pattern construct aþbuilding which would not return a match for the phrase "construct building", as the pattern is expecting the character "a" to precede (i.e., "construct a building").Replacing the þ character with a * character makes the "a" an optional character, meaning the pattern construct a*building would return matches for both the phrases "construct a building" and "construct building".
Quantifiers also exist which limit the minimum and maximum of times the matches of the preceding characters or groups, represented by the symbols {m, n}.This pattern matches the token preceding it between m and n times.This quantifier was used in this work to control word proximity.Consider the regex pattern (construct).*(building).This pattern would return matches for both 'to construct a building' and 'to construct a boundary wall and minor modifications to building' (see Fig. 14a).The second match here would be 'false positive' and this can be overcome by setting a limit to the distance between the words.For example, by modifying the pattern as follows: construct(\s\wþ){1,3} building.Now, the word 'building' must lie between one and three words from the word 'construct' for the pattern to return a match (see Fig. 14b).The final regex tool to be discussed are the lookaheads and lookbehinds.A lookahead, represented by the metacharacter ?¼is a point in the pattern where the engine stops searching the phrase, looks ahead to the rest of the document and checks if some given word lies ahead in the future before proceeding.If the phrase does exist, it will proceed with the rest of the pattern, if not, the engine stops and restarts.Take for example the pattern construct(?¼.*new).*buildinghere the word new is bound by a lookahead, and therefore must exist after the word 'construct' for the pattern to be satisfied (see Fig. 15).This pattern would therefore return a match for the phrase "construct a new building" but not for "construction of a five floor building".A lookbehind is a similar metacharacter, except the pattern looks backwards instead of forward.Both a look ahead and look behind can be turned negative, meaning it for example the absence of the word new that does not break the pattern.
Finally, it is useful to imagine a regex engine as a trailing search that begins from left to right, attempting to match each component of its pattern until the full pattern can be satisfied (i.e., a match made), but continuing to trail to the end of the string, even after the pattern has been satisfied once, allowing a regex to make multiple matches.To this end, the metacharacter \wþ (i.e., any word) run on the four-word phrase "construct a new building" would actually return four separate matches, one for each word.

Fig. 2 .
Fig. 2. Mean building permit word count for the corpus in the illustrative application (Section 3).

Fig. 3 .
Fig. 3. Noise-filtering classification: iterative determination of the number of manually-labelled (i.e., training and test) building permits.Example from the illustrative application (F1-Score = 2×(Precision×Recall) Precision+Recall ).Plot based on the results of illustrative example.

Fig. 5 .
Fig. 5. Schema to systematically consider the regex tools available when constructing the strict root regex.Example building permit description: "demolish site and construct five floors of apartments and underlying basements".

Fig. 4 .
Fig. 4. Schema for gathering exposure insights from building permit sampled from a cluster.Example building permit description: "demolish site and construct five floors of apartments and underlying basements".

Fig. 9 .
Fig. 9. Distribution of building permits submitted in the lllustrative corpus: a) submissions by local council; b) submissions by year.

Fig. 8 .
Fig. 8. General logic for using the designed regex library on the corpus.

Cluster Sample building permit Topic 0 8 Fig. 10 .
Fig. 10.a) Principal Component Analysis-based dimensionality reduction for the selected clustering (k = 22 clusters).b) Elbow method to select the number of clusters.

Fig. 11 .
Fig. 11.a) GIS representation of the large Maltese town of St. Julian's; b) all building permits in the town; c) non-trivial building permits; d) building with one or more captured exposure attribute; e) example buildings with year of construction/retrofit identified.

Fig. 12 .
Fig. 12. GIS representations showing example buildings with captured exposure attributes: a) occupancy (here residential) in cyan and, b) building height (storeys) in shades of red.Buildings with a permit associated with them are hatched in blue.(For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)

Fig. 13 .
Fig. 13.Basic regex abstract symbols, the \ character removing the default meaning of the character it precedes.

Fig. 15 .
Fig. 15.Using lookaheads to check for the presence of words or phrases in a string.

Table 1
Example insights offered by building permit descriptions (extracted from the illustrative dataset in Section 3).Note: the last 2 digits of the 'Permit Reference' correspond to the year of the application, and it is assumed that the year of construction is equal to the permit date.

Table 2
Sample statistics/features of the corpus of Maltese building permits.

Table 3
Suggested keywords for initial word/phrase search through a corpus.

Table 4
Sample permits from two clusters identified in the illustrative application.To construct a terraced house and underlying garage 1 To construct a two-storey dwelling with swimming pool. 2 Demolition of existing building and construction of garage and terraced house with pool. 1 0 Restoration of facade, internal alterations to existing residence and construction of extension at roof level. 1 Proposed alterations to existing townhouse and extension at first floor 2 Extension and alterations to dwelling

Table 6
Grid search hyperparameters: a) vectorizer methods: b) classifier models.See sci-kit learn documentation for hyperparameters definitions.

Table 5
Keywords used in initial corpus investigation and example results showing exposure-attribute insights.

Table 9
(a) Contribution of top five capturing regexes.(b) Exposure attributes captured.