Transforming text into knowledge graph: Extracting and structuring information from spatial development plans

Iwona Kaczmarek

doi:10.1515/geo-2022-0513

Open Access Published by De Gruyter Open Access August 11, 2023

Transforming text into knowledge graph: Extracting and structuring information from spatial development plans

Iwona Kaczmarek

From the journal Open Geosciences

https://doi.org/10.1515/geo-2022-0513

Abstract

This article explores how natural language processing techniques can be applied to extract information from spatial planning documents and how this information can be represented in a knowledge graph. The proposed method uses named entity recognition to extract relevant information from text and structure it into labels and corresponding values. The extracted information is represented in the form of a knowledge graph, which allows for better understanding and management of complex relationships between different elements in spatial planning documents. For this purpose, a dedicated ontology was developed. The research demonstrates that the proposed method achieves good results with high precision, recall, and F1 scores for all entity types, with particularly remarkable results for biologically active area predictions. The practical application of this method in spatial planning can contribute to improving decision-making processes and streamlined collaboration between different entities involved in spatial planning.

Keywords: natural language processing; knowledge graphs; ontology; named entity recognition; spatial development plan; spatial data enrichment

1 Introduction

Ideas such as open data and open government are driving the digitalization of data in Europe, including data generated during the spatial planning process. Access to information on planned spatial policy is a fundamental pillar of civil society, and access to information contained in spatial plans is an essential component of the entire planning process, including public participation. In order to shape spatial policy, authorities at all levels require up-to-date information related to land management to reflect the dynamics of land use in real time [1]. Spatial development plans, along with reference data, such as the Land and Building Registry, are some of the most important pieces of information that should be made available to citizens.

The digitalization of spatial planning is not a new concept, but its development toward a full digital plan is an ongoing issue [2]. Digitalization is aimed at facilitating data exchange, effective analysis, supporting spatial monitoring, and increasing the transparency of spatial planning. The use of digital plan data today is increasingly embedded in planning practice, and these data are becoming part of wider integrated digital management [3].

During the past 10 years, many European countries have made significant strides in building nationwide digital plan registries and digitizing urban planning processes. The process of digitalization is related to the process of digitizing plans, that is, the transition from the analog to the digital form. Digitization, in turn, often takes place with the application of standardization and harmonization of data, the main goal of which is to standardize planning documents so that they are exchangeable and comparable across the country.

The scope of a document, such as a spatial development plan, at the local level varies depending on the planning system in force in a given country. However, by definition, the purpose of developing a plan is to designate development zones and conditions, including the restrictions applicable to each zone [4]. A spatial development plan as a document includes a plan drawing and an integrally related plan text. A digital plan data, on the other hand, can be understood as a special form of geodata that covers a specific area with its defined attributes [3].

Depending on the level of digitization, plan data can be presented either as a georeferenced scanned image or as spatial data. The plans depicted as spatial data consist of spatial objects and their associated attributes. One of the fundamental spatial objects in a plan drawing is the boundary of the plan area and the planning zones, for which numerous findings are formulated and documented in the plan’s textual portion. These arrangements, related to the regulation of future land use, can vary depending on the specific conditions and characteristics of the area. Essential information formulated for planning zones relates primarily to their intended purposes and the manner of development. At the same time, urban indicators and parameters are determined on the specific nature of the planning area. This information is typically included in the plan’s text, often in the form of detailed provisions for individual planning zones.

In cases where an entity maintains a database of plans as spatial data, the standard for recording plans in the descriptive part (attribute table) may contain the aforementioned information. However, extracting this information from the textual part of the plans is both time-consuming and impossible on a larger scale. This challenge arises due to the heterogeneity of spatial data structures and the code lists used, such as land use classifications.

Currently, the most advanced form of data representation for plans is geodata. However, data structuring can also take a more accessible and easily scalable form, such as a knowledge graph. A knowledge graph primarily represents real-world entities and their interrelations, organized in a graph structure, defines potential classes and relations within a schema, enables the potential interconnection of arbitrary entities, and spans various topical domains [5]. Knowledge graphs can relate and describe information from many domains. In the field of urban planning, the construction of a knowledge graph is the subject of the Cities Knowledge Graph project [6], which aims to support urban planning, among other things, by representing knowledge in the form of knowledge graphs.

The aim of this article is to present research focused on exploring the automatic extraction of information from spatial development plans and the construction of a comprehensive knowledge graph. In order to achieve this objective, a method using machine learning techniques, specifically natural language processing (NLP), was developed for the extraction of information from spatial development plans. A key component of the extraction process involved creating a dedicated named entity recognition (NER) model to identify entities or subjects of interest within the textual fragments of the plans. Essential information, such as future land use and development indicators, including maximum building intensity, minimum biologically active surface index, maximum building area, and maximum building height, was selected from the plans.

The method was tested for local spatial development plans in Poland, where the diversity of planning documents is significant, in terms of both the formulation of their content and presentation. Currently, there is no uniform standard for representing local plans throughout the country, but intensive work is underway on reforming spatial planning, including digitizing and standardizing local planning documents. In practice, larger cities or highly computerized units have developed their own standards for representing plans, which encompass both the graphical and textual aspects. However, the extent and quality of planning data provided vary depending on various factors, such as whether a standard for representing plans in digital form exists or not, or the level of computerization of the unit providing the data [7].

This article is organized as follows. In Section 2, related works are presented on the importance of data-driven approaches for spatial planning, as well as issues related to information extraction methods using machine learning. In Section 3, the methods for extracting information from spatial planning documents and the process of creating a knowledge graph are described. In Section 4, the evaluation results of the NER model developed in the first stage, the entire information extraction process, the ontology, and the knowledge graph are presented, along with a discussion. In Section 5, a use case for enriching digital spatial data is presented. In Section 6, a summary and future work are provided.

2 Related works

2.1 Data-driven urbanism

In recent decades, cities and urban planning have undergone significant digitalization, becoming increasingly complex and data-rich [8,9]. This evolution is characterized by the integration of modern technologies, which have the potential to not only alter the practice of spatial planning, but also change the way planners engage society in the planning process [10,11,12].

Data-driven urbanism is central to this transformation. In modern cities, the growing importance of data plays a crucial role in the functioning of contemporary agglomerations. Data-driven smart sustainable cities, in simplified terms, are cities that use modern information and communication technologies, as well as new data sources, to better and more effectively manage and monitor processes and phenomena occurring in cities [13]. Datafication also means that in the urban environment, where there is an immense amount of data, the efficiency of the city depends on having control over data management and analysis, as well as awareness of the potential provided by data [14].

Urbanism based on data reinforces the way we think about the city as a living and evolving organism. It also changes the way we plan and manage urban systems [15,16]. One of the consequences of adopting data-driven urbanism is that urban systems become increasingly interconnected and integrated, making urban areas highly coordinated [17]. The challenge in today’s data-driven urban planning is not just accumulating knowledge, but rather using it in automatic reasoning systems [18]. By integrating human expertise with artificial intelligence and advanced data processing techniques, we can create more effective and efficient planning solutions that adapt to the ever-changing urban landscape.

The availability of high-quality data is essential for the effective management and development of urban areas and forms the backbone of data-driven smart cities. For city authorities, these data are critical to making informed decisions, optimizing urban processes, and ensuring sustainability. Additionally, citizens rely on the accessibility of reliable and comprehensive information to better understand and actively participate in the urban management process. Additionally, access to reliable information about land is also crucial for businesses or landowners when making decisions regarding their economic activity, or for authorities when creating public policy in urban areas [19]. Spatial development plans in that case are important as they can impose restrictions and obligations on the land, grant privileges or benefits, and affect the entire investment process.

The importance of the information carried by plans has been recognized by Indrajit et al. [1], who proposed standardizing information on spatial planning and land administration as subsets of land information. They suggested developing a spatial planning package within the existing Land Administration Domain Model standard, which aims to enable the integration of land information from different sources in a consistent way. Spatial development plans must be interoperable for further integration with other data, such as 3D cadastre and 3D spatial plans.

Recent studies emphasize that digitization efforts in spatial planning are increasingly focused on enhancing transparency in planning practices. For example, Hersperger et al. [3] as part of the ESPON DIGIPLAN project [2] investigated the impact of digitalization on planning practice, while highlighting the potential use of deep learning and intelligent systems to support spatial planning, which is expected to become an important part of planning practice. Potts [9] highlights that advances in semantics and artificial intelligence offer planners new opportunities to develop innovative solutions, leading to a paradigm shift in urban planning known as Planning 3.0. This change is characterized by a more interactive, intelligent, and self-organizing approach to urban development [9].

2.2 Information extraction

The role of information extraction in data-driven urbanism is significant as it offers valuable tools to handle and analyze large amounts of unstructured data produced by cities. The purpose of information extraction is the automatic acquisition, structuring, and interpretation of information from unstructured sources such as text, image, or video. In the context of text analytics, information extraction refers to tasks related to NLP methods, which are an important part of artificial intelligence. NLP deals with methods of natural language processing, such as understanding, analyzing, translating, and generating language by machines. NER is a subtask of NLP that focuses on identifying specific phrases or words in the text (the so-called entities) and categorizing them into predefined categories such as people, places, and dates. It involves identifying specific phrases or words in the text (the so-called entities) and categorizing them, for example, as people, places, dates [20].

NLP methods have evolved from rule-based methods to statistical methods over time. Their development coincides with the growth of artificial intelligence, including machine learning methods, especially deep learning. A major breakthrough in the field of NLP was the introduction of word embeddings, which are vector-based representations of words that allow for their representation depending on the context in which they occur. One of the most popular word-embedding techniques is word2vec, created by Tomas Mikolov in 2013, which uses a shallow neural network [21]. One of the latest methods that are gaining increasing interest is those based on transformer architectures. They are effective and efficient in solving many NLP problems, such as text classification and generation.

In light of recent research, an increasing number of researchers are starting to use NLP techniques for analyzing various phenomena within urban environments. One particular data source that has gained significant attention in this regard is social media data. According to Cai [22], social media data is the most commonly used source of data for urban research, representing a new channel of communication between city authorities and citizens. Social media platforms have become virtual spaces where people share their thoughts, experiences, and emotions related to urban environments. In that context, these data are analyzed to study social sentiments, represent the identity of a place, and more [23]. In the case of social data related to cities, information extraction can include identifying places, people, events, emotions, opinions, or trends. The use of NER techniques is related to extracting and categorizing geographic information from textual data such as addresses or place names [24,25]. This is a special case of NER called geotagging, which is part of geoparsing, that is, the process of geotagging and geocoding entities extracted from the text [26]. NER can also be used in this area as part of larger machine learning models as demonstrated in the work by Szczepanek [27].

A different example of using NLP tools, which is related to our research, is through the analysis of building permit documents. Lai and Kontokosta [28] use NLP and topic modeling to explore building permit documents and create a knowledge base of permit changes, as well as to explore the spatiotemporal patterns of construction activities. This example illustrates how NLP can be used to extract valuable information from more formal sources such as permits, which is different from analyzing social media data.

3 Methods

The methods used in this research can be divided into two main areas. The first aspect involves the automatic extraction of information from spatial development plans, with the primary objective of extracting relevant information about indicators and parameters pertinent to the specific area addressed by the plan. The second one focuses on representing extracted information within a knowledge graph. This process entails, among other tasks, the development of a dedicated ontology to suit these requirements.

3.1 Information extraction from the textual part of the spatial development plan

The method for extracting information from spatial development plan texts consists of two stages. The first stage involves developing a dedicated NER model, specifically tailored to address the unique requirements of examining spatial development plans. This model is trained on a specially prepared training data set, enabling it to effectively detect and label phrases or sentences containing pertinent information. In the second stage, specific values are extracted from the sentence fragments identified in the first stage. The outline of the methodology is presented in Figure 1 and described in detail in the following sections.

Figure 1

General process of information extraction from textual part of spatial development plans.

3.2 Creating a customized NER model

Creating a customized NER model is a multistage process that includes steps such as data acquisition and preprocessing, data annotation, feature extraction, and then training the model using selected machine learning algorithms. For the purpose of creating the NER model, spaCy version 3.5.0 was used. SpaCy [29] provides an open source library for performing basic NLP tasks, such as NER and text classification. It is also possible to train custom models and update existing models with more training data. SpaCy offers pre-trained models, including for the Polish language, which extract basic entities such as time, organization, and place. However, they are not able to detect domain-specific phrases or indicators for spatial planning, which are crucial for achieving the intended goal. Therefore, creating a dedicated NER model is essential to achieve high-quality information extraction from spatial development plan texts.

In this particular research, we decided to create a blank model that contains the basic components necessary for text processing in the Polish language, such as tokenizer, tagger, and parser. The SpaCy library was used to create this model. The pipeline implicitly consisted of several steps, including text tokenization, part-of-speech (POS) tagging, parsing, and NER. The first step was text tokenization. It allows splitting the text into tokens, which are single text units such as words and punctuation marks. This tokenizer has been specifically adapted to the Polish language, taking into account its unique features, such as the presence of inflections and word variations. POS tagging was then performed to assign the appropriate grammatical category to each token, such as a noun or verb. This was accomplished using the built-in tagger for the Polish language. Parsing was then performed to analyze the sentence structure and determine which tokens were related to each other in phrases. This was done using the built-in parser for the Polish language. Finally, a custom NER model was created to recognize specific named entities within the text.

A custom NER model was trained using a data set prepared from examples of local spatial development plans in Poland. These plans consist of a textual part and a graphical part (plan drawing), which must be consistent with each other. The textual part specifies general provisions for the entire area covered by the plan and detailed provisions for individual planning areas or zones that are delineated in the plan. These provisions include, among others, land development conditions, such as land use, building shapes, and transportation systems (see Figure 2). The data set consisted of 2,800 documents containing detailed provisions for individual planning areas within local spatial development plans.

Figure 2

Example of spatial development plan fragment with a description of detailed provisions for the planning area (source: National Geoportal, https://www.geoportal.gov.pl/).

Based on the data set mentioned earlier, the preparation of training data required data labeling and annotation, which involved assigning labels to entities that must be extracted. This task is often one of the most time-consuming stages in the process of information extraction.

For the annotation task, Doccano software was used. Doccano [30] is an open source web-based annotation tool that allows for the annotation of text. It provides a user-friendly interface for annotators to tag text data with predefined labels. The annotation task was carried out by a domain expert. The expert was responsible for labeling the data using predefined entity labels according to his domain knowledge. The use of an expert annotator ensures the high quality and consistency of the annotated data, which is crucial for training an accurate NER model. The entity labels included future land use, maximum development intensity, minimum biologically active surface area, index, maximum build-up area, and maximum building height. The number of assigned labels in the entire data set is illustrated in Figure 3.

Figure 3

The number of individual labels in the entire data set (corpus of texts).

The next step was to train the NER model. For training purposes, the data set was divided into a training set (80%), a test set (10%), and an evaluation set (10%). Thus, the training set consisted of 2,240 texts, while the validation and test sets contained 280 texts each.

The training process for the custom NER model involved initializing the model weights randomly and making predictions for a number of examples using the current weights. The model then compared these predictions with the actual labels, determining how to adjust the weights to improve the predictions in future iterations. After making a slight correction to the existing weights, the process proceeded to the next set of data.

In summary, a custom NER model was developed using a blank SpaCy model, which was then trained on a specially prepared data set using the Doccano tool. In the developed method, it was assumed that whole phrases or sentences containing the target values of entities would be extracted first. Then, in the next stage, specific data assigned to the appropriate labels would be extracted using the rule-based method.

3.3 Extracting values from identified entities

The second stage of the information extraction process involved defining additional rules to extract the values of individual labels from the named entities identified in the first stage. For this purpose, rule-based patterns provided by the Matcher class from the SpaCy library were used, which allows defining patterns using token attributes, such as morphological analysis, speech, and reference to the annotation type. Matcher also supports the use of regular expressions, significantly increasing the possibilities of pattern definition. Therefore, the second stage involves applying these patterns to the extracted text fragments to obtain the values of individual labels. In the processed texts, there were situations where more than one label appeared in a single text. For example, the intensity of development for single-family housing areas was presented separately for different types of building, for example, row houses and detached houses. In such cases, it was assumed that the maximum intensity of development for the entire zone (i.e., the text that represents the detailed provisions) would be extracted. Similarly, for biologically active areas, the lowest value was extracted, since the minimum allowed biologically active area indicator is of interest. For the labels of maximum area and building height, the maximum indicators for the given area were also adopted.

The final result of the entire process is presented in an example in Table 1. The original text is an example of provisions for three areas of the plan with symbols 1R, 2R, and 3R. These areas are represented as spatial data, consisting of three objects, i.e., planning zones. Each of these objects is subject to specific regulations (content of the “Original text” column). The second column (Step 1) shows the result of the first stage, where a text fragment is extracted in the form of a span containing multiple tokens, with its starting and ending indices provided. The third column (Step 2) presents the final extraction result, i.e., the label along with the given value. In this case, two entities have been extracted. The first is future land use with the value of “agricultural areas,” and the second entity is the percentage of biologically active surface, which amounts to 90%.

Table 1

Example of information extraction from text from a spatial development plan

Original text	Step 1	Step 2
§ 8.3. For areas 1R, 2R, and 3R, the following parameters and indicators of development and land use are established: 1) purpose: agricultural areas; 2) order to maintain the agricultural function of the land; 3) prohibition of development, subject to point 4; 4) minimum share of biologically active area – 90% of the land; 5) allowing the introduction of natural greenery, trees and bushes, meadows, pastures, and field crops; 6) allowing the construction, reconstruction, expansion, and repair of technical infrastructure networks and equipment; 8) allowing the location of drainage facilities; and 9) allowing the location of access roads and driveways	[148, 163] agricultural areas, “Future land use” [320, 394] minimum share of biologically active area – 90% of the land, “‘Biologically active area”	Future land use: agricultural areas, biologically active area: 90%

3.4 Developing an ontology and creating a knowledge graph

Techniques such as NER, relationship analysis, and semantic analysis are used in NLP to extract relevant information from text and structure it into knowledge graphs. These knowledge graphs, in conjunction with ontologies, serve as structured knowledge representations that can be used for various purposes, such as information retrieval, data analysis, decision support systems, and semantic search.

In this work, an ontology was developed that can be used to model entities such as planning zones and provide descriptions of them. The main goal of the developed ontology is to provide a description reflecting the informational scope of the extracted information in earlier stages.

The ontology was defined using the RDFS language and uses properties from the GeoSPARQL ontology to describe the geometric properties of the planning areas. GeoSPARQL [31] is a standard developed by the Open Geospatial Consortium that extends the RDF query language (SPARQL) with geospatial functions, allowing queries that involve spatial data.

Instantiation, which is the process of adding instances to an ontology and creating a knowledge graph, was implemented using the rdflib library. The rdflib library [32] is a popular tool for manipulating and analyzing data in RDF format, and its use enabled the addition of new instances and their relationships to the ontology.

4 Results and discussion

The following sections present the results of the performance evaluation of the information extraction method from spatial development plans, as well as the results of representing the extracted information within a knowledge graph. The results of the performance evaluation are categorized into two parts. The first part focuses on the evaluation results of the custom NER model, which is described in the “Customized NER model for extracting named entities from spatial development plans” section. The second part addresses the results of the overall extraction process, which concentrate on extracting specific values of named entities rather than sentences or phrases. The final section presents the examples of the representation of the extracted information in a knowledge graph using a dedicated ontology based on the RDFS language and the properties of the GeoSPARQL ontology. These properties allow for the reflection of the geometric properties of the planning zones.

4.1 Performance of the custom NER model

To evaluate the performance of the NER model created in the first stage of the information extraction process, we selected precision, recall, and F1 score as evaluation measures. These measures are defined as follows:

Precision = TP TP + FP ,

Recall = TP TP + FN ,

F 1 = 2 × Precision × Recall Precision + Recall .

The model results for all entities (overall) show the precision of 0.78, recall of 0.78, and an F1 score of 0.78, indicating even information extraction quality across these categories. For the future land use label, the precision is 0.74, the recall is 0.79, and the F1 score is 0.76. The build-up area label achieves the precision of 0.86, the recall of 0.70, and the F1 score of 0.77. The development intensity label is characterized by the precision of 0.81, the recall of 0.84, and the F1 score of 0.82. The highest accuracy is achieved for biologically active surface, where all three metrics take a value of 0.93. The weakest result the model achieves is for the building height label, with precision, recall, and F1 metrics all at a level of 0.66 (Table 2).

Table 2

Evaluation of custom NER model

Entity type	Precision	Recall	F1 score
Overall	0.78	0.78	0.78
Future land use	0.74	0.79	0.76
Build-up area	0.86	0.70	0.77
Development intensity	0.81	0.84	0.82
Biologically active area	0.93	0.93	0.93
Building height	0.66	0.66	0.66

In reference to the aforementioned results, it should be noted that the method of extracting information from spatial development plans is based on two sequential stages. In the first stage, the aforementioned evaluation results of the NER model include detecting named entities that represent sentence fragments that correspond to specific labels. The accuracy of detecting the corresponding fragments of text is not as important as whether a given fragment includes the value of interest.

For example, in the first stage, the model extracts an entity identified by the NER model as “minimum biologically active area – 15% of the plot area” for the label minimum biologically active area. However, the true value is “minimum biologically active area – 15% of the area of a building plot or group of plots construction.” In this case, the model prediction is not entirely accurate, but it is still sufficient for the analysis discussed in this article. Our primary focus is to determine whether the extracted phrase contains the 15% value and whether it is labeled as a valid entity type. This does not apply to the land use label, where the accuracy of detecting the text fragment is more important. For example, for the “service building areas – education services” land use, we will be interested in the entire extracted fragment, including the phrase “education services,” indicating a specific use for the given area.

4.2 Outcomes of the comprehensive information extraction process

The results of the overall information extraction process involve the extraction of specific values of named entities from the text. These values were extracted using pattern matching and regular expression. In the analyzed case, where the process of information extraction is carried out in two stages, the result of the quality measures for the second stage is more important, as it is directly related to solving the problem at hand. In the first stage, only sentence fragments are detected, which are then processed in the second stage, but the fragments themselves are not the final results. Therefore, the evaluation of the first stage only provides a partial picture of the quality of the entire process, as the most important values are obtained at the end of the second stage. In the second stage, the final quality of the model was tested in sample entities that achieved the best metrics in the first stage, such as the intensity of development and biologically active surface, as well as future land use, also subjected to evaluation. The test data set for the final overall evaluation of the extraction process was developed by a domain expert.

Table 3 presents the evaluation results for three different types of named entities: future land use, development intensity, and biologically active area. The results are presented in the form of three metrics that were calculated: precision, recall, and F1 score.

Table 3

Performance metrics for predicting future land use, development intensity, and biologically active area

Entity type	Precision	Recall	F1 score
Future land use	0.85	0.90	0.86
Development intensity	0.97	0.93	0.95
Biologically active area	0.99	0.99	0.99

For future land use predictions, the model achieved a precision of 0.85, indicating that 85% of the predictions for this type of entity were accurate. The recall score of 0.90 suggests that the model identified 90% of the actual instances of this entity type. The F1 score, which is the harmonic mean of precision and recall, is 0.86, signifying a strong performance for this type of entity. Regarding development intensity, the model achieved a precision of 0.97, implying that 97% of its predictions for this entity type were correct. The recall score of 0.93 shows that the model identified 93% of the actual instances of this entity type. The F1 score for development intensity is 0.95, demonstrating a high level of accuracy in the model’s predictions for this entity type. For biologically active area, the model demonstrated exceptional performance, with a precision of 0.99, indicating that 99% of its predictions for this entity type were accurate. The recall score, also at 0.99, suggests that the model successfully identified 99% of the actual instances of this entity type. The F1 score reached 0.99, confirming the model’s outstanding performance in predicting biologically active area instances.

It is worth noting that in terms of the effectiveness of the overall process, it is not crucial for the model to detect the entire phrase. The primary concern lies in determining whether the model can identify and extract the relevant values of interest from the given fragments. Our main goal is to check whether the detected phrase contains the value we want to get and whether the model correctly identified this entity. Therefore, the effectiveness of the model depends on its ability to detect and assign appropriate values to the analyzed entities, and less important is the accuracy of detecting complete phrases.

4.3 Representation of plan data in a knowledge graph

To represent plan data within a knowledge graph, an ontology is used. The ontology defines two main classes: the Plan class, which represents objects such as spatial development plans, and the Zone class, which represents planning areas. Both classes inherit from the feature class of GeoSPARQL. The ontology defines specifically the properties related to the planning area. The data-type properties, which represent relationships between instances of classes and literals, include hasArea, representing the area of the planning zone; hasBiologicallyActiveArea, representing the maximum share of biologically active area in the planning zone; hasDevelopmentIntensity, representing the maximum development intensity, hasFutureLandUse, representing the land use; hasText, a property representing a detailed text about the planning area; and zoneID, representing the identifier of the spatial planning object. The ontology defines only one object property, isInPlan, which is used to determine the relationship between the planning area and the spatial development plan of which it is a part (Figure 4).

Figure 4

Ontology for representation of planning zones in a spatial development plan (visualization created using TopBraid Composer).

The created knowledge graph had to be populated with data. To do this, spatial objects, which are planning areas, were converted into a graph form. It should be noted that each area has a unique text description, which leads to a situation where the number of objects in the Zone class is not equal to the number of texts, because one text can refer to many zones. As a result, the knowledge graph contains more than 22,700 instances of the Zone class. In the knowledge graph, in addition to the instances of the Zone class, the plans for spatial development were also included as a separate class labeled Plan, comprising 381 instances of the Plan class representing various plans for spatial development. Each object of the Plan class is associated with the appropriate instances of the Zone class, which refer to the planning areas contained in these plans. Thanks to this structure of the knowledge graph, it is possible to easily track connections between spatial development plans and individual planning areas, which facilitates the analysis and use of these data. An example instance of a planning area with its describing properties is shown in Figure 5.

Figure 5

An example instance of a zone in a spatial development plan (visualization created using TopBraid Composer).

The developed ontology provides a structured modeling approach that greatly enhances the analysis and exchange of plan data. It serves as a robust framework for storing comprehensive information about spatial development plans and their associated details, including information related to planning areas, development types, and imposed restrictions. Additionally, integration with GeoSPARQL allows for standardized geometry representation of objects, facilitating their analysis, exchange, and integration with other systems.

Advanced queries based on the SPARQL query language enable the search, filtering, and merging of data from different sources to obtain answers to specific questions. For example, it is possible to query planning areas with a specific purpose, as well as for specific limitations defined for those areas. An example for SPARQL query with the result is presented in Table 4. The query result consists of spatial zones whose designation is residential development, and the maximum development intensity index is less than 1.2, while the minimum share of biologically active area is over 50% (the query result is limited to five instances). In this example, instances that met specific criteria related to residential development, development intensity index, and biologically active area were identified. However, the range of information about a particular planning area can be much broader. Due to the flexible data structure, valuable information from other sources, such as environmental restrictions in a given area, can be included. As a result, efficient analysis is possible and areas that are in line with specific spatial planning objectives are identified.

Table 4

A sample SPARQL query with an answer

PREFIX sp: < http://spatialplanning.org/localplans# >

PREFIX geo: < http://www.opengis.net/ont/geosparql# >

PREFIX rdf: < http://www.w3.org/1999/02/22-rdf-syntax-ns# >

PREFIX xsd: < http://www.w3.org/2001/XMLSchema# >

SELECT DISTINCT? plan? zone? intensity? biolActiveArea? landUse

WHERE {

? zone rdf:type sp:Zone;

sp:isInPlan? plan;

sp:hasFutureLandUse? landUse;

sp:hasBiologicallyActiveArea? activeArea;

sp:hasDevelopmentIntensity? intensity.

FILTER (CONTAINS(str(? landUse), "Tereny zabudowy mieszkaniowej"))

FILTER (? activeArea > 50)

FILTER (0 <? intensity &&? intensity < = 1.2)}

LIMIT 5

`[Plan]`	`Zone`	`Intensity`	`BiolActiveArea`	`LandUse`
`sp:Plan_101`	`sp:Feature_16489`	`0.5`	`60.0`	`Tereny zabudowy mieszkaniowej jednorodzinnej (Single-family housing areas)`
`sp:Plan_182`	`sp:Feature_18778`	`0.9`	`60.0`	`Tereny zabudowy mieszkaniowej jednorodzinnej (Single-family housing areas)`
`sp:Plan_226`	`sp:Feature_3264`	`0.2`	`55.0`	`Tereny zabudowy mieszkaniowej jednorodzinnej (Single-family housing areas)`
`sp:Plan_226`	`sp:Feature_2735`	`0.2`	`55.0`	`Tereny zabudowy mieszkaniowej jednorodzinnej (Single-family housing areas)`
`sp:Plan_407`	`sp:Feature_8062`	`0.7`	`40.0`	`Tereny zabudowy mieszkaniowej jednorodzinnej (Single-family housing areas)`

This article presents only a fragment of the spatial planning information representation in the form of an ontology, limiting itself to presenting basic classes and properties describing the spatial planning areas. Other works related to the development of ontologies for the formal description of spatial planning have been carried out, among others, by the authors [33], who developed semantic metadata for spatial planning, used to identify the relationships between provisions in the plan text, spatial objects in drawings, and references to external resources. The developed ontology was used to describe the textual part of the plan represented in XHTML documents using semantic annotations. An interesting example is also an ontology that aims to represent land use restrictions developed for Singapore [6].

In the future, the ontology developed in this study could be expanded with additional classes and properties to better reflect the various aspects of spatial planning and allow for more complex knowledge exploration. These could include additional constraints that exist in a given area and, as a result, create a complete picture of the defined spatial policy. Issues related to the extension or reuse of an ontology require additional investigation, but remain within the realm of potential feasibility.

4.4 Use case: enriching spatial data with extracted information

Information extracted from the textual parts of plans can also enrich spatial data that represent planning areas in terms of their descriptive parts. Manual acquisition of information from plans is very time-consuming, so the scope of information from these data is often limited and only covers information about land use. In Figures 6 and 7, an exemplary fragment of the current spatial development plans has been presented, showing the areas with a minimum biologically active surface ratio (Figure 6) and maximum building intensity ratio (Figure 7). Having such information facilitates monitoring of spatial policy on a larger scale, such as a region. It enables the confrontation and verification of the investment process with the current spatial policy expressed in the plans. It can also serve as a basis for further analysis, such as verifying plan provisions and their compliance with existing current development [34] or identifying potentially problematic areas, for example, areas where the planned intensity of development is high, while the share of biologically active area is low (Figure 8).

Figure 6

Example of visualization of minimum biologically active area indicator.

Figure 7

Example of visualization of maximum development intensity ratio.

Figure 8

Example of the visualization of relation between two indicators: biologically active surface and development intensity.

The use of automation in obtaining this type of information can significantly accelerate the process and enable monitoring and analysis of spatial data on a larger scale and with greater precision. By automating the extraction process, our proposed method enables a more comprehensive representation of the plan data, encompassing a wider range of its attributes, such as development constraints and rules. As a result, these enriched spatial data can provide a deeper understanding of regulations of plans, support more informed decision-making, and facilitate more effective communication among stakeholders, including urban planners, policymakers, and citizens.

5 Conclusions

Digitalization with the use of new technologies is changing spatial planning and influencing its development. Digitization and standardization are important elements in this task as they allow for the exchange of plan data, efficient monitoring of spatial policy, and support for public participation in planning. However, it can be long and complex process. Spatial development plans are difficult to formalize, as text and plan graphics complement each other and the text contains a number of agreements concerning individual areas and their development conditions.

Currently, the most advanced representations of plans are in the form of spatial data with associated data model. However, other structures, such as knowledge graphs, enable data storage and integration in a new way. The current technological progress in the field of artificial intelligence allows for the search for solutions that will support this matter.

The research focuses on two areas in this regard: NLP methods using machine learning and semantic technologies. Research addresses the issue of extraction of contextual information from the text of spatial planning documents, presenting a two-stage method of the extraction process. The extracted information is then represented in the form of a knowledge graph, which allows easy exploration and in the future, connection with other data. It can also be used to enrich spatial data related to spatial planning in the plan data harmonization.

Research was carried out on spatial planning documents from Poland, where the main driving force behind the digitization process was the INSPIRE directive [35], which established frameworks and technical means for the exchange of spatial planning data among European countries [36]. In Poland, under the current legal framework, spatial data representing plans for spatial development must be made available in the form of vector plan boundaries, plan drawings with georeferencing, and attributes containing information about the plan act [37]. However, ultimately, the vector boundaries of planning divisions along with attributes describing these objects are also to be published. Activities regarding the reform of spatial planning, including digitization of planning documents in Poland, are currently very dynamic.

According to Michalik [36], digitalization of spatial planning should be treated in a multifaceted manner, and any changes in this area must occur gradually, step by step, which seems to be the most optimal solution. In Poland, consecutive steps are being taken to achieve this goal, and a new concept of a planning geoportal for the country is proposed as the next step in the process of digitalization of spatial planning [38].

The method presented in this article can support the process of digitization of spatial planning documents by providing a way to automatically extract information about plan determinations and presenting their possible representation in the form of knowledge graph. The NLP technique, such as NER, was used in the extraction process, enabling the extraction of relevant information from text and structuring it in the form of labels with their corresponding values.

The experiments conducted showed a high level of accuracy in extracting information from plans, which gives hope for the use of the developed methodology on a larger scale. However, this would require obtaining a larger amount of training data and preparing additional rules at the national level. The developed NER model demonstrated high levels of precision, recall, and F1 scores for all three analyzed entity types, with particularly remarkable results for biologically active area predictions. These findings suggest that the model can effectively predict and identify information about future land use, development intensity, and biologically active area. Further research could explore potential improvements to the model, as well as investigate its applicability to additional entity types and domains.

The proposed method goes beyond traditional NER techniques. Although the first stage involves extracting named entities, this step alone is insufficient to obtain comprehensive information on the plan’s regulations. Consequently, in the second stage, we use rule-based extraction to obtain specific values and assign them to the identified named entities. This differentiates our approach from standard NER methods that primarily focus on extracting named entities such as names and places. Our ultimate goal is to transform unstructured documents into structured information in the form of an ontology-based knowledge graph. This process involves not only extracting relevant information but also organizing it in a structured way that allows for efficient querying and analysis.

The developed method is not limited to spatial planning documents in Poland and can be applied to other spatial planning documents at the local level. However, developing a new NER model requires preparing new training data sets that cover the specific documents that need to be processed, taking into account their unique characteristics and terminology. The use of a knowledge graph structure to represent extracted information from spatial development plans also enables a better understanding and management of complex relationships between various elements, such as zones, objects, land use, and planning documents. However, to fully reflect the specificity of spatial planning depending on the country and planning system, the ontology should be modified or extended with additional elements.

Our research shows the effectiveness of the proposed method in obtaining and representing planning information from local spatial development plans. However, it is worth considering the possibility of generalizing this approach to other types of documents or data related to spatial description. Possible applications in the area of spatial planning are analyses of other spatial planning documents, such as studies of conditions and directions of spatial development of communes. Another example may be protection plans for protected areas, which concern such objects as national parks or nature reserves, and the scope of their regulation is included in the text part. Another promising possibility is the application of the proposed approach to the analysis of environmental impact assessment (EIA). Thanks to the use of NLP techniques and semantic technologies, valuable information can be extracted from these reports and then presented in a structured and easily accessible way. Ultimately, this can help identify trends and patterns in different EIA reports.

Further research could focus on developing and optimizing information extraction algorithms, taking into account additional aspects, particularly the regulations contained in the plan. As the field of NER continues to evolve, with new models and algorithms emerging, there is potential for further research. In this study, we used the Spacy NER model. However, it would be valuable to explore and compare the performance of different NER models, especially those based on deep learning, for similar tasks. Such a comparative analysis could provide information on the strengths and weaknesses of various NER models and guide the selection of the most suitable model for specific applications in data-driven urbanism and information extraction. Moreover, it is also important to apply these methods in spatial planning practice, which can contribute to further improvement of decision-making processes and streamline collaboration between different entities involved in spatial planning.

Funding information: This article received financial support from the Institute of Spatial Management, Wrocław University of Environmental and Life Sciences.
Conflict of interest: Author states no conflict of interest.

References

[1] Indrajit A, van Loenen B, Ploeger H, van Oosterom P. Developing a spatial planning information package in ISO 19152 land administration domain model. Land Use Policy. 2020;98:104111.10.1016/j.landusepol.2019.104111Search in Google Scholar

[2] ESPON DIGIPLAN. Evaluating spatial planning practices with digital plan data. Final report. [Online]; 2021. https://www.espon.eu/digiplan.Search in Google Scholar

[3] Hersperger AM, Thurnheer-Wittenwiler C, Tobias S, Folvig S, Fertner C. Digitalization in land-use planning: Effects of digital plan data on efficiency, transparency and innovation. Eur Plan Stud. 2022;30:2537–53.10.1080/09654313.2021.2016640Search in Google Scholar

[4] Nowak M, Petrisor AI, Mitrea A, Kovács KF, Lukstina G, Jürgenson E, et al. The role of spatial plans adopted at the local level in the spatial planning systems of Central and Eastern European Countries. Land. 2022;11(9):1599.10.3390/land11091599Search in Google Scholar

[5] Cimiano P, Paulheim H. Knowledge graph refinement: A survey of approaches and evaluation methods. Semant Web. 2017 January;8:489–508.10.3233/SW-160218Search in Google Scholar

[6] Silvennoinen H, Chadzynski A, Farazi F, Grišiūtė A, Shi Z, von Richthofen A, et al. A semantic web approach to land use regulations in urban planning: The OntoZoning ontology of zones, land uses and programmes for Singapore. J Urban Manag. 2023;12:151–67.10.1016/j.jum.2023.02.002Search in Google Scholar

[7] Kaczmarek I, Iwaniak A, Łukowicz J. New spatial planning data access methods through the implementation of the inspire directive. Real Estate Manag Valuat. 2014;22:9–21.10.2478/remav-2014-0002Search in Google Scholar

[8] Boland P, Durrant A, McHenry J, McKay S, Wilson A. A ‘planning revolution’ or an ‘attack on planning’ in England: digitization, digitalization, and democratization. Int Plan Stud. 2022;27:155–72.10.1080/13563475.2021.1979942Search in Google Scholar

[9] Potts R. Is a new ‘Planning 3.0’ paradigm emerging? Exploring the relationship between digital technologies and planning theory and practice. Plan Theory Pract. 2020;21:272–89.10.1080/14649357.2020.1748699Search in Google Scholar

[10] Jankowski P, Czepkiewicz M, Młodkowski M, Zwoliński Z, Wójcicki M. Evaluating the scalability of public participation in urban land use planning: A comparison of Geoweb methods with face-to-face meetings. Environ Plan B: Urban Analytics City Sci. 2019;46:511–33.10.1177/2399808317719709Search in Google Scholar

[11] Levenda AM, Keough N, Rock M, Miller B. Rethinking public participation in the smart city. Can Geographer/Le Géographe canadien. 2020;64:344–58.10.1111/cag.12601Search in Google Scholar

[12] Olszewski R, Cegiełka M, Szczepankowska U, Wesołowski J. Developing a serious game that supports the resolution of social and ecological problems in the toolset environment of cities: skylines. ISPRS Int J Geo-Information. 2020;9(2):118.10.3390/ijgi9020118Search in Google Scholar

[13] Bibri SE. The evolving data-driven approach to smart sustainable urbanism for tackling the conundrums of sustainability and urbanization. In Big data science and analytics for smart sustainable urbanism: Unprecedented paradigmatic shifts and practical advancements. Cham: Springer International Publishing; 2019. p. 1–10.10.1007/978-3-030-17312-8_1Search in Google Scholar

[14] Bibri SE. The anatomy of the data-driven smart sustainable city: Instrumentation, datafication, computerization and related applications. J Big Data. 2019;6:59.10.1186/s40537-019-0221-4Search in Google Scholar

[15] Kitchin R, Lauriault TP, McArdle G. Knowing and governing cities through urban indicators, city benchmarking and real-time dashboards. Regional Studies, Regional Sci. 2015;2:6–28.10.1080/21681376.2014.983149Search in Google Scholar

[16] Bibri SE. Compact urbanism and the synergic potential of its integration with data-driven smart urbanism: An extensive interdisciplinary literature review. Land Use Policy. 2020;97:104703.10.1016/j.landusepol.2020.104703Search in Google Scholar

[17] Bibri SE. Introduction: The rise of sustainability, ICT, and urbanization and the materialization of smart sustainable cities. In Smart sustainable cities of the future: The untapped potential of big data analytics and context–Aware computing for advancing sustainability. Cham: Springer International Publishing; 2018. p. 1–38.10.1007/978-3-319-73981-6_1Search in Google Scholar

[18] Laurini R. A primer of knowledge management for smart city governance. Land Use Policy. 2021;111:104832.10.1016/j.landusepol.2020.104832Search in Google Scholar

[19] Indrajit A, van Loenen B, Suprajaka, Jaya VE, Ploeger H, Lemmen C, et al. Implementation of the spatial plan information package for improving ease of doing business in Indonesian cities. Land Use Policy. 2021;105:105338.10.1016/j.landusepol.2021.105338Search in Google Scholar

[20] Nadkarni PM, Ohno-Machado L, Chapman WW. Natural language processing: An introduction. J Am Med Inform Assoc. 2011 September;18:544–51.10.1136/amiajnl-2011-000464Search in Google Scholar PubMed PubMed Central

[21] Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, editors. Advances in Neural Information Processing Systems 26. Red Hook, NY: Curran Associates, Inc.; 2013. p. 3111–9.Search in Google Scholar

[22] Cai M. Natural language processing for urban research: A systematic review. Heliyon. 2021;7:e06322.10.1016/j.heliyon.2021.e06322Search in Google Scholar PubMed PubMed Central

[23] Jang KM, Kim Y. Crowd-sourced cognitive mapping: A new way of displaying people’s cognitive perception of urban space. PLOS ONE. 2019 June;14:1–18.10.1371/journal.pone.0218590Search in Google Scholar PubMed PubMed Central

[24] Sharma P, Samal A, Soh LK, Joshi D. A spatially-aware algorithm for location extraction from structured documents. GeoInformatica. 2022.10.1007/s10707-022-00482-1Search in Google Scholar

[25] Halterman A. Mordecai: Full text geoparsing and event geocoding. J Open Source Softw. 2017;2:91.10.21105/joss.00091Search in Google Scholar

[26] Gritta M, Pilehvar MT, Limsopatham N, Collier N. What’s missing in geographical parsing? Lang Resour Evaluation. 2018;52:603–23.10.1007/s10579-017-9385-8Search in Google Scholar PubMed PubMed Central

[27] Szczepanek R. A deep learning model of spatial distance and named entity recognition (SD-NER) for flood mark text classification. Water. 2023;15(6):1197.10.3390/w15061197Search in Google Scholar

[28] Lai Y, Kontokosta CE. Topic modeling to discover the thematic structure and spatial-temporal patterns of building renovation and adaptive reuse in cities. Comput Environ Urban Syst. 2019;78:101383.10.1016/j.compenvurbsys.2019.101383Search in Google Scholar

[29] Honnibal M, Montani I, Van Landeghem S, Boyd A. spaCy: Industrial-strength Natural Language Processing in Python. 2020. 10.5281/zenodo.1212303.Search in Google Scholar

[30] Nakayama H, Kubo T, Kamura J, Taniguchi Y, Liang X. doccano: Text annotation tool for human; 2018. Software available from https://github.com/doccano/doccano.Search in Google Scholar

[31] Car NJ, Homburg T, Perry M, Herring J, Knibbe F, Cox SJD, et al. OGC GeoSPARQL - A Geographic Query Language for RDF Data. OGC Implementation Standard., Open Geospatial Consortium; 2022.Search in Google Scholar

[32] Boettiger C. rdflib: A high level wrapper around the redland package for common rdf applications. 2018. 0000-0002-1642-628X.Search in Google Scholar

[33] Iwaniak A, Kaczmarek I, Łukowicz J, Strzelecki M, Coetzee S, Paluszyński W. Semantic metadata for heterogeneous spatial planning documents. ISPRS Ann Photogram Remote Sens Spat Inf Sci. 2016;IV-4/W1:27–36.10.5194/isprs-annals-IV-4-W1-27-2016Search in Google Scholar

[34] Błasik M, Wang T, Kazak JK. The effectiveness of master plans: Case studies of biologically active areas in suburban zones. Geomat Environ Eng. 2022;16:27–40.10.7494/geom.2022.16.3.27Search in Google Scholar

[35] Directive 2007/2/EC of the European Parliament and of the Council of 14 March 2007 establishing an Infrastructure for Spatial Information in the European Community (INSPIRE); 2007.Search in Google Scholar

[36] Michalik A. Selected aspects of the digitisation of spatial planning in the context of legislative changes in Poland. Acta Sci Pol Architectura. 2022;21(2):63–73.10.22630/ASPA.2022.21.2.15Search in Google Scholar

[37] Ustawa z dnia 27 marca 2003 r. o planowaniu i zagospodarowaniu przestrzennym. Dz.U. 2003 nr 80, poz. 717 (Act of 27 March 2003 on planning and spatial development. Journal of Laws of 2003 no. 80, item 717).Search in Google Scholar

[38] Michalik A, Zwirowicz-Rutkowska A. A geoportal supporting spatial planning in Poland: Concept and pilot version. Geomat Environ Eng. 2023 January;17:5–30.10.7494/geom.2023.17.2.5Search in Google Scholar

Received: 2023-04-11

Revised: 2023-06-28

Accepted: 2023-07-01

Published Online: 2023-08-11

This work is licensed under the Creative Commons Attribution 4.0 International License.

Transforming text into knowledge graph: Extracting and structuring information from spatial development plans

Abstract

1 Introduction

2 Related works

2.1 Data-driven urbanism

2.2 Information extraction

3 Methods

3.1 Information extraction from the textual part of the spatial development plan

3.2 Creating a customized NER model

3.3 Extracting values from identified entities

3.4 Developing an ontology and creating a knowledge graph

4 Results and discussion

4.1 Performance of the custom NER model

4.2 Outcomes of the comprehensive information extraction process

4.3 Representation of plan data in a knowledge graph

4.4 Use case: enriching spatial data with extracted information

5 Conclusions

References

Journal and Issue

Articles in the same Issue