Semantic understanding of high spatial resolution remote sensing images using directional geospatial relationships

ABSTRACT Semantic understanding of high spatial resolution remote sensing (RS) images can be divided into object detection, object labelling, identification of geospatial relationships, and semantic description generation. Geographical relations represent the spatial distribution dependencies between geospatial entities such as points, lines, and polygons, and the topologies among them. Geospatial relations play a very important role in describing the relations between geographic objects. These relations can be broadly classified as topological, directional, and proximity relations. These relations describe the adjacency and association relations between geospatial objects. An approach to identify an appropriate directional geospatial relationship between geo-objects present in high spatial resolution RS images is proposed in this paper. Geospatial objects in the form of the closed boundary are taken as input and relationship triplets are generated. Two approaches have been used in the identification of directional relationships and the results of both approaches are compared. The first approach is based on the centroid of the objects and the second considers whole objects while calculating the direction. These relations are then further represented using a knowledge graph, where nodes represent objects and edges represent their relationship. Knowledge graph plays a very important role in overall scene understanding. It shows the association of all objects with each other. These relationships are then represented in the form of descriptions by using template-based sentence generation. Results show that these directional relationships are accurately identified between each pair of objects using both approaches, but relations generated by considering whole objects are closer to human cognition. Semantic understanding of remote sensing images is of great significance in different applications such as urban surveys, urban planning, and management, military intelligence, etc.


Introduction
Semantic understanding of high spatial resolution (HSR) remote sensing (RS) images using geospatial relationships includes proper interpretation of RS scene type or location, different object properties, and their relationship with each other. Instead of simply predicting individual keywords, it generates a geospatial relation description for the semantic understanding of RS images.
In the past few years, the field of image description using natural language processing (NLP) has gained the attention of many researchers Staniute and Šešok 2019). This area is mainly focusing on description generation for natural images. These techniques primarily use the concept of visual attention, scene-specific contexts (Fu et al. 2017;Li et al. 2019), and Game-Theoretic optimization (Sreela and Idicula 2019;He and Deng 2017). As the geo-objects in RS images are different from objects present in the natural images, therefore, identifying the geospatial relationship between geo-objects is a very crucial task. After identifying the need for semantic descriptions in the form of sentences, a few RS image description methods are proposed in the literature. A method by leveraging a deep convolution neural network (CNN) and recurrent neural network language model (RNNLM) is proposed by Zhang et al. (2017). and a technique that uses a fully convolution network (FCN) is proposed by Shi and Zou (2017). They have used less complicated RS images as input, but for generating semantically correct descriptions, it is important to focus on the complex relationship between objects. The generated descriptions are not giving accurate semantic information about geospatial objects and there are no specific metrics for the evaluation of performance and accuracy of results separately for RS images. Chen et.al. proposed an attentionbased mechanism for identifying the relationships between objects (Chen et al. 2019) and X. Zhang et.al. proposed a technique based on the attribute attention mechanism ). An approach for semantic segmentation of high-resolution RS images based on multiscale features is proposed by Zhao et al. (2022). The method works very well in extraction of small objects when tested on two datasets with six classes of objects in each dataset. Wang et.al. proposed a semantic network for the extraction of impervious surfaces from highresolution RS images (Wang et al. 2021). Although techniques used in literature are performing well in extracting semantic information from RS images, still there is a scope of research for semantic understanding of RS images using geospatial relations.
The main contribution of this work can be listed as (i) A novel method is devised to identify the accurate directional relationship between each pair of objects present in the RS image. (ii) Knowledge graph is used to represent the identified geospatial directional relationship between each pair of objects.
The rest of the paper is organized as follows. The few most significant approaches for the semantic understanding of images are discussed in section 2. The Geospatial relationships and ways to identify each relationship are discussed in detail in section 3. The proposed methodology used for the semantic understanding of high-resolution RS images is described in section 4. Section 5 comprises results and discussions followed by the conclusion.

Related work
Some studies focused on the semantic understanding of RS images. Quan et. al. proposed a technique for extraction of contextual information from images using semantic graphs Muruganandham 2016). Few more studies used deep hashing networks and graph-based techniques for identifying semantic relations between objects Wang et al. 2019;Xia et al. 2017). The existing techniques are concentrated on contextual information extraction and retrieval using different deep learning algorithms. The general approach in the identification and description of the relationship between objects is as shown in Figure 1.
Most of the methods for object relationship detection generate scene graphs from the input image. The scene graph is also referred to as a semantic graph or knowledge graph consisting of all the objects connected by some relationship (Wang et al. 2016;Catania 1996;Anderson et al. 2016;Xu et al. 2019;Yang et al. 2018). After identifying objects from the input image, every pair of objects is denoted as (<subject, object>) and the relationship between them is represented by a triplet comprising of (<subject, predicate, object>), where the predicate can be one of the geospatial relations mentioned above. This detected object relationship is refined by using an attention mechanism. Some significant contributions in the field of relationship detection are explained below.

Hierarchical recurrent neural network (HRNN)
Hierarchical recurrent neural network with visual attention mechanism is proposed (Gao et al. 2019) in which a CNN model is applied to the input image to detect objects from it. Then, an attention-based HRNN (AHRNN) is used to detect the relationship between these objects as <subjectpredicate-object> triplet. The scene graph is constructed automatically using these relationship triplets. AHRNN comprises two sub-networks attention-based triplet recurrent neural network (ATRNN) and attention-based word recurrent neural network (AWRNN). ATRNN takes feature vectors as input from CNN and sequentially constructs topic vectors of object relationship triplet using LSTM and attentive weights. AWRNN takes each topic vector as input and recognizes the triplet with the proper word for each relation. The flow of the method is shown in Figure 2. This method works well for natural images but for complex RS images, there is a need to construct a technique that considers high-level semantics while extracting geospatial relations.

GRU (gated recurrent unit)
The GRU unit (Gated Recurrent Unit) is used as a recurrent unit that integrates two convolution layers inside. The GRU unit is preferred to the LSTM one since it has already shown its effectiveness in the field of remote sensing considering optical and radar land cover classification via multi-temporal spatial data and hyperspectral data analysis Anderson et al. 2018;Interdonato et al. 2019). Furthermore, the Gated Recurrent Unit network involves a lower number of parameters to learn compared to the LSTM unit. This architecture has proven its performance in RS images but has been not tried with very high-resolution (VHR) RS images, where challenges are different.

Collective semantic metric learning framework (CSMLF)
Collective semantic metric learning was used for generating sentences using the attention mechanism, in which a collective sentence is generated by combining five sentences for describing an input image using metric learning. CNN is applied to the input image to embed it onto the semantic space. Then, the distance between the image and the learned sentences is calculated using Mahalanobis matrix, and the closest collective sentence is selected as a description for the image . Since this method compares the image with already generated descriptions showing the relationship between the objects, it cannot generate new sentences automatically on its own.

Vision spatial attention network (VSA-Net)
VSA-Net employs an attention-based mechanism for accurately detecting small objects present in the image. The focus of this technique is to distinguish between subject and object from the image to precisely detect the relationship triplet <subject-predicate-object> (Han et al. 2018). VSA-Net is an end-to-end attention mechanism based visual relationship detection model. This technique focused on relationship of subject with other objects unlike other methods where, relationship graph is generated to show relations of each object with other object. This method gives better performance than other methods used for relationship extraction from natural images but in RS images, it is very difficult to consider one object as a subject.
These methods focus on the relations like the distance between the objects, but single geospatial relation cannot express the exact association among them because the relations in the geographic space often involve various objects that are mixed with each other. Geospatial relationships between geo-objects have utmost importance in the semantic understanding of high spatial resolution (HSR) remote sensing (RS) images. Geo-objects can be considered as points, lines, or polygons (regions) (Fu et al. 2020), relations between the objects in the RS image are categorized as point -point, point -line, point -polygon, line -line, line -polygon, and polygonpolygon. In low spatial resolution RS images, points and lines can be important objects but for high-resolution RS images, one can focus only on the relationship between polygon and polygon (Region-Region) (Shekhar and Xiong 2008). These relations can be classified as topological relations, directional relations, and proximity relations (Dube and Egenhofer 2020).

Geospatial relations
There are three major types of geospatial relationships between the geo-objects in the RS image as follows (Hu 2018; Zhou and Guan 2019):

Topological relations
Topological relations are the most important geospatial relations for the semantic understanding of RS images. These relations are rotation and scale-invariant. The primary and derived (secondary) topological relations between the objects are given as follows and primary relations are shown with example in Figure 3. The spatial topological relations between two regions, namely A and B, can be represented as a matrix R(A, B) which is called as Intersection Model (IM). IM is a square matrix which can be 2 × 2 (4 IM), 3 × 3 (9 IM), 4 × 4 (16 IM) or 5 × 5 (25 IM) (Zhou and Guan 2019). In this matrix, objects' interiors are denoted as A o , boundaries denoted as ∂A and exteriors denoted as A − . 4IM and 9IM are given below. 9+ IM evolves to get topological relationship in more depth as they consider vertex and edge as the additional parameter.
• 4IM Model: The 4IM can be used to identify 2 4 = 16 topological relations by taking the element values as empty or non-empty. The 4IM matrix is given in Equation (1).
• 9IM Model: The 9 IM can be used to identify 2 9 = 512 topological relations by taking the element values as empty or non-empty. The 9IM matrix is given in Equation (2).

Directional relations/cardinal directions
Directional relations use subjective describers based on spatial perspective to indicate geographical locations. People, for example, use directional descriptors like East, West, Northeast, and Southwest to define relative directions between geographical objects.
Directional relations or cardinal directions are represented using a direction-relation matrix (Cicerone and di Felice 2004). Given two objects in the form of regions or polygons (e.g. A and B), let us denote with dir(A, B). The direction-relation matrix indicates the position of object B relative to A. Eight directions around object A (O A ) are denoted as E A , W A , N A , S A , NW A , NE A , SW A , SE A . Then direction relation matrix is given in Equation (3).

Proximity relations
The geographical distances between geospatial objects are referred to as geospatial proximity relations (e.g. X is close to Y, A is far from B). Following are different proximity relations present between geospatial objects: Right-side, Leftside, Between, Before, After, NextTo, Besides, Near, Far, etc.
Euclidian distance between the centroid of objects is calculated and divided by the diagonal distance of the image using the formula given in Equation (4) as shown in Figure 4. The threshold value is set to decide proximity.
These three types of relationships when used individually are not sufficient to cover all the important relations present in the RS image (Cicerone and di Felice 2004). There is a need to consider topological, directional, and proximity relations together to cover the detailed semantic information which is present in high-resolution RS images.

Methodology
To implement geospatial relation cognition from the geospatial space conveyed by an HSR remote sensing image scene, geo-objects should be initially recognized, and their geospatial relation should be expressed as basic components of semantic understanding of images. The overall idea behind the semantic understanding of RS images is shown in Figure 5.
The proposed methodology for semantic understanding of high spatial resolution RS images using directional relationships is shown in Figure 6. High spatial resolution RS images are taken as input and object detection is carried out as a first step. After identifying geospatial objects and their location, directional relationship triplets are generated and represented using knowledge graph. Semantic descriptions are generated based on the relationship triplet using a fixed template. Accuracy assessment is carried out for object detection and directional relationship identification. Object detection and directional relationship identification is discussed in next subsections.

Geospatial object detection and labeling
Extraction of geospatial relationships from high spatial resolution remote sensing images requires object detection to be performed for the identification of geospatial objects. These objects in the form of the closed boundary are taken as input for relationship extraction. Various object detection techniques are available which can be categorized broadly as machine learning-based, deep learning-based, and OBIA (Object-based image analysis) (Ahuja and Patil 2021). YOLOv5 object detection is used here for getting objects in the form of a bounding box which also gives the location of each object. This location information is further used in the identification of relations between objects.
YOLO is an object detection technique that uses a grid approach to partition photos. Each grid cell oversees detecting items inside itself. YOLO is a single-stage deep learning approach that detects objects using a convolutional neural network. It is popular because of its speed and accuracy (Nepal and Eslamiat 2022;Mahendrakar et al. 2021;Yu et al. 2021). YOLOv5 is the most recent and lightest version of previous YOLO algorithms, and it replaces the Darknet framework with the PyTorch framework. Few techniques such as Region-based Convolutional Neural Networks (R-CNN), Faster R-CNN, YOLO, RetinaNet, and Single Shot Detector (SSD) were proposed in the literature for object detection in the form of rectangular bounding boxes. Out of these, YOLO is chosen because YOLO is the fastest algorithm with comparable accuracy (Zakria et al. 2022).

Identification of directional relationship
Once object detection and labeling are completed, the directional relationship identification process starts. The stepwise procedure for calculating the direction relationship between each pair of objects is given below and represented as a block diagram in Figure 7.
(1) Geospatial Objects in the form of rectangular bounding boxes with labels and positions are taken as input.
(2) Renaming of objects occurring multiple times is carried out to differentiate between all objects. If object appears once, its name is kept as it is, when same type of object repeated second time, then it is renamed as <objectname1>. Similarly, if same type object is appeared third time, it is renamed as <objectname12> and so on. (ii). Finding angle between centroids of each pair of objects using Equation (6) as shown in figure 8 If the angle is less than 0, then add 360 and divide by 45 to match with any one of the eight directions anticlockwise starting from North.
b. Considering the whole object (i) To find the directional relationship between each pair of objects, one object is considered at a time, and the direction of all other objects is calculated with respect to that object. RS image is first divided into 9 subparts by keeping the object of interest in the center as shown in figure 9. (ii) The direction of other objects is checked with respect to the object of interest. One of the eight direction is assigned in which maximum part of the object present. If exactly same portion of the object is in two directions, then one of the directions is selected appearing first starting clockwise from north direction.
(4) Comparison of results generated using both methods is carried out by matching relationship triplets generated using centroid-based method and the method which considers whole object.

Knowledge graph construction
Knowledge graphs, which depict structural relationships between items, have become a popular research topic in cognition and human intelligence (Ji et al. 2022). A knowledge graph is an organized representation of information that includes entities, relationships, and semantic descriptors. Entities can be real-world things or abstract concepts, relationships indicate the relationship between entities, and semantic descriptions of entities and their relationships comprise well-defined types and characteristics. Nodes and relationships in property graphs, also known as attributed graphs, have attributes.

Semantic description generation
There are various approaches for generating descriptions for RS images (Lu et al. 2018). In this proposed    methodology, more focus is on relationship triplet generation, therefore simple template-based method is used for semantic description generation. Since directional relationships are considered here, the template used is 'The <Object i> is on the <direction> side of the <Object j>'. One sentence is generated for each relationship triplet, <Object 1, Direction, Object 2 > .

Results and discussions
In this section, the dataset and the experimental results obtained by applying the proposed methodology on that dataset are discussed.

Dataset used
NWPU VHR-10 dataset: Very-high-resolution (VHR) remote sensing image dataset is used which was constructed by Gong Cheng et al. from Northwestern Figure 11. Result of object detection and renaming of repeating objects.  Polytechnical University (NWPU) (Cheng et al. 2014;Cheng, Zhou, and Han 2016;. The dataset has 10 classes of objects as -airplane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, and vehicle.

Experimental results
The results of YOLOv5 object detection on images from the NWPU VHR-10 dataset are shown in Figure 10, where objects are labeled and their position is also located.
Performance of object detection is tested by keeping different confidence score (Threshold) and accuracy is calculated as shown in Table 1, Table 2 Table 3 for threshold values 0.7, 0.65, 0.6 and 0.55 respectively. Although overall accuracy is more when threshold is  0.6 but as we decrease threshold, more number of false positive objects are detected which will not lead to accurate semantic description generation. Therefore, optimum threshold is selected at 0.65, where maximum number of objects detected correctly and less number of false negatives are generated.
After detection of geospatial objects in the form of bounding boxes, we will get class of objects detected and position of the objects for each image (Table 2). If same type of object is repeated in the image, then renaming of repeating object is carried out as shown in Figure 11. Table 3 shows the directions and angle calculated using centroid based method for each pair of objects and Table 4 shows directional relationship triplets for the same. Knowledge graph is generated that represents directional relations is shown in Figure 12, where nodes represent objects and edges represent directions.
Template based descriptions are then generated corresponding to directional relationship triplets as shown in Figure 13. The template is as follows: 'The <Object1> is on the <Relationship> side of the <Object2>'. The second approach for finding the directional relationship is considering whole objects instead of considering only centroid of the objects. In this approach RS image is divided into 9 subparts keeping the object of interest in center. One object at a time is kept as reference and directions of all other objects are calculated with respect to reference object. Same process is repeated for every object present in the image. Results obtained using second approach are shown in Figure 14. There are total 4 objects detected in the image, bridge, ground track field, baseball diamond, and baseball diamond1. Bridge is taken as reference object in Figure 14(a), directions are identified for other 3 objects with respect to bridge. Image is divided into 9 sections based on coordinates of reference object and one out of 8 directions is chosen based on the presence of other object in that direction. If object is partly located in more than one direction of reference object then the direction in which maximum part of the object is located, is chosen. Similarly, all relations are identified keeping other objects as reference as shown in Figure 14(b-d). Figure 15 shows result of identification of directional relationship on more sample images.
The generated triplets using whole object based approach are matched with those generated using  centroid-based approach. The matching and nonmatching triplets of first approach with the second approach are shown in Table 5. The reason behind the non-matching triplets is that the directional relationships generated using centroid-based method are symmetric since it is point-point relation between centroids of two objects, while those which are generated in second approach by considering whole object may not be symmetric for all cases as they are region-region relations between two polygons (objects). In centroid based approach (Figure 13), we can see that, 'The bridge is on the Northeast side of the ground track field'., which implicitly indicates that, 'The ground track field is on the Southwest of the bridge', but that is not true for all relations generated by considering whole object ( Figure 14).

Conclusion
A semantic understanding of high spatial resolution remote sensing images using directional geospatial relationships is proposed in this paper. Different kinds of geospatial relations exist in the RS scene images such as topological, directional, and proximity relations. Representing RS images using geospatial relations will be an effective means of understanding HSR remote sensing images. Two approaches have been used for identifying directional relationships from the RS image, the first is centroid-based and the second is by considering the whole object for calculating the directions of all other objects with respect to one object at a time. Even though both approaches are resulting in accurate directional relationships, since whole objects are taken into consideration in the second approach instead of a single point(centroid), while calculating the directions, results generated using the second approach are more logical. The calculation of directions using centroid-based approach is simple and directions between all object pairs are identified in a single go while it is comparably slow using the second approach.
Although the proposed approach is effective in generating geospatial directional relationships between objects from a high-resolution RS image, a few limitations still remain. For example, the method depends on the detection of objects in the form of rectangular bounding box. It needs further research for the identification of objects in their actual shape. Also, in future work, we would like to represent all types of geospatial relationships from high spatial resolution remote sensing images in the form of semantic descriptions.

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
Authors declare that there is no funding received for the above research work.