GeoGlue: feature matching with self-supervised geometric priors for high-resolution UAV images

ABSTRACT We present GeoGlue, a novel method using high-resolution UAV imagery for accurate feature matching, which is normally challenging due to the complicated scenes. Current feature detection methods are performed without guidance of geometric priors (e.g., geometric lines), lacking enough attention given to salient geometric features which are indispensable for accurate matching due to their stable existence across views. In this work, geometric lines are firstly detected by a CNN-based geometry detector (GD) which is pre-trained in a self-supervised manner through automatically generated images. Then, geometric lines are naturally vectorized based on GD and thus non-significant features can be disregarded as judged by their disordered geometric morphology. A graph attention network (GAT) is utilized for final feature matching, spanning across the image pair with geometric priors informed by GD. Comprehensive experiments show that GeoGlue outperforms other state-of-the-art methods in feature-matching accuracy and performance stability, achieving pose estimation with maximum rotation and translation errors under 1% in challenging scenes from benchmark datasets, Tanks & Temples and ETH3D. This study also proposes the first self-supervised deep-learning model for curved line detection, generating geometric priors for matching so that more attention is put on prominent features and improving the visual effect of 3D reconstruction.


Introduction
Stereovision-based 3D reconstruction aims to rebuild a virtual 3D scene containing entities that are consistent with input images (Schönberger and Frahm 2016;Wei et al. 2020;Huang et al. 2018a).Accurate feature matching from different views is the cornerstone for excavating 3D information, by which the precise reconstruction of substructures and surfaces of 3D objects can be realized under the epipolar geometry model (Zhang 1998).Current methods, i.e. both conventional or learning-based feature matching methods, obey a multi-step process for feature matching (He et al. 2018;Sarlin et al. 2020).First, a feature detector finds distinctive keypoints within adjacent areas.Second, local descriptors for corresponding keypoints are computed based on their locations, local image features, and even their spatial distribution features, and matches are generated through algorithms such as nearest-neighbor search (Cover and Hart 1967).However, these methods are prone to performing poorly when working with complex scenes due to ambiguities lying in regions displaying repetitive patterns (e.g.floor tiles with regular arrangement) or low texture regions (e.g.bare land) without common visual features that can be easily located (e.g.segments, junctions, corners, etc.).These factors result in mismatching that leads to a final 3D reconstruction product that is far from satisfactory.In general, these limitations are caused by the lack of a mechanism for intelligent discrimination for feature-matching performed on the above-mentioned challenging areas.We propose a novel feature-matching method named GeoGlue that leverages salient geometric elements using self-supervised techniques.
1.1.Literature review 1.1.1.Feature matching methods As discussed above, a robust feature detection module is a key component in the feature-matching pipeline, building distinctive representations for each found feature (e.g.keypoints).Before the emergence of learning-based methods, hand-crafted detectors such as SIFT (Lowe 2004) and SURF (Bay et al. 2008) were widely used and proved to be successful for feature registration tasks.They contain complex operations for dealing with viewpoint changes.Methods such as Brief (Calonder et al. 2010), Brisk (Leutenegger, Chli, and Siegwart 2011), and ORB (Rublee et al. 2011) improve computational efficiency by providing binary descriptors that retain a competitive performance against floating-point descriptors.These methods have been employed in many scene applications such as image retrieval, simultaneous localization and mapping (Mur-Artal and Tardós 2017), and real-time aerial image mosaicing (Li et al. 2014;Wang et al. 2017;de de Lima and Martinez-Carranza 2017;de Lima, Cabrera-Ponce, and Martinez-Carranza 2021).
Learning-based feature-matching methods such as LIFT (Yi et al. 2016), DGC-Net (Melekhov et al. 2019), and MatchNet (Han et al. 2015) have emerged with the remarkable progress of deep learning.These methods approach feature detection and matching for images with viewpoint changes in a supervised manner, getting rid of hand-engineered representations and traditional methods such as SIFT, SURF, and ORB.SuperPoint (DeTone, Malisiewicz, and Rabinovich 2018), which is employed in our method, builds a two-branch convolutional neural network (CNN) composed of one encoder and two decoders to detect interest points and provide descriptors in a single model.Specifically, SuperPoint proposes a self-supervised pipeline for model training to achieve effective feature detection and matching without manual annotation.This inspired us to develop a self-supervised model named geometry detector (GD) to provide geometric priors for robust feature matching.However, CNN-based methods (e.g. SuperPoint, DeepDesc (Simo-Serra et al. 2015), and Quad-networks (Savinov et al. 2017)) are prone to failure in the presence of large areas that have low texture or display repetitive patterns due to CNN's finite receptive field, which causes an inability to apply spatial context awareness for matching in difficult regions.Fortunately, SuperGlue (Sarlin et al. 2020), a method based on the graph neural network (GNN) technique, establishes global context perception by spanning the attention of keypoint locations and features across the image pair.Moreover, the method generates high-quality descriptors that fuse extensive adjacent area information.In contrast with other learning-based methods, SuperGlue significantly enhances feature matching accuracy and robustness (Luo et al. 2019;Ono et al. 2018;Dusmanu et al. 2019;Revaud et al. 2019).However, an obvious defect still exists in its geometry extraction efficiency and feature-matching results, especially for unmanned aerial vehicle (UAV) imagery.The interest points generated by SuperPoint are not capable of fully excavating geometric elements, especially curved lines in UAV images, resulting in scattered keypoint proposals.A recently proposed method known as LoFTR (Sun et al. 2021) based on Transformer (Vaswani et al. 2017) partly approaches the above-mentioned problem.It generates evenly distributed keypoints within the image pair under a coarse-to-fine strategy and thus derives matches with relatively high density.However, it is considered inflexible due to the constraints resulting from the advanced image partition which causes the loss of geometric detail.

Line detection methods
The 3D objects in high-resolution UAV images are commonly complex in terms of type and size since artificial objects (e.g.cars and buildings) and natural objects (e.g.trees and lakes) often randomly or jointly appear.In particular, corner points, which are the focus of SuperPoint, are salient features that are used to determine object structures, while lines can hold more importance in structural descriptions in the presence of blur and occlusion.Traditional methods such as Prewitt (Prewitt 1970), Sobel (Kittler 1983), and Canny (Canny 1986) focus on pixel-level gradients and employ thresholding to achieve line detection.There is also numerous research offering practical solutions for straight-line segment detection.With respect to early algorithms, Hough transform (Ballard 1981) was proposed to search for straight lines in a discretized Hough space where the selected edge points are projected.Many subsequent works such as LSD (Grompone von Gioi et al. 2010), EDLines (Akinlar and Topal 2011), FLD (Lee et al. 2014), and CannyLines (Lu et al. 2015) have been proposed to enhance line detection performance and computation efficiency.Notably, LSD is one of the most popular line segment detectors.It uses region-growing and gradient-based thresholding strategies to reduce computational complexity and detection errors.In addition, CNNs have been recently introduced for line segment detection and have shown striking success (Huang et al. 2018b;Xue et al. 2019;Zhou, Qi, and Ma 2019;Xue et al. 2020).ULSD (Li et al. 2021) even unifies line segment detection across diverse sensor platforms by integrating a novel equipartition point-based Bezier curve representation and learning-based point regression to tackle challenges from distorted line segments.However, in the case of UAV platforms, detection methods, which have the detection ability for ubiquitous edges that include curved or straight lines, are more suitable.This is because part of the application scenes, for example, some nature scenes (e.g. a bird's eye view of forests, mountains, etc.) do not contain artifacts with distinct straight lines.A large effort has been made to research edge detection with learning-based methods, e.g.DeepEdge (Bertasius, Shi, and Torresani 2015), CASENet (Yu et al. 2017), BDCN (He et al. 2022), and EDTER (Pu et al. 2022).These methods have become mainstream and outperform classical methods such as Prewitt, Sobel, and Canny, which use hand-designed heuristics.Nevertheless, learning-based methods are limited by training data and tend to extract outer contours, leading to uncompleted line recognition on the main bodies of objects.Furthermore, these methods have not yet conducted keypoint sampling for extracted edges, and thus they cannot be directly implemented for feature matching.

Contributions
The main contributions of this study are as follows.(i) We propose the first self-supervised deeplearning model (GD) for realizing edge detection of straight lines and curved lines.The GD model aims to fully excavate geometric features displayed in images to guide subsequent feature matching.(ii) We propose a self-supervised training scheme to endow the deep-learning model with an edge detection ability and resistance to noise.(iii) We develop a GPU parallel algorithm for vectoring exploited lines from raster imagery.This enables fast redundant-line filtering to offer quality geometric priors from which keypoints are sampled for accurate feature matching.Incidentally, adjacent relationships are also constructed among interest points, which can be valuable priors for surface mesh reconstruction after feature matching.(iv) In general, we propose an integration strategy between quality self-supervised geometric priors and the neural network framework with spatial context awareness (i.e.SuperGlue), namely GeoGlue in this paper, which is demonstrated to be effective through comprehensive experiments compared with state-of-the-art methods.
The remainder of this paper is structured as follows.Section 2 illustrates the self-supervised line detection method used for generating quality geometric priors that guide feature matching in the next step.Section 3 articulates the principles of the proposed GeoGlue which operates feature matching under guidance from geometric priors.Extensive experiments are presented in Section 4 and include a discussion on the remaining issues.Finally, Section 5 summarizes the research and presents future directions for improvement.

Self-supervised geometric prior generation
In the procedure for GeoGlue, the input image pair is first handled by the GD model, which attempts to exploit all quality lines, including straight or curved lines, to guide the subsequent feature-matching process (Section 3).
Similar to SuperPoint (DeTone, Malisiewicz, and Rabinovich 2018) and SOLD 2 (Pautrat et al. 2021), the GD model has a self-supervised training pipeline (Section 2.1) that produces pseudo images containing geometry objects and labels for training.The difference is that GD obtains the ability to detect curved lines based on supervision from the geometry data comprised of curved lines and sampled keypoints.However, SuperPoint can only detect interest points while SOLD 2 can detect straight-line segments but not curved lines.After training, the process within GD is then transferred to real UAV imagery, and keypoints are sampled from detected lines.Specifically, a GPU parallel algorithm provides the sampled keypoints that depict the shapes of the detected lines along with adjacent relationships (Section 2.2) and act as geometric priors for feature matching (Section 3).

Self-supervised line detection
As shown in Figure 1, the architecture for self-supervised line detection is composed of two stages, namely (1) self-supervised training and (2) line detection in real scenes.Once GD can stably detect the lines in computer-simulated images with randomly generated geometries (see Figure 2), it can be naturally adapted to real-scene imagery since visual features in the real scene tend to be covered by numerous pseudo images within the datasets.The essential steps include (i) pseudo image generation, (ii) image pre-processing, and (iii) model training, and are articulated by the following subchapters.

Pseudo image generation
First, several types of geometry templates were designed in advance for random pseudo-image generation to simulate the different patterns, distributions, and aggregation levels of geometric elements within real-scene images.For example, as Figure 2 shows, line-type, and stripe-type templates represent lines with prominent lengths.Star-type and checkerboard-type templates cover cases where lines are relatively concentrated with more junctions than the former.In addition, random affine transformation was performed over the geometries so that their sizes and densities were randomized in pseudo images to ensure the robustness of the GD model for real scenes.The ability of GD to detect curved lines was trained by randomly selecting straight lines from a geometry template that was then replaced by curved lines.Figure 2 shows the curving operation procedure performed on a straight line based on Lagrange interpolation.The keypoints determining the shape of the curved line are placed at random.Note that the points sampled along the curved line are chosen as the keypoint labels (Figure 3), and there are gaps with a length of 4 pixels in the straight-line state (Figure 2(b)).Notably, the supervision of junctions (i.e.endpoints of lines) and sampled points are weighted, i.e. all pixels in a 3 × 3 matrix which belongs to a junction, are labeled as 1, while the sampled points are marked based on their pixel coordinates (see Figure 3 and Appendix A).
Lastly, the strategy in SuperPoint (DeTone, Malisiewicz, and Rabinovich 2018) was adopted to generate random backgrounds for pseudo images (Figure 3).Gaussian blurring was also applied to the generated images to enhance the ability of GD to detect lines.Randomly generated images only containing Gaussian noise were evenly inserted into the created dataset so that GD could achieve maximal resistance to noise within real images.

Image pre-processing
For better training and faster convergence, image pre-processing was performed on the generated pseudo images described in Section 2.1.1 before the training step shown in Figure 1. Figure 3 shows the two-step procedure of image pre-processing for model training, which includes gradient computing and masking.Specifically, gradient computing provides coarse edge priors for the initial stage of model training, which can be considered tips that accelerate the training process.Masking is an operation that displaces parts of the pixels with zero values after gradient computing.The details of the image pre-processing algorithm are as follows.
First, the Sobel (Cristina and Holban 2013) and Laplacian operators are jointly adopted for image gradient computing.The two employed Sobel masks are: where s v and s h respectively correspond to the image gradient in the vertical and horizontal directions, and the two Laplacian masks are denoted by: The receptive field for each image coordinate (x, y) is defined in advance as a 3 × 3 pixel mask denoted by: where I x,y [ [0, 255] refers to the pixel value of coordinate (x, y) of image I.
To accomplish multi-operator gradient computing, the first operation on the image with respect to the Sobel operator is given as: The second operation is processed using the Laplacian operator and is given by: Lastly, the result of image gradient computing is derived by: Second, in the masking step, the regularity of the pixels selected to be displaced by zero values is defined by Equation ( 7), where m x,y represents whether the image coordinate (x, y) needs to be masked or not.
Finally, the pre-processed layer shown in Figure 3 is derived using the following equation: The input data X for GD training (Figure 3) is constructed by: where [ || ] denotes the layer concatenation operation.
Referring to Equation ( 6), gradient computing activates the image areas occupied by the edges of geometries and outputs the feature map G s+l .The masking operation based on Equations ( 7) and ( 8) therefore enriches the textures of the activated areas in G s+l , while the inactivated areas without edge gradients tend to remain unchanged.This further enlarges the difference between areas with and without geometric features and benefits model training.

Model inference and training
In the GD training process, the pre-processed two-layer image X (Section 2.1.2) with spatial resolution h × w is taken as input of the backbone encoder, which is connected with a keypoint and line decoder that outputs the predicted keypoint map HP c and line map HL, respectively (Figure 3).× 256 feature map.The keypoint branch, referring to the junction branch in SOLD 2 , consists of a 3 × 3 convolution layer followed by a 1 × 1 convolution layer with 65 channels that outputs HP c with a size of h 8 × w 8 × 65.Finally, the line branch is composed of two consecutive blocks with both containing a 3 × 3 convolution layer followed by the batch normalization computation, ReLU, and a ×2 subpixel shuffle (Shi et al. 2016).The final operation of the line branch is performed by a 1 × 1 convolution layer with 1 channel followed by sigmoid activation to derive the line map HL with size h × w.
As illustrated in Figure 3, GD is trained with supervision from keypoint and line labels using the loss functions defined by Equations ( 10) and ( 11), respectively: − log exp (HP c i,j,y i,j ) 65 Note that y [ {1, . . ., 65} h 8 × w 8 (Equation ( 10)) denotes the ground truth indices for the keypoint position in each patch and HP c i,j,y i,j [ {0, 1} is the related label data for the keypoint map.
HL GT [ {0, 1} h×w in Equation ( 11) represents the ground truth line map.Finally, a multi-task learning approach proposed by Cipolla, Gal, and Kendall (2018) was adopted for jointly training the two branches, within which dynamic parameters w p and w l were optimized during the training process to adaptively weigh the losses L p and L line .The overall objective thus becomes:

GPU-based line vectoring and filtering
As discussed in Section 2.1.3,the straight or curved lines detected from a source image using the trained GD are output as a keypoint and line map (Figure 3), which respectively provide the positions and adjacent relationships of keypoints.In this work, it is necessary to retain stable geometric features (e.g.building contours) and exclude trivial lines (e.g.ripples in lakes) to generate quality geometric priors and accomplish accurate feature matching for high-resolution UAV imagery.Hence, the line detection results need to be converted from raster to vector format so that efficient morphology analysis such as line-length computation can be easily conducted to filter out trivial lines when the GPU memory is limited.
A novel GPU parallel algorithm was developed based on CUDA to achieve high-performance vectorization from UAV imagery with dense predicted keypoints.Figure 4 illustrates the procedure for the vectoring algorithm.First, each predicted keypoint is allocated to an individual CUDA thread for further processing.Second, during CUDA runtime, each CUDA thread simultaneously performs a breadth first search along the paths provided by the predicted line map until it meets another CUDA thread.The thread index is recorded by the traversed coordinate in each step.Finally, the adjacent relationship between a keypoint pair can be easily constructed if the two indices recorded in the neighboring coordinates are different.
As important as these are, the output keypoint map HP c is first translated to HP [ R h × w to establish prerequisites for the vectoring algorithm.Keypoint candidates are filtered from HP using an activation threshold j p , and the final keypoints are derived through the non-maximum suppression of HP using a specific distance constraint of e = 8 pixels.Finally, paths for the breadth first search (Figure 4(b) and (c)) are extracted from the output line map by a specific activation threshold j l .

Feature matching with geometric priors
The feature matching problem for a UAV image pair can be viewed as the process of matching a pair of graphs since there is a flexible spatial distribution of keypoints and the final matched keypoint pairs satisfy a certain projection relationship.Hence, the GNN technique is adapted for such problems and the data structure considers keypoints as nodes and models keypoint relations as edges.
Motivation.A novel feature-matching method named GeoGlue is proposed in this paper to achieve precise feature matching for UAV imagery.The method combines both SuperGlue (Sarlin et al. 2020) and self-supervised geometric priors (Section 2).In the first part, Super-Glue realizes feature matching in a human fashion that distinguishes salient keypoints for matching based on spatial contextual cues that are integrated from the visual and distributional patterns of co-visible keypoints.Additionally, a GNN structure with stacked selfand cross-attention layers is utilized for processing iterative node aggregation to encode contextual clues into the node descriptors for enhancing keypoint discrimination.Second, geometric priors provided by the self-supervised model GD (Section 2) are employed to perform keypoint proposal, which is more meticulous and comprehensive than the original SuperGlue method built on SuperPoint (DeTone, Malisiewicz, and Rabinovich 2018).Moreover, adequate adjacent candidates can be used for fine-tuning since the keypoints extracted from geometric priors are uniformly distributed on whole bodies of detected lines, which is instrumental in optimal feature matching.Furthermore, keypoint candidates with prominent geometric properties are also advantageous for building spatial context awareness to encode distinctive descriptors for matching.
Feature reasoning.The architecture for feature reasoning with GD and SuperGlue, including keypoint generation and description, can be seen in Figure 5.For feature matching, a specific case is that the two images to be matched (i.e.image Q and image S) are processed as inputs into GD and SuperPoint.The two output keypoint groups from the two models are combined as one set for each image.The former focuses on meticulous geometry extraction while the latter provides balanced attention to the whole image.The image coordinates of the keypoints are represented by p = {(x i , y i ) [ h × w} n i=1 where n denotes the total number of keypoints from images Q or S. In the further computation of keypoint descriptors based on GAT, the confidence values c = {c i [ R} n i=1 and initial visual descriptors d = {d i [ R D } n i=1 of keypoints are sampled from the feature maps produced by SuperPoint (DeTone, Malisiewicz, and Rabinovich 2018) in accordance with positions p.Then, to obtain the initial node state w (0)  i of keypoint i.This enables feature integration for the keypoint's appearance, location, and certainty, as shown in Equation ( 13): where || denotes concatenation of two vectors.After node state initialization, these embeddings are taken as input for a stack of selfand cross-attention layers, as shown in Figure 5. Specifically, the data processed above can be considered a graph where the nodes represent the keypoints from images Q and S and the edges reflect their associations.As for the self-attention layer, each node (i) is connected with all other nodes in the same image.Conversely, each of the nodes in the cross-attention layer is connected with all of the nodes within the other image (Figure 6).The graph reasoning aspect of GAT is composed of stacked selfand cross-attention layers that are repeated L times (Figure 5), which iteratively performs information aggregation (Equation 14) among the linked nodes to encode contextual clues (e.g. the pattern of local distribution and visual information from neighbors) into the node descriptors.The mathematical principle for graph reasoning is as follows.To update the node embeddings, another MLP named MLP Msg is utilized for fusing the immediate embeddings w (l−1) and messages {m A i } n i=1 into new encodings through vector concatenation, where A = {(i, j) [ {1, . . ., n} × {1, . . ., n}} represents the edges among nodes.The graph reasoning process of the selfor cross-attention layer l is summarized by Equation ( 14): Messages {m A i } n i=1 are high-dimensional vectors for each node that are aggregated from the neighbors of each node.For neighbor information retrieval, the embeddings of node i and its neighbor node j are first transformed into vectors q i , k j , and v j through linear projection (Sarlin et al. 2020), as shown by Equations ( 15) and ( 16): Then, messages {m A i } n i=1 are therefore derived with a weighted sum method illustrated by Equation ( 17): where {a ij } j:(i,j)[A denotes the attention weights allocated by the softmax function based on similarities among node i and its neighbors (Equation ( 18)).
Lastly, the reasoning results w (L) are output by the L th layer after L iterations of information diffusion across the selfand cross-attention graphs, and the ultimate node descriptors { f i } n i=1 are obtained by linear projection: Keypoint matching.Under the condition that the final descriptors of keypoints are given within images Q and S, the Sinkhorn algorithm is then employed to solve the keypoint matching problem as an optimal transport problem (Sinkhorn and Knopp 1967).The algorithm is differentiable and enables end-to-end training as described in SuperGlue (Sarlin et al. 2020) in order to derive an optimized feature matching ability.The algorithm is articulated as follows.
Supposing that the keypoint numbers of images Q and S are n Q and n S , respectively, a matrix M [ R n Q ×n S , which stores the similarity score of each candidate pair (i, j) [ n Q × n S , is first computed by: where •, • 〈 〉 denotes the inner product.If there are unmatched keypoints, an augmented score matrix M [ R (n Q +1)×(n S +1) is set up from M by adding a new row and column filled with a learnable parameter (Sarlin et al. 2020).Then, a probability matrix P [ [0, 1] (n Q +1)×(n S +1) is defined to be the objective indicating the final matching results, and the primary matrix P is considered an assignment matrix that is optimized for maximizing the transmission value (i.e. i,j M i,j P i,j ) under the two constraints with respect to the marginal probabilities: Finally, the optimized P can be derived by the fixed-point method for T iterations based on Equations ( 21) and ( 22): where a and b denote the unsolved parameters, and l is set as −1 in this study.The parameter T = 20 was also set in our experiments.

Experiments and results
In this study, comprehensive experiments were conducted to verify the effectiveness of self-supervised geometric priors and show the superiority of GeoGlue compared with recent learning-based approaches and traditional mainstream methods such as SIFT (Lowe 2004) and ORB (Rublee et al. 2011) that use nearest-neighbor matching.Four specialized datasets were introduced in Section 4.1.
The experiment results are presented in Sections 4.2, 4.2.1, and 4.2.2.Implementation details.The GD model (Section 2.1) was trained by the auto-generated dataset, or the pseudo image dataset, illustrated in Section 4.1.1.The training process used Adam with a learning rate of 5e-4 and batch size of 2. As described in Section 3, GeoGlue is built upon SuperGlue and the pretrained network provided by Sarlin et al. (2020) was used in the following experiments.All experiments were implemented in PyTorch and OpenCV with an RTX 4000 GPU (8 GB).

Pseudo image dataset
In accordance with Sections 2.1.1 and 2.1.2,a pseudo image dataset consisting of rendered images and labels was first created to train the GD model.Table 1 presents the statistical numbers for the training data generated using the six types of geometry templates (i.e.line, polygon, checkerboard, cube, star, and stripe).Notably, as described in Section 2.1.1,the shapes, sizes, and locations of the generated geometry objects were random.Some samples are provided in Appendix A. Finally, 10,000 pseudo images only containing Gaussian noise were evenly inserted into the created dataset to improve noise resistance.

Benchmark datasets
Two benchmark datasets named Tanks & Temples (Knapitsch et al. 2017) and ETH3D (Schöps et al. 2017) were utilized for pose estimation experiments.These two datasets both consist of images captured under strong viewpoint variations.In the Tanks & Temples dataset, four challenging scenes (auditorium, ballroom, lighthouse, and temple) containing rich regions with repetitive patterns were selected to evaluate robustness.Additionally, ETH3D featured stronger light changes and sparser views in comparison to Tank & Temples, and three difficult scenes (botanical garden, bridge, and exhibition hall) were chosen for experiments.Some examples of the image pairs provided by Tank & Temples and ETH3D are displayed in Figures 7 and 8.The two types of datasets used in the experiments are available at https://figshare.com/ndownloader/files/38945210.

High-resolution UAV image dataset
High-resolution UAV imagery covering diverse scenes were collected and used to fully test the feature matching and generalization of GeoGlue.The high-resolution UAV image dataset contains four types of aerial images, including: (a) city scenes with large features (e.g.tall buildings) (City A), (b) city scenes with small-sized features (e.g.sculptures) (City B), (c) village scenes, and (d) nature scenes (e.g.forests and lakes).Images in the dataset were organized as image pairs in view of their overlap rates, and a resolution of H I = 4000 (height) and W I = 6000 (width).The counts of different scenes are listed in Table 2. Some specific samples are shown in Figure 9, and the whole dataset can be directly downloaded at https://figshare.com/ndownloader/files/38095920.

Validation of self-supervised geometric priors
The GD model was first trained using the pseudo image dataset (Section 4.1.1). Figure 10 presents the training curves for the average losses in the keypoint and line maps (Section 2.1.3)across every 0.25 epoch.The losses stabilized at around 0.04 and 0.01 after 2 epochs, which demonstrates the learning ability of the geometric prior extraction task.In this paper, we chose a model that was trained after 8 epochs, consuming 52 h, as the GD for the following experiments.The activation thresholds were set as j p = 6 255 and j l = 40 255 for the line vectoring and filtering processes described in Section 2.2.
Figure 11 shows the comparative results from line detection with GD and Canny (Canny 1986), where parameters T 1 and T 2 denote the gradient thresholds used for the Canny algorithm.Although Canny proved to be effective in line detection, some defects, e.g.missing lines (see the yellow boxes in Figure 11), were found to still exist due to the complexity of textures and shading displayed in high-resolution UAV imagery.In comparison, GD can achieve more acceptable results, while trivial lines can be disregarded using the line-length filtering process described in Section 2.2.
As described in Section 3 and displayed in Figure 11, the keypoints detected by GD were primarily distributed along lines, providing quality candidates for feature matching.We visualized the keypoint descriptor produced by GeoGlue using PCA (Section 3).This was carried out to illustrate the   feasibility of performing feature matching given the condition that adjacent keypoints were inclined to be homogeneous in terms of location and visual features.As shown in Figure 12, adjacent keypoints were distinguishable by their RGB colors derived from the first three components after PCA using a linear transformation.In this sample, it was apparent that more than 6 components were significant in the discrimination of keypoint descriptors according to the resulting eigenvalues (Figure 12).The eigenvalues of the first four components are listed in Table 3.In addition, Appendix B (Figure 21) displays the results of descriptor visualization for a sample image pair under different component combinations while only using the first four components.This result shows the relevance between two specific regions for different images that reflect the same spatial location.Therefore, the feature representation provided by GeoGlue is position-dependent based on the quality keypoint proposed by GD, which helps form rich spatial context information during model inference.

Pose estimation evaluated by the benchmark datasets
We conducted pose estimation experiments using the two benchmark datasets described in Section 4.1.2to quantitatively verify the effectiveness of GeoGlue.GeoGlue was compared with recent learning-based approaches, namely SuperGlue (Sarlin et al. 2020) and LoFTR (Sun et al. 2021), and the traditional mainstream methods of SIFT (Lowe 2004) and ORB (Rublee et al. 2011) with nearest-neighbor matching.Pose estimation includes solving the relative rotation R and translation T between a pair of adjacent frames based on essential matrix decomposition (Hartley 1995) and applying random sample consensus (RANSAC) to exclude outliers.For a fair comparison, all poses derived from diverse methods were aligned to the same scale as the ground truth data by scaling the estimated translation T.   The approach to evaluating pose accuracy is described as follows.First, for each frame pair, the percentages of the relative rotation error (using degrees) and translation error (using L2 distance) are calculated from the total rotations and translations of the whole frame pairs of a scene.The number of frame pairs is counted and the computed percentages are placed within a certain threshold.Two thresholds {0.2%, 0.5%} are chosen, and the statistical number of frame pairs are also presented as percentages (i.e.AUC) to intuitively show the stability of diverse methods across the whole trajectory of a scene.In Tables 4 and 5, the 'max error' is used to denote the maximum relative pose error among all frame pairs, and the 'final error' represents the relative pose error of the last frame against the pose of the first frame.
The quantitative results are shown in Tables 4 and 5.As described in Section 4.1.2,four challenging scenes from Tanks & Temples (Knapitsch et al. 2017) and three difficult scenes from ETH3D (Schöps et al. 2017) were chosen for comparing GeoGlue and other state-of-the-art methods.In comparison with SuperGlue, GeoGlue was found to achieve a prominent improvement in pose estimation robustness and accuracy for almost all scenes in Tanks & Temples and ETH3D.Additionally, GeoGlue showed a more competitive performance for scenes in Tanks & Temples and ETH3D compared with LoFTR.
In addition to the tables, Figures 13 and 14 display the qualitative results of the estimated trajectories obtained from diverse methods.The bridge scene in ETH3D was chosen for its great perspective transforms among adjacent frames.The ballroom scene in Tanks & Temples was selected due to the presence of large areas of repetitive patterns, which are even quite difficult for manual registration, as shown in Section 4.1.2.Despite all this, GeoGlue achieved satisfactory stability for pose estimation and showed the best performance compared to baseline methods.

Feature matching evaluation using the UAV dataset
The reprojection error (RE), computed by Equation ( 23), was employed to evaluate the feature matching performance in experiments carried out on the high-resolution UAV image dataset: Note that, (x rp , y rp ) is computed using Equation ( 24) and represents the reprojection coordinate of a matched keypoint (x o , y o ) from a frame.As described in Section 4.1.3,H I and W I represent the height and width of a UAV image.In Equation ( 24), [X, Y, Z] T denotes the normalized 3D  (Zhang 1998) for a pair of adjacent frames.K is the camera intrinsic matrix and s denotes the scale factor.R and T are derived from essential matrix decomposition (Hartley 1995) with RANSAC and represent the relative pose between the frame pair.The estimated pose receives no influence from matches that are viewed as outliers by the RANSAC algorithm.Thus, large reprojection deviations can be produced from the outliers using Equation ( 23), and RE can reflect the feature-matching performance during evaluation.
Since the GPU memory is limited, the input UAV image is inevitably resized before model inference.Coincidentally, image partitioning and sub-image matching are preparatory steps for dense matching, which is instrumental for 3D reconstruction with high-resolution UAV images and dependent on the results of global matching.Therefore, it is important to conduct the following feature matching (i.e.global matching) experiments using RE (Equation ( 23)), which can simultaneously reflect the deviations of matches in the horizontal and vertical directions.The results are tightly correlated with the model performance for feature-matching.As discussed above, the height and width of the input UAV image was resized to 2000 × 3000 (a quarter of the original size H I × W I ) for the extraction of self-supervised geometric priors using GD.Additionally, it was easy to calculate the line lengths since the adjacent relationships among the extracted keypoints were already derived by line vectoring (Figure 5).The line lengths can then be used to filter out trivial lines.In this study, the line-length threshold was dynamically adjusted to satisfy an upper limit keypoint number (j u = 10000) due to GPU memory limitation.
Table 6 presents the comparative outcomes between SuperGlue and GeoGlue for various scenes in the high-resolution UAV image dataset.For GeoGlue, we particularly compare the matching results from the keypoints derived by SuperPoint (see Figure 5).The number of matched keypoints was observed to increase with the aid of geometric priors from GD because more high-quality candidates were offered for matching.In addition, the results for the mean, standard deviation, and maximum values simultaneously declined compared to those of SuperGlue, demonstrating the power of the keypoints proposed by GD.
Moreover, in terms of the results for all keypoints in GeoGlue, the mean value and standard deviation of RE were both better than those of SuperGlue.This result validates the rationality and feasibility of self-supervised geometric priors since the keypoints from GD were the primary candidates used for matching.The extreme cases of max RE shown in the last column in Table 6 indicate that Table 6.RE for the four types of scenes in the high-resolution UAV dataset.GeoGlue (S) only refers to the keypoint matches from SuperPoint (i.e. the same keypoint proposal results as SuperGlue).
the maximum reprojection deviations (Equation ( 23)) resulting from GeoGlue were reasonable and close to those of SuperGlue.
Comparative experiments were performed using recent learning-based approaches such as SuperGlue (Sarlin et al. 2020) and LoFTR (Sun et al. 2021) as well as traditional mainstream methods such as SIFT (Lowe 2004) with nearest-neighbor matching.LoFTR, which infers based on Transformer (Vaswani et al. 2017), has been demonstrated as an excellent approach for producing abundant high-quality matches.Image pairs with small overlapping regions were employed in experiments to test the performance of the above methods.
Figure 15 shows the matching quality of LoFTR, SuperGlue, and GeoGlue for various scenes, using image pairs with small overlapping regions.In the case of LoFTR, the matching candidates were evenly distributed across image pairs and produce more matches but RE ′ s ≥ 2% occurred frequently with large deviations.In comparison, GeoGlue offered more matches than SuperGlue while retaining matching quality and provided more satisfactory results for less extreme deviation cases than LoFTR.
Table 7 presents the quantitative results for feature matching using the above methods.GeoGlue generally outperformed the other methods and featured better performance stability with small maximum REs and less extreme cases of large RE values (i.e.RE ≥ 2%) for diverse scenes.This result verifies the robustness and practicality of GeoGlue for feature matching using high-resolution UAV imagery.Moreover, as observed in Table 7, the number of matches provided by GeoGlue was reasonable and the statistical results for RE were acceptable, demonstrating the suitability of Geo-Glue for global matching tasks.Lastly, the robustness of GeoGlue was evaluated using the entire UAV image dataset.The statistical results listed in Table 8 show that GeoGlue achieved stable performance in various scenes.

Time performance and GPU memory cost
Tables 9, 10, and 11 respectively display the time performance and maximum GPU memory consumption in the feature matching step by GeoGlue for the Tanks & Temples, ETH3D, and the highresolution UAV datasets.The table includes the average time for line detection (AT LD ) on a frame pair, the average time for the whole matching step including line detection and feature matching (AT Wh ), and the total time (TT Wh ).
As discussed in Sections 2 and 3, most of the computational resource requirements were from line detection and feature matching procedures.Thus, the resolution of input images and the richness of geometric features in the corresponding scene were the primary factors affecting the time Table 7. Reprojection error results for the four types of scenes using image pairs with small overlapping regions.The last four columns provide the percentage of matches, and the REs are in the relevant numerical intervals.Note that the symbol '•' is a placeholder for the data row with a matching failure.
Figure 15.Qualitative results of feature matching from diverse methods using the image pairs of various scenes with small overlapping regions, where the red lines represent the matches with RE ≥ 2% in the horizontal or vertical directions.
performance and GPU memory cost.From the tables, it can be seen that GeoGlue cannot achieve real-time processing, and the average processing time for each frame pair is several seconds.In addition, the GPU memory cost is around 5 GB for Tanks & Temples and ETH3D, and 7.5 GB for the UAV dataset, which is caused by the expensive computation of the Sinkhorn algorithm used for keypoint matching (Section 3).

3d reconstruction and remaining issues
As illustrated in Sections 4.2.1 and 4.2.2,GeoGlue shows superiority in feature matching since it is robust and effective to various scenes.As described in Section 4.2.2, global matching is the prerequisite for image partitioning and sub-image matching for high-resolution UAV imagery, which aims at fine-grained matching for subsequent 3D reconstruction.Therefore, the process includes the following steps.First, regularly divide the query image into 4 × 4 sub-images (like Figure 16).Second, compute the relevant image blocks within the target image for each sub-image of the query image by averaging the deviations of matches from global matching (see Figure 16).Lastly, perform feature matching between each sub-image pair.
Figure 16 shows some samples of the results from fine-grained matching with GeoGlue.The current method can achieve effective results for 3D reconstruction and visualization, and the reprojection distances (Equation ( 23)) are close to just one pixel under a resolution of 4000 × 6000.However, it is obvious that the structures of some small-sized objects (e.g.texture-less sculptures) are not Table 8.Statistical results for feature matching produced by GeoGlue with respect to the whole UAV image dataset.The columns ranging from the second onward present the numbers and percentages of image pairs.And the title bar shows the numerical intervals for the percentage of matches whose RE ≥ 2% from an image pair, denoted by ECs.

Scene
ECs entirely reconstructed because the keypoints depicting structural details are not completely retained.This is caused by the memory saving strategy for line-length thresholding described in Section 4.2.2.Figures 17-19 provide additional 3D reconstruction results for the fine-grained matching within diverse scenes, and includes comparisons between LoFTR (Sun et al. 2021) and GeoGlue.The results from GeoGlue were observed to feature effective reconstruction for salient geometric elements, e.g. the steps in undulating shapes, serpentine roads, roofs, etc.Since GeoGlue focuses more on geometric objects, however, the reconstruction performance was not satisfactory for natural objects (e.g.trees) and some texture-less planes.Therefore, for GeoGlue, a strategy that spares a certain attention from abundant geometric elements should be considered.

Conclusions and future work
Feature matching is challenging in high-resolution UAV imagery since the visual information is complicated due to the presence of repetitive patterns (e.g.aligned windows on tall buildings), low texture surfaces, shading, and image noise.We propose GeoGlue to overcome these issues.GeoGlue is a novel feature-matching method that contains a self-supervised geometric detector (GD) and a graph attention network (GAT).The method aims to facilitate spatial context awareness during model inference and achieve accurate feature-matching results.
GD is a CNN model within GeoGlue that is trained in a self-supervised manner with synthetic images.Comprehensive experiments confirm the effectiveness and robustness of GD for extracting geometric priors (i.e.keypoints and lines) and providing quality keypoint candidates used in the matching step.In addition, the experiments also demonstrate the feasibility of employing the keypoints obtained from GD for matching based on GNN architecture.The experimental results show the superiority of GeoGlue in terms of matching accuracy and stability under various UAV scenes compared with other learning-based methods.The reliable matching capability of GeoGlue also enables global matching to perform 3D reconstruction through fine-grained matching on high-resolution UAV image pairs.However, there are still limitations to GeoGlue.Since the GPU memory is finite, it is impractical to use all the keypoints provided by GD to perform feature matching.A strategy that filters out trivial lines must be adopted, which may abandon some quality keypoints that are meaningful for depicting the structural details of small-sized objects.In future studies, more effort will be applied to research on grading keypoints for intelligent keypoint selection in accordance with their importance in structural reconstruction.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Figure 2 .
Figure 2. Geometry templates for automatic generation of the curved-line detection dataset.

Figure 3 .
Figure 3. Image pre-processing for model training of GD.
map where the 1st to 64th channels of a coordinate correspond to the relevant 8 × 8 patch.The last channel indicates the non-existence of keypoints, obeying a similar strategy to SOLD 2 (Pautrat et al. 2021) for the keypoint branch.The resolution of the output line map HL is h × w, or the same as X.The network structure of the backbone encoder, keypoint branch, and line branch (see Figure3) were directly borrowed from SOLD 2 (Pautrat et al. 2021) to validate the practicability of the proposed self-supervised training scheme for line detection including straight or curved lines.Specifically, the backbone encoder is the same stacked hourglass network proposed in(Newell, Yang, and    Deng 2016), outputting a h 4 × w 4

Figure 4 .
Figure 4. GPU parallel algorithm for vectoring: (a) initial stage; (b) -(c) breadth first search stage; and (d) linking stage.The yellow grids represent the keypoints extracted from the output keypoint map, and the blue grids represent the paths derived from the output line map for the breadth first search.

Figure 5 .Figure 6 .
Figure5.The architecture of feature reasoning, including the generation and description of keypoints with GD and SuperGlue, which is composed of SuperPoint and a graph attention network (GAT).Note that the symbols (Op I) and (Op II) denote the operation || and the sum operation in Equation (13), respectively.

Figure 7 .
Figure 7. Examples of image pairs from Tanks & Temples.

Figure 8 .
Figure 8. Examples of image pairs from ETH3D.

Figure 9 .
Figure 9. High-resolution UAV image dataset used for feature matching experiments (Section 4.1.2).

Figure 10 .
Figure 10.Training curves for the GD model.

Figure 11 .
Figure 11.Comparison of line detection results produced by GD and Canny.

Figure 12 .
Figure 12.A sample of the eigenvalues of all descriptor components following PCA.The components are sorted by their value size, and the descriptors for the first three components are visualized.

Figure 13 .
Figure 13.Qualitative results for bridge scene in ETH3D.

Figure 14 .
Figure 14.Qualitative results for the ballroom scene in Tanks & Temples.

Funding
Research presented in this paper was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences [Grant No. XDA19080101], the Director Fund of the International Research Center of Big Data for Sustainable Development Goals [Grant No. CBAS2022DF015], the National Natural Science Foundation of China [Grant No. 41901328 and 41974108], and the National Key Research and Development Program of China [Grant No. 2022YFC3800700].

Figure 20 .
Figure 20.Training data samples of the six types of geometry templates in the pseudo image dataset.

Table 1 .
Statistical results of various types of geometry templates in the pseudo image dataset.

Table 2 .
Statistics of the four types of scenes in the high-resolution UAV dataset.

Table 3 .
Eigenvalues for the first four descriptor components based on PCA.

Table 4 .
Baseline comparison of pose estimation results for the first fifty frames in Tanks & Temples.The numbers presented in brackets in the 'rotation error AUC' and 'translation error AUC' columns refer to the specific frame pair number.

Table 5 .
Baseline comparison of pose estimation results for the first fifty frames in ETH3D.The numbers presented in brackets in the 'rotation error AUC' and 'translation error AUC' columns refer to the specific frame pair number.
coordinate of the keypoint (x o , y o ) derived by triangulation under the epipolar geometry model

Table 9 .
Time performance and maximum GPU memory cost in the feature matching step produced by GeoGlue and Tanks & Temples.

Table 10 .
Time performance and maximum GPU memory cost in the feature matching step produced by GeoGlue and ETH3D.

Table 11 .
Time performance and maximum GPU memory cost of feature matching produced by GeoGlue and the high-resolution UAV dataset.