Learning Dense Consistent Features for Aerial-to-Ground Structure-From-Motion

The integration of aerial and ground images is known to be effective for enhancing the quality of 3-D reconstruction in complex urban scenarios. However, directly applying the structure-from-motion (SfM) technique for unified 3-D reconstruction with aerial and ground images is particularly difficult, due to the large differences in viewpoint, scale, and appearance between those two types of images. Previous studies mainly rely on viewpoint rectification or view rendering/synthesis to improve the feature matching quality for aligning the aerial and ground models. Nevertheless, these approaches still fail to address the inherent information differences between aerial and ground images. In this article, we propose a learning-based matching framework for direct SfM with ground and aerial images. The key idea of our method is to learn the pixel-wise consistent features between aerial and ground images to handle the large heterogeneity of these two types of images. Specifically, we deploy a learning-based matching framework to robustly correspond the aerial and ground images. With the high-quality feature matching, learned feature maps are used for refining keypoint locations and fusing featuremetric error into bundle adjustment with the consideration of geometric error, both of which can further improve the accuracy and completeness of the recovered 3-D scene. Extensive experiments conducted on six datasets demonstrate that the proposed method can reconstruct high-fidelity 3-D models with direct aerial-to-ground SfM, which cannot be achieved by existing methods. In addition, our method also shows outstanding performance in subtasks of feature matching and point cloud recovery.


I. INTRODUCTION
W ITH the increasing availability of aerial oblique images, the fast reconstruction of urban scenes has become feasible. However, challenges remain in achieving high-quality urban 3-D models due to occlusion and low resolution caused by the limitations in height and perspective. These issues often result in geometric holes and blurred textures in reconstructed models, particularly on building facades. To address these challenges, recent studies have explored the integration of aerial and ground images as a promising approach for improving the quality of urban 3-D reconstructions [1], [2]. As depicted in Fig. 1, ground images provide close and arbitrary views of the scene, which can complement aerial images to capture details and ensure completeness in the reconstructed model.
The key to joint reconstruction of aerial and ground images lies in accurately registering the 3-D data (e.g., point cloud and mesh) produced by each image source to the same coordinate system and ensuring geometric consistency constraints. Feature matching is a primary strategy that can precisely establish connection relationships between different images by extracting and matching features. However, commonly used handcrafted 2-D features (such as scale-invariant feature transform (SIFT) [3] and affine-SIFT (ASIFT) [4]) cannot tolerate well the heterogeneity of aerial and ground images in terms of viewpoint, lighting, and appearance, making it difficult to find enough matching points to support the effective operation of various components in the structure-from-motion (SfM) [5], such as triangulation and bundle adjustment (BA). Apart from 2-D feature matching, 3-D features from separate 3-D models prebuilt with different image sources are also often used for model matching and alignment, such as pin images (SI) [6], fast point feature histogram [7], rotational projection statistics [8], and so on. Nevertheless, the differences in accuracy, density, and noise level between the aerial and ground models of the same scene still make it difficult to yield satisfactory fusion results.
In order to address the problem of 2-D/3-D feature-based registration difficulties between aerial and ground images or models, some researchers have conducted pioneering studies using viewpoint rectification [9], [10] or rendering/synthesis [1], [2], [11] to improve feature matching performance and achieve registration of aerial and ground models. Viewpoint rectification works focus on identifying view-independent planar structures (such as ground and building facades) in the scene to correct the aerial and ground images to a normalized viewpoint, thus reducing the differences in viewpoint between the aerial and ground images. However, the structure of urban scenes is usually complex, and it is difficult to guarantee that effective planar features can be extracted even from building facades, so the practicality of this method is often limited. On the other hand, viewpoint rendering/synthesis involves using recovered 3-D data (such as depth maps, point clouds, and meshes) to synthesize a new image at a target viewpoint and match it with the target image (often by projecting ground data onto the aerial viewpoint). Unlike the viewpoint rectification, this strategy combines information from multiple-view images to reduce the differences between aerial and ground images. However, successful implementation of this strategy relies on accurate GPS or geotagged labels for the rough registration of the 3-D data. This can be challenging when the data comes from crowd-sourced databases or when GPS information has high levels of error. In general, although the above two strategies have made some progress, the heterogeneous differences between aerial and ground images still hinder highquality joint reconstruction with aerial-ground information.
In this article, we propose a dense correspondence learning based SfM approach for the integration of aerial and ground images. Inspired by current developments in deep learning for optical flow estimation [12], [13], the proposed method addresses the heterogeneous differences between aerial and ground images by learning pixel-wise consistent features. Based on dense consistent features, our method can accurately establish correspondences between 2-D local features (subpixel level) preextracted in the images. Furthermore, we propose a multiview feature consistency refinement to adjust the position of 2-D local features in each track, to overcome the low feature localization accuracy caused by differences in aerial and ground scales and obtain an optimized scene graph. Different from the existing studies on the integration of aerial and ground images, this article not only focuses on solving the feature matching problem existing between aerial and ground images, but also further proposes a feature map-based BA method, which further improves the accuracy of 3-D scene recovery with consideration of both featuremetric error (FE) and geometric error.
In summary, our main contribution is the proposal of a reliable method for integrating aerial and ground images, which mines consistent features from aerial and ground images with extreme information differences by a learning manner to achieve complete and refined sparse SfM point clouds.
The rest of this article is organized as follows. Section II reviews related works. Section III describes the proposed method in detail. In Section IV, we present experiments that validate the performance of our method. Finally, Section V concludes this article.

II. RELATED WORK
Here, we review the works related to 3-D reconstruction based on the integration of aerial and ground images. Specifically, the review is organized into the following three parts: 1) feature matching; 2) viewpoint rectification and 3) view rendering/synthesis.

A. Feature Matching
The first step of image-based 3-D reconstruction methods is usually to extract and match 2-D local features between images. Commonly used 2-D features, such as SIFT [3], ASIFT [4], and oriented FAST and rotated BRIEF (ORB) [14], are useful in handling general baseline scenes but ineffective to cope with matching tasks with extremely wide baseline, such as feature matching between aerial and ground images. Therefore, some studies have proposed self-similarity descriptors based on a simple observation that urban building facades often exhibit high self-similarity to achieve matching between aerial and ground images [15], [16], [17]. However, building facades in cities are usually complex, making it difficult to guarantee that reliable self-similarity features are captured. Additionally, these studies lack the establishment of pixel-level correspondences between images, making it impossible to provide effective inputs for the SfM algorithm. Some researchers pay attention to outlier rejection to handle aerial-ground image matching task by introducing matching priors. These outlier rejection approaches improve the robustness of the feature matching algorithm to high outlier rates and can effectively mine the inliers from cluttered matches. For example, Line et al. [18] proposed to separate outliers by learning the bilateral function (BF) from candidate matches, based on the assumption that the correct matches are consistent in density, smoothness, and spatial distribution. In order to eliminate the incorrect matches brought by repetitive structures, Lin et al. [19] further proposed RepMatch to incorporate random sample consensus (RANSAC [20]) into the BF. Zheng et al. [21] considered that small local areas in real scenes cannot be simply considered as planes, and thus designs the local affine validation (LAV), which eliminates outliers by solving smoothly varying affine functions in small local areas. To address the extreme scale differences between aerial and ground images, Zhou [22] studied a scale-space-based scale-invariant matching algorithm based on the assumption that the scale ratio of correctly matched feature pairs is close to the image scale ratio. The method introduced by [22] first estimates the image scale ratio based on bag-of-features encoding, and then achieves scale-aware image matching with the estimated scale ratio. In contrast, our method works in a learning-based way to extract features that are invariant to heterogeneous differences such as viewpoint, scale, and illumination between aerial and ground images, so as to achieve accurate matching of aerial-ground images.

B. Viewpoint Rectification
Existing viewpoint rectification based methods use the geometric priors of a given scene to correct the aerial and ground images to a normalized view, thus improving feature matching performance. Typically, the first step is to detect planar structures in the scene that are invariant to viewpoint changes, such as building facades or the ground, and assume that these structures are the same in all images. By projecting all images onto these planes, the viewpoint differences in image data can be reduced [23], [24]. However, in many cases, it is not possible to find planar structures that are visible in all images. In such cases, some studies resort to performing viewpoint rectification for pairs of images based on the planar structures detected in each pair. Wu et al. [9] corrected the images by projecting them onto virtual planes generated from dense point cloud data. Zheng et al. [10] first extracted the building façade structures in the aerial and ground images using the local consistency of features, then verifies images based on the transform-invariant low-rank texture [25], and finally achieves aerial-ground image matching by a mutually supervised manner between extracted façade grid structures and matched seeds. However, the viewpoint rectification approach suffers from the following two problems: 1) even if the viewpoint of both images is successfully verified, the information difference between the aerial and ground images is not handled, and the further scale change brought by the viewpoint rectification will exacerbate the difficulty of feature matching and 2) real scenes are often complex, and it is difficult to guarantee the extraction of planar structures, especially when the building facades are nonplanar.

C. View Rendering/Synthesis
The method of viewpoint rendering/synthesis often uses a coarse-to-fine strategy. First, the coarse registration of the aerial and ground models is achieved using the geotagging of the images. Then, viewpoint rendering/synthesis techniques are used to generate a synthesized image in the target viewpoint, which is then matched with the target image using feature matching. The matched 2-D features are then back-projected to establish 3-D correspondences between the aerial and ground models, and finally, the similarity matrix is estimated to achieve the registration of the aerial and ground models. Considering the low resolution of aerial data, view synthesis often takes the aerial view as the target viewpoint. The view synthesis can thus be realized by using depth maps to warp the ground images [11] or project the dense point cloud visible by ground images to the aerial viewpoint [2]. Compared to the former generation method, the latter incorporates information from multiview images. To avoid holes in the synthetic images, [26] explored to generate synthetic images using spatially continuous meshes generated from ground sparse point clouds. In addition, unlike previous studies, they merged point clouds by BA method instead of estimating the transformations between models, which aims to deal with possible scene drift issues. Zhu et al. [1] inferred synthetic images from ground view using a mesh model recovered from aerial data, and also generated depth and normal maps to tackle the problem of inaccurate feature correspondence caused by the low mesh geometric accuracy and texture blending. In a word, the view rendering/synthesis based approach relies on the GPS information or text labels for the initial coarse alignment of aerial and ground models, which limits the applicability of these methods, especially when the images come from a multisource database without localization information or when the GPS information is inaccurate. Moreover, these approaches still do not address the heterogeneous differences between aerial and ground images, even though they tackle the inconsistency in viewpoint and scale to some extent.

III. METHOD
Given N g ground images and N a aerial images, our goal is to geometrically register these images, and recover high-quality 3-D camera information (intrinsic and extrinsic parameters) and scene structure. The main challenges lie in the significant variations between aerial and ground images about viewpoint, lighting, weather condition, and resolution, as well as the potential for significant occlusion and noise. These factors often result in feature matching failures when attempting to construct a complete scene graph from the set of aerial and ground images. Even in some cases where feature matching appears to be successful, SfM may still generate inaccurate 3-D information. In this work, we make it possible to jointly model complete and detailed scenes using SfM directly by learning consistent features between aerial and ground images. Specifically, we first design a dense correspondence network to learn consistent features among ground and aerial images and generate dense correspondences. To enhance generalization to ground and aerial images, we adopt a multiscale and multistage inference strategy to output high-quality correspondences. We then extract sparse keypoints from each image and establish correspondences between keypoints in the image pairs based on the outputted dense correspondences. To ensure the quality of keypoint matches input to the SfM pipeline, we further use the learned feature map to adjust the locations of the keypoints in the multiview images. Finally, based on the learned feature map, we introduce a new feature consistency error into BA, which effectively eliminates cumulative errors and improves the quality of the recovered 3-D information. The overall pipeline is illustrated in Fig. 2. It should be noted that the fundamental structure from the motion algorithm used in our method is provided by [5].

A. Dense Correspondence Network
Typically, SfM requires a matching graph based on sparse 2-D keypoint correspondences for geometric estimation. However, traditional handcrafted sparse features such as SIFT and ASIFT  are inadequate for establishing high-quality keypoint correspondences between aerial and ground images. This is mainly due to the sensitivity of the keypoint extractor to changes in image conditions such as viewpoint, scale, and illumination [27], making it difficult to ensure high repeatability of keypoints extracted from aerial-ground image pairs. In addition, the invariance of existing feature descriptors is also insufficient to cope with significant changes in the conditions of aerial and ground images, resulting in the failure of matching strategies based on feature similarity. Although learning-based methods can effectively increase robustness to image condition changes, the large receptive field of convolutional neural networks (CNNs) and down-sampling of feature maps often result in the lower location accuracy of learned keypoints compared to traditionally handcrafted keypoints, which greatly affect the accuracy of geometric estimation [28]. Therefore, instead of directly utilizing the traditional detect-describe-match pipeline to establish keypoint correspondences between aerial and ground images, we first propose a dense correspondence network to learn consistent features between aerial and ground images and establish pixel-level correspondences. Then, in Section III-B 1, dense correspondences are used to perform handcrafted keypoint matching. The idea behind this strategy is that we believe matching pixel by pixel is more accurate than directly using descriptors to match sparse keypoints.
The structure of the proposed dense correspondence network is illustrated in Fig. 3. We first extract two feature maps {F 1/2 ∈ R H/2×W/2×D , F 1/4 ∈ R H/4×W/4×D } with 1/2 and 1/4 spatial resolution of the original size from the reference image and source image, respectively, where H, W denote the height and width of the input image, respectively, and D, D denote the channel dimensions of F 1/2 and F 1/4 . Considering the extreme differences in image conditions between ground and aerial images, we deploy a transformer block [29] to enlarge the receptive field of the CNN and ensure that each pixel can receive full-image contextual information, thus enhancing the discriminability of the features. The design of our dense correspondence network takes a coarse-to-fine strategy, as shown in Fig. 3. In the coarse phase, a global correlation SoftMax layer is used to construct pixel-level similarity between the source and reference feature maps {F The global correlation SoftMax can be defined by the following formulation: where x i s and x j r are the coordinates in the source and reference feature maps, respectively. The resulting correlation volume C 1/4 ∈ R H/4×W/4×H/4×W/4 is then fed into a flow estimator to output the coarse flow f c . In the fine phase, the source feature map F 1/2 s is warped to the reference viewpoint based on the coarse flow f c , and a local correlation softmax layer is used to construct a local correlation volume C 1/2 between the warped feature map F 1/2 s(w) and the reference feature map F 1/2 r . The local correlation softmax can be defined as where x i r represents the ith coordinate in the reference feature map, while d denotes the offset vector. The local region used to compute the pixel-wise similarity is determined by the radius (|d| < 4). Finally, the resulting volume C 1/2 ∈ R H×W ×(2d+1) 2 is fed into another flow estimator to output the residual flow f r . The final refined flow f f is obtained by combining the coarse flow f c with the residual flow f r . The final flow can accurately generate pixel-to-pixel correspondences (referred to as dense correspondences) between the images. Additionally, a confidence selection module is used to select reliable correspondences with correlation values exceeding a specific threshold as the final output. It should be noted that the confidence selection module is only used in the inference stage. Specifically, after forward pass through the network, we calculate the correlation values of pixel-to-pixel correspondences on the feature maps F 1/2 and F 1/4 , and set the threshold to 0.8. Further details of the network design and training can be found in Section IV-B 1.
Furthermore, we believe that accurate matching between aerial and ground images is challenging to achieve through a single forward pass of the dense correspondence network. Therefore, a multiscale and multistage inference strategy is used to handle the large viewpoint and scale differences between aerial and ground images. For the multiscale inference, we adopt the idea in [30] to resize the ground images into four different resolutions of 0.5, 0.6, 0.88, and 1, resulting in four image pairs with the aerial images {(I g 0.5 , I a 1 ), (I g 0.6 , I a 1 ), (I g 0.88 , I a 1 ), (I g 1 , I a 1 )}. Each of these image pairs is fed into the network separately to obtain the corresponding dense correspondence results. The resulting four sets of correspondences are then passed through RANSAC to solve for the homography matrix, and the final correspondences are determined based on the ratio of inliers. For the multistage inference, we follow a coarse-to-fine design similar to the network architecture. In the first forward pass of the network, we can obtain the coarse dense correspondence results and estimate the homography matrix by using RANSAC. We then use this matrix to warp the source image into reference viewpoint and obtain a roughly aligned image pair. The second input of the new image pair is fed into the network to generate a new flow, which can be considered as the residual flow. This flow is added to the flow obtained from the first forward pass. The combined flow results are then converted into correspondences and output as the final result.

B. Scene Graph Construction
The scene graph describes the connectivity between images, where each image is a node and there is an edge between any pair of images with matched keypoints. The set of matched keypoints across multiple views forms a track. As the input to SfM, the quality of the scene graph directly determines the quality of 3-D camera and scene information recovery. To construct a highquality scene graph, we adopt the following steps to provide the necessary connectivity for recovering the complete model, and sufficient redundancy and accurate initial values for reliable estimation.
1) Keypoint extraction and assignment. Although dense correspondences outputted by the dense correspondence network establish pixel-level matches between images, they are insufficient for accurately estimating 3-D geometry and are limited by viewpoint and resolution, often resulting in many-to-one pixel matches. Therefore, dense correspondences cannot be directly applied to SfM. To address these issues, we first extract sparse keypoints (such as SIFT and SuperPoint [31]) from each image and establish rough matching relationships between keypoints based on dense correspondences, which can be viewed as candidate matches. Although these matches also inherit the many-to-one disadvantage of dense correspondences, they have higher keypoint localization accuracy. Additionally, we believe that the sparsity of keypoints ensures that the probability of multiple keypoints within the same pixel is low, allowing the correspondence problem of multiple keypoints in one matching pixel pair to be ignored. The outliers included in these candidate matches can be handled by RANSAC or other outlier rejection algorithms, such as LAV [21].
2) Refinement of keypoint locations. Usually, sparse keypoints are independently extracted on each image. Therefore, when there are significant variations in image conditions, it is difficult to ensure positional consistency of the matched keypoints across multiple images, resulting in a decrease in the accuracy of both scene structure and camera pose estimation. In order to tackle this problem, we propose a method that refines the keypoint locations based on multiview feature maps. This method adjusts the keypoint locations by minimizing the FEs between matched keypoints in the multiview feature maps. Specifically, the method works as follows: N g )) refer to the number of tracks and the number of keypoints on jth track, respectively. N a j and N g j denote the number of aerial images and ground images involved in the jth track, respectively. p i(j) is the ith keypoint on the jth track, F represents the feature map learned from the dense correspondence network (It is noted that we use the feature map F 1/2 with 1/2 spatial resolution of the original image.).
[·] is the sampling operator. We select the keypoint p k(j) with the most matching relationships on the jth track as the anchor point to adjust the locations of other keypoints Taking into account the resolution difference between aerial and ground images, we utilize γ to regulate the influence of keypoint offset on both types of images towards the overall loss. We adopt the Levenberg-Marquardt (LM) algorithm [32] to solve (3) and optimize the location estimation in each iteration as follows: where E(P) = [e i,j ] j=1,...,N t i=1,...,N j , J(P) is the Jacobian matrix of E(P), D(P) is a nonnegative diagonal matrix consisting of the square root of the elements on the diagonal of J(P) T J(P), and λ > 0 controls the degree of regularization.

C. Geometric-Feature BA
To mitigate the cumulative errors during the incremental reconstruction, it is necessary to perform BA after image registration and triangulation to guarantee the accuracy of 3-D scene estimation. Specifically, given an initial estimation, BA refines the estimation of the scene point and camera pose by minimizing the following reprojection error (RE): where and N c is the number of images involved in 3-D reconstruction; (·) is the function that projects the 3-D scene point to 2-D plane; P j is the jth scene point; C i and {R i , T i } are the intrinsic parameter and pose of ith camera, respectively. For better performance of BA, we utilize the image feature maps ({F 1/2 i }, i = 1, . . ., N c ) from the dense correspondence network and introduce a novel feature-based BA The final objective function for BA minimizes both geometric and featuremetric consistency error, which is formulated as follows: where N a c and N g c denote the number of aerial images and ground images involved in 3-D reconstruction, respectively. Here, we also use a parameter η to balance the impact of errors on aerial and ground images towards the overall loss.
The solution of (7) is the same as that of (3), which also utilizes the LM algorithm to iteratively update the parameters for optimization where is the Jacobian matrix of E(X ), D(X ) is a nonnegative diagonal matrix consisting of the square root of the elements on the diagonal of J(X ) T J(X ), and λ > 0 controls the degree of regularization.

IV. EXPERIMENTS
To evaluate the effectiveness of our proposed method in the task of aerial-ground image integration, we conduct a series of experiments on multiple datasets that are both publicly available and collected by ourselves. Firstly, we compare the performance of our proposed method with state-of-the-art techniques in feature matching. Secondly, we evaluate the quality of the reconstructed 3-D scenes by integrating aerial and ground images. Finally, we demonstrate the impact of our proposed method on the reconstruction of complete and fine-grained surface models. The comparative results with prior arts on datasets are reported in Sections IV-C, IV-D, and IV-E.
A. Dataset 1) Training Data: We use the MegaDepth [33] to train the proposed dense correspondence network. MegaDepth consists of images from photo-tourism with significant variations in appearance and viewpoint, which can simulate the differences between aerial and ground images. The authors use COLMAP to reconstruct 196 different scenes from 1 070 468 internet photos and provide the intrinsic and extrinsic camera parameters and depth maps for 102 681 images among them. Before training, we preprocess the MegaDepth in the following steps.
1) Removing the scenes with low quality depth maps as indicated by [34]. . In order to obtain the ground truth correspondences between image pairs, we project all the points in the source image with depth information into the reference image to obtain coarse correspondences. Subsequently, a depth-check is performed to reject the incorrect correspondences and obtain the final correspondence ground-truth and the mask for the effective loss calculation, as proposed in [34]. It should be noted that the correspondences can be converted into the ground-truth of flow for training dense correspondence network, as done in [35].
2) Test Data: Six datasets are used to evaluate the proposed method, including the ISPRS benchmark dataset collected in the Centre of Dortmund and Zeche of Zurich [36], two datasets (SWJTU-LIB and SWJTU-BLD) collected on the campus of Southwest Jiaotong University (SWJTU) provided by [1], and two datasets (CQ-BAISHA and CQ-StudioCity) collected by ourselves in Baisha Town and a studio city in Chongqing (CQ). The ground sampling distance (GSD) of all images ranged from 0.16 to 1.8 cm. Table I describes the specific details of the six aerial-ground datasets, Fig. 5 shows examples of aerial-ground image pairs, and Fig. 6 shows the scenes reconstructed by SfM from the aerial and ground images separately. In order to obtain ground truth correspondences for quantitative evaluation, we manually select tie points in covisible multiview images to integrate aerial and ground images.

1) Dense Correspondence Network:
We implement our network using PyTorch. The backbone of our network is ResNet50 [37], which is pretrained on ImageNet [38]. In the network, we introduce the transformer blocks to increase the global receptive field. However, the computational burden that comes with it cannot be ignored. Therefore, similar to [27], we use the linear transformer to address this issue. We only use one eight-head attention layer. For the flow decoder, we adopt the design provided by [39]. We follow [13] to supervise the training of the network by using the L1 distance between the predicted flow and the ground truth flow. Given the ground truth flow, we are able to calculate the following loss: where M c and M f refer to the ground truth masks at the coarse and fine stages of our network, respectively, while f gt c and f gt f refer to the ground truth flow at the coarse and fine stages, respectively. During our experiments, we set the values of γ 1 and γ 2 to 0.7 and 0.9, respectively. For network training, we use a progressive strategy. Initially, we freeze all the weights of the backbone and train the remaining part of the network on a subset of the scene with a scale ratio of [0.5-0.7]. The learning rate is set to 10 −4 during this stage. Once this training is completed, we unfreeze the weights of the backbone and fine-tune them on a subset with a scale ratio of [0.3-0.5]. The learning rate is set to 4 × 10 −5 during this stage. Finally, to make the network adaptable to challenging scenarios, we train the entire network on the subset with the maximum scale ratio of [0.1-0.3], and set the learning rate to 10 −6 . The model is trained on image pairs of size 520 × 520.
2) Keypoints LR and Optimized BA: In the process of refining keypoint locations and BA, we impose a maximum offset of β = 10 for each keypoint, and set η = 0.3 to focus the entire optimization process more on errors in aerial images. It is noteworthy that (4) and (8) reveal the similarity between the solving process of refining keypoint locations and that of BA.    However, the key difference between the two lies in the fact that refining keypoint locations is performed on a single track, which makes it amenable to acceleration via parallel computing. In contrast, in BA, all camera poses and scene points are optimized simultaneously.

C. Evaluation of Feature Matching Between Aerial and Ground Images
We compare our matching results with the following four advanced methods: 1) The feature matching method embedded in the Colmap system (Colmap) [5]; 2) Adaptive locally-affine matching (AdaLAM) [40]; 3) SIFT+SuperGlue [41]; 4) Superpoint [31]+SuperGlue. The first two methods use handcrafted feature extractors and outlier filters, the third method uses graph convolutional networks to learn correct matching relationships between features, and the fourth method incorporates a learned feature extractor on top of the graph convolutional network. It is worth noting that our matching results include the results with SIFT and SuperPoint as feature extractors. Additionally, besides using precision (P), Recall (R), and F1-score (F) to evaluate the quality of feature matching results for individual image pairs, we also count the number of aerial-ground image pairs matched, the average number of matches per pair, and the average number of 3-D points observable per image in six aerial-ground datasets to further evaluate the performance of our proposed matching method.
1) Evaluation on Two Views: We select one aerial-ground image pair from each of the six datasets for evaluation. The quantitative comparison of our proposed method and other methods in terms of precision, recall, and F1 score is presented in Table II, while the visual results of feature matching for each method on the six aerial-ground image pairs are shown in Fig. 7. From Table II, it can be observed that the classic SIFT feature and geometric verification method provided by Colmap are insufficient for matching aerial and ground images, resulting in zero scores in all three metrics. AdaLAM, an advanced outlier filter that uses local geometric verification to filter out outliers, surprisingly achieved the same results as COLMAP. This may be due to the following two reasons: 1) the scale, orientation, and other feature frame information of the SIFT features are not reliable for finding matches between aerial and ground images and 2) the small number of true SIFT matches between aerial and ground images is not enough to support local geometric verification. When the SIFT features are input into SuperGlue to learn correct matching relationships, effective matching results cannot be obtained for test image pairs. The results of the above three methods indicate that relying solely on SIFT descriptors is insufficient for aerial-ground image matching tasks under extreme differences.
When SuperPoint, a learned feature, is used instead of SIFT and combined with SuperGlue, there is a significant improvement in the results of aerial-ground image matching. This suggests that learned features perform better than traditionally designed handcrafted features in processing images with significant differences. Nevertheless, the SuperPoint + Super-Glue matching strategy still has many false matches, and even on the two pairs of images in the Zeche and CQ-StudioCity datasets, almost no correct matches are obtained. This may be related to the dataset and training strategy used for network training. On the other hand, using the results output by our dense correspondence network to assist SIFT and SuperPoint matching achieves higher precision, recall, and F1-score. It is worth noting that SIFT still fails to obtain true matches on the SWJTU-LIB image pair. This is because under extreme differences in aerial and ground images, SIFT features are challenging to ensure effective repeatability between images, making it difficult to obtain matches even with highperformance matchers. Additionally, the matches obtained using our method are more evenly distributed in space, as seen in Fig. 7.
2) Evaluation on MultiViews: To comprehensively evaluate the performance of our proposed method, we calculate three metrics on six datasets. The number of matched aerial-ground image pairs (N p ) reflects the robustness of the matching algorithm to differences between aerial and ground images. When the performance of all algorithms is similar on this metric, the average number of matches per aerial-ground image pair (N m ) is used to further evaluate the performance differences between different matching algorithms. The average number of 3-D points observed per image (N o ) is used to evaluate the matching results from the perspective of recovering 3-D information of the scene. This metric supports the hypothesis that the number and quality of matches do not necessarily have a proportional relationship. Table III reports a comparison between our method and other methods on the above three metrics. The results in Table III show that our proposed strategy, which utilizes a dense correspondence network to assist SIFT and SuperPoint in matching, can obtain correctly matched aerial-ground images on the six test datasets and exceeds other methods in number. Furthermore, due to the advantage of having more matches on aerial-ground image pair, our method is able to generate more 3-D points. We also find  that, for Centre, CQ-BAISHA, and CQ-StudioCity, the results based on SuperPoint are slightly lower than those based on SIFT for all three metrics. This could be attributed to the fact that the structures in these three scenes are more suitable for SIFT to extract salient features, such as corners. COLMAP, on the other hand, is able to successfully integrate aerial and ground images in Zeche, Centre, and SWJTU-LIB, and outperformed SuperPoint + SuperGlue in terms of N p and N o , but is much lower in N m . This phenomenon is because the location accuracy of SuperPoint is much lower than that of SIFT, which makes it difficult to effectively recover the 3-D information of the scene even if there are many true matches between the images. This further emphasizes the necessity of refining feature location. Similarly to the evaluation on two views, AdaLAM and SIFT+SuperGlue are unable to integrate aerial and ground images correctly, which is expected.

D. Evaluation on Integrated Sparse Point Clouds
In order to further evaluate the impact of feature matching results on subsequent 3-D scene recovery and the effect of our proposed location refinement (LR) and BA introducing FE on improving the quality of 3-D scenes, we initially feed our method's matching results and SuperPoint + SuperGlue's matching results into SfM for reconstructing 3-D point clouds. We then compare them with the 3-D point clouds generated by software such as COLMAP and RealityCapture. Furthermore, to demonstrate the effectiveness of LR and FE, we generate four different configurations of our method by including or excluding corresponding modules. To evaluate the results, we use two metrics: 1) average RE and average track length (TL), which are presented in Table IV.
From Table IV, it can be seen that both COLMAP and Reality-Capture fail to integrate aerial and ground images for SWJTU-BLD, CQ-BAISHA, and CQ-StudioCity scenes, and Reality-Capture also fails on Centre. This indicates a high failure rate when using existing software to complete aerial image integration tasks. In comparison, our method, using either SIFT or Su-perPoint, not only successfully achieves aerial-ground integration, but also have lower average REs and longer average TLss. It is worth noting that SuperPoint + ours has a significantly higher average RE than SIFT+ours, because feature learning-based methods often have lower location accuracy due to the large receptive field and downsampling operations of CNNs, which can affect the accuracy of 3-D scene reconstruction. Additionally, as shown in the table, adding the keypoint LR and the BA with featuremetric significantly reduce the average RE and increase the average TL. For example, on Zeche, SIFT+ours(wLF, wP E) reduce the average RE by 25% and increase the average TL by 18% compared to SIFT+ours(w/oLF, w/oP E). Fig. 8 shows the sparse SfM point cloud obtained by our method using SIFT to integrate aerial and ground images, while Fig. 9 displays some examples of failures in other methods. In Fig. 9(a), all ground cameras are restored to the same facade view. This is because SuperPoint + SuperGlue lacks robustness to repeated structures. In Fig. 9(b), the addition of ground images even destroys the consistency of the original aerial scene, which also shows that the mismatches between aerial-ground images generated by SuperPoint + SuperGlue directly interferes with the operation of the SfM algorithm. In Fig. 9(c), the positional relationship between the aerial-ground cameras is misestimated, resulting in the generated scene points not being accurately registered. In contrast, our method can successfully integrate aerial and ground images by learning more consistent features.

E. Evaluation on Texture Models
Fig. 10 compares the texture mesh models obtained using only aerial images (top row) and the fusion of aerial and ground images (bottom row). In order to highlight the comparison results, we only select part of the facade of the model for display. From Fig. 10, it can be seen that integrating ground images into the aerial model can make the reconstructed model more complete and the texture clearer. This further illuminates that our proposed method can effectively integrate aerial and ground images to generate more accurate and complete 3-D information. Additionally, it is worth noting that in Fig. 10(e), there is ambiguity between the car objects in the model built from aerial images and those in the integrated model. This also indicates that dynamic objects or image differences need to be further considered in the fusion modeling process to obtain more reasonable models.

V. CONCLUSION
In our article, we propose to improve the SfM algorithm by learning dense consistent features for the integration of aerialground images. This approach primarily utilizes the learned features to improve feature matching and BA while introducing a method for adjusting keypoint locations to further refine the accuracy of 3-D scene reconstruction. Extensive experiments have demonstrated that our method not only significantly improves modeling accuracy compared to existing algorithms and software but also achieves effective aerial-ground image integration in challenging scenarios. In the next step, we plan to improve the multiview stereo algorithm to better adapt to aerial-ground images and generate high-quality depth maps and dense point clouds. His research interests include image matching and 3-D reconstruction.