Feature matching for multi-epoch historical aerial images

Historical imagery is characterized by high spatial resolution and stereo-scopic acquisitions, providing a valuable resource for recovering 3D land-cover information. Accurate geo-referencing of diachronic historical images by means of self-calibration remains a bottleneck because of the difficulty to find sufficient amount of feature correspondences under evolving landscapes. In this research, we present a fully automatic approach to detecting feature correspondences between historical images taken at different times (i.e., inter-epoch), without auxiliary data required. Based on relative orientations computed within the same epoch (i.e., intra-epoch), we obtain DSMs (Digital Surface Model) and incorporate them in a rough-to-precise matching. The method consists of: (1) an inter-epoch DSMs matching to roughly co-register the orientations and DSMs (i.e, the 3D Helmert transformation), followed by (2) a precise inter-epoch feature matching using the original RGB images. The innate ambiguity of the latter is largely alleviated by narrowing down the search space using the co-registered data. With the inter-epoch features, we refine the image orientations and quantitatively evaluate the results (1) with DoD (Difference of DSMs), (2) with ground check points, and (3) by quantifying ground displacement due to an earthquake. We demonstrate that our method: (1) can automatically georeference diachronic historical images; (2) can effectively mitigate systematic errors induced by poorly estimated camera parameters; (3) is robust to drastic scene changes. Compared to the state-of-the-art, our method improves the image georeferencing accuracy by a factor of 2. The proposed methods are implemented in MicMac, a free, open-source photogrammetric software.


Introduction
Historical imagery chronicles worldwide land-cover information, and as a result enables long-term environmental monitoring as well as 3D dynamic change detection. The images are of high spatial resolution, and are acquired in stereoscopic configuration. They have been acquired in many countries all over the world and can be traced back to the beginning of the 20 th century [1]. Millions of historical images have been digitized and made accessible through web services [2,3,4]. They are often accompanied by metadata, in most cases including the camera focal length and the physical sensor size. Other metadata such as flight plans, camera calibration certificates or orientations are not commonly available. Besides, inappropriate film/glass plate preservation and the scanning process enforce reestimating of the camera calibrations (i.e., the self-calibrating). Self-calibration estimates simultaneously interior and exterior orientations. It is generally solved with a bundle block adjustment (BBA) routine, taking feature correspondences and Ground Control Points (GCPs) as input observations. Extracting feature correspondences within a single epoch (also referred to as intra-epoch) can be efficiently done with local features such as SIFT [5]. Yet, due to drastic scene changes and heterogeneous acquisition conditions, it is challenging to automatically find feature correspondences across different epochs (also referred to as inter-epoch).
In this work we propose a fully automatic approach to computing dense and robust inter-epoch feature correspondences. Our method consists of a rough co-registration by finding feature correspondences between DSMs derived within single epochs, and a precise feature matching on original RGB images. Our main contributions include: • By matching DSMs, we are able to obtain robust rough co-registration as the 3D landscape often stays globally stable over time.
• Under the guidance of co-registered orientations and DSMs, we reduce the difficulty in precise matching by: (1) narrowing down the search space; (2) reducing the combinatorial complexity since only overlapping images are analyzed; • By proposing a tiling scheme (including one-to-many tiling in rough coregistration and one-to-one tiling in precise matching), we are opening up the possibility to scale-up the deep learning methods for feature matching. As we have shown in Section 4.3, using them out-of-the-box is inefficient. Large images demand high computing resources, while deep learning feature extraction methods are presumably trained on small images.
• By including priors about the geometry (in form of DSMs), we can filter candidate correspondences: each three correspondences projected to DSM serve to compute a 3D Helmert transformation between epochs, and most importantly provide a 2D constraint on all images' correspondences.
In the following we briefly describe the current states of local features, interepoch historical images alignment as well as robust matching. In Section 3 we introduce our methodology, and in Section 4 the experiments as well as results are given. The source code is available from MicMac Github [6].

Local features
Among hand-crafted features, SIFT [5] is undoubtedly a milestone. For decades, SIFT and its variants such as RootSIFT [7], RootSIFT-PCA [8], and DSP-SIFT [9] have been widely used both in academic and industrial fields. Other popular traditional features include ASIFT [10], SURF [11], Hessian Affine keypoint detector [12] and KAZE [13]. They are respectively powerful in affine invariance, computational efficiency, scale and affine invariance, retaining object boundaries. With the rise of deep learning, learned features have gained much attention due to their good performance on certain benchmarks. LIFT [14] is the first end-to-end network that implemented a full pipeline including detection, orientation estimation and feature description. From then on, a large number of learned methods related to feature matching proliferated, including: 2. Detector along with descriptor. Most of the methods fall into this category, including LF-Net [18], DELF [19], SuperPoint [20], D2-Net [21], ASLFeat [22], R2D2 [23], D2D [24], etc. 3. Matching methods based on off-the-shelf features, such as SuperGlue [25].
Originally combined with SuperPoint, SuperGlue was designed for realtime processing, and unlike the previous methods, it is capable of implicitly encoding spatial relation between neighboring features within an image and across images with the so-called attention mechanism.

Inter-epoch historical images alignment
When it comes to inter-epoch historical images, however, directly applying SIFT or SuperGlue often results in inferior results due to large radiometric differences. In Figure 8 we showed an example where SIFT and SuperGlue failed on an inter-epoch image pair with drastic scene changes. It is understandable as (1) SIFT is not sufficiently invariant over time, while (2) SuperGlue is not invariant to rotations and it underperforms on larger images because it was presumably trained on small images. Therefore, many previous researches bypassed the task of extracting interepoch correspondences by processing different epochs separately followed by an inter-epoch co-registration relying on Ground Control Points(GCPs). Between 10 and 169 GCPs are required in [29], [30], [31], [32], [33]. GCPs are usually measured with the help of photointerpretation on recent orthophotos, however, it is still monotonous and time-consuming. Furthermore, it is difficult to find salient points that are stable over time. Certain attempts were made to extract inter-epoch correspondences. Giordano et al. [34] extract feature correspondences between historical and recent images relying on HoG descriptors [35]. The authors require flight plans as input, which are not commonly available as mentioned in Section 1. Feurer et al. [36], Filhol et al. [37], Cook et al. [38], Parente et al. [39] and Blanch et al. [40] assume that a sufficient number of keypoints remain invariant across time and employ SIFT to extract inter-epoch feature correspondences. It remains questionable whether the method is capable of handling drastic scene changes. Zhang et al. [41] extract inter-epoch correspondences from SIFTdetected keypoints based on the hypothesis that points follow 2D and 3D spatial similarity model. This method works in simple cases with few scene changes. Additionally, a stream of research works focuses on historical terrestrial images ( [42], [43], [44], [45]) and historical video recordings ( [45]). However, their algorithms are not suitable to the aerial case. This work is an extention of [41]. Unlike in [41], we introduce a rough coregistration between different epochs based on matching DSMs with Super-Glue, and use it to guide a precise matching. Our rough co-registration is robust under extreme scene changes because (1) SuperGlue utilizes context to enhance feature descriptors and (2) DSMs are generally stable over time.
With the guidance of roughly co-registered orientations and DSMs, both SIFT and SuperGlue achieved good performance, as shown in our experiments.

Robust matching
The goal of robust matching is to tell apart inliers (i.e., true correspondences) from outliers (i.e., false correspondences), and eliminate the latter from further processing. Typically, an iterative sampling strategy based on RANSAC (Random Sample Consensus) [46] relying on some geometric model, such as homography [47] or essential matrix [47] is carried out to remove outliers. Substitutes of RANSAC include hand-crafted methods such as LMedS [48] (a meaningful groundwork before RANSAC), MLESAC [49] (maximizing inlier likelihood), PROSAC [50] (choosing samples progressively for acceleration), DEGENSAC [51] (increasing robustness when a dominant plane is present), GC-RANSAC [52] (adopting graph-cut algorithm), MAGSAC [53] (defining threshold automatically with marginalization), and learned methods such as DSAC [54] (simulating RANSAC in a differentiable way) and CNe [55] (end-to-end deep network). In this research, we use RANSAC to estimate the 3D Helmert transformation between surfaces (i.e., DSMs) calculated in different epochs. Compared to the classical essential/fundamental matrix filtering, with less data (3 versus 5 points) we impose stricter rules on the sets of points. Lastly, we eliminate the remaining false correspondences by looking at their cross-correlation. Figure 1(a) exhibits the workflow of our pipeline. It includes 3 main parts: intra-epoch processing, inter-epoch processing, and combined processing. The images' focal lengths and physical sensor sizes are supposed to be known, as we already mentioned in Section 1. The images are resampled to the geometry of the fiducial marks prior to processing. For the sake of simplicity, we only exhibit the processing flow of two epochs, however, it can be easily extended to more epochs. We use the term Patch and Guided to refer to two alternatives of our roughto-precise matching pipeline (patch matching or guided matching). The patch matching uses learned features, and the guided matching uses handcrafted features. We explore the two approaches because of their diverse characteristics: the Patch recovers more correspondences (cf. Figure 8 (b) and (c)), however, due to the resampling stage, they are computationally more intensive compared to Guided. Additionally, our collateral goal is to compare their performances. We adopt the following naming conventions -(1) images acquired in

Intra-epoch Processing
Intra-epoch processing furnishes complementary 3D information to help the latter inter-epoch processing. It is a standard photogrammetry or SfM pipeline and can be accomplished with lots of solutions (e.g. MicMac [56], COLMAP [57], OpenMVG [58], Theia [59], etc.). The solution used in our experiment is MicMac. It is performed within each epoch i individually as follows: 1. Extract intra-epoch correspondences between images I e i with SIFT [5]; 2. Compute interior and relative orientations (O e i ini ) with the sequential SfM; 3. Based on image orientations O e i ini , perform semi-global dense matching [60] between images I e i to get DSM (D e i ini ) in their arbitrary coordinate frames. Once multi-epochs are roughly co-registered based on feature correspondences computed on DSMs, a precise matching is carried out by performing patch matching or guided matching, followed by a 3D-RANSAC filter and a local-context validation, giving rise to the final inter-epoch feature correspondences. (b) Workflow of the rough co-registration. SuperGlue is applied on tiles of DSMs, followed by a RANSAC procedure to remove outliers. As SuperGlue is not invariant to rotations lager than 45 • , we test four rotation hypotheses and keep the best one. (c) Four rotation hypotheses. We rotate the secondary DSM by 90 • four times to match with master DSM and keep the best one with the largest number of RANSAC inliers.

Inter-epoch Processing
Inter-epoch processing follows a rough-to-precise matching strategy including: 1. Rough co-registration: match DSMs based on one-to-many tiling scheme (described in Section 3.2.1) to roughly co-register the image orientations and DSMs into a common reference frame.
2. Precise matching: match original RGB images under the guidance of the roughly co-registered data.
We choose matching DSMs for rough co-registration and RGB images for precise matching because: (1) DSMs computed from historical images turn to be more constant over time compared to RGB images; however, (2) due to low radiometric quality of the images, the DSMs are noisy (cf. Figure 7(d)). As a consequence, DSMs provide correspondences perfectly tailored for rough co-registration -less accurate but repetitive over time. Then, by narrowing down the search space based on co-registered orientations, matching RGB images will give us correspondences that are both accurate and robust.

Rough co-registration
The workflow of rough co-registration is displayed in Figure 1(b). The results (co-registered orientations and DSMs) will be used to (1) guide the precise matching, and (2) provide initial orientations for bundle adjustment in the combined processing. Therefore, the goal is to get a moderate number of reliable inter-epoch feature correspondences at low cost. We choose SuperGlue 1 to do the rough matching as it is more invariant over time than SIFT. Simply applying SuperGlue on multi-epoch images fails because it is not rotation invariant and it underperforms on large images. We improve the matching robustness by adopting the following modifications: (1) Match DSMs instead of original images. Assuming the numbers of the original images in epoch 1 and epoch 2 are P and Q respectively, instead of applying brute force rough matching P×Q times on original images, we only need to do it once on a pair of DSM images. To obtain the DSM images, we convert the DSMs (i.e., 2.5D rasters) from floating-point to [0-255] range grayscale images. Since the DSMs contain outliers, pixels with elevations larger than double the standard deviation of all elevations are ignored in the conversion process. Matching DSM images has the following merits: 1. Redundancy caused by the forward and side overlapping areas is removed; 2. It implicitly enables a follow-up search for globally consistent inliers; 3. It decreases the combinatorial complexity caused by rotation ambiguity of the product of P and Q images; 4. Even under important scene changes, DSMs generally provide stable information over time.
(2) One-to-many tiling scheme. As SuperGlue provides unsatisfactory result on large images, we propose a one-to-many tiling scheme to make up for the deficiency. It is performed as follows (cf. Figure 1(b)): 1. Crop both the master and secondary DSM images into M and N tiles of certain size, respectively; 2. Apply SuperGlue on M×N tile pairs individually; 3. Merge the feature correspondences and perform RANSAC based on 2D similarity transformation to find globally consistent inliers.
(3) Testing 4 rotation hypotheses. As SuperGlue is not invariant to rotations larger than 45 • , the one-to-many tiling scheme will be performed repeatedly over 4 rotation hypotheses as follows (cf. Figure 1(c)): 1. Rotate the secondary DSM image by 90 • four times; 2. Match each rotated DSM image with the master DSM image with the one-to-many tiling scheme; 3. Keep the hypothesis with the largest number of RANSAC inliers.
With RANSAC inliers, we build a 3D Helmert transformation to transform the orientations O e 1 ini and DSM D e 1 ini to the frame of epoch 2 , leading to coregistered orientations O e 1 co and DSM D e 1 co . The feature matching method based on DSMs will fail on perfectly flat terrain. In this rare scenario, feature correspondences can be extracted on orthophotos (see demonstration in Section 4.3), and their elevation coordinates can be retrieved from the DSMs. However, because the scene radiometry change is more pronounced than the 3D landscape change, fewer matches are found.

Precise matching
To compute precise inter-epoch feature correspondences, we perform matching on original RGB images under the guidance of co-registered orientations and DSMs. It consists of extracting tentative inter-epoch feature correspondences, followed by a 3D-RANSAC filter and a local-context validation stage to remove outliers.
(1) Get tentative inter-epoch correspondences. We offer two alternatives to get tentative correspondences: patch matching or guided matching. The former has better overall performance while the latter is more efficient in terms of the use of memory and CPU resources.
(1-1) Patch matching for learned features It is based on one-to-one tiling scheme (not to confuse with the one-to-many tiling scheme presented in the rough matching), as shown in Figure 2(a) and detailed below: 1. Crop the master RGB image I e 1 into M tiles (T e 1 ) of certain size, with a buffer zone overlapped with each other; 2. As the orientations and DSMs are roughly co-registered, we project each master tile T e 1 onto the DSM D e 1 co and backproject to secondary RGB image I e 2 to find the corresponding tile T e 2 ; 3. Resample T e 2 to T e 2 , so that the tile pair P (T e 1 , T e 2 ) is free from differences of rotation, scale and extent; 4. Apply SuperGlue on tile pair P (T e 1 , T e 2 ) to find feature correspondences M (K e 1 , K e 2 ) (K e i represents keypoints in I e i ); 5. Merge the feature correspondences.
Our patch matching experiments are performed based on SuperGlue, however, other learned methods can be adopted readily. Because the orientations and DSMs are only roughly co-registered, we have to take into account the margin of error when projecting tiles to overlapping images. This is why we add a buffer zone around the tile T e 1 , as shown in Figure 2(b). To avoid duplicate matches, we remove a posteriori the matches exceeding the original tile.
(1-2) Guided matching for hand-crafted features. The patch matching substitute orientated towards hand-crafted features is the guided matching, as shown in Figure 2(c). It leverages the positions of predicted keypoints, the known scale ratio and rotation differences to narrow down the list of the matching candidates. In our experiments, we use the SIFT points, but the method is suitable to any hand-crafted extractor. The strategy is as follows: 1. Compute the scale ratio R scl and the rotation D rot between two images by sequentially projecting the I e 1 image corners to the co-registered DSM D e 1 co and to image I e 2 ; 2. Extract keypoints K e 1 in image I e 1 and K e 2 in image I e 2 ; 3. Intersect the keypoints K e 1 with the co-registered DSM D e 1 co ; 4. Back-project them to image I e 2 , giving rise to predicted keypoints K e 2 ; 5. Search for a subset of points in K e 2 located within a radius S (100 pixels in our experiments) centered at the predicted positions K e 2 ; 6. Remove candidate matches whose scales and rotations computed by SIFT are incoherent with R scl and D rot computed from image orientations and the co-registered DSM (i.e., step 1); 7. Find the best match with mutual nearest neighbor and apply the first to second nearest neighbor ratio test [5].
To compute enhanced correspondences, we apply a 3D-RANSAC filter on the previously obtained tentative matches. More precisely, we do the following: (1) for each feature correspondence M (K e 1 , K e 2 ), the keypoints K e 1 and K e 2 are projected onto DSM D e 1 co and D e 2 ini individually to get 3D points M (G e 1 , G e 2 ); and (2) the correspondences M (G e 1 , G e 2 ) are iteratively sampled to compute the 3D spatial similarity RANSAC model: where λ is the scale factor, T is the translation vector and R is the rotation matrix. We set the number of RANSAC iterations to 1000, and consider feature correspondences within T r of its predicted position as inliers. In our experiment, T r was set to 10×GSD where GSD is the mean ground sampling distance in the coordinate frame of epoch e 2 . This distance is computed as the ground distance between two adjacent image pixels.
In the preceding step we got rid of a substantial number of outliers, however, we believe that not all outliers could be identified. We apply cross-correlation for final validation. Feature correspondences with their correlation scores below a predefined threshold (0.6 in our experiments) are discarded. The correlation window size was set to be large enough to take into account the context around a point (32×32 pixels in our experiment). Figure 3 shows an example of a false match (red) eliminated by cross correlation, while the true match (blue) is kept. Figure 3: Demonstration of the validation with cross-correlation. Considering poor quality of historical images, the window size (blue and red rectangles) was set to 32×32 pixels. False match (red) is eliminated by cross correlation, while true match (blue) is kept.

Combined Processing
Based on the intra-epoch and inter-epoch feature correspondences, a free network BBA is performed to refine all the image orientations and camera calibrations. If the results need to be analyzed in a metric scale, a spatial similarity transformation will be performed to move the refined acquisitions in an arbitrary reference frame to a metric one. If the precise orientations for one of the epochs were known (i.e., deemed as ground truth), their parameters will be fixed during the BBA and the subsequent spatial similarity transformation will be skipped. We adopted the Fraser model [61] to calibrate the cameras and allowed image-dependent affine parameters, the remaining parameters were shared among all images.

Implementation details
To reduce the image noise all historical images entering the pipeline are downsampled by a factor of 3. To calculate the DSMs, we further downsample the images by a factor of 4, which amounts to a total downsampling factor of 12 with respect to the input images. For example, the images in Fréjus 1970 are downsampled from [8766, 8763] to [730,730]. Note that the DSMs serve 3 purposes: (1) rough co-registration, (2) narrowing down the search space in precise matching, (3) providing 3D coordinates for 3D-RANSAC filter. A low resolution surface is good enough for these tasks, and keeps the computational cost low. To balance the number of the intra-and inter-epoch feature correspondences, we perform intra-epoch feature correspondences reduction available in Mic-Mac [62], followed by setting the relative observation weight in the bundle adjustment, if necessary. The feature correspondences reduction algorithm maximizes good spatial distribution, points' multiplicity and low reprojection error, it also helps to speed up the bundle adjustment. Inter-epoch feature correspondences are extracted for every possible combination of 2 epochs and finally merged. Algorithm 1 demonstrates the pseudo code of our pipeline.

Datasets
We tested our method on three datasets: Pezenas, Fréjus and Kobe. Details of the datasets are listed in Table 1 and   Pezenas. It is a 420 km 2 rectangular area located in Pezenas in the Occitanie region in southern France. The area is mainly covered with vegetation and several sparsely populated urban zones. We have at our disposal three sets of images acquired in 1971, 1981 and 2015. The epoch 2015 was acquired with the IGN's digital metric camera [63], and is treated as the ground truth (GT) during our processing (in other words, the 2015 image orientations are fixed during BBA). The area exhibits changes in scene appearance in the 44-year period.
Fréjus. It is a 15 km 2 rectangular area located in Fréjus, a commune in southeastern France. The area is mainly covered with buildings along with scattered farmlands, except a half-moon-shaped bay located in south. We have four sets of images acquired in 1954, 1966, 1970 and 2014. The epoch 2014 is treated as GT. The area exhibits drastic scene changes in the 60-year period, as can be seen in Figure 5, where the evolution of a subregion is displayed.
Kobe. It is a 90 km 2 area of irregular shape located in the north of Awaji Island, Japan. The well-known Kobe earthquake happened here in January 1995. We have two sets of images: pre-event acquired in 1991 and postevent acquired in 1995. It is mainly covered with mountain area and narrow urban zones along the sea. There is no GT, hence we measured 2 points on Google map to roughly scale the result to metric units. In this dataset we are interested in localizing the earthquake fault.

Comparison of co-registration with DSMs and orthophotos
In this section, we carry out the rough co-registration twice, with DSMs and orthophotos. We choose Fréjus in the years 1954 and 2014 for this experiment because it is the most difficult matching scenario, demonstrating significant scene changes and very different area extents. As can be seen in Figure 6(a-b)   followed by RANSAC), pure SuperGlue and SIFT. Feature correspondences are visualised in Figure 6 (c-h) and the inlier ratios are given in Table 2.
As can be seen, Ours inlier ratio reached 18.6% and 37.1% for orthophotos and DSMs, respectively. SuperGlue and SIFT reached 0% for both types of images. DSMs tend to have more inliers than orthophotos, because there is much less perceived change in the DSMs than there is in the orthophotos, see Figure 6 (c) and (d). Our rough co-registration is capable of recovering robust correspondences because one-to-many tiling scheme finds many good correspondences in the overlapping tile pairs, and few wrong ones in the non-overlapping pairs, which ensures the success of RANSAC.  Table 2: Feature correspondences number, inlier number and inlier ratio in Figure 6

Comparison of precise matching on DSMs and original images
In order to decide which type of images (DSMs or original images) is more suitable for executing the precise matching, we apply our pipeline Patch on both DSMs and original images of Fréjus 1970 and 2014. The final feature correspondences are displayed in Figure 7 (a) and (b). To asses quantitatively the results, we created a GT depth map and calculated the accuracy (correct matches / total matches). In Figure 7 (c) we plot the accuracy curves while varying the reprojection error threshold from 0 to 10 pixels. It is clear that the result using the original images is more accurate, even though the DSMs recovered more correspondences. This is because historical DSMs at full resolution are too noisy to guarantee high precision measurements (see the DSM shaded image in Figure 7 (d).

Comparison between Ours (Guided and Patch), SuperGlue and SIFT
In this section we compare the feature matching result between our roughto-precise pipeline (Guided and Patch), SuperGlue and SIFT. The image pair used here is Fréjus 1954 and 1966. Given that SuperGlue failed on the original pair (cf. Figure 8 (a)), the left image was rotated to match the upright position of the right image prior to processing. SuperGlue is employed with the off-the-shelf trained outdoor model provided by the authors. We can observe that pure SIFT fails to extract correct matches, and the out-of-thebox SuperGlue finds a lot of matches, most of which seem good, but at a closer look the details reveal poor localization precision. In Guided and Patch, all matches are correct, and the Patch approach detects denser correspondences, as can be seen in Figure 8 (b-e).
In Figure 8 (f) we plot the accuracy curves as we vary the reprojection error threshold from 0 to 10 pixels based on manually created GT. Notice that SIFT and SuperGlue had 0% and 30% correct matches under 10 pixels, respectively. In the meantime, both Guided and Patch reached over 90% under 6 pixels.

Evaluation Method
We refine the co-registered orientations in a BBA routine using the interand intra-epoch feature correspondences (cf. Figure 1(a)). The orientations of the latest epochs in Pezenas and Fréjus were treated as fixed during the combined BBA since they were accurately known a-priori, while all the remaining orientations were considered as free parameters. At first, interior orientation parameters were shared among all images. Once stable initial values were known, interior parameters were further refined with image-dependent affine parameters. The affine component of the camera calibration is expected to model, at least partially, the shear of the analog film. For comparison, we recover the image orientations with 5 methods, which can be divided into 2 categories of "rough" and "refined": 1. 3vGCPs ("rough"): We co-register all the epochs with 3 manually measured virtual GCPs in total. By virtual GCPs we mean points of known 3D coordinates in one of the epochs. 2. Co-Reg ("rough"): We match DSMs based on one-to-many tiling scheme to roughly co-register all the epochs (cf. Figure 1(b)). 3. PureSG ("refined"): Based on the Co-Reg result, we remove the scale, rotation and extent differences for inter-epoch image pairs (without one-to-one tiling scheme), and apply off-the-shell SuperGlue to get inter-epoch feature correspondences, which are then used to refine the Co-Reg result. 4. Guided ("refined"): Based on the Co-Reg result, inter-epoch feature correspondences are extracted with guided matching, as presented in Section 3.2.2. Finally, a BA routine refines the initial Co-Reg orientations. 5. Patch ("refined"): Same as in Guided, except here we adopt the alternative Patch approach as presented in Section 3.2.2.
Our first metric to evaluate the aforementioned 5 methods is DoD (Difference of DSMs). For Pezenas and Fréjus datasets, DoDs are calculated between historical epochs and the available GT epochs (cf. Figure 9). For Kobe dataset there is no GT so we calculate the DoD between 1991 and 1995 instead (cf. Figure 10 (a-e)). Statistics of the presented DoDs are demonstrated in Table 3.
Even though DoD provides a global evaluation over the entire scene, it is contaminated with real scene changes. Therefore, we present a supplementary metric for each dataset:  Table 3: Average value µ, standard deviation σ, and absolute average value |µ| of all the DoDs in Figure 9 and 10(a-e).
2. For Kobe, we: (1) calculated the DSMs; (2) orthorectified the images; and (3) performed 2D correlation of the respective orthophotos [64] to see whether we can observe the slip of the tectonic plate. The displacement maps (i.e., northeastward, denoted as Gd) are presented in Figure 10(f-j).

Result
DoDs. Table 3 reports the average value and standard deviations computed on all the DoDs in Figure 9 and 10(a-e). A dome artifact is present in DoD 3vGCP s and DoD Co−Reg for all the datasets. This kind of systematic error is known to originate from poorly modeled camera internal parameters [34]. The results of DoD P ureSG for datasets with smaller time span (e.g. Kobe and Pezenas) are relatively good, thanks to our roughly co-registered orientations and DSMs. However, the results of Fréjus are not satisfactorywithout adopting our pipeline (i.e., patch matching, 3D-RANSAC filter and cross-correlation), the inlier ratio and precision of the inter-epoch feature correspondences are low. DoD P atch and DoD Guided performed best for all the datasets. Given the numerous good inter-epoch feature correspondences, we are able to effectively mitigate the dome effect. DoD P atch performed slightly    better than DoD Guided on Fréjus dataset, as the Patch is more robust when extreme scene changes are present.
Ground displacements. For ground displacements in Kobe, an up-lateral strikeslip movement along the sea is present in Guided and Patch (cf. Figure 10(ij)), but not the other 3 methods. The observed signal is coherent with the fault of the Kobe earthquake known from independent data sources (cf. Figure 10(k)), according to which a striking north 30 • -60 • east rupture occurred along the northeast-southwest coastline across ∼18 km [65].
Check point accuracy. For the check point accuracy in Pezenas and Fréjus, the Patch performed best, while the Guided ranked second best for both datasets. Also, the Patch performed slightly better than Guided for Pezenas, while the gap widened for Fréjus, which is consistent with the result of DoDs.
Robustness. Table 5 demonstrates the number of inter-epoch feature correspondences obtained with patch matching and guided matching before and after the 3D-RANSAC and cross correlation filters. Observe that 3D-RANSAC filter and cross correlation removed considerable number of feature correspondences, at the same time enough feature correspondences survived, which guaranteed robustness of our method. Moreover, the Patch recovered considerably more feature correspondences than the Guided, which is understandable as SuperGlue is more invariant over time than SIFT.

Conclusion
This work proposed to adopt DSMs and tiling schemes to get robust feature correspondences between historical images taken at different epochs. It opens up the possibility to unlock the potential of millions of historical images, which until now have been relatively unexplored due to lack of reliable processing techniques. Our matching strategy is able to extract reliable interepoch feature correspondences under significant scene changes where SIFT [5] and SuperGlue [25] fail. Three sets of datasets are tested, including one that witnessed an earthquake. We showed that with the recovered feature correspondences we are able to mitigate the camera calibration systematic errors, leading to more accurate results, whether it is the geo-referenced DSM or ground displacements.
Our future work will leverage depth maps to train a neural network architecture in extracting robust features over time.