Synthetic feature pairs dataset and siamese convolutional model for image matching

In a previous publication [1], we created a dataset of feature patches for detection model training. In this paper, we use the same patches to create a new large synthetic dataset of feature pairs, similar and different, in order to perform, thanks to a siamese convolutional model, the description and matching of the detected features. We thus complete the entire matching pipeline. The accurate manual labeling of image features being very difficult because of their large number and the various associated parameters of position, scale and rotation, recent deep learning models use the result of handcrafted methods for training. Compared to existing datasets, ours avoids model training with false detections of the extraction of feature patches by other algorithms, or with inaccuracy errors of manual labeling. The other advantage of synthetic patches is that we can control their content (corners, edges, etc.), as well as their geometric and photometric parameters, and therefore we control the invariance of the model. The proposed datasets thus allow a new approach to train the different matching modules without using traditional methods. To our knowledge, these are the first feature datasets based on generated synthetic patches for image matching.


a b s t r a c t
In a previous publication [1] , we created a dataset of feature patches for detection model training. In this paper, we use the same patches to create a new large synthetic dataset of feature pairs, similar and different, in order to perform, thanks to a siamese convolutional model, the description and matching of the detected features. We thus complete the entire matching pipeline. The accurate manual labeling of image features being very difficult because of their large number and the various associated parameters of position, scale and rotation, recent deep learning models use the result of handcrafted methods for training. Compared to existing datasets, ours avoids model training with false detections of the extraction of feature patches by other algorithms, or with inaccuracy errors of manual labeling. The other advantage of synthetic patches is that we can control their content (corners, edges, etc.), as well as their geometric and photometric parameters, and therefore we control the invariance of the model. The proposed datasets thus allow a new approach to train the different matching modules without using traditional methods. To our knowledge, these are the first feature datasets based on generated synthetic patches for image matching.

Value of the Data
• Recent image matching methods based on deep learning still use the result of traditional methods for training. This is due to the difficulty of manually and accurately labeling the position, orientation and scale of hundreds of features per image. The proposed synthetic datasets allow to avoid relying on handcrafted methods for the training of feature detection, description and matching models. Also, the use of the same patches for training the different models of the matching pipeline allows a better combination of these components. • The datasets can be used by researchers working on one of the many applications of image matching, such as augmented reality or image registration, in order to train new models. • The datasets are usable directly for training one or more modules of the matching pipeline.
It is also possible to augment the dataset with new transformations, or to rely on the design methodology to create a new one, more specific to a certain application, for example by modifying the format and content of the images. • The idea of using a synthetic dataset for image matching allows, by avoiding the use of traditional method results for training, and by being able to control accurately the geometric and photometric parameters of the features and generate them in very large numbers, opening up new possibilities for the training of matching models. • The dataset simulates different geometric and photometric transformations in order to obtain matching models invariant to these transformations.

Data Description
We present in this article three datasets of synthetic images we have designed to train deep learning models for image matching. The first dataset, called patch dataset, is used to train convolutional and artificial feature detection models (first matching step). The second dataset, called mosaic images dataset, is used to train a convolutional sliding window model for fast feature detection. Finally, the third dataset, called feature pairs dataset, is used to train a siamese convolutional model for description and matching of the detected features (last two steps of the matching pipeline). Note that the first two datasets have already been the subject of a previous publication [1] , but they will be presented here with more details about their designs, as well as some improvements that have been made. The main novelty here is the dataset of feature pairs and its use for training.
The Fig. 1 shows some random samples of the patch dataset, simulating different variations of angle, rotation, luminance, contrast, noise and blur. Fig. 3 shows an example of a mosaic     image and the corresponding feature map that allows to locate the features. The Fig. 5 shows random samples from the feature pairs dataset.
The different datasets are hosted in HDF5 format. The patch dataset is separated in two files of train set and test set in HDF5 formats. The mosaic dataset is separated into two files in HDF5 format of train and test sets. The feature pairs dataset is stored in HDF5 format in a single file.
Each dataset is accompanied by an IPYNB notebook code allowing to read and display the data as well as other information (format, type, number of samples) in order to facilitate its use.
The datasets are public and available to the community at [2] .

Experimental Design, Materials and Methods
Feature matching consists in finding the correspondence between two images of the same scene acquired from different points of view. The applications in computer vision are numerous [3] : augmented reality, stereo, registration, camera pose, etc. Not all pixel regions of the image are reliable for matching. For example, lines and uniform regions are not locally unique in an image. Corners are more reliable features for matching [4] . The classical matching pipeline consists of 3 steps: feature detection, feature description and descriptor matching. Recent deep learning methods [5][6][7][8][9][10][11][12][13] use different models for each component. However, we found that these models still use feature datasets [14][15][16][17][18] based on the result of traditional handcrafted methods [19] for training. This is due to the difficulty to label in each image the position, orientation and scale of several hundreds of features accurately.
In fact, there are different types of datasets used to train these models. On the one hand, we have real image datasets [14][15][16] where features are localized by specific algorithms. On the other hand, the synthetic datasets use either real images to which synthetic geometric transformations are applied as in [17,18] , or synthetic images of scenes/objects where the feature patches are labeled manually or detected by other algorithms as in [20] . The novelty in the proposed datasets, is that we have synthesized feature patches, rather than images of scenes/objects. This avoids manual labeling of features or their detection by specific algorithms, and thus allows more localization accuracy, more control over the content, and avoids the algorithm's false detection.
Based on this observation, in order to not rely on traditional methods, we have created in a previous article [1] a new dataset of synthetic features to allow the training of feature detection models, and which will be presented here in more detail, as well as some improvements that have been made.
In this paper, in order to complete the whole matching pipeline with only deep learning models, we propose a new large dataset of feature pairs, to train a siamese convolutional model of description and matching that predicts the similarity between pairs of features.
We will start by describing the feature patch dataset used for detection [1] .

Patch dataset for feature detection
In this section, we will present the image patch dataset previously used and briefly described in [1] . In the following, we will provide all the details about its design.
The idea of creating a synthetic dataset is inspired by the traditional Harris detector, still widely used today, which classifies the image into three regions: corners, lines and uniform regions. We thus generated synthetic image patches of 15 × 15 dimensions for each of the three shapes, in order to train a CNN and ANN models [1] , that classify the image regions into feature (corners) and non-feature (lines and uniform regions), as shown in Fig. 1 .
For the corner features, we simulate, with constant steps: 13 corner angles varying from 90 °t o 130 °(to avoid confusing them with edges and lines), 120 rotations from 0 to 360 °, 18 contrasts greater than 0.6 with corners both darker and lighter than the background, a Gaussian noise N (0 , (0 . 001) 2 ) (with and without), and a Gaussian blur with a kernel size 2 × 2 and a standard deviation σ = 2 / 3 (with and without). We obtain a total number of features of 112320 (13 × 120 × 18 × 2 × 2) .
For the line non-features, we simulated 90 rotations between 0 and 180 °(symmetrical lines), 20 luminances and contrasts, 6 Gaussian noises with σ between 0 and 0.002, and 3 Gaussian blurs of kernel size between 1 and 3 with σ = 1 . For the half-line non-features, we simulated 180 rotations between 0 and 360 ×, 14 luminances and contrasts, 4 Gaussian noise with σ between 0 and 0.002, and 3 Gaussian blurs with kernel size between 1 and 3 with σ = 1 . For the uniform region non-features, we simulated 205 grey levels between 0 and 1, 34 Gaussian noises with σ between 0 and 0.01, and 9 Gaussian blurs of kernel size between 1 and 9 with σ = 3 .
We obtain a total of 125370 non features.
The different values were chosen, empirically, after several tests, both to have a maximum of variabilities, visually perceptible corner and line structures, and to obtain balanced numbers of samples between the feature and non-feature classes.
The data were shuffled randomly before being stored in HDF5 format in two files of train ( 90% ) and test set ( 10% ). The images are in gray level and coded on 8 bit (values between 0 and 255), it is thus necessary to normalize and convert them into float before training.
In [1] , we used this dataset for training different ANN and CNN architectures, both shallow and deep, for feature detection.
For a faster detection with a convolutional sliding window model, we have created a feature mosaic dataset that we will present in the next section.

Mosaic dataset for fast feature detection
The dataset and the model that are presented in this section, are improved versions of those used in [1] .
The convolutional sliding window model represented on Fig. 2 takes as input an image of dimensions 600 × 450 pixels and returns a feature map which allows to localize the features. It is a problem of localization by regression and not of classification as it is the case of the models using the first patch dataset. Note that the model presented here allows us to obtain a feature map that is denser than in its first version in [1] . Consequently, the dataset used is also slightly different than the one in [1] .
To train this model, we have generated, from the patch dataset, mosaic images as well as the corresponding feature maps as shown in Fig. 3 . The mosaic images are obtained by randomly concatenating the feature and non-feature patches in 600 × 450 images. Each image therefore contains 1200 patches ( 40 × 30 patches of dimensions 15 × 15 ). We obtain a total of 198 images from the 237690 patches. The remaining patches of the last image are filled with zeros. The feature maps are obtained by assigning to the features Gaussian shapes of size 15 × 15 and σ = 15 / 9 , normalized between 0 and 1. The center of the feature is therefore labeled with a value of 1 and its neighborhood with values less than 1 but not zero (because they also contain a part of the feature), this allows us to distinguish them from the non-featured patches to which we assign a value of 0. By concatenating the patches, we generate at their intersection, when one of them displays a significant contrast compared to all its neighbors, new corner features. We consider that the patch of dimension 15 × 15 located at the intersection of 4 patches is a feature, if one of the corners P i of the intersection ( Fig. 3 ), satisfies the following condition: with Ctr(A, B ) the contrast between two values A and B : M i is the median value of the corner P i of the intersection, with i ∈ { 1 , 2 , 3 , 4 } . j, k and l are the indices of the neighbors P j , P k and P l of P i , with (i, j, k, l) ∈ { 1 , 2 , 3 , 4 } .
T h = 0 . 6 is the contrast threshold used. The mosaics and feature maps are stored in HDF5 format, in float type with values between 0 and 1, and separated in 179 images ( 90% ) for the train set and 20 images ( 10% ) for the test set.
We performed the model training with the following parameters : 80% train set; 10% validation set; 10% test set, epochs = 1200 ; Adam Optimizer; decreasing learning rate between 0.01 and 0.0 0 0 01; Mean squared error or MSE loss function; Cosine similarity metric.
The training took about 7 hours on GPU (Tesla P100-PCIE-16GB). The Fig 4 shows the learning curves and the Table 1 shows the MSE and cosine similarity values at the end of the training.

Feature pairs dataset for image matching
We present here the dataset of pairs of features, used for the training of a siamese convolutional model of feature description and matching. To build this dataset we use only the corner patches since the matching module has to find the correspondence between the detected features (corners). Note that we do not use all the corner patches because of the limits imposed by the RAM memory during the learning process as we will explain in more details later in this article. We retain only 15600 corners which allow us to generate the dataset of feature pairs. For this, we consider 13 corners as in the patch dataset, but only 10 contrasts and 30 rotations, 2 noise values, and 2 blur values.
We consider that two corners are similar if they have the same angle and the same contrast, but can have different rotations, noise and blur.
We note x i j the j-th patch of the subset i of similar patches, with i ∈ [1 ; 130] and j ∈ [1 ; 120] . To generate the feature pairs, we arrange the subsets X i of similar patches, with i ∈ [1 ; 130] , as follows: From this data we generate a dataset of positive and negative feature pairs. We note P the set of positive pairs (similar patches) and N the set of negative pairs (different patches).
The set P is constituted by all the combinations of pairs in each subset X i , including the pairs constituted by the same feature : Thus, we have a number of positive samples of : We use the algorithm 1 for the generation of the samples of the class P .
The set N of negative pairs corresponds to all the combinations of each feature of a subset X i with all the features of the other subsets X j with i = j : So we have a number of negative samples of : The very large number of negative pairs raises 2 problems. On the one hand, the two classes P and N are largely imbalanced. On the other hand, the amount of disk memory and especially RAM needed to store and process the set N is about 200 Gio for 15 × 15 patch pairs coded on 32 bits .
To solve these problems we consider the set N * of negative pairs that we obtain by taking one feature out of 3 of the subset X i and one feature out of 40 of the subset X j . The steps 3 and 40 were chosen in order to obtain balanced P and N * classes. We used the algorithm 2 for the generation of the samples of the class N * .
We obtain Card(N * ) = 10 0620 0 . The two classes P and N * are thus well balanced and the number of samples does not pose any problem of storage.
The total number of samples in the dataset is therefore 1 . 95 M feature pair images. The Fig. 5 shows some random samples.
The samples are shuffled randomly and stored in HDF5 format. We used this dataset to train the siamese convolutional model represented on the Fig. 6 . The goal of the Siamese model is to differentiate patches by maximizing the Euclidean distance between the descriptors of pairs of different features: with a and b two features, and f(a) and f(b) the corresponding descriptors obtained at the output of the CNN. For that purpose, we optimize the CNNs by using a contrastive loss function that maximizes the contrast (distance) between the positive and negative classes  with al pha = 1 the margin that ensures that the distance of the negative pairs is greater than the distance of the positive pairs. Y true : The labels of the pairs. Y pred : The prediction of the model. Note that both CNNs have the same architecture and learn the same parameters in order to compute the descriptors in the same way.
The training took about 7 hours on GPU. The Fig. 7 shows the learning curves and the table 2 the values of the contrastive loss and the accuracy at the end of the training.

Homography estimation
In order to test the combination of the two trained detection and matching models and how they generalize to new images, we applied different transformations to images and estimated the homography from the matched features. The matching outliers are eliminated with RANSAC. Fig. 8 shows some examples of estimation.