On Sensitive Minima in Margin-Based Deep Distance Learning

This paper investigates sensitive minima in popular deep distance learning techniques such as Siamese and Triplet networks. We demonstrate that standard formulations may find solutions that are sensitive to small changes and thus do not generalize well. To alleviate sensitive minima we propose a new approach to regularize margin-based deep distance learning by introducing stochasticity in the loss that encourages robust solutions. Our experimental results on HPatches show promise compared to common regularization techniques including weight decay and dropout, especially for small sample sizes.

Margin-based distance learning methods using the Contrastive loss [5] or the Triplet loss [24], [32] optimize a deep network by learning a nonlinear distance where similar image pairs are optimized to be closer than dissimilar pairs.For dissimilar image pairs often a hinge-loss variant max(m − d, 0) is used so that the distance d between the dissimilar image pair is at least m away, where m is the margin.The elegance and ease of implementation makes margin-based approaches arguably the best known and popular current approach for deep distance learning [16], [19], [23], [25], [41].
In this paper we posit our observation that margin-based approaches for deep distance learning suffer from sensitive minima.As illustrated in Figure 1, there are two cases.Case 1: The associate editor coordinating the review of this manuscript and approving it for publication was Tossapon Boongoen .The blue line is a learned nonlinear distance function d , the purple line is the margin m, the black line is a typical distance learning hinge loss: max(m − d , 0) 2 .Case 1: If the margin is violated and the loss minimized with gradient decent, the solutions (red dots) have a strong gradient for d which is sensitive to small changes and thus unstable.Case 2: If the margin is not violated, any solution is satisfactory, disregarding all stability issues.In this paper we propose a method to find robust solutions (green dots).
If the distance d of the dissimilar pair is too small, and the margin is thus violated, then gradient descent will converge to points on the margin where the loss is zero, but the gradient of d with respect to the learned parameters is high (red points), i.e.: Small changes in the parameters will lead to large changes in the distance and thus sensitivity.Case 2: The dissimilar image pair has a large enough distance d, the margin is never violated, there is no loss and thus any valid FIGURE 2. 2D example of distance learning.Five point pairs (connected by a line) are used as input to a Contrastive loss, which overfits severely to such a small sample size.In contrast, our regularized Siamese loss, trained with exactly the same network and hyper-parameters, is robust.
parameterization is selected, without taking robustness into consideration.The problem of sensitive solutions is that they do not well generalize to unseen data as changes of weights and input data are strongly correlated.Thus, using a margin for distance learning may lead to sensitive solutions which in turn causes overfitting.
We propose a novel regularization technique to combat sensitive minima for margin-based deep distance learning.We introduce stochasticity in the loss which forces the optimization to search for minima with small gradient for distance w.r.t. the parameterization, leading to robust minima.In addition, we explore the use of a robust margin not only on the dissimilar pairs, but also on similar image pairs, which prevents the network from selecting minima too close to 0, which would fit the training data too well, and likely not generalize.In Figure 2 we show a 2D example fitted on 5 random 2D pairs where the Contrastive loss severely overfits while our regularization approach is stable.
We have the following contributions.1) We make the observation that margin-based deep distance learning suffers from sensitive minima.2).We propose a regularization technique by introducing stochasticity in the loss that imposes non-zero gradients for sensitive minima thus preventing gradient descent to settle at those unstable solutions.
3) We show that introducing a non-zero margin for similar pairs in Siamese loss is beneficial.4) We propose the use of square-loss instead of a hinge-loss in both Siamese and Triplet loss formulation, which leads to more robust features in the learned latent space.5) We evaluate our regularization on recent HPatches dataset [3] and we show that compared to other popular regularization methods our regularization is particularly effective for small training size.

II. RELATED WORK A. REGULARIZATION IN DEEP DISTANCE LEARNING
Regularized convex distance metric learning allow for generalization bounds [14], yet they do not apply to current non-convex deep distance learning.We focus on practical regularization of popular margin-based metric learning, where Siamese [5] and Triplet networks [24], [32] are best known example.Siamese networks existed in the 90s [4], but modern Siamese network with a CNN architecture dates back a decade [5].Overfitting in Siamese networks [17], [27], [38] are treated commonly with data augmentation [19], l 2 -norm normalization [28] and l 2 -norm regularization [17], [38], and Dropout [42].Triplet networks tend to outperform Siamese networks at the expense of a more complicated training process [16], [19].There exist diverse works based on Triplet networks for feature embedding, where few examples are [1], [9], [20], [30], [40] and overfitting is addressed by regularization techniques, ranging from data augmentation [19] to more complicated sampling strategies [12], [34].Learning local image descriptors is a well known application of deep distance learning [38].Examples in this category are MatchNet [8], LIFT [37], HardNet [22], L2-Net [28], DOAP [10].Typically some sort of feature normalization and batch normalization [13] are used in these networks to reduce the effect of hyper-parameters.Overfitting is reduced by dropout and limiting the architecture to few convolutional layers (only around 7 layers), without fully connected layers.All such works, including Siamese and Triplet networks typically use common regularization techniques proven in the image classification domain such as dropout, l 1 /l 1 -norms, etc.In contrast, we propose a regularization technique for deep metric learning specifically.

B. STOCHASTICITY FOR REGULARIZATION
Uncertainty injection is a regularization technique by making a network insensitive to small random modifications [6], which in some cases is equivalent to an analytical form, e.g., weigh decay [18] with a Gaussian prior on the input is the same as l 2 -regularization.Uncertainty injection methods include dropout [26], drop-connect [31], standout [2], and shakeout [15], that are injecting Bernoulli noise to the hidden units of the deep neural network.In this context, practical Bayesian variational methods are discussed in [7].Controversial techniques such as [35] suggest that adding noise to the output labels improves generalization.In [6,Ch. 7.5 Similar image pairs have a label y = 1 and dissimilar image pairs the label y = 0. We consider a Euclidean metric space where d(f (I ), f (I )) outputs the Euclidean distance between two vectors of representations for an image pair {I , I }.The mapping function f is typically parameterized by a deep convolutional neural network (CNN).
Common CNN architectures for deep distance learning are Siamese and Triplet networks.In Siamese network two identical networks, that share the same weights, are trained using a Contrastive loss function [5] which aims a zero distance for all similar image pairs and at least marginal distance m for dissimilar images: where N pos and N neg are the number of similar and dissimilar training pairs indexed by i and j.The loss is a function of the network weights w and the margin m.
A Triplet network uses three images: An anchor image is compared to a similar image and to a dissimilar image [24].An example Triplet loss for N = N pos + N neg images is

B. SQUARE LOSS INSTEAD OF HINGE LOSS
Several important variants of Contrastive and Triplet losses exist in the literature [19], yet in essence they are captured by the above formulation, e.g., the hinge loss in Eqs.(3-4) can be replaced by other function of choice.The hinge loss is 0 if the margin is not violated, which makes all valid parameter instantiations equal, as illustrated in case 2 in Figure 1.
To favour robust minima, we use the squared loss for both Siamese and Triplet instead of the hinge-loss.

C. NON-ZERO MARGIN FOR SIMILAR PAIRS
Our experiments show clear advantage once the desired distance for positive pairs is not set to zero unlike the original Siamese loss in Eq. ( 3).By allowing non-zero distances between similar images, the natural inter class variance is accounted for in the optimization process.This means although different images (patches) of positive class have same label, they are slightly different in terms of texture, illumination, viewpoint, etc.The Siamese loss of our choice is illustrated in Figure 3.In contrast to the Siamese loss, the Triplet loss does not have a fixed optimal points, i.e., m + (optimal point for positive pairs) and m − (optimal point for negative pairs) alternate from sample to sample [19].

D. STOCHASTIC LOSS
Our proposed loss, referred to as Stochastic loss, is modelled by an additive random variable to the distance representation.The proposed loss for Siamese and Triplet networks are  3) with non-zero positive optimal point.The positive (negative) optimal point is depicted with the blue circle (red circle ) which shows m + (m + + m).The relative margin is set to m = 3 and m + = 0.5 in this plot, therefore, the negative optimal point is set to m − = 3.5.

represented as follows
Comparing Eqs.(3-4) with Eqs.(5-7) reveals three new variables that are inserted in the proposed loss.The random variables θ p and θ n in Eqs.(5)(6)(7) are the core of the proposed regularization method.The constant variable m + , in Eq. ( 5), is related to the non-zero optimal point (positive margin) for similar images.Note that setting the θ p , θ n and m + to zero reduces the Eqs.(5-7) to the conventional Contrastive and Triplet losses where the max-loss is replaced by squared loss.
The appended random variables θ p and θ n impose uncertainty to the distance of the image pair representations in Siamese and Triplet network and are considered zero mean i.i.d.random variables.This property guaranties that the injected random variables does not affect the expected value of the empirical loss.For example, JSiam will be optimized in an average sense on E{d p } = m + and E{d n } = m + + m for positive and negative pairs, respectively.This is important because the desired optimal points for the positive and negative samples remain intact by injecting uncertainties to the loss, unlike methods that insert randomness to the hidden layers, e.g., [6] and [26].We keep θ p and θ n equal to limit the introduced hyper-parameters to one.

E. REGULARIZATION BY Stochastic LOSS
The effect of Stochastic loss on preventing the model from choosing sensitive minima is effectively explainable by exploring the behavior of the loss once optimized using gradient descent.In gradient descent, at each iteration, gradient of the loss function is computed and then, a proper step is taken towards the minimum of the loss.Taking the gradient of the Stochastic loss on positive samples in Eq. ( 5) leads to In turn, the gradient of Eq. ( 6) is given by where ) is the gradient of the positive pair (negative pair) distance w.r.t. the network parameters, w for the ith (jth ) training sample.Here we show that the expectation of the gradient is the same for Stochastic loss and Siamese loss.However, the difference lies in the variance that is added by stochasticity.This an important difference between norm penalty regularization techniques which change the loss expectation as well as the gradient expectation while our method affects the variance instead.
To further analyze, the sum gradient vector, expectation and covariance matrix of the gradient are Eq. ( 12) shows the appended term to the loss gradient due to the proposed stochasticity, where Eqs.(10)(11) correspond to the original Siamese loss gradient.The expectation of the gradients of proposed loss, over all training samples, is simply equal to the original Siamese loss since both θ p and θ n are zero-mean i.i.d.random variables. E{ By Eqs.(10-13), we explicitly show the regularization as an additive term to the gradient, which reveals the similarity and difference between Siamese and Stochastic loss.At this point, we focus on the gradient behaviour at an optimal point that is the key to understand why Stochastic loss imposes regularization on Siamese loss.The network parameters are trained when sum of the gradients are close to zero, i.e., where the loss is not updated anymore in any direction.One can see, for both Siamese and Stochastic losses, the expectation of losses gradient is zero, once: 1.The scalars E{ d n − m − } and E{ d p − m + } (the distance from the optimal point (δd w )) are zero or 2. The vectors E{∇ w d n } and E{∇ w d p } (the expected metric gradient) are zero.Both options can equally make the expectation of the losses gradient zero at the optimal point.Nevertheless, only the second choice will make the variance of Stochastic loss zero, therefore we regularize the solution space by using stochaticity.
The covariance matrix of the loss gradient reveals the regularization property explicitly.
where ∇ w J is the metric gradient and E{θ 2 } is the regularization variance (we assume equal variance for θ p and θ n ).Eq. ( 15) is derived assuming that variables d, ∇ w d and θ are independent mutually.Note that E{θ 2 } is always zero in the original Siamese loss so as the covariance matrix where, this is not the case for Stochastic loss.In other words, if gradient descent converges to an optimal point meaning that there is no (or negligible) update from one iteration to another, then the Stochastic loss guaranties that the metric gradient is zero on that optimal point since the variance of the θ is designed to be non-zero.Hence, Eq. ( 15) admits the regularization term that promotes solutions with zero metric gradient over other (sensitive) solutions.
Note that we do not claim any convergence guarantee, however, we showed through the gradient analysis that once gradient descent converges, then the gradient of metric is always (almost) zero on that solution point.Because ∇ w d p requires to be a function of w so it can be tuned during the training process.Consequently, Stochastic loss does not affect a linear network that leads to a constant term for ∇ w d p .Strictly convex losses also do not benefit from Stochastic loss.Moreover, note that our proposed loss is not equivalent to adding a norm penalty of the gradient loss to the original loss as we showed that the loss expectation remains intact compared to the original loss.The same line of derivations can be shown for Triplet loss, that is omitted here for the sake of space.

IV. EXPLANATORY EXAMPLE IN 1D
For ease of explanation, we start by building intuition with an illustrative example of creating a 1D nonlinear function in Figure 4. Let the blue line be a learned nonlinear distance function for a pair of samples, for parameter values w.We wish to find the best value for w where d(w) = m; the black curve is the corresponding loss (d − 1.5) 2 .The red circles show all the minima of the loss.) and the minimum (w * = 1.74) found by gradient descent algorithm.The orange curve shows the regularized loss with added penalty on the norm of the metric gradient.The unstable minima of the loss does not coincide with the minima of the regularized loss.Note that the minimum (or maximum) of the uncertain losses are the same as the regularized loss that are found by our method using gradient descent.The steps that are taken by the gradient descent to reach the minimum are shown by light blue circles, where, at each iteration θ is chosen randomly.
At each iteration of gradient descent, stochastically pick a constant ∈ {−θ, θ} and simply add it to the distance d.The red and green dashed lines show these losses.In fact, the gradient descent experiences one of these two losses at each iteration randomly.Once θ is large enough compared to m then the sign of the gradients in red and green losses have opposing directions.Almost always (with the exception of d = m), one of the two opposed signed losses will dominate the other and pushes the gradient descent towards it's own minimum.An example of gradient descent steps for θ = 2 is illustrated in Figure 4, where the initial point is randomly chosen at w 0 = 4.21 (black star) and the optimal point is found in 500 iterations at w * = 1.74 (orange star).One can see at the sensitive optimal point, where the metric gradient is non-zero, none of the Stochastic losses have minima so there is no chance that gradient descent finds these points.In contrast the original loss can easily settle at these sensitive local minima.
We also added the conventional norm of gradient regularization to the original loss and plotted the function on Figure 4 with orange solid line.As we expect, the unstable minima in original function disappeared due to the regularization term that penalizes the loss once the norm of the gradient is non-zero.One can see that the minima of the orange function coincides with the minima of one of the Stochastic losses which confirms the regularization effect of the proposed Stochastic margin loss.In other words, the proposed approach samples the minima of the gradient regularized loss effectively without explicitly calculating the gradient of the metric.Analytical computation of the gradient is very expensive for high dimensional optimization problems as the partial derivative of a vector valued function is a matrix that needs to be calculated at every iteration of stochastic gradient descent.
In Figure 5 we perform a small experiment where gradient descent runs 10,000 times with random initialization each time for various θ values.This experiment shows the effect of hyperparameter tuning.The regularization parameter of θ = 4 yields to the same (not sensitive) solution 100% of the time.

V. EXPERIMENTS
Dataset: We use HPatches [3], with more than 2.5 million image patches, which is the largest and most recent benchmark for local descriptor learning.The HPtaches benchmark is designed to evaluate image descriptors for three different tasks: patch verification, image matching and patch retrieval.The dataset is collected in various illuminations and view points from 116 different scenes with 3 level of difficulties easy, hard and tough based on the different transformation noises where 40 scenes are used for test while the other 76 scenes are considered as train data.Evaluation is done by Mean Average Precision (mAP) [3].
Architecture: Our network architecture in Table 1 consists of 11 convolutional layers (including residual blocks [11]), followed by a fully connected layer.The number of learning  parameters for this network is approximately 1.2 M. The activation function for all the layers is Relu, padding is the ''same'' and average pooling is used for down sampling.Unlike other common networks, our proposed model does not use any batch normalization, dropout layer and any other conventional regularization method.Additionally, in all the experiments, we do not leverage any form of data augmentation.The network input is 32 × 32 gray image patches and the output is a 128 dimensional feature vector resembling SIFT feature descriptor [21].For training, stochastic gradient descent with momentum of 0.9 and learning rate of 0.005 is used, where the learning rate is gradually decreased over iterations.The number of iterations in different experiments is considered enough high based on the data regime to make the loss convergence possible.The hyperparameter introduced by Stochastic loss, θ, has a zero mean Bernoulli distribution which is with Pr(−θ) = Pr(θ ) = 0.5 where the θ is tuned in different experiments.

A. DOES THE METRIC GRADIENT BECOME SMALLER?
We hypothesize that for Stochastic loss shrinks the metric gradient with respect to the network weights ∇ w d.We train on 3K pairs taken from Hpatches [3] dataset.Both Siamese and Triplet losses are trained in three different settings: 1) Plain margin-based, 2) non-Stochastic (θ = 0) and 3) Stochastic (θ = 0).Experiments are repeated 5 times with random initialization and averaged results are shown in Figure 6.The top row shows that networks converge, while our loss is higher because of the added stochastisity.The middle row shows that the norm of the metric gradient indeed reduces more for Stochastic loss.The histogram in the bottom row shows that the variance of ∇ w d for Stochastic loss is less than the non-Stochastic one after training for Siamese .However, for Triplet loss, the histogram of ∇ w d for Stochastic loss is closer to the non-Stochastic one, which confirms that the Triplet loss is inherently less sensitive to the weights rather than Siamese as confirmed in the literature [16], [19].

B. WHAT IS THE EFFECT ON THE ARCHITECTURE?
Since we introduce regularization, can we use deeper architectures using Stochastic loss compared to the Siamese loss?In Figure 7, The mean average precisions (mAPs) for patch verification task on Hpatches [3] dataset are shown for 5 different models.These models listed from 1 to 5 correspond to the networks with 4, 5, 8, 12 and 20 convolutional layers plus an ultimate FC layer.We used 600K pairs collected from training portion of Hpatches dataset to learn all the models, simply without any conventional type of regularization.The stochastic parameter θ ∈ {−1, +1}.Three different losses Contrastive (red), Siamese (green) and Stochastic (blue) are compared in this figure .The first observation in Figure 7 is the significant gap between the train and test mAPs, specifically for Contrastive loss which confirms the overfitting problem, which deteriorates as the network goes deeper.The second observation shows the overall performance improvement once the squared loss is used instead of the hing loss and positive margin is added for the similar pairs (Siamese loss).In the third experiment we add the stochasticity to the Siamese loss that results in the proposed Stochastic loss.One can see that by using Stochastic loss, the accuracy gap on the train and test set is reduced even more.Additionally, it results in higher correlation between mAP of train and test data.These observations show the generalization power of Stochastic loss over different models.It is clear once the mAP on deeper model drops for Siamese and Contrastive losses, Stochastic loss retains its performance.

C. REGULARIZATION VS TRAINING SET SIZE
We conducted extensive experiments in 4 different data regimes of 3K , 30K , 300K and 3M from train Hpatches [3] data, on our 12-layer network.In all cases the number of positives and negatives pairs are equal.We evaluate  The CNN model is fixed for all the experiments and the hyperparameters are tuned on 30 K data regime using patch verification mAP.We tuned the model for all the listed methods individually, to obtain the best mAP.The result of this tuning is shown in the Figure 9 and Figure 10 where m + = 1 and m = 2.We report the mean and variance of each setup in the Figure 8.This experiment confirms that Stochastic Siamese loss performs better than the other conventional Contrastive loss regularized by different methods in all different tasks.However, the improvement of Triplet loss by adding stochasticity is less significant compared to the Siamese loss.One important observation from this experiment is to show the advantage of Stochastic loss when there are fewer available training data in the left part of Figure 8.

D. COMPARISON TO WEIGHT PERTURBATION
The literature [6] suggests that random Gaussian perturbation of the network weights can approximate a norm penalty regularization on the gradient, which is comparable to our proposed Stochastic loss.To compare the effect of both WP and Stochastic loss we conducted the following experiment.We train our model on three states: Siamese loss, Siamese loss with Gaussian noise is added to the weights, and Stochastic loss.After tuning the network on 30K data regime with different initializations, the best performance of these three states are reported in Table 2.The tuned Gaussian distribution of noise is N (0, 2.5) and θ ∈ {+2, −2}.The results show that stochasticity better generalizes for Siamese loss compared to WP.

E. COMPARISON TO OTHERS
We compare the performance of our Stochastic Siamese and Stochastic Triplet losses with other methods on HPatches [3].We trained our model on Hpatches train dataset as same as other methods except HardNet++ which is trained on the union of Liberty [33] and HPatches.Data augmentation is not used at all.Parameters θ = ±0.75 and θ = ±0.05are tuned for Stochastic Siamese and Stochastic Triplet losses, respectively.Results are shown in Figure 11.The accuracy of our Stochastic Triplet loss on patch verification task is 94.8% which introduces a new state of the art on this dataset.For image matching and patch retrieval tasks, our best mAPs are 58.67% and 80.23%, respectively which are competitive.The DOAP does better, which can be expected as the DOAP is a ranking loss which is better correlated with ranking task such as retrieval.

This paper introduces new Stochastic Siamese and Stochastic
Triplet losses for deep distance learning that regularizes the networks at loss layer to prevent overfitting.This is by eliminating sensitive minima, where the metric gradient is nonzero, from the loss landscape.Experimental results show the effectiveness of the proposed Stochastic loss particularly, for limited training data regime.Stochastic loss achieves state of the art on patch verification, while have a competitive performance on image matching and patch retrieval compared to the ranking losses.

FIGURE 1 .
FIGURE 1.Adding a margin to distance learning creates sensitive minima.The blue line is a learned nonlinear distance function d , the purple line is the margin m, the black line is a typical distance learning hinge loss: max(m − d , 0) 2 .Case 1: If the margin is violated and the loss minimized with gradient decent, the solutions (red dots) have a strong gradient for d which is sensitive to small changes and thus unstable.Case 2: If the margin is not violated, any solution is satisfactory, disregarding all stability issues.In this paper we propose a method to find robust solutions (green dots).

FIGURE 3 .
FIGURE 3. Visualization of the loss w.r.t. the distance for the squared variation of Contrastive loss in Eq. (3) with non-zero positive optimal point.The positive (negative) optimal point is depicted with the blue circle (red circle ) which shows m + (m + + m).The relative margin is set to m = 3 and m + = 0.5 in this plot, therefore, the negative optimal point is set to m − = 3.5.

FIGURE 4 .
FIGURE 4. Illustration of Stochastic loss optimization for a 1D function.The blue line is a nonlinear distance function d = f (w ), for parameter values w .The black curve is the corresponding loss (d − m) 2 where m = 1.5.The red circles show all minima for the loss.The red and green dashed lines show the loss of adding either −θ or θ to the distance.The coloured stars indicate the random initial guess (w 0 = 4.21) and the minimum (w * = 1.74) found by gradient descent algorithm.The orange curve shows the regularized loss with added penalty on the norm of the metric gradient.The unstable minima of the loss does not coincide with the minima of the regularized loss.Note that the minimum (or maximum) of the uncertain losses are the same as the regularized loss that are found by our method using gradient descent.The steps that are taken by the gradient descent to reach the minimum are shown by light blue circles, where, at each iteration θ is chosen randomly.

FIGURE 5 .
FIGURE 5.Histogram of the minima found by our method for different θ values.For θ = 4, our proposed method finds the minimum at w * = 1.74 with 100% chance.

FIGURE 6 .
FIGURE 6.Effect of Stochastic loss on metric gradient.The plots show three states: Plain margin-based in green, Non-stochastic (θ = 0) in red and Stochastic (θ = 0) in blue for both Siamese (left) and Triplet (right).The first row is the loss over iterations and the second row is the norm of the metric gradient || N b i =1 ∇ w d i || 2 over iterations, where N b is number of batch pairs.The third row shows the histogram of all elements of ∇ w d for all the train pairs after training in logarithmic scale.The hyperparameter θ is set to 2 and 0.15 for Stochastic Siamese and Stochastic Triplet , respectively.Note that the metric gradient becomes smaller once stochasticity is added to the loss.

FIGURE 7 .
FIGURE 7. Comparison of Contrastive (red), Siamese (green), and Stochastic (blue) losses for 5 different CNN models on Hpatches dataset for patch verification task.The models 1 to 5 are respectively, correspondent to 4, 5, 8, 12 and 20 convolutional layers plus a fully connected layer at the end.Test and train mAPs, show that Stochastic loss reduces overfitting problem.

FIGURE 8 .
FIGURE 8. Comparison of Stochastic Siamese and Stochastic Triplet losses with conventional Contrastive and Triplet losses regularized by drop out, L 2 -regularization and batch normalization (BN) on 4 different training data regimes where the mAP for 3 different tasks are reported.Generally, Stochastic losses result in better mAP especially, in the lower data regime.
by batch normalization, Triplet, Stochastic Triplet, Triplet regularized by L 2 -regularization, Triplet regularized by dropout and Triplet regularized by batch normalization.The evaluations are reported on three different tasks of HPatches where training process repeated 3 times with different initialization.The total number of conducted experiments for this section is 4 × 10 × 3 = 120 which makes it a reliable source for comparison between different methods.
IR k between similar image pairs d p are smaller than distances to dissimilar image pairs d n by a predefined margin m in the desired metric space, i.e., d p + m ≤ d n , where

TABLE 1 .
Our model architecture.

TABLE 2 .
10 different loss settings including: Contrastive, Stochastic Siamese, Contrastive regularized by L 2 -regularization, Contrastive regularized by dropout, Contrastive regularized Results on Hpatches.Our proposed Stochastic losses are state of the art in patch verification task.Comparison of weights perturbation and Stochastic loss.The mAP[%]s are reported for three different Hpatches tasks.